Comments
Description
Transcript
Document 1205862
August 2013 MANAGEMENT BRIEF Business Case for Enterprise Big Data Deployments Comparing Costs, Benefits and Risks for Use of IBM InfoSphere BigInsights and Open Source Apache Hadoop International Technology Group 609 Pacific Avenue, Suite 102 Santa Cruz, California 95060-4406 Telephone: 831-427-9260 Email: [email protected] Website: ITGforInfo.com TABLE OF CONTENTS EXECUTIVE SUMMARY 1 Challenges and Solutions 1 Open Source IBM InfoSphere BigInsights Differentiators 2 4 Conclusions 6 SOLUTION SET Overview Deployment Options 8 8 8 Servers and Storage Platform Symphony 8 8 GPFS-FPO 8 DETAILED DATA 11 Composite Profiles Cost Calculations 11 12 Cost Breakdowns 12 List of Figures Three-year Costs for Use of IBM InfoSphere BigInsights and Open Source Apache Hadoop for Major Applications – Averages for All Installations 1 2. IBM InfoSphere BigInsights Environment 4 3. IBM InfoSphere BigInsights Components 9 4. Composite Profiles 11 5. FTE Salary Assumptions 12 6. Three-year Cost Breakdowns 12 1. Copyright © 2013 by the International Technology Group. All rights reserved. Material, in whole or part, contained in this document may not be reproduced or distributed by any means or in any form, including original, without the prior written permission of the International Technology Group (ITG). Information has been obtained from sources assumed to be reliable and reflects conclusions at the time. This document was developed with International Business Machines Corporation (IBM) funding. Although the document may utilize publicly available material from various sources, including IBM, it does not necessarily reflect the positions of such sources on the issues addressed in this document. Material contained and conclusions presented in this document are subject to change without notice. All warranties as to the accuracy, completeness or adequacy of such material are disclaimed. There shall be no liability for errors, omissions or inadequacies in the material contained in this document or for interpretations thereof. Trademarks included in this document are the property of their respective owners. EXECUTIVE SUMMARY Challenges and Solutions What more can be said on the subject of big data? A great deal, it turns out. Industry debate tends to focus on the role which big data analytics may play in transforming business decision-making and organizational competitiveness. The impact on these will clearly be transformative. But there is a downside. Bottlenecks are emerging that may seriously delay realization of the potential of big data in many, perhaps most organizations. This is particularly the case for the complex of technologies that has developed around Apache Hadoop. As use of Hadoop-based systems has spread beyond social media companies, users have found that developer productivity is often poor – Hadoop requires a great deal of manual coding – and that skills shortages slow new project starts and magnify deployment times and costs. Hadoop specialists have become among the highest-paid in the IT world. In the United States, for example, starting salaries for Hadoop developers are routinely over $100,000, and salaries for managers, data scientists, architects and other high-level specializations often top $200,000. Worldwide, Hadoop compensation is trending rapidly upward, and is expected to continue doing so for the foreseeable future. The combination of low developer productivity and high salary levels makes for poor economics. This is, it is commonly argued, counterbalanced by the fact that most Hadoop components are open sourced, and may be downloaded free of charge. But overall costs may not necessarily be lower than for vendormanaged Hadoop distributions that enable more cost-effective development and deployment. This may be illustrated by comparisons of three-year costs for use of open source Hadoop and the IBM BigInsights Hadoop distribution for representative high-impact applications in six companies. Overall costs averaged 28 percent less for use of IBM BigInsights. These comparisons, whose results are summarized in figure 1, include software licenses, support and personnel costs for use of BigInsights, and personnel costs only for use of open source Hadoop software. 3,386.9 IBM InfoSphere BigInsights Open Source Apache Hadoop 4,690.4 $ thousands Licenses & support Development & deployment personnel Ongoing personnel Figure 1: Three-year Costs for Use of IBM InfoSphere BigInsights and Open Source Apache Hadoop for Major Applications – Averages for All Installations Personnel costs are for initial application development and deployment, as well as for ongoing postproduction operations over a three-year period. Calculations include data scientists, architects, project managers, developers, and data and installation specialists for initial development and deployment; and developers, data specialists and system administrators for post-production operations. BigInsights costs also include licenses and support. International Technology Group 1 Comparisons are for composite profiles of financial services, health care, marketing services, media, retail and telecommunications companies. Profiles were constructed based on information supplied by 29 organizations employing BigInsights, open source tools or combinations of these. Further information on profiles, methodology and assumptions employed for calculations, along with cost breakdowns for individual companies may be found in the Detailed Data section of this report. Open Source Hadoop, in its open source form, is based largely on technologies developed by Google in the early 2000s. The earliest and, to date, largest Hadoop users have been social media and e-commerce companies. In addition to Google itself, these have included Amazon.com, AOL, eBay, Facebook, LinkedIn, Twitter, Yahoo and international equivalents. Although the field of players has since expanded to include hundreds of venture capital-funded start-ups, along with established systems and services vendors and large end users, social media businesses continue to control Hadoop. Most of the more than one billion lines of code – more than 90 percent, according to some estimates – in the Apache Hadoop stack has to date been contributed by these. The priorities of this group have inevitably influenced Hadoop evolution. There tends to be an assumption that Hadoop developers are highly skilled, capable of working with “raw” open source code and configuring software components on a case-by-case basis as needs change. Manual coding is the norm. Decades of experience have shown that, regardless of which technologies are employed, manual coding offers lower developer productivity and greater potential for errors than more sophisticated techniques. Continuous updates may be required as business needs and data sources change. Over time, complex, inadequately documented masses of “spaghetti code” may be generated that are expensive to maintain and enhance. These problems have routinely affected organizations employing legacy mainframe applications. There is no sense in repeating them with new generations of software technology. Other issues have also emerged. These include: • Stability. Hadoop and the open source software stack surrounding it are currently defined and enhanced through at least 25 separate Apache Foundation projects, sub-projects and incubators. More can be expected as the scope of the Hadoop environment expands, and new tools and technologies emerge. Initiatives tend to move at different speeds, and release dates are at best loosely coordinated. Developers are exposed to a continuous stream of changes. Instability can be a significant challenge. It becomes more difficult to plan technology strategies, project schedules and costs become less predictable, and risks of project failure increase. The potential for future interoperability problems also expands. The Apache stack may, moreover, evolve in an unpredictable manner. Organizations may standardize upon individual components only to find that these receive declining attention over time. The pace of technology change among social media companies is a great deal faster than users in most other industries are accustomed to. • Interoperability. Few, if any Hadoop-based systems are “standalone” in the sense that they do not require interoperability with other applications and databases. International Technology Group 2 All of the organizations surveyed for this report, for example, employed or planned to employ interfaces to relational databases, data warehouses, conventional analytics tools, query and reporting intranets and/or CRM and back-end systems. This was the case even for “pure play” suppliers of Hadoop-based services. Interoperability requirements were particularly significant among financial services, health care, insurance, retail and telecommunications companies. One large banking institution reported, for example, that it expected to implement 40 to 50 different interfaces before its Hadoop-based system could be brought into full operation. • Resiliency. The open source Hadoop stack includes a variety of mechanisms designed to maintain availability, and enable failover and recovery in the event of unplanned (i.e., accidental) outages as well as planned downtime for software modifications, scheduled maintenance and other tasks. These mechanisms are, however, a great deal less mature than is the case for conventional business-critical systems. They are also, in environments characterized by numerous handconfigured components, a great deal more complex and error-prone. Vulnerabilities are magnified when systems undergo frequent changes. Major social media companies have often realized high levels of availability. This has, however, typically required expensive investments to harden software, ensure redundancy and provide indepth operational monitoring and response staff and procedures. • Manageability. Open source Hadoop limitations have emerged in such areas as configuration and installation, monitoring, job scheduling, workload management, tuning and availability and security administration. Although some open source components address these issues, they are a comparatively low priority for most Apache contributors. Users may, to some extent, compensate for these limitations by “labor-intensive” management practices. This approach not only translates into higher personnel costs, but is also less reliable. Open source manageability limitations may not be visible during application development and deployment. However, they will be reflected in higher ongoing full time equivalent (FTE) system administration staffing and may impact post-production quality of service. • Support. Open source software is available only with community support – i.e., users rely upon online peer forums for enhancements, technical advice and problem resolution. This approach may prove appropriate for commonly encountered issues, although it is dependent on the willingness of others to share their time and experience. It has proved to be a great deal less reliable in dealing with organization-specific configuration issues. The bottom-line implications may be substantial. Delays in resolving problems may undermine developer productivity, and may result in application errors, performance bottlenecks, outages, data loss and other negative effects. As Hadoop deployments have grown, these issues have led to the appearance of vendor-managed, feebased distributions that include enhanced tools and functions, and offer more effective customer support. Current examples include Amazon Elastic MapReduce (Amazon EMR) web service; Cloudera’s Distribution Including Apache Hadoop (CDH); EMC’s Pivotal HD; Hortonworks Data Platform; IBM InfoSphere BigInsights; Intel Distribution for Apache Hadoop (Intel Distribution); and MapR M series. International Technology Group 3 IBM InfoSphere BigInsights Differentiators The BigInsights environment currently includes the components shown in figure 2. APPLICATIONS 20+ prebuilt applications VISUALIZATION & DISCOVERY BigSheets ADMINISTRATION Dashboard & Visualization Admin Console Monitoring ENABLERS Social Data Analytics Accelerator Machine Data Analytics Accelerator Eclipse-‐based toolkits Web application catalog REST API Big SQL Application framework INFRASTRUCTURE Avro HBase Hive Jaql Lucene MapReduce Flexible Scheduler BigIndex Adaptive MapReduce Oozie Pig ZooKeeper HDFS Splittable compression Enhanced security Integrated Installer High availability GPFS-‐FPO DATA SOURCES & CONNECTIVITY BoardReader Web Crawler Cognos SPSS InfoSphere Data Explorer IBM InfoSphere DataStage Flume MicroStrategy SAS InfoSphere Optim R DB2 • Oracle SQL Server • Teradata InfoSphere Warehouse JDBC ODBC Sqoop IBM PureData System for Analytics Platform Computing InfoSphere Streams InfoSphere Guardium Open Source Figure 2: IBM InfoSphere BigInsights Environment International Technology Group 4 Although this environment includes a full Apache Hadoop stack, it is differentiated by numerous IBM components that address the issues outlined above. In BigInsights Version 2.1, which became available in June 2013, these may be summarized as follows: • Visualization and discovery tools include IBM BigSheets, a highly customizable end user analytical solution for identification, integration and exploration of unstructured and/or structured data patterns. It employs a spreadsheet-like interface, but is more sophisticated than conventional spreadsheets, and is not limited in the amount of data it can address. • Administration tools include a Web-based administrative console providing a common, highproductivity interface for monitoring, health checking and management of all application and infrastructure components. Integrated Installer automates configuration and installation tasks for all components. • Development tools include Eclipse-based toolkits supporting the principal Hadoop development tools and languages, as well as a Web application catalog that includes ad hoc query, data import and export, and test applications designed for rapid prototyping. • Accelerators for social media and machine data analytics include prebuilt templates and components for a range of industry- and application-specific functions. Accelerators were developed based on customer experiences, and have materially improved “time to value” for development and deployment of Hadoop-based applications. More can be expected in the future. Text analytic capabilities are incorporated as a standard feature of BigInsights. The social media and machine accelerators include custom text extractors for their respective application domains. • Big SQL, introduced in BigInsights 2.1, is a native SQL query engine. It allows developers to leverage existing SQL skills tools to query Hive, HBase or distributed file system data. Developers may use standard SQL syntax and, in some cases, IBM-supplied SQL extensions optimized for use with Hadoop. Big SQL offers an alternative to the SQL-like HiveQL, a Hive extension developed by Facebook. Big SQL is easier to use and better aligned with mainstream SQL development tools and techniques. It also incorporates features – which are not found in native HiveQL – that can improve runtime performance for certain applications and workloads. This approach is likely to see widespread adoption. While skilled Hadoop specialists are still comparatively rare, SQL has been in widespread use since the 1980s. There are believed to be over four million developers worldwide familiar with this language. Most large organizations have longstanding investments in SQL skill sets, and SQL-based applications and tools. • Infrastructure enhancements are provided in such areas as large-scale indexing (BigIndex), job scheduling (BigInsights Scheduler), administration and monitoring tools, splittable text compression and security. BigInsights supports Adaptive MapReduce, which exploits IBM workload management technology in Platform Symphony. Adaptive MapReduce allows smaller MapReduce jobs to be executed more efficiently, and enables more effective, lower-overhead management of mixed workloads than open source MapReduce. • Platform Symphony is a high-performance grid middleware solution originally developed by Platform Computing, which was acquired by IBM in 2012. In BigInsights, it can be used to replace the open source MapReduce layer while allowing MapReduce jobs to be created in the same manner. Customers may choose which to install. International Technology Group 5 • IBM General Parallel File System – File Placement Optimizer (GPFS-FPO is a Hadoopoptimized implementation of the IBM GPFS distributed file system that offers an alternative to HDFS. For more than a decade, GPFS has been widely deployed for scientific and technical computing, as well as a wide range of commercial applications. In addition to offering higher performance, GPFS enables higher cluster availability, and benefits from more effective system management, snapshot copying, failover and recovery, and security than HDFS. (IBM is not alone in adopting this approach. There has been a growing trend among Hadoop users toward use of HDFS alternatives such as MapR file system, Cassandra and Lustre.) • High availability features include enhanced HDFS NameNode failover. The IBM implementation enables seamless and transparent failover. The process is automatic – no administrator intervention is required – and occurs more rapidly and reliably than in a conventional open source environment. More sophisticated features are offered by Platform Symphony. • Interoperability tools conform to a wide range of industry standards and/or are designed to integrate with key IBM and third-party databases and application solutions. Interfaces are provided to commonly used open source software; JDBC- and ODBC-compliant tools; IBM DB2, Oracle, Microsoft SQL Server and Teradata databases; and key IBM solutions forming part of the company’s Big Data Platform. These include the InfoSphere Information Warehouse data warehouse framework; Cognos business intelligence; SPSS statistical modeling and analysis; InfoSphere DataStage extract, transformation and load (ETL) tooling; InfoSphere Guardium for enterprise security management; Platform Symphony high-performance grid middleware; and the IBM PureData System for Analytics appliance. Web Crawler application automates Internet searches and collects data based on user-defined criteria. Data may be imported into BigSheets. BigInsights is compatible with, and is often used alongside IBM InfoSphere Streams for real-time big data analytics. This solution is architecturally comparable to open source Storm, but contains numerous IBM enhancements for development productivity, manageability, resiliency and interoperability. BigInsights contains a limited-use InfoSphere Streams license. BigInsights capabilities are evolving rapidly. IBM has committed to integrating new open source components as these emerge, and the company is known to be working on a variety of other functional enhancements. Conclusions Use of Hadoop is still at an early stage. Apart from a handful of major social media companies, most Hadoop deployments have occurred over the last two years. As industry surveys have shown, many are still not in production. Adoption, however, is expanding rapidly, and it is clear that Big Data will become a central feature of IT landscapes in most organizations. As this occurs, technology stacks and deployment patterns will inevitably change. It can be expected that, as with previous waves of open source technology, the Hadoop market will become more segmented, and solution offerings will become more diverse. Enterprise users – a category that will probably include many midsize businesses as well as start-ups – will inevitably move to more productive, resilient, vendor-supported distributions. International Technology Group 6 Many organizations will also move toward converged Hadoop and SQL environments, applying SQL skill bases and application portfolios to new Big Data challenges. There is also a widespread move toward augmentation to SQL-based data warehouses with subsets or aggregations of Hadoop data. These trends will increasingly leverage broader IBM differentiators. These include long established company strengths in software engineering (BigInsights components are not only pre-integrated, but also extensively tested for optimum performance and functional transparency), customization (the ability of IBM services organizations to deliver industry- and organization-specific solutions has already emerged as a major source of BigInsights appeal) and customer support. IBM, moreover, has decades of experience with relational technology and data warehousing. The company’s SQL strengths exceed – by wide margins – those of any other Hadoop distributor, and its systems integration capabilities are among the world’s best. As in other areas of its software business, the company has moved aggressively to recruit and support business partners. These currently include more than 300 independent software vendor (ISV) and services firms, including suppliers of a wide range of complementary tools and industry-specific solutions. The number is expanding rapidly. The Hadoop open source community, no doubt, will remain vibrant, and use of free downloads will continue to expand. But a distinct category of enterprise solutions will clearly emerge, and that these will be more strongly focused on development productivity, stability, resilience, manageability, system integration and in-depth customer support. For organizations that expect to move toward the enterprise paradigm, it may make sense to deploy IBM BigInsights sooner rather than later. International Technology Group 7 SOLUTION SET Overview In its present form, BigInsights includes the principal components of the Apache Hadoop and related projects, along with IBM enhancements described earlier. BigInsights is offered by IBM as a licensed software product, and through IBM SmartCloud Enterprise and third-party cloud service providers. In addition to the flagship Enterprise Edition, which currently includes the components summarized in figure 3, IBM offers two free versions of BigInsights. Basic Edition includes the principal BigInsights open source components, along with database and Web server interfaces, and a simple management console. Quick Start Edition is a near full-function offering restricted to non-production use. It is designed to allow users to evaluate and gain experience with BigInsights enterprise features, and to prototype applications and develop proofs of concept. Deployment Options Servers and Storage IBM offers BigInsights clusters built around IBM System x3550 M4 and x3630 dual-socket x86 servers acting as management and data nodes respectively. Data nodes may be configured with Near Line SAS (NL-SAS) or SATA drives. Configurations are packaged in increments of up to 20, 20 to 50 and 50+ nodes. Red Hat Enterprise Linux (RHEL) and SUSE Linux Enterprise Server (SLES) are supported. BigInsights may also be deployed on non-IBM x86 servers, and on IBM Power Systems with RHEL or SLES. IBM or third-party arrays may be employed for external storage. Platform Symphony IBM offers the option of deploying BigInsights on Platform Symphony. With this approach, Platform Symphony job scheduling and management mechanisms substitute for those of MapReduce, and additional high availability features may be leveraged. Platform Symphony employs x86-based clusters to support applications extremely high levels of performance and scalability. In principle, configurations of 10,000 or more cores are supported. According to IBM, benchmark tests have demonstrated more than seven times higher performance than open source MapReduce for large-scale social media analytics workloads. Platform Computing is a longstanding player in HPC for scientific and technical computing, and for commercial applications in financial services, manufacturing, digital media, oil and gas, life sciences and other industries. GPFS-FPO GPFS-FPO has been deployed by a number of early BigInsights users in beta mode, and became generally available in Version 2.1. In HPC applications, GPFS has demonstrated near-linear scalability in extremely large configurations – installations with more than 1,000 nodes are common, and the largest exceed 5,000 nodes. Storage volumes often run to hundreds of terabytes, and there are working petabyte-scale systems. User experiences, as well as tests run with a variety of HPC benchmarks, have demonstrated significantly higher performance – in some cases by more than 20 times – than HDFS. GPFS also incorporates a distributed metadata structure, policy-driven automated storage tiering, managed high-speed replication, and information lifecycle management (ILM) tooling. International Technology Group 8 APPLICATION DEVELOPMENT Social Data Analytics Accelerator Application suite enabling extract of social media data, construction of user profiles & association with sentiment, buzz, intent & ownership. Includes customizable tools for brand management, lead generation & other common functions. Pre-‐integrated options for use with IBM (ex-‐Unica) Campaign & CCI solutions Machine Data Analytics Accelerator Application suite enabling import & aggregation of structured, semi-‐structured and/or unstructured data from log files, meters, sensors, readers & other machine sources. Provides assists for text, faceted & timeline-‐based searches, pattern recognition root cause analysis, chained analysis & other functions BigSheets Spreadsheet-‐like tool for identification, integration & analysis of large volumes of unstructured &/or structured data. Incorporates IBM-‐developed analytics macros & pattern recognition technology. Highly customizable for individual user requirements Big SQL Native SQL query engine allows developers to query Hive, HBase or distributed file system data using standard SQL syntax & Hadoop-‐optimized SQL extensions. Allows administrators to populate Big SQL tables with data from multiple sources. JDBC & ODBC drivers support many existing SQL query tools Web application catalog Includes sample query, data import & export, & test tools designed for “proof of concept” application deployment INFRASTRUCTURE Avro Data serialization & remote procedure call (RPC) framework defines JSON data schemas HBase NoSQL (nonrelational) database incorporating row-‐ & column-‐based table structures. Based on Google BigTable technology Hive Facilitates data extraction, transformation & loading (ETL), & analysis of large HDFS data sets Jaql High-‐level declarative query & scripting language with JSON-‐based data model & SQL-‐like interface processes structured unstructured data. Originally developed by IBM Lucene Text search engine library describes job graphs & relationships between these Oozie Workflow scheduler for Hadoop job management Pig Platform for analyzing large data sets includes high-‐level language for expressing, & infrastructure for evaluating programs MapReduce Parallel programming model for Hadoop clusters Hadoop Distributed File System (HDFS) Hadoop distributed file system supports clusters built around x86-‐based NameNode (master) & DataNodes. Closely integrated with MapReduce BigIndex Implements Hadoop-‐based indexing as a native InfoSphere BigInsights capability; enables additional complex functions including distributed indexing & faceted search BigInsights Scheduler Extension of Hadoop Fair Scheduler enables policy-‐based scheduling of MapReduce jobs Splittable compression Expanded implementation of Apache Lempel-‐Ziv-‐Oberhumer (LZO) algorithm allowing compressed data to run jobs on multiple mappers Enhanced security Includes enhanced authentication, authorization (roles) & auditing functions. Interfaces to IBM InfoSphere Guardium solutions Integrated Installer GUI-‐driven tool allows rapid, automated configuration, installation & assurance of BigInsights clusters. Guided installation features facilitate administrator tasks Adaptive MapReduce Platform Symphony technology that accelerates processing of small MapReduce jobs & enables more effective execution of mixed Hadoop workloads GPFS File Placement Optimizer (GPFS-‐FPO) Extension of IBM General Parallel File System high-‐performance distributed file system optimized for use in Hadoop clusters Figure 3: IBM InfoSphere BigInsights Components International Technology Group 9 DATA SOURCES & CONNECTIVITY BoardReader Interface to BoardReader search engine enables query access, & data download & import to BigInsights file system Web Crawler Interface to IBM Web Crawler application for Internet data collection & organization Flume Facilitates aggregation & integration of large data volumes across Hadoop clusters R Enables integration of applications written in R statistics language Sqoop Enables import & export of data between SQL & Hadoop databases JDBC Standard Java Database Connectivity interface to DBMS ODBC Standard Open Database Connectivity interface to DBMS MicroStrategy, SAS Interface to widely-‐used third-‐party analytics tools Database interfaces Interfaces to IBM DB2, Oracle Database, Microsoft SQL Server & Teradata Database IBM data exchanges Enable exchange of BigInsights data with IBM Cognos Business Intelligence, InfoSphere DataStage ETL tools, InfoSphere Warehouse data warehouse framework, Platform Symphony grid middleware, IBM PureData System for Analytics, SPSS statistical modeling & analysis & InfoSphere Streams real-‐time analytics solutions Legend: IBM Open Source Figure 3 (cont.): IBM InfoSphere BigInsights Components International Technology Group 10 DETAILED DATA Composite Profiles The calculations presented in this report are based upon the six composite profiles shown in figure 4. FTEs refers to numbers of full time equivalent database administrators. Health Care Company Financial Services Company Retail Company Applications Health care insurance provider – claims analysis for quality of care recommenda-‐ tions & cost/profitability variables 80 TB disk storage Diversified retail bank – customer sentiment analysis of social media, correspondence & transaction records for loyalty program opt. Data warehouse interface 130 TB disk storage Comparative analysis of customer online & in-‐store purchasing behavior. Sources include web logs, point of sale & other data. Predictive analysis for merchandis-‐ ing applications. Data warehouse & decision support interfaces 200 TB disk storage IBM INFOSPHERE BIGINSIGHTS FTEs Development & deployment (6 months): 5.25 Post-‐production operations: 2.95 Development & deployment (8 months): 7.5 Post-‐production operations: 3.15 Development & deployment (12 months): 11.3 Post-‐production operations: 4.75 OPEN SOURCE FTEs Development & deployment (6 months): 8.25 Post-‐production operations: 4.3 Development & deployment (10 months): 13.15 Post-‐production operations: 6.0 Development & deployment (15 months): 17.0 Post-‐production operations: 8.5 Media Company Marketing Services Company Telecommunications Company Applications Analysis of web logs for multiple properties to determine usage patterns, customer profiling, tracking ad event activity & identifying new marketing opportunities 300 TB disk storage Analysis of customer e-‐mail traffic for demographic & sentiment tracking, campaign management & other applications 350 TB disk storage Analysis of call detail records (CDRs), Internet & social media activity to identify cross-‐sell opportunities & improve loyalty program effectiveness. Interface to CIS, data warehouse & operational systems 500 TB disk storage IBM INFOSPHERE BIGINSIGHTS FTEs Development & deployment (7 months): 8.4 Post-‐production operations: 3.25 Development & deployment (6 months): 7.55 Post-‐production operations: 2.6 Development & deployment (9 months): 8.85 Post-‐production operations: 3.0 OPEN SOURCE FTEs Development & deployment (9 months): 12.35 Post-‐production operations: 5.0 Development & deployment (8 months): 10.95 Post-‐production operations: 4.5 Development & deployment (12 months): 14.05 Post-‐production operations: 5.5 Figure 4: Composite Profiles Profiles were constructed using information supplied by 14 companies using open source Hadoop, the same number using BigInsights, and one using both. For each of the industries shown above, comparisons were based on companies of approximately the same size, with generally similar business profiles and applications. Companies were based in the United States (26) and Europe (3). Companies supplied information on applications; development and deployment times for these; and numbers of FTE personnel for (1) application development and deployment, and (2) ongoing postproduction operations. Because job descriptions and titles often varied between companies, numbers of FTEs for equivalent specializations were in some cases estimated by the International Technology Group. International Technology Group 11 Cost Calculations Personnel costs were calculated for numbers of FTEs based on the annual salary assumptions shown in figure 5. The same assumptions were employed for use of BigInsights and open source Hadoop tools. SPECIALIZATION SALARY SPECIALIZATION $200K (1) (2) (1) Data scientist (1) Architect/equivalent (1) Project manager (1) Lead developer Developer SALARY $132K (1) (2) $189K Data specialist $154K Installation specialist (1) $147K $135K $140K (2) $104K System administrator (1) Development & deployment Post-‐production operations (2) Figure 5: FTE Salary Assumptions Calculations were based on numbers of FTEs for applicable periods. For the health care company, for example, costs were calculated for numbers of development and deployment FTEs for six months, while post-production personnel costs were calculated for 36 – 6 = 30 months. Salaries were increased by 55.48 percent to allow for benefits, bonuses and other per capita costs. Software costs for use of BigInsights were calculated based on IBM pricing per terabyte of disk storage for the capacities shown in figure 6. As BigInsights license fees include one year of software maintenance (SWMA) coverage for no additional charge, support costs are for two years. Calculations allowed for user-reported discounts. Cost Breakdowns Breakdowns for individual profiles are shown in figure 6. COMPANY TYPE Health Care Financial Services Retail Media Marketing Services Telecom 747.53 800.93 IBM INFOSPHERE BIGINSIGHTS Licenses & support 327.49 439.62 676.34 800.93 Personnel Development & deployment 596.46 1,125.68 2,526.16 1,097.07 838.97 1,477.74 Ongoing operations 1,389.21 1,442.26 1,887.53 1,561.21 1,272.80 1,313.61 Personnel total 1,985.67 2,567.93 4,413.69 2,658.28 2,111.77 2,791.35 TOTAL ($ thousands) 2,313.16 3,007.55 5,090.03 3,459.21 2,859.30 3,592.28 0 OPEN SOURCE APACHE HADOOP Licenses & support 0 0 0 0 0 Personnel Development & deployment 914.57 2,471.03 4,725.62 2,068.84 1,611.65 3,024.47 Ongoing operations 2,006.08 2,566.53 2,954.22 2,225.79 1,399.97 2,173.61 Personnel total 2,920.65 5,037.56 7,679.84 4,294.63 3,011.62 5,198.08 TOTAL ($ thousands) 2,920.65 5,037.56 7,679.84 4,294.63 3,011.62 5,198.08 Figure 6: Three-year Cost Breakdowns International Technology Group 12 ABOUT THE INTERNATIONAL TECHNOLOGY GROUP ITG sharpens your awareness of what’s happening and your competitive edge . . . this could affect your future growth and profit prospects International Technology Group (ITG), established in 1983, is an independent research and management consulting firm specializing in information technology (IT) investment strategy, cost/benefit metrics, infrastructure studies, deployment tactics, business alignment and financial analysis. ITG was an early innovator and pioneer in developing total cost of ownership (TCO) and return on investment (ROI) processes and methodologies. In 2004, the firm received a Decade of Education Award from the Information Technology Financial Management Association (ITFMA), the leading professional association dedicated to education and advancement of financial management practices in end-user IT organizations. The firm has undertaken more than 120 major consulting projects, released more than 250 management reports and white papers and more than 1,800 briefings and presentations to individual clients, user groups, industry conferences and seminars throughout the world. Client services are designed to provide factual data and reliable documentation to assist in the decisionmaking process. Information provided establishes the basis for developing tactical and strategic plans. Important developments are analyzed and practical guidance is offered on the most effective ways to respond to changes that may impact complex IT deployment agendas. A broad range of services is offered, furnishing clients with the information necessary to complement their internal capabilities and resources. Customized client programs involve various combinations of the following deliverables: Status Reports In-depth studies of important issues Management Briefs Detailed analysis of significant developments Management Briefings Periodic interactive meetings with management Executive Presentations Scheduled strategic presentations for decision-makers Email Communications Timely replies to informational requests Telephone Consultation Immediate response to informational needs Clients include a cross section of IT end users in the private and public sectors representing multinational corporations, industrial companies, financial institutions, service organizations, educational institutions, federal and state government agencies as well as IT system suppliers, software vendors and service firms. Federal government clients have included agencies within the Department of Defense (e.g., DISA), Department of Transportation (e.g., F International Technology Group 609 Pacific Avenue, Suite 102 Santa Cruz, California 95060-4406 Telephone: 831-427-9260 Email: [email protected] Website: ITGforInfo.com