...

Document 1205862

by user

on
Category: Documents
52

views

Report

Comments

Transcript

Document 1205862
August 2013 MANAGEMENT BRIEF Business Case for
Enterprise Big Data Deployments
Comparing Costs, Benefits and Risks for Use of
IBM InfoSphere BigInsights and
Open Source Apache Hadoop
International Technology Group
609 Pacific Avenue, Suite 102
Santa Cruz, California 95060-4406
Telephone: 831-427-9260
Email: [email protected]
Website: ITGforInfo.com
TABLE OF CONTENTS
EXECUTIVE SUMMARY
1
Challenges and Solutions
1
Open Source
IBM InfoSphere BigInsights Differentiators
2
4
Conclusions
6
SOLUTION SET
Overview
Deployment Options
8
8
8
Servers and Storage
Platform Symphony
8
8
GPFS-FPO
8
DETAILED DATA
11
Composite Profiles
Cost Calculations
11
12
Cost Breakdowns
12
List of Figures
Three-year Costs for Use of IBM InfoSphere BigInsights and
Open Source Apache Hadoop for Major Applications – Averages for All Installations
1
2.
IBM InfoSphere BigInsights Environment
4
3.
IBM InfoSphere BigInsights Components
9
4.
Composite Profiles
11
5.
FTE Salary Assumptions
12
6.
Three-year Cost Breakdowns
12
1.
Copyright © 2013 by the International Technology Group. All rights reserved. Material, in whole or part, contained in this document may not be
reproduced or distributed by any means or in any form, including original, without the prior written permission of the International Technology
Group (ITG). Information has been obtained from sources assumed to be reliable and reflects conclusions at the time. This document was
developed with International Business Machines Corporation (IBM) funding. Although the document may utilize publicly available material from
various sources, including IBM, it does not necessarily reflect the positions of such sources on the issues addressed in this document. Material
contained and conclusions presented in this document are subject to change without notice. All warranties as to the accuracy, completeness or
adequacy of such material are disclaimed. There shall be no liability for errors, omissions or inadequacies in the material contained in this
document or for interpretations thereof. Trademarks included in this document are the property of their respective owners.
EXECUTIVE SUMMARY
Challenges and Solutions
What more can be said on the subject of big data? A great deal, it turns out.
Industry debate tends to focus on the role which big data analytics may play in transforming business
decision-making and organizational competitiveness. The impact on these will clearly be transformative.
But there is a downside. Bottlenecks are emerging that may seriously delay realization of the potential of
big data in many, perhaps most organizations.
This is particularly the case for the complex of technologies that has developed around Apache Hadoop.
As use of Hadoop-based systems has spread beyond social media companies, users have found that
developer productivity is often poor – Hadoop requires a great deal of manual coding – and that skills
shortages slow new project starts and magnify deployment times and costs.
Hadoop specialists have become among the highest-paid in the IT world. In the United States, for
example, starting salaries for Hadoop developers are routinely over $100,000, and salaries for managers,
data scientists, architects and other high-level specializations often top $200,000. Worldwide, Hadoop
compensation is trending rapidly upward, and is expected to continue doing so for the foreseeable future.
The combination of low developer productivity and high salary levels makes for poor economics. This is,
it is commonly argued, counterbalanced by the fact that most Hadoop components are open sourced, and
may be downloaded free of charge. But overall costs may not necessarily be lower than for vendormanaged Hadoop distributions that enable more cost-effective development and deployment.
This may be illustrated by comparisons of three-year costs for use of open source Hadoop and the IBM
BigInsights Hadoop distribution for representative high-impact applications in six companies. Overall
costs averaged 28 percent less for use of IBM BigInsights.
These comparisons, whose results are summarized in figure 1, include software licenses, support and
personnel costs for use of BigInsights, and personnel costs only for use of open source Hadoop software.
3,386.9 IBM InfoSphere BigInsights Open Source Apache Hadoop 4,690.4 $ thousands Licenses & support Development & deployment personnel Ongoing personnel Figure 1: Three-year Costs for Use of IBM InfoSphere BigInsights and Open Source Apache
Hadoop for Major Applications – Averages for All Installations
Personnel costs are for initial application development and deployment, as well as for ongoing postproduction operations over a three-year period.
Calculations include data scientists, architects, project managers, developers, and data and installation
specialists for initial development and deployment; and developers, data specialists and system
administrators for post-production operations. BigInsights costs also include licenses and support.
International Technology Group
1
Comparisons are for composite profiles of financial services, health care, marketing services, media, retail
and telecommunications companies. Profiles were constructed based on information supplied by 29
organizations employing BigInsights, open source tools or combinations of these.
Further information on profiles, methodology and assumptions employed for calculations, along with cost
breakdowns for individual companies may be found in the Detailed Data section of this report.
Open Source
Hadoop, in its open source form, is based largely on technologies developed by Google in the early
2000s. The earliest and, to date, largest Hadoop users have been social media and e-commerce
companies. In addition to Google itself, these have included Amazon.com, AOL, eBay, Facebook,
LinkedIn, Twitter, Yahoo and international equivalents.
Although the field of players has since expanded to include hundreds of venture capital-funded start-ups,
along with established systems and services vendors and large end users, social media businesses
continue to control Hadoop. Most of the more than one billion lines of code – more than 90 percent,
according to some estimates – in the Apache Hadoop stack has to date been contributed by these.
The priorities of this group have inevitably influenced Hadoop evolution. There tends to be an assumption
that Hadoop developers are highly skilled, capable of working with “raw” open source code and
configuring software components on a case-by-case basis as needs change. Manual coding is the norm.
Decades of experience have shown that, regardless of which technologies are employed, manual coding
offers lower developer productivity and greater potential for errors than more sophisticated techniques.
Continuous updates may be required as business needs and data sources change. Over time, complex,
inadequately documented masses of “spaghetti code” may be generated that are expensive to maintain and
enhance. These problems have routinely affected organizations employing legacy mainframe applications.
There is no sense in repeating them with new generations of software technology.
Other issues have also emerged. These include:
•
Stability. Hadoop and the open source software stack surrounding it are currently defined and
enhanced through at least 25 separate Apache Foundation projects, sub-projects and incubators.
More can be expected as the scope of the Hadoop environment expands, and new tools and
technologies emerge.
Initiatives tend to move at different speeds, and release dates are at best loosely coordinated.
Developers are exposed to a continuous stream of changes.
Instability can be a significant challenge. It becomes more difficult to plan technology strategies,
project schedules and costs become less predictable, and risks of project failure increase. The
potential for future interoperability problems also expands.
The Apache stack may, moreover, evolve in an unpredictable manner. Organizations may
standardize upon individual components only to find that these receive declining attention over
time. The pace of technology change among social media companies is a great deal faster than
users in most other industries are accustomed to.
•
Interoperability. Few, if any Hadoop-based systems are “standalone” in the sense that they do not
require interoperability with other applications and databases.
International Technology Group
2
All of the organizations surveyed for this report, for example, employed or planned to employ
interfaces to relational databases, data warehouses, conventional analytics tools, query and
reporting intranets and/or CRM and back-end systems. This was the case even for “pure play”
suppliers of Hadoop-based services.
Interoperability requirements were particularly significant among financial services, health care,
insurance, retail and telecommunications companies. One large banking institution reported, for
example, that it expected to implement 40 to 50 different interfaces before its Hadoop-based
system could be brought into full operation.
•
Resiliency. The open source Hadoop stack includes a variety of mechanisms designed to maintain
availability, and enable failover and recovery in the event of unplanned (i.e., accidental) outages
as well as planned downtime for software modifications, scheduled maintenance and other tasks.
These mechanisms are, however, a great deal less mature than is the case for conventional
business-critical systems. They are also, in environments characterized by numerous handconfigured components, a great deal more complex and error-prone. Vulnerabilities are magnified
when systems undergo frequent changes.
Major social media companies have often realized high levels of availability. This has, however,
typically required expensive investments to harden software, ensure redundancy and provide indepth operational monitoring and response staff and procedures.
•
Manageability. Open source Hadoop limitations have emerged in such areas as configuration and
installation, monitoring, job scheduling, workload management, tuning and availability and
security administration. Although some open source components address these issues, they are a
comparatively low priority for most Apache contributors.
Users may, to some extent, compensate for these limitations by “labor-intensive” management
practices. This approach not only translates into higher personnel costs, but is also less reliable.
Open source manageability limitations may not be visible during application development and
deployment. However, they will be reflected in higher ongoing full time equivalent (FTE) system
administration staffing and may impact post-production quality of service.
•
Support. Open source software is available only with community support – i.e., users rely upon
online peer forums for enhancements, technical advice and problem resolution. This approach
may prove appropriate for commonly encountered issues, although it is dependent on the
willingness of others to share their time and experience. It has proved to be a great deal less
reliable in dealing with organization-specific configuration issues.
The bottom-line implications may be substantial. Delays in resolving problems may undermine
developer productivity, and may result in application errors, performance bottlenecks, outages,
data loss and other negative effects.
As Hadoop deployments have grown, these issues have led to the appearance of vendor-managed, feebased distributions that include enhanced tools and functions, and offer more effective customer support.
Current examples include Amazon Elastic MapReduce (Amazon EMR) web service; Cloudera’s
Distribution Including Apache Hadoop (CDH); EMC’s Pivotal HD; Hortonworks Data Platform; IBM
InfoSphere BigInsights; Intel Distribution for Apache Hadoop (Intel Distribution); and MapR M series.
International Technology Group
3
IBM InfoSphere BigInsights Differentiators
The BigInsights environment currently includes the components shown in figure 2.
APPLICATIONS 20+ prebuilt applications VISUALIZATION & DISCOVERY BigSheets ADMINISTRATION Dashboard & Visualization Admin Console Monitoring ENABLERS Social Data Analytics Accelerator Machine Data Analytics Accelerator Eclipse-­‐based toolkits Web application catalog REST API Big SQL Application framework INFRASTRUCTURE Avro HBase Hive Jaql Lucene MapReduce Flexible Scheduler BigIndex Adaptive MapReduce Oozie Pig ZooKeeper HDFS Splittable compression Enhanced security Integrated Installer High availability GPFS-­‐FPO DATA SOURCES & CONNECTIVITY BoardReader Web Crawler Cognos SPSS InfoSphere Data Explorer IBM InfoSphere DataStage Flume MicroStrategy SAS InfoSphere Optim R DB2 • Oracle SQL Server • Teradata InfoSphere Warehouse JDBC ODBC Sqoop IBM PureData System for Analytics Platform Computing InfoSphere Streams InfoSphere Guardium Open Source
Figure 2: IBM InfoSphere BigInsights Environment
International Technology Group
4
Although this environment includes a full Apache Hadoop stack, it is differentiated by numerous IBM
components that address the issues outlined above. In BigInsights Version 2.1, which became available in
June 2013, these may be summarized as follows:
•
Visualization and discovery tools include IBM BigSheets, a highly customizable end user
analytical solution for identification, integration and exploration of unstructured and/or structured
data patterns. It employs a spreadsheet-like interface, but is more sophisticated than conventional
spreadsheets, and is not limited in the amount of data it can address.
•
Administration tools include a Web-based administrative console providing a common, highproductivity interface for monitoring, health checking and management of all application and
infrastructure components. Integrated Installer automates configuration and installation tasks for
all components.
•
Development tools include Eclipse-based toolkits supporting the principal Hadoop development
tools and languages, as well as a Web application catalog that includes ad hoc query, data import
and export, and test applications designed for rapid prototyping.
•
Accelerators for social media and machine data analytics include prebuilt templates and
components for a range of industry- and application-specific functions. Accelerators were
developed based on customer experiences, and have materially improved “time to value” for
development and deployment of Hadoop-based applications. More can be expected in the future.
Text analytic capabilities are incorporated as a standard feature of BigInsights. The social media
and machine accelerators include custom text extractors for their respective application domains.
•
Big SQL, introduced in BigInsights 2.1, is a native SQL query engine. It allows developers to
leverage existing SQL skills tools to query Hive, HBase or distributed file system data.
Developers may use standard SQL syntax and, in some cases, IBM-supplied SQL extensions
optimized for use with Hadoop.
Big SQL offers an alternative to the SQL-like HiveQL, a Hive extension developed by Facebook.
Big SQL is easier to use and better aligned with mainstream SQL development tools and
techniques. It also incorporates features – which are not found in native HiveQL – that can
improve runtime performance for certain applications and workloads.
This approach is likely to see widespread adoption. While skilled Hadoop specialists are still
comparatively rare, SQL has been in widespread use since the 1980s. There are believed to be
over four million developers worldwide familiar with this language. Most large organizations
have longstanding investments in SQL skill sets, and SQL-based applications and tools.
•
Infrastructure enhancements are provided in such areas as large-scale indexing (BigIndex), job
scheduling (BigInsights Scheduler), administration and monitoring tools, splittable text
compression and security.
BigInsights supports Adaptive MapReduce, which exploits IBM workload management
technology in Platform Symphony. Adaptive MapReduce allows smaller MapReduce jobs to be
executed more efficiently, and enables more effective, lower-overhead management of mixed
workloads than open source MapReduce.
•
Platform Symphony is a high-performance grid middleware solution originally developed by
Platform Computing, which was acquired by IBM in 2012. In BigInsights, it can be used to
replace the open source MapReduce layer while allowing MapReduce jobs to be created in the
same manner. Customers may choose which to install.
International Technology Group
5
•
IBM General Parallel File System – File Placement Optimizer (GPFS-FPO is a Hadoopoptimized implementation of the IBM GPFS distributed file system that offers an alternative to
HDFS. For more than a decade, GPFS has been widely deployed for scientific and technical
computing, as well as a wide range of commercial applications.
In addition to offering higher performance, GPFS enables higher cluster availability, and benefits
from more effective system management, snapshot copying, failover and recovery, and security
than HDFS. (IBM is not alone in adopting this approach. There has been a growing trend among
Hadoop users toward use of HDFS alternatives such as MapR file system, Cassandra and Lustre.)
•
High availability features include enhanced HDFS NameNode failover. The IBM implementation
enables seamless and transparent failover. The process is automatic – no administrator
intervention is required – and occurs more rapidly and reliably than in a conventional open source
environment. More sophisticated features are offered by Platform Symphony.
•
Interoperability tools conform to a wide range of industry standards and/or are designed to
integrate with key IBM and third-party databases and application solutions.
Interfaces are provided to commonly used open source software; JDBC- and ODBC-compliant
tools; IBM DB2, Oracle, Microsoft SQL Server and Teradata databases; and key IBM solutions
forming part of the company’s Big Data Platform.
These include the InfoSphere Information Warehouse data warehouse framework; Cognos
business intelligence; SPSS statistical modeling and analysis; InfoSphere DataStage extract,
transformation and load (ETL) tooling; InfoSphere Guardium for enterprise security
management; Platform Symphony high-performance grid middleware; and the IBM PureData
System for Analytics appliance.
Web Crawler application automates Internet searches and collects data based on user-defined
criteria. Data may be imported into BigSheets.
BigInsights is compatible with, and is often used alongside IBM InfoSphere Streams for real-time big
data analytics. This solution is architecturally comparable to open source Storm, but contains numerous
IBM enhancements for development productivity, manageability, resiliency and interoperability.
BigInsights contains a limited-use InfoSphere Streams license.
BigInsights capabilities are evolving rapidly. IBM has committed to integrating new open source
components as these emerge, and the company is known to be working on a variety of other functional
enhancements.
Conclusions
Use of Hadoop is still at an early stage. Apart from a handful of major social media companies, most
Hadoop deployments have occurred over the last two years. As industry surveys have shown, many are
still not in production.
Adoption, however, is expanding rapidly, and it is clear that Big Data will become a central feature of IT
landscapes in most organizations. As this occurs, technology stacks and deployment patterns will
inevitably change. It can be expected that, as with previous waves of open source technology, the Hadoop
market will become more segmented, and solution offerings will become more diverse.
Enterprise users – a category that will probably include many midsize businesses as well as start-ups –
will inevitably move to more productive, resilient, vendor-supported distributions.
International Technology Group
6
Many organizations will also move toward converged Hadoop and SQL environments, applying SQL
skill bases and application portfolios to new Big Data challenges. There is also a widespread move toward
augmentation to SQL-based data warehouses with subsets or aggregations of Hadoop data.
These trends will increasingly leverage broader IBM differentiators. These include long established
company strengths in software engineering (BigInsights components are not only pre-integrated, but also
extensively tested for optimum performance and functional transparency), customization (the ability of
IBM services organizations to deliver industry- and organization-specific solutions has already emerged
as a major source of BigInsights appeal) and customer support.
IBM, moreover, has decades of experience with relational technology and data warehousing. The
company’s SQL strengths exceed – by wide margins – those of any other Hadoop distributor, and its
systems integration capabilities are among the world’s best.
As in other areas of its software business, the company has moved aggressively to recruit and support
business partners. These currently include more than 300 independent software vendor (ISV) and services
firms, including suppliers of a wide range of complementary tools and industry-specific solutions. The
number is expanding rapidly.
The Hadoop open source community, no doubt, will remain vibrant, and use of free downloads will
continue to expand. But a distinct category of enterprise solutions will clearly emerge, and that these will
be more strongly focused on development productivity, stability, resilience, manageability, system
integration and in-depth customer support.
For organizations that expect to move toward the enterprise paradigm, it may make sense to deploy IBM
BigInsights sooner rather than later.
International Technology Group
7
SOLUTION SET
Overview
In its present form, BigInsights includes the principal components of the Apache Hadoop and related
projects, along with IBM enhancements described earlier. BigInsights is offered by IBM as a licensed
software product, and through IBM SmartCloud Enterprise and third-party cloud service providers.
In addition to the flagship Enterprise Edition, which currently includes the components summarized in
figure 3, IBM offers two free versions of BigInsights.
Basic Edition includes the principal BigInsights open source components, along with database and Web
server interfaces, and a simple management console. Quick Start Edition is a near full-function offering
restricted to non-production use. It is designed to allow users to evaluate and gain experience with
BigInsights enterprise features, and to prototype applications and develop proofs of concept.
Deployment Options
Servers and Storage
IBM offers BigInsights clusters built around IBM System x3550 M4 and x3630 dual-socket x86 servers
acting as management and data nodes respectively. Data nodes may be configured with Near Line SAS
(NL-SAS) or SATA drives. Configurations are packaged in increments of up to 20, 20 to 50 and 50+
nodes. Red Hat Enterprise Linux (RHEL) and SUSE Linux Enterprise Server (SLES) are supported.
BigInsights may also be deployed on non-IBM x86 servers, and on IBM Power Systems with RHEL or
SLES. IBM or third-party arrays may be employed for external storage.
Platform Symphony
IBM offers the option of deploying BigInsights on Platform Symphony. With this approach, Platform
Symphony job scheduling and management mechanisms substitute for those of MapReduce, and
additional high availability features may be leveraged.
Platform Symphony employs x86-based clusters to support applications extremely high levels of
performance and scalability. In principle, configurations of 10,000 or more cores are supported.
According to IBM, benchmark tests have demonstrated more than seven times higher performance than
open source MapReduce for large-scale social media analytics workloads.
Platform Computing is a longstanding player in HPC for scientific and technical computing, and for
commercial applications in financial services, manufacturing, digital media, oil and gas, life sciences and
other industries.
GPFS-FPO
GPFS-FPO has been deployed by a number of early BigInsights users in beta mode, and became
generally available in Version 2.1.
In HPC applications, GPFS has demonstrated near-linear scalability in extremely large configurations –
installations with more than 1,000 nodes are common, and the largest exceed 5,000 nodes. Storage
volumes often run to hundreds of terabytes, and there are working petabyte-scale systems.
User experiences, as well as tests run with a variety of HPC benchmarks, have demonstrated significantly
higher performance – in some cases by more than 20 times – than HDFS.
GPFS also incorporates a distributed metadata structure, policy-driven automated storage tiering,
managed high-speed replication, and information lifecycle management (ILM) tooling.
International Technology Group
8
APPLICATION DEVELOPMENT Social Data Analytics Accelerator Application suite enabling extract of social media data, construction of user profiles & association with sentiment, buzz, intent & ownership. Includes customizable tools for brand management, lead generation & other common functions. Pre-­‐integrated options for use with IBM (ex-­‐Unica) Campaign & CCI solutions Machine Data Analytics Accelerator Application suite enabling import & aggregation of structured, semi-­‐structured and/or unstructured data from log files, meters, sensors, readers & other machine sources. Provides assists for text, faceted & timeline-­‐based searches, pattern recognition root cause analysis, chained analysis & other functions
BigSheets Spreadsheet-­‐like tool for identification, integration & analysis of large volumes of unstructured &/or structured data. Incorporates IBM-­‐developed analytics macros & pattern recognition technology. Highly customizable for individual user requirements Big SQL Native SQL query engine allows developers to query Hive, HBase or distributed file system data using standard SQL syntax & Hadoop-­‐optimized SQL extensions. Allows administrators to populate Big SQL tables with data from multiple sources. JDBC & ODBC drivers support many existing SQL query tools Web application catalog Includes sample query, data import & export, & test tools designed for “proof of concept” application deployment INFRASTRUCTURE Avro Data serialization & remote procedure call (RPC) framework defines JSON data schemas HBase NoSQL (nonrelational) database incorporating row-­‐ & column-­‐based table structures. Based on Google BigTable technology Hive Facilitates data extraction, transformation & loading (ETL), & analysis of large HDFS data sets Jaql High-­‐level declarative query & scripting language with JSON-­‐based data model & SQL-­‐like interface processes structured unstructured data. Originally developed by IBM Lucene Text search engine library describes job graphs & relationships between these Oozie Workflow scheduler for Hadoop job management Pig Platform for analyzing large data sets includes high-­‐level language for expressing, & infrastructure for evaluating programs MapReduce Parallel programming model for Hadoop clusters Hadoop Distributed File System (HDFS) Hadoop distributed file system supports clusters built around x86-­‐based NameNode (master) & DataNodes. Closely integrated with MapReduce BigIndex Implements Hadoop-­‐based indexing as a native InfoSphere BigInsights capability; enables additional complex functions including distributed indexing & faceted search BigInsights Scheduler Extension of Hadoop Fair Scheduler enables policy-­‐based scheduling of MapReduce jobs Splittable compression Expanded implementation of Apache Lempel-­‐Ziv-­‐Oberhumer (LZO) algorithm allowing compressed data to run jobs on multiple mappers Enhanced security Includes enhanced authentication, authorization (roles) & auditing functions. Interfaces to IBM InfoSphere Guardium solutions Integrated Installer GUI-­‐driven tool allows rapid, automated configuration, installation & assurance of BigInsights clusters. Guided installation features facilitate administrator tasks Adaptive MapReduce Platform Symphony technology that accelerates processing of small MapReduce jobs & enables more effective execution of mixed Hadoop workloads GPFS File Placement Optimizer (GPFS-­‐FPO) Extension of IBM General Parallel File System high-­‐performance distributed file system optimized for use in Hadoop clusters Figure 3: IBM InfoSphere BigInsights Components
International Technology Group
9
DATA SOURCES & CONNECTIVITY BoardReader Interface to BoardReader search engine enables query access, & data download & import to BigInsights file system Web Crawler Interface to IBM Web Crawler application for Internet data collection & organization Flume Facilitates aggregation & integration of large data volumes across Hadoop clusters R Enables integration of applications written in R statistics language Sqoop Enables import & export of data between SQL & Hadoop databases JDBC Standard Java Database Connectivity interface to DBMS ODBC Standard Open Database Connectivity interface to DBMS MicroStrategy, SAS Interface to widely-­‐used third-­‐party analytics tools Database interfaces Interfaces to IBM DB2, Oracle Database, Microsoft SQL Server & Teradata Database IBM data exchanges Enable exchange of BigInsights data with IBM Cognos Business Intelligence, InfoSphere DataStage ETL tools, InfoSphere Warehouse data warehouse framework, Platform Symphony grid middleware, IBM PureData System for Analytics, SPSS statistical modeling & analysis & InfoSphere Streams real-­‐time analytics solutions Legend: IBM Open Source Figure 3 (cont.): IBM InfoSphere BigInsights Components
International Technology Group
10
DETAILED DATA
Composite Profiles
The calculations presented in this report are based upon the six composite profiles shown in figure 4.
FTEs refers to numbers of full time equivalent database administrators.
Health Care Company Financial Services Company Retail Company Applications Health care insurance provider – claims analysis for quality of care recommenda-­‐
tions & cost/profitability variables 80 TB disk storage Diversified retail bank – customer sentiment analysis of social media, correspondence & transaction records for loyalty program opt. Data warehouse interface 130 TB disk storage Comparative analysis of customer online & in-­‐store purchasing behavior. Sources include web logs, point of sale & other data. Predictive analysis for merchandis-­‐
ing applications. Data warehouse & decision support interfaces 200 TB disk storage IBM INFOSPHERE BIGINSIGHTS FTEs Development & deployment (6 months): 5.25 Post-­‐production operations: 2.95 Development & deployment (8 months): 7.5 Post-­‐production operations: 3.15 Development & deployment (12 months): 11.3 Post-­‐production operations: 4.75 OPEN SOURCE FTEs Development & deployment (6 months): 8.25 Post-­‐production operations: 4.3 Development & deployment (10 months): 13.15 Post-­‐production operations: 6.0 Development & deployment (15 months): 17.0 Post-­‐production operations: 8.5 Media Company Marketing Services Company Telecommunications Company Applications Analysis of web logs for multiple properties to determine usage patterns, customer profiling, tracking ad event activity & identifying new marketing opportunities 300 TB disk storage Analysis of customer e-­‐mail traffic for demographic & sentiment tracking, campaign management & other applications 350 TB disk storage Analysis of call detail records (CDRs), Internet & social media activity to identify cross-­‐sell opportunities & improve loyalty program effectiveness. Interface to CIS, data warehouse & operational systems 500 TB disk storage IBM INFOSPHERE BIGINSIGHTS FTEs Development & deployment (7 months): 8.4 Post-­‐production operations: 3.25 Development & deployment (6 months): 7.55 Post-­‐production operations: 2.6 Development & deployment (9 months): 8.85 Post-­‐production operations: 3.0 OPEN SOURCE FTEs Development & deployment (9 months): 12.35 Post-­‐production operations: 5.0 Development & deployment (8 months): 10.95 Post-­‐production operations: 4.5 Development & deployment (12 months): 14.05 Post-­‐production operations: 5.5 Figure 4: Composite Profiles
Profiles were constructed using information supplied by 14 companies using open source Hadoop, the
same number using BigInsights, and one using both. For each of the industries shown above, comparisons
were based on companies of approximately the same size, with generally similar business profiles and
applications. Companies were based in the United States (26) and Europe (3).
Companies supplied information on applications; development and deployment times for these; and
numbers of FTE personnel for (1) application development and deployment, and (2) ongoing postproduction operations. Because job descriptions and titles often varied between companies, numbers of
FTEs for equivalent specializations were in some cases estimated by the International Technology Group.
International Technology Group
11
Cost Calculations
Personnel costs were calculated for numbers of FTEs based on the annual salary assumptions shown in
figure 5. The same assumptions were employed for use of BigInsights and open source Hadoop tools.
SPECIALIZATION SALARY SPECIALIZATION $200K (1) (2)
(1)
Data scientist (1)
Architect/equivalent (1)
Project manager (1)
Lead developer Developer SALARY $132K (1) (2)
$189K Data specialist $154K Installation specialist (1)
$147K $135K $140K (2)
$104K System administrator (1)
Development & deployment Post-­‐production operations (2)
Figure 5: FTE Salary Assumptions
Calculations were based on numbers of FTEs for applicable periods. For the health care company, for
example, costs were calculated for numbers of development and deployment FTEs for six months, while
post-production personnel costs were calculated for 36 – 6 = 30 months. Salaries were increased by 55.48
percent to allow for benefits, bonuses and other per capita costs.
Software costs for use of BigInsights were calculated based on IBM pricing per terabyte of disk storage
for the capacities shown in figure 6. As BigInsights license fees include one year of software maintenance
(SWMA) coverage for no additional charge, support costs are for two years. Calculations allowed for
user-reported discounts.
Cost Breakdowns
Breakdowns for individual profiles are shown in figure 6.
COMPANY TYPE Health Care Financial Services Retail Media Marketing Services Telecom 747.53 800.93 IBM INFOSPHERE BIGINSIGHTS Licenses & support 327.49 439.62 676.34 800.93 Personnel Development & deployment 596.46 1,125.68 2,526.16 1,097.07 838.97 1,477.74 Ongoing operations 1,389.21 1,442.26 1,887.53 1,561.21 1,272.80 1,313.61 Personnel total 1,985.67 2,567.93 4,413.69 2,658.28 2,111.77 2,791.35 TOTAL ($ thousands) 2,313.16 3,007.55 5,090.03 3,459.21 2,859.30 3,592.28 0 OPEN SOURCE APACHE HADOOP Licenses & support 0 0 0 0 0 Personnel Development & deployment 914.57 2,471.03 4,725.62 2,068.84 1,611.65 3,024.47 Ongoing operations 2,006.08 2,566.53 2,954.22 2,225.79 1,399.97 2,173.61 Personnel total 2,920.65 5,037.56 7,679.84 4,294.63 3,011.62 5,198.08 TOTAL ($ thousands) 2,920.65 5,037.56 7,679.84 4,294.63 3,011.62 5,198.08 Figure 6: Three-year Cost Breakdowns
International Technology Group
12
ABOUT THE INTERNATIONAL TECHNOLOGY GROUP
ITG sharpens your awareness of what’s happening and your competitive edge
. . . this could affect your future growth and profit prospects
International Technology Group (ITG), established in 1983, is an independent research and management
consulting firm specializing in information technology (IT) investment strategy, cost/benefit metrics,
infrastructure studies, deployment tactics, business alignment and financial analysis.
ITG was an early innovator and pioneer in developing total cost of ownership (TCO) and return on
investment (ROI) processes and methodologies. In 2004, the firm received a Decade of Education Award
from the Information Technology Financial Management Association (ITFMA), the leading professional
association dedicated to education and advancement of financial management practices in end-user IT
organizations.
The firm has undertaken more than 120 major consulting projects, released more than 250 management
reports and white papers and more than 1,800 briefings and presentations to individual clients, user
groups, industry conferences and seminars throughout the world.
Client services are designed to provide factual data and reliable documentation to assist in the decisionmaking process. Information provided establishes the basis for developing tactical and strategic plans.
Important developments are analyzed and practical guidance is offered on the most effective ways to
respond to changes that may impact complex IT deployment agendas.
A broad range of services is offered, furnishing clients with the information necessary to complement
their internal capabilities and resources. Customized client programs involve various combinations of the
following deliverables:
Status Reports
In-depth studies of important issues
Management Briefs
Detailed analysis of significant developments
Management Briefings
Periodic interactive meetings with management
Executive Presentations
Scheduled strategic presentations for decision-makers
Email Communications
Timely replies to informational requests
Telephone Consultation
Immediate response to informational needs
Clients include a cross section of IT end users in the private and public sectors representing
multinational corporations, industrial companies, financial institutions, service organizations,
educational institutions, federal and state government agencies as well as IT system suppliers,
software vendors and service firms. Federal government clients have included agencies within
the Department of Defense (e.g., DISA), Department of Transportation (e.g., F
International Technology Group
609 Pacific Avenue, Suite 102
Santa Cruz, California 95060-4406
Telephone: 831-427-9260
Email: [email protected]
Website: ITGforInfo.com
Fly UP