Comments
Description
Transcript
2014 Issue 1 1 19
2014 Issue 1 1 The enterprise data lake: Better integration and deeper analytics 19 Microservices: The resurgence of SOA principles and an alternative to the monolith 33 Containers are redefining applicationinfrastructure integration Rethinking integration: Emerging patterns from cloud computing leaders 48 Zero-integration technologies and their role in transformation Contents Features 2014 Issue 1 Data lake The enterprise data lake: Better integration and deeper analytics 1 Microservices architecture (MSA) Microservices: The resurgence of SOA principles and an alternative to the monolith 19 Linux containers and Docker Containers are redefining applicationinfrastructure integration 33 Zero integration Zero-integration technologies and their role in transformation 48 Related interviews Mike Lang CEO of Revelytix on how companies are using data lakes 9 Dale Sanders SVP at Health Catalyst on agile data warehousing in healthcare 13 John Pritchard Director of platform services at Adobe on agile coding in the software industry 26 Richard Rodger CTO of nearForm on the advantages of microservices architecture 29 Sam Ramji VP of Strategy at Apigee on integration trends and the bigger picture Ben Golub CEO of Docker on the outlook for Linux containers 39 44 Technology Forecast: Rethinking integration Issue 1, 2014 The enterprise data lake: Better integration and deeper analytics By Brian Stein and Alan Morrison Data lakes that can scale at the pace of the cloud remove integration barriers and clear a path for more timely and informed business decisions. 1 Data lakes: An emerging approach to cloud-based big data Enterprises across industries are starting to extract and place data for analytics into a single, Hadoopbased repository. UC Irvine Medical Center maintains millions of records for more than a million patients, including radiology images and other semistructured reports, unstructured physicians’ notes, plus volumes of spreadsheet data. To solve the challenge the hospital faced with data storage, integration, and accessibility, the hospital created a data lake based on a Hadoop architecture, which enables distributed big data processing by using broadly accepted open software standards and massively parallel commodity hardware. Hadoop allows the hospital’s disparate records to be stored in their native formats for later parsing, rather than forcing all-or-nothing integration up front as in a data warehousing scenario. Preserving the native format also helps maintain data provenance and fidelity, so different analyses can be performed using different contexts. The data lake has made possible several data analysis projects, including the ability to predict the likelihood of readmissions and take preventive measures to reduce the number of readmissions.1 Like the hospital, enterprises across industries are starting to extract and place data for analytics into a single Hadoopbased repository without first transforming the data the way they would need to for a relational data warehouse.2 The basic concepts behind Hadoop3 were devised by Google to meet its need for a flexible, cost-effective data processing model that could scale as data volumes grew faster than ever. Yahoo, Facebook, Netflix, and others whose business models also are based on managing enormous data volumes quickly adopted similar methods. Costs were certainly a factor, as Hadoop can be A basic Hadoop architecture for scalable data lake infrastructure Hadoop Distributed File System (HDFS) Hadoop stores and preserves data in any format across a commodity server cluster. Input file With YARN, Hadoop now supports various programming models and near-real-time outputs in addition to batch. Output file Map task The system splits up the jobs and distributes, processes, and recombines them via a cluster that can scale to thousands of server nodes. Split 1 Input Split 2 Reduce task Input Split 3 map( ) partition( ) combine( ) Split 4 Split 5 sort( ) reduce( ) Region 1 Job tracker Region 2 Output Region 3 Source: Electronic Design, 2012, and Hortonworks, 2014 1 “UC Irvine Health does Hadoop,” Hortonworks, http://hortonworks.com/customer/uc-irvine-health/. 2 See Oliver Halter, “The end of data standardization,” March 20, 2014, http://usblogs.pwc.com/emerging-technology/the-end-of-datastandardization/, accessed April 17, 2014. 3 Apache Hadoop is a collection of open standard technologies that enable users to store and process petabyte-sized data volumes via commodity computer clusters in the cloud. For more information on Hadoop and related NoSQL technologies, see “Making sense of Big Data,” PwC Technology Forecast 2010, Issue 3 at http://www.pwc.com/us/en/technology-forecast/2010/issue3/index.jhtml. 2 PwC Technology Forecast The enterprise data lake: Better integration and deeper analytics Hadoop can be 10 to 100 times less expensive to deploy than conventional data warehousing. 10 to 100 times less expensive to deploy than conventional data warehousing. Another driver of adoption has been the opportunity to defer labor-intensive schema development and data cleanup until an organization has identified a clear business need. And data lakes are more suitable for the less-structured data these companies needed to process. Today, companies in all industries find themselves at a similar point of necessity. Enterprises that must use enormous volumes and myriad varieties of data to respond to regulatory and competitive pressures are adopting data lakes. Data lakes are an emerging and powerful approach to the challenges of data integration as enterprises increase their exposure to mobile and cloudbased applications, the sensor-driven Internet of Things, and other aspects of what PwC calls the New IT Platform. Issue overview: Integration fabric The microservices topic is the second of three topics as part of the integration fabric research covered in this issue of the PwC Technology Forecast. The integration fabric is a central component for PwC’s New IT Platform.* Enterprises are starting to embrace more practical integration.** A range of these new approaches is now emerging, and during the next few months we’ll ponder what the new cloud-inspired enterprise integration fabric looks like. The main areas we plan to explore include these: Integration fabric layers Data Integration challenges Emerging technology solutions Data silos, data proliferation, rigid schemas, and high data warehousing cost; new and heterogeneous data types Hadoop data lakes, late binding, and metadata provenance tools Enterprises are beginning to place extracts of their data for analytics and business intelligence (BI) purposes into a single, massive repository and structuring only what’s necessary. Instead of imposing schemas beforehand, enterprises are allowing data science groups to derive their own views of the data and structure it only lightly, late in the process. Applications and services Rigid, monolithic systems that are difficult to update in response to business needs Microservices Fine-grained microservices, each associated with a single business function and accessible via an application programming interface (API), can be easily added to the mix or replaced. This method helps developer teams create highly responsive, flexible applications. Infrastructure Multiple clouds and operating systems that lack standardization Software containers for resource isolation and abstraction New software containers such as Docker extend and improve virtualization, making applications portable across clouds. Simplifying application deployment decreases time to value. * See http://www.pwc.com/us/en/increasing-it-effectiveness/new-it-platform.jhtml for more information. **Integration as PwC defines it means making diverse components work together so they work as a single entity. See “integrated system” at http://www.yourdictionary.com/integrated-system#computer, accessed June 17, 2014. 3 PwC Technology Forecast The enterprise data lake: Better integration and deeper analytics What is a data lake? A data lake is a repository for large quantities and varieties of data, both structured and unstructured.. Data generalists/ programmers can tap the stream data for real-time analytics. The lake can serve as a staging area for the data warehouse, the location of more carefully “treated” data for reporting and analysis in batch mode. The data lake accepts input from various sources and can preserve both the original data fidelity and the lineage of data transformations. Data models emerge with usage over time rather than being imposed up front. Data scientists use the lake for discovery and ideation. Data lakes take advantage of commodity cluster computing techniques for massively scalable, low-cost storage of data files in any format. Why a data lake? Data lakes can help resolve the nagging problem of accessibility and data integration. Using big data infrastructures, enterprises are starting to pull together increasing data volumes for analytics or simply to store for undetermined future use. (See the sidebar “Data lakes defined.”) Mike Lang, CEO of Revelytix, a provider of data management tools for Hadoop, notes that “Business owners at the C level are saying, ‘Hey guys, look. It’s no longer inordinately expensive for us to store all of our data. I want all of you to make copies. OK, your systems are busy. Find the time, get an extract, and dump it in Hadoop.’” Previous approaches to broad-based data integration have forced all users into a common predetermined schema, or data model. Unlike this monolithic view of a single enterprisewide data model, the data lake relaxes standardization and defers modeling, resulting in a nearly unlimited potential for operational insight and data discovery. As data volumes, data variety, and metadata richness grow, so does the benefit. 4 PwC Technology Forecast Recent innovation is helping companies to collaboratively create models—or views— of the data and then manage incremental improvements to the metadata. Data scientists and business analysts using the newest lineage tracking tools such as Revelytix Loom or Apache Falcon can follow each other’s purpose-built data schemas. The lineage tracking metadata also is placed in the Hadoop Distributed File System (HDFS)—which stores pieces of files across a distributed cluster of servers in the cloud—where the metadata is accessible and can be collaboratively refined. Analytics drawn from the lake become increasingly valuable as the metadata describing different views of the data accumulates. Every industry has a potential data lake use case. A data lake can be a way to gain more visibility or put an end to data silos. Many companies see data lakes as an opportunity to capture a 360-degree view of their customers or to analyze social media trends. The enterprise data lake: Better integration and deeper analytics In the financial services industry, where Dodd-Frank regulation is one impetus, an institution has begun centralizing multiple data warehouses into a repository comparable to a data lake, but one that standardizes on XML. The institution is moving reconciliation, settlement, and Dodd-Frank reporting to the new platform. In this case, the approach reduces integration overhead because data is communicated and stored in a standard yet Data lakes defined flexible format suitable for less-structured data. The system also provides a consistent view of a customer across operational functions, business functions, and products. Some companies have built big data sandboxes for analysis by data scientists. Such sandboxes are somewhat similar to data lakes, albeit narrower in scope and purpose. PwC, for example, built a social media data sandbox to help clients monitor their brand health by using its SocialMind application.4 Motivating factors behind the move to data lakes Many people have heard of data lakes, but like the term big data, definitions vary. Four criteria are central to a good definition: • Size and low cost: Data lakes are big. They can be an order of magnitude less expensive on a per-terabyte basis to set up and maintain than data warehouses. With Hadoop, petabyte-scale data volumes are neither expensive nor complicated to build and maintain. Some vendors that advocate the use of Hadoop claim that the cost per terabyte for data warehousing can be as much as $250,000, versus $2,500 per terabyte (or even less than $1,000 per terabyte) for a Hadoop cluster. Other vendors advocating traditional data warehousing and storage infrastructure dispute these claims and make a distinction between the cost of storing terabytes and the cost of writing or written terabytes.* • Fidelity: Hadoop data lakes preserve data in its original form and capture changes to data and contextual semantics throughout the data lifecycle. This approach is especially useful for compliance and internal audit. If the data has undergone transformations, aggregations, and updates, most organizations typically struggle to piece data together when the need arises and have little hope of determining clear provenance. Relational data warehouses and their big price tags have long dominated complex analytics, reporting, and operations. (The hospital described earlier, for example, first tried a relational data warehouse.) However, their slow-changing data models and rigid field-tofield integration mappings are too brittle to support big data volume and variety. The vast majority of these systems also leave business users dependent on IT for even the smallest enhancements, due mostly to inelastic design, unmanageable system complexity, and low system tolerance for human error. The data lake approach circumvents these problems. Freedom from the shackles of one big data model Job number one in a data lake project is to pull all data together into one repository while giving minimal attention to creating schemas that define integration points between disparate data sets. This approach facilitates access, but the work required to turn that data into actionable insights is a substantial challenge. While integrating the data takes place at the Hadoop layer, contextualizing the metadata takes place at schema creation time. • Ease of accessibility: Accessibility is easy in the data lake, which is one benefit of preserving the data in its original form. Whether structured, unstructured, or semi-structured, data is loaded and stored as is to be transformed later. Customer, supplier, and operations data are consolidated with little or no effort from data owners, which eliminates internal political or technical barriers to increased data sharing. Neither detailed business requirements nor painstaking data modeling are prerequisites. • Late binding: Hadoop lends itself to flexible, task-oriented structuring and does not require up-front data models. *For more on data accessibility, data lake cost, and collective metadata refinement including lineage tracking technology, see the interview with Mike Lang, “Making Hadoop suitable for enterprise data science,” at http://www.pwc.com/us/en/technology-forecast/2014/issue1/interviews/interview-revelytix.jhtml. For more on cost estimate considerations, see Loraine Lawson, “What’s the Cost of a Terabyte?” ITBusinessEdge, May 17, 2013, at http://www.itbusinessedge.com/blogs/integration/whats-the-cost-of-a-terabyte.html. Integrating data involves fewer steps because data lakes don’t enforce a rigid metadata schema as do relational data warehouses. Instead, data lakes support a concept known as late binding, or schema on read, in which users build custom schema into their queries. Data is bound to a dynamic schema created upon query execution. The late-binding principle shifts the data modeling from centralized 4 For more information on SocialMind and other analytics applications PwC offers, see http://www.pwc.com/us/en/analytics/analyticsapplications.jhtml. 5 PwC Technology Forecast The enterprise data lake: Better integration and deeper analytics “We see customers creating big data graveyards, dumping everything into HDFS and hoping to do something with it down the road. But then they just lose track of what’s there.” —Sean Martin, Cambridge Semantics data warehousing teams and database administrators, who are often remote from data sources, to localized teams of business analysts and data scientists, who can help create flexible, domain-specific context. For those accustomed to SQL, this shift opens a whole new world. proceeding cautiously. “We see customers creating big data graveyards, dumping everything into HDFS and hoping to do something with it down the road. But then they just lose track of what’s there,” says Sean Martin, CTO of Cambridge Semantics, a data management tools provider. In this approach, the more that is known about the metadata, the easier it is to query. Pre-tagged data, such as Extensible Markup Language (XML), JavaScript Object Notation (JSON), or Resource Description Framework (RDF), offers a starting point and is highly useful in implementations with limited data variety. In most cases, however, pre-tagged data is a small portion of incoming data formats. Companies avoid creating big data graveyards by developing and executing a solid strategic plan that applies the right technology and methods to the problem. Few technologies in recent memory have as much change potential as Hadoop and the NoSQL (Not only SQL) category of databases, especially when they can enable a single enterprise-wide repository and provide access to data previously trapped in silos. The main challenge is not creating a data lake, but taking advantage of the opportunities it presents. A means of creating, enriching, and managing semantic metadata incrementally is essential. Early lessons and pitfalls to avoid Some data lake initiatives have not succeeded, producing instead more silos or empty sandboxes. Given the risk, everyone is Data flow in the data lake The data lake loads data extracts, irrespective of format, into a big data store. Metadata is decoupled from its underlying data and stored independently, enabling flexibility for multiple end-user perspectives and incrementally maturing semantics. The data lake offers a unique opportunity for flexible, evolving, and maturing big data insights. Metadata grows and matures over time via user interaction. Business and data analysts select and report on domainspecific data. Tagging, synonyms, linking Users collaborate to identify, organize, and make sense of the data in the data lake. Metadata tagging and linking Upstream data extracts A big data repository XML stores data as is, loading .xls existing data and accepting new feeds regularly. etc. Data scientists and app developers prepare and analyze attribute-level data. Machines help discover patterns and create data views. New data comes into the lake 6 New actions (such as customer campaigns) based on insights from the data PwC Technology Forecast Cross-domain data analysis The enterprise data lake: Better integration and deeper analytics How a data lake matures With the data lake, users can take what is relevant and leave the rest. Individual business domains can mature independently and gradually. Perfect data classification is not required. Users throughout the enterprise can see across all disciplines, not limited by organizational silos or rigid schema. se ea cr in ity ur at m a at D The data lake foundation includes a big data repository, metadata management, and an application framework to capture and contextualize enduser feedback. The increasing value of analytics is then directly correlated to increases in user adoption across the enterprise. s Data lake maturity Increasing value of analytics With the data lake, users can take what is relevant and leave the rest. Individual business domains can mature independently and gradually. Sourcing new data into the lake can occur gradually and will not impact existing models. The lake starts with raw data, and it matures as more data flows in, as users and machines build up metadata, and as user adoption broadens. Ambiguous and competing terms eventually converge into a shared understanding (that is, semantics) within and across business domains. Data maturity results as a natural outgrowth of the ongoing user interaction and feedback at the metadata management layer—interaction that continually refines the lake and enhances discovery. (See the sidebar “Maturity and governance.”) 5. Convergence of meaning within context 4. Business-specific tagging, synonym identification, and links 3. Data set extraction and analysis 2. Attribute-level metadata tagging and linking (i.e., joins) 1. Consolidated and categorized raw data Increasing usage across the enterprise 7 PwC Technology Forecast The enterprise data lake: Better integration and deeper analytics Maturity and governance Many who hear the term data lake might associate the concept with a big data sandbox, but the range of potential use cases for data lakes is much broader. Enterprises envision lake-style repositories as staging areas, as alternatives to data warehouses, or even as operational data hubs, assuming the appropriate technologies and use cases. A key enabler is Hadoop and many of the big data analytics technologies associated with it. What began as a means of ad hoc batch analytics in Hadoop and MapReduce is evolving rapidly with the help of YARN and Storm to offer more general-purpose distributed analytics and real-time capabilities. At least one retailer has been running a Hadoop cluster of more than 2,000 nodes to support eight customer behavior analysis applications.* Despite these advances, enterprises will remain concerned about the risks surrounding data lake deployments, especially at this still-early stage of development. How can enterprises effectively mitigate the risk and manage a Hadoop-based lake for broad-ranging exploration? Lakes can provide unique benefits over traditional data management methods at a substantially lower cost, but they require many practical considerations and a thoughtful approach to governance, particularly in more heavily regulated industries. Areas to consider include: • Complexity of legacy data: Many legacy systems contain a hodgepodge of software patches, workarounds, and poor design. As a result, the raw data may provide limited value outside its legacy context. The data lake performs optimally when supplied with unadulterated data from source systems, and rich metadata built on top. 8 PwC Technology Forecast • Metadata management: Data lakes require advanced metadata management methods, including machine-assisted scans, characterizations of the data files, and lineage tracking for each transformation. Should schema on read be the rule and predefined schema the exception? It depends on the sources. The former is ideal for working with rapidly changing data structures, while the latter is best for sub second query response on highly structured data. • Lake maturity: Data scientists will take the lead in the use and maturation of the data lake. Organizations will need to place the needs of others who will benefit within the context of existing organizational processes, systems, and controls. • Staging area or buffer zone: The lake can serve as a cost-effective place to land, stage, and conduct preliminary analysis of data that may have been prohibitively expensive to analyze in data warehouses or other systems. To adopt a data lake approach, enterprises should take a full step toward multipurpose (rather than single purpose) commodity cluster computing for enterprise-wide analysis of less-structured data. To take that full step, they first must acknowledge that a data lake is a separate discipline of endeavor that requires separate treatment. Enterprises that set up data lakes must simultaneously make a long-term commitment to hone the techniques that provide this new analytic potential. Half measures won’t suffice. * Timothy Prickett Morgan, “Cluster Sizes Reveal Hadoop Maturity Curve,” Enterprise Tech: Systems Edition, November 8, 2013, http://www.enterprisetech.com/2013/11/08/cluster-sizesreveal-hadoop-maturity-curve/, accessed March 20, 2014. The enterprise data lake: Better integration and deeper analytics Technology Forecast: Rethinking integration Issue 1, 2014 Making Hadoop suitable for enterprise data science Creating data lakes enables enterprises to expand discovery and predictive analytics. Interview conducted by Alan Morrison, Bo Parker, and Brian Stein PwC: You’re in touch with a number our data. I want all of you to make copies. of customers who are in the process of OK, your systems are busy. Find the time, setting up Hadoop data lakes. Why are get an extract, and dump it in Hadoop.” they doing this? Mike Lang Mike Lang is CEO of Revelytix. ML: There has been resistance on the part of business owners to share data, and a big part of the justification for not sharing data has been the cost of making that data available. The data owners complain they must write in some special way to get the data extracted, the system doesn’t have time to process queries for building extracts, and so forth. But a lot of the resistance has been political. Owning data has power associated with it. Hadoop is changing that, because C-level executives are saying, “It’s no longer inordinately expensive for us to store all of 9 PwC Technology Forecast But they haven’t integrated anything. They’re just getting an extract. The benefit is that to add value to the integration process, business owners don’t have nearly the same hill to climb that they had in the past. C-level executives are not asking the business owner to add value. They’re just saying, “Dump it,” and I think that’s under way right now. With a Hadoop-based data lake, the enterprise has provided a capability to store vast amounts of data, and the user doesn’t need to worry about restructuring the data to begin. The data owners just need to do the dump, and they can go on their merry way. Making Hadoop suitable for enterprise data science “If I want to add a terabyte node to my current analytics infrastructure, the cost could be $250,000. But if I want to add a terabyte node to my Hadoop data lake, the cost is more like $25,000.” PwC: So one major obstacle was just the ability to share data cost-effectively? ML: Yes, and that was a huge obstacle. Huge. It is difficult to overstate how big that obstacle has been to nimble analytics and data integration projects during my career. For the longest time, there was no such thing as nimble when talking about data integration projects. Once that data is in Hadoop, nimble is the order of the day. All of a sudden, the ETL [extract, transform, load] process is totally turned on its head—from contemplating the integration of eight data sets, for example, to figuring out which of a company’s policyholders should receive which kinds of offers at what price in which geographic regions. Before Hadoop, that might have been a two-year project. PwC: What are the main use cases for Hadoop data lakes? ML: There are two main use cases for the data lake. One is as a staging area to support some specific application. A company might want to analyze three streams of data to reduce customer churn by 10 percent. They plan to build an app to do that using three known streams of data, and the data lake is just part of that workflow of receiving, processing, and then dumping data off to generate the churn analytics. The last time we talked [in 2013], that was the main use case of the data lake. The second use case is supporting data science groups all around the enterprise. Now, that’s probably 70 percent of the companies we’ve worked with. 10 PwC Technology Forecast PwC: Why use Hadoop? ML: Data lakes are driven by three factors. The first one is cost. Everybody we talk to really believes data lakes will cost much less than current alternatives. The cost of data processing and data storage could be 90 percent lower. If I want to add a terabyte node to my current analytics infrastructure, the cost could be $250,000. But if I want to add a terabyte node to my Hadoop data lake, the cost is more like $25,000. The second factor is flexibility. The flexibility comes from the late-binding principle. When I have all this data in the lake and want to analyze it, I’ll basically build whatever schema I want on the fly and I’ll conduct my analysis the way data scientists do. Hadoop lends itself to late binding. The third factor relates to scale. Hadoop data lakes will have a lot more scale than the data warehouse, because they’re designed to scale and process any type of data. PwC: What’s the first step in creating such a data lake? ML: We’re working with a number of big companies that are implementing some version of the data lake. The first step is to create a place that stores any data that the business units want to dump in it. Once that’s done, the business units make that place available to their stakeholders. The first step is not as easy as it sounds. The companies we’ve been in touch with spend an awful lot of time building security apparatuses. They also spend a fair amount of time performing quality checks on the data as it comes in, so at least they can say something about the quality of the data that’s available in the cluster. Making Hadoop suitable for enterprise data science “Data scientists don’t like the ETL paradigm used by business analysts. Data scientists have no idea at the beginning of their job what the schema should be, and so they go through this process of looking at the data that’s available to them.” But after they have that framework in place, they just make the data available for data science. They don’t know what it’s going to be used for, but they do know it’s going to be used. PwC: So then there’s the data preparation process, which is where the metadata reuse potential comes in. How does the dynamic ELT [extract, load, transform] approach to preparing the data in the data science use case compare with the static ETL [extract, transform, load] approach traditionally used by business analysts? ML: In the data lake, the files land in Hadoop in whatever form they’re in. They’re extracted from some system and literally dumped into Hadoop, and that is one of the great attractions of the data lake—data professionals don’t need to do any expensive ETL work beforehand. They can just dump the data in there, and it’s available to be processed in a relatively inexpensive storage and processing framework. The challenge, then, is when data scientists need to use the data. How do they get it into the shape that’s required for their R frame or their Python code for their advanced analytics? The answer is that the process is very iterative. This iterative process is the distinguishing difference between business analysts and data warehousing and data scientists and Hadoop. Traditional ETL is not iterative at all. It takes a long time to transform the different data into one schema, and then the business analysts perform their analysis using that schema. Data scientists don’t like the ETL paradigm used by business analysts. Data scientists have no idea at the beginning of their job what the schema should be, and so 11 PwC Technology Forecast they go through this process of looking at the data that’s available to them. Let’s say a telecom company has set-top box data and finance systems that contain customer information. Let’s say the data scientists for the company have four different types of data. They’ll start looking into each file and determine whether the data is unstructured or structured this way or that way. They need to extract some pieces of it. They don’t want the whole file. They want some pieces of each file, and they want to get those pieces into a shape so they can pull them into an R server. So they look into Hadoop and find the file. Maybe they use Apache Hive to transform selected pieces of that file into some structured format. Then they pull that out into R and use some R code to start splitting columns and performing other kinds of operations. The process takes a long time, but that is the paradigm they use. These data scientists actually bind their schema at the very last step of running the analytics. Let’s say that in one of these Hadoop files from the set-top box, there are 30 tables. They might choose one table and spend quite a bit of time understanding and cleaning up that table and getting the data into a shape that can be used in their tool. They might do that across three different files in HDFS [Hadoop Distributed File System]. But, they clean it as they’re developing their model, they shape it, and at the very end both the model and the schema come together to produce the analytics. PwC: How can the schema become dynamic and enable greater reuse? ML: That’s why you need lineage. As data scientists assemble their intermediate data sets, if they look at a lineage graph in our Loom Making Hadoop suitable for enterprise data science product, they might see 20 or 30 different sets of data that have been created. Of course some of those sets will be useful to other data scientists. Dozens of hours of work have been invested there. The problem is how to find those intermediate data sets. In Hadoop, they are actually realized persisted data sets. So, how do you find them and know what their structure is so you can use them? You need to know that this data set originally contained data from this stream or that stream, this application and that application. If you don’t know that, then the data set is useless. At this point, we’re able to preserve the input sets—the person who did it, when they did it, and the actual transformation code that produced this output set. It is pretty straightforward for users to go backward or forward to find the data set, and then find something downstream or upstream that they might be able to use by combining it, for example, with two other files. Right now, we provide the bare-bones capability for them to do that kind of navigation. From my point of view, that capability is still in its infancy. PwC: And there’s also more freedom and flexibility on the querying side? ML: Predictive analytics and statistical analysis are easier with a large-scale data lake. That’s another sea change that’s happening with the advent of big data. Everyone we talk to says SQL worked great. They look at the past through SQL. They know their current financial state, but they really need to know the characteristics of the customer in a particular zip code that they should target with a particular product. 12 PwC Technology Forecast When you can run statistical models on enormous data sets, you get better predictive capability. The bigger the set, the better your predictions. Predictive modeling and analytics are not being done timidly in Hadoop. That’s one of the main uses of Hadoop. This sort of analysis wasn’t performed 10 years ago, and it’s only just become mainstream practice. A colleague told me a story about a credit card company. He lives in Maryland, and he went to New York on a trip. He used his card one time in New York and then he went to buy gas, and the card was cut off. His card didn’t work at the gas station. He called the credit card company and asked, “Why did you cut off my card?” And they said, “We thought it was a case of fraud. You never have made a charge in New York and all of a sudden you made two charges in New York.” They asked, “Are you at the gas station right now?” He said yes. It’s remarkable what the credit card company did. It ticked him off that they could figure out that much about him, but the credit card company potentially saved itself tens of thousands of dollars in charges it would have had to eat. This new generation of processing platforms focuses on analytics. That problem right there is an analytical problem, and it’s predictive in its nature. The tools to help with that are just now emerging. They will get much better about helping data scientists and other users. Metadata management capabilities in these highly distributed big data platforms will become crucial—not nice-to-have capabilities, but I-can’t-do-my-work-withoutthem capabilities. There’s a sea of data. Making Hadoop suitable for enterprise data science Technology Forecast: Rethinking integration Issue 1, 2014 A step toward the data lake in healthcare: Latebound data warehouses Dale Sanders of Health Catalyst describes how healthcare providers are addressing their need for better analytics. Interview conducted by Alan Morrison, Bo Parker, and Brian Stein PwC: How are healthcare enterprises scaling and maturing their analytics efforts at this point? Dale Sanders Dale Sanders is senior vice president of Health Catalyst. DS: It’s chaotic right now. High-tech funding facilitated the adoption of EMRs [electronic medical records] and billing systems as data collection systems. And HIEs [health information exchanges] encouraged more data sharing. Now there’s a realization that analytics is critical. Other industries experienced the same pattern, but healthcare is going through it just now. The bad news for healthcare is that the market is so overwhelmed from the adoption of EMRs and HIEs. And now the changes from ICD-9 [International Classification of Diseases, Ninth Revision] are coming, as well as the changes to the HIPAA [Health 13 PwC Technology Forecast Insurance Portability and Accountability Act] regulation. Meaningful use is still a challenge. Accountable care is a challenge. There’s so much turmoil in the market, and it’s hard to admit that you need to buy yet another IT system. But it’s hard to deny that, as well. Lots of vendors claim they can do analytics. Trying to find the way through that maze and that decision making is challenging. PwC: How did you get started in this area to begin with, and what has your approach been? DS: Well, to go way back in history, when I was in the Air Force, I conceived the idea for late binding in data warehouses after I’d seen some different failures of data warehouses using relational database systems. A step toward the data lake in healthcare: Late-bound data warehouses “We have an analytics adoption model that we use to frame the progression of analytics in an organization. Most of the [healthcare] industry operates at level zero.” If you look at the early history of data warehousing in the government and military—it was all on mainframes. And those mainframe data warehouses look a lot like Hadoop today. Hadoop is emerging with better tools, but conceptually the two types of systems are very similar. When relational databases became popular, we all rushed to those as a solution for data warehousing. We went from the flat files associated with mainframes to Unixbased data warehouses that used relational database systems. And we thought it was a good idea. But one of the first big mistakes everyone made was to develop these enterprise data models using a relational form. I watched several failures happen as a consequence of that type of early binding to those enterprise models. I made some adjustments to my strategy in the Air Force, and I made some further adjustments when I worked for companies in the private sector and further refined it. I came into healthcare with that. I started at Intermountain Healthcare, which was an early adopter of informatics. The organization had a struggling data warehouse project because it was built around this tightly coupled, earlybinding relational model. We put a team together, scrubbed that model, and applied late binding. And, knock on wood, it’s been doing very well. It’s now 15 years in its evolution, and Intermountain still loves it. The origins of Health Catalyst come from that history. 14 PwC Technology Forecast PwC: How mature are the analytics systems at a typical customer of yours these days? DS: We generally get two types of customers. One is the customer with a fairly advanced analytics vision and aspirations. They understand the whole notion of population health management and capitated reimbursement and things like that. So they’re naturally attracted to us. The dialogue with those folks tends to move quickly. Then there are folks who don’t have that depth of background, but they still understand that they need analytics. We have an analytics adoption model that we use to frame the progression of analytics in an organization. We also use it to help drive a lot of our product development. It’s an eightlevel maturity model. Intermountain operates pretty consistently at levels six and seven. But most of the industry operates at level zero—trying to figure out how to get to levels one and two. When we polled participants in our webinars about where they think they reside in that model, about 70 percent of the respondents said level two and below. So we’ve needed to adjust our message and not talk about levels five, six, and seven with some of these clients. Instead, we talk about how to get basic reporting, such as internal dashboards and KPIs [key performance indicators], or how to meet the external reporting requirements for joint commission and accountable care organizations [ACOs] and that kind of thing. A step toward the data lake in healthcare: Late-bound data warehouses “The vast majority of data in healthcare is still bound in some form of a relational structure, or we pull it into a relational form.” If they have a technical background, some organizations are attracted to this notion of late binding. And we can relate at that level. If they’re familiar with Intermountain, they’re immediately attracted to that track record and that heritage. There are a lot of different reactions. PwC: With customers who are just getting started, you seem to focus on already well-structured data. You’re not opening up the repository to data that’s less structured as well. DS: The vast majority of data in healthcare is still bound in some form of a relational structure, or we pull it into a relational form. Late binding puts us between the worlds of traditional relational data warehouses and Hadoop—between a very structured representation of data and a very unstructured representation of data. But late binding lets us pull in unstructured content. We can pull in clinical notes and pretext and that sort of thing. Health Catalyst is developing some products to take advantage of that. But if you look at the analytic use cases and the analytic maturity of the industry right now, there’s not a lot of need to bother with unstructured data. That’s reserved for a few of the leading innovators. The vast majority of the market doesn’t need unstructured content at the moment. In fact, we really don’t even have that much unstructured content that’s very useful. PwC: What’s the pain point that the late-binding approach addresses? DS: This is where we borrow from Hadoop and also from the old mainframe days. 15 PwC Technology Forecast When we pull a data source into the late-binding data warehouse, we land that data in a form that looks and feels much like the original source system. Then we make a few minor modifications to the data. If you’re familiar with data modeling, we flatten it a little bit. We denormalize it a little bit. But for the most part, that data looks like the data that was contained in the source system, which is a characteristic of a Hadoop data lake—very little transformation to data. So we retain the binding and the fidelity of the data as it appeared in the source system. If you contrast that approach with the other vendors in healthcare, they remap that data from the source system into an enterprise data model first. But when you map that data from the source system into a new relational data model, you inherently make compromises about the way the data is modeled, represented, named, and related. You lose a lot of fidelity when you do that. You lose familiarity with the data. And it’s a time-consuming process. It’s not unusual for that early binding, monolithic data model approach to take 18 to 24 months to deploy a basic data warehouse. In contrast, we can deploy content and start exposing it to analytics within a matter of days and weeks. We can do it in days, depending on how aggressive we want to be. There’s no binding early on. There are six different places where you can bind data to vocabulary or relationships as it flows from the source system out to the analytic visualization layer. Before we bind data to new vocabulary, a new business rule, or any analytic logic, we ask ourselves what use case we’re A step toward the data lake in healthcare: Late-bound data warehouses “We are building an enterprise data model one object at a time.” trying to satisfy. We ask on a use case basis, rather than assuming a use case, because that assumption could lead to problems. We can build just about whatever we want to, whenever we want to. PwC: In essence, you’re moving toward an enterprise data model. But you’re doing it over time, a model that’s driven by use cases. DS: Are we actually building an enterprise data model one object at a time? That’s the net effect. Let’s say we land half a dozen different source systems in the enterprise data warehouse. One of the first things we do is provide a foreign key across those sources of data that allows you to query across those sources as if they were an enterprise data model. And typically the first foreign key that we add to those sources—using a common name and a common data type—is patient identifier. That’s the most fundamental. Then you add vocabularies such as CPT [Current Procedural Terminology] and ICD-9 as that need arises. When you land the data, you have what amounts to a virtual enterprise model already. You haven’t remodeled the data at all, but it looks and functions like an enterprise model. Then we’ll spin targeted analytics data marts off those source systems to support specific analytic use cases. For example, perhaps you want to drill down on the variability, quality, and cost of care in a clinical program for women and newborns. We’ll spin off a registry of those patients and the physicians treating those patients into its own separate data mart. And then we will associate every little piece of data that we can find: costing 16 PwC Technology Forecast data, materials management data, human resources data about the physicians and nurses, patient satisfaction data, outcomes data, and eventually social data. We’ll pull that data into the data mart that’s specific to that analytic use case to support women and newborns. PwC: So you might need to perform some transform rationalization, because systems might not call the same thing by the same name. Is that part of the late-binding vocabulary rationalization? DS: Yes, in each of those data marts. PwC: Do you then use some sort of provenance record—a way of rationalizing the fact that we call these 14 things different things—that becomes reusable? DS: Oh, yes, that’s the heart of it. We reuse all of that from organization to organization. There’s always some modification. And there’s always some difference of opinion about how to define a patient cohort or a disease state. But first we offer something off the shelf, so you don’t need to re-create them. PwC: What if somebody wanted to perform analytics across the data marts or across different business domains? In this framework, would the best strategy be to somehow consolidate the data marts, or instead go straight to the underlying data warehouse? DS: You can do either one. Let’s take a comorbidity situation, for example, where a patient has three or four different disease states. Let’s say you want to look at that patient’s continuum of care across all of those. A step toward the data lake in healthcare: Late-bound data warehouses “A drawback of traditional ways of deploying data warehouses is that they presuppose various bindings and rules. They don’t allow for data exploration and local fingerprinting.” Over the top of those data marts is still this common late-binding vocabulary that allows you to query the patient as that patient appears in each of those different subject areas, whatever disease state it is. It ends up looking like a virtual enterprise model for that patient’s record. After we’ve formally defined a patient cohort and the key metrics that the organization wants to understand about that patient cohort, we want to lock that down and tightly bind it at that point. First you get people to agree. You get physicians and administrators to agree how they want to identify a patient cohort. You get agreement on the metrics they want to understand about clinical effectiveness. After you get comprehensive agreement, then you look for it to stick for a while. When it sticks for a period of time, then you can tightly bind that data together and feel comfortable about doing so—so you don’t need to rip it apart and rebind it again. PwC: When you speak about coming toward an agreement among the various constituencies, is it a process that takes place more informally outside the system, where everybody is just going to come up with the model? Or is there some way to investigate the data first? Or by using tagging or some collaborative online utility, is there an opportunity to arrive at consensus through an interface? DS: We have ready-to-use definitions around all these metrics—patient registries and things like that. But we also recognize that the state of the industry being what it is, there’s still a lot of fingerprinting and opinions about those definitions. So even though 17 PwC Technology Forecast an enterprise might reference the National Quality Forum, the Agency for Healthcare Research and Quality, and the British Medical Journal as the sources for the definitions, local organizations always want to put their own fingerprint on these rules for data binding. We have a suite of tools to facilitate that exploration process. You can look at your own definitions, and you can ask, “How do we really want to define a diabetic patient? How do we define congestive heart failure and myocardial infarction patients?” We’ll let folks play around with the data, visualize it, and explore it in definitions. When we see them coming toward a comprehensive and persistent agreement, then we’ll suggest, “If you agree to that definition, let’s bind it together behind that visualization layer.” That’s exactly what happens. And you must allow that to happen. You must let that exploration and fingerprinting happen. A drawback of traditional ways of deploying data warehouses is that they presuppose all of those bindings and rules. They don’t allow that exploration and local fingerprinting. PwC: So how do companies get started with this approach? Assuming they have existing data warehouses, are you using those warehouses in a new way? Are you starting up from scratch? Do you leave those data warehouses in place when you’re implementing the late-bound idea? DS: Some organizations have an existing data warehouse. And a lot of organizations don’t. The greenfield organizations are the easiest to deal with. A step toward the data lake in healthcare: Late-bound data warehouses The strategy is pretty complicated to decouple all of the analytic logic that’s been built around those existing data warehouses and then import that to the future. Like most transitions of this kind, it often happens through attrition. First you build the new enterprise data warehouse around those late-binding concepts. And then you start populating it with data. The one thing you don’t want to do is build your new data warehouse under a dependency to those existing data warehouses. You want to go around those data warehouses and pull your data straight from source systems in the new architecture. It’s a really bad strategy to build a data warehouse on top of data warehouses. PwC: Some of the people we’ve interviewed about Hadoop assert that using Hadoop versus a data warehouse can result in a cost benefit that’s at least an order of magnitude cheaper. They claim, for example, that storing data costs $250,000 per terabyte in a traditional warehouse versus $25,000 per terabyte for Hadoop. If you’re talking with the C-suite about an exploratory analytics strategy, what’s the advantage of staying with a warehousing approach? DS: In healthcare, the compelling use case for Hadoop right now is the license fee. Contrast that case with what compels Silicon Valley web companies and everybody else to go to Hadoop. Their compelling reason wasn’t so much about money. It was about scalability. 18 PwC Technology Forecast If you consider the nature of the data that they’re pulling into Hadoop, there’s no such thing as a data model for the web. All the data that they’re streaming into Hadoop comes tagged with its own data model. They don’t need a relational database engine. There’s no value to them in that setting at all. For CIOs, the fact that Hadoop is inexpensive open source is very attractive. The downside, however, is the lack of skills. The skills and the tools and the ways to really take advantage of Hadoop are still a few years off in healthcare. Given the nature of the data that we’re dealing with in healthcare right now, there’s nothing particularly compelling about Hadoop in healthcare right now. Probably in the next year, we will start using Hadoop as a preprocessor ETL [extract, transform, load] platform that we can stream data into. During the next three to four years, as the skills and the tools evolve to take advantage of Hadoop, I think you’ll see companies like Health Catalyst being more aggressive about the adoption of Hadoop in a data lake scenario. But if you add just enough foreign keys and dimensions of analytics across that data lake, that approach greatly facilitates reliable landing and loading. It’s really, really hard to pull meaningful data out of those lakes without something to get the relationship started. A step toward the data lake in healthcare: Late-bound data warehouses Technology Forecast: Rethinking integration Issue 1, 2014 Microservices: The resurgence of SOA principles and an alternative to the monolith By Galen Gruman and Alan Morrison Big SOA was overkill. In its place, a more agile form of services is taking hold. 19 Moving away from the monolith Greater modularity, loose coupling, and reduced dependencies all hold promise in simplifying the integration task. Companies such as Netflix, Gilt, PayPal, and Condé Nast are known for their ability to scale high-volume websites. Yet even they have recently performed major surgery on their systems. Their older, more monolithic architectures would not allow them to add new or change old functionality rapidly enough. So they’re now adopting a more modular and loosely coupled approach based on microservices architecture (MSA). Their goal is to eliminate dependencies and enable quick testing and deployment of code changes. Greater modularity, loose coupling, and reduced dependencies all hold promise in simplifying the integration task. If MSA had a T-shirt, it would read: “Code small. Code local.” Early signs indicate this approach to code management and deployment is helping companies become more responsive to shifting customer demands. Yet adopters might encounter a challenge when adjusting the traditional software development mindset to the MSA way—a less elegant, less comprehensive but more nimble approach. PwC believes MSA is worth considering as a complement to traditional methods when speed and flexibility are paramount—typically in web-facing and mobile apps. Microservices also provide the services layer in what PwC views as an emerging cloudinspired enterprise integration fabric, which companies are starting to adopt for greater business model agility. Why microservices? In the software development community, it is an article of faith that apps should be written with standard application programming interfaces (APIs), using common services when possible, and managed through one or more orchestration technologies. Often, there’s a superstructure of middleware, integration methods, and management tools. That’s great for software designed to handle complex tasks for long-term, core enterprise functions—it’s how transaction systems and other systems of record need to be designed. But these methods hinder what Silicon Valley companies call web-scale development: software that must evolve quickly, whose functionality is subject to change or obsolescence in a couple of years—even months—and where the level of effort must fit a compressed and reactive schedule. It’s more like web page design than developing traditional enterprise software. Dependencies from a developer’s perspective 1990s and earlier 2000s 2010s Pre-SOA (monolithic) Traditional SOA Microservices Tight coupling Looser coupling Decoupled Team Team Team Team Team For a monolith to change, all must agree on each change. Each change has unanticipated effects requiring careful testing beforehand. Elements in SOA are developed more autonomously but must be coordinated with others to fit into the overall design. 20 PwC Technology Forecast Developers can create and activate new microservices without prior coordination with others. Their adherence to MSA principles makes continuous delivery of new or modified services possible. Microservices: An alternative to the monolith It is important to understand that MSA is still evolving and unproven over the long term. But like the now common agile methods, Node.js coding framework, and NoSQL data management approaches before it, MSA is an experiment many hope will prove to be a strong arrow in software development quivers. MSA: A think-small approach for rapid development orchestration brokers, but rather simpler messaging systems such as Apache Kafka. MSA proponents tend to code in web-oriented languages such as Node.js that favor small components with direct interfaces, and in functional languages like Scala or the Clojure Lisp library that favor “immutable” approaches to data and functions, says Richard Rodger, a Node.js expert and CEO of nearForm, a development consultancy. This fine-grained approach lets you update, add, replace, or remove services—in short, to integrate code changes—from your application easily, with minimal effect on anything else. For example, you could change the zip-code lookup to a UK postal-code lookup by changing or adding a microservice. Or you could change the communication protocol from HTTP to AMQP, the emerging standard associated with RabbitMQ. Or you could pull data from a NoSQL database like MongoDB at one stage of an application’s lifecycle and from a relational product like MySQL at another. In each case, you would change or add a service. Simply put, MSA breaks an application into very small components that perform discrete functions, and no more. The definition of “very small” is inexact, but think of functional calls or low-level library modules, not applets or complete services. For example, a microservice could be an address-based or geolocation-based zip-code lookup, not a full mapping module. MSA lets you move from quick-and-dirty to quick-and-clean changes to applications or their components that are able to function by themselves. You would use other techniques— conventional service-oriented architecture (SOA), service brokers, and platform as a service (PaaS)—to handle federated application requirements. In other words, MSA is one technique among many that you might use in any application. In MSA, you want simple parts with clean, messaging-style interfaces; the less elaborate the better. And you don’t want elaborate middleware, service buses, or other The fine-grained, stateless, self-contained nature of microservices creates decoupling between different parts of a code base and is what makes them easy to update, replace, Evolution of services orientation 1990s and earlier 2000s 2010s Traditional SOA Microservices Coupling Pre-SOA (monolithic) Tight coupling Looser coupling Decoupled ent nm gi ng en vi ro Ex 21 PwC Technology Forecast sa The fine-grained, stateless, selfcontained nature of microservices creates decoupling between different parts of a code base and is what makes them easy to update, replace, remove, or augment. Some of the leading web properties use MSA because it comes from a mindset similar to other technologies and development approaches popular in web-scale companies: agile software development, DevOps, and the use of Node.js and Not only SQL (NoSQL). These approaches all strive for simplicity, tight scope, and the ability to take action without calling an allhands meeting or working through a tedious change management process. Managing code in the MSA context is often ad hoc and something one developer or a small team can handle without complex superstructure and management. In practice, the actual code in any specific module is quite small—a few dozen lines, typically—is designed to address a narrow function, and can be conceived and managed by one person or a small group. ist in a “ d u m b ” m es Microservices: An alternative to the monolith remove, or augment. Rather than rewrite a module for a new capability or version and then coordinate the propagation of changes the rewrite causes across a monolithic code base, you add a microservice. Other services that want this new functionality can choose to direct their messages to this new service, but the old service remains for parts of the code you want to leave alone. That’s a significant difference from the way traditional enterprise software development works. Thinking the MSA way: Minimalism is a must The MSA approach is the opposite of the traditional “let’s scope out all the possibilities and design in the framework, APIs, and data structures to handle them all so the application is complete.” Think of MSA as almost-plug-and-play in-app integration of discrete services both local Issue overview: Integration fabric The microservices topic is the second of three topics as part of the integration fabric research covered in this issue of the PwC Technology Forecast. The integration fabric is a central component for PwC’s New IT Platform.* Enterprises are starting to embrace more practical integration.** A range of these new approaches is now emerging, and during the next few months we’ll ponder what the new cloud-inspired enterprise integration fabric looks like. The main areas we plan to explore include these: Integration fabric layers Data Integration challenges Emerging technology solutions Data silos, data proliferation, rigid schemas, and high data warehousing cost; new and heterogeneous data types Hadoop data lakes, late binding, and metadata provenance tools Enterprises are beginning to place extracts of their data for analytics and business intelligence (BI) purposes into a single, massive repository and structuring only what’s necessary. Instead of imposing schemas beforehand, enterprises are allowing data science groups to derive their own views of the data and structure it only lightly, late in the process. Applications and services Rigid, monolithic systems that are difficult to update in response to business needs Microservices Fine-grained microservices, each associated with a single business function and accessible via an application programming interface (API), can be easily added to the mix or replaced. This method helps developer teams create highly responsive, flexible applications. Infrastructure Multiple clouds and operating systems that lack standardization Software containers for resource isolation and abstraction New software containers such as Docker extend and improve virtualization, making applications portable across clouds. Simplifying application deployment decreases time to value. * See http://www.pwc.com/us/en/increasing-it-effectiveness/new-it-platform.jhtml for more information. **Integration as PwC defines it means making diverse components work together so they work as a single entity. See “integrated system” at http://www.yourdictionary.com/integrated-system#computer, accessed June 17, 2014. 22 PwC Technology Forecast Microservices: An alternative to the monolith Traditional SOA versus microservices Traditional SOA Microservices Messaging type Smart, but dependency-laden ESB Dumb, fast messaging (as with Apache Kafka) Programming style Imperative model Reactive actor programming model that echoes agent-based systems Lines of code per service Hundreds or thousands of lines of code 100 or fewer lines of code State Stateful Stateless Messaging type Synchronous: wait to connect Asynchronous: publish and subscribe Databases Large relational databases NoSQL or micro-SQL databases blended with conventional databases Code type Procedural Functional Means of evolution Each big service evolves Each small service is immutable and can be abandoned or ignored Means of systemic change Modify the monolith Create a new service Means of scaling Optimize the monolith Add more powerful services and cluster by activity System-level awareness Less aware and event driven More aware and event driven and external. These services are expected to change, and some eventually will become disposable. When services have a small focus, they become simple to develop, understand, manage, and integrate. They do only what’s necessary, and they can be removed or ignored when no longer needed. Mobile apps and web apps are natural venues for MSA. There’s an important benefit to this minimalist approach, says Gregg Caines, a freelance web developer and co-author of programming books: “When a package doesn’t do more than is absolutely necessary, it’s easy to understand and to integrate into other applications.” In many ways, MSA is a return to some of the original SOA principles of independence and composition—without the complexity and superstructure that become common when SOA is used to implement enterprise software. The use of multiple, specific services with short lifetimes might sound sloppy, but remember that MSA is for applications, or their components, that are likely to change frequently. It makes no sense to design and develop software over an 18-month process 23 PwC Technology Forecast to accommodate all possible use cases when those use cases can change unexpectedly and the life span of code modules might be less than 18 months. The pace at which new code creation and changes happen in mobile applications and websites simply doesn’t support the traditional application development model. In such cases, the code is likely to change due to rapidly evolving social media services, or because it runs in iOS, Android, or some other environment where new capabilities are available annually, or because it needs to search a frequently updated product inventory. For such mutable activities, you want to avoid—not build in—legacy management requirements. You live with what nearForm’s Rodger considers a form of technical debt, because it is an easier price to pay for functional flexibility than a full-blown architecture that tries to anticipate all needs. It’s the difference between a two-week update and a two-year project. Microservices: An alternative to the monolith This mentality is different from that required in traditional enterprise software, which assumes complex, multivariate systems are being integrated, requiring many-tomany interactions that demand some sort of intelligent interpretation and complex framework. You invest a lot up front to create a platform, framework, and architecture that can handle a wide range of needs that might be extensive but change only at the edges. MSA assumes you’re building for the short term; that the needs, opportunities, and context will change; and that you will handle them as they occur. That’s why a small team of developers familiar with their own microservices are the services’ primary users. And the clean, easily understood nature lets developers even more quickly add, remove, update, and replace their services and better ensure interoperation with other services. In MSA, governance, data architecture, and the microservices are decentralized, which minimizes the dependencies. As a result of this independence, you can use the right language for the microservice in question, as well as the right database or other related service, rather than use a single language or back-end service to accomplish all your application’s needs, says David Morgantini, a developer at ThoughtWorks. Where MSA makes sense MSA is most appropriate for applications whose functions may need to change frequently; that may need to run on multiple, changing platforms whose local services and capabilities differ; or whose life spans are not long enough to warrant a heavily architected framework. MSA is great for disposable services. Mobile apps and web apps are natural venues for MSA. But whatever platform the application runs on, some key attributes favor MSA: • Fast is more important than elegant. • Change occurs at different rates within the application, so functional isolation and simple integration are more important than module cohesiveness. • Functionality is easily separated into simple, isolatable components. For example, an app that draws data from social networks might use separate microservices for each network’s data extraction and data normalization. As social networks wax and wane in popularity, they can be added to the app without changing anything else. And as APIs evolve, the app can support several versions concurrently but independently. Microservices can make media distribution platforms, for example, easier to update and faster than before, says Adrian Cockcroft, a technology fellow at Battery Ventures, a venture capital firm.The key is to separate concerns along these dimensions: • Each single-function microservice has one action. • A small set of data and UI elements is involved. • One developer, or a small team, independently produces a microservice. • Each microservice is its own build, to avoid trunk conflict. • The business logic is stateless. • The data access layer is statefully cached. • New functions are added swiftly, but old ones are retired slowly.1 These dimensions create the independence needed for the microservices to achieve the goals of fast development and easy integration of discrete services limited in scope. • Change in the application’s functionality and usage is frequent. 1 Adrian Cockcroft, “Migrating to Microservices,” (presentation, QCon London, March 6, 2014), http://qconlondon.com/london-2014/qconlondon.com/london-2014/presentation/Migrating%20to%20Microservices.html. 24 PwC Technology Forecast Microservices: An alternative to the monolith MSA is not entirely without structure. There is a discipline and framework for developing and managing code the MSA way, says nearForm’s Rodger. The more experienced a team is with other methods—such as agile development and DevOps—that rely on small, focused, individually responsible approaches, the easier it is to learn to use MSA. It does require a certain groupthink. The danger of approaching MSA without such a culture or operational framework is the chaos of individual developers acting without regard to each other. In MSA, integration is the problem, not the solution of microservices. You might have “planned community” neighborhoods made from coarser-grained services or even monolithic modules that interact with more organic MSAstyle neighborhoods.2 It’s important to remember that by keeping services specific, there’s little to integrate. You typically deal with a handful of data, so rather than work through a complex API, you directly pull the specific data you want in a RESTful way. You keep your own state, again to reduce dependencies. You bind data and functions late for the same reasons. Many enterprise developers shake their heads and ask how microservices can possibly integrate with other microservices and with other applications, data sets, and services. MSA sounds like an integration nightmare, a morass of individual connections causing a rat’s nest that looks like spaghetti code. Integration is a problem MSA tries to avoid by reducing dependencies and keeping them local. If you need complex integration, you shouldn’t use MSA for that part of your software development. Instead, use MSA where broad integration is not a key need. Ironically, integration is almost a byproduct of MSA, because the functionality, data, and interface aspects are so constrained in number and role. (Rodger says Node.js developers will understand this implicit integration, which is a principle of the language.) In other words, your integration connections are local, so you’re building more of a chain than a web of connections. MSA is not a cure-all, nor is it meant to be the only or even dominant approach for developing applications. But it’s an emerging approach that bucks the trend of elaborate, elegant, complete frameworks where that doesn’t work well. Sometimes, doing just what you need to do is a better answer than figuring out all the things you might need and constructing an environment to handle it all. MSA serves the “do just what you need to do” scenario. When you have fine-grained components, you do have more integration points. Wouldn’t that make the development more difficult and changes within the application more likely to cause breakage? Not necessarily, but it is a risk, says Morgantini. They key is to create small teams focused on business-relevant tasks and to conceive of the microservices they create as neighbors living together in a small neighborhood, so the relationships are easily apparent and proximate. In this model, an application can be viewed as a city of neighborhoods assigned to specific business functions, with each neighborhood composed Conclusion This approach has proven effective in contexts already familiar with agile development, DevOps, and loosely coupled, event-driven technologies such as Node.js. MSA applies the same mentality to the code itself, which may be why early adopters are those who are using the other techniques and technologies. They already have an innate culture that makes it easier to think and act in the MSA way. Any enterprise looking to serve users and partners via the web, mobile, and other fastevolving venues should explore MSA. 2 David Morgantini, “Micro-services—Why shouldn’t you use micro-services?” Dare to dream (blog), August 27, 2013, http://davidmorgantini.blogspot.com/2013/08/micro-services-why-shouldnt-you-use.html, accessed May 12, 2014. 25 PwC Technology Forecast Microservices: An alternative to the monolith Technology Forecast: Rethinking integration Issue 1, 2014 Microservices in a software industry context John Pritchard offers some thoughts on the rebirth of SOA and an API-first strategy from the vantage point of a software provider. Interview conducted by Alan Morrison, Wunan Li, and Akshay Rao PwC: What are some of the challenges when moving to an API-first business model? John Pritchard John Pritchard is director of platform services at Adobe. and packaged in interesting ways by our own product teams or third-party developers. JP: I see APIs as a large oncoming wave that will create a lot of benefit for a lot of companies, especially companies in our space that are trying to migrate to SaaS.1 With the API model, there’s a new economy of sorts and lots of talk about how to monetize the services. People discuss the models by which those services could be made available and how they could be sold. At Adobe, we have moved from being a licensed desktop product company to a subscription-based SaaS company. We’re in the process of disintegrating our desktop products to services that can be reassembled There’s still immaturity in the very coarse way that APIs tend to be exposed now. I might want to lease the use of APIs to a third-party developer, for instance, with a 1 Abbreviations are as follows: • API: application programming interface • SaaS: software as a service For more information on APIs, see “The business value of APIs,” PwC Technology Forecast 2012, Issue 2, http://www.pwc.com/us/en/technology-forecast/2012/issue2/index.jhtml. 26 PwC Technology Forecast Microservices in a software industry context “APIs are SOA realized.” usage-based pricing model. This model allows the developer to white label the experience to its customers without requiring a license. Usage-based pricing triggers some thought around how to instrument APIs and the connection between API usage and commerce. It leads to some interesting conversations about identity and authentication, especially when third-party developers might be integrating multiple API sets from different companies into a customer-exposed application. PwC: Isn’t there substantial complexity associated with the API model once you get down to the very granular services suggested by a microservices architecture? JP: At one level, the lack of standards and tooling for APIs has resulted in quite a bit of simplification. Absent standards, we are required to use what I’ll call the language of the Internet: HTTP, JSON, and OAuth. That’s it. This approach has led to beautiful, simple designs because you can only do things a few ways. But at another level, techniques to wire together capabilities with some type of orchestration have been missing. This absence creates a big risk in my mind of trying to do things in the API space like the industry did with SOA and WS*.2 PwC: How are microservices related to what you’re doing on the API front? JP: We don’t use the term microservices; I wouldn’t say you’d hear that term in conversations with our design teams. But I’m familiar with some of Martin Fowler’s writing on the topic.3 If you think about how the term is defined in industry and this idea of smaller statements that are transactions, that concept is very consistent with design principles and the API-first strategy we adhere to. What I’ve observed on my own team and some of the other product teams we work with is that the design philosophy we use is less architecturally driven than it is team dynamic driven. When you move to an end-toend team or a DevOps4 type of construct, you tend to want to define things that you can own completely and that you can release so you have some autonomy to serve a particular need. We use APIs to integrate internally as well. We want these available to our product engineering community in the most consumable way. How do we describe these APIs so we clear the path for selfservice as quickly as possible? Those sorts of questions and answers have led us to the design model we use. 2 Abbreviations are as follows: • HTTP: hypertext transfer protocol • JSON: JavaScript Object Notation • SOA: service-oriented architecture • WS*: web services 3 For example, see James Lewis and Martin Fowler, “Microservices,” March 25, 2014, http://martinfowler.com/articles/microservices.html, accessed June 18, 2014. 4 DevOps is a working style designed to encourage closer collaboration between developers and operations people: DevOps=Dev+Ops. For more information on DevOps, continuous delivery, and antifragile system development, see “DevOps: Solving the engineering productivity challenge,” PwC Technology Forecast 2013, Issue 2, http://www.pwc.com/us/en/technology-forecast/2013/issue2/index.jhtml. 27 PwC Technology Forecast Microservices in a software industry context PwC: When you think about the problems that a microservices approach might help with, what is top of mind for you? JP: I’ve definitely experienced the rebirth of SOA. In my mind, APIs are SOA realized. We remember the ESB and WS* days and the attempt to do real top-down governance. We remember how difficult that was not only in the enterprise, but also in the commercial market, where it didn’t really happen at all.5 Developer-friendly consumability has helped us bring APIs to market. Internally, that has led to greater efficiencies. And it encourages some healthy design practices by making things small. Some of the connectivity becomes less important than the consumability. PwC: What’s the approach you’re taking to a more continuous form of delivery in general? JP: For us, continuous delivery brings to mind end-to-end teams or the DevOps model. Culturally, we’re trying to treat everything like code. I treat infrastructure like code. I treat security like code. Everything is assigned to sprints. APIs must be instrumented for deployment, and then we test around the APIs being deployed. only for infrastructure components but also for scripted security attacks to validate our operational run times. We’ve seen an increased need for automation. With every deployment we look for opportunities for automation. But what’s been key for the success in my team is this idea of treating all these different aspects just like we treat code. PwC: Would that include infrastructure as well? JP: Yes. My experience is that the line is almost completely blurred about what’s software and what’s infrastructure now. It’s all software defined. PwC: As systems become less monolithic, how will that change the marketplace for software? JP: At the systems level, we’re definitely seeing a trend away from centralized core systems—like core ERP or core large platforms that provide lots of capabilities—to a model where a broad selection of SaaS vendors provide very niche capabilities. Those SaaS operators may change over time as new ones come into the market. The service provider model, abstracting SaaS provider capabilities with APIs, gives us the flexibility to evaluate newcomers that might be better providers for each API we’ve defined. We’ve borrowed many of the Netflix constructs around monkeys.6 We use monkeys not 5 Abbreviations are as follows: • ESB: enterprise service bus • ERP: enterprise resource planning 6 Chaos Monkey is an example. See “The evolution from lean and agile to antifragile,” PwC Technology Forecast 2013, Issue 2, http://www.pwc.com/us/en/technology-forecast/2013/issue2/features/new-cloud-development-styles.jhtml for more on Chaos Monkey. 28 PwC Technology Forecast Microservices in a software industry context Technology Forecast: Rethinking integration Issue 1, 2014 The critical elements of microservices Richard Rodger describes his view of the emerging microservices landscape and its impact on enterprise development. Interview conducted by Alan Morrison and Bo Parker PwC: What’s the main advantage of a microservices approach versus object-oriented programming? RR: Object-oriented programming failed miserably. With microservices, it’s much harder to shoot yourself in the foot. The traditional anti-patterns and problems that happen in object-oriented code—such as the big bowl of mud where a single task has a huge amount of responsibilities or goes all over the place— are less likely in the microservices world. Richard Rodger Richard Rodger is the CTO of nearForm, a software development and training consultancy specializing in Node.js. Consider the proliferation of patterns in the object-oriented world. Any programming paradigm that requires you to learn 50 different design patterns to get things right and makes it so easy to get things wrong is probably not the right way to be doing things. 29 PwC Technology Forecast That’s not to say that patterns aren’t good. Pattern designs are good and they are necessary. It’s just that in the microservices world, there are far fewer patterns. PwC: What is happening in companies that are eager to try the microservices approach? RR: It’s interesting to think about why change happens in the software industry. Sometimes the organizational politics is a much more important factor than the technology itself. Our experience is that politics often drives the adoption of microservices. We’re observing aggressive, ambitious vice presidents who have the authority to fund large software projects. In light of how long most of these projects usually take, the vice presidents see an opportunity for career advancement by executing much more rapidly. The critical elements of microservices “With a traditional approach, even approaches that are agile, you must make an awful lot of decisions up front. And if you make the wrong decisions, you back yourself into a corner.” A lot of our engagements are with forwardlooking managers who essentially are sponsoring the adoption of a microservices approach. Once those initial projects have been deemed successful because they were delivered faster and more effectively, that proves the point and creates its own force for the broader adoption of microservices. PwC: How does a typical microservices project begin? RR: In large projects that can take six months or more, we develop the user story and then define and map capabilities to microservices. And then we map microservices onto messages. We do that very, very quickly. Part of what we do, and part of what microservices enable us to do, is show a working live demo of the system after week one. If we kick off on a Monday, the following Monday we show a live version of the system. You might only be able to log in and perhaps get to the main screen. But there’s a running system that may be deployed on whatever infrastructure is chosen. Every Monday there’s a new live demo. And that system stays running during the lifetime of the project. Anybody can look at the system, play with it, break it, or whatever at any point in time. Those capabilities are possible because we started to build services very quickly within the first week. With a traditional approach, even approaches that are agile, you must make an awful lot of decisions up front. And if you make the wrong decisions, you back yourself into a corner. For example, if you decide to use a particular database technology 30 PwC Technology Forecast or commit to a certain structure of object hierarchies, you must be very careful and spend a lot of time analyzing. The use of microservices reduces that cost significantly. An analogy might help to explain how this type of decision making happens. When UC Irvine laid out its campus, the landscapers initially put in grass and watched where people walked. They later built paths where the grass was worn down. Microservices are like that. If you have a particular data record and you build a microservice to look back at that data record, you don’t need to define all of the fields up front. A practical example might be if a system will capture transactions and ultimately use a relational database. We might use MongoDB for the first four weeks of development because it’s schema free. After four weeks of development, the schema will be stabilized to a considerable extent. On week five, we throw away MongoDB and start using a relational product. We saved ourselves from huge hassles in database migrations by developing this way. The key is using a microservice as the interface to the database. That lets us throw away the initial database and use a new one—a big win. PwC: Do microservices have skeletal frameworks of code that you can just grab, plug in, and compose the first week’s working prototype? RR: We open source a lot, and we have developed a whole bunch of precut microservices. That’s a benefit of being part of the Node [server-side JavaScript] community. There’s this ethic in the Node The critical elements of microservices “There’s this ethic in the Node community about sharing your Node services. It’s an emergent property of the ecosystem.” community about sharing your Node services. It’s an emergent property of the ecosystem. You can’t really compile JavaScript, so a lot of it’s going to be open source anyway. You publish a module onto the npm public repository, which is open source by definition. PwC: There are very subtle and nuanced aspects of the whole microservices scene, and if you look at it from just a traditional development perspective, you’d miss these critical elements. What’s the integration pattern most closely associated with microservices? RR: It all comes back to thinking about your system in terms of messages. If you need a search engine for your system, for example, there are various options and cloud-based search services you can use now. Normally this is a big integration task with heavy semantics and coordination required to make it work. If you define your search capability in terms of messages, the integration is to write a microservice that talks to whatever back end you are using. In a sense, the work is to define how to interact with the search service. Let’s say the vendor is rolling out a new version. It’s your choice when you go with the upgrade. If you decide you want to move ahead with the upgrade, you write your microservices so both version 1 and version 2 can subscribe to the same messages. You can route a certain part of your message to version 1 and a certain part to version 2. To gracefully phase in version 2 before fully committing, you might start by directing 5 percent of traffic to the new version, monitor it for issues, and gradually increase the traffic to version 2. Because it 31 PwC Technology Forecast doesn’t require a full redeployment of your entire system, it’s easy to do. You don’t need to wait three months for a lockdown. Monolithic systems often have these scenarios where the system is locked down on November 30 because there’s a Christmas sales period or something like that. With microservices, you don’t have such issues anymore. PwC: So using this message pattern, you could easily fall into the trap of having a fat message bus, which seems to be the anti-pattern here for microservices. You’re forced to maintain this additional code that is filtering the messages, interpreting the messages, and transforming data. You’re back in the ESB world. RR: Exactly. An enterprise spaghetti bowl, I think it’s called. PwC: How do you get your message to the right places efficiently while still having what some are calling a dumb pipe to the message management? RR: This principle of the dumb pipe is really, really important. You must push the intelligence of what to do with messages out to the edges. And that means some types of message brokers are better suited to this architecture than others. For example, traditional message brokers like RabbitMQ—ones that maintain internal knowledge of where individual consumers are, message queues, and that sort of thing— are much less suited to what we want to do here. Something like Apache Kafka is much better because it’s purposely dumb. It forces the message-queue consumers to remember their own place in the queue. The critical elements of microservices “If we have less intellectual work to do, that actually lets us do more.” As a result, you don’t end up with scaling issues if the queue gets overloaded. You can deal with the scaling issue at the point of actually intercepting the message, so you’re getting the messages passed through as quickly as possible. You don’t need to use a message queue for everything, either. If you end up with a very, very high throughput system, you move the intelligence into the producer so it knows you have 10 consumers. If one dies, it knows to trigger the surrounding system to create a new consumer, for example. It’s the same idea as when we were using MongoDB to determine the schema ahead of time. After a while, you’ll notice that the bus is less suitable for certain types of messages because of the volumes or the latency or whatever. 32 PwC Technology Forecast PwC: Would Docker provide a parallel example for infrastructure? RR: Yes. Let’s say you’re deploying 50 servers, 50 Amazon instances, and you set them up with a Docker recipe. And you deploy that. If something goes wrong, you could kill it. There’s no way for a sys admin to SSH [Secure Shell] into that machine and start tinkering with the configurations to fix it. When you deploy, the services either work or they don’t. PwC: The cognitive load facing programmers of monoliths and the coordination load facing programmer teams seem to represent the new big mountain to climb. RR: Yes. And that’s where the productivity comes from, really. It actually isn’t about best practices or a particular architecture or a particular version of Node.js. It’s just that if we have less intellectual work to do, that actually lets us do more. The critical elements of microservices Technology Forecast: Rethinking integration Issue 1, 2014 Containers are redefining application-infrastructure integration By Alan Morrison and Pini Reznik With containers like Docker, developers can deploy the same app on different infrastructure without rework. 33 Issue overview: Rethinking integration This article focuses on one of three topics covered in the Rethinking Integration issue of the PwC Technology Forecast (http://www.pwc.com/ us/en/technologyforecast/2014/issue1/ index.jhtml). The integration fabric is a central component for PwC’s New IT Platform. (See http://www.pwc. com/us/en/increasingit-effectiveness/new-itplatform.jhtml for more information. Early evaluations of Docker suggest it is a flexible, cost-effective, and more nimble way to deploy rapidly changing applications on infrastructure that also must evolve quickly. Spotify, the Swedish streaming music service, grew by leaps and bounds after its launch in 2006. As its popularity soared, the company managed its scaling challenge simply by adding physical servers to its infrastructure. Spotify tolerated low utilization in exchange for speed and convenience. In November 2013, Spotify was offering 20 million songs to 24 million users in 28 countries. By that point, with a computing infrastructure of 5,000 servers in 33 Cassandra clusters at four locations processing more than 50 terabytes, the scaling challenge demanded a new solution. Spotify chose Docker, an open source application deployment container that evolved from the LinuX Containers (LXCs) used for the past decade. LXCs allow different applications to share operating system (OS) kernel, CPU, and RAM. Docker containers go further, adding layers of abstraction and deployment management features. Among the benefits of this new infrastructure technology, containers that have these capabilities reduce coding, deployment time, and OS licensing costs. Not every company is a web-scale enterprise like Spotify, but increasingly many companies need scalable infrastructure with maximum flexibility to support the rapid changes in services and applications that today’s business environment demands. Early evaluations of Docker suggest it is a flexible, cost-effective, and more nimble way to deploy rapidly changing applications on infrastructure that also must evolve quickly. PwC expects containers will become a standard fixture of the infrastructure layer in the evolving cloud-inspired integration fabric. This integration fabric includes microservices at the services layer and data lakes at the data layer, which other articles explore in this “Rethinking integration” issue of the PwC Technology Forecast.1 This article examines Docker containers and their implications for infrastructure integration. A stretch goal solves a problem Spotify’s infrastructure scale dwarfs those of many enterprises. But its size and complexity make Spotify an early proof case for the value and viability of Docker containers in the agile business environment that companies require. By late 2013, Spotify could no longer continue to scale or manage its infrastructure one server at a time. The company used state-ofthe-art configuration management tools such as Puppet, but keeping those 5,000 servers consistently configured was still difficult and time-consuming. Spotify had avoided conventional virtualization technologies. “We didn’t want to deal with the overhead of virtual machines (VMs),” says Rohan Singh, a Spotify infrastructure engineer. The company required some kind of lightweight alternative to VMs, because it needed to deploy changes to 60 services and add new services across the infrastructure in a more manageable way. “We wanted to make our service deployments more repeatable and less painful for developers,” Singh says. Singh was a member of a team that first looked at LXCs, which—unlike VMs—allow applications to share an OS kernel, CPU, and RAM. With containers, developers can isolate applications and their dependencies. Advocates of containers tout the efficiencies and deployment speed compared with VMs. Spotify wrote some deployment service scripts for LXCs, but decided it was needlessly duplicating what existed in Docker, which includes additional layers of abstraction and deployment management features. Singh’s group tested Docker on a few internal services to good effect. Although the vendor had not yet released a production version of Docker and advised against production use, Spotify took a chance and did just that. “As a stretch goal, we ignored the warning labels and went ahead and deployed a container into production and started throwing production traffic at it,” Singh says, referring to a service that provided album metadata such as the album or track titles.2 Thanks to Spotify and others, adoption had risen steadily even before Docker 1.0 was available in June 2014. The Docker application 1 For more information, see “Rethinking integration: Emerging patterns from cloud computing leaders,” PwC Technology Forecast 2014, Issue 1, http://www.pwc.com/us/en/technology-forecast/2014/issue1/index.jhtml. 2 Rohan Singh, “Docker at Spotify,” Twitter University YouTube channel, December 11, 2013, https://www.youtube.com/watch?v=pts6F00GFuU, accessed May 13, 2014, and Jack Clark, “Docker blasts into 1.0, throwing dust onto traditional hypervisors,” The Register, June 9, 2014, http://www.theregister.co.uk/2014/06/09/docker_milestone_release/, accessed June 11, 2014. 34 PwC Technology Forecast Containers are redefining application-infrastructure integration The fact that a container does not run its own OS instance reduces dramatically the overhead associated with starting and running instances. Startup time can typically be reduced from 30 seconds (or more) to one-tenth of a second. container engine posted on GitHub, the codesharing network, had received more than 14,800 stars (up-votes by users) by August 14, 2014.3 Container-oriented management tools and orchestration capabilities are just now emerging. When Docker, Inc., (formerly dotCloud, Inc.) released Docker 1.0, the vendor also released Docker Hub, a proprietary orchestration tool available for licensing. PwC anticipates that orchestration tools will also become available from other vendors. Spotify is currently using tools it developed. Why containers? LXCs have existed for many years, and some companies have used them extensively. Google, for example, now starts as many as 2 billion containers a week, according to Joe Beda, a senior staff software engineer at Google.4 LXCs abstract the OS more efficiently than VMs. The VM model blends an application, a full guest OS, and disk emulation. In contrast, the container model uses just the application’s dependencies and runs them directly on a host OS. Containers do not launch a separate OS for each application, but share the host kernel while maintaining the isolation of resources and processes where required. The fact that a container does not run its own OS instance reduces dramatically the overhead associated with starting and running instances. Startup time can typically be reduced from 30 seconds (or more) to onetenth of a second. The number of containers running on a typical server can reach dozens or even hundreds. The same server, in contrast, might support 10 to 15 VMs. Developer teams such as those at Spotify, which write and deploy services for large software-asa-service (SaaS) environments, need to deploy new functionality quickly, at scale, and to test and see the results immediately. Increasingly, they say containerization delivers those benefits. SaaS environments by their very testdriven nature require frequent infusions of new code to respond to shifting customer demands. Without containers, developers who write more and more distributed applications would spend much time on repetitive drudgery. Docker: LXC simplification and an emerging multicloud abstraction A Docker application container takes the basic notion of LXCs, adds simplified ways of interacting with the underlying kernel, and makes the whole portable (or interoperable) Virtual machines on a Type 2 hypervisor versus application containerization with a shared OS Figure 1: Virtual machines on a Type 2 hypervisor versus application containerization with a shared OS App A App A App B Bins/ Libs Bins/ Libs Bins/ Libs Guest OS Guest OS Guest OS Containers are isolated, but share OS and, where appropriate, bins/libraries VM App B App B App B Bins/Libs App B App A App A Container Bins/Libs Hypervisor (Type 2) Container engine Host OS Host OS Server Server Source: Docker, Inc., 2014 Source: Docker, Inc., 2014 3 See “dotcloud/docker,” GitHub, https://github.com/dotcloud/docker, accessed August 14, 2014. 4 Joe Beda, “Containers At Scale,” Gluecon 2014 conference presentation slides, May 22, 2014, https://speakerdeck.com/jbeda/containers-at-scale, accessed June 11, 2014. 35 PwC Technology Forecast Containers are redefining application-infrastructure integration across environments that have different operating systems. Portability is currently limited to Linux environments—Ubuntu, SUSE, or Red Hat Enterprise Linux, for example. But Ben Golub, CEO of Docker, Inc., sees no reason why a Dockerized container created on a laptop for Linux couldn’t eventually run on a Windows server unchanged. “With Docker, you no longer need to worry in advance about where the apps will run, because the same containerized application will run without being modified on any Linux server today. Going to Windows is a little trickier because the primitives aren’t as well defined, but there’s no rocket science involved.5 It’s just hard work that we won’t get to until the second half of 2015.” That level of portability can therefore extend across clouds and operating environments, because containerized applications can run on a VM or a bare-metal server, or in clouds from different service providers. The amount of application isolation that Docker containers provide—a primary reason for their portability—distinguishes them from basic LXCs. In Docker, applications and their dependencies, such as binaries and libraries, all become part of a base working image. That containerized image can run on different machines. “Docker defines an abstraction for these machine-specific settings, so the exact same Docker container can run—unchanged— on many different machines, with many different configurations,” says Solomon Hykes, CTO of Docker, Inc.6 Another advantage of Docker containerization is that updates, such as vulnerability patches, can be pushed out to the containers that need them without disruption. “You can push changes to 1,000 running containers without taking any of them down, without restarting an OS, without rebuilding a VM,” Golub says. Docker’s ability to extend the reach of security policy and apply it uniformly is substantial. “The security model becomes much better with containers. In the VM-based world, every application has its own guest OS, which is a slightly different version. These different versions are difficult to patch. In a containerbased world, it’s easier to standardize the OS and deploy just one patch across all hosts,” he adds. Containerized applications also present opportunities for more comprehensive governance. Docker tracks the provenance of each container by using a method that digitally signs each one. Golub sees the potential, over time, for a completely provenanced library of components, each with its own automated documentation and access control capability.7 When VMs were introduced, they formed a new abstraction layer, a way to decouple software from a hardware dependency. VMs led to the creation of clouds, which allowed the load to be distributed among multiple hardware clusters. Containerization using the open Docker standard extends this notion of abstraction in new ways, across homogeneous or heterogeneous clouds. Even more importantly, it lowers the time and cost associated with creating, maintaining, and using the abstraction. Docker management tools such as Docker Hub, CenturyLink Panamax, Apache Mesos, and Google Kubernetes are emerging to address container orchestration and related challenges. Outlook: Containers, continuous deployment, and the rethinking of integration Software engineering has generally trended away from monolithic applications and toward the division of software into orchestrated groups of smaller, semi-autonomous pieces that have a smaller footprint and shorter deployment cycle.8 Microservices principles are leading this change in application architecture, and containers will do the same when it comes to deploying those microservices on any cloud infrastructure. The smaller size, the faster creation, and the subsecond deployment of containers allow enterprises to reduce both the infrastructure and application deployment 5 A primitive is a low-level object, components of which can be used to compose functions. See http://www.webopedia.com/TERM/P/primitive.html, accessed July 28, 2014, for more information. 6 Solomon Hykes, “What does Docker add to just plain LXC?” answer to Stack Overflow Q&A site, August 13, 2013, http://stackoverflow.com/questions/17989306/what-does-docker-add-to-just-plain-lxc, accessed June 11, 2014. 7 See the PwC interview with Ben Golub, “Docker’s role in simplifying and securing multicloud development,” http://www.pwc.com/en_US/us/technology-forecast/2014/issue1/interviews/interview-ben-golub-docker.jhtml for more information. 8 For more detail and a services perspective on this evolution, see “Microservices: The resurgence of SOA principles and an alternative to the monolith,” PwC Technology Forecast 2014, Issue 1, http://www.pwc.com/en_US/us/technology-forecast/2014/issue1/features/microservices.jhtml. 36 PwC Technology Forecast Containers are redefining application-infrastructure integration From2:monolithic to multicloud architectures Figure From monolithic to multi-cloud architectures 1995 2014 2017? Thick client-server client Thin mobile client Thin mobile/web UI Middleware/OS stack Assembled from available services Microservices Monolithic physical infrastructure VMs on cloud Multicloud container-based infrastructure Physical VMs Containers Cloud Source: Docker, Inc. and PwC, 2014 Containers Multiple clouds Source: Docker, Inc. and PwC, 2014 The smaller size, the faster creation, and the subsecond deployment of containers allow enterprises to reduce both the infrastructure and application deployment cycles from hours to minutes. cycles from hours to minutes. When enterprises can reduce deployment time so it’s comparable to the execution time of the application itself, infrastructure development can become an integral part of the main development process. These changes should be accompanied by changes in organizational structures, such as transitioning from waterfall to agile and DevOps teams.9 When VMs became popular, they were initially used to speed up and simplify the deployment of a single server. Once the application architecture internalized the change and monolithic apps started to be divided into smaller pieces, the widely accepted approach of that time—the golden image—could not keep up. VM proliferation and management became the new headaches in the typical organization. This problem led to the creation of configuration management tools that help maintain the desired state of the system. CFEngine pioneered these tools, which Puppet, Chef, and Ansible later popularized. Another cycle of growth led to a new set of orchestration tools. These tools—such as MCollective, Capistrano, and Fabric—manage the complex system deployment on multihost environments in the correct order. Containers might allow the deployment of a single application in less than a second, but now different parts of the application must run on different clouds. The network will become the next bottleneck. The network issues will require systems to have a combination of statelessness and segmentation. Organizations will need to deploy and run subsystems separately with only loose, software-defined network connections. That’s a difficult path. Some centralization may still be necessary. 9 For a complete analysis of the DevOps movement and its implications, see “DevOps: Solving the engineering productivity challenge,” PwC Technology Forecast 2013, Issue 2, http://www.pwc.com/us/en/technology-forecast/2013/issue2/index.jhtml. 37 PwC Technology Forecast Containers are redefining application-infrastructure integration Conclusion: Beyond application and infrastructure integration Microservices and containers are symbiotic. Together their growth has produced an alternative to integration entirely different from traditional enterprise application integration (EAI). Some of the differences include: Traditional EAI Microservices + Containers Translation (via an enterprise service bus [ESB], for example) Encapsulation Articulation Abstraction Bridging between systems Portability across systems Monolithic, virtualized OS Fit for purpose, distributed OS Wired Loosely coupled 38 PwC Technology Forecast The blend of containers, microservices, and associated management tools will redefine the nature of the components of a system. As a result, organizations that use the blend can avoid the software equivalent of wired, hardto-create, and hard-to-maintain connections. Instead of constantly tinkering with a polyglot connection bus, system architects can encapsulate the application and its dependencies in a lingua franca container. Instead of virtualizing the old OS into the new context, developers can create distributed, slimmed-down operating systems. Instead of building bridges between systems, architects can use containers that allow applications to run anywhere. By changing the nature of integration, containers and microservices enable enterprises to move beyond it. Containers are redefining application-infrastructure integration Technology Forecast: Rethinking integration Issue 1, 2014 Docker’s role in simplifying and securing multicloud development Ben Golub of Docker outlines the company’s application container road map. Interview conducted by Alan Morrison, Bo Parker, and Pini Reznik PwC: You mentioned that one of the reasons you decided to join Docker, Inc., as CEO was because the capabilities of the tool itself intrigued you. What intrigued you most?1 Ben Golub Ben Golub is CEO of Docker, Inc. BG: The VM was created when applications were long-lived, monolithic, built on a well-defined stack, and deployed to a single server. More and more, applications today are built dynamically through rapid modification. They’re built from loosely coupled components in a variety of different stacks, and they’re not deployed to a single server. They’re deployed to a multitude of servers, and the application that’s working on a developer’s laptop also must work in the test stage, in production, when scaling, across clouds, in a customer environment on a VM, on an OpenStack cluster, and so forth. The model for how you would do that is really very different from how you would deal with a VM, which is in essence trying to treat an application as if it were an application server. 1 Docker is an open source application deployment container tool released by Docker, Inc., that allows developers to package applications and their dependencies in a virtual container that can run on any Linux server. Docker Hub is Docker, Inc.’s related, proprietary set of image distribution, change management, collaboration, workflow, and integration tools. For more information on Docker Hub, see Ben Golub, “Announcing Docker Hub and Official Repositories,” Docker, Inc. (blog), June 9, 2014, http://blog.docker.com/2014/06/announcing-dockerhub-and-official-repositories/, accessed July 18, 2014. 39 PwC Technology Forecast Docker’s role in simplifying and securing multicloud development “You can usually gain 10 times greater density when you get rid of that guest operating system.” What containers do is pretty radical if you consider their impact on how applications are built, deployed, and managed.2 working to build an ecosystem around it, so there will be people, tools, and standard libraries that will all work with Docker.3 PwC: What predisposed the market to say that now is the time to start looking at tools like Docker? PwC: What impact is Docker having on the evolution of PaaS?4 BG: When creating an application as if it were an application server, the VM model blends an application, a full guest operating system, and disk emulation. By contrast, the container model uses just the application’s dependencies and runs them directly on a host OS. In the server world, the use of containers was limited to companies, such as Google, that had lots of specialized tools and training. Those tools weren’t transferable between environments; they didn’t make it possible for containers to interact with each other. We often use the shipping container as an analogy. The analogous situation before Docker was one in which steel boxes had been invented but nobody had made them a standard size, put holes in all the same places, and figured out how to build cranes and ships and trains that could use them. We aim to add to the core container technology, so containers are easy to use and interoperable between environments. We want to make them portable between clouds and different operating systems, between physical and virtual. Most importantly, we’re BG: The traditional VM links together the application management and the infrastructure management. We provide a very clean separation, so people can use Docker without deciding in advance whether the ideal infrastructure is a public or private cloud, an OpenStack cluster, or a set of servers all running RHEL or Ubuntu. The same container will run in all of those places without modification or delay. Because containers are so much more efficient and lightweight, you can usually gain 10 times greater density when you get rid of that guest operating system. That density really changes the economics of providing XaaS as well as the economics and the ease of moving between different infrastructures. In a matter of milliseconds, a container can be moved between provider A and provider B or between provider A and something private that you’re running. That speed really changes how people think about containers. Docker has become a standard container format for a lot of different platforms as a service, both private and public PaaS. At this point, a lot of people are questioning whether they really need a full PaaS to build a flexible app environment.5 2 Abbreviations are as follows: • VM: virtual machine 3 Abbreviations are as follows: • OS: operating system 4 Abbreviations are as follows: • PaaS: platform as a service 5 Abbreviations are as follows: • RHEL: Red Hat Enterprise Linux 40 PwC Technology Forecast Docker’s role in simplifying and securing multicloud development “What people want is the ability to choose any stack and run it on any platform.” PwC: Why are so many questioning whether or not they need a full PaaS? BG: A PaaS is a set of preselected stacks intended to run the infrastructure for you. What people increasingly want is the ability to choose any stack and run it on any platform. That’s beyond the capability of any one organization to provide. With Docker, you no longer need to worry in advance about where the apps will run, because the same containerized application will run without being modified on any Linux server today. You might build it in an environment that has a lot of VMs and decide you want to push it to a bare-metal cluster for greater performance. All of those options are possible, and you don’t really need to know or think about them in advance. PwC: When you can move Docker containers so easily, are you shifting the challenge to orchestration? BG: Certainly. Rather than having components tightly bound together and stitched up in advance, they’re orchestrated and moved around as needs dictate. Docker provides the primitives that let you orchestrate between containers using a bridge. Ultimately, we’ll introduce more full-fledged orchestration that lets you orchestrate across different data centers. Docker Hub—our commercial services announced in June 2014—is a set of services you can use to orchestrate containers both within a data center and between data centers. PwC: What should an enterprise that’s starting to look at Docker think about before really committing? BG: We’re encouraging people to start introducing Docker as part of the overall workflow, from development to test and then to production. For example, eBay has been using Docker for quite some time. The company previously took weeks to go from development to production. A team would start work on the developer’s laptop, move it to staging or test, and it would break and they weren’t sure why. And as they moved it from test or staging, it would break again and they wouldn’t know why.6 Then you get to production with Docker. The entire runtime environment is defined in the container. The developer pushes a button and commits code to the source repository, the container gets built and goes through test automatically, and 90 percent of the time the app goes into production. That whole process takes minutes rather than weeks. During the 10 percent of the time when this approach doesn’t work, it’s really clear what went wrong, whether the problem was inside the container and the developer did something wrong or the problem was outside the container and ops did something wrong. Some people want to really crank up efficiency and performance and use Docker on bare metal. Others use Docker inside of a VM, which works perfectly well. 6 For more detail, see Ted Dziuba, “Docker at eBay,” presentation slides, July 13, 2013, https://speakerdeck.com/teddziuba/docker-at-ebay, accessed July 18, 2014. 41 PwC Technology Forecast Docker’s role in simplifying and securing multicloud development “Going to Windows is a little bit trickier because the primitives aren’t as well defined, but there’s no rocket science involved.” For those just starting out, I’d recommend they do a proof of concept this year and move to production early next year. According to our road map, we’re estimating that by the second half of 2015, they can begin to use the control tools to really understand what’s running where, set policies about deployment, and set rules about who has the right to deploy. PwC: The standard operating model with VMs today commingles app and infrastructure management. How do you continue to take advantage of those management tools, at least for now? BG: If you use Docker with a VM for the host rather than bare metal, you can continue to use those tools. And in the modified VM scenario, rather than having 1,000 applications equal 1,000 VMs, you have 10 VMs, each of which would be running 100 containers. PwC: What about management tools for Docker that could supplant the VM-based management tools? BG: We have good tools now, but they’re certainly nowhere as mature as the VM toolset. PwC: What plans are there to move Docker beyond Linux to Windows, Solaris, or other operating systems? BG: This year we’re focused on Linux, but we’ve already given ourselves the ability to use different container formats within Linux, including LXC, libvirt, and libcontainer. People who are already in the community are working on having Docker manage Solaris zones and jails. We don’t see any huge technical reasons why Docker for Solaris can’t happen. Going to Windows is a little bit trickier because the primitives aren’t as well defined, but there’s no rocket science involved. It’s just hard work that we likely won’t get to until the second half of 2015.7 PwC: Docker gets an ecstatic response from developers, but the response from operations people is more lukewarm. A number of those we’ve spoken with say it’s a very interesting technology, but they already have Puppet running in the VMs. Some don’t really see the benefit. What would you say to these folks? BG: I would say there are lots of folks who disagree with them who actually use it in production. Folks in ops are more conservative than developers for good reason. But people will get much greater density and a significant reduction in the amount they’re spending on server virtualization licenses and hardware. One other factor even more compelling is that Docker enables developers to deliver what they create in a standardized form. While admin types might hope that developers embrace Chef and Puppet, developers rarely do. You can combine Docker with tools such as Chef and Puppet that the ops folks like and often get the best of both worlds. PwC: What about security? BG: People voice concerns about security just because they think containers are new. They’re actually not new. The base container technology has been used at massive scale by companies such as Google for several years. 7 Abbreviations are as follows: • LXC: LinuX Container 42 PwC Technology Forecast Docker’s role in simplifying and securing multicloud development The security model becomes much better with containers. Most organizations face hundreds of thousands of vulnerabilities that they know about but have very little ability to address. In the VM-based world where every application has its own VM, every application has its own guest OS, which is a slightly different version. These different versions are difficult to patch. In a container-based world, it’s easier to standardize the OS across all hosts. If there’s an OS-level vulnerability, there’s one patch that just needs to be redeployed across all hosts. Containerized apps are also much easier to update. If there’s an application vulnerability, you can push changes to 1,000 running containers without taking any of them down, without restarting an OS, without rebuilding a VM. Once the ops folks begin to understand better what Docker really does, they can get a lot more excited. PwC: Could you extend a governance model along these same lines? BG: Absolutely. Generally, when developers build with containers, they start with base images. Having a trusted library to start with is a really good approach. Creating containers these days is directly from source, which in essence means you can put a set of instructions in a source code repository. As this very mature source code is used, the changes to source code that get committed essentially translate automatically into an updated container. What we’re adding to that is what we call provenance. That’s the ability to digitally sign every container so you know where it came from, all the way back to the source. That’s a much more comprehensive security and governance model than trying to control what different black boxes are doing. PwC: What’s the outlook for the distributed services model generally? BG: I won’t claim that we can change the laws of physics. A terabyte of data doesn’t move easily across narrow pipes. But if applications and databases can be moved rapidly, and if they consistently define where they look for data, then the things that should be flexible can be. For example, if you want the data resident in two different data centers, that could be a lot of data. Either you could arrange it so the data eventually become consistent or you could set up continuous replication of data from one location to the other using something like CDP.8 I think either of those models work. 8 Abbreviations are as follows: • CDP: continuous data protection 43 PwC Technology Forecast Docker’s role in simplifying and securing multicloud development Technology Forecast: Rethinking integration Issue 1, 2014 What do businesses need to know about emerging integration approaches? Sam Ramji of Apigee views the technologies of the integration fabric through a strategy lens. Interview conducted by the Technology Forecast team PwC: We’ve been looking at three emerging technologies: data lakes, microservices, and Docker containers. Each has a different impact at a different layer of the integration fabric. What do you think they have in common?1 Sam Ramji Sam Ramji is vice president of strategy at Apigee. SR: What has happened here has been the rightsizing of all the components. IT providers previously built things assuming that compute, storage, and networking capacity were scarce. Now they’re abundant. But even when they became abundant, end users didn’t have tools that were the right size. Containers have rightsized computing, and Hadoop has rightsized storage. With HDFS or Cassandra or a NoSQL database, companies can process enormous amounts of data very easily. And HTTP-based, bindable endpoints that can talk to any compute source have rightsized the network. So between Docker containers for compute, data lakes for storage, and APIs for networking, these pieces are finally small enough that they all fit very nicely, cleanly, and perfectly together at the same time.2 1 For more background on data lakes, microservices, and Docker containers, see “Rethinking integration: Emerging patterns from cloud computing leaders,” PwC Technology Forecast 2014, Issue 1, http://www.pwc.com/us/en/technology-forecast/2014/issue1/index.jhtml. 2 Abbreviations are as follows: • HDFS: Hadoop Distributed File System • NoSQL: Not only SQL • API: application programming interface 44 PwC Technology Forecast What do businesses need to know about emerging integration approaches? “This new approach to integration could enable companies to develop and deliver apps in three months.” PwC: To take advantage of the three integration-related trends that are emerging at the same time, what do executives generally need to be cautious about? SR: One risk is how you pitch this new integration approach politically. It may not be possible to convert the people who are currently working on integration. For those who are a level or two above the fray, one of the top messages is to make the hard decisions and say, “Yes, we’re moving our capital investments from solving all problems the old ways to getting ready for new problems.” It’s difficult. PwC: How do you justify such a major change when presenting the pitch? SR: The incremental cost in time and money for new apps is too high—companies must make a major shift. Building a new app can take nine months. Meanwhile, marketing departments want their companies to build three new apps per quarter. That will consume a particular amount of money and will require a certain amount of planning time. And then by the time the app ships nine months later, it needs a new feature because there’s a new service like Pinterest that didn’t exist earlier and now must be tied in. 45 PwC Technology Forecast PwC: Five years ago, nine months would have been impossibly fast. Now it’s impossibly slow. SR: And that means the related business processes are too slow and complicated, and costs are way too high. Companies spend between $300,000 and $700,000 in roughly six to seven months to implement a new partner integration. That high cost is becoming prohibitive in today’s valuenetwork-based world, where companies constantly try to add new nodes to their value network and maybe prune others. Let’s say there’s a new digital pure play, and your company absolutely must be integrated with it. Or perhaps you must come up with some new joint offer to a customer segment you’re trying to target. You can’t possibly afford to be in business if you rely on those old approaches because now you need to get 10 new partnerships per order. And you certainly can’t do that at a cost of $500,000 per partner. This new approach to integration could enable companies to develop and deliver apps in three months. Partner integration would be complete in two months for $50,000, not $500,000. What do businesses need to know about emerging integration approaches? “We’re seeing a ton of new growth in these edge systems, specifically from mobile. There are app-centric uses that require new infrastructure, and they’re distinct from what I call plain old integration.” PwC: How about CIOs in particular? How should they pitch the need for change to their departments? PwC: IT may still be focused on core systems where things haven’t really changed a lot. SR: Web scale is essential now. ESBs absolutely do not scale to meet mobile demands. They cannot support three fundamental components of mobile and Internet access. SR: Yes, but that’s not where the growth is. We’re seeing a ton of new growth in these edge systems, specifically from mobile. There are app-centric uses that require new infrastructure, and they’re distinct from what I call plain old integration. In general, ESBs were built for different purposes. A fully loaded ESB that’s performing really well will typically cost an organization millions of dollars to run about 50 TPS. The average back end that’s processing mobile transactions must run closer to 1,000 TPS. Today’s transaction volumes require systems to run at web scale. They will crush an ESB. The second issue is that ESBs are not built for identity. ESBs generally perform systemto-system identity. They’re handling a maximum of 10,000 different identities in the enterprise, and those identities are organizations or systems—not individual end users, which is crucial for the new IT. If companies don’t have a user’s identity, they’ll have a lot of other issues around user profiling or behavior. They’ll have user amnesia and problems with audits or analytics. The third issue is the ability to handle the security handoff between external devices that are built in tools and languages such as JavaScript and to bridge those devices into ESB native security. ESB is just not a good fit when organizations need scale, identity, and security.3 PwC: How about business unit managers and their pitches to the workforce? People in the business units may wonder why there’s such a preoccupation with going digital. SR: When users say digital, what we really mean is digital data that’s ubiquitous and consumable and computable by pretty much any device now. The industry previously built everything for a billion PCs. These PCs were available only when people chose to walk to their desks. Now people typically have three or more devices, many of which are mobile. They spend more time computing. It’s not situated computing where they get stuck at a desk. It’s wherever they happen to be. So the volume of interactions has gone up, and the number of participants has gone up. About 3 billion people work at computing devices in some way, and the volume of interactions has gone up many times. The shift to digital interactions has been basically an order of magnitude greater than what we supported earlier, even for web-based computing through desktop computers. 3 Abbreviations are as follows: • ESB: enterprise service bus • TPS: transactions per second 46 PwC Technology Forecast What do businesses need to know about emerging integration approaches? PwC: The continuous delivery mentality of DevOps has had an impact, too. If the process is in software, the expectations are that you should be able to turn on a dime.4 SR: Consumer expectations about services are based on what they’ve seen from largescale services such as Facebook and Google that operate in continuous delivery mode. They can scale up whenever they need to. Availability is as important as variability. Catastrophic successes consistently occur in global corporations. The service gets launched, and all of a sudden they have 100,000 users. That’s fantastic. Then they have 200,000 users, which is still fantastic. Then they reach 300,000. Crunch. That’s when companies realize that moving around boxes to try to scale up doesn’t work anymore. They start learning from web companies how to scale. PwC: The demand is for fluidity and availability, but also variability. The load is highly variable. SR: Yes. In highly mobile computing, the demand patterns for digital interactions are extremely spiky and unpredictable. None of these ideas is new. Eleven years ago when I was working for Adam Bosworth at BEA Systems, he wrote a paper about the autonomic model of computing in which he anticipated natural connectedness and smaller services. We thought web services would take us there. We were wrong about that as a technology, but we were right about the direction. We lacked the ability to get people to understand how to do it. People were building services that were too big, and we didn’t realize why the web services stack was still too bulky to be consumed and easily adopted by a lot of people. It wasn’t the right size before, but now it’s shrunk down to the right size. I think that’s the big difference here. 4 DevOps refers to a closer collaboration between developers and operations people that becomes necessary for a more continuous flow of changes to an operational code base, also known as continuous delivery. Thus, DevOps=Dev+Ops. For more on continuous delivery and DevOps, see “DevOps: Solving the engineering productivity challenge,” PwC Technology Forecast 2013, Issue 2, http://www.pwc.com/us/en/technology-forecast/2013/issue2/index.jhtml. 47 PwC Technology Forecast What do businesses need to know about emerging integration approaches? Technology Forecast: Rethinking integration Issue 1, 2014 Zero-integration technologies and their role in transformation By Bo Parker The key to integration success is reducing the need for integration in the first place. 48 Issue overview: Rethinking integration This article summarizes three topics also covered individually in the Rethinking Integration issue of the PwC Technology Forecast (http://www.pwc.com/ us/en/technologyforecast/2014/issue1/ index.jhtml). The integration fabric is a central component for PwC’s New IT Platform. (See http://www.pwc. com/us/en/increasingit-effectiveness/new-itplatform.jhtml for more information.) Social, mobile, analytics, cloud—SMAC for short—have set new expectations for what a high-performing IT organization delivers to the enterprise. Yet they can be saviors if IT figures out how to embrace them. As PwC states in “Reinventing Information Technology in the Digital Enterprise”: Business volatility, innovation, globalization and fierce competition are forcing business leaders to review all aspects of their businesses. High on the agenda: Transforming the IT organization to meet the needs of businesses today. Successful IT organizations of the future will be those that evaluate new technologies with a discerning eye and cherry pick those that will help solve the organization’s most important business problems. This shift requires change far greater than technology alone. It requires a new mindset and a strong focus on collaboration, innovation and “outside-in” thinking with a customer-centric point of view.1 The shift starts with rethinking the purpose and function of IT while building on its core historical role of delivering and maintaining stable, rock-solid transaction engines. Rapidly changing business needs are pushing enterprises to adopt a digital operating model. This move reaches beyond the backoffice and front-office technology. Every customer, distributor, supplier, investor, partner, employee, contractor, and especially any software agents substituting for those conventional roles now expects a digital relationship. Such a relationship entails more than converting paper to web screens. Digital relationships are highly personalized, analyticsdriven interactions that are absolutely reliable, that deliver surprise and delight, and that evolve on the basis of previous learnings. Making digital relationships possible is a huge challenge, but falling short will have severe consequences for every enterprise unable to make a transition to a digital operating model. How to proceed? Successfully adopting a digital operating model requires what PwC calls a New IT Platform. This innovative platform aligns IT’s capabilities to the dynamic needs of the business and empowers the entire organization with technology. Empowerment is an important focus. That’s because a digital operating model won’t be something IT builds from the center out. It won’t be something central IT builds much of at all. Instead, building out the digital operating model—whether that involves mobile apps, software as a service, or business units developing digital value propositions on third-party infrastructure as a service or on internal private clouds—will happen closest to the relevant part of the ecosystem. What defines a New IT Platform? The illustration highlights the key ingredients. PwC’s New IT Platform The New IT Platform encompasses transformation across the organization. The Mandate The Process + Broker of Services The Architecture + Assembleto-Order = The Organization + Integration Fabric The Governance + Professional Services Structure Empowering Governance New IT Platform 1 “Reinventing Information Technology in the Digital Enterprise,” PwC, December 2013, http://www.pwc.com/us/en/increasing-it-effectiveness/publications/new-it-platform.jhtml. 49 PwC Technology Forecast Zero-integration technologies The New IT Platform emphasizes consulting, guiding, brokering, and using existing technology to assemble digital assets rather than build from scratch. A major technology challenge that remains—and one that central IT is uniquely suited to address—is to establish an architecture that facilitates the integration of an empowered, decentralized enterprise technology landscape. PwC calls it the new integration fabric. Like the threads that combine to create a multicolored woven blanket, a variety of new integration tools and methods will combine to meet a variety of challenges. And like a fabric, these emerging tools and methods rely on each other to weave in innovations, new business partners, and new operating models. The common denominator of these new integration tools and methods is time: The time it takes to use new data and discover new insights from old data. The time it takes to modify a business process supported by software. The time it takes to promote new code into production. The time it takes to scale up infrastructure to support the overnight success of a new mobile app. The bigger the denominator (time), then the bigger the numerator (expected business value) must be before a business will take a chance on a new innovation, a new service, or an improved process. Every new integration approach tries to reduce integration time to as close to zero as possible. Given the current state of systems integration, getting to zero might seem like a pipe dream. In fact, most of the key ideas behind zerointegration technologies aren’t coming from traditional systems integrators or legacy technologies. They are coming from web-scale companies facing critical problems for which new approaches had to be invented. The great news is that these inventions are often available as open source, and a number of service providers support them. How to reach zero integration? What has driven web-scale companies to push toward zero-integration technologies? These companies operate in ecosystems that innovate in web time. Every web-scale company is conceivably one startup away from oblivion. As a result, today’s smart engineers provide a new project deliverable in addition to working code. They deliver IT that is change-forward friendly. Above all, change-forward friendly means that doing something new and different is just as easy four years and 10 million users into a project as it was six months and 1,000 users into the project. It’s all about how doing something new integrates with the old. More specifically, change-forward-friendly data integration is about data lakes.2 All data is in the lake, schema are created on read, metadata generation is collaborative, and data definitions are flexible rather than singular definitions fit for a business purpose. Changeforward-friendly data integration means no time is wasted getting agreement across the enterprise about what means what. Just do it. Change-forward-friendly application and services integration is about microservices frameworks and principles.3 It uses small, single-purpose code modules, relaxed approaches to many versions of the same service, and event loop messaging. It relies on organizational designs that acknowledge Conway’s law, which says the code architecture reflects the IT organization architecture. In other words, when staffing large code efforts, IT should organize people into small teams of business-meaningful neighborhoods to minimize the cognitive load associated with working together. Just code it. 2 See the article “The enterprise data lake: Better integration and deeper analytics,” PwC Technology Forecast 2014, Issue 1, http://www.pwc.com/us/en/technology-forecast/2014/issue1/features/data-lakes.jhtml. 3 See the article “Microservices: The resurgence of SOA principles and an alternative to the monolith,” PwC Technology Forecast 2014, Issue 1, http://www.pwc.com/en_US/us/technology-forecast/2014/issue1/features/microservices.jhtml. 50 PwC Technology Forecast Zero-integration technologies Change-forward-friendly infrastructure integration is about container frameworks, especially Docker.4 The speed required by data science innovators using data lakes and by ecosystem innovators using microservices frameworks will demand infrastructure that is broadly consistent with zero-integration principles. That means rethinking the IT stack and the roles of the operating system, hypervisors, and automation tools such as Chef and Puppet. Such an infrastructure also means rethinking operations and managing by chaos principles, where failures are expected, their impacts are isolated, and restarts are instantaneous. Just run it. This is the story of the new integration fabric. Read more about data lakes, microservices, and containers in the articles in the PwC Technology Forecast 2014, Issue 1. But always recall what Ronald Reagan once said about government, rephrased here in the context of technology: “Integration is not the solution to our problem, integration is the problem.” Change-forwardfriendly integration means doing whatever it takes to bring time to integration to zero. 4 See the article “Containers are redefining application-infrastructure integration,” PwC Technology Forecast 2014, Issue 1, http://www.pwc.com/en_US/us/technology-forecast/2014/issue1/features/open-source-application-deployment-containers.jhtml. 51 PwC Technology Forecast Zero-integration technologies Acknowledgments Advisory Reviewers US Technology Consulting Leader Gerard Verweij Rohit Antao Phil Berman Julien Furioli Oliver Halter Glen Hobbs Henry Hwangbo Rajesh Rajan Hemant Ramachandra Ritesh Ramesh Zach Sachen Chief Technologist Chris Curran New IT Platform Leader Michael Pearl Strategic Marketing Lock Nelson Bruce Turner US Thought Leadership Partner Rob Gittings Center for Technology and Innovation Managing Editor Bo Parker Editors Vinod Baya Alan Morrison Contributors Galen Gruman Pini Resnik Bill Roberts Brian Stein Editorial Advisor Larry Marion Copy Editor Lea Anne Bantsari US Creative Team Infographics Tatiana Pechenik Chris Pak Layout Jyll Presley Web Design Jaime Dirr Greg Smith Special thanks Eleni Manetas and Gabe Taylor Mindshare PR Wunan Li Akshay Rao Industry perspectives During the preparation of this publication, we benefited greatly from interviews and conversations with the following executives: Darren Cunningham Vice President of Marketing SnapLogic Michael Facemire Principal Analyst Forrester Ben Golub CEO Docker, Inc. Mike Lang CEO Revelytix Ross Mason Founder and Vice President of Product Strategy MuleSoft Sean Martin CTO Cambridge Semantics John Pritchard Director of Platform Services Adobe Systems Sam Ramji Vice President of Strategy Apigee Richard Rodger CTO nearForm Dale Sanders Senior Vice President Health Catalyst Ted Schadler Vice President and Principal Analyst Forrester Brett Shepherd Director of Big Data Product Marketing Splunk Eric Simone CEO ClearBlade Sravish Sridhar Founder and CEO Kinvey Michael Topalovich CTO Delivered Innovation Michael Voellinger Managing Director ClearBlade Glossary Data lake A single, very large repository for less-structured data that doesn’t require up-front modeling, a data lake can help resolve the nagging problem of accessibility and data integration. Microservices architecture Microservices architecture (MSA) breaks an application into very small components that perform discrete functions, and no more. The fine-grained, stateless, selfcontained nature of microservices creates decoupling between different parts of a code base and is what makes them easy to update, replace, remove, or augment. Linux containers and Docker LinuX Containers (LXCs) allow different applications to share operating system (OS) kernel, CPU, and RAM. Docker containers go further, adding layers of abstraction and deployment management features. Among the benefits of this new infrastructure technology, containers that have these capabilities reduce coding, deployment time, and OS licensing costs. Zero integration Every new integration approach tries to reduce integration time to as close to zero as possible. Zero integration means no time is wasted getting agreement across the enterprise about what means what. To have a deeper conversation about this subject, please contact: Gerard Verweij Principal and US Technology Consulting Leader +1 (617) 530 7015 [email protected] Chris Curran Chief Technologist +1 (214) 754 5055 [email protected] Michael Pearl Principal New IT Platform Leader +1 (408) 817 3801 [email protected] Bo Parker Managing Director Center for Technology and Innovation +1 (408) 817 5733 [email protected] Alan Morrison Technology Forecast Issue Editor and Researcher Center for Technology and Innovation +1 (408) 817 5723 [email protected] About PwC’s Technology Forecast Published by PwC’s Center for Technology and Innovation (CTI), the Technology Forecast explores emerging technologies and trends to help business and technology executives develop strategies to capitalize on technology opportunities. Recent issues of the Technology Forecast have explored a number of emerging technologies and topics that have ultimately become many of today’s leading technology and business issues. To learn more about the Technology Forecast, visit www.pwc.com/technologyforecast. About PwC PwC US helps organizations and individuals create the value they’re looking for. We’re a member of the PwC network of firms in 157 countries with more than 195,000 people. We’re committed to delivering quality in assurance, tax and advisory services. Find out more and tell us what matters to you by visiting us at www.pwc.com. Comments or requests? Please visit www.pwc.com/ techforecast or send e-mail to [email protected]. © 2014 PricewaterhouseCoopers LLP, a Delaware limited liability partnership. All rights reserved. PwC refers to the US member firm, and may sometimes refer to the PwC network. Each member firm is a separate legal entity. Please see www.pwc.com/ structure for further details. This content is for general information purposes only, and should not be used as a substitute for consultation with professional advisors. This content is for general information purposes only, and should not be used as a substitute for consultation with professional advisors. MW-15-0186