...

2014 Issue 1 1 19

by user

on
Category: Documents
26

views

Report

Comments

Transcript

2014 Issue 1 1 19
2014
Issue 1
1
The enterprise data
lake: Better integration
and deeper analytics
19
Microservices:
The resurgence of
SOA principles and
an alternative to
the monolith
33
Containers are
redefining applicationinfrastructure
integration
Rethinking integration:
Emerging patterns from
cloud computing leaders
48
Zero-integration
technologies and their
role in transformation
Contents
Features
2014
Issue 1
Data lake
The enterprise data lake: Better
integration and deeper analytics
1
Microservices
architecture (MSA)
Microservices: The resurgence
of SOA principles and an
alternative to the monolith
19
Linux containers and Docker
Containers are redefining applicationinfrastructure integration
33
Zero integration
Zero-integration technologies and
their role in transformation
48
Related interviews
Mike Lang
CEO of Revelytix on
how companies are
using data lakes
9
Dale Sanders
SVP at Health
Catalyst on agile
data warehousing
in healthcare
13
John Pritchard
Director of platform
services at Adobe on
agile coding in the
software industry
26
Richard Rodger
CTO of nearForm
on the advantages
of microservices
architecture
29
Sam Ramji
VP of Strategy at Apigee
on integration trends
and the bigger picture
Ben Golub
CEO of Docker on
the outlook for
Linux containers
39
44
Technology Forecast: Rethinking integration
Issue 1, 2014
The enterprise data lake:
Better integration and
deeper analytics
By Brian Stein and
Alan Morrison
Data lakes that can scale at the pace of the cloud remove
integration barriers and clear a path for more timely and
informed business decisions.
1
Data lakes: An emerging
approach to cloud-based big data
Enterprises across
industries are
starting to extract
and place data
for analytics into
a single, Hadoopbased repository.
UC Irvine Medical Center maintains millions
of records for more than a million patients,
including radiology images and other semistructured reports, unstructured physicians’
notes, plus volumes of spreadsheet data. To
solve the challenge the hospital faced with
data storage, integration, and accessibility,
the hospital created a data lake based on
a Hadoop architecture, which enables
distributed big data processing by using
broadly accepted open software standards and
massively parallel commodity hardware.
Hadoop allows the hospital’s disparate records
to be stored in their native formats for later
parsing, rather than forcing all-or-nothing
integration up front as in a data warehousing
scenario. Preserving the native format also
helps maintain data provenance and fidelity,
so different analyses can be performed
using different contexts. The data lake has
made possible several data analysis projects,
including the ability to predict the likelihood
of readmissions and take preventive measures
to reduce the number of readmissions.1
Like the hospital, enterprises across
industries are starting to extract and place
data for analytics into a single Hadoopbased repository without first transforming
the data the way they would need to for a
relational data warehouse.2 The basic concepts
behind Hadoop3 were devised by Google to
meet its need for a flexible, cost-effective
data processing model that could scale as
data volumes grew faster than ever. Yahoo,
Facebook, Netflix, and others whose business
models also are based on managing enormous
data volumes quickly adopted similar methods.
Costs were certainly a factor, as Hadoop can be
A basic Hadoop architecture for scalable data lake infrastructure
Hadoop Distributed File System (HDFS)
Hadoop stores and
preserves data in
any format across
a commodity
server cluster.
Input file
With YARN,
Hadoop now
supports various
programming
models and
near-real-time
outputs in addition
to batch.
Output file
Map task
The system splits
up the jobs and
distributes,
processes, and
recombines them
via a cluster that
can scale to
thousands of
server nodes.
Split 1
Input
Split 2
Reduce task
Input
Split 3
map( )
partition( )
combine( )
Split 4
Split 5
sort( )
reduce( )
Region 1
Job tracker
Region 2
Output
Region 3
Source: Electronic Design, 2012, and Hortonworks, 2014
1 “UC Irvine Health does Hadoop,” Hortonworks, http://hortonworks.com/customer/uc-irvine-health/.
2 See Oliver Halter, “The end of data standardization,” March 20, 2014, http://usblogs.pwc.com/emerging-technology/the-end-of-datastandardization/, accessed April 17, 2014.
3 Apache Hadoop is a collection of open standard technologies that enable users to store and process petabyte-sized data volumes via
commodity computer clusters in the cloud. For more information on Hadoop and related NoSQL technologies, see “Making sense of Big
Data,” PwC Technology Forecast 2010, Issue 3 at http://www.pwc.com/us/en/technology-forecast/2010/issue3/index.jhtml.
2
PwC Technology Forecast
The enterprise data lake: Better integration and deeper analytics
Hadoop can be 10
to 100 times less
expensive to deploy
than conventional
data warehousing.
10 to 100 times less expensive to deploy than
conventional data warehousing. Another driver
of adoption has been the opportunity to defer
labor-intensive schema development and data
cleanup until an organization has identified a
clear business need. And data lakes are more
suitable for the less-structured data these
companies needed to process.
Today, companies in all industries find
themselves at a similar point of necessity.
Enterprises that must use enormous volumes
and myriad varieties of data to respond
to regulatory and competitive pressures
are adopting data lakes. Data lakes are an
emerging and powerful approach to the
challenges of data integration as enterprises
increase their exposure to mobile and cloudbased applications, the sensor-driven Internet
of Things, and other aspects of what PwC calls
the New IT Platform.
Issue overview: Integration fabric
The microservices topic is the second of three topics as part of the integration fabric research
covered in this issue of the PwC Technology Forecast. The integration fabric is a central
component for PwC’s New IT Platform.*
Enterprises are starting to embrace more practical integration.** A range of these new
approaches is now emerging, and during the next few months we’ll ponder what the new
cloud-inspired enterprise integration fabric looks like. The main areas we plan to explore
include these:
Integration
fabric layers
Data
Integration challenges
Emerging technology solutions
Data silos, data proliferation,
rigid schemas, and high data
warehousing cost; new and
heterogeneous data types
Hadoop data lakes, late binding,
and metadata provenance tools
Enterprises are beginning to place extracts of their data for analytics
and business intelligence (BI) purposes into a single, massive repository
and structuring only what’s necessary. Instead of imposing schemas
beforehand, enterprises are allowing data science groups to derive their
own views of the data and structure it only lightly, late in the process.
Applications
and services
Rigid, monolithic systems
that are difficult to update in
response to business needs
Microservices
Fine-grained microservices, each associated with a single business
function and accessible via an application programming interface (API),
can be easily added to the mix or replaced. This method helps developer
teams create highly responsive, flexible applications.
Infrastructure
Multiple clouds and
operating systems that lack
standardization
Software containers for resource
isolation and abstraction
New software containers such as Docker extend and improve
virtualization, making applications portable across clouds. Simplifying
application deployment decreases time to value.
* See http://www.pwc.com/us/en/increasing-it-effectiveness/new-it-platform.jhtml for more information.
**Integration as PwC defines it means making diverse components work together so they work as a single entity.
See “integrated system” at http://www.yourdictionary.com/integrated-system#computer, accessed June 17, 2014.
3
PwC Technology Forecast
The enterprise data lake: Better integration and deeper analytics
What is a data lake?
A data lake is a repository for large quantities and
varieties of data, both structured and unstructured..
Data generalists/
programmers can tap
the stream data for
real-time analytics.
The lake can serve as a staging
area for the data warehouse,
the location of more carefully
“treated” data for reporting
and analysis in batch mode.
The data lake accepts
input from various sources
and can preserve both the
original data fidelity and
the lineage of data
transformations. Data
models emerge with usage
over time rather than
being imposed up front.
Data scientists
use the lake for
discovery and
ideation.
Data lakes take advantage of commodity cluster computing techniques
for massively scalable, low-cost storage of data files in any format.
Why a data lake?
Data lakes can help resolve the nagging
problem of accessibility and data integration.
Using big data infrastructures, enterprises
are starting to pull together increasing data
volumes for analytics or simply to store for
undetermined future use. (See the sidebar
“Data lakes defined.”) Mike Lang, CEO of
Revelytix, a provider of data management tools
for Hadoop, notes that “Business owners at the
C level are saying, ‘Hey guys, look. It’s no longer
inordinately expensive for us to store all of our
data. I want all of you to make copies. OK, your
systems are busy. Find the time, get an extract,
and dump it in Hadoop.’”
Previous approaches to broad-based data
integration have forced all users into a common
predetermined schema, or data model. Unlike
this monolithic view of a single enterprisewide data model, the data lake relaxes
standardization and defers modeling, resulting
in a nearly unlimited potential for operational
insight and data discovery. As data volumes,
data variety, and metadata richness grow, so
does the benefit.
4
PwC Technology Forecast
Recent innovation is helping companies to
collaboratively create models—or views—
of the data and then manage incremental
improvements to the metadata. Data scientists
and business analysts using the newest lineage
tracking tools such as Revelytix Loom or Apache
Falcon can follow each other’s purpose-built
data schemas. The lineage tracking metadata
also is placed in the Hadoop Distributed File
System (HDFS)—which stores pieces of files
across a distributed cluster of servers in the
cloud—where the metadata is accessible and
can be collaboratively refined. Analytics drawn
from the lake become increasingly valuable as
the metadata describing different views of the
data accumulates.
Every industry has a potential data lake use
case. A data lake can be a way to gain more
visibility or put an end to data silos. Many
companies see data lakes as an opportunity to
capture a 360-degree view of their customers or
to analyze social media trends.
The enterprise data lake: Better integration and deeper analytics
In the financial services industry, where
Dodd-Frank regulation is one impetus, an
institution has begun centralizing multiple
data warehouses into a repository comparable
to a data lake, but one that standardizes on
XML. The institution is moving reconciliation,
settlement, and Dodd-Frank reporting to
the new platform. In this case, the approach
reduces integration overhead because data is
communicated and stored in a standard yet
Data lakes defined
flexible format suitable for less-structured data.
The system also provides a consistent view of a
customer across operational functions, business
functions, and products.
Some companies have built big data sandboxes
for analysis by data scientists. Such sandboxes
are somewhat similar to data lakes, albeit
narrower in scope and purpose. PwC, for
example, built a social media data sandbox to
help clients monitor their brand health by using
its SocialMind application.4
Motivating factors behind the
move to data lakes
Many people have heard of data lakes, but like the term big data,
definitions vary. Four criteria are central to a good definition:
• Size and low cost: Data lakes are big. They can be an order of
magnitude less expensive on a per-terabyte basis to set up and
maintain than data warehouses. With Hadoop, petabyte-scale data
volumes are neither expensive nor complicated to build and maintain.
Some vendors that advocate the use of Hadoop claim that the cost
per terabyte for data warehousing can be as much as $250,000,
versus $2,500 per terabyte (or even less than $1,000 per terabyte)
for a Hadoop cluster. Other vendors advocating traditional data
warehousing and storage infrastructure dispute these claims and
make a distinction between the cost of storing terabytes and the cost
of writing or written terabytes.*
• Fidelity: Hadoop data lakes preserve data in its original form and
capture changes to data and contextual semantics throughout the
data lifecycle. This approach is especially useful for compliance
and internal audit. If the data has undergone transformations,
aggregations, and updates, most organizations typically struggle
to piece data together when the need arises and have little hope of
determining clear provenance.
Relational data warehouses and their big price
tags have long dominated complex analytics,
reporting, and operations. (The hospital
described earlier, for example, first tried a
relational data warehouse.) However, their
slow-changing data models and rigid field-tofield integration mappings are too brittle to
support big data volume and variety. The vast
majority of these systems also leave business
users dependent on IT for even the smallest
enhancements, due mostly to inelastic design,
unmanageable system complexity, and low
system tolerance for human error. The data
lake approach circumvents these problems.
Freedom from the shackles of one
big data model
Job number one in a data lake project is to
pull all data together into one repository
while giving minimal attention to creating
schemas that define integration points between
disparate data sets. This approach facilitates
access, but the work required to turn that
data into actionable insights is a substantial
challenge. While integrating the data takes
place at the Hadoop layer, contextualizing the
metadata takes place at schema creation time.
• Ease of accessibility: Accessibility is easy in the data lake, which
is one benefit of preserving the data in its original form. Whether
structured, unstructured, or semi-structured, data is loaded
and stored as is to be transformed later. Customer, supplier, and
operations data are consolidated with little or no effort from data
owners, which eliminates internal political or technical barriers to
increased data sharing. Neither detailed business requirements nor
painstaking data modeling are prerequisites.
• Late binding: Hadoop lends itself to flexible, task-oriented
structuring and does not require up-front data models.
*For more on data accessibility, data lake cost, and collective metadata refinement including lineage tracking
technology, see the interview with Mike Lang, “Making Hadoop suitable for enterprise data science,” at
http://www.pwc.com/us/en/technology-forecast/2014/issue1/interviews/interview-revelytix.jhtml. For more
on cost estimate considerations, see Loraine Lawson, “What’s the Cost of a Terabyte?” ITBusinessEdge,
May 17, 2013, at http://www.itbusinessedge.com/blogs/integration/whats-the-cost-of-a-terabyte.html.
Integrating data involves fewer steps because
data lakes don’t enforce a rigid metadata
schema as do relational data warehouses.
Instead, data lakes support a concept known as
late binding, or schema on read, in which users
build custom schema into their queries. Data
is bound to a dynamic schema created upon
query execution. The late-binding principle
shifts the data modeling from centralized
4 For more information on SocialMind and other analytics applications PwC offers, see http://www.pwc.com/us/en/analytics/analyticsapplications.jhtml.
5
PwC Technology Forecast
The enterprise data lake: Better integration and deeper analytics
“We see customers
creating big
data graveyards,
dumping
everything into
HDFS and hoping
to do something
with it down the
road. But then
they just lose track
of what’s there.”
—Sean Martin,
Cambridge Semantics
data warehousing teams and database
administrators, who are often remote from data
sources, to localized teams of business analysts
and data scientists, who can help create
flexible, domain-specific context. For those
accustomed to SQL, this shift opens a whole
new world.
proceeding cautiously. “We see customers
creating big data graveyards, dumping
everything into HDFS and hoping to do
something with it down the road. But then
they just lose track of what’s there,” says Sean
Martin, CTO of Cambridge Semantics, a data
management tools provider.
In this approach, the more that is known
about the metadata, the easier it is to query.
Pre-tagged data, such as Extensible Markup
Language (XML), JavaScript Object Notation
(JSON), or Resource Description Framework
(RDF), offers a starting point and is highly
useful in implementations with limited data
variety. In most cases, however, pre-tagged data
is a small portion of incoming data formats.
Companies avoid creating big data graveyards
by developing and executing a solid strategic
plan that applies the right technology and
methods to the problem. Few technologies in
recent memory have as much change potential
as Hadoop and the NoSQL (Not only SQL)
category of databases, especially when they
can enable a single enterprise-wide repository
and provide access to data previously trapped
in silos. The main challenge is not creating
a data lake, but taking advantage of the
opportunities it presents. A means of creating,
enriching, and managing semantic metadata
incrementally is essential.
Early lessons and pitfalls to avoid
Some data lake initiatives have not succeeded,
producing instead more silos or empty
sandboxes. Given the risk, everyone is
Data flow in the data lake
The data lake loads data extracts, irrespective of format, into a big data store. Metadata
is decoupled from its underlying data and stored independently, enabling flexibility for
multiple end-user perspectives and incrementally maturing semantics.
The data lake offers a unique opportunity for flexible,
evolving, and maturing big data insights.
Metadata grows and matures
over time via user interaction.
Business and data
analysts select and
report on domainspecific data.
Tagging, synonyms, linking
Users collaborate to identify, organize, and
make sense of the data in the data lake.
Metadata tagging and linking
Upstream
data
extracts
A big data repository
XML
stores data as is, loading
.xls
existing data and accepting
new feeds regularly.
etc.
Data scientists
and app developers
prepare and analyze
attribute-level data.
Machines help discover
patterns and create data
views.
New data comes into the lake
6
New actions (such as customer
campaigns) based on insights
from the data
PwC Technology Forecast
Cross-domain data analysis
The enterprise data lake: Better integration and deeper analytics
How a data lake matures
With the data lake, users can take what is
relevant and leave the rest. Individual business
domains can mature independently and
gradually. Perfect data classification is not
required. Users throughout the enterprise
can see across all disciplines, not limited by
organizational silos or rigid schema.
se
ea
cr
in
ity
ur
at
m
a
at
D
The data lake foundation
includes a big data
repository, metadata
management, and an
application framework
to capture and
contextualize enduser feedback. The
increasing value of
analytics is then directly
correlated to increases
in user adoption
across the enterprise.
s
Data lake maturity
Increasing value of analytics
With the data lake,
users can take
what is relevant
and leave the
rest. Individual
business domains
can mature
independently
and gradually.
Sourcing new data into the lake can occur
gradually and will not impact existing models.
The lake starts with raw data, and it matures as
more data flows in, as users and machines build
up metadata, and as user adoption broadens.
Ambiguous and competing terms eventually
converge into a shared understanding (that
is, semantics) within and across business
domains. Data maturity results as a natural
outgrowth of the ongoing user interaction
and feedback at the metadata management
layer—interaction that continually refines
the lake and enhances discovery. (See the
sidebar “Maturity and governance.”)
5. Convergence of
meaning within
context
4. Business-specific
tagging, synonym
identification,
and links
3. Data set
extraction and
analysis
2. Attribute-level
metadata tagging
and linking
(i.e., joins)
1. Consolidated
and categorized
raw data
Increasing usage across the enterprise
7
PwC Technology Forecast
The enterprise data lake: Better integration and deeper analytics
Maturity and governance
Many who hear the term data lake might
associate the concept with a big data
sandbox, but the range of potential use cases
for data lakes is much broader. Enterprises
envision lake-style repositories as staging
areas, as alternatives to data warehouses, or
even as operational data hubs, assuming the
appropriate technologies and use cases.
A key enabler is Hadoop and many of the
big data analytics technologies associated
with it. What began as a means of ad hoc
batch analytics in Hadoop and MapReduce
is evolving rapidly with the help of YARN
and Storm to offer more general-purpose
distributed analytics and real-time
capabilities. At least one retailer has been
running a Hadoop cluster of more than
2,000 nodes to support eight customer
behavior analysis applications.*
Despite these advances, enterprises
will remain concerned about the risks
surrounding data lake deployments,
especially at this still-early stage of
development. How can enterprises
effectively mitigate the risk and manage
a Hadoop-based lake for broad-ranging
exploration? Lakes can provide unique
benefits over traditional data management
methods at a substantially lower cost, but
they require many practical considerations
and a thoughtful approach to governance,
particularly in more heavily regulated
industries. Areas to consider include:
• Complexity of legacy data: Many legacy
systems contain a hodgepodge of software
patches, workarounds, and poor design.
As a result, the raw data may provide
limited value outside its legacy context.
The data lake performs optimally when
supplied with unadulterated data from
source systems, and rich metadata built
on top.
8
PwC Technology Forecast
• Metadata management: Data lakes
require advanced metadata management
methods, including machine-assisted
scans, characterizations of the data
files, and lineage tracking for each
transformation. Should schema on read
be the rule and predefined schema the
exception? It depends on the sources. The
former is ideal for working with rapidly
changing data structures, while the latter
is best for sub second query response on
highly structured data.
• Lake maturity: Data scientists will take
the lead in the use and maturation of
the data lake. Organizations will need
to place the needs of others who will
benefit within the context of existing
organizational processes, systems,
and controls.
• Staging area or buffer zone: The lake
can serve as a cost-effective place to land,
stage, and conduct preliminary analysis
of data that may have been prohibitively
expensive to analyze in data warehouses
or other systems.
To adopt a data lake approach, enterprises
should take a full step toward multipurpose
(rather than single purpose) commodity
cluster computing for enterprise-wide
analysis of less-structured data. To take that
full step, they first must acknowledge that a
data lake is a separate discipline of endeavor
that requires separate treatment. Enterprises
that set up data lakes must simultaneously
make a long-term commitment to hone the
techniques that provide this new analytic
potential. Half measures won’t suffice.
* Timothy Prickett Morgan, “Cluster Sizes Reveal Hadoop Maturity
Curve,” Enterprise Tech: Systems Edition, November 8, 2013,
http://www.enterprisetech.com/2013/11/08/cluster-sizesreveal-hadoop-maturity-curve/, accessed March 20, 2014.
The enterprise data lake: Better integration and deeper analytics
Technology Forecast: Rethinking integration
Issue 1, 2014
Making Hadoop suitable
for enterprise data science
Creating data lakes enables enterprises to expand
discovery and predictive analytics.
Interview conducted by Alan Morrison, Bo Parker, and Brian Stein
PwC: You’re in touch with a number
our data. I want all of you to make copies.
of customers who are in the process of OK, your systems are busy. Find the time,
setting up Hadoop data lakes. Why are get an extract, and dump it in Hadoop.”
they doing this?
Mike Lang
Mike Lang is CEO of Revelytix.
ML: There has been resistance on the part of
business owners to share data, and a big part
of the justification for not sharing data has
been the cost of making that data available.
The data owners complain they must write
in some special way to get the data extracted,
the system doesn’t have time to process
queries for building extracts, and so forth.
But a lot of the resistance has been political.
Owning data has power associated with it.
Hadoop is changing that, because C-level
executives are saying, “It’s no longer
inordinately expensive for us to store all of
9
PwC Technology Forecast
But they haven’t integrated anything. They’re
just getting an extract. The benefit is that
to add value to the integration process,
business owners don’t have nearly the same
hill to climb that they had in the past. C-level
executives are not asking the business owner
to add value. They’re just saying, “Dump it,”
and I think that’s under way right now.
With a Hadoop-based data lake, the enterprise
has provided a capability to store vast
amounts of data, and the user doesn’t need
to worry about restructuring the data to
begin. The data owners just need to do the
dump, and they can go on their merry way.
Making Hadoop suitable for enterprise data science
“If I want to add a terabyte node to my current
analytics infrastructure, the cost could be
$250,000. But if I want to add a terabyte
node to my Hadoop data lake, the cost is
more like $25,000.”
PwC: So one major obstacle was
just the ability to share data
cost-effectively?
ML: Yes, and that was a huge obstacle.
Huge. It is difficult to overstate how big that
obstacle has been to nimble analytics and data
integration projects during my career. For the
longest time, there was no such thing as nimble
when talking about data integration projects.
Once that data is in Hadoop, nimble is
the order of the day. All of a sudden, the
ETL [extract, transform, load] process
is totally turned on its head—from
contemplating the integration of eight data
sets, for example, to figuring out which of
a company’s policyholders should receive
which kinds of offers at what price in which
geographic regions. Before Hadoop, that
might have been a two-year project.
PwC: What are the main use cases for
Hadoop data lakes?
ML: There are two main use cases for the
data lake. One is as a staging area to support
some specific application. A company might
want to analyze three streams of data to
reduce customer churn by 10 percent.
They plan to build an app to do that using
three known streams of data, and the
data lake is just part of that workflow of
receiving, processing, and then dumping
data off to generate the churn analytics.
The last time we talked [in 2013], that was
the main use case of the data lake. The
second use case is supporting data science
groups all around the enterprise. Now, that’s
probably 70 percent of the companies we’ve
worked with.
10
PwC Technology Forecast
PwC: Why use Hadoop?
ML: Data lakes are driven by three factors.
The first one is cost. Everybody we talk to
really believes data lakes will cost much
less than current alternatives. The cost of
data processing and data storage could be
90 percent lower. If I want to add a terabyte
node to my current analytics infrastructure,
the cost could be $250,000. But if I want
to add a terabyte node to my Hadoop data
lake, the cost is more like $25,000.
The second factor is flexibility. The flexibility
comes from the late-binding principle. When
I have all this data in the lake and want
to analyze it, I’ll basically build whatever
schema I want on the fly and I’ll conduct
my analysis the way data scientists do.
Hadoop lends itself to late binding.
The third factor relates to scale. Hadoop
data lakes will have a lot more scale than the
data warehouse, because they’re designed
to scale and process any type of data.
PwC: What’s the first step in creating
such a data lake?
ML: We’re working with a number of big
companies that are implementing some
version of the data lake. The first step is to
create a place that stores any data that the
business units want to dump in it. Once
that’s done, the business units make that
place available to their stakeholders.
The first step is not as easy as it sounds.
The companies we’ve been in touch with
spend an awful lot of time building security
apparatuses. They also spend a fair amount
of time performing quality checks on the
data as it comes in, so at least they can
say something about the quality of the
data that’s available in the cluster.
Making Hadoop suitable for enterprise data science
“Data scientists don’t like the ETL paradigm used
by business analysts. Data scientists have no idea
at the beginning of their job what the schema
should be, and so they go through this process of
looking at the data that’s available to them.”
But after they have that framework in place,
they just make the data available for data
science. They don’t know what it’s going to be
used for, but they do know it’s going to be used.
PwC: So then there’s the data
preparation process, which is
where the metadata reuse potential
comes in. How does the dynamic ELT
[extract, load, transform] approach
to preparing the data in the data
science use case compare with the
static ETL [extract, transform, load]
approach traditionally used by
business analysts?
ML: In the data lake, the files land in Hadoop
in whatever form they’re in. They’re extracted
from some system and literally dumped into
Hadoop, and that is one of the great attractions
of the data lake—data professionals don’t need
to do any expensive ETL work beforehand.
They can just dump the data in there, and
it’s available to be processed in a relatively
inexpensive storage and processing framework.
The challenge, then, is when data scientists
need to use the data. How do they get it into
the shape that’s required for their R frame or
their Python code for their advanced analytics?
The answer is that the process is very iterative.
This iterative process is the distinguishing
difference between business analysts and data
warehousing and data scientists and Hadoop.
Traditional ETL is not iterative at all. It takes a
long time to transform the different data into
one schema, and then the business analysts
perform their analysis using that schema.
Data scientists don’t like the ETL paradigm
used by business analysts. Data scientists
have no idea at the beginning of their
job what the schema should be, and so
11
PwC Technology Forecast
they go through this process of looking
at the data that’s available to them.
Let’s say a telecom company has set-top box
data and finance systems that contain customer
information. Let’s say the data scientists for
the company have four different types of
data. They’ll start looking into each file and
determine whether the data is unstructured
or structured this way or that way. They need
to extract some pieces of it. They don’t want
the whole file. They want some pieces of each
file, and they want to get those pieces into a
shape so they can pull them into an R server.
So they look into Hadoop and find the file.
Maybe they use Apache Hive to transform
selected pieces of that file into some
structured format. Then they pull that out
into R and use some R code to start splitting
columns and performing other kinds of
operations. The process takes a long time,
but that is the paradigm they use. These
data scientists actually bind their schema at
the very last step of running the analytics.
Let’s say that in one of these Hadoop files
from the set-top box, there are 30 tables. They
might choose one table and spend quite a bit
of time understanding and cleaning up that
table and getting the data into a shape that
can be used in their tool. They might do that
across three different files in HDFS [Hadoop
Distributed File System]. But, they clean it as
they’re developing their model, they shape it,
and at the very end both the model and the
schema come together to produce the analytics.
PwC: How can the schema become
dynamic and enable greater reuse?
ML: That’s why you need lineage. As data
scientists assemble their intermediate data
sets, if they look at a lineage graph in our Loom
Making Hadoop suitable for enterprise data science
product, they might see 20 or 30 different
sets of data that have been created. Of course
some of those sets will be useful to other data
scientists. Dozens of hours of work have been
invested there. The problem is how to find
those intermediate data sets. In Hadoop, they
are actually realized persisted data sets.
So, how do you find them and know what
their structure is so you can use them? You
need to know that this data set originally
contained data from this stream or that stream,
this application and that application. If you
don’t know that, then the data set is useless.
At this point, we’re able to preserve the input
sets—the person who did it, when they
did it, and the actual transformation code
that produced this output set. It is pretty
straightforward for users to go backward or
forward to find the data set, and then find
something downstream or upstream that
they might be able to use by combining it, for
example, with two other files. Right now, we
provide the bare-bones capability for them
to do that kind of navigation. From my point
of view, that capability is still in its infancy.
PwC: And there’s also more freedom
and flexibility on the querying side?
ML: Predictive analytics and statistical
analysis are easier with a large-scale data
lake. That’s another sea change that’s
happening with the advent of big data.
Everyone we talk to says SQL worked great.
They look at the past through SQL. They
know their current financial state, but they
really need to know the characteristics of the
customer in a particular zip code that they
should target with a particular product.
12
PwC Technology Forecast
When you can run statistical models on
enormous data sets, you get better predictive
capability. The bigger the set, the better
your predictions. Predictive modeling and
analytics are not being done timidly in Hadoop.
That’s one of the main uses of Hadoop.
This sort of analysis wasn’t performed
10 years ago, and it’s only just become
mainstream practice. A colleague told me a
story about a credit card company. He lives
in Maryland, and he went to New York on a
trip. He used his card one time in New York
and then he went to buy gas, and the card
was cut off. His card didn’t work at the gas
station. He called the credit card company
and asked, “Why did you cut off my card?”
And they said, “We thought it was a case
of fraud. You never have made a charge in
New York and all of a sudden you made two
charges in New York.” They asked, “Are you
at the gas station right now?” He said yes.
It’s remarkable what the credit card
company did. It ticked him off that they
could figure out that much about him,
but the credit card company potentially
saved itself tens of thousands of dollars
in charges it would have had to eat.
This new generation of processing platforms
focuses on analytics. That problem right there
is an analytical problem, and it’s predictive
in its nature. The tools to help with that
are just now emerging. They will get much
better about helping data scientists and other
users. Metadata management capabilities in
these highly distributed big data platforms
will become crucial—not nice-to-have
capabilities, but I-can’t-do-my-work-withoutthem capabilities. There’s a sea of data.
Making Hadoop suitable for enterprise data science
Technology Forecast: Rethinking integration
Issue 1, 2014
A step toward the data
lake in healthcare: Latebound data warehouses
Dale Sanders of Health Catalyst describes how healthcare
providers are addressing their need for better analytics.
Interview conducted by Alan Morrison, Bo Parker, and Brian Stein
PwC: How are healthcare enterprises
scaling and maturing their
analytics efforts at this point?
Dale Sanders
Dale Sanders is senior vice
president of Health Catalyst.
DS: It’s chaotic right now. High-tech
funding facilitated the adoption of EMRs
[electronic medical records] and billing
systems as data collection systems. And
HIEs [health information exchanges]
encouraged more data sharing. Now there’s
a realization that analytics is critical. Other
industries experienced the same pattern, but
healthcare is going through it just now.
The bad news for healthcare is that the
market is so overwhelmed from the adoption
of EMRs and HIEs. And now the changes
from ICD-9 [International Classification of
Diseases, Ninth Revision] are coming, as
well as the changes to the HIPAA [Health
13
PwC Technology Forecast
Insurance Portability and Accountability
Act] regulation. Meaningful use is still a
challenge. Accountable care is a challenge.
There’s so much turmoil in the market, and it’s
hard to admit that you need to buy yet another
IT system. But it’s hard to deny that, as well.
Lots of vendors claim they can do analytics.
Trying to find the way through that maze
and that decision making is challenging.
PwC: How did you get started
in this area to begin with, and
what has your approach been?
DS: Well, to go way back in history, when I
was in the Air Force, I conceived the idea for
late binding in data warehouses after I’d seen
some different failures of data warehouses
using relational database systems.
A step toward the data lake in healthcare: Late-bound data warehouses
“We have an analytics adoption model that we
use to frame the progression of analytics in an
organization. Most of the [healthcare] industry
operates at level zero.”
If you look at the early history of data
warehousing in the government and
military—it was all on mainframes. And
those mainframe data warehouses look a
lot like Hadoop today. Hadoop is emerging
with better tools, but conceptually the
two types of systems are very similar.
When relational databases became popular,
we all rushed to those as a solution for
data warehousing. We went from the flat
files associated with mainframes to Unixbased data warehouses that used relational
database systems. And we thought it was a
good idea. But one of the first big mistakes
everyone made was to develop these enterprise
data models using a relational form.
I watched several failures happen as a
consequence of that type of early binding
to those enterprise models. I made some
adjustments to my strategy in the Air Force,
and I made some further adjustments
when I worked for companies in the
private sector and further refined it.
I came into healthcare with that. I started at
Intermountain Healthcare, which was an early
adopter of informatics. The organization had
a struggling data warehouse project because
it was built around this tightly coupled, earlybinding relational model. We put a team
together, scrubbed that model, and applied late
binding. And, knock on wood, it’s been doing
very well. It’s now 15 years in its evolution,
and Intermountain still loves it. The origins
of Health Catalyst come from that history.
14
PwC Technology Forecast
PwC: How mature are the
analytics systems at a typical
customer of yours these days?
DS: We generally get two types of customers.
One is the customer with a fairly advanced
analytics vision and aspirations. They
understand the whole notion of population
health management and capitated
reimbursement and things like that. So
they’re naturally attracted to us. The dialogue
with those folks tends to move quickly.
Then there are folks who don’t have
that depth of background, but they still
understand that they need analytics.
We have an analytics adoption model that we
use to frame the progression of analytics in
an organization. We also use it to help drive a
lot of our product development. It’s an eightlevel maturity model. Intermountain operates
pretty consistently at levels six and seven.
But most of the industry operates at level
zero—trying to figure out how to get to levels
one and two. When we polled participants
in our webinars about where they think they
reside in that model, about 70 percent of
the respondents said level two and below.
So we’ve needed to adjust our message and not
talk about levels five, six, and seven with some
of these clients. Instead, we talk about how to
get basic reporting, such as internal dashboards
and KPIs [key performance indicators], or how
to meet the external reporting requirements
for joint commission and accountable care
organizations [ACOs] and that kind of thing.
A step toward the data lake in healthcare: Late-bound data warehouses
“The vast majority of data in healthcare is still
bound in some form of a relational structure, or
we pull it into a relational form.”
If they have a technical background,
some organizations are attracted to this
notion of late binding. And we can relate
at that level. If they’re familiar with
Intermountain, they’re immediately attracted
to that track record and that heritage.
There are a lot of different reactions.
PwC: With customers who are just
getting started, you seem to focus on
already well-structured data. You’re
not opening up the repository to
data that’s less structured as well.
DS: The vast majority of data in healthcare
is still bound in some form of a relational
structure, or we pull it into a relational
form. Late binding puts us between the
worlds of traditional relational data
warehouses and Hadoop—between a very
structured representation of data and a
very unstructured representation of data.
But late binding lets us pull in unstructured
content. We can pull in clinical notes
and pretext and that sort of thing.
Health Catalyst is developing some
products to take advantage of that.
But if you look at the analytic use cases and
the analytic maturity of the industry right
now, there’s not a lot of need to bother
with unstructured data. That’s reserved
for a few of the leading innovators. The
vast majority of the market doesn’t need
unstructured content at the moment. In
fact, we really don’t even have that much
unstructured content that’s very useful.
PwC: What’s the pain point that the
late-binding approach addresses?
DS: This is where we borrow from Hadoop
and also from the old mainframe days.
15
PwC Technology Forecast
When we pull a data source into the
late-binding data warehouse, we land
that data in a form that looks and feels
much like the original source system.
Then we make a few minor modifications to
the data. If you’re familiar with data modeling,
we flatten it a little bit. We denormalize it a
little bit. But for the most part, that data looks
like the data that was contained in the source
system, which is a characteristic of a Hadoop
data lake—very little transformation to data.
So we retain the binding and the fidelity
of the data as it appeared in the source
system. If you contrast that approach with
the other vendors in healthcare, they remap
that data from the source system into an
enterprise data model first. But when you
map that data from the source system into
a new relational data model, you inherently
make compromises about the way the data is
modeled, represented, named, and related.
You lose a lot of fidelity when you do that.
You lose familiarity with the data. And it’s
a time-consuming process. It’s not unusual
for that early binding, monolithic data
model approach to take 18 to 24 months
to deploy a basic data warehouse.
In contrast, we can deploy content and start
exposing it to analytics within a matter of days
and weeks. We can do it in days, depending
on how aggressive we want to be. There’s
no binding early on. There are six different
places where you can bind data to vocabulary
or relationships as it flows from the source
system out to the analytic visualization layer.
Before we bind data to new vocabulary, a
new business rule, or any analytic logic,
we ask ourselves what use case we’re
A step toward the data lake in healthcare: Late-bound data warehouses
“We are building an enterprise data model one
object at a time.”
trying to satisfy. We ask on a use case
basis, rather than assuming a use case,
because that assumption could lead to
problems. We can build just about whatever
we want to, whenever we want to.
PwC: In essence, you’re moving
toward an enterprise data model.
But you’re doing it over time, a
model that’s driven by use cases.
DS: Are we actually building an enterprise
data model one object at a time? That’s the
net effect. Let’s say we land half a dozen
different source systems in the enterprise
data warehouse. One of the first things
we do is provide a foreign key across
those sources of data that allows you to
query across those sources as if they were
an enterprise data model. And typically
the first foreign key that we add to those
sources—using a common name and a
common data type—is patient identifier.
That’s the most fundamental. Then you add
vocabularies such as CPT [Current Procedural
Terminology] and ICD-9 as that need arises.
When you land the data, you have what
amounts to a virtual enterprise model
already. You haven’t remodeled the data
at all, but it looks and functions like an
enterprise model. Then we’ll spin targeted
analytics data marts off those source systems
to support specific analytic use cases.
For example, perhaps you want to drill
down on the variability, quality, and cost
of care in a clinical program for women
and newborns. We’ll spin off a registry of
those patients and the physicians treating
those patients into its own separate data
mart. And then we will associate every
little piece of data that we can find: costing
16
PwC Technology Forecast
data, materials management data, human
resources data about the physicians and nurses,
patient satisfaction data, outcomes data, and
eventually social data. We’ll pull that data into
the data mart that’s specific to that analytic
use case to support women and newborns.
PwC: So you might need to perform
some transform rationalization,
because systems might not call
the same thing by the same name.
Is that part of the late-binding
vocabulary rationalization?
DS: Yes, in each of those data marts.
PwC: Do you then use some sort
of provenance record—a way of
rationalizing the fact that we
call these 14 things different
things—that becomes reusable?
DS: Oh, yes, that’s the heart of it. We reuse
all of that from organization to organization.
There’s always some modification. And there’s
always some difference of opinion about
how to define a patient cohort or a disease
state. But first we offer something off the
shelf, so you don’t need to re-create them.
PwC: What if somebody wanted to
perform analytics across the data
marts or across different business
domains? In this framework,
would the best strategy be to
somehow consolidate the data
marts, or instead go straight to
the underlying data warehouse?
DS: You can do either one. Let’s take a
comorbidity situation, for example, where
a patient has three or four different disease
states. Let’s say you want to look at that
patient’s continuum of care across all of those.
A step toward the data lake in healthcare: Late-bound data warehouses
“A drawback of traditional ways of deploying
data warehouses is that they presuppose various
bindings and rules. They don’t allow for data
exploration and local fingerprinting.”
Over the top of those data marts is still this
common late-binding vocabulary that allows
you to query the patient as that patient
appears in each of those different subject
areas, whatever disease state it is. It ends up
looking like a virtual enterprise model for
that patient’s record. After we’ve formally
defined a patient cohort and the key metrics
that the organization wants to understand
about that patient cohort, we want to lock
that down and tightly bind it at that point.
First you get people to agree. You get
physicians and administrators to agree how
they want to identify a patient cohort. You
get agreement on the metrics they want
to understand about clinical effectiveness.
After you get comprehensive agreement,
then you look for it to stick for a while.
When it sticks for a period of time, then you
can tightly bind that data together and feel
comfortable about doing so—so you don’t
need to rip it apart and rebind it again.
PwC: When you speak about coming
toward an agreement among
the various constituencies, is it
a process that takes place more
informally outside the system,
where everybody is just going to
come up with the model? Or is
there some way to investigate the
data first? Or by using tagging or
some collaborative online utility,
is there an opportunity to arrive at
consensus through an interface?
DS: We have ready-to-use definitions around
all these metrics—patient registries and
things like that. But we also recognize that
the state of the industry being what it is,
there’s still a lot of fingerprinting and opinions
about those definitions. So even though
17
PwC Technology Forecast
an enterprise might reference the National
Quality Forum, the Agency for Healthcare
Research and Quality, and the British Medical
Journal as the sources for the definitions, local
organizations always want to put their own
fingerprint on these rules for data binding.
We have a suite of tools to facilitate that
exploration process. You can look at your
own definitions, and you can ask, “How do
we really want to define a diabetic patient?
How do we define congestive heart failure
and myocardial infarction patients?”
We’ll let folks play around with the data,
visualize it, and explore it in definitions. When
we see them coming toward a comprehensive
and persistent agreement, then we’ll suggest,
“If you agree to that definition, let’s bind it
together behind that visualization layer.”
That’s exactly what happens. And you must
allow that to happen. You must let that
exploration and fingerprinting happen.
A drawback of traditional ways of deploying
data warehouses is that they presuppose all
of those bindings and rules. They don’t allow
that exploration and local fingerprinting.
PwC: So how do companies get
started with this approach? Assuming
they have existing data warehouses,
are you using those warehouses in a
new way? Are you starting up from
scratch? Do you leave those data
warehouses in place when you’re
implementing the late-bound idea?
DS: Some organizations have an
existing data warehouse. And a lot of
organizations don’t. The greenfield
organizations are the easiest to deal with.
A step toward the data lake in healthcare: Late-bound data warehouses
The strategy is pretty complicated to decouple
all of the analytic logic that’s been built around
those existing data warehouses and then
import that to the future. Like most transitions
of this kind, it often happens through attrition.
First you build the new enterprise data
warehouse around those late-binding concepts.
And then you start populating it with data.
The one thing you don’t want to do is build
your new data warehouse under a dependency
to those existing data warehouses. You want to
go around those data warehouses and pull your
data straight from source systems in the new
architecture. It’s a really bad strategy to build
a data warehouse on top of data warehouses.
PwC: Some of the people we’ve
interviewed about Hadoop assert
that using Hadoop versus a data
warehouse can result in a cost
benefit that’s at least an order of
magnitude cheaper. They claim,
for example, that storing data
costs $250,000 per terabyte in
a traditional warehouse versus
$25,000 per terabyte for Hadoop. If
you’re talking with the C-suite about
an exploratory analytics strategy,
what’s the advantage of staying
with a warehousing approach?
DS: In healthcare, the compelling use case for
Hadoop right now is the license fee. Contrast
that case with what compels Silicon Valley
web companies and everybody else to go to
Hadoop. Their compelling reason wasn’t so
much about money. It was about scalability.
18
PwC Technology Forecast
If you consider the nature of the data that
they’re pulling into Hadoop, there’s no such
thing as a data model for the web. All the data
that they’re streaming into Hadoop comes
tagged with its own data model. They don’t
need a relational database engine. There’s
no value to them in that setting at all.
For CIOs, the fact that Hadoop is inexpensive
open source is very attractive. The downside,
however, is the lack of skills. The skills
and the tools and the ways to really take
advantage of Hadoop are still a few years off in
healthcare. Given the nature of the data that
we’re dealing with in healthcare right now,
there’s nothing particularly compelling about
Hadoop in healthcare right now. Probably
in the next year, we will start using Hadoop
as a preprocessor ETL [extract, transform,
load] platform that we can stream data into.
During the next three to four years, as the
skills and the tools evolve to take advantage
of Hadoop, I think you’ll see companies like
Health Catalyst being more aggressive about
the adoption of Hadoop in a data lake scenario.
But if you add just enough foreign keys and
dimensions of analytics across that data lake,
that approach greatly facilitates reliable
landing and loading. It’s really, really hard to
pull meaningful data out of those lakes without
something to get the relationship started.
A step toward the data lake in healthcare: Late-bound data warehouses
Technology Forecast: Rethinking integration
Issue 1, 2014
Microservices: The resurgence
of SOA principles and an
alternative to the monolith
By Galen Gruman and
Alan Morrison
Big SOA was overkill. In its place, a more agile form of
services is taking hold.
19
Moving away from the monolith
Greater modularity,
loose coupling,
and reduced
dependencies all
hold promise in
simplifying the
integration task.
Companies such as Netflix, Gilt, PayPal, and
Condé Nast are known for their ability to
scale high-volume websites. Yet even they
have recently performed major surgery on
their systems. Their older, more monolithic
architectures would not allow them to add
new or change old functionality rapidly
enough. So they’re now adopting a more
modular and loosely coupled approach
based on microservices architecture (MSA).
Their goal is to eliminate dependencies and
enable quick testing and deployment of code
changes. Greater modularity, loose coupling,
and reduced dependencies all hold promise in
simplifying the integration task.
If MSA had a T-shirt, it would read: “Code
small. Code local.”
Early signs indicate this approach to code
management and deployment is helping
companies become more responsive to
shifting customer demands. Yet adopters
might encounter a challenge when adjusting
the traditional software development
mindset to the MSA way—a less elegant, less
comprehensive but more nimble approach.
PwC believes MSA is worth considering as
a complement to traditional methods when
speed and flexibility are paramount—typically
in web-facing and mobile apps.
Microservices also provide the services layer
in what PwC views as an emerging cloudinspired enterprise integration fabric, which
companies are starting to adopt for greater
business model agility.
Why microservices?
In the software development community, it is
an article of faith that apps should be written
with standard application programming
interfaces (APIs), using common services when
possible, and managed through one or more
orchestration technologies. Often, there’s a
superstructure of middleware, integration
methods, and management tools. That’s great
for software designed to handle complex tasks
for long-term, core enterprise functions—it’s
how transaction systems and other systems of
record need to be designed.
But these methods hinder what Silicon Valley
companies call web-scale development:
software that must evolve quickly, whose
functionality is subject to change or
obsolescence in a couple of years—even
months—and where the level of effort must
fit a compressed and reactive schedule. It’s
more like web page design than developing
traditional enterprise software.
Dependencies from a developer’s perspective
1990s and earlier
2000s
2010s
Pre-SOA (monolithic)
Traditional SOA
Microservices
Tight coupling
Looser coupling
Decoupled
Team
Team
Team
Team
Team
For a monolith to change, all must
agree on each change. Each change
has unanticipated effects requiring
careful testing beforehand.
Elements in SOA are developed
more autonomously but must be
coordinated with others to fit into
the overall design.
20 PwC Technology Forecast
Developers can create and activate new
microservices without prior coordination
with others. Their adherence to MSA
principles makes continuous delivery of
new or modified services possible.
Microservices: An alternative to the monolith
It is important to understand that MSA is still
evolving and unproven over the long term.
But like the now common agile methods,
Node.js coding framework, and NoSQL data
management approaches before it, MSA is an
experiment many hope will prove to be a strong
arrow in software development quivers.
MSA: A think-small approach
for rapid development
orchestration brokers, but rather simpler
messaging systems such as Apache Kafka.
MSA proponents tend to code in web-oriented
languages such as Node.js that favor small
components with direct interfaces, and in
functional languages like Scala or the Clojure
Lisp library that favor “immutable” approaches
to data and functions, says Richard Rodger,
a Node.js expert and CEO of nearForm, a
development consultancy.
This fine-grained approach lets you update,
add, replace, or remove services—in short, to
integrate code changes—from your application
easily, with minimal effect on anything else.
For example, you could change the zip-code
lookup to a UK postal-code lookup by changing
or adding a microservice. Or you could change
the communication protocol from HTTP to
AMQP, the emerging standard associated
with RabbitMQ. Or you could pull data from a
NoSQL database like MongoDB at one stage of
an application’s lifecycle and from a relational
product like MySQL at another. In each case,
you would change or add a service.
Simply put, MSA breaks an application into
very small components that perform discrete
functions, and no more. The definition of “very
small” is inexact, but think of functional calls
or low-level library modules, not applets or
complete services. For example, a microservice
could be an address-based or geolocation-based
zip-code lookup, not a full mapping module.
MSA lets you move from quick-and-dirty to
quick-and-clean changes to applications or
their components that are able to function by
themselves. You would use other techniques—
conventional service-oriented architecture
(SOA), service brokers, and platform as
a service (PaaS)—to handle federated
application requirements. In other words, MSA
is one technique among many that you might
use in any application.
In MSA, you want simple parts with clean,
messaging-style interfaces; the less elaborate
the better. And you don’t want elaborate
middleware, service buses, or other
The fine-grained, stateless, self-contained
nature of microservices creates decoupling
between different parts of a code base and
is what makes them easy to update, replace,
Evolution of services orientation
1990s and earlier
2000s
2010s
Traditional SOA
Microservices
Coupling
Pre-SOA (monolithic)
Tight coupling
Looser coupling
Decoupled
ent
nm
gi
ng
en
vi
ro
Ex
21 PwC Technology Forecast
sa
The fine-grained,
stateless, selfcontained nature
of microservices
creates decoupling
between different
parts of a code
base and is what
makes them
easy to update,
replace, remove,
or augment.
Some of the leading web properties use MSA
because it comes from a mindset similar to other
technologies and development approaches
popular in web-scale companies: agile software
development, DevOps, and the use of Node.js
and Not only SQL (NoSQL). These approaches
all strive for simplicity, tight scope, and the
ability to take action without calling an allhands meeting or working through a tedious
change management process. Managing
code in the MSA context is often ad hoc and
something one developer or a small team can
handle without complex superstructure and
management. In practice, the actual code in
any specific module is quite small—a few dozen
lines, typically—is designed to address a narrow
function, and can be conceived and managed by
one person or a small group.
ist
in a “ d u m b ” m
es
Microservices: An alternative to the monolith
remove, or augment. Rather than rewrite a
module for a new capability or version and
then coordinate the propagation of changes
the rewrite causes across a monolithic code
base, you add a microservice. Other services
that want this new functionality can choose to
direct their messages to this new service, but
the old service remains for parts of the code
you want to leave alone. That’s a significant
difference from the way traditional enterprise
software development works.
Thinking the MSA way:
Minimalism is a must
The MSA approach is the opposite of the
traditional “let’s scope out all the possibilities
and design in the framework, APIs, and data
structures to handle them all so the application
is complete.”
Think of MSA as almost-plug-and-play in-app
integration of discrete services both local
Issue overview: Integration fabric
The microservices topic is the second of three topics as part of the integration fabric research
covered in this issue of the PwC Technology Forecast. The integration fabric is a central
component for PwC’s New IT Platform.*
Enterprises are starting to embrace more practical integration.** A range of these new
approaches is now emerging, and during the next few months we’ll ponder what the new
cloud-inspired enterprise integration fabric looks like. The main areas we plan to explore
include these:
Integration
fabric layers
Data
Integration challenges
Emerging technology solutions
Data silos, data proliferation,
rigid schemas, and high data
warehousing cost; new and
heterogeneous data types
Hadoop data lakes, late binding,
and metadata provenance tools
Enterprises are beginning to place extracts of their data for analytics
and business intelligence (BI) purposes into a single, massive repository
and structuring only what’s necessary. Instead of imposing schemas
beforehand, enterprises are allowing data science groups to derive their
own views of the data and structure it only lightly, late in the process.
Applications
and services
Rigid, monolithic systems
that are difficult to update in
response to business needs
Microservices
Fine-grained microservices, each associated with a single business
function and accessible via an application programming interface (API),
can be easily added to the mix or replaced. This method helps developer
teams create highly responsive, flexible applications.
Infrastructure
Multiple clouds and
operating systems that lack
standardization
Software containers for resource
isolation and abstraction
New software containers such as Docker extend and improve
virtualization, making applications portable across clouds. Simplifying
application deployment decreases time to value.
* See http://www.pwc.com/us/en/increasing-it-effectiveness/new-it-platform.jhtml for more information.
**Integration as PwC defines it means making diverse components work together so they work as a single entity.
See “integrated system” at http://www.yourdictionary.com/integrated-system#computer, accessed June 17, 2014.
22 PwC Technology Forecast
Microservices: An alternative to the monolith
Traditional SOA versus microservices
Traditional SOA
Microservices
Messaging type
Smart, but dependency-laden ESB
Dumb, fast messaging (as with Apache Kafka)
Programming
style
Imperative model
Reactive actor programming model that echoes
agent-based systems
Lines of code
per service
Hundreds or thousands of lines
of code
100 or fewer lines of code
State
Stateful
Stateless
Messaging type
Synchronous: wait to connect
Asynchronous: publish and subscribe
Databases
Large relational databases
NoSQL or micro-SQL databases blended with
conventional databases
Code type
Procedural
Functional
Means of
evolution
Each big service evolves
Each small service is immutable and can be
abandoned or ignored
Means of
systemic change
Modify the monolith
Create a new service
Means of scaling
Optimize the monolith
Add more powerful services and cluster
by activity
System-level
awareness
Less aware and event driven
More aware and event driven
and external. These services are expected to
change, and some eventually will become
disposable. When services have a small focus,
they become simple to develop, understand,
manage, and integrate. They do only what’s
necessary, and they can be removed or ignored
when no longer needed.
Mobile apps and
web apps are
natural venues
for MSA.
There’s an important benefit to this minimalist
approach, says Gregg Caines, a freelance web
developer and co-author of programming
books: “When a package doesn’t do more than
is absolutely necessary, it’s easy to understand
and to integrate into other applications.” In
many ways, MSA is a return to some of the
original SOA principles of independence and
composition—without the complexity and
superstructure that become common when
SOA is used to implement enterprise software.
The use of multiple, specific services with
short lifetimes might sound sloppy, but
remember that MSA is for applications, or
their components, that are likely to change
frequently. It makes no sense to design and
develop software over an 18-month process
23 PwC Technology Forecast
to accommodate all possible use cases when
those use cases can change unexpectedly and
the life span of code modules might be less
than 18 months.
The pace at which new code creation and
changes happen in mobile applications
and websites simply doesn’t support the
traditional application development model.
In such cases, the code is likely to change
due to rapidly evolving social media services,
or because it runs in iOS, Android, or some
other environment where new capabilities
are available annually, or because it needs to
search a frequently updated product inventory.
For such mutable activities, you want to
avoid—not build in—legacy management
requirements. You live with what nearForm’s
Rodger considers a form of technical
debt, because it is an easier price to pay
for functional flexibility than a full-blown
architecture that tries to anticipate all needs.
It’s the difference between a two-week update
and a two-year project.
Microservices: An alternative to the monolith
This mentality is different from that required
in traditional enterprise software, which
assumes complex, multivariate systems
are being integrated, requiring many-tomany interactions that demand some sort
of intelligent interpretation and complex
framework. You invest a lot up front to create
a platform, framework, and architecture that
can handle a wide range of needs that might be
extensive but change only at the edges.
MSA assumes you’re building for the short
term; that the needs, opportunities, and
context will change; and that you will handle
them as they occur. That’s why a small
team of developers familiar with their own
microservices are the services’ primary users.
And the clean, easily understood nature lets
developers even more quickly add, remove,
update, and replace their services and better
ensure interoperation with other services.
In MSA, governance, data architecture, and
the microservices are decentralized, which
minimizes the dependencies. As a result of this
independence, you can use the right language
for the microservice in question, as well as
the right database or other related service,
rather than use a single language or back-end
service to accomplish all your application’s
needs, says David Morgantini, a developer at
ThoughtWorks.
Where MSA makes sense
MSA is most appropriate for applications whose
functions may need to change frequently;
that may need to run on multiple, changing
platforms whose local services and capabilities
differ; or whose life spans are not long enough
to warrant a heavily architected framework.
MSA is great for disposable services.
Mobile apps and web apps are natural venues
for MSA. But whatever platform the application
runs on, some key attributes favor MSA:
• Fast is more important than elegant.
• Change occurs at different rates within
the application, so functional isolation
and simple integration are more
important than module cohesiveness.
• Functionality is easily separated into
simple, isolatable components.
For example, an app that draws data
from social networks might use separate
microservices for each network’s data
extraction and data normalization. As social
networks wax and wane in popularity, they
can be added to the app without changing
anything else. And as APIs evolve, the app
can support several versions concurrently
but independently.
Microservices can make media distribution
platforms, for example, easier to update and
faster than before, says Adrian Cockcroft,
a technology fellow at Battery Ventures, a
venture capital firm.The key is to separate
concerns along these dimensions:
• Each single-function microservice
has one action.
• A small set of data and UI
elements is involved.
• One developer, or a small team,
independently produces a microservice.
• Each microservice is its own build,
to avoid trunk conflict.
• The business logic is stateless.
• The data access layer is statefully cached.
• New functions are added swiftly,
but old ones are retired slowly.1
These dimensions create the independence
needed for the microservices to achieve the
goals of fast development and easy integration
of discrete services limited in scope.
• Change in the application’s
functionality and usage is frequent.
1 Adrian Cockcroft, “Migrating to Microservices,” (presentation, QCon London, March 6, 2014),
http://qconlondon.com/london-2014/qconlondon.com/london-2014/presentation/Migrating%20to%20Microservices.html.
24 PwC Technology Forecast
Microservices: An alternative to the monolith
MSA is not entirely without structure. There is
a discipline and framework for developing and
managing code the MSA way, says nearForm’s
Rodger. The more experienced a team is with
other methods—such as agile development
and DevOps—that rely on small, focused,
individually responsible approaches, the easier
it is to learn to use MSA. It does require a
certain groupthink. The danger of approaching
MSA without such a culture or operational
framework is the chaos of individual developers
acting without regard to each other.
In MSA, integration is the
problem, not the solution
of microservices. You might have “planned
community” neighborhoods made from
coarser-grained services or even monolithic
modules that interact with more organic MSAstyle neighborhoods.2
It’s important to remember that by keeping
services specific, there’s little to integrate. You
typically deal with a handful of data, so rather
than work through a complex API, you directly
pull the specific data you want in a RESTful
way. You keep your own state, again to reduce
dependencies. You bind data and functions late
for the same reasons.
Many enterprise developers shake their heads
and ask how microservices can possibly
integrate with other microservices and with
other applications, data sets, and services. MSA
sounds like an integration nightmare, a morass
of individual connections causing a rat’s nest
that looks like spaghetti code.
Integration is a problem MSA tries to avoid by
reducing dependencies and keeping them local.
If you need complex integration, you shouldn’t
use MSA for that part of your software
development. Instead, use MSA where broad
integration is not a key need.
Ironically, integration is almost a byproduct
of MSA, because the functionality, data, and
interface aspects are so constrained in number
and role. (Rodger says Node.js developers will
understand this implicit integration, which is a
principle of the language.) In other words,
your integration connections are local, so
you’re building more of a chain than a web
of connections.
MSA is not a cure-all, nor is it meant to be the
only or even dominant approach for developing
applications. But it’s an emerging approach
that bucks the trend of elaborate, elegant,
complete frameworks where that doesn’t work
well. Sometimes, doing just what you need to
do is a better answer than figuring out all the
things you might need and constructing an
environment to handle it all. MSA serves the
“do just what you need to do” scenario.
When you have fine-grained components, you
do have more integration points. Wouldn’t
that make the development more difficult and
changes within the application more likely
to cause breakage? Not necessarily, but it is
a risk, says Morgantini. They key is to create
small teams focused on business-relevant
tasks and to conceive of the microservices
they create as neighbors living together in a
small neighborhood, so the relationships are
easily apparent and proximate. In this model,
an application can be viewed as a city of
neighborhoods assigned to specific business
functions, with each neighborhood composed
Conclusion
This approach has proven effective in contexts
already familiar with agile development,
DevOps, and loosely coupled, event-driven
technologies such as Node.js. MSA applies the
same mentality to the code itself, which may
be why early adopters are those who are using
the other techniques and technologies. They
already have an innate culture that makes it
easier to think and act in the MSA way.
Any enterprise looking to serve users and
partners via the web, mobile, and other fastevolving venues should explore MSA.
2 David Morgantini, “Micro-services—Why shouldn’t you use micro-services?” Dare to dream (blog), August 27, 2013,
http://davidmorgantini.blogspot.com/2013/08/micro-services-why-shouldnt-you-use.html, accessed May 12, 2014.
25 PwC Technology Forecast
Microservices: An alternative to the monolith
Technology Forecast: Rethinking integration
Issue 1, 2014
Microservices in a
software industry context
John Pritchard offers some thoughts on the rebirth of SOA and an
API-first strategy from the vantage point of a software provider.
Interview conducted by Alan Morrison, Wunan Li, and Akshay Rao
PwC: What are some of the
challenges when moving to an
API-first business model?
John Pritchard
John Pritchard is director of
platform services at Adobe.
and packaged in interesting ways by our own
product teams or third-party developers.
JP: I see APIs as a large oncoming wave
that will create a lot of benefit for a lot of
companies, especially companies in our
space that are trying to migrate to SaaS.1
With the API model, there’s a new economy
of sorts and lots of talk about how to monetize
the services. People discuss the models
by which those services could be made
available and how they could be sold.
At Adobe, we have moved from being a
licensed desktop product company to a
subscription-based SaaS company. We’re
in the process of disintegrating our desktop
products to services that can be reassembled
There’s still immaturity in the very coarse
way that APIs tend to be exposed now. I
might want to lease the use of APIs to a
third-party developer, for instance, with a
1 Abbreviations are as follows:
• API: application programming interface
• SaaS: software as a service
For more information on APIs, see “The business value of APIs,” PwC Technology Forecast 2012, Issue 2,
http://www.pwc.com/us/en/technology-forecast/2012/issue2/index.jhtml.
26 PwC Technology Forecast
Microservices in a software industry context
“APIs are SOA realized.”
usage-based pricing model. This model allows
the developer to white label the experience
to its customers without requiring a license.
Usage-based pricing triggers some thought
around how to instrument APIs and the
connection between API usage and commerce.
It leads to some interesting conversations about
identity and authentication, especially when
third-party developers might be integrating
multiple API sets from different companies
into a customer-exposed application.
PwC: Isn’t there substantial
complexity associated with the API
model once you get down to the
very granular services suggested
by a microservices architecture?
JP: At one level, the lack of standards
and tooling for APIs has resulted in quite
a bit of simplification. Absent standards,
we are required to use what I’ll call the
language of the Internet: HTTP, JSON,
and OAuth. That’s it. This approach has
led to beautiful, simple designs because
you can only do things a few ways.
But at another level, techniques to wire together
capabilities with some type of orchestration
have been missing. This absence creates a big
risk in my mind of trying to do things in the API
space like the industry did with SOA and WS*.2
PwC: How are microservices related
to what you’re doing on the API front?
JP: We don’t use the term microservices;
I wouldn’t say you’d hear that term in
conversations with our design teams. But I’m
familiar with some of Martin Fowler’s writing
on the topic.3 If you think about how the term
is defined in industry and this idea of smaller
statements that are transactions, that concept
is very consistent with design principles
and the API-first strategy we adhere to.
What I’ve observed on my own team and
some of the other product teams we work
with is that the design philosophy we use
is less architecturally driven than it is team
dynamic driven. When you move to an end-toend team or a DevOps4 type of construct, you
tend to want to define things that you can own
completely and that you can release so you
have some autonomy to serve a particular need.
We use APIs to integrate internally as well.
We want these available to our product
engineering community in the most
consumable way. How do we describe
these APIs so we clear the path for selfservice as quickly as possible? Those
sorts of questions and answers have
led us to the design model we use.
2 Abbreviations are as follows:
• HTTP: hypertext transfer protocol
• JSON: JavaScript Object Notation
• SOA: service-oriented architecture
• WS*: web services
3 For example, see James Lewis and Martin Fowler, “Microservices,” March 25, 2014, http://martinfowler.com/articles/microservices.html,
accessed June 18, 2014.
4 DevOps is a working style designed to encourage closer collaboration between developers and operations people: DevOps=Dev+Ops. For
more information on DevOps, continuous delivery, and antifragile system development, see “DevOps: Solving the engineering productivity
challenge,” PwC Technology Forecast 2013, Issue 2, http://www.pwc.com/us/en/technology-forecast/2013/issue2/index.jhtml.
27 PwC Technology Forecast
Microservices in a software industry context
PwC: When you think about the
problems that a microservices
approach might help with,
what is top of mind for you?
JP: I’ve definitely experienced the rebirth
of SOA. In my mind, APIs are SOA realized.
We remember the ESB and WS* days and the
attempt to do real top-down governance. We
remember how difficult that was not only in
the enterprise, but also in the commercial
market, where it didn’t really happen at all.5
Developer-friendly consumability has helped
us bring APIs to market. Internally, that has
led to greater efficiencies. And it encourages
some healthy design practices by making
things small. Some of the connectivity becomes
less important than the consumability.
PwC: What’s the approach you’re
taking to a more continuous
form of delivery in general?
JP: For us, continuous delivery brings
to mind end-to-end teams or the DevOps
model. Culturally, we’re trying to treat
everything like code. I treat infrastructure
like code. I treat security like code.
Everything is assigned to sprints. APIs must
be instrumented for deployment, and then
we test around the APIs being deployed.
only for infrastructure components but
also for scripted security attacks to validate
our operational run times. We’ve seen an
increased need for automation. With every
deployment we look for opportunities for
automation. But what’s been key for the
success in my team is this idea of treating all
these different aspects just like we treat code.
PwC: Would that include
infrastructure as well?
JP: Yes. My experience is that the line is
almost completely blurred about what’s
software and what’s infrastructure
now. It’s all software defined.
PwC: As systems become less
monolithic, how will that change
the marketplace for software?
JP: At the systems level, we’re definitely
seeing a trend away from centralized core
systems—like core ERP or core large platforms
that provide lots of capabilities—to a model
where a broad selection of SaaS vendors
provide very niche capabilities. Those SaaS
operators may change over time as new
ones come into the market. The service
provider model, abstracting SaaS provider
capabilities with APIs, gives us the flexibility
to evaluate newcomers that might be better
providers for each API we’ve defined.
We’ve borrowed many of the Netflix constructs
around monkeys.6 We use monkeys not
5 Abbreviations are as follows:
• ESB: enterprise service bus
• ERP: enterprise resource planning
6 Chaos Monkey is an example. See “The evolution from lean and agile to antifragile,” PwC Technology Forecast 2013, Issue 2,
http://www.pwc.com/us/en/technology-forecast/2013/issue2/features/new-cloud-development-styles.jhtml for more on Chaos Monkey.
28 PwC Technology Forecast
Microservices in a software industry context
Technology Forecast: Rethinking integration
Issue 1, 2014
The critical elements
of microservices
Richard Rodger describes his view of the emerging microservices
landscape and its impact on enterprise development.
Interview conducted by Alan Morrison and Bo Parker
PwC: What’s the main advantage
of a microservices approach versus
object-oriented programming?
RR: Object-oriented programming failed
miserably. With microservices, it’s much harder
to shoot yourself in the foot. The traditional
anti-patterns and problems that happen in
object-oriented code—such as the big bowl of
mud where a single task has a huge amount
of responsibilities or goes all over the place—
are less likely in the microservices world.
Richard Rodger
Richard Rodger is the CTO
of nearForm, a software
development and training
consultancy specializing in
Node.js.
Consider the proliferation of patterns in the
object-oriented world. Any programming
paradigm that requires you to learn 50
different design patterns to get things right
and makes it so easy to get things wrong is
probably not the right way to be doing things.
29 PwC Technology Forecast
That’s not to say that patterns aren’t good.
Pattern designs are good and they are
necessary. It’s just that in the microservices
world, there are far fewer patterns.
PwC: What is happening in
companies that are eager to try
the microservices approach?
RR: It’s interesting to think about why change
happens in the software industry. Sometimes
the organizational politics is a much more
important factor than the technology itself.
Our experience is that politics often drives the
adoption of microservices. We’re observing
aggressive, ambitious vice presidents who
have the authority to fund large software
projects. In light of how long most of these
projects usually take, the vice presidents
see an opportunity for career advancement
by executing much more rapidly.
The critical elements of microservices
“With a traditional approach, even approaches
that are agile, you must make an awful lot of
decisions up front. And if you make the wrong
decisions, you back yourself into a corner.”
A lot of our engagements are with forwardlooking managers who essentially are
sponsoring the adoption of a microservices
approach. Once those initial projects have
been deemed successful because they were
delivered faster and more effectively, that
proves the point and creates its own force
for the broader adoption of microservices.
PwC: How does a typical
microservices project begin?
RR: In large projects that can take six months
or more, we develop the user story and then
define and map capabilities to microservices.
And then we map microservices onto messages.
We do that very, very quickly. Part of what
we do, and part of what microservices
enable us to do, is show a working live
demo of the system after week one.
If we kick off on a Monday, the following
Monday we show a live version of the
system. You might only be able to log in and
perhaps get to the main screen. But there’s
a running system that may be deployed
on whatever infrastructure is chosen.
Every Monday there’s a new live demo.
And that system stays running during the
lifetime of the project. Anybody can look at
the system, play with it, break it, or whatever
at any point in time. Those capabilities
are possible because we started to build
services very quickly within the first week.
With a traditional approach, even approaches
that are agile, you must make an awful
lot of decisions up front. And if you make
the wrong decisions, you back yourself
into a corner. For example, if you decide
to use a particular database technology
30 PwC Technology Forecast
or commit to a certain structure of object
hierarchies, you must be very careful and
spend a lot of time analyzing. The use of
microservices reduces that cost significantly.
An analogy might help to explain how this
type of decision making happens. When UC
Irvine laid out its campus, the landscapers
initially put in grass and watched where
people walked. They later built paths
where the grass was worn down.
Microservices are like that. If you have
a particular data record and you build
a microservice to look back at that data
record, you don’t need to define all of the
fields up front. A practical example might
be if a system will capture transactions and
ultimately use a relational database. We
might use MongoDB for the first four weeks
of development because it’s schema free.
After four weeks of development, the schema
will be stabilized to a considerable extent.
On week five, we throw away MongoDB and
start using a relational product. We saved
ourselves from huge hassles in database
migrations by developing this way. The key
is using a microservice as the interface to the
database. That lets us throw away the initial
database and use a new one—a big win.
PwC: Do microservices have skeletal
frameworks of code that you can
just grab, plug in, and compose the
first week’s working prototype?
RR: We open source a lot, and we have
developed a whole bunch of precut
microservices. That’s a benefit of being
part of the Node [server-side JavaScript]
community. There’s this ethic in the Node
The critical elements of microservices
“There’s this ethic in the Node community about
sharing your Node services. It’s an emergent
property of the ecosystem.”
community about sharing your Node services.
It’s an emergent property of the ecosystem.
You can’t really compile JavaScript, so a
lot of it’s going to be open source anyway.
You publish a module onto the npm public
repository, which is open source by definition.
PwC: There are very subtle and
nuanced aspects of the whole
microservices scene, and if you
look at it from just a traditional
development perspective, you’d miss
these critical elements. What’s the
integration pattern most closely
associated with microservices?
RR: It all comes back to thinking about your
system in terms of messages. If you need a
search engine for your system, for example,
there are various options and cloud-based
search services you can use now. Normally this
is a big integration task with heavy semantics
and coordination required to make it work.
If you define your search capability in terms
of messages, the integration is to write a
microservice that talks to whatever back end
you are using. In a sense, the work is to define
how to interact with the search service.
Let’s say the vendor is rolling out a new version.
It’s your choice when you go with the upgrade.
If you decide you want to move ahead with
the upgrade, you write your microservices so
both version 1 and version 2 can subscribe to
the same messages. You can route a certain
part of your message to version 1 and a certain
part to version 2. To gracefully phase in version
2 before fully committing, you might start
by directing 5 percent of traffic to the new
version, monitor it for issues, and gradually
increase the traffic to version 2. Because it
31
PwC Technology Forecast
doesn’t require a full redeployment of your
entire system, it’s easy to do. You don’t need to
wait three months for a lockdown. Monolithic
systems often have these scenarios where
the system is locked down on November 30
because there’s a Christmas sales period or
something like that. With microservices,
you don’t have such issues anymore.
PwC: So using this message pattern,
you could easily fall into the trap
of having a fat message bus, which
seems to be the anti-pattern here
for microservices. You’re forced to
maintain this additional code that is
filtering the messages, interpreting
the messages, and transforming
data. You’re back in the ESB world.
RR: Exactly. An enterprise spaghetti
bowl, I think it’s called.
PwC: How do you get your message to
the right places efficiently while still
having what some are calling a dumb
pipe to the message management?
RR: This principle of the dumb pipe is
really, really important. You must push the
intelligence of what to do with messages
out to the edges. And that means some
types of message brokers are better suited
to this architecture than others. For
example, traditional message brokers like
RabbitMQ—ones that maintain internal
knowledge of where individual consumers
are, message queues, and that sort of thing—
are much less suited to what we want to
do here. Something like Apache Kafka is
much better because it’s purposely dumb.
It forces the message-queue consumers to
remember their own place in the queue.
The critical elements of microservices
“If we have less intellectual work to do, that
actually lets us do more.”
As a result, you don’t end up with scaling issues
if the queue gets overloaded. You can deal
with the scaling issue at the point of actually
intercepting the message, so you’re getting the
messages passed through as quickly as possible.
You don’t need to use a message queue
for everything, either. If you end up with
a very, very high throughput system, you
move the intelligence into the producer so it
knows you have 10 consumers. If one dies,
it knows to trigger the surrounding system
to create a new consumer, for example.
It’s the same idea as when we were using
MongoDB to determine the schema
ahead of time. After a while, you’ll
notice that the bus is less suitable for
certain types of messages because of the
volumes or the latency or whatever.
32 PwC Technology Forecast
PwC: Would Docker provide a parallel
example for infrastructure?
RR: Yes. Let’s say you’re deploying 50 servers,
50 Amazon instances, and you set them up
with a Docker recipe. And you deploy that.
If something goes wrong, you could kill it.
There’s no way for a sys admin to SSH [Secure
Shell] into that machine and start tinkering
with the configurations to fix it. When you
deploy, the services either work or they don’t.
PwC: The cognitive load facing
programmers of monoliths and
the coordination load facing
programmer teams seem to represent
the new big mountain to climb.
RR: Yes. And that’s where the productivity
comes from, really. It actually isn’t about
best practices or a particular architecture
or a particular version of Node.js. It’s
just that if we have less intellectual work
to do, that actually lets us do more.
The critical elements of microservices
Technology Forecast: Rethinking integration
Issue 1, 2014
Containers are redefining
application-infrastructure
integration
By Alan Morrison
and Pini Reznik
With containers like Docker, developers can deploy the
same app on different infrastructure without rework.
33
Issue overview:
Rethinking
integration
This article focuses
on one of three topics
covered in the Rethinking
Integration issue of the
PwC Technology Forecast
(http://www.pwc.com/
us/en/technologyforecast/2014/issue1/
index.jhtml). The
integration fabric is a
central component for
PwC’s New IT Platform.
(See http://www.pwc.
com/us/en/increasingit-effectiveness/new-itplatform.jhtml for more
information.
Early evaluations
of Docker suggest
it is a flexible,
cost-effective, and
more nimble
way to deploy
rapidly changing
applications on
infrastructure
that also must
evolve quickly.
Spotify, the Swedish streaming music service,
grew by leaps and bounds after its launch in
2006. As its popularity soared, the company
managed its scaling challenge simply by adding
physical servers to its infrastructure. Spotify
tolerated low utilization in exchange for speed
and convenience. In November 2013, Spotify
was offering 20 million songs to 24 million
users in 28 countries. By that point, with a
computing infrastructure of 5,000 servers in 33
Cassandra clusters at four locations processing
more than 50 terabytes, the scaling challenge
demanded a new solution.
Spotify chose Docker, an open source
application deployment container that evolved
from the LinuX Containers (LXCs) used for the
past decade. LXCs allow different applications
to share operating system (OS) kernel, CPU,
and RAM. Docker containers go further,
adding layers of abstraction and deployment
management features. Among the benefits of
this new infrastructure technology, containers
that have these capabilities reduce coding,
deployment time, and OS licensing costs.
Not every company is a web-scale enterprise
like Spotify, but increasingly many companies
need scalable infrastructure with maximum
flexibility to support the rapid changes in
services and applications that today’s business
environment demands. Early evaluations of
Docker suggest it is a flexible, cost-effective,
and more nimble way to deploy rapidly
changing applications on infrastructure that
also must evolve quickly.
PwC expects containers will become a standard
fixture of the infrastructure layer in the
evolving cloud-inspired integration fabric.
This integration fabric includes microservices
at the services layer and data lakes at the
data layer, which other articles explore in this
“Rethinking integration” issue of the PwC
Technology Forecast.1 This article examines
Docker containers and their implications for
infrastructure integration.
A stretch goal solves a problem
Spotify’s infrastructure scale dwarfs those of
many enterprises. But its size and complexity
make Spotify an early proof case for the value
and viability of Docker containers in the agile
business environment that companies require.
By late 2013, Spotify could no longer continue
to scale or manage its infrastructure one
server at a time. The company used state-ofthe-art configuration management tools such
as Puppet, but keeping those 5,000 servers
consistently configured was still difficult and
time-consuming.
Spotify had avoided conventional virtualization
technologies. “We didn’t want to deal with
the overhead of virtual machines (VMs),”
says Rohan Singh, a Spotify infrastructure
engineer. The company required some kind
of lightweight alternative to VMs, because it
needed to deploy changes to 60 services and
add new services across the infrastructure in
a more manageable way. “We wanted to make
our service deployments more repeatable and
less painful for developers,” Singh says.
Singh was a member of a team that first
looked at LXCs, which—unlike VMs—allow
applications to share an OS kernel, CPU,
and RAM. With containers, developers can
isolate applications and their dependencies.
Advocates of containers tout the efficiencies
and deployment speed compared with VMs.
Spotify wrote some deployment service
scripts for LXCs, but decided it was needlessly
duplicating what existed in Docker, which
includes additional layers of abstraction and
deployment management features.
Singh’s group tested Docker on a few internal
services to good effect. Although the vendor
had not yet released a production version of
Docker and advised against production use,
Spotify took a chance and did just that. “As a
stretch goal, we ignored the warning labels
and went ahead and deployed a container into
production and started throwing production
traffic at it,” Singh says, referring to a service
that provided album metadata such as the
album or track titles.2
Thanks to Spotify and others, adoption had
risen steadily even before Docker 1.0 was
available in June 2014. The Docker application
1 For more information, see “Rethinking integration: Emerging patterns from cloud computing leaders,” PwC Technology Forecast 2014,
Issue 1, http://www.pwc.com/us/en/technology-forecast/2014/issue1/index.jhtml.
2 Rohan Singh, “Docker at Spotify,” Twitter University YouTube channel, December 11, 2013,
https://www.youtube.com/watch?v=pts6F00GFuU, accessed May 13, 2014, and Jack Clark, “Docker blasts into 1.0, throwing dust onto
traditional hypervisors,” The Register, June 9, 2014, http://www.theregister.co.uk/2014/06/09/docker_milestone_release/, accessed
June 11, 2014.
34 PwC Technology Forecast
Containers are redefining application-infrastructure integration
The fact that a
container does
not run its own OS
instance reduces
dramatically
the overhead
associated with
starting and
running instances.
Startup time
can typically be
reduced from 30
seconds (or more)
to one-tenth
of a second.
container engine posted on GitHub, the codesharing network, had received more than
14,800 stars (up-votes by users) by August
14, 2014.3 Container-oriented management
tools and orchestration capabilities are just
now emerging. When Docker, Inc., (formerly
dotCloud, Inc.) released Docker 1.0, the
vendor also released Docker Hub, a proprietary
orchestration tool available for licensing. PwC
anticipates that orchestration tools will also
become available from other vendors. Spotify is
currently using tools it developed.
Why containers?
LXCs have existed for many years, and some
companies have used them extensively. Google,
for example, now starts as many as 2 billion
containers a week, according to Joe Beda, a
senior staff software engineer at Google.4 LXCs
abstract the OS more efficiently than VMs.
The VM model blends an application, a full
guest OS, and disk emulation. In contrast, the
container model uses just the application’s
dependencies and runs them directly on a host
OS. Containers do not launch a separate OS
for each application, but share the host kernel
while maintaining the isolation of resources
and processes where required.
The fact that a container does not run its
own OS instance reduces dramatically the
overhead associated with starting and running
instances. Startup time can typically be
reduced from 30 seconds (or more) to onetenth of a second. The number of containers
running on a typical server can reach dozens
or even hundreds. The same server, in
contrast, might support 10 to 15 VMs.
Developer teams such as those at Spotify, which
write and deploy services for large software-asa-service (SaaS) environments, need to deploy
new functionality quickly, at scale, and to test
and see the results immediately. Increasingly,
they say containerization delivers those
benefits. SaaS environments by their very testdriven nature require frequent infusions of new
code to respond to shifting customer demands.
Without containers, developers who write more
and more distributed applications would spend
much time on repetitive drudgery.
Docker: LXC simplification and an
emerging multicloud abstraction
A Docker application container takes the
basic notion of LXCs, adds simplified ways of
interacting with the underlying kernel, and
makes the whole portable (or interoperable)
Virtual machines on a Type 2 hypervisor versus application containerization with a shared OS
Figure 1: Virtual machines on a Type 2 hypervisor versus application containerization with a shared OS
App
A
App
A
App
B
Bins/
Libs
Bins/
Libs
Bins/
Libs
Guest
OS
Guest
OS
Guest
OS
Containers are isolated, but share OS
and, where appropriate, bins/libraries
VM
App B
App B
App B
Bins/Libs
App B
App A
App A
Container
Bins/Libs
Hypervisor (Type 2)
Container engine
Host OS
Host OS
Server
Server
Source:
Docker,
Inc.,
2014
Source:
Docker,
Inc.,
2014
3 See “dotcloud/docker,” GitHub, https://github.com/dotcloud/docker, accessed August 14, 2014.
4 Joe Beda, “Containers At Scale,” Gluecon 2014 conference presentation slides, May 22, 2014,
https://speakerdeck.com/jbeda/containers-at-scale, accessed June 11, 2014.
35 PwC Technology Forecast
Containers are redefining application-infrastructure integration
across environments that have different
operating systems. Portability is currently
limited to Linux environments—Ubuntu, SUSE,
or Red Hat Enterprise Linux, for example.
But Ben Golub, CEO of Docker, Inc., sees no
reason why a Dockerized container created on
a laptop for Linux couldn’t eventually run on
a Windows server unchanged. “With Docker,
you no longer need to worry in advance about
where the apps will run, because the same
containerized application will run without
being modified on any Linux server today.
Going to Windows is a little trickier because the
primitives aren’t as well defined, but there’s no
rocket science involved.5 It’s just hard work that
we won’t get to until the second half of 2015.”
That level of portability can therefore extend
across clouds and operating environments,
because containerized applications can run on
a VM or a bare-metal server, or in clouds from
different service providers.
The amount of application isolation that
Docker containers provide—a primary reason
for their portability—distinguishes them from
basic LXCs. In Docker, applications and their
dependencies, such as binaries and libraries,
all become part of a base working image. That
containerized image can run on different
machines. “Docker defines an abstraction for
these machine-specific settings, so the exact
same Docker container can run—unchanged—
on many different machines, with many
different configurations,” says Solomon Hykes,
CTO of Docker, Inc.6
Another advantage of Docker containerization
is that updates, such as vulnerability patches,
can be pushed out to the containers that
need them without disruption. “You can push
changes to 1,000 running containers without
taking any of them down, without restarting
an OS, without rebuilding a VM,” Golub says.
Docker’s ability to extend the reach of security
policy and apply it uniformly is substantial.
“The security model becomes much better
with containers. In the VM-based world, every
application has its own guest OS, which is
a slightly different version. These different
versions are difficult to patch. In a containerbased world, it’s easier to standardize the OS
and deploy just one patch across all hosts,”
he adds.
Containerized applications also present
opportunities for more comprehensive
governance. Docker tracks the provenance of
each container by using a method that digitally
signs each one. Golub sees the potential, over
time, for a completely provenanced library
of components, each with its own automated
documentation and access control capability.7
When VMs were introduced, they formed
a new abstraction layer, a way to decouple
software from a hardware dependency. VMs
led to the creation of clouds, which allowed
the load to be distributed among multiple
hardware clusters. Containerization using
the open Docker standard extends this
notion of abstraction in new ways, across
homogeneous or heterogeneous clouds. Even
more importantly, it lowers the time and cost
associated with creating, maintaining, and
using the abstraction. Docker management
tools such as Docker Hub, CenturyLink
Panamax, Apache Mesos, and Google
Kubernetes are emerging to address container
orchestration and related challenges.
Outlook: Containers, continuous
deployment, and the rethinking
of integration
Software engineering has generally trended
away from monolithic applications and toward
the division of software into orchestrated
groups of smaller, semi-autonomous pieces
that have a smaller footprint and shorter
deployment cycle.8 Microservices principles are
leading this change in application architecture,
and containers will do the same when it comes
to deploying those microservices on any cloud
infrastructure. The smaller size, the faster
creation, and the subsecond deployment of
containers allow enterprises to reduce both
the infrastructure and application deployment
5 A primitive is a low-level object, components of which can be used to compose functions.
See http://www.webopedia.com/TERM/P/primitive.html, accessed July 28, 2014, for more information.
6 Solomon Hykes, “What does Docker add to just plain LXC?” answer to Stack Overflow Q&A site, August 13, 2013,
http://stackoverflow.com/questions/17989306/what-does-docker-add-to-just-plain-lxc, accessed June 11, 2014.
7 See the PwC interview with Ben Golub, “Docker’s role in simplifying and securing multicloud development,”
http://www.pwc.com/en_US/us/technology-forecast/2014/issue1/interviews/interview-ben-golub-docker.jhtml for more information.
8 For more detail and a services perspective on this evolution, see “Microservices: The resurgence of SOA principles and
an alternative to the monolith,” PwC Technology Forecast 2014, Issue 1,
http://www.pwc.com/en_US/us/technology-forecast/2014/issue1/features/microservices.jhtml.
36 PwC Technology Forecast
Containers are redefining application-infrastructure integration
From2:monolithic
to multicloud
architectures
Figure
From monolithic
to multi-cloud
architectures
1995
2014
2017?
Thick client-server client
Thin mobile client
Thin mobile/web UI
Middleware/OS stack
Assembled from available services
Microservices
Monolithic physical infrastructure
VMs on cloud
Multicloud
container-based infrastructure
Physical
VMs
Containers
Cloud
Source: Docker, Inc. and PwC, 2014
Containers
Multiple
clouds
Source: Docker, Inc. and PwC, 2014
The smaller size,
the faster creation,
and the subsecond
deployment of
containers allow
enterprises to
reduce both the
infrastructure
and application
deployment
cycles from hours
to minutes.
cycles from hours to minutes. When enterprises
can reduce deployment time so it’s comparable
to the execution time of the application itself,
infrastructure development can become an
integral part of the main development process.
These changes should be accompanied by
changes in organizational structures, such
as transitioning from waterfall to agile and
DevOps teams.9
When VMs became popular, they were initially
used to speed up and simplify the deployment
of a single server. Once the application
architecture internalized the change and
monolithic apps started to be divided into
smaller pieces, the widely accepted approach of
that time—the golden image—could not keep
up. VM proliferation and management became
the new headaches in the typical organization.
This problem led to the creation of
configuration management tools that help
maintain the desired state of the system.
CFEngine pioneered these tools, which Puppet,
Chef, and Ansible later popularized. Another
cycle of growth led to a new set of orchestration
tools. These tools—such as MCollective,
Capistrano, and Fabric—manage the complex
system deployment on multihost environments
in the correct order.
Containers might allow the deployment of a
single application in less than a second, but
now different parts of the application must run
on different clouds. The network will become
the next bottleneck. The network issues will
require systems to have a combination of
statelessness and segmentation. Organizations
will need to deploy and run subsystems
separately with only loose, software-defined
network connections. That’s a difficult path.
Some centralization may still be necessary.
9 For a complete analysis of the DevOps movement and its implications, see “DevOps: Solving the engineering productivity challenge,”
PwC Technology Forecast 2013, Issue 2, http://www.pwc.com/us/en/technology-forecast/2013/issue2/index.jhtml.
37 PwC Technology Forecast
Containers are redefining application-infrastructure integration
Conclusion: Beyond application
and infrastructure integration
Microservices and containers are symbiotic.
Together their growth has produced an
alternative to integration entirely different from
traditional enterprise application integration
(EAI). Some of the differences include:
Traditional EAI
Microservices +
Containers
Translation (via an
enterprise service bus
[ESB], for example)
Encapsulation
Articulation
Abstraction
Bridging between
systems
Portability across
systems
Monolithic,
virtualized OS
Fit for purpose,
distributed OS
Wired
Loosely coupled
38 PwC Technology Forecast
The blend of containers, microservices, and
associated management tools will redefine
the nature of the components of a system. As
a result, organizations that use the blend can
avoid the software equivalent of wired, hardto-create, and hard-to-maintain connections.
Instead of constantly tinkering with a
polyglot connection bus, system architects
can encapsulate the application and its
dependencies in a lingua franca container.
Instead of virtualizing the old OS into the new
context, developers can create distributed,
slimmed-down operating systems. Instead of
building bridges between systems, architects
can use containers that allow applications
to run anywhere. By changing the nature of
integration, containers and microservices
enable enterprises to move beyond it.
Containers are redefining application-infrastructure integration
Technology Forecast: Rethinking integration
Issue 1, 2014
Docker’s role in
simplifying and securing
multicloud development
Ben Golub of Docker outlines the company’s application
container road map.
Interview conducted by Alan Morrison, Bo Parker, and Pini Reznik
PwC: You mentioned that one of
the reasons you decided to join
Docker, Inc., as CEO was because the
capabilities of the tool itself intrigued
you. What intrigued you most?1
Ben Golub
Ben Golub is CEO of Docker, Inc.
BG: The VM was created when applications
were long-lived, monolithic, built on a
well-defined stack, and deployed to a
single server. More and more, applications
today are built dynamically through rapid
modification. They’re built from loosely
coupled components in a variety of different
stacks, and they’re not deployed to a single
server. They’re deployed to a multitude of
servers, and the application that’s working
on a developer’s laptop also must work in
the test stage, in production, when scaling,
across clouds, in a customer environment on
a VM, on an OpenStack cluster, and so forth.
The model for how you would do that is really
very different from how you would deal with
a VM, which is in essence trying to treat an
application as if it were an application server.
1 Docker is an open source application deployment container tool released by Docker, Inc., that allows developers to package applications
and their dependencies in a virtual container that can run on any Linux server. Docker Hub is Docker, Inc.’s related, proprietary set of image
distribution, change management, collaboration, workflow, and integration tools. For more information on Docker Hub, see Ben Golub,
“Announcing Docker Hub and Official Repositories,” Docker, Inc. (blog), June 9, 2014, http://blog.docker.com/2014/06/announcing-dockerhub-and-official-repositories/, accessed July 18, 2014.
39 PwC Technology Forecast
Docker’s role in simplifying and securing multicloud development
“You can usually gain 10 times greater density
when you get rid of that guest operating system.”
What containers do is pretty radical if you
consider their impact on how applications
are built, deployed, and managed.2
working to build an ecosystem around it,
so there will be people, tools, and standard
libraries that will all work with Docker.3
PwC: What predisposed the market
to say that now is the time to start
looking at tools like Docker?
PwC: What impact is Docker having
on the evolution of PaaS?4
BG: When creating an application as if it were
an application server, the VM model blends
an application, a full guest operating system,
and disk emulation. By contrast, the container
model uses just the application’s dependencies
and runs them directly on a host OS.
In the server world, the use of containers was
limited to companies, such as Google, that
had lots of specialized tools and training.
Those tools weren’t transferable between
environments; they didn’t make it possible
for containers to interact with each other.
We often use the shipping container as an
analogy. The analogous situation before
Docker was one in which steel boxes had
been invented but nobody had made them
a standard size, put holes in all the same
places, and figured out how to build cranes
and ships and trains that could use them.
We aim to add to the core container
technology, so containers are easy to use and
interoperable between environments. We
want to make them portable between clouds
and different operating systems, between
physical and virtual. Most importantly, we’re
BG: The traditional VM links together
the application management and the
infrastructure management. We provide
a very clean separation, so people can
use Docker without deciding in advance
whether the ideal infrastructure is a public
or private cloud, an OpenStack cluster, or a
set of servers all running RHEL or Ubuntu.
The same container will run in all of those
places without modification or delay.
Because containers are so much more efficient
and lightweight, you can usually gain 10
times greater density when you get rid of
that guest operating system. That density
really changes the economics of providing
XaaS as well as the economics and the ease of
moving between different infrastructures.
In a matter of milliseconds, a container can
be moved between provider A and provider B
or between provider A and something private
that you’re running. That speed really changes
how people think about containers. Docker has
become a standard container format for a lot
of different platforms as a service, both private
and public PaaS. At this point, a lot of people
are questioning whether they really need a full
PaaS to build a flexible app environment.5
2 Abbreviations are as follows:
• VM: virtual machine
3 Abbreviations are as follows:
• OS: operating system
4 Abbreviations are as follows:
• PaaS: platform as a service
5 Abbreviations are as follows:
• RHEL: Red Hat Enterprise Linux
40
PwC Technology Forecast
Docker’s role in simplifying and securing multicloud development
“What people want is the ability to choose any
stack and run it on any platform.”
PwC: Why are so many questioning
whether or not they need a full PaaS?
BG: A PaaS is a set of preselected stacks
intended to run the infrastructure for
you. What people increasingly want is the
ability to choose any stack and run it on
any platform. That’s beyond the capability
of any one organization to provide.
With Docker, you no longer need to worry
in advance about where the apps will run,
because the same containerized application
will run without being modified on any
Linux server today. You might build it in an
environment that has a lot of VMs and decide
you want to push it to a bare-metal cluster
for greater performance. All of those options
are possible, and you don’t really need to
know or think about them in advance.
PwC: When you can move Docker
containers so easily, are you shifting
the challenge to orchestration?
BG: Certainly. Rather than having
components tightly bound together and
stitched up in advance, they’re orchestrated
and moved around as needs dictate.
Docker provides the primitives that let you
orchestrate between containers using a
bridge. Ultimately, we’ll introduce more
full-fledged orchestration that lets you
orchestrate across different data centers.
Docker Hub—our commercial services
announced in June 2014—is a set of services
you can use to orchestrate containers both
within a data center and between data centers.
PwC: What should an enterprise
that’s starting to look at Docker think
about before really committing?
BG: We’re encouraging people to start
introducing Docker as part of the overall
workflow, from development to test and
then to production. For example, eBay has
been using Docker for quite some time.
The company previously took weeks to go
from development to production. A team
would start work on the developer’s laptop,
move it to staging or test, and it would
break and they weren’t sure why. And as
they moved it from test or staging, it would
break again and they wouldn’t know why.6
Then you get to production with Docker. The
entire runtime environment is defined in the
container. The developer pushes a button
and commits code to the source repository,
the container gets built and goes through test
automatically, and 90 percent of the time
the app goes into production. That whole
process takes minutes rather than weeks.
During the 10 percent of the time when
this approach doesn’t work, it’s really clear
what went wrong, whether the problem was
inside the container and the developer did
something wrong or the problem was outside
the container and ops did something wrong.
Some people want to really crank up
efficiency and performance and use Docker
on bare metal. Others use Docker inside
of a VM, which works perfectly well.
6 For more detail, see Ted Dziuba, “Docker at eBay,” presentation slides, July 13, 2013, https://speakerdeck.com/teddziuba/docker-at-ebay,
accessed July 18, 2014.
41
PwC Technology Forecast
Docker’s role in simplifying and securing multicloud development
“Going to Windows is a little bit trickier because
the primitives aren’t as well defined, but there’s
no rocket science involved.”
For those just starting out, I’d recommend they
do a proof of concept this year and move to
production early next year. According to our
road map, we’re estimating that by the second
half of 2015, they can begin to use the control
tools to really understand what’s running
where, set policies about deployment, and
set rules about who has the right to deploy.
PwC: The standard operating model
with VMs today commingles app
and infrastructure management.
How do you continue to take
advantage of those management
tools, at least for now?
BG: If you use Docker with a VM for the host
rather than bare metal, you can continue
to use those tools. And in the modified VM
scenario, rather than having 1,000 applications
equal 1,000 VMs, you have 10 VMs, each of
which would be running 100 containers.
PwC: What about management tools
for Docker that could supplant the
VM-based management tools?
BG: We have good tools now, but they’re
certainly nowhere as mature as the VM toolset.
PwC: What plans are there to move
Docker beyond Linux to Windows,
Solaris, or other operating systems?
BG: This year we’re focused on Linux, but
we’ve already given ourselves the ability to
use different container formats within Linux,
including LXC, libvirt, and libcontainer. People
who are already in the community are working
on having Docker manage Solaris zones and
jails. We don’t see any huge technical reasons
why Docker for Solaris can’t happen.
Going to Windows is a little bit trickier
because the primitives aren’t as well defined,
but there’s no rocket science involved.
It’s just hard work that we likely won’t
get to until the second half of 2015.7
PwC: Docker gets an ecstatic response
from developers, but the response
from operations people is more
lukewarm. A number of those we’ve
spoken with say it’s a very interesting
technology, but they already have
Puppet running in the VMs. Some
don’t really see the benefit. What
would you say to these folks?
BG: I would say there are lots of
folks who disagree with them who
actually use it in production.
Folks in ops are more conservative than
developers for good reason. But people will
get much greater density and a significant
reduction in the amount they’re spending on
server virtualization licenses and hardware.
One other factor even more compelling is that
Docker enables developers to deliver what
they create in a standardized form. While
admin types might hope that developers
embrace Chef and Puppet, developers rarely
do. You can combine Docker with tools such
as Chef and Puppet that the ops folks like
and often get the best of both worlds.
PwC: What about security?
BG: People voice concerns about security
just because they think containers are new.
They’re actually not new. The base container
technology has been used at massive scale by
companies such as Google for several years.
7 Abbreviations are as follows:
• LXC: LinuX Container
42
PwC Technology Forecast
Docker’s role in simplifying and securing multicloud development
The security model becomes much better with
containers. Most organizations face hundreds
of thousands of vulnerabilities that they know
about but have very little ability to address. In
the VM-based world where every application
has its own VM, every application has its own
guest OS, which is a slightly different version.
These different versions are difficult to patch.
In a container-based world, it’s easier to
standardize the OS across all hosts. If there’s an
OS-level vulnerability, there’s one patch that
just needs to be redeployed across all hosts.
Containerized apps are also much easier
to update. If there’s an application
vulnerability, you can push changes to
1,000 running containers without taking
any of them down, without restarting
an OS, without rebuilding a VM.
Once the ops folks begin to understand
better what Docker really does, they
can get a lot more excited.
PwC: Could you extend a governance
model along these same lines?
BG: Absolutely. Generally, when developers
build with containers, they start with base
images. Having a trusted library to start with
is a really good approach. Creating containers
these days is directly from source, which in
essence means you can put a set of instructions
in a source code repository. As this very mature
source code is used, the changes to source
code that get committed essentially translate
automatically into an updated container.
What we’re adding to that is what we call
provenance. That’s the ability to digitally sign
every container so you know where it came
from, all the way back to the source. That’s
a much more comprehensive security and
governance model than trying to control
what different black boxes are doing.
PwC: What’s the outlook for the
distributed services model generally?
BG: I won’t claim that we can change the laws
of physics. A terabyte of data doesn’t move
easily across narrow pipes. But if applications
and databases can be moved rapidly, and if
they consistently define where they look for
data, then the things that should be flexible can
be. For example, if you want the data resident
in two different data centers, that could be a
lot of data. Either you could arrange it so the
data eventually become consistent or you could
set up continuous replication of data from one
location to the other using something like CDP.8
I think either of those models work.
8 Abbreviations are as follows:
• CDP: continuous data protection
43
PwC Technology Forecast
Docker’s role in simplifying and securing multicloud development
Technology Forecast: Rethinking integration
Issue 1, 2014
What do businesses need
to know about emerging
integration approaches?
Sam Ramji of Apigee views the technologies of the integration
fabric through a strategy lens.
Interview conducted by the Technology Forecast team
PwC: We’ve been looking at three
emerging technologies: data
lakes, microservices, and Docker
containers. Each has a different
impact at a different layer of the
integration fabric. What do you
think they have in common?1
Sam Ramji
Sam Ramji is vice president of
strategy at Apigee.
SR: What has happened here has been
the rightsizing of all the components. IT
providers previously built things assuming that
compute, storage, and networking capacity
were scarce. Now they’re abundant. But
even when they became abundant, end users
didn’t have tools that were the right size.
Containers have rightsized computing, and
Hadoop has rightsized storage. With HDFS or
Cassandra or a NoSQL database, companies can
process enormous amounts of data very easily.
And HTTP-based, bindable endpoints that can
talk to any compute source have rightsized
the network. So between Docker containers
for compute, data lakes for storage, and APIs
for networking, these pieces are finally small
enough that they all fit very nicely, cleanly,
and perfectly together at the same time.2
1 For more background on data lakes, microservices, and Docker containers, see “Rethinking integration: Emerging patterns from cloud
computing leaders,” PwC Technology Forecast 2014, Issue 1, http://www.pwc.com/us/en/technology-forecast/2014/issue1/index.jhtml.
2 Abbreviations are as follows:
• HDFS: Hadoop Distributed File System
• NoSQL: Not only SQL
• API: application programming interface
44 PwC Technology Forecast
What do businesses need to know about emerging integration approaches?
“This new approach to integration could
enable companies to develop and deliver
apps in three months.”
PwC: To take advantage of the
three integration-related trends
that are emerging at the same
time, what do executives generally
need to be cautious about?
SR: One risk is how you pitch this new
integration approach politically. It may not
be possible to convert the people who are
currently working on integration. For those
who are a level or two above the fray, one of the
top messages is to make the hard decisions and
say, “Yes, we’re moving our capital investments
from solving all problems the old ways to
getting ready for new problems.” It’s difficult.
PwC: How do you justify such a major
change when presenting the pitch?
SR: The incremental cost in time and money
for new apps is too high—companies must
make a major shift. Building a new app can
take nine months. Meanwhile, marketing
departments want their companies to
build three new apps per quarter. That will
consume a particular amount of money and
will require a certain amount of planning
time. And then by the time the app ships nine
months later, it needs a new feature because
there’s a new service like Pinterest that didn’t
exist earlier and now must be tied in.
45
PwC Technology Forecast
PwC: Five years ago, nine months
would have been impossibly fast.
Now it’s impossibly slow.
SR: And that means the related business
processes are too slow and complicated,
and costs are way too high. Companies
spend between $300,000 and $700,000 in
roughly six to seven months to implement
a new partner integration. That high cost
is becoming prohibitive in today’s valuenetwork-based world, where companies
constantly try to add new nodes to their
value network and maybe prune others.
Let’s say there’s a new digital pure play, and
your company absolutely must be integrated
with it. Or perhaps you must come up with
some new joint offer to a customer segment
you’re trying to target. You can’t possibly
afford to be in business if you rely on those old
approaches because now you need to get 10
new partnerships per order. And you certainly
can’t do that at a cost of $500,000 per partner.
This new approach to integration could enable
companies to develop and deliver apps in three
months. Partner integration would be complete
in two months for $50,000, not $500,000.
What do businesses need to know about emerging integration approaches?
“We’re seeing a ton of new growth in these
edge systems, specifically from mobile.
There are app-centric uses that require new
infrastructure, and they’re distinct from
what I call plain old integration.”
PwC: How about CIOs in particular?
How should they pitch the need for
change to their departments?
PwC: IT may still be focused
on core systems where things
haven’t really changed a lot.
SR: Web scale is essential now. ESBs absolutely
do not scale to meet mobile demands.
They cannot support three fundamental
components of mobile and Internet access.
SR: Yes, but that’s not where the growth is.
We’re seeing a ton of new growth in these
edge systems, specifically from mobile.
There are app-centric uses that require
new infrastructure, and they’re distinct
from what I call plain old integration.
In general, ESBs were built for different
purposes. A fully loaded ESB that’s performing
really well will typically cost an organization
millions of dollars to run about 50 TPS. The
average back end that’s processing mobile
transactions must run closer to 1,000 TPS.
Today’s transaction volumes require systems
to run at web scale. They will crush an ESB.
The second issue is that ESBs are not built
for identity. ESBs generally perform systemto-system identity. They’re handling a
maximum of 10,000 different identities
in the enterprise, and those identities are
organizations or systems—not individual
end users, which is crucial for the new IT.
If companies don’t have a user’s identity,
they’ll have a lot of other issues around user
profiling or behavior. They’ll have user amnesia
and problems with audits or analytics.
The third issue is the ability to handle
the security handoff between external
devices that are built in tools and languages
such as JavaScript and to bridge those
devices into ESB native security.
ESB is just not a good fit when organizations
need scale, identity, and security.3
PwC: How about business unit
managers and their pitches to the
workforce? People in the business
units may wonder why there’s such
a preoccupation with going digital.
SR: When users say digital, what we really
mean is digital data that’s ubiquitous
and consumable and computable by
pretty much any device now.
The industry previously built everything
for a billion PCs. These PCs were available
only when people chose to walk to their
desks. Now people typically have three or
more devices, many of which are mobile.
They spend more time computing. It’s not
situated computing where they get stuck at
a desk. It’s wherever they happen to be. So
the volume of interactions has gone up, and
the number of participants has gone up.
About 3 billion people work at computing
devices in some way, and the volume of
interactions has gone up many times. The
shift to digital interactions has been basically
an order of magnitude greater than what
we supported earlier, even for web-based
computing through desktop computers.
3 Abbreviations are as follows:
• ESB: enterprise service bus
• TPS: transactions per second
46
PwC Technology Forecast
What do businesses need to know about emerging integration approaches?
PwC: The continuous delivery
mentality of DevOps has had an
impact, too. If the process is in
software, the expectations are that
you should be able to turn on a dime.4
SR: Consumer expectations about services
are based on what they’ve seen from largescale services such as Facebook and Google
that operate in continuous delivery mode.
They can scale up whenever they need to.
Availability is as important as variability.
Catastrophic successes consistently occur in
global corporations. The service gets launched,
and all of a sudden they have 100,000 users.
That’s fantastic. Then they have 200,000
users, which is still fantastic. Then they reach
300,000. Crunch. That’s when companies
realize that moving around boxes to try to
scale up doesn’t work anymore. They start
learning from web companies how to scale.
PwC: The demand is for fluidity and
availability, but also variability.
The load is highly variable.
SR: Yes. In highly mobile computing, the
demand patterns for digital interactions
are extremely spiky and unpredictable.
None of these ideas is new. Eleven years ago
when I was working for Adam Bosworth at BEA
Systems, he wrote a paper about the autonomic
model of computing in which he anticipated
natural connectedness and smaller services.
We thought web services would take us there.
We were wrong about that as a technology,
but we were right about the direction.
We lacked the ability to get people to
understand how to do it. People were building
services that were too big, and we didn’t
realize why the web services stack was still
too bulky to be consumed and easily adopted
by a lot of people. It wasn’t the right size
before, but now it’s shrunk down to the right
size. I think that’s the big difference here.
4 DevOps refers to a closer collaboration between developers and operations people that becomes necessary for a more continuous flow of
changes to an operational code base, also known as continuous delivery. Thus, DevOps=Dev+Ops. For more on continuous delivery and
DevOps, see “DevOps: Solving the engineering productivity challenge,” PwC Technology Forecast 2013, Issue 2,
http://www.pwc.com/us/en/technology-forecast/2013/issue2/index.jhtml.
47
PwC Technology Forecast
What do businesses need to know about emerging integration approaches?
Technology Forecast: Rethinking integration
Issue 1, 2014
Zero-integration
technologies and their
role in transformation
By Bo Parker
The key to integration success is reducing the need for
integration in the first place.
48
Issue overview:
Rethinking
integration
This article summarizes
three topics also covered
individually in the
Rethinking Integration
issue of the PwC
Technology Forecast
(http://www.pwc.com/
us/en/technologyforecast/2014/issue1/
index.jhtml). The
integration fabric is a
central component for
PwC’s New IT Platform.
(See http://www.pwc.
com/us/en/increasingit-effectiveness/new-itplatform.jhtml for more
information.)
Social, mobile, analytics, cloud—SMAC for
short—have set new expectations for what a
high-performing IT organization delivers to the
enterprise. Yet they can be saviors if IT figures
out how to embrace them. As PwC states in
“Reinventing Information Technology in the
Digital Enterprise”:
Business volatility, innovation, globalization
and fierce competition are forcing
business leaders to review all aspects of
their businesses. High on the agenda:
Transforming the IT organization to meet
the needs of businesses today. Successful
IT organizations of the future will be
those that evaluate new technologies with
a discerning eye and cherry pick those
that will help solve the organization’s
most important business problems. This
shift requires change far greater than
technology alone. It requires a new mindset
and a strong focus on collaboration,
innovation and “outside-in” thinking
with a customer-centric point of view.1
The shift starts with rethinking the purpose
and function of IT while building on its core
historical role of delivering and maintaining
stable, rock-solid transaction engines.
Rapidly changing business needs are pushing
enterprises to adopt a digital operating
model. This move reaches beyond the backoffice and front-office technology. Every
customer, distributor, supplier, investor,
partner, employee, contractor, and especially
any software agents substituting for those
conventional roles now expects a digital
relationship. Such a relationship entails more
than converting paper to web screens. Digital
relationships are highly personalized, analyticsdriven interactions that are absolutely reliable,
that deliver surprise and delight, and that
evolve on the basis of previous learnings.
Making digital relationships possible is a huge
challenge, but falling short will have severe
consequences for every enterprise unable to
make a transition to a digital operating model.
How to proceed?
Successfully adopting a digital operating model
requires what PwC calls a New IT Platform.
This innovative platform aligns IT’s capabilities
to the dynamic needs of the business and
empowers the entire organization with
technology. Empowerment is an important
focus. That’s because a digital operating model
won’t be something IT builds from the center
out. It won’t be something central IT builds
much of at all. Instead, building out the digital
operating model—whether that involves
mobile apps, software as a service, or business
units developing digital value propositions on
third-party infrastructure as a service or on
internal private clouds—will happen closest to
the relevant part of the ecosystem.
What defines a New IT Platform? The
illustration highlights the key ingredients.
PwC’s New IT Platform
The New IT Platform encompasses transformation across the organization.
The Mandate
The Process
+
Broker of
Services
The Architecture
+
Assembleto-Order
=
The Organization
+
Integration
Fabric
The Governance
+
Professional
Services Structure
Empowering
Governance
New IT
Platform
1 “Reinventing Information Technology in the Digital Enterprise,” PwC, December 2013,
http://www.pwc.com/us/en/increasing-it-effectiveness/publications/new-it-platform.jhtml.
49
PwC Technology Forecast
Zero-integration technologies
The New IT Platform emphasizes consulting,
guiding, brokering, and using existing
technology to assemble digital assets rather
than build from scratch. A major technology
challenge that remains—and one that
central IT is uniquely suited to address—is to
establish an architecture that facilitates the
integration of an empowered, decentralized
enterprise technology landscape. PwC calls
it the new integration fabric. Like the threads
that combine to create a multicolored woven
blanket, a variety of new integration tools and
methods will combine to meet a variety of
challenges. And like a fabric, these emerging
tools and methods rely on each other to weave
in innovations, new business partners, and new
operating models.
The common denominator of these new
integration tools and methods is time: The
time it takes to use new data and discover
new insights from old data. The time it takes
to modify a business process supported by
software. The time it takes to promote new
code into production. The time it takes to scale
up infrastructure to support the overnight
success of a new mobile app.
The bigger the denominator (time), then the
bigger the numerator (expected business
value) must be before a business will take a
chance on a new innovation, a new service, or
an improved process. Every new integration
approach tries to reduce integration time to as
close to zero as possible.
Given the current state of systems integration,
getting to zero might seem like a pipe dream.
In fact, most of the key ideas behind zerointegration technologies aren’t coming from
traditional systems integrators or legacy
technologies. They are coming from web-scale
companies facing critical problems for which
new approaches had to be invented. The great
news is that these inventions are often available
as open source, and a number of service
providers support them.
How to reach zero integration?
What has driven web-scale companies to push
toward zero-integration technologies? These
companies operate in ecosystems that innovate
in web time. Every web-scale company is
conceivably one startup away from oblivion. As
a result, today’s smart engineers provide a new
project deliverable in addition to working code.
They deliver IT that is change-forward friendly.
Above all, change-forward friendly means
that doing something new and different is
just as easy four years and 10 million users
into a project as it was six months and 1,000
users into the project. It’s all about how doing
something new integrates with the old.
More specifically, change-forward-friendly
data integration is about data lakes.2 All data
is in the lake, schema are created on read,
metadata generation is collaborative, and data
definitions are flexible rather than singular
definitions fit for a business purpose. Changeforward-friendly data integration means no
time is wasted getting agreement across the
enterprise about what means what. Just do it.
Change-forward-friendly application and
services integration is about microservices
frameworks and principles.3 It uses small,
single-purpose code modules, relaxed
approaches to many versions of the same
service, and event loop messaging. It relies
on organizational designs that acknowledge
Conway’s law, which says the code architecture
reflects the IT organization architecture. In
other words, when staffing large code efforts,
IT should organize people into small teams
of business-meaningful neighborhoods to
minimize the cognitive load associated with
working together. Just code it.
2 See the article “The enterprise data lake: Better integration and deeper analytics,” PwC Technology Forecast 2014, Issue 1,
http://www.pwc.com/us/en/technology-forecast/2014/issue1/features/data-lakes.jhtml.
3 See the article “Microservices: The resurgence of SOA principles and an alternative to the monolith,” PwC Technology Forecast 2014,
Issue 1, http://www.pwc.com/en_US/us/technology-forecast/2014/issue1/features/microservices.jhtml.
50 PwC Technology Forecast
Zero-integration technologies
Change-forward-friendly infrastructure
integration is about container frameworks,
especially Docker.4 The speed required by
data science innovators using data lakes and
by ecosystem innovators using microservices
frameworks will demand infrastructure that
is broadly consistent with zero-integration
principles. That means rethinking the IT
stack and the roles of the operating system,
hypervisors, and automation tools such as
Chef and Puppet. Such an infrastructure also
means rethinking operations and managing by
chaos principles, where failures are expected,
their impacts are isolated, and restarts are
instantaneous. Just run it.
This is the story of the new integration fabric.
Read more about data lakes, microservices, and
containers in the articles in the PwC Technology
Forecast 2014, Issue 1. But always recall what
Ronald Reagan once said about government,
rephrased here in the context of technology:
“Integration is not the solution to our problem,
integration is the problem.” Change-forwardfriendly integration means doing whatever it
takes to bring time to integration to zero.
4 See the article “Containers are redefining application-infrastructure integration,” PwC Technology Forecast 2014, Issue 1,
http://www.pwc.com/en_US/us/technology-forecast/2014/issue1/features/open-source-application-deployment-containers.jhtml.
51 PwC Technology Forecast
Zero-integration technologies
Acknowledgments
Advisory
Reviewers
US Technology Consulting Leader
Gerard Verweij
Rohit Antao
Phil Berman
Julien Furioli
Oliver Halter
Glen Hobbs
Henry Hwangbo
Rajesh Rajan
Hemant Ramachandra
Ritesh Ramesh
Zach Sachen
Chief Technologist
Chris Curran
New IT Platform Leader
Michael Pearl
Strategic Marketing
Lock Nelson
Bruce Turner
US Thought Leadership
Partner
Rob Gittings
Center for Technology
and Innovation
Managing Editor
Bo Parker
Editors
Vinod Baya
Alan Morrison
Contributors
Galen Gruman
Pini Resnik
Bill Roberts
Brian Stein
Editorial Advisor
Larry Marion
Copy Editor
Lea Anne Bantsari
US Creative Team
Infographics
Tatiana Pechenik
Chris Pak
Layout
Jyll Presley
Web Design
Jaime Dirr
Greg Smith
Special thanks
Eleni Manetas and Gabe Taylor
Mindshare PR
Wunan Li
Akshay Rao
Industry perspectives
During the preparation of this
publication, we benefited greatly
from interviews and conversations
with the following executives:
Darren Cunningham
Vice President of Marketing
SnapLogic
Michael Facemire
Principal Analyst
Forrester
Ben Golub
CEO
Docker, Inc.
Mike Lang
CEO
Revelytix
Ross Mason
Founder and Vice President
of Product Strategy
MuleSoft
Sean Martin
CTO
Cambridge Semantics
John Pritchard
Director of Platform Services
Adobe Systems
Sam Ramji
Vice President of Strategy
Apigee
Richard Rodger
CTO
nearForm
Dale Sanders
Senior Vice President
Health Catalyst
Ted Schadler
Vice President and Principal Analyst
Forrester
Brett Shepherd
Director of Big Data Product Marketing
Splunk
Eric Simone
CEO
ClearBlade
Sravish Sridhar
Founder and CEO
Kinvey
Michael Topalovich
CTO
Delivered Innovation
Michael Voellinger
Managing Director
ClearBlade
Glossary
Data lake
A single, very large repository for less-structured
data that doesn’t require up-front modeling, a
data lake can help resolve the nagging problem
of accessibility and data integration.
Microservices
architecture
Microservices architecture (MSA) breaks an application
into very small components that perform discrete
functions, and no more. The fine-grained, stateless, selfcontained nature of microservices creates decoupling
between different parts of a code base and is what makes
them easy to update, replace, remove, or augment.
Linux
containers
and Docker
LinuX Containers (LXCs) allow different applications
to share operating system (OS) kernel, CPU, and
RAM. Docker containers go further, adding layers of
abstraction and deployment management features.
Among the benefits of this new infrastructure
technology, containers that have these capabilities reduce
coding, deployment time, and OS licensing costs.
Zero
integration
Every new integration approach tries to reduce
integration time to as close to zero as possible. Zero
integration means no time is wasted getting agreement
across the enterprise about what means what.
To have a deeper conversation about
this subject, please contact:
Gerard Verweij
Principal and US Technology
Consulting Leader
+1 (617) 530 7015
[email protected]
Chris Curran
Chief Technologist
+1 (214) 754 5055
[email protected]
Michael Pearl
Principal
New IT Platform Leader
+1 (408) 817 3801
[email protected]
Bo Parker
Managing Director
Center for Technology and Innovation
+1 (408) 817 5733
[email protected]
Alan Morrison
Technology Forecast Issue
Editor and Researcher
Center for Technology and Innovation
+1 (408) 817 5723
[email protected]
About PwC’s
Technology Forecast
Published by PwC’s Center for
Technology and Innovation (CTI), the
Technology Forecast explores emerging
technologies and trends to help
business and technology executives
develop strategies to capitalize
on technology opportunities.
Recent issues of the Technology
Forecast have explored a number of
emerging technologies and topics
that have ultimately become many
of today’s leading technology and
business issues. To learn more
about the Technology Forecast, visit
www.pwc.com/technologyforecast.
About PwC
PwC US helps organizations and
individuals create the value they’re
looking for. We’re a member of the
PwC network of firms in 157 countries
with more than 195,000 people. We’re
committed to delivering quality in
assurance, tax and advisory services.
Find out more and tell us what matters
to you by visiting us at www.pwc.com.
Comments or requests?
Please visit www.pwc.com/
techforecast or send e-mail to
[email protected].
© 2014 PricewaterhouseCoopers LLP, a Delaware limited liability partnership. All rights reserved. PwC refers to the US member
firm, and may sometimes refer to the PwC network. Each member firm is a separate legal entity. Please see www.pwc.com/
structure for further details. This content is for general information purposes only, and should not be used as a substitute for
consultation with professional advisors.
This content is for general information purposes only, and should not be used as a substitute for consultation with professional
advisors. MW-15-0186
Fly UP