Technology Making sense of Big Data A quarterly journal
by user
Comments
Transcript
Technology Making sense of Big Data A quarterly journal
Technologyforecast Making sense of Big Data A quarterly journal 2010, Issue 3 In this issue 04 22 36 Tapping into the power of Big Data Building a bridge to the rest of your data Revising the CIO’s data playbook Contents Features 04 Tapping into the power of Big Data Treating it differently from your core enterprise data is essential. 22 Building a bridge to the rest of your data How companies are using open-source cluster-computing techniques to analyze their data. 36 Revising the CIO’s data playbook Start by adopting a fresh mind-set, grooming the right talent, and piloting new tools to ride the next wave of innovation. Interviews 14 The data scalability challenge John Parkinson of TransUnion describes the data handling issues more companies will face in three to five years. 18 Creating a cost-effective Big Data strategy Disney’s Bud Albers, Scott Thompson, and Matt Estes outline an agile approach that leverages open-source and cloud technologies. 34 Hadoop’s foray into the enterprise Cloudera’s Amr Awadallah discusses how and why diverse companies are trying this novel approach. 46 New approaches to customer data analysis Razorfish’s Mark Taylor and Ray Velez discuss how new techniques enable them to better analyze petabytes of Web data. Departments 02 Message from the editor 50 Acknowledgments 54 Subtext Message from the editor Bill James has loved baseball statistics ever since he was a kid in Mayetta, Kansas, cutting baseball cards out of the backs of cereal boxes in the early 1960s. James, who compiled The Bill James Baseball Abstract for years, is a renowned “sabermetrician” (a term he coined himself). He now is a senior advisor on baseball operations for the Boston Red Sox, and he previously worked in a similar capacity for other Major League Baseball teams. James has done more to change the world of baseball statistics than anyone in recent memory. As broadcaster Bob Costas says, James “doesn’t just understand information. He has shown people a different way of interpreting that information.” Before Bill James, Major League Baseball teams all relied on long-held assumptions about how games are won. They assumed batting average, for example, had more importance than it actually does. James challenged these assumptions. He asked critical questions that didn’t have good answers at the time, and he did the research and analysis necessary to find better answers. For instance, how many days’ rest does a reliever need? James’s answer is that some relievers can pitch well for two or more consecutive days, while others do better with a day or two of rest in between. It depends on the individual. Why can’t a closer work more than just the ninth inning? A closer is frequently the best reliever on the team. James observes that managers often don’t use the best relievers to their maximum potential. The lesson learned from the Bill James example is that the best statistics come from striving to ask the best questions and trying to get answers to those questions. But what are the best questions? James takes an iterative approach, analyzing the data he has, or can gather, asking some questions based on that analysis, and then looking for the answers. He doesn’t stop with just one set of statistics. The first set suggests some questions, to which a second set suggests some answers, which then give rise to yet another set of questions. It’s a continual process of investigation, one that’s focused on surfacing the best questions rather than assuming those questions have already been asked. Enterprises can take advantage of a similarly iterative, investigative approach to data. Enterprises are being overwhelmed with data; many enterprises each generate petabytes of information they aren’t making best use of. And not all of the data is the same. Some of it has value, and some, not so much. The problem with this data has been twofold: (1) it’s difficult to analyze, and (2) processing it using conventional systems takes too long and is too expensive. 02 PricewaterhouseCoopers Technology Forecast Addressing these problems effectively doesn’t require radically new technology. Better architectural design choices and software that allows a different approach to the problems are enough. Search engine companies such as Google and Yahoo provide a pragmatic way forward in this respect. They’ve demonstrated that efficient, cost-effective, system-level design can lead to an architecture that allows any company to handle different data differently. “Revising the CIO’s data playbook,” on page 36 emphasizes that CIOs have time to pick and choose the most suitable approach. The most promising opportunity is in the area of “gray data,” or data that comes from a variety of sources. This data is often raw and unvalidated, arrives in huge quantities, and doesn’t yet have established value. Gray data analysis requires a different skill set—people who are more exploratory by nature. Enterprises shouldn’t treat voluminous, mostly unstructured information (for example, Web server log files) the same way they treat the data in core transactional systems. Instead, they can use commodity computer clusters, open-source software, and Tier 3 storage, and they can process in an exploratory way the less-structured kinds of data they’re generating. With this approach, they can do what Bill James does and find better questions to ask. As always, in this issue we’ve included interviews with knowledgeable executives who have insights on the overall topic of interest: In this issue of the Technology Forecast, we review the techniques behind low-cost distributed computing that have led companies to explore more of their data in new ways. In the article, “Tapping into the power of Big Data,” on page 04, we begin with a consideration of exploratory analytics—methods that are separate from traditional business intelligence (BI). These techniques make it feasible to look for more haystacks, rather than just the needle in one haystack. • Amr Awadallah of Cloudera explores the reasons behind Apache Hadoop’s adoption at search engine, social media, and financial services companies. The article, “Building a bridge to the rest of your data,” on page 22 highlights the growing interest in and adoption of Hadoop clusters. Hadoop provides highvolume, low-cost computing with the help of opensource software and hundreds or thousands of commodity servers. It also offers a simplified approach to processing more complex data in parallel. The methods, cost advantages, and scalability of Hadoop-style cluster computing clear a path for enterprises to analyze lots of data they didn’t have the means to analyze before. The buzz around Big Data and “cloud storage” (a term some vendors use to describe less-expensive clustercomputing techniques) is considerable, but the article, Message from the editor • John Parkinson of TransUnion describes the data challenges that more and more companies will face during the next three to five years. • Bud Albers, Scott Thompson, and Matt Estes of Disney outline an agile, open-source cloud data vision. • Mark Taylor and Ray Velez of Razorfish contrast newer, more scalable techniques of studying customer data with the old methods. Please visit pwc.com/techforecast to find these articles and other issues of the Technology Forecast online. If you would like to receive future issues of the Technology Forecast as a PDF attachment, you can sign up at pwc.com/techforecast/subscribe. We welcome your feedback and your ideas for future research and analysis topics to cover. Tom DeGarmo Principal Technology Leader [email protected] 03 Tapping into the power of Big Data Treating it differently from your core enterprise data is essential. By Galen Gruman 04 PricewaterhouseCoopers Technology Forecast Like most corporations, the Walt Disney Co. is swimming in a rising sea of Big Data: information collected from business operations, customers, transactions, and the like; unstructured information created by social media and other Web repositories, including the Disney home page itself and sites for its theme parks, movies, books, and music; plus the sites of its many big business units, including ESPN and ABC. “In any given year, we probably generate more data than the Walt Disney Co. did in its first 80 years of existence,” observes Bud Albers, executive vice president and CTO of the Disney Technology Shared Services Group. “The challenge becomes what do you do with it all?” Albers and his team are in the early stages of answering their own question with an economical cluster-computing architecture based on a set of cost-effective and scalable technologies anchored by Apache Hadoop, an open-source, Java-based distributed file system based on Google File System and developed by Apache Software Foundation. These still-emerging technologies allow Disney analysts to explore multiple terabytes of information without the lengthy time requirements or high cost of traditional business intelligence (BI) systems. platform make it feasible not only to look for the needle in the haystack, but also to look for new haystacks. This kind of analysis demands an attitude of exploration—and the ability to generate value from data that hasn’t been scrubbed or fully modeled into relational tables. Using Disney and other examples, this first article introduces the idea of exploratory BI for Big Data. The second article examines Hadoop clusters and technologies that support them (page 22), and the third article looks at steps CIOs can take now to exploit the future benefits (page 36). We begin with a closer look at Disney’s still-nascent but illustrative effort. “In any given year, we probably generate more data than the Walt Disney Co. did in its first 80 years of existence.” —Bud Albers of Disney This issue of the Technology Forecast examines how Apache Hadoop and these related technologies can derive business value from Big Data by supporting a new kind of exploratory analytics unlike traditional BI. These software technologies and their hardware cluster Tapping into the power of Big Data 05 Bringing Big Data under control Big Data is not a precise term; rather, it is a characterization of the never-ending accumulation of all kinds of data, most of it unstructured. It describes data sets that are growing exponentially and that are too large, too raw, or too unstructured for analysis using relational database techniques. Whether terabytes or petabytes, the precise amount is less the issue than where the data ends up and how it is used. Like everyone else, Disney’s Big Data is huge, more unstructured than structured, and growing much faster than transactional data. The Disney Technology Shared Services Group, which is responsible for Disney’s core Web and analysis technologies, recently began its Big Data efforts but already sees high potential. The group is testing the technology and working with analysts in Disney business units. Disney’s data comes from varied sources, but much of it is collected for departmental business purposes and not yet widely shared. Disney’s Big Data approach will allow it to look at diverse data sets for unplanned purposes and to uncover patterns across customer activities. For example, insights from Disney Store activities could be useful in call centers for theme park booking or to better understand the audience segments of one of its cable networks. The Technology Shared Services Group is even using Big Data approaches to explore its own IT questions to understand what data is being stored, how it is used, and thus what type of storage hardware and management the group needs. Albers assumes that Big Data analysis is destined to become essential. “The speed of business these days and the amount of data that we are now swimming in mean that we need to have new ways and new techniques of getting at the data, finding out what’s in there, and figuring out how we deal with it,” he says. The team stumbled upon an inexpensive way to improve the business while pursuing more IT costeffectiveness through the use of private-cloud technologies. (See the Technology Forecast, Summer 2009, for more on the topic of cloud computing.) When Albers launched the effort to change the division’s cost curve so IT expenses would rise more slowly than the business usage of IT—the opposite had been true—he turned to an approach that many companies use to make data centers more efficient: virtualization. Virtualization offers several benefits, including higher utilization of existing servers and the ability to move workloads to prevent resource bottlenecks. An organization can also move workloads to external cloud providers, using them as a backup resource when needed, an approach called cloud bursting. By using such approaches, the Disney Technology Shared Services Group lowered its IT expense growth rate from 27 percent to –3 percent, while increasing its annual processing growth from 17 percent to 45 percent. While achieving this efficiency, the team realized that the ability to move resources and tap external ones could apply to more than just data center efficiency. At first, they explored using external clouds to analyze big sets of data, such as Web traffic to Disney’s many sites, and to handle big processing jobs more cost-effectively and more quickly than with internal systems. During that exploration, the team discovered Hadoop, MapReduce, and other open-source technologies that distribute data-analysis workloads across many computers, breaking the analysis into many parallel workloads that produce results faster. Faster results mean that more questions can be asked, and the low cost of the technologies means the team can afford to ask those questions. Disney assembled a Hadoop cluster and set up a central logging service to mine data that the organization hadn’t been able to before. It will begin to provide internal group access to the cluster in October 2010. Figure 1 shows how the Hadoop cluster will benefit internal groups, business partners, and customers. “The speed of business these days and the amount of data that we are now swimming in mean that we need to have new ways and new techniques of getting at the data, finding out what’s in there, and figuring out how we deal with it.” —Bud Albers of Disney 06 PricewaterhouseCoopers Technology Forecast Improved experience 4 Internal business partners Site visitors Affiliated businesses Interface to cluster (MapReduce/Hive/Pig) 1 Usage data D-Cloud data cluster 2 Central logging service Core IT and business unit systems 3 Hadoop Metadata repository Figure 1: Disney’s Hadoop cluster and central logging service Disney’s new D-Cloud data cluster can scale to handle (1) less-structured usage data through the establishment of (2) a central logging service, (3) a cost-effective Hadoop data analysis engine, and a commodity computer cluster. The result is (4) a more responsive and personalized user experience. Source: Disney, 2010 Tapping into the power of Big Data 07 Simply put, the low cost of a Hadoop cluster means freedom to experiment. Disney uses a couple of dozen servers that were scheduled to be retired, and the organization operates its cluster with a handful of existing staff. Matt Estes, principal data architect for the Disney Technology Shared Services Group, estimates the cost of the project at $300,000 to $500,000. Here are other examples of the kinds of insights that may be gleaned from analysis of Big Data information flows: “Before, I would have needed to figure on spending $3 million to $5 million for such an initiative,” Albers says. “Now I can do this without charging to the bottom line.” • Changes in corporate reputation and the potential for regulatory action, based on the monitoring of social networks as well as Web news sites Unlike the reusable canned queries in typical BI systems, Big Data analysis does require more effort to write the queries and the data-parsing code for what are often unique inquiries of data sources. But Albers notes that “the risk is lower due to all the other costs being lower.” Failure is inexpensive, so analysts are more willing to explore questions they would otherwise avoid. • Real-time demand forecasting, based on disparate inputs such as weather forecasts, travel reservations, automotive traffic, and retail point-of-sale data Even in this early stage, Albers is confident that the ability to ask more questions will lead to more insights that translate to both the bottom line and the top line. For example, Disney already is seeking to boost customer engagement and spending by making recommendations to customers based on pattern analysis of their online behavior. How Big Data analysis is different What should other enterprises anticipate from Hadoopstyle analytics? It is a type of exploratory BI they haven’t done much before. This is business intelligence that provides indications, not absolute conclusions. It requires a different mind-set, one that begins with exploration, the results of which create hypotheses that are tested before moving on to validation and consolidation. These methods could be used to answer questions such as, “What indicators might there be that predate a surge in Web traffic?” or “What fabrics and colors are gaining popularity among influencers, and what sources might be able to provide the materials to us?” or “What’s the value of an influencer on Web traffic through his or her social network?” See the sidebar “Opportunities for Big Data insights” for more examples of the kinds of questions that can be asked of Big Data. 08 Opportunities for Big Data insights • Customer churn, based on analysis of call center, help desk, and Web site traffic patterns • Supply chain optimization, based on analysis of weather patterns, potential disaster scenarios, and political turmoil Disney and others explore their data without a lot of preconceptions. They know the results won’t be as specific as a profit-margin calculation or a drug-efficacy determination. But they still expect demonstrable value, and they expect to get it without a lot of extra expense. Typical BI uses data from transactional and other relational database management systems (RDBMSs) that an enterprise collects—such as sales and purchasing records, product development costs, and new employee hire records—diligently scrubs the data for accuracy and consistency, and then puts it into a form the BI system is programmed to run queries against. Such systems are vital for accurate analyses of transactional information, especially information subject to compliance requirements, but they don’t work well for messy questions, they’ve been too expensive for questions you’re not sure there’s any value in asking, and they haven’t been able to scale to analyze large data sets efficiently. (See Figure 2.) PricewaterhouseCoopers Technology Forecast Large data sets Small data sets Big Data (via Hadoop/MapReduce) Little analytical value Non-relational data Less scalability Traditional BI Relational data Figure 2: Where Big Data fits in Other companies have also tapped into the excitement brewing over Big Data technologies. Several Weboriented companies that have always dealt with huge amounts of data—such as Yahoo, Twitter, and Google—were early adopters. Now, more traditional companies—such as TransUnion, a credit rating service—are exploring Big Data concepts, having seen the cost and scalability benefits the Web companies have realized. Specifically, enterprises are also motivated by the inability to scale their existing approach for working on traditional analytics tasks, such as querying across terabytes of relational data. They are learning that the tools associated with Hadoop are uniquely positioned to explore data that has been sitting on the side, unanalyzed. Figure 3 illustrates how the data architecture landscape appears in 2010. Enterprises with high processing power requirements and centralized architectures are facing scaling issues. Source: PricewaterhouseCoopers, 2010 In contrast, Big Data techniques allow you to sift through data to look for patterns at a much lower cost and in much less time than traditional BI systems. Should the data end up being so valuable that it requires the ongoing, compliance-oriented analysis of regular BI systems, only then do you make that investment. Big Data approaches let you ask more questions of more information, opening a wide range of potential insights you couldn’t afford to consider in the past. “Part of the analytics role is to challenge assumptions,” Estes says. BI systems aren’t designed to do that; instead, they’re designed to dig deeper into known questions and look for variations that may indicate deviations from expected outcomes. Furthermore, Big Data analysis is usually iterative: you ask one question or examine one data set, then think of more questions or decide to look at more data. That’s different from the “single source of truth” approach to standard BI and data warehousing. The Disney team started with making sure they could expose and access the data, then moved to iterative refinement in working with the data. “We aggressively got in to find the direction and the base. Then we began to iterate rather than try to do a Big Bang,” Albers says. Tapping into the power of Big Data High processing power Low processing power Enterprises facing scaling and capacity/cost problems Google, Amazon, Facebook, Twitter, etc. (all use nonrelational data stores for reasons of scale) Most enterprises Cloud users with low compute requirements Centralized compute architecture Distributed compute architecture Figure 3: The data architecture landscape in 2010 Source: PricewaterhouseCoopers, 2010 Wolfram Research and IBM have begun to extend their analytics applications to run on such large-scale data pools, and startups are presenting approaches they promise will allow data exploration in ways that technologies couldn’t have enabled in the past, including support for tools that let knowledge workers examine traditional databases using Big Data–style exploratory tools. 09 The ways different enterprises approach Big Data It should come as no surprise that organizations dealing with lots of data are already investigating Big Data technologies, or that they have mixed opinions about these tools. “At TransUnion, we spend a lot of our time trawling through tens or hundreds of billions of rows of data, looking for things that match a pattern approximately,” says John Parkinson, TransUnion’s acting CTO. “We want to do accurate but approximate matching and categorization in very large low-structure data sets.” Parkinson has explored Big Data technologies such as MapReduce that appear to have a more efficient filtering model than some of the pattern-matching algorithms TransUnion has tried in the past. “MapReduce also, at least in its theoretical formulation, is very amenable to highly parallelized execution,” which lets the users tap into farms of commodity hardware for fast, inexpensive processing, he notes. However, Parkinson thinks Hadoop and MapReduce are too immature. “MapReduce really hasn’t evolved yet to the point where your average enterprise technologist can easily make productive use of it. As for Hadoop, they have done a good job, but it’s like a lot of open-source software—80 percent done. There were limits in the code that broke the stack well before what we thought was a good theoretical limit.” Parkinson echoes many IT executives who are skeptical of open-source software in general. “If I have a bunch of engineers, I don’t want them spending their day being the technology support environment for what should be a product in our architecture,” he says. That’s a legitimate point of view, especially considering the data volumes TransUnion manages—8 petabytes from 83,000 sources in 4,000 formats and growing— and its focus on mission-critical capabilities for this data. Credit scoring must run successfully and deliver top-notch credit scores several times a day. It’s an operational system that many depend on for critical business decisions that happen millions of times a day. (For more on TransUnion, see the interview with Parkinson on page 14.) 10 Disney’s system is purely intended for exploratory efforts or at most for reporting that eventually may feed up to product strategy or Web site design decisions. If it breaks or needs a little retooling, there’s no crisis. But Albers disagrees about the readiness of the tools, noting that the Disney Technology Shared Services Group also handles quite a bit of data. He figures Hadoop and MapReduce aren’t any worse than a lot of proprietary software. “I fully expect we will run on things that break,” he says, adding facetiously, “Not that any commercial product I’ve ever had has ever broken.” Data architect Estes also sees responsiveness in open-source development that’s laudable. “In our testing, we uncovered stuff, and you get somebody on the other end. This is their baby, right? I mean, they want it fixed.” Albers emphasizes the total cost-effectiveness of Hadoop and MapReduce. “My software cost is zero. You still have the implementation, but that’s a constant at some level, no matter what. Now you probably need to have a little higher skill level at this stage of the game, so you’re probably paying a little more, but certainly, you’re not going out and approving a Teradata cluster. You’re talking about Tier 3 storage. You’re talking about a very low level of cost for the storage.” Albers’ points are also valid. PricewaterhouseCoopers predicts these open-source tools will be solid sooner rather than later, and are already worthy of use in non-mission-critical environments and applications. Hence, in the CIO article on page 36, we argue in favor of taking cautious but exploratory steps. Asking new business questions Saving money is certainly a big reward, but PricewaterhouseCoopers contends the biggest payoff from Hadoop-style analysis of Big Data is the potential to improve organizations’ top line. “There’s a lot of potential value in the unstructured data in organizations, and people are starting to look at it more seriously,” says Tom Urquhart, chief architect at PricewaterhouseCoopers. Think of it as a “Google in a box, which allows you to do intelligent search regardless of whether the underlying content is structured or unstructured,” he says. PricewaterhouseCoopers Technology Forecast The Google-style techniques in Hadoop, MapReduce, and related technologies work in a fundamentally different way from traditional BI systems, which use strictly formatted data cubes pulling information from data warehouses. Big Data tools let you work with data that hasn’t been formally modeled by data architects, so you can analyze and compare data of different types and of different levels of rigor. Because these tools typically don’t discard or change the source data before the analysis begins, the original context remains available for drill-down by analysts. These tools provide technology assistance to a very human form of analysis: looking at the world as it is and finding patterns of similarity and difference, then going deeper into the areas of interest judged valuable. In contrast, BI systems know what questions should be asked and what answers to expect; their goal is to look for deviations from the norm or changes in standard patterns deemed important to track (such as changes in baseline quality or in sales rates in specific geographies). Such an approach, absent an exploratory phase, results in a lot of information loss during data consolidation. (See Figure 4.) Pattern analysis mashup services There’s another use of Big Data that combines efficiency and exploratory benefits: on-the-fly pattern analysis from disparate sources to return real-time results. Amazon.com pioneered Big Data–based product recommendations by analyzing customer data, including purchase histories, product ratings, and comments. Albers is looking for similar value that would come from making live recommendations to customers when they go to a Disney site, store, or reservations phone line—based on their previous online and offline behavior with Disney. O’Reilly Media, a publisher best known for technical books and Web sites, is working with the White House to develop mashup applications that look at data from various sources to identify patterns that might help lobbyists and policymakers. For example, by mashing together US Census data and labor statistics, they can see which counties have the most international and domestic immigration, then correlate those attributes with government spending changes, says Roger Magoulas, O’Reilly’s research director. Exploration Pre-consolidated data (never collected) All collected data tio n Insight da Information loss oli ns on ati lid o ns Summary enterprise data Information loss Co Co Summary departmental data All collected data Summary departmental data Less information loss Information loss Summary enterprise data Greater insight Figure 4: Information loss in the data consolidation process Source: PricewaterhouseCoopers, 2010 Tapping into the power of Big Data 11 Mashups like this can also result in customer-facing services. FlightCaster for iPhone and BlackBerry uses Big Data approaches to analyze flight-delay records and current conditions to issue flight-delay predictions to travelers. Exploiting the power of human analysis Big Data approaches can lower processing and storage costs, but we believe their main value is to perform the analysis that BI systems weren’t designed for, acting as an enabler and an amplifier of human analysis. Ad hoc exploration at a bargain Big Data lets you inexpensively explore questions and peruse data for patterns that may indicate opportunities or issues. In this arena, failure is cheap, so analysts are more willing to explore questions they would otherwise avoid. And that should lead to insights that help the business operate better. Medical data is an example of the potential for ad hoc analysis. “A number of such discoveries are made on the weekends when the people looking at the data are doing it from the point of view of just playing around,” says Doug Lenat, founder and CEO of Cycorp and a former professor at Stanford and Carnegie Mellon universities. Right now the technical knowledge required to use these tools is nontrivial. Imagine the value of extending the exploratory capability more broadly. Cycorp is one of many startups trying to make Big Data analytic capabilities usable by more knowledge workers so they can perform such exploration. Analyzing data that wasn’t designed for BI Big Data also lets you work with “gray data,” or data from multiple sources that isn’t formatted or vetted for your specific needs, and that varies significantly in its level of detail and accuracy—and thus cannot be examined by BI systems. One analogy is Wikipedia. Everyone knows its information is not rigorously managed or necessarily accurate; nonetheless, Wikipedia is a good first place to look for indicators of what may be true and useful. From there, you do further research using a mix of 12 information resources whose accuracy and completeness may be more established. People use their knowledge and experience to appropriately weigh and correlate what they find across gray data to come up with improved strategies to aid the business. Figure 5 compares gray data and more normalized black data. Black data Classified Provenanced Cleaned Actual Gray data Raw Data and context commingled Noisy Hypothetical e.g., Financial system data e.g., Wikipedia Reviewed Confirming More trustworthy Managed by IT Unchecked Indicative Less trustworthy Managed by business unit Figure 5: Gray versus black data Source: PricewaterhouseCoopers, 2010 Web analytics and financial risk analysis are two examples of how Big Data approaches augment human analysts. These techniques comb huge data sets of information collected for specific purposes (such as monitoring individual financial records), looking for patterns that might identify good prospects for loans and flag problem borrowers. Increasingly, they comb external data not collected by a credit reporting agency—for example, trends in a neighborhood’s housing values or in local merchants’ sales patterns— to provide insights into where sales opportunities could be found or where higher concentrations of problem customers are located. The same approaches can help identify shifts in consumer tastes, such as for apparel and furniture. And, by analyzing gray data related to costs of resources and changes in transportation schedules, these approaches can help anticipate stresses on suppliers and help identify where additional suppliers might be found. All of these activities require human intelligence, experience, and insight to make sense of the data, figure out the questions to ask, decide what information should be correlated, and generally conduct the analysis. PricewaterhouseCoopers Technology Forecast Why the time is ripe for Big Data Conclusion The human analysis previously described is old hat for many business analysts, whether they work in manufacturing, fashion, finance, or real estate. What’s changing is scale. As noted, many types of information are now available that never existed or were not accessible. What could once only be suggested through surveys, focus groups, and the like can now be examined directly, because more of the granular thinking and behaviors are captured. Businesses have the potential to discover more through larger samples and more granular details, without relying on people to recall behaviors and motivations accurately. PricewaterhouseCoopers believes that Big Data approaches will become a key value creator for businesses, letting them tap into the wild, woolly world of information heretofore out of reach. These new data management and storage technologies can also provide economies of scale in more traditional data analysis. Don’t limit yourself to the efficiencies of Big Data and miss out on the potential for gaining insights through its advantages in handling the gray data prevalent today. This potential can be realized only if you pull together and analyze all that data. Right now, there’s simply too much information for individual analysts to manage, increasing the chances of missing potential opportunities or risks. Businesses that augment their human experts with Big Data technologies could have significant competitive advantages by heading off problems sooner, identifying opportunities earlier, and performing mass customization at a larger scale. Fortunately, the emerging Big Data tools should let businesspeople apply individual judgments to vaster pools of information, enabling low-cost, ad hoc analysis never before feasible. Plus, as patterns are discovered, the detection of some can be automated, letting the human analysts concentrate on the art of analysis and interpretation that algorithms can’t accomplish. Even better, emerging Big Data technologies promise to extend the reach of analysis beyond the cadre of researchers and business analysts. Several startups offer new tools that use familiar data-analysis tools— similar to those for SQL databases and Excel spreadsheets—to explore Big Data sources, thus broadening the ability to explore to a wider set of knowledge workers. Finally, Big Data approaches can be used to power analytics-based services that improve the business itself, such as in-context recommendations to customers, more accurate predictions of service delivery, and more accurate failure predictions (such as for the manufacturing, energy, medical, and chemical industries). Tapping into the power of Big Data Big Data analysis does not replace other systems. Rather, it supplements the BI systems, data warehouses, and database systems essential to financial reporting, sales management, production management, and compliance systems. The difference is that these information systems deal with the knowns that must meet high standards for rigor, accuracy, and compliance—while the emerging Big Data analytics tools help you deal with the unknowns that could affect business strategy or its execution. As the amount and interconnectedness of data vastly increases, the value of the Big Data approach will only grow. If the amount and variety of today’s information is daunting, think what the world will be like in 5 or 10 years. People will become mobile sensors—collecting, creating, and transmitting all sorts of information, from locations to body status to environmental information. We already see this happening as smartphones equipped with cameras, microphones, geolocation, and compasses proliferate. Wearable medical sensors, small temperature tags for use on packages, and other radio-equipped sensors are a reality. They’ll be the Twitter and Facebook feeds of tomorrow, adding vast quantities of new information that could provide context on behavior and environment never before possible—and a lot of “noise” certain to mask what’s important. Insight-oriented analytics in this sea of information— where interactions cause untold ripples and eddies in the flow and delivery of business value—will become a critical competitive requirement. Big Data technology is the likeliest path to gaining such insights. 13 The data scalability challenge John Parkinson of TransUnion describes the data handling issues more companies will face in three to five years. Interview conducted by Vinod Baya and Alan Morrison John Parkinson is the acting CTO of TransUnion, the chairman and owner of Parkwood Advisors, and a former CTO at Capgemini. In this interview, Parkinson outlines TransUnion’s considerable requirements for less-structured data analysis, shedding light on the many data-related technology challenges TransUnion faces today—challenges he says that more companies will face in the near future. PwC: In your role at TransUnion, you’ve evaluated many large-scale data processing technologies. What do you think of Hadoop and MapReduce? JP: MapReduce is a very computationally attractive answer for a certain class of problem. If you have that class of problem, then MapReduce is something you should look at. The challenge today, however, is that the number of people who really get the formalism behind MapReduce is a lot smaller than the group of people trying to understand what to do with it. It really hasn’t evolved yet to the point where your average enterprise technologist can easily make productive use of it. PwC: What class of problem would that be? JP: MapReduce works best in situations where you want to do high-volume, accurate but approximate matching and categorization in very large, lowstructured data sets. At TransUnion, we spend a lot of our time trawling through tens or hundreds of billions 14 of rows of data looking for things that match a pattern approximately. MapReduce is a more efficient filter for some of the pattern-matching algorithms that we have tried to use. At least in its theoretical formulation, it’s very amenable to highly parallelized execution, which many of the other filtering algorithms we’ve used aren’t. The open-source stack is attractive for experimenting, but the problem we find is that Hadoop isn’t what Google runs in production—it’s an attempt by a bunch of pretty smart guys to reproduce what Google runs in production. They’ve done a good job, but it’s like a lot of open-source software—80 percent done. The 20 percent that isn’t done—those are the hard parts. From an experimentation point of view, we have had a lot of success in proving that the computing formalism behind MapReduce works, but the software that we can acquire today is very fragile. It’s difficult to manage. It has some bugs in it, and it doesn’t behave very well in an enterprise environment. It also has some interesting limitations when you try to push the scale and the performance. PricewaterhouseCoopers Technology Forecast We found a number of representational problems when we used the HDFS/Hadoop/HBase stack to do something that, according to the documentation available, should have worked. However, in practice, limits in the code broke the stack well before what we thought was a good theoretical limit. Now, the good news of course is that you get source code. But that’s also the bad news. You need to get the source code, and that’s not something that we want to do as part of routine production. I have a bunch of smart engineers, but I don’t want them spending their day being the technology support environment for what should be a product in our architecture. Yes, there’s a pony there, but it’s going to be awhile before it stabilizes to the point that I want to bet revenue on it. PwC: Data warehousing appliance prices have dropped pretty dramatically over the past couple of years. When it comes to data that’s not necessarily on the critical path, how does an enterprise make sure that it is not spending more than it has to? JP: We are probably not a good representational example of that because our business is analyzing the data. There is almost no price we won’t pay to get a better answer faster, because we can price that into the products we produce. The challenge we face is that the tools don’t always work properly at the edge of the envelope. This is a problem for hardware as well as software. A lot of the vendors stop testing their applications at about 80 percent or 85 percent of their theoretical capability. We routinely run them at 110 percent of their theoretical capability, and they break. I don’t mind making tactical justifications for technologies that I expect to replace quickly. I do that all the time. But having done that, I want the damn thing to work. Too often, we’ve discovered that it doesn’t work. PwC: Are you forced to use technologies that have matured because of a wariness of things on the absolute edge? JP: My dilemma is that things that are known to work usually don’t scale to what we need—for speed or full capacity. I must spend some time, energy, and dollars betting on things that aren’t mature yet, but that can be sufficiently generalized architecturally. If the one I pick doesn’t work, or goes away, I can fit something else into its place relatively easily. That’s why we like appliances. As long as they are well behaved at the network layer and have a relatively generalized or standards-based business semantic interface, it doesn’t matter if I have to unplug one in 18 months or two years because something better came along. I can’t do that for everything, but I can usually afford to do it in the areas where I have no established commercial alternative. “I have a bunch of smart engineers, but I don’t want them spending their day being the technology support environment for what should be a product in our architecture.” The data scalability challenge 15 PwC: What are you using in place of something like Hadoop? PwC: Of the three kinds of data, which is the most challenging? JP: Essentially, we use brute force. We use Ab Initio, which is a very smart brute-force parallelization scheme. I depend on certain capabilities in Ab Initio to parallelize the ETL [extract, transform, and load] in such a way that I can throw more cores at the problem. JP: We have two kinds of challenges. The first is driven purely by the scale at which we operate. We add roughly half a terabyte of data per month to the credit file. Everything we do has challenges related to scale, updates, speed, or database performance. The vendors both love us and hate us. But we are where the industry is going—where everybody is going to be in two to five years. We are a good leading indicator, but we break their stuff all the time. A second challenge is the unstructured part of the data, which is increasing. PwC: Much of the data you see is transactional. Is it all structured data, or are you also mining text? JP: We get essentially three kinds of data. We get accounts receivable data from credit loan issuers. That’s the record of what people actually spend. We get public record data, such as bankruptcy records, court records, and liens, which are semi-structured text. And we get other data, which is whatever shows up, and it’s generally hooked together around a well-understood set of identifiers. But the cost of this data is essentially free—we don’t pay for it. It’s also very noisy. So we have to spend computational time figuring out whether the data we have is right, because we must find a place to put it in the working data sets that we build. At TransUnion, we suck in 100 million updates a day for the credit files. We update a big data warehouse that contains all the credit and related data. And then every day we generate somewhere between 1 and 20 operational data stores, which is what we actually run the business on. Our products are joined between what we call indicative data, the information that identifies you as an individual; structured data, which is derived from transactional records; and unstructured data that is attached to the indicative. We build those products on the fly because the data may change every day, sometimes several times a day. One challenge is how to accurately find the right place to put the record. For example, we get a Joe Smith at 13 Main Street and a Joe Smith at 31 Main Street. Are those two different Joe Smiths, or is that a typing error? We have to figure that out 100 million times a day using a bunch of custom pattern-matching and probabilistic algorithms. 16 PwC: It’s more of a challenge to deal with the unstructured stuff because it comes in various formats and from various sources, correct? JP: Yes. We have 83,000 data sources. Not everyone provides us with data every day. It comes in about 4,000 formats, despite our data interchange standards. And, to be able to process it fast enough, we must convert all data into a single interchange format that is the representation of what we use internally. Complex computer science problems are associated with all of that. PwC: Are these the kinds of data problems that businesses in other industries will face in three to five years? JP: Yes, I believe so. PwC: What are some of the other problems you think will become more widespread? JP: Here are some simple practical examples. We have 8.5 petabytes of data in the total managed environment. Once you go seriously above 100 terabytes, you must replace the storage fabric every four or five years. Moving 100 terabytes of data becomes a huge material issue and takes a long time. You do get some help from improved interconnect speed, but the arrays go as fast PricewaterhouseCoopers Technology Forecast as they go for reads and writes and you can’t go faster than that. And businesses down the food chain are not accustomed to thinking about refresh cycles that take months to complete. Now, a refresh cycle of PCs might take months to complete, but any one piece of it takes only a couple of hours. When I move data from one array to another, I’m not done until I’m done. Additionally, I have some bugs and new vulnerabilities to deal with. Today, we don’t have a backup problem at TransUnion because we do incremental forever backup. However, we do have a restore problem. To restore a material amount of data, which we very occasionally need to do, takes days in some instances because the physics of the technology we use won’t go faster than that. The average IT department doesn’t worry about these problems. But take the amount of data an average IT department has under management, multiply it by a single decimal order of magnitude, and it starts to become a material issue. We would like to see computationally more-efficient compression algorithms, because my two big cost pools are Store It and Move It. For now, I don’t have a computational problem, but if I can’t shift the trend line on Store It and Move It, I will have a computational problem within a few years. To perform the computations in useful time, I must parallelize how I compute. Above a certain point, the parallelization breaks because I can’t move the data further. The data scalability challenge PwC: Cloudera [a vendor offering a Hadoop distribution] would say bring the computation to the data. JP: That works only for certain kinds of data. We already do all of that large-scale computation on a file system basis, not on a database basis. And we spend compute cycles to compress the data so there are fewer bits to move, then decompress the data for computation, and recompress it so we have fewer bits to store. What we have discovered—because I run the fourth largest commercial GPFS [general parallel file system, a distributed computing file system developed by IBM] cluster in the world—is that once you go beyond a certain size, the parallelization management tools break. That’s why I keep telling people that Hadoop is not what Google runs in production. Maybe the Google guys have solved this, but if they have, they aren’t telling me how. n “We would like to see computationally more-efficient compression algorithms, because my two big cost pools are Store It and Move It.” 17 Creating a cost-effective Big Data strategy Disney’s Bud Albers, Scott Thompson, and Matt Estes outline an agile approach that leverages open-source and cloud technologies. Interview conducted by Galen Gruman and Alan Morrison Bud Albers joined what is now the Disney Technology Shared Services Group two years ago as executive vice president and CTO. His management team includes Scott Thompson, vice president of architecture, and Matt Estes, principal data architect. The Technology Shared Services Group, located in Seattle, has a heritage dating back to the late 1990s, when Disney acquired Starwave and Infoseek. The group supports all the Disney businesses ($38 billion in annual revenue), managing the company’s portfolio of Web properties. These include properties for the studio, store, and park; ESPN; ABC; and a number of local television stations in major cities. In this interview, Albers, Thompson, and Estes discuss how they’re expanding Disney’s Web data analysis footprint without incurring additional cost by implementing a Hadoop cluster. Albers and team freed up budget for this cluster by virtualizing servers and eliminating other redundancies. PwC: Disney is such a diverse company, and yet there clearly is lots of potential for synergies and cross-fertilization. How do you approach these opportunities from a data perspective? BA: We try and understand the best way to work with and to provide services to the consumer in the long term. We have some businesses that are very data intensive, and then we have some that are less so because of their consumer audience. One of the challenges always is how to serve both kinds of businesses and do so in ways that make sense. The sell-to relationships extend from the studio out to the distribution groups and the theater chains. If you’re selling to millions, you’re trying to understand the different audiences and how they connect. 18 One of the things I’ve been telling my folks from a data perspective is that you don’t send terabytes one way to be mated with a spreadsheet on the other side, right? We’re thinking through those kinds of pieces and trying to figure out how we move down a path. The net is that working with all these businesses gives us a diverse set of requirements, as you might imagine. We’re trying to stay ahead of where all the businesses are. In that respect, the questions I’m asking are, how do we get more agile, and how do we do it in a way that handles all the data we have? We must consider all of the new form factors being developed, all of which will generate lots of data. A big question is, how do we handle this data in a way that makes cost sense for the business and provides us an increased level of agility? PricewaterhouseCoopers Technology Forecast We hope to do in other areas what we’ve done with content distribution networks [CDNs]. We’ve had a tremendous amount of success with the CDN marketplace by standardizing, by staying in the middle of the road and not going to Akamai proprietary extensions, and by creating a dynamic marketplace. If we get a new episode of Lost, we can start streaming it, and I can be streaming 80 percent on Akamai and 20 percent on Level 3. Then we can decide we’re going to turn it back, and I’m going to give 80 percent to Limelight and 20 percent to Level 3. We can do that dynamically. PwC: What are the other main strengths of the Technology Shared Services Group at Disney? BA: When I came here a couple of years ago, we had some very good core central services. If you look at the true definition of a cloud, we had the very early makings of one—shared central services around registration, for example. On Disney, on ABC, or on ESPN, if you have an ID, it works on all the Disney properties. If you have an ESPN ID, you can sign in to KGO in San Francisco, and it will work. It’s all a shared registration system. The advertising system we built is shared. The marketing systems we built are shared—all the analytics collection, all those things are centralized. Those things that are common are shared among all the sites. Those things that are brand specific are built by the brands, and the user interface is controlled by the brands, so each of the various divisions has a head of engineering on the Web site who reports to me. Our CIO worries about it from the firewall back; I worry about it from the firewall to the living room and the mobile device. That’s the way we split up the world, if that makes sense. PwC: How do you link the data requirements of the central core with those that are unique to the various parts of the business? BA: It’s more art than science. The business units must generate revenue, and we must provide the core services. How do you strike that balance? Ownership is a lot more negotiated on some things today. We typically pull down most of the analytics and add things in, and it’s a constant struggle to answer the question, “Do we have everything?” We’re headed toward this notion of one data element at a time, aggregate, and queue up the aggregate. It can get a little bit crazy because you wind up needing to pull the data in and run it through that whole food chain, and it may or may not have lasting value. It may have only a temporal level of importance, and so we’re trying to figure out how to better handle that. An awful lot of what we do in the data collection is pull it in, lay it out so it can be reported on, and/or push it back into the businesses, because the Web is evolving rapidly from a standalone thing to an integral part of how you do business. “It’s more art than science. The business units must generate revenue, and we must provide the core services. How do you strike that balance? Ownership is a lot more negotiated on some things today.” —Bud Albers Creating a cost-effective Big Data strategy 19 PwC: Hadoop seems to suggest a feasible way to analyze data that has only temporal importance. How did you get to the point where you could try something like a Hadoop cluster? BA: Guys like me never get called when it’s all pretty and shiny. The Disney unit I joined obviously has many strengths, but when I was brought on, there was a cost growth situation. The volume of the aggregate activity growth was 17 percent. Our server growth at the time was 30 percent. So we were filling up data centers, but we were filling them with CPUs that weren’t being used. My question was, how can you go to the CFO and ask for a lot of money to fill a data center with capital assets that you’re going to use only 5 percent of? CPU utilization isn’t the only measure, but it’s the most prominent one. To study and understand what was happening, we put monitors and measures on our servers and reported peak CPU utilization on fiveminute intervals across our server farm. We found that on roughly 80 percent of our servers, we never got above 10 percent utilization in a monthly period. Our first step to address that problem was virtualization. At this point, about 49 percent of our data center is virtual. Our virtualization effort had a sizable impact on cost. Dollars fell out because we quit building data centers and doing all kinds of redundant shuffling. We didn’t have to lay off people. We changed some of our processes, and we were able to shift our growth curve from plus 27 to minus 3 on the shared service. We call this our D-Cloud effort. Another step in this effort was moving to a REST [REpresentational State Transfer] and JSON [JavaScript Object Notation] data exchange standard, because we knew we had to hit all these different devices and create some common APIs [application programming interfaces] in the framework. One of the very first things we put in place was a central logging service for all the events. These event logs can be streamed into one very large data set. We can then use the Hadoop and MapReduce paradigm to go after that data. 20 PwC: How does the central logging service fit into your overall strategy? ST: As we looked at it, we said, it’s not just about virtualization. To be able to burst and do these other things, you need to build a bunch of core services. The initiative we’re working on now is to build some of those core services around managing configuration. This project takes the foundation we laid with virtualization and a REST and JSON data exchange standard, and adds those core services that enable us to respond to the marketplace as it develops. Piping that data back to a central repository helps you to analyze it, understand what’s going on, and make better decisions on the basis of what you learned. PwC: How do you evolve so that the data strategy is really served well, so that it’s more of a data-driven approach in some ways? ME: On one side, you have a very transactional OLTP [online transactional processing] kind of world, RDBMSs [relational database management systems], and major vendors that we’re using there. On the other side of it, you have traditional analytical warehousing. And where we’ve slotted this [Hadoop-style data] is in the middle with the other operational data. Some of it is derived from transactional data, and some has been crafted out of analytical data. There’s a freedom that’s derived from blending these two kinds of data. Our centralized logging service is an example. As we look at continuing to drive down costs to drive up efficiency, we can begin to log a large amount of this data at a price point that we have not been able to achieve by scaling up RDBMSs or using warehousing appliances. Then the key will be putting an expert system in place. That will give us the ability to really understand what’s going on in the actual operational environment. We’re starting to move again toward lower utilization trajectories. We need to scale the infrastructure back and get that utilization level up to the threshold. PricewaterhouseCoopers Technology Forecast PwC: This kind of information doesn’t go in a cube. Not that data cubes are going away, but cubes are fairly well known now. The value you can create is exactly what you said, understanding the thinking behind it and the exploratory steps. ST: We think storing the unstructured data in its raw format is what’s coming. In a Hadoop environment, instead of bringing the data back to your warehouse, you figure out what question you want to answer. Then you MapReduce the input, and you may send that off to a data cube and a place that someone can dig around in, but you keep the data in its raw format and pull out only what you need. BA: The wonderful thing about where we’re headed right now is that data analysis used to be this giant, massive bet that you had to place up front, right? No longer. Now, I pull Hadoop off of the Internet, first making sure that we’re compliant from a legal perspective with licensing and so forth. After that’s taken care of, you begin to prototype. You begin to work with it against common hardware. You begin to work with it against stuff you otherwise might throw out. Rather than, I’m going to go spend how much for Teradata? We’re using the basic premise of the cloud, and we’re using those techniques of standardizing the interface to virtualize and drive cost out. I’m taking that cost savings and returning some of it to the business, but then reinvesting some in new capabilities while the cost curve is stabilizing. ME: Refining some of this reinvestment in new capabilities doesn’t have to be put in the category of traditional “$5 million projects” companies used to think about. You can make significant improvements with reinvestments of $200,000 or even $50,000. BA: It’s then a matter of how you’re redeploying an investment in resources that you’ve already made as a corporation. It’s a matter of now prioritizing your work and not changing the bottom-line trajectory in a negative fashion with a bet that may not pay off. I can try it, and I don’t have to get great big governancebased permission to do it, because it’s not a bet of half the staff and all of this stuff. It’s, OK, let’s get something on the ground, let’s work with the business unit, let’s pilot it, let’s go somewhere where we know we have a need, let’s validate it against this need, and let’s make sure that it’s working. It’s not something that must go through an RFP [request for proposal] and standard procurement. I can move very fast. n “We think storing the unstructured data in its raw format is what’s coming. In a Hadoop environment, instead of bringing the data back to your warehouse, you figure out what question you want to answer.” —Scott Thompson Creating a cost-effective Big Data strategy 21 Building a bridge to the rest of your data How companies are using open-source cluster-computing techniques to analyze their data. By Alan Morrison 22 PricewaterhouseCoopers Technology Forecast As recently as two years ago, the International Supercomputing Conference (ISC) agenda included nothing about distributed computing for Big Data— as if projects such as Google Cluster Architecture, a low-cost, distributed computing design that enables efficient processing of large volumes of less-structured data, didn’t exist. In a May 2008 blog, Brough Turner noted the omission, pointing out that Google had harnessed as much as 100 petaflops1 of computing power, compared to a mere 1 petaflop in the new IBM Roadrunner, a supercomputer profiled in EE Times that month. “Have the supercomputer folks been bypassed and don’t even know it?” Turner wondered.2 Turner, co-founder and CTO of Ashtonbrooke.com, a startup in stealth mode, had been reading Google’s research papers and remarking on them in his blog for years. Although the broader business community had taken little notice, some companies were following in Google’s wake. Many of them were Web companies that had data processing scalability challenges similar to Google’s. Yahoo, for example, abandoned its own data architecture and began to adopt one along the lines pioneered by Google. It moved to Apache Hadoop, an open-source, Java-based distributed file system based on Google File System and developed by the Apache Software Foundation; it also adopted MapReduce, Google’s parallel programming framework. Yahoo used these and other open-source tools it helped develop to crawl and index the Web. After implementing the architecture, it found other uses for the technology and has now scaled its Hadoop cluster to 4,000 nodes. By early 2010, Hadoop, MapReduce, and related open-source techniques had become the driving forces behind what O’Reilly Media, The Economist, and others in the press call Big Data and what vendors call cloud storage. Big Data refers to data sets that are growing exponentially and that are too large, too raw, or too unstructured for analysis by traditional means. Many who are familiar with these new methods are convinced that Hadoop clusters will enable cost-effective analysis of Big Data, and these methods are now spreading beyond companies that mine the public Web as part of their business. By early 2010, Hadoop, MapReduce, and related open-source techniques had become the driving forces behind what O’Reilly Media, The Economist, and others in the press call Big Data and what vendors call cloud storage. Building a bridge to the rest of your data 23 “Hadoop will process the data set and output a new data set, as opposed to changing the data set in place.” —Amr Awadallah of Cloudera What are these methods and how do they work? This article looks at the architecture and tools surrounding Hadoop clusters with an eye toward what about them will be useful to mainstream enterprises during the next three to five years. We focus on their utility for less-structured data. Hadoop clusters Although cluster computing has been around for decades, commodity clusters are more recent, starting with UNIX- and Linux-based Beowulf clusters in the mid-1990s. These banks of inexpensive servers networked together were pitted against expensive supercomputers from companies such as Cray and others—the kind of computers that government agencies, such as the National Aeronautics and Space Administration (NASA), bought. It was no accident that NASA pioneered the development of Beowulf.3 Hadoop extends the value of commodity clusters, making it possible to assemble a high-end computing cluster at a low-end price. A central assumption underlying this architecture is that some nodes are bound to fail when computing jobs are distributed across hundreds or thousands of nodes. Therefore, one key to success is to design the architecture to anticipate and recover from individual node failures.4 Other goals of the Google Cluster Architecture and its expression in open-source Hadoop include: • Price/performance over peak performance—The emphasis is on optimizing aggregate throughput; for example, sorting functions to rank the occurrence of keywords in Web pages. Overall sorting throughput is high. In each of the past three years, Yahoo’s Hadoop clusters have won Gray’s terabyte sort benchmarking test.5 24 • Software tolerance for hardware failures—When a failure occurs, the system responds by transferring the processing to another node, a critical capability for large distributed systems. As Roger Magoulas, research director for O’Reilly Media, says, “If you are going to have 40 or 100 machines, you don’t expect your machines to break. If you are running something with 1,000 nodes, stuff is going to break all the time.” • High compute power per query—The ability to scale up to thousands of nodes implies the ability to throw more compute power at each query. That ability, in turn, makes it possible to bring more data to bear on each problem. • Modularity and extensibility—Hadoop clusters scale horizontally with the help of a uniform, highly modular architecture. Hadoop isn’t intended for all kinds of workloads, especially not those with many writes. It works best for read-intensive workloads. These clusters complement, rather than replace, high-performance computing (HPC) and other relational data systems. They don’t work well with transactional data or records that require frequent updating. “Hadoop will process the data set and output a new data set, as opposed to changing the data set in place,” says Amr Awadallah, vice president of engineering and CTO of Cloudera, which develops a version of Hadoop. A data architecture and a software design that are frugal with network and disk resources are responsible for the price/performance ratio of Hadoop clusters. In Awadallah’s words, “You move your processing to where your data lives.” Each node has its own processing and storage, and the data is divided and processed locally in blocks sized for the purpose. This concept of localization makes it possible to use inexpensive serial advanced technology attachment (SATA) hard disks—the kind used in most PCs and servers—and Gigabit Ethernet for most network interconnections. (See Figure 1.) PricewaterhouseCoopers Technology Forecast Client Switch 1000Mbps Switch 100Mbps Switch 100Mbps Typical node setup 2 quad-core Intel Nehalem 24GB of RAM Task tracker/ DataNode JobTracker Task tracker/ DataNode NameNode Task tracker/ DataNode Task tracker/ DataNode Effective file space per node: 20TB Task tracker/ DataNode Task tracker/ DataNode Claimed benefits Task tracker/ DataNode Task tracker/ DataNode Task tracker/ DataNode Task tracker/ DataNode Rack 12 1TB SATA disks (non-RAID) 1 Gigabit Ethernet card Cost per node: $5,000 Rack Linear scaling at $250 per user TB (versus $5,000–$100,000 for alternatives) Compute placed near the data and fewer writes limit networking and storage costs Modularity and extensibility Figure 1: Hadoop cluster layout and characteristics Source: IBM, 2008, and Cloudera, 2010 Building a bridge to the rest of your data 25 “Amazon supports Hadoop directly through its Elastic MapReduce application programming interfaces.” —Chris Wensel of Concurrent The result is less-expensive large-scale distributed computing and parallel processing, which make possible an analysis that is different from what most enterprises have previously attempted. As author Tom White points out, “The ability to run an ad hoc query against your whole data set and get the results in a reasonable time is transformative.”6 The cost of this capability is low enough that companies can fund a Hadoop cluster from existing IT budgets. When it decided to try Hadoop, Disney’s Technology Shared Services Group took advantage of the increased server utilization it had already achieved from virtualization. As of March 2010, with nearly 50 percent of its servers virtualized, Disney had 30 percent server image growth annually but 30 percent less growth in physical servers. It was then able to set up a multiterabyte cluster with Hadoop and other free opensource tools, using servers it had planned to retire. The group estimates it spent less than $500,000 on the entire project. (See the article, “Tapping into the power of Big Data,” on page 04.) These clusters are also transformative because cloud providers can offer them on demand. Instead of using their own infrastructures, companies can subscribe to a service such as Amazon’s or Cloudera’s distribution on the Amazon Elastic Compute Cloud (EC2) platform. The EC2 platform was crucial in a well-known use of cloud computing on a Big Data project that also depended on Hadoop and other open-source tools. In 2007, The New York Times needed to quickly assemble the PDFs of 11 million articles from 4 terabytes of scanned images. Amazon’s EC2 service completed the job in 24 hours after setup, a feat that received widespread attention in blogs and the trade press. 26 Mostly overlooked in all that attention was the use of the Hadoop Distributed File System (HDFS) and the MapReduce framework. Using these open-source tools, after studying how-to blog posts from others, Times senior software architect Derek Gottfrid developed and ran code in parallel across multiple Amazon machines.7 “Amazon supports Hadoop directly through its Elastic MapReduce application programming interfaces [APIs],” says Chris Wensel, founder of Concurrent, which developed Cascading. (See the discussion of Cascading later in this article.) “I regularly work with customers to boot up 200-node clusters and process 3 terabytes of data in five or six hours, and then shut the whole thing down. That’s extraordinarily powerful.” The Hadoop Distributed File System The Hadoop Distributed File System (HDFS) and the MapReduce parallel programming framework are at the core of Apache Hadoop. Comparing HDFS and MapReduce to Linux, Awadallah says that together they’re a “data operating system.” This description may be overstated, but there are similarities to any operating system. Operating systems schedule tasks, allocate resources, and manage files and data flows to fulfill the tasks. HDFS does a distributed computing version of this. “It takes care of linking all the nodes together to look like one big file and job scheduling system for the applications running on top of it,” Awadallah says. HDFS, like all Hadoop tools, is Java based. An HDFS contains two kinds of nodes: • A single NameNode that logs and maintains the necessary metadata in memory for distributed jobs • Multiple DataNodes that create, manage, and process the 64MB blocks that contain pieces of Hadoop jobs, according to the instructions from the NameNode PricewaterhouseCoopers Technology Forecast HDFS uses multi-gigabyte file sizes to reduce the management complexity of lots of files in large data volumes. It typically writes each copy of the data once, adding to files sequentially. This approach simplifies the task of synchronizing data and reduces disk and bandwidth usage. HDFS does not perform tasks such as changing specific numbers in a list or other changes on parts of a database. This limitation leads some to assume that HDFS is not suitable for structured data. “HDFS was never designed for structured data and therefore it’s not optimal to perform queries on structured data,” says Daniel Abadi, assistant professor of computer science at Yale University. Abadi and others at Yale have done performance testing on the subject, and they have created a relational database alternative to HDFS called HadoopDB to address the performance issues they identified.8 Equally important are fault tolerance within the same disk and bandwidth usage limits. To accomplish fault tolerance, HDFS creates three copies of each data block, typically storing two copies in the same rack. The system goes to another rack only if it needs the third copy. Figure 2 shows a simplified depiction of HDFS and its data block copying method. Some developers are structuring data in ways that are suitable for HDFS; they’re just doing it differently from the way relational data would be structured. Nathan Marz, a lead engineer at BackType, a company that offers a search engine for social media buzz, uses schemas to ensure consistency and avoid data corruption. “A lot of people think that Hadoop is meant for unstructured data, like log files,” Marz says. “While Hadoop is great for log files, it’s also fantastic for strongly typed, structured data.” For this purpose, Marz uses Thrift, which was developed by Facebook for data translation and serialization purposes.9 (See the discussion of Thrift later in this article.) Figure 3 illustrates a typical Hadoop data processing flow that includes Thrift and MapReduce. Client NameNode (metadata) Files File A File A Blocks 1, 2, 4 3, 5 DataNode DataNode DataNode DataNode 1 5 4 5 2 4 2 3 3 1 2 5 Figure 2: The Hadoop Distributed File System, or HDFS Source: Apache Software Foundation, IBM, and PricewaterhouseCoopers, 2008 Input data Input applications Less-structured information such as: log files messages images Cascading Thrift Zookeeper Pig Output applications Core Hadoop data processing Mashups RDBMS apps BI systems 1 Jobs 1 M 1 2 M 2 2 3 M R Results 3 3 M Map R Reduce 64MB blocks Figure 3: Hadoop ecosystem overview Source: PricewaterhouseCoopers, derived from Apache Software Foundation and Dion Hinchcliffe, 2010 Building a bridge to the rest of your data 27 MapReduce MapReduce is the base programming framework for Hadoop. It often acts as a bridge between HDFS and tools that are more accessible to most programmers. According to those at Google who developed the tool, “it hides the details of parallelization” and the other nuts and bolts of HDFS.10 MapReduce is a layer of abstraction, a way of managing a sea of details by creating a layer that captures and summarizes their essence. That doesn’t mean it is easy to use. Many developers choose to work with another tool, yet another layer of abstraction on top of it. “I avoid using MapReduce directly at all cost,” Marz says. “I actually do almost all my MapReduce work with a library called Cascading.” The terms “map” and “reduce” refer to steps the tool takes to distribute, or map, the input for parallel processing, and then reduce, or aggregate, the processed data into output files. (See Figure 4.) MapReduce works with key-value pairs. Frequently with Web data, the keys consist of URLs and the values consist of Web page content, such as Hypertext Markup Language (HTML). MapReduce’s main value is as a platform with a set of APIs. Before MapReduce, fewer programmers could take advantage of distributed computing. Now that user-accessible tools have been designed, simpler programming is possible on massively parallel systems and less adaptation of the programs is required. The following sections examine some of these tools. Data store 1 Data store n Input key-value pairs Input key-value pairs Map key 1 values Map key 2 values Barrier ... key 1 values key 3 values key 2 values key 3 values Aggregates intermediate values by output key ... Barrier key 1 intermediate values key 2 intermediate values key 3 intermediate values Reduce Reduce Reduce final key 1 values final key 2 values final key 3 values Figure 4: MapReduce phases Source: Google, 2004, and Cloudera, 2009 28 PricewaterhouseCoopers Technology Forecast “You can code in whatever JVM-based language you want, and then shove that into the cluster.” —Chris Wensel of Concurrent Cascading Wensel, who created Cascading, calls it an alternative API to MapReduce, a single library of operations that developers can tap. It’s another layer of abstraction that helps bring what programmers ordinarily do in non-distributed environments to distributed computing. With it, he says, “you can code in whatever JVM-based [Java Virtual Machine] language you want, and then shove that into the cluster.” Wensel wanted to obviate the need for “thinking in MapReduce.” When using Cascading, developers don’t think in key-value pair terms—they think in terms of fields and lists of values called “tuples.” A Cascading tuple is simpler than a database record but acts like one. Each tuple flows through “pipe” assemblies, which are comparable to Java classes. The data flow begins at the source, an input file, and ends with a sink, an output directory. (See Figure 5.) Map Reduce [f1, f2, ...] P Map [f1, f2, ...] P [f1, f2, ...] So Assembly Flow A A A A A A A A MR MR MR MR Cluster Job Job Reduce [f1, f2, ...] P Rather than approach map and reduce phases large-file by large-file, developers assemble flows of operations using functions, filters, aggregators, and buffers. Those flows make up the pipe assemblies, which, in Marz’s terms, “compile to MapReduce.” In this way, Cascading smoothes the bumpy MapReduce terrain so more developers—including those who work mainly in Client scripting languages—can build flows. (See Figure 6.) [f1, f2, ...] P A [f1, f2, ...] P P MR [f1, f2, ...] Pipe assembly Hadoop MR (translation to MapReduce) MapReduce jobs Si Figure 6: Cascading assembly and flow [f1, f2, ...] So Si P Tuples with field names Source Sink Pipe Source: Concurrent, 2010 Figure 5: A Cascading assembly Source: Concurrent, 2010 Building a bridge to the rest of your data 29 Some useful tools for MapReduce-style analytics programming Open-source tools that work via MapReduce on Hadoop clusters are proliferating. Users and developers don’t seem concerned that Google received a patent for MapReduce in January 2010. In fact, Google, IBM, and others have encouraged the development and use of open-source versions of these tools at various research universities.11 A few of the more prominent tools relevant to analytics, and used by developers we’ve interviewed, are listed in the sections that follow. With LISP, Watson says, he can load the data once and test multiple times. In C++, he would need to use a relational database and reload each time for a program test. Using LISP makes it possible to create and test small bits of code in an iterative fashion, a major reason for the productivity gains.� This iterative, LISP-like program-programmer interaction with Clojure leads to what Hickey calls “dynamic development.” Any code entered in the console interface, he points out, is automatically compiled on the fly.� Clojure Thrift Clojure creator Rich Hickey wanted to combine aspects of C or C#, LISP (for list processing, a language associated with artificial intelligence that’s rich in mathematical functions), and Java. The letters C, L, and J led him to name the language, which is pronounced “closure.” Clojure combines a LISP library with Java libraries. Clojure’s mathematical and natural language processing (NLP) capabilities and the fact that it is JVM based make it useful for statistical analysis on Hadoop clusters. FlightCaster, a commercial-airline-delayprediction service, uses Clojure on top of Cascading, on top of MapReduce and Hadoop, for “getting the right view into unstructured data from heterogeneous sources,” says Bradford Cross, FlightCaster co-founder.� Thrift, initially created at Facebook in 2007 and then released to open source, helps developers create services that communicate across languages, including C++, C#, Java, Perl, Python, PHP, Erlang, and Ruby. With Thrift, according to Facebook, users can “define all the necessary data structures and interfaces for a complex service in a single short file.”� LISP has attributes that lend themselves to NLP, making Clojure especially useful in NLP applications. Mark Watson, an artificial intelligence consultant and author, says most LISP programming he’s done is for NLP. He considers LISP to be four times as productive for programming as C++ and twice as productive as Java. His NLP code “uses a huge amount of memory-resident data,” such as lists of proper nouns, text categories, common last names, and nationalities. “Getting the right view into unstructured data from heterogeneous sources can be quite tricky.” —Bradford Cross of FlightCaster 30 A more important aspect of Thrift, according to BackType’s Marz, is its ability to create strongly typed data and flexible schemas. Countering the emphasis of the so-called NoSQL community on schema-less data, Marz asserts there are effective ways to lightly structure the data in Hadoop-style analysis. Marz uses Thrift’s serialization features, which turn objects�into a sequence of bits that can be stored as files, to create schemas between types (for instance, differentiating between text strings and long, 64-bit integers) and schemas between relationships (for instance, linking Twitter accounts that share a common interest). Structuring the data in this way helps BackType avoid inconsistencies in the data or the need to manually filter for some attributes. BackType can use required and optional fields to structure the Twitter messages it crawls and analyzes. The required fields can help enforce data type. The optional fields, meanwhile, allow changes to the schema as well as the use of old objects that were created using the old schema. PricewaterhouseCoopers Technology Forecast Marz’s use of Thrift to model social graphs like the one in Figure 7 demonstrates the flexibility of the schema for Hadoop-style computing. Thrift essentially enables modularity in the social graph described in the schema. For example, to select a single age for each person, BackType can take into account all the raw age data. It can do this by a computation on the entire data set or a selective computation on only the people in the data set who have new data. Bob Gender male Age Charlie Gender female Gender male 25 Age Apache Thrift Non-relational data stores have become much more numerous since the Apache Hadoop project began in 2007. Many are open source. Developers of these data stores have optimized each for a different kind of data. When contrasted with relational databases, these data stores lack many design features that can be essential for enterprise transactional data. However, they are often well tailored to specific, intended purposes, and they offer the added benefit of simplicity. Primary non-relational data store types include the following: • Multidimensional map store—Each record maps a row name, a column name, and a time stamp to a value. Map stores have their heritage in Google’s Bigtable. 39 Alice Age Open-source, non-relational data stores 22 Language: C++ Figure 7: An example of a social graph modeled using Thrift schema Source: Nathan Marz, 2010 BackType doesn’t just work with raw data. It runs a series of jobs that constantly normalize and analyze new data coming in, and then other jobs that write the analyzed data to a scalable random-access database such as HBase or Cassandra.12 • Key-value store—Each record consists of a key, or unique identifier, mapped to one or more values. • Graph store—Each record consists of elements that together form a graph. Graphs depict relationships. For example, social graphs describe relationships between people. Other graphs describe relationships between objects, between links, or both. • Document store—Each record consists of a document. Extensible Markup Language (XML) databases, for example, store XML documents. Because of their simplicity, map and key-value stores can have scalability advantages over most types of relational databases. (HadoopDB, a hybrid approach developed at Yale University, is designed to overcome the scalability problems associated with relational databases.) Table 1 provides a few examples of the open-source, non-relational data stores that are available. Map Key-value Document Graph HBase Tokyo Cabinet/Tyrant MongoDB Resource Description Framework (RDF) Hypertable Project Voldemort CouchDB Neo4j Cassandra Redis Xindice InfoGrid Table 1: Example open-source, non-relational data stores Source: PricewaterhouseCoopers, Daniel Abadi of Yale University, and organization Web sites, 2010 Building a bridge to the rest of your data 31 “We established that Hadoop does horizontally scale. This is what’s really exciting, because I’m an RDBMS guy, right? I’ve done that for years, and you don’t get that kind of constant scalability no matter what you do.” —Scott Thompson of Disney Other related technologies and vendors A comprehsensive review of the various tools created for the Hadoop ecosystem is beyond the scope of this article, but a few of the tools merit brief description here because they’ve been mentioned elsewhere in this issue: • Pig—A scripting language called Pig Latin, which is a primary feature of Apache Pig, allows more concise querying of data sets “directly from the console” than is possible using MapReduce, according to author Tom White. • Hive—Hive is designed as “mainly an ETL [extract, transform, and load] system” for use at Facebook, according to Chris Wensel. • Zookeeper—Zookeeper provides an interface for creating distributed applications, according to Apache. Big Data covers many vendor niches, and some vendors’ products take advantage of the Hadoop stack or add to its capabilities. (See the sidebar “Selected Big Data tool vendors.”) Conclusion Interest in and adoption of Hadoop clusters are growing rapidly. Reasons for Hadoop’s popularity include: • Open, dynamic development—The Hadoop/ MapReduce environment offers cost-effective distributed computing to a community of opensource programmers who’ve grown up on Linux and Java, and scripting languages such as Perl and Python. Some are taking advantage of functional programming language dialects such as Clojure. The openness and interaction can lead to faster development cycles. 32 • Cost-effective scalability—Horizontal scaling from a low-cost base implies a feasible long-term cost structure for more kinds of data. Scott Thompson, vice president of infrastructure at the Disney Technology Shared Services Group, says, “We established that Hadoop does horizontally scale. This is what’s really exciting, because I’m an RDBMS guy, right? I’ve done that for years, and you don’t get that kind of constant scalability no matter what you do.” • Fault tolerance—Associated with scalability is the assumption that some nodes will fail. Hadoop and MapReduce are fault tolerant, another reason commodity hardware can be used. • Suitability for less-structured data—Perhaps most importantly, the methods that Google pioneered, and that Yahoo and others expanded, focus on what Cloudera’s Awadallah calls “complex” data. Although developers such as Marz understand the value of structuring data, most Hadoop/MapReduce developers don’t have an RDBMS mentality. They have an NLP mentality, and they’re focused on techniques optimized for large amounts of less-structured information, such as the vast amount of information on the Web. The methods, cost advantages, and scalability of Hadoop-style cluster computing clear a path for enterprises to analyze the Big Data they didn’t have the means to analyze before. This set of methods is separate from, yet complements, data warehousing. Understanding what Hadoop clusters do and how they do it is fundamental to deciding when and where enterprises should consider making use of them. PricewaterhouseCoopers Technology Forecast Selected Big Data tool vendors Amazon Amazon provides a Hadoop framework on its Elastic Compute Cloud (EC2) and S3 storage service it calls Elastic MapReduce. Appistry Appistry’s CloudIQ Storage platform offers a substitute for HDFS, one designed to eliminate the single point of failure of the NameNode. Cloudera Cloudera takes a Red Hat approach to Hadoop, offering its own distribution on EC2/S3 with management tools, training, support, and professional services. Cloudscale Cloudscale’s first product, Cloudcel, marries an Excel-based interface to a back end that’s a massively parallel stream processing engine. The product is designed to process stored, historical, or real-time data. Concurrent Concurrent developed Cascading, for which it offers licensing, training, and support. Drawn to Scale Drawn to Scale offers an analytical and transactional database product on Hadoop and HBase, with occasional consulting. IBM IBM introduced a distribution of Hadoop called BigInsights in May 2010. The company’s jStart team offers briefings and workshops on Hadoop pilots. IBM BigSheets acts as an aggregation, analysis, and visualization point for large amounts of Web data. 1 FLOPS stands for “floating point operations per second.” Floating point processors use more bits to store each value, allowing more precision and ease of programming than fixed point processors. One petaflop is upwards of one quadrillion floating point operations per second. 2 Brough Turner, “Google Surpasses Supercomputer Community, Unnoticed?” May 20, 2008, http://blogs.broughturner.com/ communications/2008/05/google-surpasses-supercomputercommunity-unnoticed.html (accessed April 8, 2010). 3 See, for example, Tim Kientzle, “Beowulf: Linux clustering,” Dr. Dobb’s Journal, November 1, 1998, Factiva Document dobb000020010916dub100045 (accessed April 9, 2010). 4 Luis Barroso, Jeffrey Dean, and Urs Hoelzle, “Web Search for a Planet: The Google Cluster Architecture,” Google Research Publications, http://research.google.com/archive/googlecluster.html (accessed April 10, 2010). 5 See http://sortbenchmark.org/ and http://developer.yahoo.net/blog/ (accessed April 9, 2010). 6 Tom White, Hadoop: The Definitive Guide (Sebastopol, CA: O’Reilly Media, 2009), 4. 7 See Derek Gottfrid, “Self-service, Prorated Super Computing Fun!” The New York Times Open Blog, November 1, 2007, http://open. blogs.nytimes.com/2007/11/01/self-service-prorated-supercomputing-fun/(accessed June 4, 2010) and Bill Snyder, “Cloud Computing: Not Just Pie in the Sky,” CIO, March 5, 2008, Factiva Document CIO0000020080402e4350000 (accessed March 28, 2010). 8 See “HadoopDB” at http://db.cs.yale.edu/hadoopdb/hadoopdb.html (accessed April 11, 2010). 9 Nathan Marz, “Thrift + Graphs = Strong, flexible schemas on Hadoop,” http://nathanmarz.com/blog/schemas-on-hadoop/ (accessed April 11, 2010). 10 Jeffrey Dean and Sanjay Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” Google Research Publications, December 2004, http://labs.google.com/papers/mapreduce.html (accessed April 22, 2010). 11 See Dean, et al., US Patent No. 7,650,331, January 19, 2010, at http:// www.uspto.gov. For an example of the participation by Google and IBM in Hadoop’s development, see “Google and IBM Announce University Initiative to Address Internet-Scale Computing Challenges,” Google press release, October 8, 2007, http://www.google.com/intl/en/ press/pressrel/20071008_ibm_univ.html (accessed March 28, 2010). 12 See the Apache site at http://apache.org/ for descriptions of many tools that take advantage of MapReduce and/or HDFS that are not profiled in this article. Microsoft Microsoft Pivot uses the company’s Deep Zoom technology to provide visual data browsing capabilities for XML files. Azure Table services is in some ways comparable to Bigtable or HBase. (See the interview with Mark Taylor and Ray Velez of Razorfish on page 46.) ParaScale ParaScale offers software for enterprises to set up their own public or private cloud storage environments with parallel processing and large-scale data handling capability. Building a bridge to the rest of your data 33 Hadoop’s foray into the enterprise Cloudera’s Amr Awadallah discusses how and why diverse companies are trying this novel approach. Interview conducted by Alan Morrison, Bo Parker, and Vinod Baya Amr Awadallah is vice president of engineering and CTO at Cloudera, a company that offers products and services around Hadoop, an open-source technology that allows efficient mining of large, complex data sets. In this interview, Awadallah provides an overview of Hadoop’s capabilities and how Cloudera customers are using them. PwC: Were you at Yahoo before coming to Cloudera? AA: Yes. I was with Yahoo from mid-2000 until mid2008, starting with the Yahoo Shopping team after selling my company VivaSmart to Yahoo. Beginning in 2003, my career shifted toward business intelligence and analytics at consumer-facing properties such as Yahoo News, Mail, Finance, Messenger, and Search. I had the daunting task of building a very large data warehouse infrastructure that covered all these diverse products and figuring out how to bring them together. That is when I first experienced Hadoop. Its model of “mine first, govern later” fits in with the well-governed infrastructure of a data mart, so it complements these systems very well. Governance standards are important for maintaining a common language across the organization. However, they do inhibit agility, so it’s best to complement a well-governed data mart with a more agile complex data processing system like Hadoop. PwC: How did Yahoo start using Hadoop? AA: In 2005, Yahoo was faced with a business challenge. The cost of creating the Web search index was approaching the revenues being made from the keyword advertising on the search pages. Yahoo Search adopted Hadoop as an economically scalable solution, 34 and worked on it in conjunction with the open-source Apache Hadoop community. Yahoo played a very big role in the evolution of Hadoop to where it is today. Soon after the Yahoo Search team started using Hadoop, other parts of the company began to see the power and flexibility that this system offers. Today, Yahoo uses Hadoop for data warehousing, mail spam detection, news feed processing, and content/ad targeting. PwC: What are some of the advantages of Hadoop when you compare it with RDBMSs [relational database management systems]? AA: With Oracle, Teradata, and other RDBMSs, you must create the table and schema first. You say, this is what I’m going to be loading in, these are the types of columns I’m going to load in, and then you load your data. That process can inhibit how fast you can evolve your data model and schemas, and it can limit what you log and track. With Hadoop, it’s the other way around. You load all of your data, such as XML [Extensible Markup Language], tab delimited flat files, Apache log files, JSON [JavaScript Object Notation], etc. Then in Hive or Pig [both of which are Hadoop data query tools], you point your metadata toward the file and parse the data on PricewaterhouseCoopers Technology Forecast “We are not talking about a replacement technology for data warehouses— let’s be clear on this. No customers are using Hadoop in that fashion.” the fly when reading it out. This approach lets you extract the columns that map to the data structure you’re interested in. Creating the structure on the read path like this can have its disadvantages; however, it gives you the agility and the flexibility to evolve your schema much quicker without normalizing your data first. In general, relational systems are not well suited for quickly evolving complex data types. Another benefit is retroactive schemas. For example, an engineer launching a new product feature can add the logging for it, and that new data will start flowing directly into Hadoop. Weeks or months later, a data analyst can update their read schema on how to parse this new data. Then they will immediately be able to query the history of this metric since it started flowing in [as opposed to waiting for the RDBMS schema to be updated and the ETL (extract, transform, and load) processes to reload the full history of that metric]. PwC: What about the cost advantages? AA: The cost basis is 10 to 100 times cheaper than other solutions. But it’s not just about cost. Relational databases are really good at what they were designed for, which is running interactive SQL queries against well-structured data. We are not talking about a replacement technology for data warehouses—let’s be clear on this. No customers are using Hadoop in that fashion. They recognize that the nature of data is changing. Where is the data growing? It’s growing around complex data types. Is a relational container the best and most Hadoop’s foray into the enterprise interesting place to ask questions of complex plus relational data? Probably not, although organizations still need to use, collect, and present relational data for questions that are routine and require, in some cases, a real-time response. PwC: How have companies benefited from querying across both structured and complex data? AA: When you query against complex data types, such as Web log files and customer support forums, as well as against the structured data you have already been collecting, such as customer records, sales history, and transactions, you get a much more accurate answer to the question you’re asking. For example, a large credit card company we’ve worked with can identify which transactions are most likely fraudulent and can prioritize which accounts need to be addressed. PwC: Are the companies you work with aware that this is a totally different paradigm? AA: Yes and no. The main use case we see is in companies that have a mix of complex data and structured data that they want to query across. Some large financial institutions that we talk to have 10, 20, or even hundreds of Oracle systems—it’s amazing. They have all of these file servers storing XML files or log files, and they want to consolidate all these tables and files onto one platform that can handle both data types so they can run comprehensive queries. This is where Hadoop really shines; it allows companies to run jobs across both data types. n 35 Revising the CIO’s data playbook Start by adopting a fresh mind-set, grooming the right talent, and piloting new tools to ride the next wave of innovation. By Jimmy Guterman 36 PricewaterhouseCoopers Technology Forecast Like pioneers exploring a new territory, a few enterprises are making discoveries by exploring Big Data. The terrain is complex and far less structured than the data CIOs are accustomed to. And it is growing by exabytes each year. But it is also getting easier and less expensive to explore and analyze, in part because software tools built to take advantage of cloud computing infrastructures are now available. Our advice to CIOs: You don’t need to rush, but do begin to acquire the necessary mind-set, skill set, and tool kit. These are still the early days. The prime directive for any CIO is to deliver value to the business through technology. One way to do that is to integrate new technologies in moderation, with a focus on the long-term opportunities they may yield. Leading CIOs pride themselves on waiting until a technology has proven value before they adopt it. Fair enough. However, CIOs who ignore the Big Data trends described in the first two articles risk being marginalized in the C-suite. As they did with earlier technologies, including traditional business intelligence, business unit executives are ready to seize the Big Data opportunity and make it their own. This will be good for their units and their careers, but it would be better for the organization as a whole if someone—the CIO is the natural person—drove a single, central, cross-enterprise Big Data initiative. With this in mind, PricewaterhouseCoopers encourages CIOs to take these steps: • Start to add the discipline and skill set for Big Data to your organizations; the people for this may or may not come from existing staff. • Set up sandboxes (which you can rent or buy) to experiment with Big Data technologies. • Understand the open-source nature of the tools and how to manage risk. Enterprises have the opportunity to analyze more kinds of data more cheaply than ever before. It is also important to remember that Big Data tools did not originate with vendors that were simply trying to create new markets. The tools sprung from a real need among the enterprises that first confronted the scalability and cost challenges of Big Data— challenges that are now felt more broadly. These pioneers also discovered the need for a wider variety of talent than IT has typically recruited. Enterprises have the opportunity to analyze more kinds of data more cheaply than ever before. It is also important to remember that Big Data tools did not originate with vendors that were simply trying to create new markets. Revising the CIO’s data playbook 37 Big Data lessons from Web companies Today’s CIO literature is full of lessons you can learn from companies such as Google. Some of the comparisons are superficial because most companies do not have a Web company’s data complexities and will never attain the original singleness of purpose that drove Google, for example, to develop Big Data innovations. But there is no niche where the development of Big Data tools, techniques, mind-set, and usage is greater than in companies such as Google, Yahoo, Facebook, Twitter, and LinkedIn. And there is plenty that CIOs can learn from these companies. Every major service these companies create is built on the idea of extracting more and more value from more and more data. For example, the 1-800-GOOG-411 service, which individuals can call to get telephone numbers and addresses of local businesses, does not merely take an ax to the high-margin directory assistance services run by incumbent carriers (although it does that). That is just a by-product. More important, the 800-number service has let Google compile what has been described as the world’s largest database of spoken language. Google is using that database to improve the quality of voice recognition in Google Voice, in its sundry mobile-phone applications, and in other services under development. Some of the ways companies such as Google capture data and convert it into services are listed in Table 1. Service Data that Web companies capture Self-serve advertising Ad-clicking and -picking behavior Analytics Aggregated Web site usage tracking Social networking Sundry online Browser Limited browser behaviors E-mail Words used in e-mails Search engine Searches and clicking information RSS feeds Detailed reading habits Extra browser functionality All browser behavior View videos All site behavior Free directory assistance Database of spoken words Table 1: Web portal Big Data strategy Source: PricewaterhouseCoopers, 2010 38 PricewaterhouseCoopers Technology Forecast “I see inspiration from the Google model ... just having lots of cheap stuff that you can use to crunch vast quantities of data.” —Phil Buckle of the UK National Policing Improvement Agency Many Web companies are finding opportunity in “gray data.” Gray data is the raw and unvalidated data that arrives from various sources, in huge quantities, and not in the most usable form. Yet gray data can deliver value to the business even if the generators of that content (for example, people calling directory assistance) are contributing that data for a reason far different from improving voice-recognition algorithms. They just want the right phone number; the data they leave is a gift to the company providing the service. The new technologies and services described in the article, “Building a bridge to the rest of your data,” on page 22 are making it possible to search for enterprise value in gray data in agile ways at low cost. Much of this value is likely to be in the area of knowing your customers, a sure path for CIOs looking for ways to contribute to company growth and deepen their relationships with the rest of the C-suite. What Web enterprise use of Big Data shows CIOs, most of all, is that there is a way to think and manage differently when you conclude that standard transactional data analysis systems are not and should not be the only models. New models are emerging. CIOs who recognize these new models without throwing away the legacy systems that still serve them well will see that having more than one tool set, one skill set, and one set of controls makes their organizations more sophisticated, more agile, less expensive to maintain, and more valuable to the business. The business case Besides Google, Yahoo, and other Web-based enterprises that have complex data sets, there are stories of brick and mortar organizations that will be making more use of Big Data. For example, Rollin Ford, Wal-Mart’s CIO, told The Economist earlier this year, “Every day I wake up and ask, ‘How can I flow data better, manage data better, analyze data better?’” The answer to that question today implies a budget reallocation, with less-expensive hardware and software carrying more of the load. “I see inspiration from the Google model and the notion of moving into commodity-based computing—just having lots of cheap stuff that you can use to crunch vast quantities of data. I think that really contrasts quite heavily with the historic model of paying lots of money for really specialist stuff,” says Phil Buckle, CTO of the UK’s National Policing Improvement Agency, which oversees law enforcement infrastructure nationwide. That’s a new mind-set for the CIO, who ordinarily focuses on keeping the plumbing and the data it carries safe, secure, in-house, and functional. Seizing the Big Data initiative would give CIOs in particular and IT in general more clout in the executive suite. But are CIOs up to the task? “It would be a positive if IT could harness unstructured data effectively,” former Gartner analyst Howard Dresner, president and founder of Dresner Advisory Services, observes. “However, they haven’t always done a great job with structured data, and unstructured is far more complex and exists predominantly outside the firewall and beyond their control.” Tools are not the issue. Many evolving tools, as noted in the previous article, come from the open-source community; they can be downloaded and experimented with for low cost and are certainly up to supporting any pilot project. More important is the aforementioned mind-set and a new kind of talent IT will need. Revising the CIO’s data playbook 39 “The talent demand isn’t so much for Java developers or statisticians per se as it is for people who know how to work with denormalized data.” —Ray Velez of Razorfish To whom does the future of IT belong? The ascendance of Big Data means that CIOs need a more data-centric approach. But what kind of talent can help a CIO succeed in a more data-centric business environment, and what specific skills do the CIO’s teams focused on the area need to develop and balance? Hal Varian, a University of California, Berkeley, professor and Google’s chief economist, says, “The sexy job in the next 10 years will be statisticians.” He and others, such as IT and management professor Erik Brynjolfsson at the Massachusetts Institute of Technology (MIT), contend this demand will happen because the amount of data to be analyzed is out of control. Those who can make sense of the flood will reap the greatest rewards. They have a point, but the need is not just for statisticians—it’s for a wide range of analytically minded people. Today, larger companies still need staff with expertise in package implementations and customizations, systems integration, and business process reengineering, as well as traditional data management and business intelligence that’s focused on transactional data. But there is a growing role for people with flexible minds to analyze data and suggest solutions to problems or identify opportunities from that data. In Silicon Valley and elsewhere, where businesses such as Google, Facebook, and Twitter are built on the rigorous and speedy analysis of data, programming frameworks such as MapReduce (which works with Hadoop) and NoSQL (a database approach for nonrelational data stores) are becoming more popular. 40 Chris Wensel, who created Cascading (an alternative application programming interface [API] to MapReduce) and straddles the worlds of startups and entrenched companies, says, “When I talk to CIOs, I tell them: ‘You know those people you have who know about data. You probably don’t use those people as much as you should. But once you take advantage of that expertise and reallocate that talent, you can take advantage of these new techniques.’” The increased emphasis on data analysis does not mean that traditional programmers will be replaced by quantitative analysts or data warehouse specialists. “The talent demand isn’t so much for Java developers or statisticians per se as it is for people who know how to work with denormalized data,” says Ray Velez, CTO at Razorfish, an interactive marketing and technology consulting firm involved in many Big Data initiatives. “It’s about understanding how to map data into a format that most people are not familiar with. Most people understand SQL and the relational format, so the real skill set evolution doesn’t have quite as much to do with whether it’s Java or Python or other technologies.” Velez points to Bill James as a useful case. James, a baseball writer and statistician, challenged conventional wisdom by taking an exploratory mind-set to baseball statistics. He literally changed how baseball management makes talent decisions, and even how they manage on the field. In fact, James became senior advisor for baseball operations in the Boston Red Sox’s front office. PricewaterhouseCoopers Technology Forecast For example, James showed that batting average is less an indicator of a player’s future success than how often he’s involved in scoring runs—getting on base, advancing runners, or driving them in. In this example and many others, James used his knowledge of the topic, explored the data, asked questions no one had asked, and then formulated, tested, and refined hypotheses. Says Velez: “Our analytics team within Razorfish has the James types of folks who can help drive different thinking and envision possibilities with the data. We need to find a lot more of those people. They’re not very easy to find. There is an aspect of James that just has to do with boldness and courage, a willingness to challenge those who are in the habit of using metrics they’ve been using for years.” The CIO will need people throughout the organization who have all sorts of relevant analysis and coding skills, who understand the value of data, and who are not afraid to explore. This does not mean the end of the technology- or application-centric organizational chart of the typical IT organization. Rather, it means the addition of a data-exploration dimension that is more than one or two people. These people will be using a blend of tools that differ depending on requirements, as Table 2 illustrates. More of the tools will be open source than in the past. Skills Tools (a sampler) Comments Natural language processing and text mining Clojure, Redis, Scala, Crane, other Java functional language libraries, Python Natural Language ToolKit To some extent, each of these serves as a layer of abstraction on top of Hadoop. Those familiar keep adding layers on top of layers. FlightCaster, for example, uses a stack consisting of Amazon S3 -> Amazon EC2 -> Cloudera -> HDFS -> Hadoop -> Cascading -> Clojure.1 Data mining R, Matlab Scripting and NoSQL database programming skills Python and related frameworks, HBase, Cassandra, CouchDB, Tokyo Cabinet R is more suited to finance and statistics, whereas Matlab is more engineering oriented.2 These lend themselves to or are based on functional languages such as LISP, or comparable to LISP. CouchDB, for example, is written in Erlang.3 (See the discussion of Clojure and LISP on page 30.) Table 2: New skills and tools for the IT department Source: Cited online postings and PricewaterhouseCoopers, 2008–2010 Pete Skomoroch, “How FlightCaster Squeezes Predictions from Flight Data,” Data Wrangling blog, August 24, 2009, http://www.datawrangling.com/how-flightcaster-squeezes-predictions-from-flight-data (accessed May 14, 2010). 1 Brendan O’Connor, “Comparison of data analysis packages,” AI and Social Science blog, February 23, 2009, http://anyall.org/blog/2009/02/comparison-of-data-analysis-packages-r-matlab-scipy-excel-sas-spss-stata/ (accessed May 25, 2010). 2 Scripting languages such as Python run more slowly than Java, but developers sometimes make the tradeoff to increase their own productivity. Some companies have created their own frameworks and released these to open source. See Klaas Bosteels, “Python + Hadoop = Flying Circus Elephant,” Last.HQ Last.fm blog, May 29, 2008, http://blog.last.fm/2008/05/29/python-hadoop-flying-circus-elephant (accessed May 14, 2010). 3 Revising the CIO’s data playbook 41 “Every technology department has a skunkworks, no matter how informal—a sandbox where they can test and prove technologies. That’s how open source entered our organization. A small Hadoop installation might be a gateway that leads you to more open source. But it might turn out to be a neat little open-source project that sits by itself and doesn’t bother anything else.” —CIO of a small Massachusetts company Where do CIOs find such talent? Start with your own enterprise. For example, business analysts managing the marketing department’s lead-generation systems could be promoted onto an IT data staff charged with exploring the data flow. Most large consumer-oriented companies already have people in their business units who can analyze data and suggest solutions to problems or identify opportunities. These people need to be groomed and promoted, and more of them hired for IT, to enable the entire organization, not just the marketing department, to reap the riches. Set up a sandbox Although the business case CIOs can make for Big Data is inarguable, even inarguable business cases carry some risk. Many CIOs will look at the risks associated with Big Data and find a familiar canard. Many Big Data technologies—Hadoop in particular—are open source, and open source is often criticized for carrying too much risk. The open-source versus proprietary technology argument is nothing new. CIOs who have tried to implement open-source programs, from the Apache Web server to the Drupal content-management system, have faced the usual arguments against code being available to all comers. Some of those arguments, especially concerns revolving around security and reliability, verge on the specious. Google built its internal Web servers atop Apache. And it would be difficult to find a Big Data site as reliable as Google’s. 42 Clearly, one challenge CIOs face has nothing to do with data or skill sets. Open-source projects become available earlier in their evolution than do proprietary alternatives. In this respect, Big Data tools are less stable and complete than are Apache or Linux open-source tool kits. Introducing an open-source technology such as Hadoop into a mostly proprietary environment does not necessarily mean turning the organization upside down. A CIO at a small Massachusetts company says, “Every technology department has a skunkworks, no matter how informal—a sandbox where they can test and prove technologies. That’s how open source entered our organization. A small Hadoop installation might be a gateway that leads you to more open source. But it might turn out to be a neat little opensource project that sits by itself and doesn’t bother anything else. Either can be OK, depending on the needs of your company.” Bud Albers, executive vice president and CTO of Disney Technology Shared Services Group, concurs. “It depends on your organizational mind-set,” he says. “It depends on your organizational capability. There is a certain ‘don’t try this at home’ kind of warning that goes with technologies like Hadoop. You have to be willing at this stage of its maturity to maybe have a little higher level of capability to go in.” PricewaterhouseCoopers Technology Forecast PricewaterhouseCoopers agrees with those sentiments and strongly urges large enterprises to establish a sandbox dedicated to Big Data and Hadoop/ MapReduce. This move should be standard operating procedure for large companies in 2010, as should a small, dedicated staff of data explorers and modest budget for the efforts. For more information on what should be in your sandbox, refer to the article, “Building a bridge to the rest of your data,” on page 22. And for some ideas on how the sandbox could fit in your organization chart, see Figure 1. VP of IT Marketing manager Director of application development Web site manager Director of data analysis Sales manager Data Data analysis exploration team team Operations manager Finance manager Figure 1: Where a data exploration team might fit in an organization chart Source: PricewaterhouseCoopers, 2010 Revising the CIO’s data playbook Different companies will want to experiment with Hadoop in different ways, or segregate it from the rest of the IT infrastructure with stronger or weaker walls. The CIO must determine how to encourage this kind of experimentation. Understand and manage the risks Some of the risks associated with Big Data are legitimate, and CIOs must address them. In the case of Hadoop clusters, security is a pressing question: it was a feature added as the project developed, not cooked in from the beginning. It’s still far from perfect. Many open-source projects start as projects intended to prove a concept or solve a particular problem. Some, such as Linux or Mozilla, become massive successes, but they rarely start with the sort of requirements a CIO faces when introducing systems to corporate settings. Beyond open source, regardless of which tools are used to manipulate data, there are always risks associated with making decisions based on the analysis of Big Data. To give one dramatic example, the recent financial crisis was caused in part by banks and rating agencies whose models for understanding value at risk and the potential for securities based on subprime mortgages to fail were flat-out wrong. Just as there is risk in data that is not sufficiently clean, there is risk in data manipulation techniques that have not been sufficiently vetted. Many times, the only way to understand big, complicated data is through the use of big, complicated algorithms, which leaves a door open to big, catastrophic mistakes in analysis. Proactively preventing these mistakes from happening requires the risk mind-set, says Larry Best, IT risk manager at PricewaterhouseCoopers. “You have to think carefully about what can go wrong, do a quantitative analysis of the likelihood of such a mistake, and anticipate the impact if the mistake occurs.” 43 Table 3 includes a list of the risks associated with Big Data analysis and ways to mitigate them. Risk Mitigation tactic Over-reliance on insights gleaned from data analysis leads to loss Testing Inaccurate or obsolete data Maintain strong metadata management; unverified information must be flagged Analysis leads to paralysis Keep the sandbox related to the business problem or opportunity Security Keep the Hadoop clusters away from the firewall, be vigilant, ask chief security officer for help Buggy code and other glitches Make sure the team keeps track of modifications and other implementation history, since documentation isn’t plentiful Rejection by other parts of the organization Perform change management to help improve the odds of acceptance, along with quick successes Table 3: How to mitigate the risks of Big Data analysis Source: PricewaterhouseCoopers, 2010 Best points out that choosing the right controls to implement in a given risk scenario is essential. The only way to make sound choices is by adopting a risk mind-set and approach, that allow a focus on the most critical controls, he says. Enterprises simply don’t have the resources to implement blanket controls. The Control Objectives for Information and related Technology (COBIT) framework, a popular reference for IT risk assessment, is a “phone book of thousands of controls.” he says. Risk is not juggling a lot of balls. “Risk is knowing which balls are made out of rubber and which are made out of glass.” By nature and by work experiences, most CIOs are risk averse. Blue-chip CIOs hold off installing new versions of software until they have been proven beyond a doubt, and these CIOs don’t standardize on new platforms until the risk for change appears to be less than the risk of stasis. “The fundamental issue is whether IT is willing to depart from the status quo, such as an RDBMS [relational database management system], in favor 44 of more powerful technologies,” Dresner says. “This means massive change, and IT doesn’t always embrace change.” More forward-thinking IT organizations constantly review their software portfolio and adjust accordingly. In this case, the need to manipulate larger and larger amounts of data that companies are collecting is pressing. Even risk-averse CIOs are exploring the possibilities of Big Data for their businesses. Bud Mathaisel, CIO of the outsourcing vendor Achievo, divides the risks of Big Data and their solutions into three areas: • Accessibility—The data repository used for data analysis should be access managed. • Classification—Gray data should be identified as such. • Governance—Who’s doing what with this? Yes, Big Data is new. But accessibility, classification, and governance are matters CIOs have had to deal with for many years in many guises. PricewaterhouseCoopers Technology Forecast Conclusion At many companies, Big Data is both an opportunity (what useful needles can we find in a terabyte-sized haystack?) and a source of stress (Big Data is overwhelming our current tools and methods; they don’t scale up to meet the challenge). The prefix “tera” in “terabyte,” after all, comes from the Greek word for “monster.” CIOs aiming to use Big Data to add value to their businesses are monster slayers. CIOs don’t just manage hardware and software now; they’re expected to manage the data stored in that hardware and used by that software—and provide a framework for delivering insights from the data. From Amazon.com to the Boston Red Sox, diverse companies compete based on what data they collect and what they learn from it. CIOs must deliver easy, reliable, secure access to that data and develop consistent, trustworthy ways to explore and wrench wisdom from that data. CIOs do not need to rush, but they do need to be prepared for the changes that Big Data is likely to require. Perhaps the most productive way for CIOs to frame the issue is to acknowledge that Big Data isn’t merely a new model; it’s a new way to think about all data models. Big Data isn’t merely more data; it is different data that requires different tools. As more and more internal and external sources cast off more and more data, basic notions about the size and attributes of data sets are likely to change. With those changes, CIOs will be expected to capture more data and deliver it to the executive team in a manner that reveals the business— and how to grow it—in new ways. Web companies have set the bar high already. John Avery, a partner at Sungard Consulting Services, points to the YouTube example: “YouTube’s ability to index a data store of such immense size and then accrete additional analysis on top of that, as an ongoing process with no foresight into what those analyses would look like when the data was originally stored, is very, very impressive. That is something that has challenged folks in financial technology for years.” Revising the CIO’s data playbook As companies with a history of cautious data policies begin to test and embrace Hadoop, MapReduce, and the like, forward-looking CIOs will turn to the issues that will become more important as Big Data becomes the norm. The communities arising around Hadoop (and the inevitable open-source and proprietary competitors that follow) will grow and become influential, inspiring more CIOs to become more data-centric. The profusion of new data sources will lead to dramatic growth in the use and diversity of metadata. As the data grows, so will our vocabulary for understanding it. Whether learning from Google’s approach to Big Data, hiring a staff primed to maximize its value, or managing the new risks, forward-looking CIOs will, as always, be looking to enable new business opportunities through technology. As companies with a history of cautious data policies begin to test and embrace Hadoop, MapReduce, and the like, forwardlooking CIOs will turn to the issues that will become more important as Big Data becomes the norm. The communities arising around Hadoop (and the inevitable opensource and proprietary competitors that follow) will grow and become influential, inspiring more CIOs to become more data-centric. 45 New approaches to customer data analysis Razorfish’s Mark Taylor and Ray Velez discuss how new techniques enable them to better analyze petabytes of Web data. Interview conducted by Alan Morrison and Bo Parker Mark Taylor is global solutions director and Ray Velez is CTO of Razorfish, an interactive marketing and technology consulting firm that is now a part of Publicis Groupe. In this interview, Taylor and Velez discuss how they use Amazon’s Elastic Compute Cloud (EC2) and Elastic MapReduce services, as well as Microsoft Azure Table services, for large-scale customer segmentation and other data mining functions. PwC: What business problem were you trying to solve with the Amazon services? MT: We needed to join together large volumes of disparate data sets that both we and a particular client can access. Historically, those data sets have not been able to be joined at the capacity level that we were able to achieve using the cloud. In our traditional data environment, we were limited to the scope of real clickstream data that we could actually access for processing and leveraging bandwidth, because we procured a fixed size of data. We managed and worked with a third party to serve that data center. This approach worked very well until we wanted to tie together and use SQL servers with online analytical processing cubes, all in a fixed infrastructure. With the cloud, we were able to throw billions of rows of data together to really start categorizing that information so that we could segment non-personally identifiable data from browsing sessions and from specific ways in which we think about segmenting the behavior of customers. 46 That capability gives us a much smarter way to apply rules to our clients’ merchandising approaches, so that we can achieve far more contextual focus for the use of the data. Rather than using the data for reporting only, we can actually leverage it for targeting and think about how we can add value to the insight. RV: It was slightly different from a traditional database approach. The traditional approach just isn’t going to work when dealing with the amounts of data that a tool like the Atlas ad server [a Razorfish ad engine that is now owned by Microsoft and offered through Microsoft Advertising] has to deal with. PwC: The scalability aspect of it seems clear. But is the nature of the data you’re collecting such that it may not be served well by a relational approach? RV: It’s not the nature of the data itself, but what we end up needing to deal with when it comes to relational data. Relational data has lots of flexibility because of the normalized format, and then you can slice and dice and look at the data in lots of different ways. Until you PricewaterhouseCoopers Technology Forecast “Rather than using the data for reporting only, we can actually leverage it for targeting and think about how we can add value to the insight.” —Mark Taylor put it into a data warehouse format or a denormalized EMR [Elastic MapReduce] or Bigtable type of format, you really don’t get the performance that you need when dealing with larger data sets. So it’s really that classic tradeoff; the data doesn’t necessarily lend itself perfectly to either approach. When you’re looking at performance and the amount of data, even a data warehouse can’t deal with the amount of data that we would get from a lot of our data sources. PwC: What motivated you to look at this new technology to solve that old problem? RV: Here’s a similar example where we used a slightly different technology. We were working with a large financial services institution, and we were dealing with massive amounts of spending-pattern and anonymous data. We knew we had to scale to Internet volumes, and we were talking about columnar databases. We wondered, can we use a relational structure with enough indexes to make it perform well? We experimented with a relational structure and it just didn’t work. So early on we jumped into what Microsoft Azure technology allowed us to do, and we put it into a Bigtable format, or a Hadoop-style format, using Azure Table services. The real custom element was designing the partitioning structure of this data to denormalize what would usually be five or six tables into one huge table with lots of columns, to the point where we started to bump up against the maximum number of columns they had. New approaches to customer data analysis We were able to build something that we never would have thought of exposing to the world, because it never would have performed well. It actually spurred a whole new business idea for us. We were able to take what would typically be a BusinessObjects or a Cognos application, which would not scale to Internet volumes. We did some sizing to determine how big the data footprint would be. Obviously, when you do that, you tend to have a ton more space than you require, because you’re duplicating lots and lots of data that, with a relational database table, would be lookup data or other things like that. But it turned out that when I laid the indexes on top of the traditionally relational data, the resulting data set actually had even greater storage requirements than performing the duplication and putting the data set into a large denormalized format. That was a bit of a surprise to us. The size of the indexes got so large. When you think about it, maybe that’s just how an index works anyway—it puts things into this denormalized format. An index file is just some closed concept in your database or memory space. The point is, we would have never tried to expose that to consumers, but we were able to expose it to consumers because of this new format. MT: The first commercial benefits were the ability to aggregate large and disparate data into one place and the extra processing power. But the next phase of benefits really derives from the ability to identify true relationships across that data. 47 “The stat section on [the MLB] site was always the most difficult part of the site, but the business insisted it needed it.” —Ray Velez Tiny percentages of these data sets have the most significant impact on our customer interactions. We are already developing new data measurement and KPI [key performance indicator] strategies as we’re starting to ask ourselves, “Do our clients really need all of the data and measurement points to solve their business goals?” PwC: Given these new techniques, is the skill set that’s most beneficial to have at Razorfish changing? RV: It’s about understanding how to map data into a format that most people are not familiar with. Most people understand SQL and relational format, so I think the real skill set evolution doesn’t have quite as much to do with whether the tool of choice is Java or Python or other technologies; it’s more about do I understand normalized versus denormalized structures. MT: From a more commercial viewpoint, there’s a shift away from product type and skill set, which is based around constraints and managing known parameters, and very much more toward what else can we do. It changes the impact—not just in the technology organization, but in the other disciplines as well. I’ve already seen a profound effect on the old ways of doing things. Rather than thinking of doing the same things better, it really comes down to having the people and skills to meet your intended business goals. Using the Elastic MapReduce service with Cascading, our solutions can have a ripple effect on all of the non-technical business processes and engagements across teams. For example, conventional marketing segmentation used to involve teams of analysts who waded through various data sets and stages of processing and analysis to make sense of how a business might view groups of customers. Using the Hadoop-style alternative and Cascading, we’re able to identify unconventional relationships across many data points with less effort, and in the process create new segmentations and insights. 48 This way, we stay relevant and respond more quickly to customer demand. We’re identifying new variations and shifts in the data on a real-time basis that would have taken weeks or months, or that we might have missed completely, using the old approach. The analyst’s role in creating these new algorithms and designing new methods of campaign planning is clearly key to this type of solution design. The outcome of all this is really interesting and I’m starting to see a subtle, organic response to different changes in the way our solution tracks and targets customer behavior. PwC: Are you familiar with Bill James, a Major League Baseball statistician who has taken a rather different approach to metrics? James developed some metrics that turned out to be more useful than those used for many years in baseball. That kind of person seems to be the type that you’re enabling to hypothesize, perhaps even do some machine learning to generate hypotheses. RV: Absolutely. Our analytics team within Razorfish has the Bill James type of folks who can help drive different thinking and envision possibilities with the data. We need to find a lot more of those people. They’re not very easy to find. And we have some of the leading folks in the industry. You know, a long, long time ago we designed the Major League Baseball site and the platform. The stat section on that site was always the most difficult part of the site, but the business insisted it needed it. The amount of people who really wanted to churn through that data was small. We were using Oracle at the time. We used the concept of temporary tables, which would denormalize lots of different relational tables for performance reasons, and that was a challenge. If I had the cluster technology we do now back in 1999 and 2000, we could have built to scale much more than going to two measly servers that we could cluster. PricewaterhouseCoopers Technology Forecast PwC: The Bill James analogy goes beyond batting averages, which have been the age-old metric for assessing the contribution of a hitter to a team, to measuring other things that weren’t measured before. RV: Even crazy stuff. We used to do things like, show me all of Derek Jeter’s hits at night on a grass field. PwC: There you go. Exactly. RV: That’s the example I always use, because that was the hardest thing to get to scale, but if you go to the stat section, you can do a lot of those things. But if too many people went to the stat section on the site, the site would melt down, because Oracle couldn’t handle it. If I were to rebuild that today, I could use an EMR or a Bigtable and I’d be much happier. PwC: Considering the size of the Bigtable that you’re able to put together without using joins, it seems like you’re initially able to filter better and maybe do multistage filtering to get to something useful. You can take a cyclical approach to your analysis, correct? RV: Yes, you’re almost peeling away the layer of the onion. But putting data into a denormalized format does restrict flexibility, because you have so much more power with a where clause than you do with a standard EMR or Bigtable access mechanism. It’s like the difference between something built for exactly one task versus something built to handle tasks I haven’t even thought of. If you peel away the layer of the onion, you might decide, wow, this data’s interesting and we’re going in a very interesting direction, so what about this? You may not be able to slice it that way. You might have to step back and come up with a different partition structure to support it. New approaches to customer data analysis PwC: Social media is helping customers become more active and engaged. From a marketing analysis perspective, it’s a variation on a Super Bowl advertisement, just scaled down to that social media environment. And if that’s going to happen frequently, you need to know what is the impact, who’s watching it, and how are the people who were watching it affected by it. If you just think about the data ramifications of that, it sort of blows your mind. RV: If you think about the popularity of Hadoop and Bigtable, which is really looking under the covers of the way Google does its search, and when you think about search at the end of the day, search really is recommendations. It’s relevancy. What are the impacts on the ability of people to create new ways to do search and to compete in a more targeted fashion with the search engine? If you look three to five years out, that’s really exciting. We used to say we could never re-create that infrastructure that Google has; Google is the second largest server manufacturer in the world. But now we have a way to create small targeted ways of doing what Google does. I think that’s pretty exciting. n “What are the impacts on the ability of people to create new ways to do search and to compete in a more targeted fashion with the search engine? If you look three to five years out, that’s really exciting.” —Ray Velez 49 Acknowledgments Advisory Sponsor & Technology Leader Tom DeGarmo US Thought Leadership Partner-in-Charge Tom Craren Center for Technology and Innovation Managing Editor Bo Parker Editors Vinod Baya, Alan Morrison Contributors Larry Best, Galen Gruman, Jimmy Guterman, Larry Marion, Bill Roberts Editorial Advisers Markus Anderle, Stephen Bay, Brian Butte, Tom Johnson, Krishna Kumaraswamy, Bud Mathaisel, Sean McClowry, Rajesh Munavalli, Luis Orama, Dave Patton, Jonathan Reichental, Terry Retter, Deepak Sahi, Carter Shock, David Steier, Joe Tagliaferro, Dimpsy Teckchandani, Cindi Thompson, Tom Urquhart, Christine Wendin, Dean Wotkiewich Copyedit Lea Anne Bantsari, Ellen Dunn Transcription Paula Burns 50 PricewaterhouseCoopers Technology Forecast Graphic Design Industry perspectives Art Director Jacqueline Corliss During the preparation of this publication, we benefited greatly from interviews and conversations with the following executives and industry analysts: Designers Jacqueline Corliss, Suzanne Lau Illustrators Donald Bernhardt, Suzanne Lau, Tatiana Pechenik Photographers Tim Szumowski, David Tipling (Getty Images), Marina Waltz Bud Albers, executive vice president and chief technology officer, Technology Shared Services Group, Disney Matt Aslett, analyst, enterprise software, the451 John Avery, partner, Sungard Consulting Services Amr Awadallah, vice president, engineering, and chief technology officer, Cloudera Online Phil Buckle, chief technology officer, National Policing Improvement Agency Director, Online Marketing Jack Teuber Howard Dresner, president and founder, Dresner Advisory Services Designer and Producer Scott Schmidt Brian Donnelly, founder and chief executive officer, InSilico Discovery Reviewers Dave Stuckey, Chris Wensel Marketing Bob Kramer Special thanks to Ray George, Page One Rachel Lovinger, Razorfish Mariam Sughayer, Disney Matt Estes, principal data architect, Technology Shared Services Group, Disney Jim Kobelius, senior analyst, Forrester Research Doug Lenat, founder and chief executive officer, Cycorp Roger Magoulas, research director, O’Reilly Media Nathan Marz, lead engineer, BackType Bill McColl, founder and chief executive officer, Cloudscale John Parkinson, acting chief technology officer, TransUnion David Smoley, chief information officer, Flextronics Mark Taylor, global solutions director, Razorfish Scott Thompson, vice president, architecture, Technology Shared Services Group, Disney Ray Velez, chief technology officer, Razorfish Acknowledgments 51 pwc.com/us To have a deeper conversation about how this subject may affect your business, please contact: Tom DeGarmo Principal, Technology Leader PricewaterhouseCoopers +1 267-330-2658 [email protected] This publication is printed on Coronado Stipple Cover made from 30% recycled fiber; and Endeavor Velvet Book made from 50% recycled fiber, a Forest Stewardship Council (FSC) certified stock using 25% post-consumer waste. Recycled paper Subtext Big Data Data sets that range from many terabytes to petabytes in size, and that usually consist of less-structured information such as Web log files. Hadoop cluster A type of scalable computer cluster inspired by the Google Cluster Architecture and intended for cost-effectively processing less-structured information. Apache Hadoop The core of an open-source ecosystem that makes Big Data analysis more feasible through the efficient use of commodity computer clusters. Cascading A bridge from Hadoop to common Java-based programming techniques not previously usable in cluster-computing environments. NoSQL A class of non-relational data stores and data analysis techniques that are intended for various kinds of less-structured data. Many of these techniques are part of the Hadoop ecosystem. Gray data Data from multiple sources that isn’t formatted or vetted for specific needs, but worth exploring with the help of Hadoop cluster analysis techniques. Comments or requests? Please visit www.pwc.com/techforecast OR send e-mail to: [email protected] PricewaterhouseCoopers (www.pwc.com) provides industry-focused assurance, tax and advisory services to build public trust and enhance value for its clients and their stakeholders. More than 155,000 people in 153 countries across our network share their thinking, experience and solutions to develop fresh perspectives and practical advice. © 2010 PricewaterhouseCoopers LLP. All rights reserved. “PricewaterhouseCoopers” refers to PricewaterhouseCoopers LLP, a Delaware limited liability partnership, or, as the context requires, the PricewaterhouseCoopers global network or other member firms of the network, each of which is a separate and independent legal entity. This document is for general information purposes only, and should not be used as a substitute for consultation with professional advisors.