InfoSphere BigInsights Overview PureData Ecosystem May 29, 2013 © 2013 IBM Corporation
by user
Comments
Transcript
InfoSphere BigInsights Overview PureData Ecosystem May 29, 2013 © 2013 IBM Corporation
PureData Ecosystem InfoSphere BigInsights Overview May 29, 2013 © 2013 IBM Corporation PureData Ecosystem What is Big Data ? All kinds of data – Large volumes – Valuable insight, but difficult to extract Data is the new oil – In its raw form, oil has little value – Once processed and refined, it helps power the world Big data is a hot topic because technology makes it possible to analyze ALL available data “Big data technologies describe a new generation of technologies and architectures, designed to economically extract value from very large volumes of a wide variety of data, by enabling high velocity capture, discovery and/or analysis.” Source: Matt Eastwood, IDC 2 © 2013 IBM Corporation PureData Ecosystem Big Data Presents Big Opportunities Big data characteristics – Variety • Manage and benefit from diverse data types and data structures – Volume • Scale from terabytes to zettabytes – Velocity • Analyze streaming data and large volumes of persistent data – Veracity • 1 in 3 business leaders don’t trust the information they use to make decisions Establishing the Veracity of big data sources Extract insight from a high volume, variety and velocity of data in a timely and cost-effective manner 3 © 2013 IBM Corporation PureData Ecosystem Big Difference: Schema on Run Big Data (Hadoop) Regular database – Schema on run – Schema on load Raw data Schema to filter Raw data Storage (unfiltered, raw data) Schema to filter Storage (pre-filtered data) 4 Output © 2013 IBM Corporation PureData Ecosystem Analyze Raw Data Customer Need – – – – Ingest data as-is into Hadoop and derive insight from it Process large volumes of diverse data within Hadoop Combine insights with the data warehouse Low-cost ad-hoc analysis with Hadoop to test new hypothesis Value Statement – Gain new insights from a variety and combination of different data sources – Overcome the prohibitively high cost of converting unstructured data sources to a structured format – Experiment with analysis of different data combinations to modify the analytic models in the data warehouse Customer example – Financial Services Regulatory Org – managed additional data types and integrated with their existing data warehouse Get started with: InfoSphere BigInsights 5 © 2013 IBM Corporation PureData Ecosystem Reduce Costs with Hadoop Customer Need – Reduce the overall costs to maintain data in the warehouse – Lower costs as data grows within the data warehouse – Reduce expensive infrastructure used for processing and transformations Value Statement – Support existing and new workloads on the most cost effective alternative, while preserving existing access and queries – Lower storage costs – Reduce processing costs by pushing processing onto commodity hardware and the parallel processing of Hadoop Customer example – Financial Services Firm – move processing of applications and reports to Hadoop HBase while preserving existing queries Get started with: InfoSphere BigInsights 6 © 2013 IBM Corporation PureData Ecosystem What is BigInsights? Flexible, enterprise-class support for processing large volumes of data Enables applications to work with thousands of nodes and petabytes of data in a highly parallel, cost effective manner Enterprise Edition Enterprise class – Based on Google’s MapReduce technology – Inspired by Apache Hadoop; compatible with its ecosystem and distribution – Well-suited to batch-oriented, read-intensive applications – Supports wide variety of data Licensed Business process accelerators (“Apps”) Text analytics, Spreadsheet-style analysis tool RDBMS connectivity Integrated Web-based console Flexible job scheduler Performance enhancements Eclipse-based tooling Basic Edition LDAP authentication Free download .... Integrated Install Web-based console – CPU + disks = “node” – Nodes can be combined into clusters – New nodes can be added as needed without changing Breadth of capabilities 7 © 2013 IBM Corporation PureData Ecosystem What’s so Special About Open Source Hadoop? Storage • Distributed • Reliable • Commodity gear Scalable • New nodes can be added on the fly Affordable • Massively parallel computing on commodity servers ( easily and affordably available ) MapReduce • Parallel Programming • Fault Tolerant Flexible • Hadoop is schema-less – can absorb any type of data Fault Tolerant • Through MapReduce software framework 8 © 2013 IBM Corporation PureData Ecosystem Hadoop is Well Suited for Handling Big Data Challenges 9 Analyzing larger volumes may provide better results Deriving new insights from combinations of data types Larger data volumes are cost prohibitive with existing technology Exploring data – a sandbox for ad-hoc analytics © 2013 IBM Corporation PureData Ecosystem InfoSphere BigInsights – A Closer Look User Interfaces Integration Databases Visualization Dev Tools Admin Console Accelerators More Than Hadoop • Performance & workload optimizations • Unique text analytic engines Application Accelerators Text Analytics Content Management BigInsights Engine Map Reduce + Indexing Workload Mgmt Security Apache Hadoop • Spreadsheet-style visualization for data discovery & exploration • Built-in IDE & admin consoles Information Governance • Enterprise-class security • High-speed connectors to integration with other systems • Analytical accelerators 10 © 2013 IBM Corporation PureData Ecosystem BigInsights Enterprise Edition Components IBM InfoSphere BigInsights Visualization & Discovery BigSheets Apps Dashboard & Visualization Workflow Text Analytics Integration Administration Applications & Development MapReduce Pig & Jaql Hive Admin Console JDBC Monitoring PureData DB2 Advanced Analytic Engines R Text Processing Engine & Extractor Library (AQL+HIL) Adaptive Algorithms Streams Workload Optimization Runtime DataStage Integrated Installer Enhanced Security Splittable Text Compression Adaptive MapReduce ZooKeeper Oozie Jaql Flexible Scheduler Lucene Pig Hive Index MapReduce Guardium HCatalog Platform Computing Management Cognos Security Data Store Flume HBase Audit & History Lineage File System Open Source 11 Sqoop HDFS IBM © 2013 IBM Corporation PureData Ecosystem Integrated Installation Integrated installation of supported open source and IBM components. – Seamless process for single node and cluster environments – Integrated installation of all selected components Post-install validation of IBM and open source components BigInsights Many disparate components Manual install Leg-work required • What components? • Which versions? Roll Your Own 12 Single install • No need to worry about components & versions Install requires very little interaction • No extra prerequisites to download Easiest © 2013 IBM Corporation PureData Ecosystem BigInsights Web Console for Administration Real-time iteration and visualization of the cluster – – – – – – Add/remove nodes HDFS file system administration Configure components Monitor workflow, jobs, storage Metrics export for enterprise monitoring tools Health summary Discover and Analyze – Run or schedule pre-built application – Discover information, entities – Load, explore data – Review log records 13 © 2013 IBM Corporation PureData Ecosystem Enhanced tools for Business Users: Application linking Application linking – Compose new applications from existing applications and BigSheets – Invoke analytics applications from the web console, including integration within BigSheets New Apps to provide enhanced data import capability – REST data source App that enables users to load data from any data source supporting REST APIs into BigInsights, including popular social media services – Sampling App that enables users to sample data for analysis – Subsetting App that enables users to subset data for data analysis 14 © 2013 IBM Corporation PureData Ecosystem Data Visualization and Analytics 15 BigSheets – spreadsheet metaphor for exploring data Rich platform for the analysis and visualization of Internet-scale data volumes Ad-hoc analytics for LOB user Analyze a variety of data - unstructured and structured © 2013 IBM Corporation 1 6 PureData Ecosystem Enhanced tools for Business Users A centralized dashboard to visualize analytic results: BigSheets collections Analytic application results Monitoring metrics .. leveraging a new charting engine 16 BigSheets usability enhancements: The ability to view BigSheets data flows between and across data sets to quickly navigate and relate analysis and charts Inner outer joins, enhanced filters for BigSheets columns, column data-type mapping for collections and application of analytics to BigSheets columns, … etc © 2013 IBM Corporation PureData Ecosystem Text Analytics – Accurate Analysis of Unstructured Big Data How it works – Parses text and detects meaning with annotators – Understands the context in which the text is analyzed Hundreds of pre-built annotators – E.g. names, addresses, phone numbers, along others – Multilingual support Distills structured info from unstructured text – Sentiment analysis – Consumer behavior Unstructured text (document, email, etc) Football World Cup 2010, one team distinguished themselves well, losing to the eventual champions 1-0 in the Final. Early in the second half, Netherlands’ striker, Arjen Robben, had a breakaway, but the keeper for Spain, Iker Casillas made the save. Winger Andres Iniesta scored for Spain for the win. Classification and Insight Benefits – More precise and correct answers – 50% faster than manual method – Run faster text analysis 17 © 2013 IBM Corporation © 2013 IBM Corporation PureData Ecosystem Improved Security The web console leverages several mechanisms to ensure security User authentication approaches – None, Flat file, LDAP, PAM Authorization (role-based) – Four default roles • System administrator • Data administrator • Application administrator • User – Feature-based access control based on BigInsights role membership Credentials store – Store potentially sensitive info: tokens, passwords, etc. – Maintained in BigInsights distributed file system 18 BigInsights Authentication Store © 2013 IBM Corporation 2 0 PureData Ecosystem Enhanced tools for Developers Unified tooling for Big Data Application Development Lifecycle – Enables users to sample data and define, test, and deploy analytics applications from the BigInsights Eclipse tools – Users can administer, execute, and monitor the deployed applications from the BigInsights Web Console 1. Sample your Data 2. Develop your application using BigInsights tools 5. Deploy your application on the cluster 3. Test your application 4. Package and publish your application 20 © 2013 IBM Corporation PureData Ecosystem IBM Accelerators Packaged Applications included with BigInsights – IBM Accelerator for Machine Data Analytics • Convert machine raw and dispersed logs and data files into informed, intelligent decisions • Locate log entries across servers and time zones • Add/Extract new log types to repository – IBM Accelerator for Social Data Analytics • Extract information from social media sites and build user profiles based off participation • Import social media data • Associate profiles with buzz, and intent around brands, products, and companies 21 © 2013 IBM Corporation PureData Ecosystem InfoSphere BigInsights – Advanced Features Adaptive MapReduce – – – – – Mappers can decide at runtime to take on more work Balance workload across Map tasks Speeds up a class of jobs (e.g. jobs that process small files) Supported on Jaql/Java jobs Flexible Scheduler – Optimize response time for small jobs – Available in addition to FAIR and FIFO scheduling – Based on average response time metrics • Allocates maximum resources to small jobs, guaranteeing that these jobs are completed quickly Compression – BigInsights LZO-based compression technology – Splittable: use multiple map tasks to process compressed text files – Good performance with a reasonable compression ration (~60%) 22 © 2013 IBM Corporation 2 3 PureData Ecosystem BigInsights 2.1 features a variety of enhancements that deliver key Enterprise Hadoop capabilities Big SQL • Comprehensive Standard ANSI SQL support to access data stored in BigInsights • Standards compliant JDBC & ODBC drivers • Leverages MapReduce parallelism in complex data sets • Direct access for low-latency in small queries, e.g. sub-second response to HBase queries 23 GPFS-FPO support • No single point of failure • Built-in High Availability • POSIX compliance • Enhanced Security with ACL support • Support for Storage Pools • SnapShot capability High Availability • Out of the box High Availability • Seamless, automatic and transparent failover for HDFS NameNode and JobTracker • Eliminates admin intervention • Reduces downtime for recovery of the cluster • Hardware fencing to guarantee data integrity 2 4 PureData Ecosystem BigInsights 2.1 updates to the Accelerators and BigSheets also enhance consumability • Machine Data Analytics Accelerator • New configuration UI enables an easy and intuitive way to perform the workflow configuration • The new configuration interfaces significantly improves the time to value • Expand data sources to support and analyze BigInsights/ Hadoop logs. • Social Data Analytics Accelerator • Improved profile generation performance • Enhanced data discovery & visualization capabilities in BigSheets • Provides 10+ build-in functions to extract names, addresses, organizations, email, and phone numbers. 24 PureData Ecosystem The Only Platform to Support Multiple Hadoop Distributions User Interfaces • Provides a rich set of big data analytics and accelerators on top of open source BigInsights Engine Map Reduce + Indexing Workload Mgmt Security Integration Accelerators • Delivers a comprehensive big data platform on top of open source, that addresses all big data requirements. IBM tested & supported open source components 25 Distribution of Hadoop open source components future © 2013 IBM Corporation PureData Ecosystem Purpose-Built High Speed Connectors for Multiple Data Sources Connect any type of data through optimized connectors and information integration capabilities Structured InfoSphere Information Server Unstructured InfoSphere BigInsights Streaming Includes connectivity to: 26 DB2 InfoSphere Information Server InfoSphere Warehouse InfoSphere Streams IBM Smart Analytics System PureData for Smart Analytics JDBC connector for connectivity to any JDBC compatible data store © 2013 IBM Corporation PureData Ecosystem Enterprise Integration With Multiple Products Brings the Power of the Big Data Platform to BigInsights IBM InfoSphere Data Explorer (Vivisimo) NEW & BUNDLED: Indexing and “on the glass” integration DB2 and JDBC High speed parallel read-write for DB2 and JDBC connectivity InfoSphere Guardium NEW: Auditing InfoSphere BigInsights Visualization & Exploration Development Tools Advanced Engines Connectors Workload Optimization Administration & Security PureData for Analytics NEW: Query and join data using UDFs Cognos Business Intelligence NEW & BUNDLED: Support for Hive; Business Intelligence capabilities 27 Open source Hadoop components R NEW: Application that allows users to execute R jobs directly from BigInsights web console InfoSphere Streams BUNDLED: Enables realtime, continuous analysis of data on the fly InfoSphere DataStage Data collection and integration Platform Computing NEW: High performance, low-latency platform computing grid WebSphere NEW: WAS 8.5 Liberty Profile – high performance secure REST access Rational & Data Studio RAD, Rational Team Concert & Data Studio collaborative development integration © 2013 IBM Corporation PureData Ecosystem InfoSphere DataStage Integration, transformation and delivery of data on demand, across multiple sources and targets – Complete ETL functionality with metadata-driven productivity – Supports team-based development and collaboration Provides integration with a broad range of sources – Connector integrates BigInsights and the underlying HDFS file system – Leverages clustered architecture InfoSphere BigInsights InfoSphere DataStage Data Warehouses Data Warehouse DataStage Connector, push and pull data to and from BigInsights clusters. 28 DataStage Read/write data from/to Databases, warehouses. © 2013 IBM Corporation PureData Ecosystem BigInsights Connectivity to DBMS / Warehouse 29 © 2013 IBM Corporation PureData Ecosystem Filter and Summarize Big Data for the Warehouse BigInsights can manage all enterprise data upon arrival – Organizations can manipulate, analyze, and summarize incoming data BigInsights can be utilized as a source for a data warehouse – Broaden analytic coverage without undue burden on systems – Augment existing corporate data within warehouses 30 © 2013 IBM Corporation PureData Ecosystem BigInsights as a Query-ready Archive for a Data Warehouse Allow firms to manage the size of their existing data management platforms – Use BigInsights as a query-ready archive – With frequently accessed data maintained in the warehouse and “cold” or outdated information offloaded to BigInsights – Better manage the size and usability of data within the enterprise 31 © 2013 IBM Corporation PureData Ecosystem InfoSphere Streams – BigInsights Integration Sink adapter for BigInsights Source adapter for BigInsights Use cases: – Use stored and analyzed BigInsights data to respond to real-time events – Use Streams as a large-scale data ingest engine to filter, decorate, or otherwise – Manipulate a stream of data to be stored in the BigInsights cluster InfoSphere BigInsights InfoSphere Streams Hadoop-Based low latency analytics for variety and volume Real-Time Data in Motion Analytic Streaming Tool Sink/Source Adaptor for Bi-Directional Data Flow 32 © 2013 IBM Corporation PureData Ecosystem Recognized for Our Leadership “IBM has the deepest Hadoop platform and application portfolio.” • Functionality • Subproject integration • Modeling • Storage • Acceleration and optimization • Real-time/low-latency data management • Cluster management • Packaging • Distributed EDW file store connectors • Business applications • Strategic direction • Professional services • Solution adoption • Solution revenues • Solution partners February 2012 “The Forrester Wave™: Enterprise Hadoop Solutions, Q1 2012” 33 © 2013 IBM Corporation PureData Ecosystem Where to get BigInsights software IBM Big Data software, including BigInsights basic and enterprise editions http://www-01.ibm.com/software/data/infosphere/biginsights/ BigInsights on the cloud – IBM SmartCloud Enterprise: http://www-935.ibm.com/services/us/en/cloud-enterprise/index.html – GoGrid: www.gogrid.com/pr/gogrid-and-ibm-team-deliver-big-data-cloud BigInsights Infocenter http://pic.dhe.ibm.com/infocenter/bigins/v2r0/index.jsp 34 © 2013 IBM Corporation PureData Ecosystem Cross Industry Use Cases Use patterns – Customer sentiment analysis – Internet behavior & buying pattern analysis – Predictive modeling (credit card fraud) – System log analytics (reduce operational risk) Financial Services IT Common requirements – Extract business insight from large volumes of raw data (often outside operational systems) – Integrate with other existing software – Ready for enterprise 35 Retail Telco Healthcare Media & Entertainment Utilities © 2013 IBM Corporation PureData Ecosystem Vestas optimizes capital investments based on 2.5 Petabytes of information. Model the weather to optimize placement of turbines, maximizing power generation and longevity. Reduce time required to identify placement of turbine from weeks to hours. Incorporate 2.5 PB of structured and semi-structured information flows. Data volume expected to grow to 6 PB. 36 36 36 © 2013 IBM Corporation PureData Ecosystem Asian Health Bureau reduces diagnostic errors Capabilities Utilized: Hadoop System • Telemedicine imaging diagnostics service to improve rural healthcare • Automatically sifts and analyzes large collections looking for anomalies and disease • Makes it possible for radiologists and Pathologists to analyze: 1000s of patient images 37 37 37 “Over 80% of healthcare data is medical imaging” Significant improvements expected: • Reduction in diagnostic errors • Improved outcomes by leveraging physicians treating similar cases © 2013 IBM Corporation PureData Ecosystem For Big Data, InfoSphere BigInsights is the Clear Choice Faster Performance enhancements & workload optimizations resulting in faster answers Smarter Unique analytic engines that get more accurate results Built-in development environment and administration Easier consoles enable your resources skills to utilize Hadoop Secure Enterprise-class security to protect your big data Plugged-in Pre-integrated to your existing enterprise IT systems ensuring that big data doesn't become a silo 38 © 2013 IBM Corporation PureData Ecosystem Summary Big data is a strategic initiative for IBM – Significant investments across software, hardware and services. InfoSphere BigInsights – Enables firms to exploit growing variety, velocity, and volume of data – Delivers diverse range of analytics – Leverages and extends open source – Provides enterprise-class features and supporting services – Complement existing software investments and commercial offerings – Available in basic (free) and enterprise editions IBM advantage – Full solution spanning software, hardware & services – Rapid technology advances through partnerships with IBM Research – Global reach 39 © 2013 IBM Corporation