...

InfoSphere BigInsights Overview PureData Ecosystem May 29, 2013 © 2013 IBM Corporation

by user

on
Category: Documents
80

views

Report

Comments

Transcript

InfoSphere BigInsights Overview PureData Ecosystem May 29, 2013 © 2013 IBM Corporation
PureData Ecosystem
InfoSphere BigInsights Overview
May 29, 2013
© 2013 IBM Corporation
PureData Ecosystem
What is Big Data ?
 All kinds of data
– Large volumes
– Valuable insight, but difficult to extract
 Data is the new oil
– In its raw form, oil has little value – Once processed and refined, it helps power the world
 Big data is a hot topic because technology makes it possible to analyze ALL
available data
“Big data technologies describe a new generation of technologies and architectures, designed to
economically extract value from very large volumes of a wide variety of data, by enabling
high velocity capture, discovery and/or analysis.”
Source: Matt Eastwood, IDC
2
© 2013 IBM Corporation
PureData Ecosystem
Big Data Presents Big Opportunities
 Big data characteristics
– Variety
• Manage and benefit from diverse data
types and data structures
– Volume
• Scale from terabytes to zettabytes
– Velocity
• Analyze streaming data and large
volumes of persistent data
– Veracity
• 1 in 3 business leaders don’t trust the
information they use to make decisions
Establishing the
Veracity of big
data sources
Extract insight from a high volume, variety and velocity of data in
a timely and cost-effective manner
3
© 2013 IBM Corporation
PureData Ecosystem
Big Difference: Schema on Run
 Big Data (Hadoop)
 Regular database
– Schema on run
– Schema on load
Raw data
Schema
to filter
Raw data
Storage
(unfiltered,
raw data)
Schema
to filter
Storage
(pre-filtered data)
4
Output
© 2013 IBM Corporation
PureData Ecosystem
Analyze Raw Data
 Customer Need
–
–
–
–
Ingest data as-is into Hadoop and derive insight from it
Process large volumes of diverse data within Hadoop
Combine insights with the data warehouse
Low-cost ad-hoc analysis with Hadoop to test new
hypothesis
 Value Statement
– Gain new insights from a variety and combination of
different data sources
– Overcome the prohibitively high cost of converting
unstructured data sources to a structured format
– Experiment with analysis of different data combinations to
modify the analytic models in the data warehouse
 Customer example
– Financial Services Regulatory Org – managed additional
data types and integrated with their existing data
warehouse
 Get started with: InfoSphere BigInsights
5
© 2013 IBM Corporation
PureData Ecosystem
Reduce Costs with Hadoop
 Customer Need
– Reduce the overall costs to maintain data in the warehouse
– Lower costs as data grows within the data warehouse
– Reduce expensive infrastructure used for processing and
transformations
 Value Statement
– Support existing and new workloads on the most cost
effective alternative, while preserving existing access and
queries
– Lower storage costs
– Reduce processing costs by pushing processing onto
commodity hardware and the parallel processing of Hadoop
 Customer example
– Financial Services Firm – move processing of applications
and reports to Hadoop HBase while preserving existing
queries
 Get started with: InfoSphere BigInsights
6
© 2013 IBM Corporation
PureData Ecosystem
What is BigInsights?
 Flexible, enterprise-class support for
processing large volumes of data
 Enables applications to work with
thousands of nodes and petabytes of data
in a highly parallel, cost effective manner
Enterprise Edition
Enterprise class
– Based on Google’s MapReduce technology
– Inspired by Apache Hadoop; compatible with its
ecosystem and distribution
– Well-suited to batch-oriented, read-intensive
applications
– Supports wide variety of data
Licensed
Business process accelerators (“Apps”)
Text analytics, Spreadsheet-style analysis tool
RDBMS connectivity
Integrated Web-based console
Flexible job scheduler
Performance enhancements
Eclipse-based tooling
Basic Edition
LDAP authentication
Free download
....
Integrated Install
Web-based console
– CPU + disks = “node”
– Nodes can be combined into clusters
– New nodes can be added as needed without
changing
Breadth of capabilities
7
© 2013 IBM Corporation
PureData Ecosystem
What’s so Special About Open Source Hadoop?
Storage
• Distributed
• Reliable
• Commodity gear
Scalable
• New nodes can be added on the fly
Affordable
• Massively parallel computing on commodity
servers ( easily and affordably available )
MapReduce
• Parallel Programming
• Fault Tolerant
Flexible
• Hadoop is schema-less – can absorb any
type of data
Fault Tolerant
• Through MapReduce software framework
8
© 2013 IBM Corporation
PureData Ecosystem
Hadoop is Well Suited for Handling Big Data Challenges
9
Analyzing larger volumes may
provide better results
Deriving new insights
from combinations of data types
Larger data volumes are cost
prohibitive with existing technology
Exploring data –
a sandbox for ad-hoc analytics
© 2013 IBM Corporation
PureData Ecosystem
InfoSphere BigInsights – A Closer Look
User Interfaces
Integration
Databases
Visualization
Dev Tools
Admin Console
Accelerators
More Than Hadoop
• Performance & workload
optimizations
• Unique text analytic engines
Application
Accelerators
Text
Analytics
Content
Management
BigInsights Engine
Map Reduce +
Indexing
Workload Mgmt
Security
Apache Hadoop
• Spreadsheet-style visualization
for data discovery &
exploration
• Built-in IDE & admin consoles
Information
Governance
• Enterprise-class security
• High-speed connectors to
integration with other systems
• Analytical accelerators
10
© 2013 IBM Corporation
PureData Ecosystem
BigInsights Enterprise Edition Components
IBM InfoSphere BigInsights
Visualization & Discovery
BigSheets
Apps
Dashboard &
Visualization
Workflow
Text Analytics
Integration
Administration
Applications & Development
MapReduce
Pig & Jaql
Hive
Admin Console
JDBC
Monitoring
PureData
DB2
Advanced Analytic Engines
R
Text Processing Engine &
Extractor Library (AQL+HIL)
Adaptive Algorithms
Streams
Workload Optimization
Runtime
DataStage
Integrated
Installer
Enhanced
Security
Splittable Text
Compression
Adaptive
MapReduce
ZooKeeper
Oozie
Jaql
Flexible
Scheduler
Lucene
Pig
Hive
Index
MapReduce
Guardium
HCatalog
Platform
Computing
Management
Cognos
Security
Data Store
Flume
HBase
Audit & History
Lineage
File System
Open Source
11
Sqoop
HDFS
IBM
© 2013 IBM Corporation
PureData Ecosystem
Integrated Installation
 Integrated installation of supported open source and IBM components.
– Seamless process for single node and cluster environments
– Integrated installation of all selected components
 Post-install validation of IBM and open source components
BigInsights
 Many disparate
components
 Manual install
 Leg-work required
• What components?
• Which versions?
Roll Your Own
12
 Single install
• No need to worry about
components & versions
 Install requires very little
interaction
• No extra prerequisites to
download
Easiest
© 2013 IBM Corporation
PureData Ecosystem
BigInsights Web Console for Administration
 Real-time iteration and visualization of the cluster
–
–
–
–
–
–
Add/remove nodes
HDFS file system administration
Configure components
Monitor workflow, jobs, storage
Metrics export for enterprise monitoring tools
Health summary
 Discover and Analyze
– Run or schedule pre-built
application
– Discover information, entities
– Load, explore data
– Review log records
13
© 2013 IBM Corporation
PureData Ecosystem
Enhanced tools for Business Users: Application linking
 Application linking
– Compose new applications from existing applications and BigSheets
– Invoke analytics applications from the web console, including integration within BigSheets
 New Apps to provide enhanced data import capability
– REST data source App that enables users to load data from any data source supporting
REST APIs into BigInsights, including popular social media services
– Sampling App that enables users to sample data for analysis
– Subsetting App that enables users to subset data for data analysis
14
© 2013 IBM Corporation
PureData Ecosystem
Data Visualization and Analytics




15
BigSheets – spreadsheet metaphor for exploring data
Rich platform for the analysis and visualization of Internet-scale data volumes
Ad-hoc analytics for LOB user
Analyze a variety of data - unstructured and structured
© 2013 IBM Corporation
1
6
PureData Ecosystem
Enhanced tools for Business Users
A centralized dashboard to visualize
analytic results:
 BigSheets collections
 Analytic application results
 Monitoring metrics
 .. leveraging a new charting engine
16
BigSheets usability enhancements:
 The ability to view BigSheets data flows
between and across data sets to quickly
navigate and relate analysis and charts
 Inner outer joins, enhanced filters for
BigSheets columns, column data-type
mapping for collections and application of
analytics to BigSheets columns, … etc
© 2013 IBM Corporation
PureData Ecosystem
Text Analytics – Accurate Analysis of Unstructured Big Data
 How it works
– Parses text and detects meaning
with annotators
– Understands the context in which
the text is analyzed
 Hundreds of pre-built annotators
– E.g. names, addresses, phone
numbers, along others
– Multilingual support
 Distills structured info from
unstructured text
– Sentiment analysis
– Consumer behavior
Unstructured text (document, email, etc)
Football World Cup 2010, one team
distinguished themselves well, losing
to the eventual champions 1-0 in the
Final. Early in the second half,
Netherlands’ striker, Arjen Robben,
had a breakaway, but the keeper for
Spain, Iker Casillas made the save.
Winger Andres Iniesta scored for
Spain for the win.
Classification and Insight
 Benefits
– More precise and correct answers
– 50% faster than manual method
– Run faster text analysis
17
© 2013 IBM Corporation
© 2013 IBM Corporation
PureData Ecosystem
Improved Security
 The web console leverages several mechanisms to ensure security
 User authentication approaches
– None, Flat file, LDAP, PAM
 Authorization (role-based)
– Four default roles
• System administrator
• Data administrator
• Application administrator
• User
– Feature-based access control based on BigInsights
role membership
 Credentials store
– Store potentially sensitive info: tokens, passwords, etc.
– Maintained in BigInsights distributed file system
18
BigInsights
Authentication
Store
© 2013 IBM Corporation
2
0
PureData Ecosystem
Enhanced tools for Developers
 Unified tooling for Big Data Application Development Lifecycle
– Enables users to sample data and define, test, and deploy analytics applications from
the BigInsights Eclipse tools
– Users can administer, execute, and monitor the deployed applications from the
BigInsights Web Console
1. Sample your
Data
2. Develop your
application using
BigInsights tools
5. Deploy your
application on the
cluster
3. Test your
application
4. Package and publish your
application
20
© 2013 IBM Corporation
PureData Ecosystem
IBM Accelerators
 Packaged Applications included with BigInsights
– IBM Accelerator for Machine Data Analytics
• Convert machine raw and dispersed logs and data files into informed,
intelligent decisions
• Locate log entries across servers and time zones
• Add/Extract new log types to repository
– IBM Accelerator for Social Data Analytics
• Extract information from social media sites and build user profiles based off
participation
• Import social media data
• Associate profiles with buzz, and intent around brands, products, and
companies
21
© 2013 IBM Corporation
PureData Ecosystem
InfoSphere BigInsights – Advanced Features
 Adaptive MapReduce
–
–
–
–
–
Mappers can decide at runtime to take on more work
Balance workload across Map tasks
Speeds up a class of jobs (e.g. jobs that process small files)
Supported on Jaql/Java jobs
 Flexible Scheduler
– Optimize response time for small jobs
– Available in addition to FAIR and FIFO scheduling
– Based on average response time metrics
• Allocates maximum resources to small jobs, guaranteeing that these jobs are
completed quickly
 Compression
– BigInsights LZO-based compression technology
– Splittable: use multiple map tasks to process compressed text files
– Good performance with a reasonable compression ration (~60%)
22
© 2013 IBM Corporation
2
3
PureData Ecosystem
BigInsights 2.1 features a variety of enhancements
that deliver key Enterprise Hadoop capabilities
Big SQL
• Comprehensive Standard ANSI
SQL support to access data
stored in BigInsights
• Standards compliant JDBC &
ODBC drivers
• Leverages MapReduce
parallelism in complex data sets
• Direct access for low-latency in
small queries, e.g. sub-second
response to HBase queries
23
GPFS-FPO support
• No single point of failure
• Built-in High Availability
• POSIX compliance
• Enhanced Security with ACL
support
• Support for Storage Pools
• SnapShot capability
High Availability
• Out of the box High
Availability
• Seamless, automatic and
transparent failover for HDFS
NameNode and JobTracker
• Eliminates admin intervention
• Reduces downtime for
recovery of the cluster
• Hardware fencing to
guarantee data integrity
2
4
PureData Ecosystem
BigInsights 2.1 updates to the Accelerators and
BigSheets also enhance consumability
• Machine Data Analytics Accelerator
• New configuration UI enables an easy
and intuitive way to perform the workflow
configuration
• The new configuration interfaces
significantly improves the time to value
• Expand data sources to support and
analyze BigInsights/ Hadoop logs.
• Social Data Analytics Accelerator
• Improved profile generation performance
• Enhanced data discovery &
visualization capabilities in BigSheets
• Provides 10+ build-in functions to extract
names, addresses, organizations, email,
and phone numbers.
24
PureData Ecosystem
The Only Platform to Support Multiple Hadoop Distributions
User Interfaces
• Provides a rich set of big data
analytics and accelerators on top
of open source
BigInsights Engine
Map Reduce +
Indexing
Workload Mgmt
Security
Integration
Accelerators
• Delivers a comprehensive big
data platform on top of open
source, that addresses all big
data requirements.
IBM
tested &
supported
open source
components
25
Distribution of
Hadoop open
source
components
future
© 2013 IBM Corporation
PureData Ecosystem
Purpose-Built High Speed Connectors for Multiple Data Sources
Connect any type of data through optimized connectors
and information integration capabilities
Structured
InfoSphere Information Server
Unstructured
InfoSphere
BigInsights
Streaming
Includes connectivity to:




26
 DB2
InfoSphere Information Server
 InfoSphere Warehouse
InfoSphere Streams
 IBM Smart Analytics System
PureData for Smart Analytics
JDBC connector for connectivity to any JDBC compatible data store
© 2013 IBM Corporation
PureData Ecosystem
Enterprise Integration With Multiple Products Brings the Power
of the Big Data Platform to BigInsights
IBM InfoSphere Data
Explorer (Vivisimo)
NEW & BUNDLED:
Indexing and “on the
glass” integration
DB2 and JDBC
High speed parallel
read-write for DB2 and
JDBC connectivity
InfoSphere Guardium
NEW: Auditing
InfoSphere BigInsights
Visualization & Exploration
Development Tools
Advanced Engines
Connectors
Workload Optimization
Administration & Security
PureData for
Analytics
NEW: Query and join
data using UDFs
Cognos Business
Intelligence
NEW & BUNDLED:
Support for Hive;
Business Intelligence
capabilities
27
Open source Hadoop components
R
NEW: Application that allows users to
execute R jobs directly from BigInsights
web console
InfoSphere Streams
BUNDLED: Enables realtime, continuous analysis of
data on the fly
InfoSphere DataStage
Data collection and
integration
Platform Computing
NEW: High performance,
low-latency platform
computing grid
WebSphere
NEW: WAS 8.5 Liberty
Profile – high performance
secure REST access
Rational & Data Studio
RAD, Rational Team
Concert & Data Studio
collaborative development
integration
© 2013 IBM Corporation
PureData Ecosystem
InfoSphere DataStage
 Integration, transformation and delivery of data on demand, across
multiple sources and targets
– Complete ETL functionality with metadata-driven productivity
– Supports team-based development and collaboration
 Provides integration with a broad range of sources
– Connector integrates BigInsights and the underlying HDFS file system
– Leverages clustered architecture
InfoSphere BigInsights
InfoSphere DataStage
Data Warehouses
Data
Warehouse
DataStage
Connector, push
and pull data to and
from BigInsights
clusters.
28
DataStage
Read/write data
from/to Databases,
warehouses.
© 2013 IBM Corporation
PureData Ecosystem
BigInsights Connectivity to DBMS / Warehouse
29
© 2013 IBM Corporation
PureData Ecosystem
Filter and Summarize Big Data for the Warehouse
 BigInsights can manage all enterprise data upon arrival
– Organizations can manipulate, analyze, and summarize incoming
data
 BigInsights can be utilized as a source for a data warehouse
– Broaden analytic coverage without undue burden on systems
– Augment existing corporate data within warehouses
30
© 2013 IBM Corporation
PureData Ecosystem
BigInsights as a Query-ready Archive for a Data Warehouse
 Allow firms to manage the size of their existing data
management platforms
– Use BigInsights as a query-ready archive
– With frequently accessed data maintained in the warehouse and
“cold” or outdated information offloaded to BigInsights
– Better manage the size and usability of data within the enterprise
31
© 2013 IBM Corporation
PureData Ecosystem
InfoSphere Streams – BigInsights Integration
 Sink adapter for BigInsights
 Source adapter for BigInsights
 Use cases:
– Use stored and analyzed BigInsights data to respond to real-time events
– Use Streams as a large-scale data ingest engine to filter, decorate, or
otherwise
– Manipulate a stream of data to be stored in the BigInsights cluster
InfoSphere BigInsights
InfoSphere Streams
Hadoop-Based low
latency analytics for
variety and volume
Real-Time Data in Motion
Analytic Streaming Tool
Sink/Source Adaptor
for Bi-Directional Data
Flow
32
© 2013 IBM Corporation
PureData Ecosystem
Recognized for Our Leadership
“IBM has the deepest Hadoop platform and application portfolio.”
• Functionality
• Subproject integration
• Modeling
• Storage
• Acceleration and optimization
• Real-time/low-latency data management
• Cluster management
• Packaging
• Distributed EDW file store connectors
• Business applications
• Strategic direction
• Professional services
• Solution adoption
• Solution revenues
• Solution partners
February 2012 “The Forrester Wave™: Enterprise Hadoop Solutions, Q1 2012”
33
© 2013 IBM Corporation
PureData Ecosystem
Where to get BigInsights software
 IBM Big Data software, including BigInsights basic and enterprise
editions
http://www-01.ibm.com/software/data/infosphere/biginsights/
 BigInsights on the cloud
– IBM SmartCloud Enterprise:
http://www-935.ibm.com/services/us/en/cloud-enterprise/index.html
– GoGrid: www.gogrid.com/pr/gogrid-and-ibm-team-deliver-big-data-cloud
 BigInsights Infocenter
http://pic.dhe.ibm.com/infocenter/bigins/v2r0/index.jsp
34
© 2013 IBM Corporation
PureData Ecosystem
Cross Industry Use Cases
 Use patterns
– Customer sentiment analysis
– Internet behavior & buying pattern
analysis
– Predictive modeling (credit card fraud)
– System log analytics (reduce operational
risk)
Financial Services
IT
 Common requirements
– Extract business insight from large
volumes
of raw data (often outside operational
systems)
– Integrate with other existing software
– Ready for enterprise
35
Retail
Telco
Healthcare
Media &
Entertainment
Utilities
© 2013 IBM Corporation
PureData Ecosystem
Vestas optimizes
capital investments
based on 2.5
Petabytes of
information.
 Model the weather to optimize
placement of turbines, maximizing
power generation and longevity.
 Reduce time required to identify
placement of turbine from weeks to
hours.
 Incorporate 2.5 PB of structured and
semi-structured information flows. Data
volume expected to grow to 6 PB.
36
36
36
© 2013 IBM Corporation
PureData Ecosystem
Asian Health Bureau
reduces diagnostic errors
Capabilities Utilized: Hadoop System
•
Telemedicine imaging diagnostics
service to improve rural healthcare
•
Automatically sifts and analyzes large
collections looking for anomalies and
disease
•
Makes it possible for radiologists and
Pathologists to analyze:
1000s of patient images
37
37
37
“Over 80% of
healthcare data is
medical imaging”
Significant improvements expected:
•
Reduction in diagnostic errors
•
Improved outcomes by leveraging
physicians treating similar cases
© 2013 IBM Corporation
PureData Ecosystem
For Big Data, InfoSphere BigInsights is the Clear Choice
Faster Performance enhancements & workload
optimizations resulting in faster answers
Smarter
Unique analytic engines that get more accurate
results
Built-in development environment and administration
Easier consoles enable your resources skills to utilize
Hadoop
Secure Enterprise-class security to protect your big data
Plugged-in Pre-integrated to your existing enterprise IT systems
ensuring that big data doesn't become a silo
38
© 2013 IBM Corporation
PureData Ecosystem
Summary
 Big data is a strategic initiative for IBM
– Significant investments across software, hardware and
services.
 InfoSphere BigInsights
– Enables firms to exploit growing variety, velocity, and
volume of data
– Delivers diverse range of analytics
– Leverages and extends open source
– Provides enterprise-class features and supporting
services
– Complement existing software investments and
commercial offerings
– Available in basic (free) and enterprise editions
 IBM advantage
– Full solution spanning software, hardware & services
– Rapid technology advances through partnerships with IBM
Research
– Global reach
39
© 2013 IBM Corporation
Fly UP