Using DB2 BLU to enable IBM Cúram Business Intelligence and Analytics
by user
Comments
Transcript
Using DB2 BLU to enable IBM Cúram Business Intelligence and Analytics
Using DB2 BLU to enable IBM Cúram Business Intelligence and Analytics Anthony Farrell is a senior software engineer in the IBM Cúram platform group. Anthony has technical responsibility for Business Intelligence in the Cúram platform, has 10 years’ experience of Cúram product development, is a certified Java programmer with strong Java, Java Platform Enterprise Edition, and business intelligence experience. Anthony Farrell [email protected] @ Copyright International Business Machines Corporation 2015. US Government Users Restricted Rights – Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. CONTENTS 1 Introduction .......................................................................................................... 1 1.1 Purpose.................................................................................................... 1 1.2 Audience .................................................................................................. 1 1.3 Pre-requisites ........................................................................................... 1 2 What is DB2 BLU? ............................................................................................... 1 3 Enabling BIA in a BLU sandbox ........................................................................... 2 4 3.1 Enabling BLU in a new database.............................................................. 2 3.2 Enabling BLU in an existing database ...................................................... 2 3.3 Creating column-organized tables ............................................................ 3 Using DDL statements .............................................................................. 3 3.3.2 Using db2convert ...................................................................................... 5 3.3.3 Using the Workload Table Organization Advisor (WTOA) ....................... 5 Planning your BIA Rollout with BLU ..................................................................... 6 4.1 When BLU is best..................................................................................... 6 4.2 Columnar table adoption strategies for Cúram BIA databases ................. 7 4.3 5 3.3.1 4.2.1 BIA staging area ....................................................................................... 7 4.2.2 BIA warehouse.......................................................................................... 7 4.2.3 BIA data mart ............................................................................................ 7 Tuning and other considerations .............................................................. 8 4.3.1 Row and column-based tables co-existing ............................................... 8 4.3.2 Column organized tables and Indexes ..................................................... 8 4.3.3 Materialized Query Tables ........................................................................ 8 4.3.4 Reading from files ..................................................................................... 9 4.3.5 DB2 Warehouse data flow configuration settings ..................................... 9 Acknowledgements .............................................................................................. 9 Appendix A – Sample SQL ............................................................................................ 10 Appendix B – Further Reading ...................................................................................... 11 ii Introduction 1 Introduction Cúram Business Intelligence and Analytics is a decision support solution that helps social program organizations analyze the effectiveness of their programs and gain insight into the efficiency of their operations. It is configurable to meet each organization’s specific needs and scalable from the program to enterprise level. It consists of embedded analytics, domain-specific dashboards, a complete Extract, Transform, Load (ETL) framework and tool-independent, pre-defined, domain-specific data marts. Cúram Business Intelligence and Analytics delivers contextually relevant, embedded Analytics through pre-packaged domain-specific reports. 1.1 Purpose This article shows that IBM Cúram Business Intelligence and Analytics (BIA) can work with DB2 BLU version 10.5 FP5. In this article, you are introduced to DB2 10.5 BLU and you are shown how to start by using BLU technology. You are shown how to set up a BLU Cúram BIA sandbox and several scenarios for adopting BLU technology with Cúram BIA are discussed. 1.2 Audience This article is intended for technical users who are tasked with developing solutions that use Cúram BIA, or for technical users who are aware of DB2 10.5 BLU without extensive expertise. 1.3 2 Pre-requisites IBM Cúram development environment with Business Intelligence and Analytics installed. The SQL tools in this article were tested against Cúram version 6.1.0.1. Knowledge of the Cúram Business Intelligence and Analytics components. Awareness of DB2 10.5 BLU Acceleration. SQL skills and the ability to run the sample SQL. The SQL in this article was tested against DB2 10.5 Fix Pack 5. What is DB2 BLU? IBM DB2 with BLU Acceleration is the next generation database technology for in-memory computing. It delivers a combination of innovations from IBM Research & Development labs, BLU Acceleration provides breakthrough performance. BLU technology is seamlessly integrated in DB2 10.5. In addition, BLU enables faster time–tovalue through a choice of deployment models of on-premises or cloud. DB2 BLU uses in-memory computing for faster insight into your data; it increases the speed of your analytic workloads with columnar tables. It does not require SQL or schema changes to be implemented to your application code. 1 If you are seeing storage savings with adaptive compression, further savings are possible with column-organized tables in BLU. Another direct impact of reducing the on-disk footprint is faster backup, restore, and smaller backup images. 3 Enabling BIA in a BLU sandbox For further guidance on creating BIA data warehouse database’s please review the Reporting developer guide, creating DB2 target schema’s. 3.1 Enabling BLU in a new database Creating a DB2 database is the typical approach when setting up a new BIA sandbox, this approach provides the fastest setup to experience BLU technology. Configure your DB2 instance with the following steps: 1. Set the analytics registry variable. This variable tells the database that it is intended to be used for analytic workloads. Run: db2set db2_workload=analytics. 2. Restart the instance by using the following command. db2stop force followed by db2start. 3. Apply the db2_workload setting to the database by using the following command. Run: autoconfigure apply db and dbm. 4. Create the database by using the following command. create database BIA_BLU using codeset utf-8 territory us collate using identity 5. Create the table space by using the following command. create tablespace BIA_BLU_tbsp Because the database was created after setting DB2_WORKLOAD=ANALYTICS, the default page size for the table spaces is 32K and the extent size is 4 pages. You are now ready to create column organized tables. In fact, all tables that are created in this database default to columnar, to create row organized tables in this database you must use the “organise by row” clause on the create table definition. 3.2 Enabling BLU in an existing database You can enable BLU in an existing database, follow the steps to enable BLU in an existing database. Set the analytics registry variable. This variable tells the database that it is intended to be used for analytic workloads. Run: 2 Enabling BIA in a BLU sandbox db2set db2_workload=analytics Restart the instance by using. db2stop force followed by db2start. The following command applies the db2_workload setting to the database. autoconfigure apply db and dbm Now enable automatic storage. Run the following command: create stogroup BIA_BLU_mystg on '/dbfs1' set as default To convert existing DBMS table spaces to use automatic storage, run the two ALTER TABLESPACE commands: alter tablespace BIA_BLU_tbsp managed by automatic storage alter tablespace BIA_BLU_tbsp rebalance To create new automatic storage table spaces, run the CREATE TABLESPACE command. Because this table space is used to store column-organized table data, use the recommended page size of 32K and the extent size of 4 pages. create tablespace BIA_BLU_tbsp pagesize 32k extentsize 4 The cost of database administration can be a large part of the cost of the managing database software. BLU supports autonomics to lower the cost of database administration. Autonomics is the concept of database self-management, allowing the database to self-tune, self-manage containers and self-tune memory all of which drives better default operation that lowers the total cost of ownership. If you are not already using automatic storage table spaces, you can get started by running the CREATE STOGROUP command. 3.3 3.3.1 Creating column-organized tables Using DDL statements If the database was configured with DB2_WORKLOAD=ANALYTICS, newly created tables are column-organized by default. For example, running the following DDL statements create a table that is column organized. create table DM_DIMCASESTATUSCOL (statuskey bigint not null, status varchar(60)); Otherwise, you must use the optional “organise by clause”, the following example creates a row organized table. create table DM_DIMCASESTATUSROW (statuskey bigint not null, status varchar(60)) organize by row; 3 4 Enabling BIA in a BLU sandbox You can also load data from a row-organized table into a column-organized table by using the following commands: load from DIMCASESTATUSROW of del replace into DM_DIMCASESTATUSCOL The load command automatically builds the columnar compression dictionaries, synopsis table and collect statistics. With regards to compression, bear in mind you need a large enough data set (several gigabytes) that contains a good representative subset of data values to achieve optimal compression. In cases with strongly unsorted input data, the effectiveness of data skipping can sometimes be improved by pre-sorting data on numeric or date/time columns that are frequently used in query predicates. 3.3.2 Using db2convert Use the db2convert command to convert one or all row-organized tables into column-organized tables, row-organized tables remain online during this conversion. The db2convert command starts the ADMIN_MOVE_TABLE stored procedure. Ensure you back up the database or table spaces as the conversion is one way. db2convert -d BIA_BLU -z db2admin -t DM_DIMCASESTATUSROW This command converts the DM_DIMCASESTATUSROW table in the BIA_BLU database from roworganized to column-organized format. The -z option specifies the schema for the table and the -ts option species the table space in which the column-organized table should be created. To convert all tables in a database use the command: db2convert –d BIA_BLU When the command completes you get a message: SQL2446I The db2convert command completed successfully. All row-organized tables that satisfy the specified matching criteria have been converted to column-organized tables. 3.3.3 Using the Workload Table Organization Advisor (WTOA) Before you use this tool, make sure to back up the database, and as always refer to the DB2 documentation for full details before implementing any changes. This advisor tool can be used to identify tables that might be good candidates for columnorganization. The DB2 advisor presents its analysis and rationale, the advisor lists the tables that are might benefit from conversion and those tables that might not. You can see the cardinalities of the tables, how many statements in the query workload referenced them, the cumulative cost of running the statements that reference a table, the statements themselves, and more. With the advisor, you can view and save DDL for implementing the changes that are necessary to convert the tables to column organized. Now, you should be able to measure the compression savings and start running queries without any SQL or DDL changes. Question two in the Appendix shows the SQL to measure disk space consumption. 5 4 Planning your BIA Rollout with BLU 4.1 When BLU is best The BLU feature in DB2 with version 10.5 is intended for analytic (warehouse / data mart type) workloads. These workloads can often involve many thousands of users who run regular report queries. If you have databases that have a mixture of tables that are used for transaction processing workloads (OLTP) and analytic workloads (OLAP). The following table shows some of the markers that suggest tables can be suited to column-organized. If the number of read requests is much greater than the number of write requests, then columnorganized tables can help. Column-organized tables have several advantages when compared to the row-organized tables: Read efficiency. In a column-organized table, only those columns that are needed are read, while each row is read in a row-organized table. Compression efficiency. Typically, a column-organized table features a compression ratio that is higher than a row-organized table compression ratio. Sorting and indexing efficiency. Sorted columns compress more efficiently and range queries run faster. Use column-organized tables. Use row-organized tables Analytical Workloads servicing Warehouse or data marts (Star schema or dimensionally designed tables) Online transaction processing workloads (highly normalized databases that are designed for high levels of transaction through put) Queries that use grouping, sorting, aggregation. Applications where the majority of queries return one or a few rows. Queries where a subset of table columns is typically used. Table scans over a star schema are typically suited to column-organized tables. Applications that insert or update one or few records in a single transaction. Performance is optimal when inserting large numbers of records, 100's or 1000's of records for large insert or update transactions. Typically 1000's to amortize the transaction costs. Materialized Query Tables (MQT) are created to improve performance. MQT is no longer required with column-organized tables as performance is already optimal. Applications that store LOB within tables or XML within tables. 6 Planning your BIA Rollout with BLU 4.2 Columnar table adoption strategies for Cúram BIA databases The default database configuration for BIA is a two tiered model. That is, a single database is created to host the BIA Staging schema and the BIA Warehouse schema. A second database is created to host the BIA data mart schema. 4.2.1 BIA staging area Row-based organization is best for these tables. The number of read requests is not much greater than the number of write requests. Also, typically, all columns are read by warehouse ETL processes when updating warehouse tables. Using the Workload Table Advisor Tool, you can identify any staging tables that might benefit from column organization. 4.2.2 BIA warehouse Row-based organization is best for these tables. The number of read requests is not much greater than the number of write requests. Also, typically, all columns are read by data mart ETL processes when updating data mart tables. Using the Workload Table Advisor Tool, you can identify any staging tables that might benefit from column organization. 4.2.3 BIA data mart Analytics are built from data mart tables, as a result, the number of read requests is much greater than the number of write requests. With green field projects, deciding how to use columnar tables is simpler. The data mart database is created as a columnar database, this configuration results in all Fact and Dimensions tables being column-organised. If you have production data, you might want a more fine grained strategy to achieve “the greatest return for lowest effort”. Using the Workload Table Advisor Tool, identify the Fact table’s most in use, for each Fact table identify the most used dimensions and then convert this cohort of Fact and dimension tables to column-organised tables. Prioritize the remaining output from the advisor tool, you can then convert the remaining Fact and dimensions tables. 7 4.3 4.3.1 Tuning and other considerations Row and column-based tables co-existing To, gain full advantage of the optimizations that are provided by the columnar execution engine, SQL statements must join columnar tables to columnar tables. Introducing row-based tables into SQL queries can negate many of the optimizations that are otherwise possible. If you are running a mixture of row based and columnar tables in a single database, ensure that your database is tuned such that OLAP workloads are not negatively affected by your columnar tables. Columnar databases typically require a very high level of database sort memory, this can negatively affect row-based queries. With FP4 DB2 introduced configuration that allows row and column-based tables to co-exist in the same database, without OLTP queries on row tables being adversely affected. Also, DB2 FP5 contains many updates to the DB2 Warehouse components; Warehouse Design Client, SQW, and so on. To ensure that you are getting the best from your DB2 stack you are advised to stay current with Fix Packs. 4.3.2 Column organized tables and Indexes By default BIA DDL includes indexes to improve performance, these indexes are no longer required with a columnar database. Only indexes that are required to enforce primary keys or unique constraints are required, other indexes that support performance are no longer required. 4.3.3 Materialized Query Tables Your current architecture might include MQT to support analytics. For example, to support charts that are written with BIRT or reports that are written in other technologies like Cognos. If you created MQT for performance reasons then you should consider changing your analytics to point directly at columnar tables, as DB2 BLU provides performance that is good when compared to MQT. DB2 advises that there is now little performance gain from MQT when compared to columnar tables. Materialized Query Tables used as part of a “resilience or failover” 1 process. MQT are sometimes used as part of a “resilience or failover” process. For example, MQT can remain online for analytics users when the underlying warehouse is being updated. If the warehouse load completes successfully, MQT are then rebuilt with the latest data. If the warehouse load fails the existing MQT remain in place until the warehouse load succeeds. Because you are using MQT for resilience or failover, you can continue to do so. 1 MQT is used by Cúram clients as a resilience technique during ETL processing. 8 Acknowledgements 4.3.4 Reading from files Some BIA ETL processes read data from comma-separated-files (CSV). As you upgrade to newer versions of DB2 you might need to rebind ETL data flows against any CSV files. This might be required as newer versions of DB2 might provide updated usage notes for the LOAD command. 4.3.5 DB2 Warehouse data flow configuration settings Further design time and runtime configuration options are possible with BIA ETL data flows including: row compression for internal staging tables, enable optimization for code generation, jdbc batch size and so on. Depending on your local topology and data patterns, you are advised to change these configuration settings locally. 5 Acknowledgements I would like to thank Patrick Fagan, Chief Architect, Cúram Platform Group for his encouragement. I also want to thank the following members of the Cúram product development team for their input and support: Joseph Glackin, Senior Software Engineer, who tested the SQL tools in this article and performed a technical review this document. Eoin Fitzpatrick, Software Engineer, Cúram Product Development, Cúram and Smarter Care, IBM Watson Health, who performed a technical review this article. Caitriona nic Lughadha, Information Development Manager, Functional Verification Test Manager Cúram SPM, who performed an editorial review of this article. Andrew Foley, System Verification Test Engineer, who installed DB2 version 10.5 FP5 on Linux. Conor Cullen, Partner Enablement Manager, Cúram and Smarter Care, IBM Watson Health, who provided a business review of the document. I would also like to thank, Albert Z. Yao, Warehouse Tooling, InfoSphere Warehouse, for confirming that DB2 Warehouse supports DB2 BLU. It was also noted by Albert that clients should stay current with DB2 Fix packs to avail of the latest supported features and fixes. . 9 APPENDIX A – SAMPLE SQL Question: How do I verify which tables are row-organized and, which are column-organized? Answer: Read the TABLEORG column from the syscat.tables to verify if a table is row-organized (TABLEORG=R) or column-organized (TABLEORG=C). See the example SQL which lists the table organization for the tables created in this article. select tabname, tableorg from syscat.tables Question: How do I evaluate any storage saving with BLU compression? Answer: The following example query can be used to get the total on-disk footprint for the columnorganized tables versus the same data loaded into row-organized tables select systemTables.tabname, rowcompmode, pctpagessaved, data_object_p_size, index_object_p_size, col_object_p_size, data_object_p_size + index_object_p_size + col_object_p_size as totalSizeInKB from syscat.tables systemTables, sysibmadm.admintabinfo adminInfo where systemTables.tabname = adminInfo.tabname and systemTables.tabschema = adminInfo.tabschema 10 APPENDIX B – FURTHER READING DB2 10.5. What’s new in the SQL Warehousing Tool What's new in SQW for DB2 Warehouse V10.5 DB2 10.5 Knowledge Center: www01.ibm.com/support/knowledgecenter/SSEPGG_10.5.0/com.ibm.db2.luw.kc.doc/welcome.html Big Data & Analytics Hub http://www.ibmbigdatahub.com/blog/super-analytics-super-easy DB2 Best Practices Community https://www.ibm.com/developerworks/community/groups/ Tuning and Configuring DB2 BLU: https://www.ibm.com/developerworks/community Optimizing ETL performance with DB2 BLU: http://www.ibm.com/developerworks/data/library/techarticle/dm-1504etl-blu-accel/dm-1504etl-bluaccel-pdf.pdf Options for converting tables to column organized tables: http://www.ibm.com/developerworks/library/dm-1406convert-table-db2105/ Cúram Social Program Management http://www-01.ibm.com/support/knowledgecenter/SS8S5A/welcome?lang=en Cúram SPMP Analytics Accelerator http://www01.ibm.com/support/knowledgecenter/SS8S5A_6.1.0/com.ibm.curam.content.doc/SPMAnalytics/SP MOverview/c_SPM_SPMAnalyticsOverview.html Installing a Cúram Development Environment http://www01.ibm.com/support/knowledgecenter/SS8S5A_6.1.0/com.ibm.curam.content.doc/install_Developm entEnvironment/ctr_install_installingthedevelopmentenvironment.html?lang=en Cúram Business Intelligence and Analytics http://www-03.ibm.com/software/products/en/business-intelligence-analytics Cúram Business Intelligence and Analytics Knowledge Center http://www01.ibm.com/support/knowledgecenter/SS8S5A_6.1.1/com.ibm.curam.nav.doc/common/t_ctr_bia.ht ml Cúram Business Intelligence and Analytics Knowledge Center – Creating BIA database’s http://www01.ibm.com/support/knowledgecenter/SS8S5A_6.1.1/com.ibm.curam.content.doc/BIReportingDeve loper/c_BEREP_Installandconfig1CreateDbTargetSchemas1.html 11 ® © Copyright IBM Corporation 2015 IBM United States of America Produced in the United States of America US Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. IBM can not offer the products, services, or features discussed in this document in other countries. Consult your local IBM representative for information on the products and services currently available in your area. Any reference to an IBM product, program, or service is not intended to state or imply that only that IBM product, program, or service may be used. Any functionally equivalent product, program, or service that does not infringe any IBM intellectual property right may be used instead. However, it is the user's responsibility to evaluate and verify the operation of any non-IBM product, program, or service. IBM may have patents or pending patent applications covering subject matter described in this document. The furnishing of this document does not grant you any license to these patents. You can send license inquiries, in writing, to: IBM Director of Licensing IBM Corporation North Castle Drive Armonk, NY 10504-1785 U.S.A. The following paragraph does not apply to the United Kingdom or any other country where such provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PAPER “AS IS” WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties in certain transactions, therefore, this statement may not apply to you. This information could include technical inaccuracies or typographical errors. Changes may be made periodically to the information herein; these changes may be incorporated in subsequent versions of the paper. IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this paper at any time without notice. Any references in this document to non-IBM Web sites are provided for convenience only and do not in any manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of the materials for this IBM product and use of those Web sites is at your own risk. 12 IBM may have patents or pending patent applications covering subject matter described in this document. The furnishing of this document does not give you any license to these patents. You can send license inquiries, in writing, to: IBM Director of Licensing IBM Corporation 4205 South Miami Boulevard Research Triangle Park, NC 27709 U.S.A. All statements regarding IBM's future direction or intent are subject to change or withdrawal without notice, and represent goals and objectives only. This information is for planning purposes only. The information herein is subject to change before the products described become available. If you are viewing this information softcopy, the photographs and color illustrations may not appear. 13 Trademarks IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or ™), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the web at "Copyright and trademark information" at http://www.ibm.com/legal/copytrade.shtml. Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. UNIX is a registered trademark of The Open Group in the United States and other countries. Java and all Java-based trademarks and logos are trademarks or registered trademarks of Oracle and/or its affiliates. 14