Active/Active with InfoSphere Replication Server - Taking DB2 DR from Failover
by user
Comments
Transcript
Active/Active with InfoSphere Replication Server - Taking DB2 DR from Failover
Active/Active with InfoSphere Replication Server Taking DB2 DR from Failover to Continuous Availability Session Number 1430 Oct 26th 2010, 3:00PM Serge Bourbonnais IBM Silicon Valley Laboratory 0 Session: TDZ-1430A Tue, 26/Oct, 03:00 PM - 04:00 PM Mandalay Bay North Convention Center - South Pacific A Abstract • Q Replication, which is part of InfoSphere Replication Server for DB2 z/OS and for Linux Unix and Windows (LUW) can replicate transactions between databases with low latency (sub-second) at very high throughputs (millions of changes per minute) and over practically unlimited distances for thousands of DB2 tables, while preserving transactional integrity and tolerating system and network outages. Each database is active, allowing for workload distribution. Databases can be configured very differently, and can reside on different type of hardware and different software versions, allowing for immediate failover during outages, as well as maintenance, upgrades, and migrations without any downtime. Replication is multi-directional and can manage conflicting database changes. A replication infrastructure can also feed a warehouse or distribute and consolidate data between large numbers of servers. InfoSphere Replication Server has been used by several thousands of customers for nearly two decades now, and has allowed some customers to achieve as much as 99.99999% DB2 availability. 1 Outline • • • • Business Continuity Replication Technologies Functional and Technological Overview of Q Replication Deploying an Active/Active configuration – – – – 2 Use-Cases Operational Procedures Database Constructs that may require special treatment Value Proposition Causes of System Unavailability 1. Planned Maintenance • • • System and Application Upgrades or Reconfiguration Scheduled during ‘off-hours’ Requires careful planning and a swift execution, may need to be called off if there is a glitch 2. Component Outages • Caused by Human Error, Software defects, Disk failure, Subsystem failure, Hardware failure, Power Grid outage • • Data is (generally) recoverable But, changes might be stranded until the failed component is recovered or the error corrected 3. Data Corruptions • • • Caused by Human Errors, Software defects Data is not recoverable Requires ‘un-doing’ the mistake or going back to a know to be reliable of the database. Addressing version each one of these causes requires a data replication strategy. 4. Disasters 3 • Flood, Earthquake, Fire, …, Loss of a site How Much Interruption can your Business Tolerate? Ensuring Business Continuity: Standby 1. Disaster Recovery – Restore business after a database loss 2. High-Availability – Meet Service Availability objectives e.g., 99.9% availability or 8.8 hours of down-time a year 3. Continuous Availability – No downtime (planned or not) Active/Active Global Enterprises that operate across timezones no longer have any ‘off-hours’ window. Continuous Availability is required. What is the cost of 1 hour of downtime during core business hours? 4 Cost of Downtime by Industry Industry Sector Loss per Hour Financial $8,213,470 Telecommunications $4,611,604 Information Technology $3,316,058 Insurance $2,582,382 Pharmaceuticals $2,058,710 Energy $1,468,798 Transportation $1,463,128 Banking $1,145,129 Chemicals $1,071,404 Consumer Products $989,795 Source: Robert Frances Group 2006, “Picking up the value of PKI: Leveraging z/OS for Improving Manageability, Reliability, and Total Cost of Ownership of PKI and Digital Certificates.” (*) Establishing the System Recovery Objectives Recovery Time Objective (RTO) How much time is needed to restore business operations? Recovery Point Objective How much data could we really afford to lose? (RPO) --> How far does the recovery site needs to be? 5 Replication Technologies Requirements Maintain a site copy for Maintain a database copy for Disaster Recovery High Availability Replicated Objects: File System Disk Copy •XRC (asynch) •PPRC (synch)1 Continuous Availability Database Disk Volumes Technologies and examples of Products: Maintain one or more copies of critical data for Log pages DB Recovery •HADR (LUW only) •Log Shipping Tables Row changes Transaction Replay •Q Replication Considerations for Comparing Technologies: •Recovery Point Objective: 0 data loss --> seconds •Recovery Time Objective: A few seconds --> 2 hours or more •Distance required between sites: 10s of kilometers ---> Unlimited •Hardware Utilization: Active/Standby --> Active/Active •Recovery Requirement: Site --> Database subsystem ---> Application recovery •Impact on applications: Direct overhead (synchronous technologies) --> no impact (asynch technologies) •CPU Overhead: Negligible (hardware e.g., PPRC) --> Proportional to the workload (transaction replay technology) 1: PPRC with Metro Mirror on z/OS provides a Continuous Availability solution for the mainframe 6 Q Replication • Part of the InfoSphere Replication Server product • A software-based asynchronous replication solution – For Relational Databases – For selected tables/columns/transactions/operations – Changes are captured from the database recovery log; transmitted as (compact) binary data; and then applied to the remote database(s) using SQL statements. Technology is transaction-replay – Each DB2 is ACTIVE and can be used for READ/WRITE operations • NOT disk level copy (where disk volumes are shadowed) • NOT log shipping (where remote database is in recovery mode) • Asynchronous = No impact on application response time Site A Active DB2 DB2 User tables Control tables database recovery log Active DB2 Asynchronous LOG change data capture logrdr Unlimited Distance WebSphere MQ Data Center agent Q Apply agent publish Q Capture SQL Transaction Parallel Replay Site B DB2 User tables agent Configuration & Monitoring Control Tables Q Replication also supports several N-node topologies; with conflict detection/resolution. 7 Comparing Q Replication with GDPS Solutions for DB2 z/OS Offsite Disaster Recovery Consideration \ Technology GDPS/XRC - z/OS Global Mirror technology GDPS Global Mirror GDPS/PPRC - Metro Mirror technology Q Replication Requirements covered Disaster Recovery (site) Disaster Recovery (site) Continuous Availability (site) Disaster Recovery (DB2) Continuous Availability (DB2) Information Integration (DB2) Replicated Objects Disk volumes Disk volumes Relational Tables Scope of Consistency Related disk updates Related disk updates DB2 Database transactions Technology Asynchronous disk Replication with Volume Group Consistency Sysplex (need to timestamp I/O) Synchronous (Metro Mirror) disk Replication with Volume Group Consistency for Parallel Sysplex, zVM, and zLinux images. Asynchronous Replay database Transactions using SQL Automation and management Tivoli System Automation NetView Fully automated solution with minimal manual intervention. Tivoli System Automation Netview Fully automated solution with minimal manual intervention. Replication Center Web-based Replication Dashboard Manual takeover Platforms z/OS, zLinux, & zVM GDPS control system on z/OS, but any data for Global Mirror z/OS GDPS control system on z/OS, but any data for Metro Mirror. z/OS and LUW Configurations ACTIVE/STANDBY ACTIVE/ACTIVE (z/OS-only) ACTIVE/STANDBY ACTIVE/ACTIVE Data Loss (RPO) < 3 seconds 0 (sync) < 3 seconds Time to Recover (RTO) < 1 hour. Restart workload to release shared resources. A/S < 1 hour. IPL z/OS. Release shared resources, restart apps. A/A < few minutes < 30 seconds (detect timeout and redirect) Distance unlimited 10s of kms (possibly up to 100kms *) unlimited Impact on application response time no impact (asynch) yes (synch) increases with distance 1ms/100kms no impact (async) CPU Overhead XRC: System Data Movers on STANDBY. zIIp Negligible (hardware). enabled 8 2 -10% at source database 10-20% of replicated IUD at target Comparing Q Replication with HADR for DB2 LUW Offsite Disaster Recovery Consideration \ Technology HADR (log replay) Q Replication (transaction replay) Requirements covered Disaster Recovery (DB2 LUW) Disaster Recovery (DB2) Continuous Availability (DB2) Information Integration (DB2) OS Platforms Supported LUW ESE only z/OS and LUW, including PureScale and DPF (ISAS) Source/Target Mismatch - HW/Middleware Similar hardware and software only Sites can be very different (e.g., OS, DB2 versions, codepages, etc) Source/Target Mismatch - Data Identical DB2 data at each sites Transformations allowed Support DB2 Data Partitions (warehouse) No Yes Objects Replicated Entire Database Selected tables/columns/operations Operations Replicated All logged operations are replicated (both DML, DDL) Non-logged operation not replicated (e.g., LOBs, LOAD) DML operations, non-logged LOB OK No DDL except ADD Column Can trigger refresh when LOAD is detected Configurations ACTIVE/STANDBY (2) ACTIVE/ACTIVE configurations Scope of Consistency DB2 Database at point-in-time DB2 Database transactions Technology Log page shipping and Database Rollforward Nearsync or Synch (for DR usage, but async exists) Replay database Transactions using SQL Asynchronous Administration and Automation DB2 Control Center DB2 Data Studio Can be enabled with TSA or HACMP. manual intervention Replication Center Web-based Rep Dashboard Manual takeover Administrative Overhead Low - One database to administer - all logged operations are replicated. Higher - Requires administration of each active database. Data Loss Objective (RPO) 0 (in peer state) < 1-3 seconds Recovery Time Objective (RTO) Detection and Failover 30 seconds 2-5 seconds Distance between sites Shorter distances (100s of kilometers) each 100kms adding about 1ms to response time unlimited Impact on application response time yes (use synch or nearsync, for DR usage) no impact (async) Upgrade to a new DB2 Release Requires an Application outage No Application outage; can skip versions (V8.2 to V9.7) (2) ACTIVE/with RO on standby is available as of V9.7FP1. You can query the standby in UNCOMMITTED READ only, no locks are obtained at the secondary with HADR. 9 Customer Requirements for Continuous Availability 1. Utilize all Hardware – Cannot afford to have hardware sitting in standby mode 2. Immediate Failover (RTO near 0) – One hour of down-time during business hours may translate to millions of dollars in losses for some enterprises 3. Zero-downtime (online) Maintenance – Eliminate any downtime for maintenance, system and application upgrades 4. Ability to replicate only a subset of the transactions – Replicate data that is critical for Business Continuity, not all data necessarily 5. Remote recovery site – Particularly for Financial Institutions to comply with US (1) “Interagency Paper on Sound Practices to Strengthen the Resilience of the U.S. Financial System” [Docket No. R-1128] (April 7, 2003): 10 Government’s Recommended Practices: “Back-up Addressing Customer Requirements for Continuous Availability • Requires a shift in strategy: – From Local to Remote Recovery Site – From ‘Copy All’ to Copying data that is critical for core business operations – From Failover to Active/Active • Q Replication technology meets these requirements for DB2 • Often combined with other technologies, e.g.,: • Disk Mirroring for site protection + Q Replication for DB2 applications 11 InfoSphere Replication Server Product Offering • The InfoSphere Replication Server product offers 2 technologies: 1. SQL Replication (formerly DB2 Data Propagator) - first released in 1994 • Captured changes are staged in relational tables and delivered over database connections 2. Q Replication (including Q Data Event Publishing) - first released in 2004 • Captured changes are staged and delivered via WebSphere MQ Common Code, Technology, and Interfaces for z/OS and LUW for both SQL and Q Replication • Some functions are only available in Q Replication today (e.g., XML data, Replication Dashboard) Packaged free-of-charge with Replication Server (restricted licenses): • WebSphere MQ; InfoSphere Federation Server; and DB2 (on distributed only) • The DB2 LUW Homogeneous Replication Feature for DB2 <--> DB2 replication • Contains both Q and SQL replication, but no Federation Server and no Event Publisher capability • It is included free-of-charge for 2 sites with the DB2 Advanced Enterprise Server Edition (AESE) • Current version: LUW V9.7 FP3 released Sept/2010; z/OS V10 released Oct/2010 • • • Any version can run with down-level DB2 (e.g., Replication V9.7 with DB2 V8) SQL Replication is particularly suited for very large (100s servers) consolidation/distribution Q Replication is the best-of-breed technology for Continuous Availability and High-Performance 12 InfoSphere Replication Server Today • High-Performance and Linear Scalability Q Replication can replicate 1000s of tables, 100 000s of changes per second, over distances of 1000s of miles with sub-second end-to-end latency (from DB2 commit at source to DB2 commit at the target) -- As reported by our customers. • Continuous Operations Add/remove tables to a live replication configuration Add columns or ALTER column types for a table already being replicated; Automatically re-initialize a subscription when a source table is LOADED Perform maintenance (e.g., REORG of a table at the target) Without interruption of service (no application outage, no 13 replication suspend) and tolerating hardware, software and How Customers use Replication Server Replication Server is used by several thousands of customers, world-wide; it has its strongest presence in the Financial and Insurance sectors, as well in the Retail, Manufacturing, Services, and Government sectors. Most customers use Replication Server for several of the following scenarios: 1. Live Reporting – Offload queries away from the OLTP server to improve response times 2. Data Distribution and Consolidation – – Multiple locations (e.g., hundreds of Points of Sales, or Business Branches) Often bidirectional with assumption of no conflicts (e.g., rollup sales, distribute price lists) 3. Feeding the Warehouse – – DB2 [Staging] ETL With row-Level transformations (set-level transformations and integration done by ETL application) 4. Continuous Availability and Disaster Recovery – – Workload Distribution (read/write possible at each site) Immediate Failover Capability for Business Continuity 5. Upgrades and Migrations, without downtime (High-Availability) − 14 Hardware or Software upgrades, including upgrades that require database schema changes SQL Replication • • • • Formerly known as DPropR or Data Propagator Staging is in relational tables, publish once for N subscribers. Comes free-of-charge with purchase of DB2 distributed (for DB2 homogeneous replication only) Highly-Scalable architecture for large Distribution/Consolidation topologies (hundreds of databases) Source server DB2 Staging Tables Target servers Common ADMIN GUI, APIs and Commands DB2 SQL Apply DB2 SQL Apply database recovery log •DB2 z/OS •DB2 dist. •DB2 DPF •iSeries DB2 SQL Capture SQL connection SQL Apply Non-DB2 Control tables SQL Apply Federation Server DB2 15 •z/OS •iSeries •distributed Control Tables •Informix •Oracle •Sybase •SQL Server •Teradata Q Replication • Designed for High-throughput. Low-latency and Continuous Operations – Parallel Capture program publishes captured database transactions as compact MQSeries messages, only data is sent – Apply program applies non-dependent transactions in parallel using multiple database agents – Capture/Apply program run independently and can be individually stopped/restarted at any time • Staging/Transport over WebSphere MQ – Highly resilient and reliable, staging allows for moving data off the DB2 log quickly, even if target database is down – Each captured database transactions published in an MQ message (messages sent at each commit_interval) – Less CPU at the source compared to SQL Capture (3x less measured for some workloads) Source server WebSphere MQ Queue Manager (or client on LUW) Send queues database recovery log Target servers Data Warehouse ETL Data Event Publishing WebSphere MQSeries Queue MQ Channels Manager (or client) logrdr Q Capture publish •DB2 z/OS •DB2 Dist. •DB2 DPF •pureScale Control tables •Oracle ADMIN GUI, APIs and Utilities DB2 agent Q Apply agent agent Admin queue Control Tables Restart queue agent Q Apply agent agent WebSphere MQ Queue Manager (or client) Federation Server DB2 16 •z/OS •Distributed •DPF •PureScale Non-DB2 Control Tables •Informix •Oracle •Sybase •SQL Server •Teradata How Q Apply Delivers Scalable Performance • • • • Patented technology that is Best-of-Breed in the Industry Q Apply can match the level of parallelism of the applications at the source database Non-dependent transactions are applied and committed in parallel – Using up to num_apply_agents connections to DB2; an agent never waits for any other agent before it commits – Any ‘monster’ transaction, ie., larger than memory_limit is applied by the browser (temporarily suspending parallelism) – On z/OS, batches can be limited to a max. number of agents with max_agent_correlid, to prevent page lock contention – DONEMSG table keeps track of applied messages, MQ messages deleted asynchronously. No 2PC with MQ Dependent transaction are applied in source commit order – Q Apply keeps track of DB2 unique indexes, replication key, and constraints for all subscribed tables and uses them to detect dependencies between transactions Q Apply (asnqapp) Program Table Dependencies parent Highly Parallel Transaction Replay Transaction Dependencies DB2 child parent Q Apply agent pool NUM_APPLY_AGENTS MAX_AGENT_CORRELID child child WebSphere MQ Receive queue(s) MEMORY_LIMIT browse (internal) WORKQ BROWSER (internal) DONEQ delete PRUNING Thread (internal) RETRY Q (deletes from DONEMSG) 17 AGENT AGENT AGENT AGENT AGENT AGENT AGENT AGENT AGENT AGENT AGENT RETRY AGENT DB2 connections User tables DONEMSG Control Tables SQL operations Parallel Q Apply Effect on Latency • DB2 COMMITS in source order – Latency increases with workload • Easily monitored – Q Replication Dashboard – Historical statistics maintained in tables 18 • DB2 COMMITS in parallel – Same workload, but with parallelism enabled – Positive effect - Latency drops immediately • Watch Video demo on ChannelDB2.com Single Capture for Multiple Configurations • Single Capture reads the DB2 log once for multiple replication configurations, e.g.,: • • – Bi-directional Replication between two DB2 z/OS for Workload Balancing and Failover in case of a Disaster – Unidirectional Replication from DB2 z/OS to DB2 LUW for Live Reporting Several other targets and Configurations possible with the same Capture program – Including feeding a Warehouse with Data Event Publishing and Replicating from DB2 to non-DB2 targets Reduces CPU Overhead on the Server Mainframe Site 2 Mainframe Site 1 WebSphere MQ WebSphere MQ DB2 User tables Control tables database recovery log agent agent agent Q Apply Q Capture Remote Admin Q Admin queue Receive queue Send queue Admin queue Restart queue BiBi-Directional Replication Send queue(s) Remote Admin queue Un i-D Re irec t p li ca iona tio l n • Multi-System Databases – DB2 z/OS Parallel Sysplex with data sharing – InfoSphere Warehouse with Data Partitions (DPF) – DB2 pureScale – Only one Capture/Apply is needed per cluster, each can run on any of the members 19 Receive queue Q agent Apply agent agent Q Capture DB2 User tables database recovery Control Tables log Restart queue WebSphere MQ Queue Manager (or client) Control tables Q Apply agent agent agent User DB2 tables Distributed Server Site 3 Q Replication on pureScale • Q Replication already had a long established history of providing Continuous Availability for DB2 z/OS Data Sharing across long distances • Q Replication is IBM’s Disaster Recovery Solution for pureScale 9.8 –It is packaged with pureScale. User only needs to run db2licm for enabling the replication license (after purchasing replication) • Scalable Performance –Q Replication scales as members are added to a data sharing cluster • Configuration: –One Capture and one Apply per Data Sharing group, it can run 20 on any member Continuous Operations • Q Apply can automatically perform initial table loads while – replication continues for all tables already being replicated – Q Apply connects back to the source database to fetch the data (load from cursor) – changes continue against the source table. They are staged at the target into a spill queue during load • Columns can also be added or altered for a table already replicated without any interruption • Administrator can independently start/stop/resume capture or apply for a single queue or a single table for maintenance. e.g., add index, runstats, or reorg a table Applications continue using the table being added (or modified) for Replication Applications New table(s) being loaded or under maintenance Existing tables continue to be replicated Site B Site A WebSphere MQ Queue Manager (or client on LUW) WebSphere MQ Queue Manager (or client on LUW) SPILL queue(s) Captured Transactions from the LOG DB2 Q Apply User tables Control tables 21 database recovery log Network 1000s of miles Spill during LOAD of new COPY or SPILLSUB cmd Receive queue(s) Send queue(s) logrdr Q Capture Admin queue agent agent agent agent DB2 User tables Remote Admin Q publish Restart queue Configuration & Monitoring Control Tables Initial Load of Target Tables Transfer existing data and on-going changes against the source table to the new target table, without impacting other tables already being replicated or suspending writes to this table • Q Apply LOAD choices: – Internal: Q Apply automatically loads table when a subscription is started: • Load from cursor – DSNUTILS on z/OS and LUW • Export/Import - Export/Load – LUW only (not suitable when LOB or XML columns are involved in a remote configuration) – External: User uses the LOAD utility of her choice to load the target table • User tells replication before invoking LOAD and after LOAD is done (Issue CAPSTART, load the table, issue LOADDONE) – NO LOAD: Changes are immediately applied on the target • Useful when tables are created from a backup image 22 How Q Apply Internal LOAD works • Q Apply connects back to the source database to fetch the data • Changes occurring at the source are captured and spilled at the target into an automatically created MQ queue while the table is being loaded • Q Apply deals with Referential Integrity and Check Constraints during the load process • Referential Integrity constraints are dropped during LOAD, and restored when all related tables have finished loading • Q Apply invokes CHECKDATA once all tables in a group are loaded • Subscription state set to ACTIVE (A) once the load is done • Several tables can be loaded in parallel (default 15); Q Apply throttles LOAD utility scheduling (OK to start all subscriptions at once) • Q Capture can automatically trigger a LOAD at the target when data is LOADED at the source using a LOAD utility • Best Practices: To reduce spilling of changes during load of very large tables that are heavily updated, start subscriptions a few at a time 23 Tooling and Administration • Graphical User Interfaces: – DB2 Replication Center • Configuration, Operations • Generate SQL scripts – Q Replication Dashboard • Runs from Web Browser • Multi-sites monitoring, troubleshooting, tuning, and reporting • Scripting for Configuration – asnclp: interactive mode or script • Commands for Operations – Commands for controlling Replication are issued to the Capture and Apply started tasks: • with the asnqccmd and asnqacmd programs, • with a MODIFY command on z/OS • Utilities – Check table integrity: asntdiff – Display queued messages:asnqmfmt – Detect alerts and send email: asnmon 24 C:\asnclp REPL > CREATE QSUB USING REPLQMAP ... REPL > CREATE SUBSCRIPTION SET SETNAME ... REPL > CREATE MEMBER IN SETNAME ... > asnqacmd apply_server=qtest status show details > F jobname,status show details asnclp scripting (some recent V10 enhancements) • Subscriptions can be created for many tables with just two statements – 'Like' allows wildcard and can be used for owner (schema) and table name – Sample script (only one script setups up both sites for bidirectional) ASNCLP SESSION SET TO Q REPLICATION; SET RUN SCRIPT NOW STOP ON SQL ERROR OFF; SET BIDI NODE 1 SERVER DBALIAS SOURCEDB; SET BIDI NODE 2 SERVER DBALIAS TARGETDB; SET TABLES (NODE 1 SRC OWNER LIKE "AIRUKU%"); CREATE QSUB SUBTYPE B; subscriptions <--- Creates all your • Create all MQ objects with a single command – Queue manager, queues, channels. Run on each system where MQ needed 25 CREATE MQ SCRIPT RUN NOW A few Q Replication V10 enhancements that are important for Active/Active • Improved Schema Evolution: – Automatically replicate ALTER COLUMN DATA TYPE • Available in both SQL and Q Replication – Automatically replicate ALTER TABLE ADD COLUMN • In the past, required issuing a command to add the column to the subscription • STOP/START individual Q Capture send queues – Useful when a target is down, Capture continues sending changes on all other queues, resumes where it left off for this queue on STARTQ – In the past, stopping a queue was only possible at the Q Apply side • Controlled stop of Q Capture for planned maintenance asnqccmd SRCDB stopq=Q1 captureupto='2010-06-0513.00.00' after=data_applied 26 ASN7304I "Q Capture" : "ASN" : "WorkerThread" All Monitoring Performance and Operations • Several dozens of data elements are logged by both the Capture and Apply programs into DB2 tables at their respective monitor_interval; they report performance, status, operational, and troubleshooting information – Data remains available for days or more (until pruned as per monitor_limit). A few examples: Table Column Information IBMQREP_CAPQMON ROWS_PUBLISHED Total number of captured rows put to MQ by Q Capture (one row per send queue) TRANS_PUBLISHED Total number of captured transactions put to MQ by Q Capture IBMQREP_APPLYMON OLDEST_TRANS All transactions prior to this timestamp have been applied. (one row per receive queue) ROWS_APPLIED Total number of rows applied to target database. At the TARGET database QDEPTH Number of messages currently in the MQ receive queue. END2END_LATENCY Elapsed time in milliseconds between DB2 commit time at the source and DB2 commit time at the target. At the SOURCE database . •Source commit time is from the DB2 commit log record •Target commit time is obtained by getting a timestamp after committing the transaction in DB2 Broken down CAPTURE_LATENCY Time spent in Capture. by Component - QLATENCY Time spent in MQ (over the network) APPLY_LATENCY Time spent in Apply • Latency counters, such as END2END_LATENCY report an average for the monitor interval: – If there are no transactions applied in the monitor interval, latency is 0; Intervals are reported in milliseconds. • Counters are reset at each monitor interval 27 Q Replication Dashboard • A Global Live View – Health and Status of: • • • • • • programs, queues, subscriptions, topology – Both DB2 z/OS, LUW, and Oracle With aggregation – Rolled-up status of queues, subscriptions, latency, and exceptions Drill down – Details behind the status Access to Historical Data – Reports for any time period Support for Alerts – View exceeded thresholds Operations – Start/Stop/Reinit Subscription – Start/Stop Send Queue – Synchronize external LOAD And User Roles for Access Control 28 Dashboard Performance Advisor 29 The Q Replication Dashboard makes Performance Recommendations based on Historical Data and Explains 30 Documentation and Learning Aids • • • Extensive Documentation Available on-line • Visit Q Replication Roadmap – From your browser search “Q Replication Roadmap”: Integrated with DB2 Information Centers (z and LUW) Comprehensive set of Redbooks, Tutorials, Whitepapers, Samples, and Classroom Education http://www.ibm.com/developerworks/data/roadmaps/qrepl-roadmap.html • Visit IBM DeveloperWorks for samples and tutorials – Replication group: www.ibm.com/developerworks/software/exchange/repl – Download samples from IBM and fellow users, or share your own code – Blogs, message boards, useful links and answers … read or contribute. – Subscribe to updates; tag, comment on, and recommend • Visit ChannelDB2 for videos, presentations, and expert advice – http://www.channeldb2.com/video/video/listForContributor?screenName=SideshowDave 31 Deploying an Active/Active Configuration 32 A Case Study: IBM Jam (DB2 Distributed) https://www.collaborationjam.com/ Challenge • The Minijam project, IBM Jam, serves as a catalyst for collaborative innovation™ and culture change within client organizations, opening new business relationships outside of the traditional CIO/CTO office. • IBM Jam required systems that would help in speedy recovery and alleviate the Jam registration process. • The solution also required conflict detection between sites when one was brought up after regular maintenance • IBM Jam runs IBM’s DB2 data server on IBM’s AIX operating system. Solution • IBM InfoSphere Replication Server was selected for its low-latency and its ability to provide three-site data synchronization with conflict detection and resolution. 33 Business Benefits • Conflict detection and resolution along with lower latency and throughput helped resolve Jam registration issues. • Minimizing latency has further improved user experience in getting all their comments on a particular topic in a timely manner. • Find out more about IBM Jam by going to its website at http://www.ibm.com/ibm/jam/ The Events/IBM.com Infrastructure • Highly-Redundant Infrastructure for Web Hosting – Achieves 99.99999% Availability – 3 DB2 AIX databases in an active/active configuration, synchronized with Replication Server • Reference: – Resilient hosting in a continuously available virtualized environment R. R. Scadden, R. J. Bogdany, J. W. Clifford, H. D. Pearthree, and R. A. Locke IBM Systems Journal Volume 47, Number 4, 2008 http://www.research.ibm.com/journal/sj47-4.html 34 A Case Study: Banco do Brasil (DB2 z/OS) Challenge Business Benefits •Excessive load on production system due to combination of an increasing OLTP workload and query reporting applications •Two SYSPLEXES with unbalanced workloads •Need for a workload balance solution with low impact on existing applications •Critical and non-critical applications sharing state of the art infra-structure resources Solution •InfoSphere Replication Server’s Q Replication •Use the lightly loaded secondary system to offload query reporting applications and to separate critical and non-critical applications in different DB2 data sharings groups. •Replicate thousands of tables to the secondary system using uni and bidirectional replication to integrate the application data. 35 Improved application and data availability Reduction of unplanned application outage Improved SLA with Line of Business Focus investments in infra-structure for critical business applications Cost Benefits Better sysplex utilization No need for application redevelopment, minimum changes required Banco do Brasil (DB2 z/OS) Configuration Details • Scenario: Workload balance • DB2 z/OS to DB2 z/OS V9 CM • Configurations: bidi, uni-directional • Number of queues: 80 • 4 different DB2 data sharing groups involved • 6 Q capture • 7 Q apply • Number of table subscriptions: Over 2000 • Replicate over 500 millions rows/day at peak period 36 Active/Active with Q Replication Using Transaction Replay technology: • Sites can be asymmetric, with different capacities, and at different HW/OS/SW levels • A workload can be immediately redirected for a disaster or planned maintenance • Both sites are active – Replication is bi-directional, ready for fallback after a failover, from any direction – Even if either site is used for read-only or standby for all or a subset of the workloads – Routing decision determines the configuration e.g., West-Coast customers to site A and East-Coast customers to site B, until failover Site A Applications West-Coast OLTP workload Query Reporting workload WebSphere MQ Queue Manager (or client on LUW) DB2 User tables database recovery Control tables log 37 East-Coast OLTP workload Remote Admin Q agent agent agent Receive queue(s) Q Apply Send queue(s) Q Capture WebSphere MQ Queue Manager (or client on LUW) Network 1000s of kilometers subsecond latency Admin queue Restart queue Site B Admin queue Send queue(s) Q Apply Receive queue(s) Remote Admin Q Configuration & Monitoring Restart queue agent agent agent Q Capture DB2 User tables database recovery Control Tables log Database Constructs that sometimes need attention for Active/Active • Database SEQUENCES and IDENTITY columns – Ideally, segregate between site, e.g., use ODD/EVEN scheme – Otherwise, e.g., if increment cannot be altered • Active/Standby: ALTER sequence/identity start from MAX+1 at failover • Active/Active: ALTER sequence/identity to start with a different range (e.g., 0->maxint/2; and maxint/2->maxint) • Unique constraint on identity columns allows Q Replication to detect overlaps (follows conflict_action) • Tables with no unique index – Create one if possible at the target to allow automatic load • Triggers – Good practice to drop them during fallback for faster re-synchronization – Exclude triggered statements from replication, if table is replicated • Immediate/deferred MQTs – Refresh them at failover • GENERATED ALWAYS columns – Must be altered at the target to GENERATED WITH DEFAULT 38 Conflict Detection and Resolution with Q Replication • Q Rep will detect, resolve, and report conflicts • Usually not an operational strategy, rather, a way to keep things going until what caused the conflict is understood and the conflict resolution is verified. – Special case: Conflicts after a failover are expected during fallback for stranded changes that are finally delivered • Handling conflicts at the row level minimizes the impact of conflicts by “merging” transactions for non conflicting changes. One conflict in a transaction doesn’t invalidate the whole transaction, which can result in more conflicts later on • ‘bi-directional’ configuration - force or discard conflicting row • ‘peer-to-peer’ - apply row with most recent timestamp 39 Conflict Detection and Resolution with Q Replication B 1. ‘Bidirectional’ mode – detection by value - resolution Force/Ignore; Ignore/Ignore – 2 or up to tens of sites connected though a central site; a hub and spokes model (or tree topology) A: Master C: Master or Slave – Conflict detection: Compare current values at the target with the before values carried over from the source. Check key, changed cols only, or all columns – – Possibility of merging updates to different columns of the same row made from different sites – Conflict action: Force or Ignore (merging updates is not a conflict) - Designated site wins in case of conflict, master/master is OK Recapture avoidance: Apply inserts a row into the SIGNAL table with each transaction it applies. If there are only 2 sites, can use Capture IGNORE TRANSACTION for less overhead. 2. ‘peer-to-peer’ mode - detection by version and origin - resolution Time-Based – Fully connected topology - Each node sends data to each other node (practical limit 3 to 5 sites) – Conflict detection: Requires 2 extra columns and triggers on replicated tables to sets Peer A Peer B timestamp, node number, and to mark the deletes made by Q Apply. Handles out of order arrivals via alternate routes (e.g., delete at A arrives to C, before insert at B arrives at C, for the same row) – – Peer C Conflict action: Most recent change overwrites previous change Recapture avoidance: Trigger marks the row with origin of the change. Only local changes captured. Resolving conflicts at the row level minimize the impact of conflicts by “merging” transactions for non conflicting changes. One conflict in a transaction doesn’t invalidate the whole transaction, which could result in more conflicts later on. Resolved conflicts are reported into the EXCEPTIONS table. Each conflict is reported only once. 40 How Replication Conflicts are Reported Information Center: Q Replication Dashboard: hyperlinks to Information Center Q Replication Table: IBMQREP_EXCEPTIONS TEXT UPDATE “QALLTYPE”.”ALLTYPE2U_trg” SET “VCHAR1” = “UPD1” WHERE “i1” = 2 41 Failover for Planned Maintenance 1. Stop Applications at Site A; Wait for replication to replicate all changes to Site B (usually a few seconds) Use ‘stop after=data_applied’ 2. Reroute Applications to Site B 3. Perform maintenance at Site A – Q Capture continues to capture changes at site B during maintenance of A and stage them in the xmit or receive queue ready for a faster restart after the maintenance is completed; it allows DB2 logs to be archived at site B No downtime for maintenance Applications East-Coast workload Site A WebSphere MQ Queue Manager (or client on LUW) Upgrading DB2 at Site A (Q Capture and Apply are temporarily stopped) DB2 User tables Control tables 42 database recovery log agent agent agent West-Coast workload WebSphere MQ Queue Manager (or client on LUW) Remote Admin Q Admin queue Receive queue(s) Send queue(s) Q Apply Q Apply Send queue(s) Q Capture Receive queue(s) Admin queue Restart queue Remote Admin Q Configuration & Monitoring Site B Restart queue agent agent agent Q Capture DB2 User tables database recovery log Control Tables Failover for a Disaster • For a disaster, applications can be immediately rerouted to Site B – On LUW, DB2 Client reroute can be used • Some changes committed at the source DB2 (and acknowledged to the application) are not yet committed at the target DB2 – 1: Transactions committed to DB2, but not yet captured from the log – 2: Transactions captured and published, by not yet transmitted by MQSeries After Disaster at Site A, some transactions are stranded. They will be delivered if and when Site A can be restarted. Applications are unaware of the failure Applications East-Coast workload Site A 2: Transactions captured WebSphere and written toMQ Queue MQSeries, but not yet transmittedManager (or client on LUW) West-Coast workload Site B WebSphere MQ Queue Manager (or client on LUW) 1: Transactions committed to DB2 but not yet captured from the DB2 log DB2 User tables Control tables 43 database recovery log agent agent agent Remote Admin Q Admin queue Receive queue(s) Send queue(s) Q Apply Q Apply Send queue(s) Q Capture Receive queue(s) Admin queue Restart queue Remote Admin Q Configuration & Monitoring Restart queue agent agent agent Q Capture DB2 User tables database recovery log Control Tables Re-synchronizing DB2 after a Failover No application interruption is required when bringing back a site after an outage, planned or unplanned Q Capture can continue capture changes at the active site and stage them in MQ for faster restart hours Duration of Outage days 1. Restart Replication – The Usual method, can be used perhaps 95% of the time. – Load balancing to both servers can be resumed as soon as replication catches up 2. Reload tables – Reload tables at the failed site by exporting the data from the online site; • Can be done automatically by Q Rep, using fastest available method: LOAD from CURSOR, UNLOAD/RELOAD. Many loads in parallel (up to 16 tables). • Simply reinitialize the subscriptions with one command. – weeks Also used as a way to overwrite changes stranded at the inactive site --> no conflicts 3. Restore Database – Take online backup of the active site, and restore the image on the site to restore – A technique that can also be used for deploying a new site the first time – More Steps than restart or reload: Replication subscriptions need to be recreated on the restored system (the backup image overwrites the replication control tables.) After an unplanned outage: •Some transactions might be left stranded on the site that was down (the inactive site) •Some (but not all) of the stranded transactions might have been re-executed on another site •Q Replication provides conflict detection/resolution/reporting; to keep replication going, minimizing down-time. 44 Advantages of Q Replication for DB2 Disaster Recovery • Read and write possible at each server. Utilize all Hardware • Unlimited distance • Asymmetric Sites: can use different hardware, OS, DBMS, codepage and data formats, and software versions • No impact on application response time (asynchronous Capture) • Low-latency (sub-second) and high-throughput (100000 rows/sec) achievable for 1000s of tables • RTO of sub-second. DB2 workload can be rerouted immediately when a failure is detected – No shared resources to release; No file system to mount – Data is transactionally consistent at the secondary site • Online Upgrade and Maintenance – Reroute application to another site while performing maintenance or upgrades – Can add new workloads to replicate without any outage, without rerouting – Can replicate ALTER TABLE ADD COLUMN and ALTER TYPE (e.g., for upgrades) • Same Replication infrastructure can also be used for Live Reporting, Feeding the Warehouse, and Information Integration 45 Do not Miss! Monday, 10:15-11:15a – 2027A: Zero Downtime Upgrades Using QREP at Constant Contact, Inc Wednesday, 3:15-4:15p – 3728A: Making Data More Highly Available at American Express Thursday, 8:15-11:15a – 1568A: Migrating Data between DB2 and Oracle Using IBM InfoSphere Replication Server Bidirectional Replication And during InfoSphere Demo Room hours • Pedestal #27 - Q Replication in InfoSphere HA Solutions without Hardware Limitations. 46 A Special Invitation for System z Attendees 47 Information and Analytics Communities • On-line communities, User Groups, Technical Forums, Blogs, Social networks, and more – Find a community that interests you at… • ibm.com/software/data/community • Information Champions – Recognizing individuals who have made the most outstanding contributions to Information Management communities • ibm.com/software/data/champion 48 Thank You! Your Feedback is Important to Us • Access your personal session survey list and complete via SmartSite – Your smart phone or web browser at: iodsmartsite.com – Any SmartSite kiosk onsite – Each completed session survey increases your chance to win an Apple iPod Touch with daily drawing sponsored by Alliance Tech 49 49 Disclaimer © Copyright IBM Corporation 2010. All rights reserved. U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. THE INFORMATION CONTAINED IN THIS PRESENTATION IS PROVIDED FOR INFORMATIONAL PURPOSES ONLY. WHILE EFFORTS WERE MADE TO VERIFY THE COMPLETENESS AND ACCURACY OF THE INFORMATION CONTAINED IN THIS PRESENTATION, IT IS PROVIDED “AS IS” WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED. IN ADDITION, THIS INFORMATION IS BASED ON IBM’S CURRENT PRODUCT PLANS AND STRATEGY, WHICH ARE SUBJECT TO CHANGE BY IBM WITHOUT NOTICE. IBM SHALL NOT BE RESPONSIBLE FOR ANY DAMAGES ARISING OUT OF THE USE OF, OR OTHERWISE RELATED TO, THIS PRESENTATION OR ANY OTHER DOCUMENTATION. NOTHING CONTAINED IN THIS PRESENTATION IS INTENDED TO, NOR SHALL HAVE THE EFFECT OF, CREATING ANY WARRANTIES OR REPRESENTATIONS FROM IBM (OR ITS SUPPLIERS OR LICENSORS), OR ALTERING THE TERMS AND CONDITIONS OF ANY AGREEMENT OR LICENSE GOVERNING THE USE OF IBM PRODUCTS AND/OR SOFTWARE. IBM, the IBM logo, ibm.com, WebSphere, DB2, and IBM InfoSphere are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or ™), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at www.ibm.com/legal/copytrade.shtml Other company, product, or service names may be trademarks or service marks of others. 50