...

Active/Active with InfoSphere Replication Server - Taking DB2 DR from Failover

by user

on
Category: Documents
97

views

Report

Comments

Transcript

Active/Active with InfoSphere Replication Server - Taking DB2 DR from Failover
Active/Active with
InfoSphere Replication
Server Taking DB2 DR from Failover
to Continuous Availability
Session Number 1430
Oct 26th 2010, 3:00PM
Serge Bourbonnais
IBM Silicon Valley Laboratory
0
Session: TDZ-1430A
Tue, 26/Oct, 03:00 PM - 04:00 PM
Mandalay Bay North Convention Center - South Pacific A
Abstract
• Q Replication, which is part of InfoSphere Replication Server for DB2 z/OS
and for Linux Unix and Windows (LUW) can replicate transactions between
databases with low latency (sub-second) at very high throughputs (millions of
changes per minute) and over practically unlimited distances for thousands of
DB2 tables, while preserving transactional integrity and tolerating system and
network outages. Each database is active, allowing for workload distribution.
Databases can be configured very differently, and can reside on different type
of hardware and different software versions, allowing for immediate failover
during outages, as well as maintenance, upgrades, and migrations without any
downtime. Replication is multi-directional and can manage conflicting database
changes. A replication infrastructure can also feed a warehouse or distribute
and consolidate data between large numbers of servers. InfoSphere Replication
Server has been used by several thousands of customers for nearly two
decades now, and has allowed some customers to achieve as much as
99.99999% DB2 availability.
1
Outline
•
•
•
•
Business Continuity
Replication Technologies
Functional and Technological Overview of Q Replication
Deploying an Active/Active configuration
–
–
–
–
2
Use-Cases
Operational Procedures
Database Constructs that may require special treatment
Value Proposition
Causes of System Unavailability
1. Planned Maintenance
•
•
•
System and Application Upgrades or Reconfiguration
Scheduled during ‘off-hours’
Requires careful planning and a swift execution, may need to be called off
if there is a glitch
2. Component Outages
•
Caused by Human Error, Software defects, Disk failure, Subsystem failure,
Hardware failure, Power Grid outage
•
•
Data is (generally) recoverable
But, changes might be stranded until the failed component is recovered or
the error corrected
3. Data Corruptions
•
•
•
Caused by Human Errors, Software defects
Data is not recoverable
Requires ‘un-doing’ the mistake or going back to a know to be reliable
of the
database.
Addressing version
each one
of these
causes requires a data replication strategy.
4. Disasters
3
•
Flood, Earthquake, Fire, …, Loss of a site
How Much Interruption can your Business
Tolerate?
Ensuring Business Continuity:
Standby
1. Disaster Recovery
– Restore business after a database loss
2. High-Availability
– Meet Service Availability objectives e.g., 99.9% availability
or 8.8 hours of down-time a year
3. Continuous Availability
– No downtime (planned or not)
Active/Active
Global Enterprises that operate across timezones no longer have any ‘off-hours’ window.
Continuous Availability is required.
What is the cost of 1 hour of downtime during
core business hours?
4
Cost of Downtime by Industry
Industry Sector
Loss per Hour
Financial
$8,213,470
Telecommunications
$4,611,604
Information Technology
$3,316,058
Insurance
$2,582,382
Pharmaceuticals
$2,058,710
Energy
$1,468,798
Transportation
$1,463,128
Banking
$1,145,129
Chemicals
$1,071,404
Consumer Products
$989,795
Source: Robert Frances Group 2006, “Picking up the value of
PKI: Leveraging z/OS for Improving Manageability, Reliability,
and Total Cost of Ownership of PKI and Digital Certificates.” (*)
Establishing the System Recovery
Objectives
Recovery Time Objective
(RTO)
How much time is needed to restore business
operations?
Recovery Point Objective
How much data could we really afford to lose?
(RPO)
--> How far does the recovery site needs to be?
5
Replication Technologies
Requirements
Maintain a
site copy for
Maintain a
database copy for
Disaster Recovery High Availability
Replicated
Objects:
File System
Disk Copy
•XRC (asynch)
•PPRC (synch)1
Continuous Availability
Database
Disk Volumes
Technologies
and examples
of Products:
Maintain one or more copies of
critical data for
Log pages
DB Recovery
•HADR (LUW only)
•Log Shipping
Tables
Row changes
Transaction Replay
•Q Replication
Considerations for Comparing Technologies:
•Recovery Point Objective: 0 data loss --> seconds
•Recovery Time Objective: A few seconds --> 2 hours or more
•Distance required between sites: 10s of kilometers ---> Unlimited
•Hardware Utilization: Active/Standby --> Active/Active
•Recovery Requirement: Site --> Database subsystem ---> Application recovery
•Impact on applications: Direct overhead (synchronous technologies) --> no impact (asynch technologies)
•CPU Overhead: Negligible (hardware e.g., PPRC) --> Proportional to the workload (transaction replay technology)
1:
PPRC with Metro Mirror on z/OS provides a Continuous Availability solution for the mainframe
6
Q Replication
• Part of the InfoSphere Replication Server product
• A software-based asynchronous replication solution
– For Relational Databases
– For selected tables/columns/transactions/operations
– Changes are captured from the database recovery log; transmitted as (compact) binary data; and then applied
to the remote database(s) using SQL statements. Technology is transaction-replay
– Each DB2 is ACTIVE and can be used for READ/WRITE operations
• NOT disk level copy (where disk volumes are shadowed)
• NOT log shipping (where remote database is in recovery mode)
• Asynchronous = No impact on application response time
Site A
Active DB2
DB2
User
tables
Control tables
database
recovery
log
Active DB2
Asynchronous
LOG change
data capture
logrdr
Unlimited
Distance
WebSphere MQ
Data Center
agent
Q Apply agent
publish
Q Capture
SQL Transaction
Parallel Replay
Site B
DB2
User
tables
agent
Configuration &
Monitoring
Control Tables
Q Replication also supports several N-node topologies; with conflict detection/resolution.
7
Comparing Q Replication with GDPS Solutions
for DB2 z/OS Offsite Disaster Recovery
Consideration \ Technology
GDPS/XRC
- z/OS Global Mirror technology
GDPS Global Mirror
GDPS/PPRC
- Metro Mirror technology
Q Replication
Requirements covered
Disaster Recovery (site)
Disaster Recovery (site)
Continuous Availability (site)
Disaster Recovery (DB2)
Continuous Availability (DB2)
Information Integration (DB2)
Replicated Objects
Disk volumes
Disk volumes
Relational Tables
Scope of Consistency
Related disk updates
Related disk updates
DB2 Database transactions
Technology
Asynchronous disk Replication with Volume
Group Consistency
Sysplex (need to timestamp I/O)
Synchronous (Metro Mirror) disk Replication
with Volume Group Consistency for Parallel
Sysplex, zVM, and zLinux images.
Asynchronous
Replay database Transactions using SQL
Automation and management
Tivoli System Automation
NetView
Fully automated solution with minimal manual
intervention.
Tivoli System Automation
Netview
Fully automated solution with minimal
manual intervention.
Replication Center
Web-based Replication Dashboard
Manual takeover
Platforms
z/OS, zLinux, & zVM GDPS control system
on z/OS, but any data for Global Mirror
z/OS
GDPS control system on z/OS, but any data
for Metro Mirror.
z/OS and LUW
Configurations
ACTIVE/STANDBY
ACTIVE/ACTIVE (z/OS-only)
ACTIVE/STANDBY
ACTIVE/ACTIVE
Data Loss (RPO)
< 3 seconds
0 (sync)
< 3 seconds
Time to Recover (RTO)
< 1 hour. Restart workload to release shared
resources.
A/S < 1 hour. IPL z/OS. Release shared
resources, restart apps.
A/A < few minutes
< 30 seconds
(detect timeout and redirect)
Distance
unlimited
10s of kms (possibly up to 100kms *)
unlimited
Impact on application response
time
no impact (asynch)
yes (synch) increases with distance 1ms/100kms
no impact (async)
CPU Overhead
XRC: System Data Movers on STANDBY. zIIp Negligible (hardware).
enabled
8
2 -10% at source database
10-20% of replicated IUD at target
Comparing Q Replication with HADR for DB2
LUW Offsite Disaster Recovery
Consideration \ Technology
HADR (log replay)
Q Replication (transaction replay)
Requirements covered
Disaster Recovery (DB2 LUW)
Disaster Recovery (DB2)
Continuous Availability (DB2)
Information Integration (DB2)
OS Platforms Supported
LUW ESE only
z/OS and LUW, including PureScale and DPF (ISAS)
Source/Target Mismatch - HW/Middleware
Similar hardware and software only
Sites can be very different (e.g., OS, DB2 versions,
codepages, etc)
Source/Target Mismatch - Data
Identical DB2 data at each sites
Transformations allowed
Support DB2 Data Partitions (warehouse)
No
Yes
Objects Replicated
Entire Database
Selected tables/columns/operations
Operations Replicated
All logged operations are replicated (both DML, DDL)
Non-logged operation not replicated (e.g., LOBs, LOAD)
DML operations, non-logged LOB OK
No DDL except ADD Column
Can trigger refresh when LOAD is detected
Configurations
ACTIVE/STANDBY (2)
ACTIVE/ACTIVE configurations
Scope of Consistency
DB2 Database at point-in-time
DB2 Database transactions
Technology
Log page shipping and Database Rollforward
Nearsync or Synch (for DR usage, but async exists)
Replay database Transactions using SQL
Asynchronous
Administration and Automation
DB2 Control Center
DB2 Data Studio
Can be enabled with TSA or HACMP. manual intervention
Replication Center
Web-based Rep Dashboard
Manual takeover
Administrative Overhead
Low - One database to administer - all logged operations are
replicated.
Higher - Requires administration of each active database.
Data Loss Objective (RPO)
0 (in peer state)
< 1-3 seconds
Recovery Time Objective (RTO)
Detection and Failover
30 seconds
2-5 seconds
Distance between sites
Shorter distances (100s of kilometers)
each 100kms adding about 1ms to response time
unlimited
Impact on application response time
yes (use synch or nearsync, for DR usage)
no impact (async)
Upgrade to a new DB2 Release
Requires an Application outage
No Application outage; can skip versions (V8.2 to V9.7)
(2) ACTIVE/with RO on standby is available as of V9.7FP1. You can query the standby in UNCOMMITTED READ only, no locks are obtained at the secondary with HADR.
9
Customer Requirements for Continuous
Availability
1. Utilize all Hardware
– Cannot afford to have hardware sitting in standby mode
2. Immediate Failover (RTO near 0)
– One hour of down-time during business hours may translate
to millions of dollars in losses for some enterprises
3. Zero-downtime (online) Maintenance
– Eliminate any downtime for maintenance, system and
application upgrades
4. Ability to replicate only a subset of the transactions
– Replicate data that is critical for Business Continuity, not all
data necessarily
5. Remote recovery site
– Particularly for Financial Institutions to comply with US
(1) “Interagency Paper on Sound Practices to Strengthen the Resilience of the U.S. Financial System” [Docket No. R-1128] (April 7, 2003):
10
Government’s Recommended Practices: “Back-up
Addressing Customer Requirements for
Continuous Availability
• Requires a shift in strategy:
– From Local to Remote Recovery Site
– From ‘Copy All’ to Copying data that is critical for core
business operations
– From Failover to Active/Active
• Q Replication technology meets these requirements for DB2
• Often combined with other technologies, e.g.,:
• Disk Mirroring for site protection + Q Replication for DB2 applications
11
InfoSphere Replication Server Product
Offering
• The InfoSphere Replication Server product offers 2 technologies:
1. SQL Replication (formerly DB2 Data Propagator) - first released in 1994
•
Captured changes are staged in relational tables and delivered over database connections
2. Q Replication (including Q Data Event Publishing) - first released in 2004
•
Captured changes are staged and delivered via WebSphere MQ
Common Code, Technology, and Interfaces for z/OS and LUW for both SQL and Q Replication
•
Some functions are only available in Q Replication today (e.g., XML data, Replication Dashboard)
Packaged free-of-charge with Replication Server (restricted licenses):
•
WebSphere MQ; InfoSphere Federation Server; and DB2 (on distributed only)
• The DB2 LUW Homogeneous Replication Feature for DB2 <--> DB2 replication
•
Contains both Q and SQL replication, but no Federation Server and no Event Publisher capability
• It is included free-of-charge for 2 sites with the DB2 Advanced Enterprise Server Edition (AESE)
•
Current version: LUW V9.7 FP3 released Sept/2010; z/OS V10 released Oct/2010
•
•
•
Any version can run with down-level DB2 (e.g., Replication V9.7 with DB2 V8)
SQL Replication is particularly suited for very large (100s servers) consolidation/distribution
Q Replication is the best-of-breed technology for Continuous Availability and High-Performance
12
InfoSphere Replication Server Today
• High-Performance and Linear Scalability
Q Replication can replicate 1000s of tables, 100 000s of
changes per second, over distances of 1000s of miles with
sub-second end-to-end latency (from DB2 commit at source
to DB2 commit at the target) -- As reported by our customers.
• Continuous Operations
Add/remove tables to a live replication configuration
Add columns or ALTER column types for a table already
being replicated;
Automatically re-initialize a subscription when a source table
is LOADED
Perform maintenance (e.g., REORG of a table at the target)
Without interruption of service (no application outage, no
13
replication suspend) and tolerating hardware, software and
How Customers use Replication Server
Replication Server is used by several thousands of customers, world-wide; it has its strongest presence
in the Financial and Insurance sectors, as well in the Retail, Manufacturing, Services, and Government
sectors. Most customers use Replication Server for several of the following scenarios:
1. Live Reporting
–
Offload queries away from the OLTP server to improve response times
2. Data Distribution and Consolidation
–
–
Multiple locations (e.g., hundreds of Points of Sales, or Business Branches)
Often bidirectional with assumption of no conflicts (e.g., rollup sales, distribute price lists)
3. Feeding the Warehouse
–
–
DB2 [Staging] ETL
With row-Level transformations (set-level transformations and integration done by ETL application)
4. Continuous Availability and Disaster Recovery
–
–
Workload Distribution (read/write possible at each site)
Immediate Failover Capability for Business Continuity
5. Upgrades and Migrations, without downtime (High-Availability)
−
14
Hardware or Software upgrades, including upgrades that require database schema changes
SQL Replication
•
•
•
•
Formerly known as DPropR or Data Propagator
Staging is in relational tables, publish once for N subscribers.
Comes free-of-charge with purchase of DB2 distributed (for DB2 homogeneous replication only)
Highly-Scalable architecture for large Distribution/Consolidation topologies (hundreds of databases)
Source server
DB2
Staging Tables
Target servers
Common
ADMIN
GUI, APIs and
Commands
DB2
SQL Apply
DB2
SQL Apply
database
recovery
log
•DB2 z/OS
•DB2 dist.
•DB2 DPF
•iSeries
DB2
SQL
Capture
SQL connection
SQL Apply
Non-DB2
Control tables
SQL Apply
Federation
Server
DB2
15
•z/OS
•iSeries
•distributed
Control Tables
•Informix
•Oracle
•Sybase
•SQL Server
•Teradata
Q Replication
• Designed for High-throughput. Low-latency and Continuous Operations
– Parallel Capture program publishes captured database transactions as compact MQSeries messages, only data is sent
– Apply program applies non-dependent transactions in parallel using multiple database agents
– Capture/Apply program run independently and can be individually stopped/restarted at any time
• Staging/Transport over WebSphere MQ
– Highly resilient and reliable, staging allows for moving data off the DB2 log quickly, even if target database is down
– Each captured database transactions published in an MQ message (messages sent at each commit_interval)
– Less CPU at the source compared to SQL Capture (3x less measured for some workloads)
Source server
WebSphere MQ
Queue Manager
(or client on LUW)
Send queues
database
recovery
log
Target servers
Data Warehouse
ETL
Data Event Publishing
WebSphere
MQSeries Queue
MQ Channels Manager (or
client)
logrdr
Q Capture
publish
•DB2 z/OS
•DB2 Dist.
•DB2 DPF
•pureScale
Control tables
•Oracle
ADMIN
GUI, APIs and
Utilities
DB2
agent
Q Apply agent
agent
Admin queue
Control Tables
Restart queue
agent
Q Apply agent
agent
WebSphere MQ
Queue Manager
(or client)
Federation
Server
DB2
16
•z/OS
•Distributed
•DPF
•PureScale
Non-DB2
Control Tables
•Informix
•Oracle
•Sybase
•SQL Server
•Teradata
How Q Apply Delivers Scalable Performance
•
•
•
•
Patented technology that is Best-of-Breed in the Industry
Q Apply can match the level of parallelism of the applications at the source database
Non-dependent transactions are applied and committed in parallel
– Using up to num_apply_agents connections to DB2; an agent never waits for any other agent before it commits
– Any ‘monster’ transaction, ie., larger than memory_limit is applied by the browser (temporarily suspending parallelism)
– On z/OS, batches can be limited to a max. number of agents with max_agent_correlid, to prevent page lock contention
– DONEMSG table keeps track of applied messages, MQ messages deleted asynchronously. No 2PC with MQ
Dependent transaction are applied in source commit order
– Q Apply keeps track of DB2 unique indexes, replication key, and constraints for all subscribed tables and uses them to detect dependencies between
transactions
Q Apply (asnqapp) Program
Table Dependencies
parent
Highly Parallel
Transaction Replay
Transaction
Dependencies
DB2
child
parent
Q Apply agent pool
NUM_APPLY_AGENTS
MAX_AGENT_CORRELID
child
child
WebSphere MQ
Receive queue(s)
MEMORY_LIMIT
browse
(internal) WORKQ
BROWSER
(internal) DONEQ
delete
PRUNING
Thread
(internal) RETRY Q
(deletes from DONEMSG)
17
AGENT
AGENT
AGENT
AGENT
AGENT
AGENT
AGENT
AGENT
AGENT
AGENT
AGENT
RETRY
AGENT
DB2
connections
User
tables
DONEMSG
Control Tables
SQL operations
Parallel Q Apply Effect on Latency
• DB2 COMMITS in source order
– Latency increases with workload
• Easily monitored
– Q Replication Dashboard
– Historical statistics maintained in
tables
18
• DB2 COMMITS in parallel
– Same workload, but with parallelism
enabled
– Positive effect - Latency drops immediately
• Watch Video demo on ChannelDB2.com
Single Capture for Multiple Configurations
• Single Capture reads the DB2 log once for multiple replication configurations, e.g.,:
•
•
– Bi-directional Replication between two DB2 z/OS for Workload Balancing and Failover in case of a Disaster
– Unidirectional Replication from DB2 z/OS to DB2 LUW for Live Reporting
Several other targets and Configurations possible with the same Capture program
– Including feeding a Warehouse with Data Event Publishing and Replicating from DB2 to non-DB2 targets
Reduces CPU Overhead on the Server
Mainframe Site 2
Mainframe Site 1
WebSphere MQ
WebSphere MQ
DB2
User tables
Control tables
database
recovery
log
agent
agent
agent
Q Apply
Q Capture
Remote Admin Q
Admin queue
Receive queue
Send queue
Admin queue
Restart queue
BiBi-Directional
Replication
Send queue(s)
Remote Admin queue
Un
i-D
Re irec
t
p li
ca iona
tio
l
n
• Multi-System Databases
– DB2 z/OS Parallel Sysplex with data sharing
– InfoSphere Warehouse with Data Partitions (DPF)
– DB2 pureScale
– Only one Capture/Apply is needed per cluster, each
can run on any of the members
19
Receive queue
Q
agent
Apply agent
agent
Q Capture
DB2
User tables
database
recovery Control Tables
log
Restart queue
WebSphere MQ
Queue Manager
(or client)
Control tables
Q Apply
agent
agent
agent
User
DB2 tables
Distributed Server Site 3
Q Replication on pureScale
• Q Replication already had a long established history of providing
Continuous Availability for DB2 z/OS Data Sharing across long
distances
• Q Replication is IBM’s Disaster Recovery Solution for pureScale 9.8
–It is packaged with pureScale. User only needs to run db2licm
for enabling the replication license (after purchasing replication)
• Scalable Performance
–Q Replication scales as members are added to a data sharing
cluster
• Configuration:
–One Capture and one Apply per Data Sharing group, it can run
20
on any member
Continuous Operations
• Q Apply can automatically perform initial table loads while
– replication continues for all tables already being replicated
– Q Apply connects back to the source database to fetch the data (load from cursor)
– changes continue against the source table. They are staged at the target into a spill queue during load
• Columns can also be added or altered for a table already replicated without any interruption
• Administrator can independently start/stop/resume capture or apply for a single queue or a single table for
maintenance. e.g., add index, runstats, or reorg a table
Applications continue using
the table being added (or
modified) for Replication
Applications
New table(s)
being loaded
or under
maintenance
Existing tables
continue to be
replicated
Site B
Site A
WebSphere MQ
Queue Manager
(or client on LUW)
WebSphere MQ
Queue Manager
(or client on LUW)
SPILL queue(s)
Captured
Transactions
from the LOG
DB2
Q Apply
User
tables
Control tables
21
database
recovery
log
Network
1000s of
miles
Spill during
LOAD of new
COPY or
SPILLSUB cmd
Receive queue(s)
Send queue(s)
logrdr
Q Capture
Admin queue
agent
agent
agent
agent
DB2
User
tables
Remote Admin Q
publish
Restart queue
Configuration &
Monitoring
Control Tables
Initial Load of Target Tables
Transfer existing data and on-going changes against the source table to the
new target table, without impacting other tables already being replicated or
suspending writes to this table
• Q Apply LOAD choices:
– Internal: Q Apply automatically loads table when a subscription is started:
• Load from cursor – DSNUTILS on z/OS and LUW
• Export/Import - Export/Load – LUW only (not suitable when LOB or
XML columns are involved in a remote configuration)
– External: User uses the LOAD utility of her choice to load the target table
• User tells replication before invoking LOAD and after LOAD is done
(Issue CAPSTART, load the table, issue LOADDONE)
– NO LOAD: Changes are immediately applied on the target
• Useful when tables are created from a backup image
22
How Q Apply Internal LOAD works
• Q Apply connects back to the source database to fetch the data
• Changes occurring at the source are captured and spilled at the target into an automatically
created MQ queue while the table is being loaded
• Q Apply deals with Referential Integrity and Check Constraints during the load process
• Referential Integrity constraints are dropped during LOAD, and restored when all related tables
have finished loading
• Q Apply invokes CHECKDATA once all tables in a group are loaded
• Subscription state set to ACTIVE (A) once the load is done
• Several tables can be loaded in parallel (default 15); Q Apply throttles LOAD utility
scheduling (OK to start all subscriptions at once)
• Q Capture can automatically trigger a LOAD at the target when data is LOADED at the
source using a LOAD utility
• Best Practices: To reduce spilling of changes during load of very large tables that are
heavily updated, start subscriptions a few at a time
23
Tooling and
Administration
• Graphical User Interfaces:
– DB2 Replication Center
• Configuration, Operations
• Generate SQL scripts
– Q Replication Dashboard
• Runs from Web Browser
• Multi-sites monitoring, troubleshooting,
tuning, and reporting
• Scripting for Configuration
– asnclp: interactive mode or script
• Commands for Operations
– Commands for controlling Replication are
issued to the Capture and Apply started
tasks:
• with the asnqccmd and asnqacmd
programs,
• with a MODIFY command on z/OS
• Utilities
– Check table integrity: asntdiff
– Display queued messages:asnqmfmt
– Detect alerts and send email: asnmon
24
C:\asnclp
REPL > CREATE QSUB USING REPLQMAP ...
REPL > CREATE SUBSCRIPTION SET SETNAME ...
REPL > CREATE MEMBER IN SETNAME ...
> asnqacmd apply_server=qtest status show details
> F jobname,status show details
asnclp scripting (some recent V10
enhancements)
• Subscriptions can be created for many tables with just two statements
– 'Like' allows wildcard and can be used for owner (schema) and table
name
– Sample script (only one script setups up both sites for bidirectional)
ASNCLP SESSION SET TO Q REPLICATION;
SET RUN SCRIPT NOW STOP ON SQL ERROR OFF;
SET BIDI NODE 1 SERVER DBALIAS SOURCEDB;
SET BIDI NODE 2 SERVER DBALIAS TARGETDB;
SET TABLES (NODE 1 SRC OWNER LIKE "AIRUKU%");
CREATE QSUB SUBTYPE B;
subscriptions
<--- Creates all your
• Create all MQ objects with a single command
– Queue manager, queues, channels. Run on each system where MQ
needed
25
CREATE MQ SCRIPT RUN NOW
A few Q Replication V10 enhancements
that are important for Active/Active
• Improved Schema Evolution:
– Automatically replicate ALTER COLUMN DATA TYPE
• Available in both SQL and Q Replication
– Automatically replicate ALTER TABLE ADD COLUMN
• In the past, required issuing a command to add the column to the
subscription
• STOP/START individual Q Capture send queues
– Useful when a target is down, Capture continues sending changes
on all other queues, resumes where it left off for this queue on
STARTQ
– In the past, stopping a queue was only possible at the Q Apply side
• Controlled stop of Q Capture for planned maintenance
asnqccmd SRCDB stopq=Q1 captureupto='2010-06-0513.00.00' after=data_applied
26
ASN7304I
"Q Capture" : "ASN" : "WorkerThread" All
Monitoring Performance and Operations
• Several dozens of data elements are logged by both the Capture and Apply programs into DB2 tables at their
respective monitor_interval; they report performance, status, operational, and troubleshooting information
–
Data remains available for days or more (until pruned as per monitor_limit). A few examples:
Table
Column
Information
IBMQREP_CAPQMON
ROWS_PUBLISHED
Total number of captured rows put to MQ by Q Capture
(one row per send queue)
TRANS_PUBLISHED
Total number of captured transactions put to MQ by Q Capture
IBMQREP_APPLYMON
OLDEST_TRANS
All transactions prior to this timestamp have been applied.
(one row per receive queue)
ROWS_APPLIED
Total number of rows applied to target database.
At the TARGET database
QDEPTH
Number of messages currently in the MQ receive queue.
END2END_LATENCY
Elapsed time in milliseconds between DB2 commit time at the source
and DB2 commit time at the target.
At the SOURCE database
.
•Source commit time is from the DB2 commit log record
•Target commit time is obtained by getting a timestamp after
committing the transaction in DB2
Broken down
CAPTURE_LATENCY
Time spent in Capture.
by Component -
QLATENCY
Time spent in MQ (over the network)
APPLY_LATENCY
Time spent in Apply
• Latency counters, such as END2END_LATENCY report an average for the monitor interval:
– If there are no transactions applied in the monitor interval, latency is 0; Intervals are reported in milliseconds.
• Counters are reset at each monitor interval
27
Q Replication Dashboard
• A Global Live View
– Health and Status of:
•
•
•
•
•
•
programs, queues,
subscriptions, topology
– Both DB2 z/OS, LUW, and
Oracle
With aggregation
– Rolled-up status of queues,
subscriptions, latency, and
exceptions
Drill down
– Details behind the status
Access to Historical Data
– Reports for any time period
Support for Alerts
– View exceeded thresholds
Operations
– Start/Stop/Reinit
Subscription
– Start/Stop Send Queue
– Synchronize external LOAD
And User Roles for Access
Control
28
Dashboard Performance Advisor
29
The Q Replication Dashboard makes
Performance Recommendations based on
Historical Data and Explains
30
Documentation and Learning Aids
•
•
•
Extensive Documentation Available on-line
•
Visit Q Replication Roadmap
– From your browser search “Q Replication Roadmap”:
Integrated with DB2 Information Centers (z and LUW)
Comprehensive set of Redbooks, Tutorials, Whitepapers, Samples, and Classroom
Education
http://www.ibm.com/developerworks/data/roadmaps/qrepl-roadmap.html
•
Visit IBM DeveloperWorks for samples and tutorials
– Replication group: www.ibm.com/developerworks/software/exchange/repl
– Download samples from IBM and fellow users, or share your own code
– Blogs, message boards, useful links and answers … read or contribute.
– Subscribe to updates; tag, comment on, and recommend
•
Visit ChannelDB2 for videos, presentations, and expert advice
– http://www.channeldb2.com/video/video/listForContributor?screenName=SideshowDave
31
Deploying an Active/Active Configuration
32
A Case Study: IBM Jam (DB2 Distributed)
https://www.collaborationjam.com/
Challenge
• The Minijam project, IBM Jam, serves as a
catalyst for collaborative innovation™ and culture
change within client organizations, opening new
business relationships outside of the traditional
CIO/CTO office.
• IBM Jam required systems that would help in
speedy recovery and alleviate the Jam
registration process.
• The solution also required conflict detection
between sites when one was brought up after
regular maintenance
• IBM Jam runs IBM’s DB2 data server on IBM’s
AIX operating system.
Solution
• IBM InfoSphere Replication Server was
selected for its low-latency and its ability to
provide three-site data synchronization with
conflict detection and resolution.
33
Business Benefits
• Conflict detection and resolution along
with lower latency and throughput helped
resolve Jam registration issues.
• Minimizing latency has further improved
user experience in getting all their
comments on a particular topic in a timely
manner.
• Find out more about IBM Jam by going to
its website at
http://www.ibm.com/ibm/jam/
The Events/IBM.com Infrastructure
• Highly-Redundant Infrastructure for Web Hosting
– Achieves 99.99999% Availability
– 3 DB2 AIX databases in an active/active configuration,
synchronized with Replication Server
• Reference:
– Resilient hosting in a continuously available virtualized
environment R. R. Scadden, R. J. Bogdany, J. W. Clifford,
H. D. Pearthree, and R. A. Locke
IBM Systems Journal Volume 47, Number 4, 2008
http://www.research.ibm.com/journal/sj47-4.html
34
A Case Study: Banco do Brasil (DB2 z/OS)
Challenge
Business Benefits
•Excessive load on production system due to
combination of an increasing OLTP workload and
query reporting applications
•Two SYSPLEXES with unbalanced workloads
•Need for a workload balance solution with low
impact on existing applications
•Critical and non-critical applications sharing state
of the art infra-structure resources
Solution
•InfoSphere Replication Server’s Q Replication
•Use the lightly loaded secondary system to offload
query reporting applications and to separate critical
and non-critical applications in different DB2 data
sharings groups.
•Replicate thousands of tables to the secondary
system using uni and bidirectional replication to
integrate the application data.
35
Improved application and data availability
Reduction of unplanned application outage
Improved SLA with Line of Business
Focus investments in infra-structure for critical
business applications
Cost Benefits
Better sysplex utilization
No need for application redevelopment,
minimum changes required
Banco do Brasil (DB2 z/OS)
Configuration Details
• Scenario: Workload balance
• DB2 z/OS to DB2 z/OS V9 CM
• Configurations: bidi, uni-directional
• Number of queues: 80
• 4 different DB2 data sharing groups involved
• 6 Q capture
• 7 Q apply
• Number of table subscriptions: Over 2000
• Replicate over 500 millions rows/day at peak period
36
Active/Active with Q Replication
Using Transaction Replay technology:
• Sites can be asymmetric, with different capacities, and at different HW/OS/SW levels
• A workload can be immediately redirected for a disaster or planned maintenance
• Both sites are active
– Replication is bi-directional, ready for fallback after a failover, from any direction
– Even if either site is used for read-only or standby for all or a subset of the workloads
– Routing decision determines the configuration e.g., West-Coast customers to site A and East-Coast customers
to site B, until failover
Site A
Applications
West-Coast OLTP
workload
Query Reporting
workload
WebSphere MQ
Queue Manager
(or client on LUW)
DB2
User
tables
database
recovery
Control tables log
37
East-Coast OLTP
workload
Remote Admin Q
agent
agent
agent
Receive queue(s)
Q Apply
Send queue(s)
Q Capture
WebSphere MQ
Queue Manager
(or client on LUW)
Network
1000s of
kilometers
subsecond
latency
Admin queue
Restart queue
Site B
Admin queue
Send queue(s)
Q Apply
Receive queue(s)
Remote Admin Q
Configuration &
Monitoring
Restart queue
agent
agent
agent
Q Capture
DB2
User
tables
database
recovery
Control Tables
log
Database Constructs that sometimes
need attention for Active/Active
• Database SEQUENCES and IDENTITY columns
– Ideally, segregate between site, e.g., use ODD/EVEN scheme
– Otherwise, e.g., if increment cannot be altered
• Active/Standby: ALTER sequence/identity start from MAX+1 at failover
• Active/Active: ALTER sequence/identity to start with a different range (e.g., 0->maxint/2; and maxint/2->maxint)
• Unique constraint on identity columns allows Q Replication to detect overlaps (follows conflict_action)
• Tables with no unique index
– Create one if possible at the target to allow automatic load
• Triggers
– Good practice to drop them during fallback for faster re-synchronization
– Exclude triggered statements from replication, if table is replicated
• Immediate/deferred MQTs
– Refresh them at failover
• GENERATED ALWAYS columns
– Must be altered at the target to GENERATED WITH DEFAULT
38
Conflict Detection and Resolution with Q
Replication
• Q Rep will detect, resolve, and report conflicts
• Usually not an operational strategy, rather, a way to keep things going until what
caused the conflict is understood and the conflict resolution is verified.
– Special case: Conflicts after a failover are expected during fallback for
stranded changes that are finally delivered
• Handling conflicts at the row level minimizes the impact of conflicts by “merging”
transactions for non conflicting changes. One conflict in a transaction doesn’t
invalidate the whole transaction, which can result in more conflicts later on
• ‘bi-directional’ configuration - force or discard conflicting row
• ‘peer-to-peer’ - apply row with most recent timestamp
39
Conflict Detection and Resolution with Q
Replication
B
1. ‘Bidirectional’ mode – detection by value - resolution Force/Ignore; Ignore/Ignore
– 2 or up to tens of sites connected though a central site; a hub and spokes model (or tree
topology)
A: Master
C: Master
or Slave
–
Conflict detection: Compare current values at the target with the before values carried over
from the source. Check key, changed cols only, or all columns
–
–
Possibility of merging updates to different columns of the same row made from different sites
–
Conflict action: Force or Ignore (merging updates is not a conflict) - Designated site wins in
case of conflict, master/master is OK
Recapture avoidance: Apply inserts a row into the SIGNAL table with each transaction it
applies. If there are only 2 sites, can use Capture IGNORE TRANSACTION for less overhead.
2. ‘peer-to-peer’ mode - detection by version and origin - resolution Time-Based
– Fully connected topology - Each node sends data to each other node (practical limit 3 to 5 sites)
– Conflict detection: Requires 2 extra columns and triggers on replicated tables to sets
Peer A
Peer B
timestamp, node number, and to mark the deletes made by Q Apply. Handles out of order
arrivals via alternate routes (e.g., delete at A arrives to C, before insert at B arrives at C, for the
same row)
–
–
Peer C
Conflict action: Most recent change overwrites previous change
Recapture avoidance: Trigger marks the row with origin of the change. Only local changes
captured.
Resolving conflicts at the row level minimize the impact of conflicts by “merging” transactions for non conflicting
changes. One conflict in a transaction doesn’t invalidate the whole transaction, which could result in more conflicts
later on. Resolved conflicts are reported into the EXCEPTIONS table. Each conflict is reported only once.
40
How Replication Conflicts are Reported
Information
Center:
Q Replication Dashboard:
hyperlinks to Information Center
Q Replication Table:
IBMQREP_EXCEPTIONS
TEXT
UPDATE “QALLTYPE”.”ALLTYPE2U_trg” SET “VCHAR1” = “UPD1” WHERE “i1” = 2
41
Failover for Planned Maintenance
1. Stop Applications at Site A; Wait for replication to replicate all changes to Site B (usually a few
seconds) Use ‘stop after=data_applied’
2. Reroute Applications to Site B
3. Perform maintenance at Site A
– Q Capture continues to capture changes at site B during maintenance of A and stage them in the
xmit or receive queue ready for a faster restart after the maintenance is completed; it allows DB2
logs to be archived at site B
No downtime for maintenance
Applications
East-Coast
workload
Site A
WebSphere MQ
Queue Manager
(or client on LUW)
Upgrading DB2 at Site A
(Q Capture and Apply are
temporarily stopped)
DB2
User
tables
Control tables
42
database
recovery
log
agent
agent
agent
West-Coast
workload
WebSphere MQ
Queue Manager
(or client on LUW)
Remote Admin Q
Admin queue
Receive queue(s)
Send queue(s)
Q Apply
Q Apply
Send queue(s)
Q Capture
Receive queue(s)
Admin queue
Restart queue
Remote Admin Q
Configuration &
Monitoring
Site B
Restart queue
agent
agent
agent
Q Capture
DB2
User
tables
database
recovery
log
Control Tables
Failover for a Disaster
• For a disaster, applications can be immediately rerouted to Site B
– On LUW, DB2 Client reroute can be used
• Some changes committed at the source DB2 (and acknowledged to the application) are not yet
committed at the target DB2
– 1: Transactions committed to DB2, but not yet captured from the log
– 2: Transactions captured and published, by not yet transmitted by MQSeries
After Disaster at Site A, some
transactions are stranded. They
will be delivered if and when Site
A can be restarted.
Applications are unaware of the
failure
Applications
East-Coast
workload
Site A
2: Transactions captured WebSphere
and written toMQ
Queue
MQSeries, but not yet transmittedManager
(or client on LUW)
West-Coast
workload
Site B
WebSphere MQ
Queue Manager
(or client on LUW)
1: Transactions committed to DB2 but not yet
captured from the DB2 log
DB2
User
tables
Control tables
43
database
recovery
log
agent
agent
agent
Remote Admin Q
Admin queue
Receive queue(s)
Send queue(s)
Q Apply
Q Apply
Send queue(s)
Q Capture
Receive queue(s)
Admin queue
Restart queue
Remote Admin Q
Configuration &
Monitoring
Restart queue
agent
agent
agent
Q Capture
DB2
User
tables
database
recovery
log
Control Tables
Re-synchronizing DB2 after a Failover
No application interruption is required when bringing back a site after an outage, planned or unplanned
Q Capture can continue capture changes at the active site and stage them in MQ for faster restart
hours
Duration of
Outage
days
1. Restart Replication
– The Usual method, can be used perhaps 95% of the time.
– Load balancing to both servers can be resumed as soon as replication catches up
2. Reload tables
– Reload tables at the failed site by exporting the data from the online site;
•
Can be done automatically by Q Rep, using fastest available method: LOAD from CURSOR,
UNLOAD/RELOAD. Many loads in parallel (up to 16 tables).
•
Simply reinitialize the subscriptions with one command.
–
weeks
Also used as a way to overwrite changes stranded at the inactive site --> no conflicts
3. Restore Database
– Take online backup of the active site, and restore the image on the site to restore
– A technique that can also be used for deploying a new site the first time
– More Steps than restart or reload: Replication subscriptions need to be recreated on the
restored system (the backup image overwrites the replication control tables.)
After an unplanned outage:
•Some transactions might be left stranded on the site that was down (the inactive site)
•Some (but not all) of the stranded transactions might have been re-executed on another site
•Q Replication provides conflict detection/resolution/reporting; to keep replication going, minimizing down-time.
44
Advantages of Q Replication for DB2
Disaster Recovery
• Read and write possible at each server. Utilize all Hardware
• Unlimited distance
• Asymmetric Sites: can use different hardware, OS, DBMS, codepage and data formats, and
software versions
• No impact on application response time (asynchronous Capture)
• Low-latency (sub-second) and high-throughput (100000 rows/sec) achievable for 1000s of tables
• RTO of sub-second. DB2 workload can be rerouted immediately when a failure is detected
– No shared resources to release; No file system to mount
– Data is transactionally consistent at the secondary site
• Online Upgrade and Maintenance
– Reroute application to another site while performing maintenance or upgrades
– Can add new workloads to replicate without any outage, without rerouting
– Can replicate ALTER TABLE ADD COLUMN and ALTER TYPE (e.g., for upgrades)
• Same Replication infrastructure can also be used for Live Reporting, Feeding the Warehouse,
and Information Integration
45
Do not Miss!
Monday, 10:15-11:15a
– 2027A: Zero Downtime Upgrades Using QREP at Constant Contact, Inc
Wednesday, 3:15-4:15p
– 3728A: Making Data More Highly Available at American Express
Thursday, 8:15-11:15a
– 1568A: Migrating Data between DB2 and Oracle Using IBM InfoSphere
Replication Server Bidirectional Replication
And during InfoSphere Demo Room hours
• Pedestal #27 - Q Replication in InfoSphere HA Solutions without Hardware
Limitations.
46
A Special Invitation for System z Attendees
47
Information and Analytics Communities
• On-line communities, User Groups, Technical Forums, Blogs,
Social networks, and more
– Find a community that interests you at…
• ibm.com/software/data/community
• Information Champions
– Recognizing individuals who have made the most outstanding
contributions to Information Management communities
• ibm.com/software/data/champion
48
Thank You!
Your Feedback is Important to Us
• Access your personal session survey list and complete via SmartSite
– Your smart phone or web browser at: iodsmartsite.com
– Any SmartSite kiosk onsite
– Each completed session survey increases your chance to win
an Apple iPod Touch with daily drawing sponsored by Alliance
Tech
49
49
Disclaimer
© Copyright IBM Corporation 2010. All rights reserved.
U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule
Contract with IBM Corp.
THE INFORMATION CONTAINED IN THIS PRESENTATION IS PROVIDED FOR INFORMATIONAL PURPOSES
ONLY. WHILE EFFORTS WERE MADE TO VERIFY THE COMPLETENESS AND ACCURACY OF THE
INFORMATION CONTAINED IN THIS PRESENTATION, IT IS PROVIDED “AS IS” WITHOUT WARRANTY OF
ANY KIND, EXPRESS OR IMPLIED. IN ADDITION, THIS INFORMATION IS BASED ON IBM’S CURRENT
PRODUCT PLANS AND STRATEGY, WHICH ARE SUBJECT TO CHANGE BY IBM WITHOUT NOTICE. IBM
SHALL NOT BE RESPONSIBLE FOR ANY DAMAGES ARISING OUT OF THE USE OF, OR OTHERWISE
RELATED TO, THIS PRESENTATION OR ANY OTHER DOCUMENTATION. NOTHING CONTAINED IN THIS
PRESENTATION IS INTENDED TO, NOR SHALL HAVE THE EFFECT OF, CREATING ANY WARRANTIES OR
REPRESENTATIONS FROM IBM (OR ITS SUPPLIERS OR LICENSORS), OR ALTERING THE TERMS AND
CONDITIONS OF ANY AGREEMENT OR LICENSE GOVERNING THE USE OF IBM PRODUCTS AND/OR
SOFTWARE.
IBM, the IBM logo, ibm.com, WebSphere, DB2, and IBM InfoSphere are trademarks or registered trademarks of
International Business Machines Corporation in the United States, other countries, or both. If these and other IBM
trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or ™), these symbols
indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such
trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is
available on the Web at “Copyright and trademark information” at www.ibm.com/legal/copytrade.shtml
Other company, product, or service names may be trademarks or service marks of others.
50
Fly UP