...

InfoSphere CDC Flat file for DataStage Configuration and Best Practices

by user

on
Category: Documents
56

views

Report

Comments

Transcript

InfoSphere CDC Flat file for DataStage Configuration and Best Practices
InfoSphere CDC Flat file for DataStage
Configuration and Best Practices
© 2010 IBM Corporation
Information Management Software
Understanding the Flat File Workflow
Landing Location
2
Information Management Software
1. Source Database
Landing Location
• Configure CDC on the source database where the CDC service
for the database reads the transaction log to capture changes
3
Information Management Software
2. Defining the Replication Definition
Landing Location
• CDC for DataStage transfers the change data according to the
replication definition
• To configure:
• Define the table structure that will be sent to DataStage
• Define the DataStage connection method for Flat Files
• Define single or multiple format to determine how DataStage will
be processing the incoming records
4
Information Management Software
Map Table for Flat File Output (1)
• Map table as usual, select WebSphere DataStage as the target
• Select Flat File for method
• Specify the directory to which the flat files will be written and picked
up by the DataStage job (directory resides on the DS server)
• Initial status of table will be Active (picking up changes from the
moment it was mapped)
5
Information Management Software
Map Table for Flat File Output (2)
6
Information Management Software
Defining the DataStage Record Format (1)
• Standard columns containing information about the change:
• DM_TIMESTAMP - The timestamp obtained from the log of when the operation
occurred (contains the value from the &TIMSTAMP journal control field)
• DM_TXID - Transaction identifier (contains the value from the &CCID journal
control field)
• DM_OPERATION_TYPE contains a single character indicating the type of
operation:
• "I" for an insert.
• "D" for a delete.
• For Single Record Format there is one type that represents the update image
• "U" represents an update.
• For Multiple Record Format there are two separate types that represent before and
after image
• "B" for the row containing the before image of an update.
• "A" for the row containing the after image of an update.
• DM_USER - The user that performed the operation (contains the value from
the &USER journal control field)
7
Information Management Software
Defining the DataStage Record Format (2)
• Single record
• In this format an update operation is sent as a single row
• The before and after image is contained in the same record
• E.g. Updating 3 records
"2010-11-23 21:43:24","0","U","EPANG","1","elaine
"2010-11-23 21:43:24","0","U","EPANG","2","elaine
"2010-11-23 21:43:24","0","U","EPANG","3","abc
","1","update
","2","update
","3","update
“
“
"
• Multiple record format
• An update operation is sent as two rows, the first row being the before image
and the second row containing the after image.
• E.g. Updating 3 records
"2010-11-23
"2010-11-23
"2010-11-23
"2010-11-23
"2010-11-23
"2010-11-23
8
21:46:15","0","B","EPANG","1","update
21:46:15","0","A","EPANG","1","hello
21:46:15","0","B","EPANG","2","update
21:46:15","0","A","EPANG","2","hello
21:46:15","0","B","EPANG","3","update
21:46:15","0","A","EPANG","3","hello
“
“
“
“
“
"
Information Management Software
Naming Convention of Flat Files
• CDC uses the following convention to name the flat files that are
produced during replication.
• [Table].x[Date].[Time][# Records]
•
•
•
•
x = D for completed flat files, @ for currently open flat file
[Date] = Julian date (year, day number within year)
[Time] = hh24mmss when flat file was created (in GMT)
[# Records] = Optionally the number of records can be added
• [Table].STOPPED
• When subscription is stopped, this file is generated
The timestamp format can be configured
using the system parameter
ds_output_timestamp_format. E.g.
ds_output_timestamp_format=“yyyyMM-dd HH:mm:ss.SSS” (to include
milliseconds)
9
Information Management Software
3. Flat Files Become Available for DataStage
Landing Location
• CDC for DataStage server hardens the files and deposits them
in the flat file location.
• While actively mirroring to a file it is not accessible to
DataStage. The process of hardening involves renaming the
file, replacing the ‘@’ with a ‘D’ thus making it available to
Datastage.
• To configure:
• Define the Batch Size Threshold settings to determine how often
CDC hardens the flat files that are made available to DataStage
10
Information Management Software
Set Subscription DataStage Properties
• Right-click on subscription to set properties
• The file will be hardened always at the end of a transaction boundary and when
either of the following thresholds are passed:
• Timing in seconds of flat file closure
• Maximum number of rows per flat file
• Flat file is closed and next one is created/opened when either value is reached
• Closed flat files can be picked up by DataStage for processing as they will contain only
completed transactions
11
Information Management Software
4. Flat Files Read by DataStage Job
Landing Location
• InfoSphere DataStage sequential file reader retrieves the flat files as
part of an InfoSphere DataStage job and transforms them
• The job has three parameters defined in the Management Console
where the *.dsx file is created:
• SPFolderPath – the full path name for the folder that DataStage
searches for the source flat files created by CDC
• SPFileNamePattern – the file name pattern used to identify the source
flat files
• SPEndFileNamePattern – the file name pattern DataStage creates
when subscriptions stop mirroring.
12
Information Management Software
5. Flat Files are Deposited to New Location
Landing Location
• InfoSphere DataStage sequential file reader deposits the
transformed flat files in the new flat file location
• To configure:
• DataStage definition file (*.dsx ) from Management Console or in
DataStage Designer
• Import definition file into DataStage and customize any additional
steps/stages where necessary
13
Information Management Software
Connecting CDC for DataStage with DataStage
• Datastage uses job definitions to describe the sequence of
steps, or stages required to transform data
• DataStage jobs are normally designed and edited in
InfoSphere DataStage Designer
• When using CDC for DataStage you have the option of
generating a job definition within CDC without creating it in
DataStage Designer
14
Information Management Software
Generating an InfoSphere DataStage Definition File
• DataStage definition import file (.dsx) can be generated
automatically
• Right-click on subscription and select Generate InfoSphere
DataStage Job Definition
• Place .dsx file at a location where it can be selected from DataStage
(or copy it to the DS server)
15
Information Management Software
Import .dsx file into DataStage (1)
• DataStage flat file processing job will be generated
automatically
• DS job is already tailored to picking up the flat files from the
specified directory
16
Information Management Software
Import .dsx file into DataStage (2)
17
Information Management Software
Best Practices for Flat Files
18
Information Management Software
Flat Files are Best Suited for…
• Best suited for under a few hundred tables
• Extra memory will need to be allocated with larger numbers of tables
• Very high data volume which requires parallel loading
• Replacement for existing ETL delta extracts
• Data warehouses which benefit from bulk load of changed
data
• Installation on 64 bit systems
19
Information Management Software
Considerations and Limitations
• The Flat File integration option is not suitable when character
columns contain binary data. The UTF-8 files may contain code
points that resolve to special characters, such as quotes, line feed
or carriage returns, that cannot be processed
• Tables are individually replicated, which can break transactional
table dependencies
• Additional processing is required in DataStage to maintain referential
integrity between dependent tables
• Disk staging space
• Managing many files
20
Information Management Software
Initial Synchronization
• DataStage extracts data from source database using standard
ETL functions
• An alternative is to use CDC to perform initial Refresh and
then transition to mirroring mode. This method involves first
creating flat files for the refresh then loading using
DataStage.
21
Information Management Software
Recommended Flat file Storage Option
• Direct attached disk storage is a typical option used for the storage
of CDC flat files.
• Shared Storage Area Network (SAN) is another recommended
option to stage files.
• This allows running CDC DS on a server separated from the
DataStage grid, ensuring CDC has dedicated CPU/Disk capacity.
• The DataStage grid nodes can then read the files on the shared SAN,
allowing for high performance and recoverability.
• Network File System (NFS) is not recommended for high volume
environments.
• CDC is not resilient to file system errors that may occur, and may suffer
from network latency for writing many small changes to the flat files.
22
Information Management Software
Clean-up of Flat files Generated by CDC
• By default, the .dsx file generated by CDC will define that flat
files are removed once CDC has deposited the files into the
DataStage job.
• If additional sequencing of the files is required (i.e. multiple
tables containing foreign key relationships) this logic
requires customization.
• A DataStage expert can modify the .dsx file generated by CDC
to remove the cleanup logic and make adjustments as
appropriate.
23
Information Management Software
Distinguishing Transaction/Record Ordering
• The timestamp field provides second to microsecond accuracy. It
cannot be used alone to uniquely order records if multiple records
are changed at the same time
• You can use the system parameter ds_output_timestamp_format
to format timestamp in milliseconds in the flat files. Note: some
databases like Oracle can not produce millisecond accuracy.
Changing this parameter can not improve upon the accuracy that
the database supports
• For sequencing within a single table:
• Use a combination of the timestamp, flat file number and line number to
uniquely identify changes in commit order
• If you need to sequence across all tables in a subscription, you will
additionally be required to use a derived column on the source to
generate a sequence number
24
Information Management Software
Recovery
• CDC maintains the source database log position in a
‘bookmark’ which is used for restarting replication and/or
recovery from failure
• Flat files – CDC writes the bookmark to internal CDC metadata
when hardening a flat file which has finished writing
• If the network is lost or a system failure occurs the flat file option
provides recoverability and resiliency; CDC will start from the last flat file
that was not yet hardened
• Both options operate independently from DataStage which
periodically picks up the changes and processes the data
• CDC only manages recovery up to the CDC staging mechanism
25
Fly UP