InfoSphere CDC Flat file for DataStage Configuration and Best Practices
by user
Comments
Transcript
InfoSphere CDC Flat file for DataStage Configuration and Best Practices
InfoSphere CDC Flat file for DataStage Configuration and Best Practices © 2010 IBM Corporation Information Management Software Understanding the Flat File Workflow Landing Location 2 Information Management Software 1. Source Database Landing Location • Configure CDC on the source database where the CDC service for the database reads the transaction log to capture changes 3 Information Management Software 2. Defining the Replication Definition Landing Location • CDC for DataStage transfers the change data according to the replication definition • To configure: • Define the table structure that will be sent to DataStage • Define the DataStage connection method for Flat Files • Define single or multiple format to determine how DataStage will be processing the incoming records 4 Information Management Software Map Table for Flat File Output (1) • Map table as usual, select WebSphere DataStage as the target • Select Flat File for method • Specify the directory to which the flat files will be written and picked up by the DataStage job (directory resides on the DS server) • Initial status of table will be Active (picking up changes from the moment it was mapped) 5 Information Management Software Map Table for Flat File Output (2) 6 Information Management Software Defining the DataStage Record Format (1) • Standard columns containing information about the change: • DM_TIMESTAMP - The timestamp obtained from the log of when the operation occurred (contains the value from the &TIMSTAMP journal control field) • DM_TXID - Transaction identifier (contains the value from the &CCID journal control field) • DM_OPERATION_TYPE contains a single character indicating the type of operation: • "I" for an insert. • "D" for a delete. • For Single Record Format there is one type that represents the update image • "U" represents an update. • For Multiple Record Format there are two separate types that represent before and after image • "B" for the row containing the before image of an update. • "A" for the row containing the after image of an update. • DM_USER - The user that performed the operation (contains the value from the &USER journal control field) 7 Information Management Software Defining the DataStage Record Format (2) • Single record • In this format an update operation is sent as a single row • The before and after image is contained in the same record • E.g. Updating 3 records "2010-11-23 21:43:24","0","U","EPANG","1","elaine "2010-11-23 21:43:24","0","U","EPANG","2","elaine "2010-11-23 21:43:24","0","U","EPANG","3","abc ","1","update ","2","update ","3","update “ “ " • Multiple record format • An update operation is sent as two rows, the first row being the before image and the second row containing the after image. • E.g. Updating 3 records "2010-11-23 "2010-11-23 "2010-11-23 "2010-11-23 "2010-11-23 "2010-11-23 8 21:46:15","0","B","EPANG","1","update 21:46:15","0","A","EPANG","1","hello 21:46:15","0","B","EPANG","2","update 21:46:15","0","A","EPANG","2","hello 21:46:15","0","B","EPANG","3","update 21:46:15","0","A","EPANG","3","hello “ “ “ “ “ " Information Management Software Naming Convention of Flat Files • CDC uses the following convention to name the flat files that are produced during replication. • [Table].x[Date].[Time][# Records] • • • • x = D for completed flat files, @ for currently open flat file [Date] = Julian date (year, day number within year) [Time] = hh24mmss when flat file was created (in GMT) [# Records] = Optionally the number of records can be added • [Table].STOPPED • When subscription is stopped, this file is generated The timestamp format can be configured using the system parameter ds_output_timestamp_format. E.g. ds_output_timestamp_format=“yyyyMM-dd HH:mm:ss.SSS” (to include milliseconds) 9 Information Management Software 3. Flat Files Become Available for DataStage Landing Location • CDC for DataStage server hardens the files and deposits them in the flat file location. • While actively mirroring to a file it is not accessible to DataStage. The process of hardening involves renaming the file, replacing the ‘@’ with a ‘D’ thus making it available to Datastage. • To configure: • Define the Batch Size Threshold settings to determine how often CDC hardens the flat files that are made available to DataStage 10 Information Management Software Set Subscription DataStage Properties • Right-click on subscription to set properties • The file will be hardened always at the end of a transaction boundary and when either of the following thresholds are passed: • Timing in seconds of flat file closure • Maximum number of rows per flat file • Flat file is closed and next one is created/opened when either value is reached • Closed flat files can be picked up by DataStage for processing as they will contain only completed transactions 11 Information Management Software 4. Flat Files Read by DataStage Job Landing Location • InfoSphere DataStage sequential file reader retrieves the flat files as part of an InfoSphere DataStage job and transforms them • The job has three parameters defined in the Management Console where the *.dsx file is created: • SPFolderPath – the full path name for the folder that DataStage searches for the source flat files created by CDC • SPFileNamePattern – the file name pattern used to identify the source flat files • SPEndFileNamePattern – the file name pattern DataStage creates when subscriptions stop mirroring. 12 Information Management Software 5. Flat Files are Deposited to New Location Landing Location • InfoSphere DataStage sequential file reader deposits the transformed flat files in the new flat file location • To configure: • DataStage definition file (*.dsx ) from Management Console or in DataStage Designer • Import definition file into DataStage and customize any additional steps/stages where necessary 13 Information Management Software Connecting CDC for DataStage with DataStage • Datastage uses job definitions to describe the sequence of steps, or stages required to transform data • DataStage jobs are normally designed and edited in InfoSphere DataStage Designer • When using CDC for DataStage you have the option of generating a job definition within CDC without creating it in DataStage Designer 14 Information Management Software Generating an InfoSphere DataStage Definition File • DataStage definition import file (.dsx) can be generated automatically • Right-click on subscription and select Generate InfoSphere DataStage Job Definition • Place .dsx file at a location where it can be selected from DataStage (or copy it to the DS server) 15 Information Management Software Import .dsx file into DataStage (1) • DataStage flat file processing job will be generated automatically • DS job is already tailored to picking up the flat files from the specified directory 16 Information Management Software Import .dsx file into DataStage (2) 17 Information Management Software Best Practices for Flat Files 18 Information Management Software Flat Files are Best Suited for… • Best suited for under a few hundred tables • Extra memory will need to be allocated with larger numbers of tables • Very high data volume which requires parallel loading • Replacement for existing ETL delta extracts • Data warehouses which benefit from bulk load of changed data • Installation on 64 bit systems 19 Information Management Software Considerations and Limitations • The Flat File integration option is not suitable when character columns contain binary data. The UTF-8 files may contain code points that resolve to special characters, such as quotes, line feed or carriage returns, that cannot be processed • Tables are individually replicated, which can break transactional table dependencies • Additional processing is required in DataStage to maintain referential integrity between dependent tables • Disk staging space • Managing many files 20 Information Management Software Initial Synchronization • DataStage extracts data from source database using standard ETL functions • An alternative is to use CDC to perform initial Refresh and then transition to mirroring mode. This method involves first creating flat files for the refresh then loading using DataStage. 21 Information Management Software Recommended Flat file Storage Option • Direct attached disk storage is a typical option used for the storage of CDC flat files. • Shared Storage Area Network (SAN) is another recommended option to stage files. • This allows running CDC DS on a server separated from the DataStage grid, ensuring CDC has dedicated CPU/Disk capacity. • The DataStage grid nodes can then read the files on the shared SAN, allowing for high performance and recoverability. • Network File System (NFS) is not recommended for high volume environments. • CDC is not resilient to file system errors that may occur, and may suffer from network latency for writing many small changes to the flat files. 22 Information Management Software Clean-up of Flat files Generated by CDC • By default, the .dsx file generated by CDC will define that flat files are removed once CDC has deposited the files into the DataStage job. • If additional sequencing of the files is required (i.e. multiple tables containing foreign key relationships) this logic requires customization. • A DataStage expert can modify the .dsx file generated by CDC to remove the cleanup logic and make adjustments as appropriate. 23 Information Management Software Distinguishing Transaction/Record Ordering • The timestamp field provides second to microsecond accuracy. It cannot be used alone to uniquely order records if multiple records are changed at the same time • You can use the system parameter ds_output_timestamp_format to format timestamp in milliseconds in the flat files. Note: some databases like Oracle can not produce millisecond accuracy. Changing this parameter can not improve upon the accuracy that the database supports • For sequencing within a single table: • Use a combination of the timestamp, flat file number and line number to uniquely identify changes in commit order • If you need to sequence across all tables in a subscription, you will additionally be required to use a derived column on the source to generate a sequence number 24 Information Management Software Recovery • CDC maintains the source database log position in a ‘bookmark’ which is used for restarting replication and/or recovery from failure • Flat files – CDC writes the bookmark to internal CDC metadata when hardening a flat file which has finished writing • If the network is lost or a system failure occurs the flat file option provides recoverability and resiliency; CDC will start from the last flat file that was not yet hardened • Both options operate independently from DataStage which periodically picks up the changes and processes the data • CDC only manages recovery up to the CDC staging mechanism 25