Comments
Description
Transcript
Best Practices IBM®
IBM® Page 1 IBM® InfoSphere® DataStage® Best Practices Performance Guidelines for IBM InfoSphere DataStage Jobs Containing Sort Operations on Intel® Xeon® Servers Garrett Drysdale, Intel® Corporation Jantz Tran, Intel® Corporation Sriram Padmanabhan, IBM Brian Caufield, IBM Fan Ding, IBM Ron Liu, IBM Pin Lp Lv, IBM Mi Wan Shum, IBM Jackson Dong Jie Wei, IBM Samuel Wong, IBM 1 Page 2 Executive Summary............................................................................................. 3 Introduction .......................................................................................................... 4 Overview of IBM InfoSphere DataStage .......................................................... 5 Overview of Intel® Xeon® Series X7500 Processors....................................... 6 Sort Operation in IBM InfoSphere DataStage.................................................. 7 Testing Configurations.................................................................................. 8 Summary for Sort Performance Optimizations ............................................... 9 Recommendations for Optimizing Sort Performance .................................. 10 Optimal RMU Tuning ................................................................................. 11 Configuration / Job Tuning Recommendations.........................................................12 Final Merge Sort Phase Tuning using Linux Read Ahead..................... 15 Configuration / Job Tuning Recommendations.........................................................15 Using a Buffer Operator to minimize latency for Sort Input................. 17 Configuration / Job Tuning Recommendations.........................................................17 Minimizing I/O for Sort Data containing Variable length fields .......... 18 Configuration / Job Tuning Recommendations.........................................................19 Future Study: Using Memory for RAM Based Scratch Disk ................. 19 Best Practices....................................................................................................... 20 Conclusion .......................................................................................................... 21 Further reading................................................................................................... 22 Contributors.................................................................................................. 22 Notices ................................................................................................................. 25 Trademarks ................................................................................................... 26 2 Page 3 Executive Summary The objective of this document is to communicate best practices for tuning IBM InfoSphere DataStage jobs containing sort operations on Intel® Xeon® Servers. Sort operations are I/O intensive operations and can cause significant I/O load on the temporary or scratch file system. To optimize server CPU utilization, the scratch I/O storage system must be capable of providing the necessary disk bandwidth demanded by the sort operations. A scratch storage system that cannot write or read data at a high enough bandwidth will lead to under-utilization of computing capability of the system. This will be observed as low CPU utilization. The paper provides recommendations that will reduce the bandwidth demand placed on the scratch storage I/O system by sort operations. These I/O reductions result in improved performance that can be quite significant for systems where the scratch I/O storage system is significantly under sized in comparison to the compute capability of the processors. 3 Page 4 Introduction This whitepaper is the first in what is anticipated to be a series of whitepapers envisioned to provide IBM InfoSphere DataStage customers with helpful performance tuning guidelines for deployment on Intel® Xeon® processor- based platforms. IBM and Intel began collaborating to optimize performance and ROI of the combination of IBM InfoSphere DataStage and Intel® Xeon® based platforms in 2007. Our goal is to not only optimize the performance, and therefore, reduce the total cost of ownership of this powerful combination in future versions of IBM InfoSphere DataStage on future Intel® processors, but also to pass along tuning and configuration guidance that we discover along the way. In our work together, we are striving to understand the execution characteristics of DataStage jobs on Intel® platforms. This information is used to determine the hardware configurations, the operating system settings, and the job design and tuning techniques to optimize performance. Because of the highly scalable capabilities of IBM InfoSphere DataStage, our tests are focused on the latest Intel® Xeon® 4 and 8 socket capable X7560 Xeon® EX processors. Initially, we are testing with four socket configurations. We have presented information about IBM InfoSphere DataStage on Intel® platforms at the 2009 and 2010 IBM Information on Demand Conferences. In 2009, our audience applauded the great scalability of IBM InfoSphere DataStage on Intel® platforms, but asked us to provide more information on the I/O requirements of jobs and how to get the most out of existing platform I/O capability. Since then, we have found ways to increase the overall performance of all jobs in the new Information Server 8.5 version of IBM InfoSphere DataStage that is now a 64 bit binary on Intel® platforms, and we investigated the I/O requirements of sorting. The focus of the paper is on the key pieces of information we obtained regarding configuring the platform, Operating System, and DataStage jobs that contain sort operators. Sort is a crucial operation in data integration software, Sort operations are I/O intensive operations and can cause significant I/O load on the temporary or scratch file system. To optimize server CPU utilization, the scratch I/O storage system must be capable of providing the necessary disk bandwidth demanded by the sort operations. A scratch storage system that cannot write or read data at a high enough bandwidth will lead to under-utilization of computing capability of the system. This will be observed as low CPU utilization. The paper provides recommendations that will reduce the bandwidth demand placed on the scratch storage I/O system by sort operations. These I/O reductions result in improved performance that can be quite significant for systems where the scratch I/O storage system is significantly under sized in comparison to the compute capability of the processors. We show such a scenario in this paper. Ideally, the best solution is to upgrade the scratch I/O storage subsystem to match the compute capability of the server. 4 Page 5 Overview of IBM InfoSphere DataStage IBM InfoSphere DataStage is a product for data integration via Extract-Transform-Load capabilities. It provides a designer tool that allows developers to visually create integration jobs. Job is used within IBM InfoSphere DataStage to describe extract, transform and load (ETL) tasks. Jobs are composed from a rich palette of operators called stages. These stages include: • Source and target access for databases, applications and files • General processing stages such as filter, sort, join, union, lookup and aggregations • Built-in and custom transformations • Copy, move, FTP and other data movement stages • Real-time, XML, SOA and Message queue processing Additionally, IBM InfoSphere DataStage allows pre- and post-conditions to be applied to all these stages. Multiple jobs can be controlled and linked by a sequencer. The sequencer provides the control logic that can be used to process the appropriate data integration jobs. IBM InfoSphere DataStage also supports a rich administration capability for deploying, scheduling and monitoring jobs. One of the great strengths of IBM InfoSphere DataStage is that when designing jobs, very little consideration to the underlying structure of the system is required and does not typically need to change. If the system changes, is upgraded or improved, or if a job is developed on one platform and implemented on another, the job design does not necessarily have to change. IBM InfoSphere DataStage has the capability to learn about the shape and size of the system from the IBM InfoSphere DataStage configuration file. Further, it has the capability to organize the resources needed for a job according to what is defined in the configuration file. When a system changes, the file is changed, not the jobs. A configuration file defines one or more processing nodes with which the job will run. The processing nodes are logical rather than physical. The number of processing nodes does not necessarily correspond to the number of cores in the system. The following are factors that affect the optimal degree of parallelism: • CPU-intensive applications, which typically perform multiple CPU-demanding operations on each record, benefit from the greatest possible parallelism up to the capacity supported by a given system. • Jobs with large memory requirements can benefit from parallelism if they act on data that has been partitioned and if the required memory is also divided among partitions. • Applications that are disk- or I/O-intensive, such as those that extract data from and load data into databases, benefit from configurations in which the number of logical nodes equals the number of I/O paths being accessed. For example, if a table is 5 Page 6 partitioned 16 ways inside a database or if a data set is spread across 16 disk drives, one should set up a node pool consisting of 16 processing nodes. Another great strength of IBM InfoSphere DataStage is that it does not rely on the functions and processes of a database to perform transformations: while IBM InfoSphere DataStage can generate complex SQL and leverages databases, IBM InfoSphere DataStage is designed from the ground up as a multipath data integration engine equally at home with files, streams, databases, and internal caching in single-machine, cluster, and grid implementations. As a result, customers in many circumstances find they do not also need to invest in staging databases to support IBM InfoSphere DataStage. Overview of Intel® Xeon® Series X7500 Processors Servers using the Intel® Xeon® series 7500 processor deliver dramatic increases in performance and scalability versus previous generation servers. The chipset includes new embedded technologies that give professionals in business, information management, creative, and scientific fields, the tools to solve problems faster, process larger data sets, and meet bigger challenges. With intelligent performance, a new high-bandwidth interconnect architecture, and greater memory capacity, platforms based on the Intel® Xeon® series 7500 processor are ideal for demanding workloads. A standard four-socket server provides up to 32 processor cores, 64 execution threads and a full terabyte of memory. Eight-socket and larger systems are in development by leading system vendors. The Intel® Xeon® series 7500 processor also includes more than 20 new reliability, availability and serviceability (RAS) features that improve data integrity and uptime. One of the most important is Intel® Machine Check Architecture Recovery, which allows the operating system to take corrective action and continue running when uncorrected errors are detected. These highly scalable servers can be used to support enormous user populations. Server platforms based on the Intel® Xeon® series 7500 processor deliver a number of additional features that help to improve performance, scalability and energy-efficiency. • Next-generation Intel® Virtualization Technology (Intel® VT) provides extensive hardware assists in processors, chipsets and I/O devices to enable fast application performance in virtual machines, including near-native I/O performance. Intel® VT also supports live virtual machine migration among current and future Intel® Xeon® processor-based servers, so businesses maintain a common pool of virtualized resources as they add new servers. • Intel® QuickPath Interconnect Technology provides point-to-point links to distributed shared memory. The Intel® Xeon® 7500 series processors with QPI feature two integrated memory controllers with and 3 QPI links to deliver scalable interconnect bandwidth, outstanding memory performance and flexibility and tightly integrated interconnect RAS features. Technical articles on QPI can be found at http://www.intel.com/technology/quickpath/. 6 Page 7 • Intel® Turbo Boost Technology boosts performance when it’s needed most by dynamically increasing core frequencies beyond rated values for peak workloads. • Intel® Intelligent Power Technology adjusts core frequencies to conserve power when demand is lower. • Intel® Hyper-Threading Technology can improve throughput and reduce latency for multithreaded applications and for multiple workloads running concurrently in virtualized environments. For additional information on the Intel® Xeon® Series 7500 Processor for mission critical applications, please see http://www.intel.com/pressroom/archive/releases/20100330comp_sm.htm. Sort Operation in IBM InfoSphere DataStage A brief overall description of the Sort operation is given here. The Sort operator implements a segmented merge sort and accomplishes sorting in two phases. First, the initial sort phase categorizes chunks of data into the correct order and stores this data as files to the scratch file system. The sort operator uses a buffer whose size is defined by the RMU parameter. This buffer is divided into two halves. The sorting thread will sort the half of the buffer it is working on until it is full and then move to the other half to begin inserts. The full buffer portion is sorted and then written out as a chunk to the scratch file system. The data is written to disk by a separate writer thread. See the figure below. Figure 1 - Sort operation overview The sort buffer is used during both the initial sort phase and the final merge phase of the sort operation. During the final merge phase, a block of data is read from the beginning of each of the temporary sorted files stored on the scratch file system. If the sort buffer is too small, there will not be enough memory to read a chunk of data from each of the temporary sort files from the initial sort phase. This condition will be detected during the initial sort phase and if it occurs, a second thread will run to perform pre-merging of 7 Page 8 the temporary sort files. This will reduce the number of temporary sort files so that the buffer will have sufficient space to load a block of data from each of the temporary sort files during final merging. In the following tests, we will show several tuning and configuration settings that can be used to reduce the I/O demand placed on the system by sort operations. Testing Configurations The testing was done on a single Intel® server with the Intel® Xeon® 7500 series chipset and four Intel® Xeon® X7560 processors. The X7560 processors are based on the Nehalem micro architecture. The system has 4 sockets, 8 cores per socket, and 2 threads per core using Intel® Hyper-Threading Technology for a total of 64 threads of execution. Our test configuration uses 64 GB of memory though the platform has a maximum capacity of 1 TB. The processor operating frequency is 2.26 GHz and each processor has 24 MB of L3 cache shared across the 8 cores. The system uses 5 Intel® X-25E solid state drives (SSDs) for temporary I/O storage configured in a RAID-0 array using the on board RAID controller. This storage is used as scratch storage for the sort tests. The bandwidth capability of the 5 SSDs was not sufficient to maximize the CPU utilization of the system given the high performance capabilities of DataStage and this will be explained in more detail later. We recommend sizing the I/O subsystem to maximize CPU utilization although we were not able to do this given the equipment available at the time of data collection. The operating system is Red Hat* Enterprise Linux* 5.3, 64 bit version. The test environment is a standard Information Server two tier configuration. The client tier is used to run just the DataStage client applications. All the remaining Information Server tiers are installed on a single Intel® Xeon® X7560 server. Test Client(s) Client • • • • Information Server (IS) Tiers (Services + Repository + Engine) Intel® Xeon® X7560 Server Window server 2003 Processor Type: x86 - based PC Processor Speed: 2.4GHZ Memory Size: 8 GB RAM Services + Repository + Engine Tiers • • • • • • Platform: Red Hat EL 5.3, 64 bit Processor: Intel® Xeon® X7560, 4 socket, 32 cores, 64 threads Processor Speed: 2.26 GHz Memory Size: 64 GB RAM Metadata Repository: DB2/LUW 9.7 GA 5 Intel X25-E SSDs for Scratch Space configured as RAID0 array using onboard controller. IS Topology: Standalone 8 Page 9 Figure 2 - System Test Configuration The following table lists the specifics of the platform tested: OEM Intel® CPU Model ID 7560 Platform Name Boxboro Sockets 4 Cores per Socket 8 Threads per core 2 CPU Code Name Nehalem-EX CPU Frequency (GHz) 2.24 QPI GT/s 6.4 Hyperthreading Enabled Prefetch Settings Default LLC Size (MB) 24 BIOS Version R21 Memory Installed (GB) 64 DIMM Type DDR3-1066 DIMM Size (GB) 4 Number of DIMMS 16 NUMA OS Table 1 – Intel® Platform Tested Enabled RHEL 5.3 64 bit Summary for Sort Performance Optimizations This section provides a brief summary of the recommendations from this performance study. Section 6 provides more detail for those seeking the deeper technical dive. Reducing I/O contention is critical to optimizing Sort stage performance. Spreading sorting I/O usage across different physical disks is a simple first step. A sample DataStage configuration file to implement this method is shown below. { node "node1" { fastname "DataStage1.ibm.com" pools "" resource disk "/opt/IBM/InformationServer/Server/Datasets1" {pools ""} resource scratchdisk "/opt/IBM/InformationServer/Server/Scratch1" {pools ""} } 9 Page 10 node "node2" { fastname "DataStage2.ibm.com" pools "" resource disk "/opt/IBM/InformationServer/Server/Datasets2" {pools ""} resource scratchdisk "/opt/IBM/InformationServer/Server/Scratch2" {pools ""} } } In this configuration file, each DataStage processing node has its own scratch space defined in a directory that resides on separate physical devices. This helps prevent contention for I/O subsystem resource among DataStage processing nodes. This is a fairly well known technique and was not studied for this paper. This paper describes additional techniques to achieve the optimal performance for DataStage jobs containing Sort operations: 1. Setting the Restrict Memory Usage (RMU) parameter for sort operations to an appropriate value for large data sets will reduce I/O demand on the scratch file system. The recommend RMU size varies with the data set size and node count. The formula is shown in section 6.1 along with a reference table that summarizes the suggested RMU sizes for a variety of data set sizes and node counts. The RMU parameter provides users with the flexibility of defining the sort buffer size to optimize memory usage of their system. 2. Increasing the default Linux read-ahead value for the disk storage system(s) for scratch space can increase the performance of the final merge phase of sort operations. The recommended setting for the read ahead value is 512 or 1024 sectors (256KB or 512KB) for the scratch file system. See section 6.2 for information on how to change the read ahead value in Linux. 3. Sort operations can benefit from having a buffer operator inserted prior to the sort in the data flow graph. Because sort operations work on large amounts of data, a buffer operator provides extra storage to get the data to the sort operator as fast as possible. See section 6.3 for details. 4. Enabling the APT_OLD_BOUNDED_LENGTH setting can decrease I/O demand during sorting when bounded length VARCHAR data is involved, potentially resulting in improved overall throughput for the job. Recommendations for Optimizing Sort Performance We investigated the Sort operation in detail and considered the effect of a number of performance tuning factors on the I/O characteristics. The input data size and the format of the data are critical input factors affecting sort. The layout of the I/O subsystem and the file cache and prefetching characteristics are also important. The RMU buffer size 10 Page 11 configuration parameter has a significant effect on the behavior of the sort as the input data set size is adjusted. These factors are considered in greater detail below. In our tests, a job consisting of one sort stage and running on a one node configuration was capable of sorting and writing data at the rate of 120 MB/s to the scratch file system. Increasing the node count of the job quickly resulted in more I/O requests to the scratch I/O storage array than it was able to service in a timely manner. Due to the limitation of the scratch I/O system, the Intel® Xeon® server CPUs were greatly underutilized. The scratch I/O file system was simply under-configured for a server with such a high computational capability. This illustrates the high compute power available on the Intel® Xeon® processors and the ability of IBM InfoSphere DataStage to efficiently harness this compute power. Configuring sufficient I/O to harness the computational capability of this powerful combination of hardware and software is of paramount importance to enable efficient utilization of system. For our test system, we chose a configuration for the scratch storage I/O system that was significantly undersized in comparison to the compute capability of the server. While we recommend always configuring for optimal performance which would include a more capable scratch storage system, customer feedback has indicated that many deployed systems have insufficient bandwidth capability to the scratch storage system. The tuning and configuration tips found in this paper to design to increase performance on all systems, but will be especially beneficial for systems constrained by the scratch I/O storage system. In all cases, the amount of data transferred may be reduced by these tuning and configuration tips. By adjusting the DataStage parallel node count, we were able to match the scratch storage capabilities and prevent the scratch storage system from being saturated. This allowed us to study and develop this tuning guidance in a balanced environment. Using this strategy, we developed several tuning guidelines to reduce the demand for scratch storage I/O which we used to effectively increase performance. This is likely to be the situation for many customers as growth in CPU processing performance continues to outpace the I/O capability of storage subsystems. Several valuable tuning guidelines were discovered and we present the findings here. While these findings are significant, and we highly recommend them, we also want to make clear that there is no substitute for having a high performance scratch storage system capable of supplying sufficient bandwidth and I/Os per second (IOPS) to maintain high CPU utilization. The tuning guidance given here will help even a high performance scratch I/O system deliver better performance to DataStage jobs using sort operations. The remainder of this section describes the tuning results we found to improve sort performance through I/O reduction. Optimal RMU Tuning 11 Page 12 This section describes how to tune the sort buffer size parameter called RMU to minimize I/O demand on the scratch I/O system. An RMU value that is too small will result in intermediate merges of temporary files during the initial sort phase. These intermediate merges can significantly increase the I/O demand on the scratch file system. Tuning the RMU value appropriately can eliminate all intermediate merge operations and greatly increase throughput of sort operations for systems with limited I/O bandwidth to the scratch I/O file system. The scratch disk I/O system on many systems is a bottleneck to performance due to insufficient bandwidth capability of the number of disks or the interconnect bandwidth is less than needed to maximize CPU utilization. The elimination of pre-merging can reduce the overall I/O demand on the scratch file system therefore allowing the scratch file system to complete I/O faster, increasing throughput and decreasing job run time. Configuration / Job Tuning Recommendations Given knowledge of the size of data to be sorted, it is possible to calculate the optimal RMU value that will prevent the pre-merge thread from running and thus, reducing I/O demand. The RMU formula is: RMU (MB) >= SQRT ( DataSizeToSort (MB) / NodeCount) / 2 Notes about using the above formula: 1. The total data size is divided by the node count because the data sorted per node decreases with increasing node count. A node in this context refers to the number of parallel instances of the job when it is instantiated. 2. Our tests indicate that the RMU value can span a fairly large range and still provide good performance. Sometimes the amount of data to be sorted is not known precisely. We recommend attempting to estimate the input data size within one or two factors of the actual value. In other words, overestimating the data set size by a factor of 2x will still result in an RMU value from the above equation that will provide good performance results. 3. The default RMU value is 20MB. This RMU value can sort up to 1.6 GB of data per node while avoiding costly pre-merge operations. If your data set size divided by node count is less than 1.6 GB, then no change is necessary to the RMU. The following table is a handy reference of RMU settings for different sizes of input data (data set size) and node counts. The table assumes the user knows the data set size to be sorted. Knowing the precise size of the data being sorted may not be feasible. Over estimating the data set size by up to a factor 2 times the actual data size will still result in good performance. The default RMU value is 20 MB. The table contains the word “Default” where the formula results in less than 20 MB indicating that the user should use the default value. It is not necessary to decrease the RMU value below the 20 MB default, though doing so is allowed. 12 Page 13 Data Size to be Sorted 1 1.5 3 10 30 100 300 1000 3000 10000 1Node Min RMU (MB) 4Nodes Min RMU (MB) 8Nodes Min RMU (MB) 16Nodes Min RMU (MB) 24Nodes Min RMU (MB) 32Nodes Min RMU (MB) 48Nodes Min RMU (MB) 64Nodes Min RMU (MB) Default Default 28 51 88 160 277 506 876 1600 Default Default Default 25 44 80 139 253 438 800 Default Default Default Default 31 57 98 179 310 566 Default Default Default Default 22 40 69 126 219 400 Default Default Default Default Default 33 57 103 179 327 Default Default Default Default Default 28 49 89 155 283 Default Default Default Default Default 23 40 73 126 231 Default Default Default Default Default 20 35 63 110 200 Table 2 – RMU Buffer Size Table Our test results of a job consisting of one sort stage running with 4 parallel nodes with two different RMU values are shown in Figure 3. The correct sizing of the RMU value resulted in a 36% throughput increase. In the tests, the I/O bandwidth did not decrease because the I/O subsystem was delivering the maximum bandwidth it was capable of in both cases. However, because the total quantity of data transferred was much lower, the CPU cores were able to operate at higher CPU utilization and complete the sort in a shorter amount of time. This optimization is very effective for scratch disks that are unable to deliver enough scratch file I/O bandwidth to feed the high performing Intel® Xeon® Server and highly efficient IBM InfoSphere DataStage Software. The results shown here are for a sort only job where we have isolated the effect of the RMU parameter. This optimization will help more complex jobs, but will only directly affect the performance of the sort operators within the job. RMU Size Read Ahead Setting Run Time 10MB 128KB(Linux Default) 4.05 minutes 30MB 128KB(Linux Default) 2.97 minutes Figure 3 - Performance Tuning Sort with Sort Operator RMU value To modify the RMU setting for a Sort Stage in a job, on DataStage Designer client canvas, open the Sort stage, click on Tab ‘Stage’, then ‘Properties’, click on ‘Options’ in the left 13 Page 14 window, and select ‘Restrict Memory Usage (MB) from the ‘Available properties to add’ window to add it. Figure 4 – Adding RMU Option Once the Restrict Memory Usage option is added, its value can be set to the recommended one based on above-mentioned formula. 14 Page 15 Figure 5 – Setting RMU Option Final Merge Sort Phase Tuning using Linux Read Ahead During testing of the single node sort job, we found that CPU utilization of final merge can be improved by changing the scratch disk read ahead setting in Linux, resulting in substantial throughput improvements of the final merge sort phase. Configuration / Job Tuning Recommendations The default Linux file system read ahead value is 256 sectors. A sector is 512 bytes so the total default read ahead is 128 kB. Our testing indicated that increasing the read ahead value to 1024 sectors (512 kB) increased CPU utilization and reduced the final merge time by reducing the amount of time that DataStage had to wait for I/Os from the scratch file system. This resulted in an increase in throughput of the final merge phase of sort of approximately 30%. Test results for a job consisting of one sort stage running with 4 parallel nodes with two different values for the Linux read ahead setting are shown in Figure 6. Increasing the Linux default read ahead setting of 128 kB to 512 kB resulted in a 9% improvement in throughput of the job. 15 Page 16 RMU Size Read Ahead Setting Run Time 30MB 128KB(Linux Default) 2.97 minutes 30MB 512KB 2.72 minutes Figure 6 - Performance Tuning Sort Operator with Linux Read Ahead Setting The current read ahead setting in Linux can be obtained using the following command: >hdparm To set the read ahead setting for a specific disk device in Linux, use the following command: >hdparm –a 1024 /dev/sdb1 /dev/sdb1) (sets read ahead to 1024 sectors on disk device, To make the command persist across reboots, edit the /etc/init.d/boot.local file. Recommended settings to try are 512 sectors (256 kB) or 1024 sectors (512 kB). Increasing read ahead size results in more data being read from the disk and stored in the OS disk cache memory. As a result, more read requests by the sort operator get the requested data directly from the OS disk cache instead of waiting for the full latency of a data read from the scratch storage system. (Note that the Linux file system cache is controlled by the kernel and uses memory that is not allocated to processes.) In our tests, the scratch storage system consists of SSDs configured in a RAID-0 array. I/O request latencies are low on this system compared to typical rotating media storage arrays. Increasing OS read ahead will benefit scratch storage arrays consisting of HDDs even more. Larger read ahead values than those tested may be more beneficial for HDD arrays. We chose to use SSDs because they provide higher bandwidth, much improved IOPS (I/Os per second) and much lower latency than an equivalent number of hard disk drives. Many RAID controllers found in commercial storage systems also have capability to do read ahead on read requests and store data in the cache. It is good to enable this feature if it is available on the storage array being used for scratch storage. It is still important to increase read ahead in the OS. Serving requests from the OS disk cache will be faster than having to wait for data from the RAID engine. The results shown here are on a job with a sort operation only. Tuning of read ahead will not impact performance of other operations in the job that are not performing scratch disk I/O. 16 Page 17 Using a Buffer Operator to minimize latency for Sort Input The DataStage parallel engine employs buffering automatically as part of its normal operations. Because the initial sort phase has such a high demand for input data, it is especially sensitive to latency spikes in the data source feeding the sort. These latency spikes can occur due to data being sourced from local or remote disks, or due to scheduling of operators by the operating system. By adding an additional buffer in front of the sort, we were able to maintain the CPU utilization on the core running the sort thread at 100% during the entire initial sort phase, thus increasing the performance of the initial sort phase by nearly 7%. Configuration / Job Tuning Recommendations We recommend using an additional buffer prior to sort of size equal to the RMU value. To add an additional buffer in front of the sort, open the Sort stage in a DataStage job on DataStage Designer client canvas, click on the ‘Input’ tab, then ‘Advanced’. Select ‘Buffer’ from the ‘Buffering mode’ drop-down menu and modify the ‘Maximum memory buffer size (bytes)’ field. Figure 7 - Adding buffer in front of the sort 17 Page 18 Minimizing I/O for Sort Data containing Variable length fields By default, the parallel engine internally handles bounded length VARCHAR fields (those that specify a maximum length) as essentially fixed length character strings. If the actual data in the field is less than the maximum length, the string is padded to the maximum size. This behavior is efficient for CPU processing of records throughout the course of an entire job flow but it increases the I/O demands for operations such as Sort. When environment variable APT_OLD_BOUNDED_LENGTH is set, the data within each VARCHAR field is processed without additional padding resulting in a decreased amount of data written to disk. This can decrease I/O bandwidth demand and therefore increase performance when running a scratch disk subsystem with insufficient bandwidth. This can increase job throughput if the scratch file system is not able to keep up with the processing capability of DataStage and the Intel® Xeon® Server. Additional CPU cycles will be used to process variable length data when using APT_OLD_BOUNDED_LENGTH. More CPU processing power is used to reduce the amount of I/O required from the scratch file system by using this setting. Our test results of a job consisting of one sort stage running with 16 parallel nodes using the APT_OLD_BOUNDED_LENGTH resulted in a 25% reduction in size of temporary sort files and a 26% increase in throughput (a 21% reduction in runtime.) Normalized Comparison Default With APT_OLD_BOUNDED_LENGTH Scratch Storage Space 1.0 0.75x (75% of the original storage space Consumed used) Runtime 1.0 0.79x (79% of the original runtime) Throughput 1.0 1.26x (26% increase in job processing rate) Table 3 – Sort Operation performance comparison using APT_OLD_BOUNDED_LENGTH Please note that the performance benefit of this tuning parameter will vary based on several factors. It only applies to data records that have varchar fields. The actual file size reduction realized on the scratch storage system will depend heavily on the maximum size specified by the varchar fields, and the size of the actual data contained in these fields, and whether the varchar fields are a sort key for the records. The amount of performance benefit will depend on how much the total file size is reduced, along with the data request rate of the sort operations compared to the capability of the scratch file system to supply the data. In our test configuration, the 16 node test resulted in the scratch I/O system being driven to its maximum bandwidth limit. By setting APT_OLD_BOUNDED_LENGTH, the amount of data that was written and subsequently read from the disk decreased substantially over the length of the job allowing faster completion. 18 Page 19 Configuration / Job Tuning Recommendations This optimization will only affect data sets that use bounded length VARCHAR data types. APT_OLD_BOUNDED_LENGTH is a user defined variable for DataStage. The variable can be added either at the project level or job level. You can follow the instructions in the IBM InfoSphere DataStage and QualityStage Administrator Client Guide and the IBM InfoSphere DataStage and QualityStage Designer Client Guide to add and set a new variable. We recommend trying this setting if low CPU utilization is observed during sorting or if it is known that the scratch file system is unable to keep up with job demands. Future Study: Using Memory for RAM Based Scratch Disk As a future study, we intend to investigate performance when using a RAM based disk for scratch storage. . The memory bandwidth available in the Nehalem-EX test system is greater than 70 GB/s when correctly configured. While SSDs offer some bandwidth improvements over hard disk drives, they cannot begin to match the performance of main memory bandwidth. The system supports PCI Express lanes to reach ~ 35 GB/s of I/O in each direction if all PCIe lanes are utilized. However, such an I/O solution would be expensive. The currently available 4 socket Intel® X7560 systems can address 1 TB of memory and 8 socket systems can address 2 TB of memory. DRAM capacity will continue to rise with new product releases and IBM X series systems also offer options to increase DRAM capacity beyond the baseline. While DRAM is expensive when compared to disk drives on a per capacity basis, it is more favorable when comparing bandwidth capability in and out of the system. We plan to evaluate the performance and cost benefit analysis of large in-memory storage compared to disk drive based storage solutions and provides the results in the near future. 19 Page 20 Best Practices This paper describes additional techniques to achieve the optimal performance for DataStage jobs containing Sort operations: • Setting the Restrict Memory Usage (RMU) parameter for sort operations to an appropriate value for large data sets will reduce I/O demand on the scratch file system. The recommend RMU size varies with the data set size and node count. The formula is shown in section 6.1 along with a reference table that summarizes the suggested RMU sizes for a variety of data set sizes and node counts. The RMU parameter provides users with the flexibility of defining the sort buffer size to optimize memory usage of their system. • Increasing the default Linux read-ahead value for the disk storage system(s) for scratch space can increase the performance of the final merge phase of sort operations. The recommended setting for the read ahead value is 512 or 1024 sectors (256KB or 512KB) for the scratch file system. See section 6.2 for information on how to change the read ahead value in Linux. • Sort operations can benefit from having a buffer operator inserted prior to the sort in the data flow graph. Because sort operations work on large amounts of data, a buffer operator provides extra storage to get the data to the sort operator as fast as possible. See section 6.3 for details. • Enabling the APT_OLD_BOUNDED_LENGTH setting can decrease I/O demand during sorting when bounded length VARCHAR data is involved, potentially resulting in improved overall throughput for the job. 20 Page 21 Conclusion We have shown how to optimize IBM InfoSphere DataStage sort performance on Intel® Xeon® processors using a variety of tuning options such as Sort buffer RMU size, Linux read ahead settings, additional Buffer operator, and configuring the Varchar length parameter. Our results reinforce the necessity of correctly sizing I/O to optimize server performance. For sort, it is imperative to have sufficient scratch I/O storage performance to allow maximization of all sort operators running in the system concurrently in order to fully utilize the server. Powerful mission critical servers like the Intel® Xeon® Platforms based on the X7500 series processor running the IBM InfoSphere DataStage parallel engine can efficiently process data at extremely high data rates. As a result, I/O and network bandwidth are extremely important for high performance. Network interconnects like 10 Gbit/s Ethernet or 40 Gbit/s Fiber Channel are necessary to fully realize the computation potential of this powerful combination of hardware and software. In the near future, we plan to analyze the cost and benefit trade off of using large DRAM capacity as a replacement for disk subsystems for scratch I/O. We also will be looking at tuning high bandwidth networking solutions to optimize performance. 21 Page 22 Further reading Other documentation with information on compression you might be interested in: • IBM InfoSphere Information Server, Version 8.5 Information Center http://publib.boulder.ibm.com/infocenter/iisinfsv/v8r5/index.jsp Contributors Garrett Drysdale is a Sr. Software Performance Engineer for Intel. Garrett has analyzed and optimized software on Intel® platforms since 1995 spanning client, workstation, and enterprise server market segments. Garrett currently works with enterprise software developers to analyze and optimize server applications, and with internal design teams to assist in evaluating the impact of new technologies on software performance for future Intel® platforms. Garrett has a BSEE from University of Missouri-Rolla and a MSEE from The Georgia Institute of Technology. His email is [email protected]. Jantz Tran is a Software Performance Engineer for Intel. He has been analyzing and optimizing enterprise software on Intel server platforms for 10 years. Jantz has a BSCE from Texas A&M University. His email is [email protected]. Dr. Sriram Padmanabhan is an IBM Distinguished Engineer, and Chief Architect for IBM InfoSphere Servers. Most recently, he had led the Information Management Advanced Technologies team investigating new technical areas such as the impact of Web 2.0 information access and delivery. He was a Research Staff Member and then a manager of the Database Technology group at IBM T.J. Watson Research Center for several years. He was a key technologist for DB2’s shared-nothing parallel database feature and one of the originators of DB2’s multi-dimensional clustering feature. He was also a chief architect for Data Warehouse Edition which provides integrated warehousing and business intelligence capabilities enhancing DB2. Dr. Padmanabhan has authored more than 25 publications including a book chapter on DB2 in a popular database text book, several journal articles, and many papers in leading database conferences. His email is [email protected]. 22 Page 23 Brian Caufield is a Software Architect for Infosphere Information Server responsible for the definition and design of new IBM InfoSphere DataStage features, and also works with the Information Server Performance Team. Brian represents IBM at the TPC, working to define an industry standard benchmark for data integration. Previously, Brian worked for 10 years as a developer on IBM InfoSphere DataStage specializing in the parallel engine. His email is [email protected]. Fan Ding is currently a member of the Information Server Performance Team. Prior to joining the team, he worked in Information Integration Federation Server Development. Fan has a PH.D. in Mechanical Engineering and a Master in Computer Science from University of Wisconsin. His email is: [email protected]. Ron Liu is currently a member of the IBM InfoSphere Information Server Performance Team with focus on performance tuning and information integration benchmark development. Prior to his current job, Ron had 7 years in Database Server development (federation runtime, wrapper, query gateway, process model, and database security). Ron has a Master of Science in Computer Science and Bachelor of Science in Physics. His email is [email protected]. Pin Lp Lv is a Software Performance Engineer from IBM. Pin has worked for IBM since 2006. He worked as a software tester for IBM WebSphere Product Center Team and RFID Team from September 2006 to March 2009, and joined IBM InfoSphere Information Server Performance Team in April 2009. Pin has a Master of Science degree in Computer Science from University of West Scotland. His email is [email protected] Mi Wan Shum is the manager of the IBM InfoSphere Information Server performance team at the IBM Silicon Valley Lab. She graduated from University of Texas at Austin and she has years of software development experience in IBM. Her email is [email protected] Jackson (Dong Jie) Wei is a Staff Software Performance Engineer for IBM. He once worked as a DBA in CSRC before joining IBM in 2006. Since then, he has been working on the Information Server product. In 2009, he began to focus his work on the ETL performance. Jackson is also the technical lead for the IBM China Lab Information Server performance group. He got his bachelor and master degrees for Electronic Engineering of Peking University in 2000 and 2003 respectively. His email is [email protected]. 23 Page 24 Samuel Wong is a member of the IBM InfoSphere InfoSphere Information Server performance team at the IBM Silicon Valley Lab. He graduated from University of Toronto and he has 12 years of software development experience with IBM. His email is [email protected] 24 Page 25 Notices This information was developed for products and services offered in the U.S.A. IBM may not offer the products, services, or features discussed in this document in other countries. Consult your local IBM representative for information on the products and services currently available in your area. Any reference to an IBM product, program, or service is not intended to state or imply that only that IBM product, program, or service may be used. Any functionally equivalent product, program, or service that does not infringe any IBM intellectual property right may be used instead. However, it is the user's responsibility to evaluate and verify the operation of any non-IBM product, program, or service. IBM may have patents or pending patent applications covering subject matter described in this document. The furnishing of this document does not grant you any license to these patents. You can send license inquiries, in writing, to: IBM Director of Licensing IBM Corporation North Castle Drive Armonk, NY 10504-1785 U.S.A. The following paragraph does not apply to the United Kingdom or any other country where such provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties in certain transactions, therefore, this statement may not apply to you. Without limiting the above disclaimers, IBM provides no representations or warranties regarding the accuracy, reliability or serviceability of any information or recommendations provided in this publication, or with respect to any results that may be obtained by the use of the information or observance of any recommendations provided herein. The information contained in this document has not been submitted to any formal IBM test and is distributed AS IS. The use of this information or the implementation of any recommendations or techniques herein is a customer responsibility and depends on the customer’s ability to evaluate and integrate them into the customer’s operational environment. While each item may have been reviewed by IBM for accuracy in a specific situation, there is no guarantee that the same or similar results will be obtained elsewhere. Anyone attempting to adapt these techniques to their own environment do so at their own risk. This document and the information contained herein may be used solely in connection with the IBM products discussed in this document. This information could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time without notice. Any references in this information to non-IBM Web sites are provided for convenience only and do not in any manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of the materials for this IBM product and use of those Web sites is at your own risk. IBM may use or distribute any of the information you supply in any way it believes appropriate without incurring any obligation to you. Any performance data contained herein was determined in a controlled environment. Therefore, the results obtained in other operating environments may vary significantly. Some measurements may have been made on development-level systems and there is no guarantee that these measurements will be the same on generally available systems. Furthermore, some measurements may have been estimated through extrapolation. Actual results may vary. Users of this document should verify the applicable data for their specific environment. 25 Page 26 Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. All statements regarding IBM's future direction or intent are subject to change or withdrawal without notice, and represent goals and objectives only. This information contains examples of data and reports used in daily business operations. To illustrate them as completely as possible, the examples include the names of individuals, companies, brands, and products. All of these names are fictitious and any similarity to the names and addresses used by an actual business enterprise is entirely coincidental. COPYRIGHT LICENSE: This information contains sample application programs in source language, which illustrate programming techniques on various operating platforms. You may copy, modify, and distribute these sample programs in any form without payment to IBM, for the purposes of developing, using, marketing or distributing application programs conforming to the application programming interface for the operating platform for which the sample programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore, cannot guarantee or imply reliability, serviceability, or function of these programs. Trademarks IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or ™), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at www.ibm.com/legal/copytrade.shtml Windows is a trademark of Microsoft Corporation in the United States, other countries, or both. UNIX is a registered trademark of The Open Group in the United States and other countries. Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. Other company, product, or service names may be trademarks or service marks of others. 26