Best Practices IBM®

by user

on 15 сентября 2016

Category: Documents

>> Downloads: 2

views

Report

Comments

Description

Download Best Practices IBM®

Transcript

Best Practices IBM®

IBM®
Page 1
IBM® InfoSphere® DataStage®
Best Practices
Performance Guidelines for IBM InfoSphere DataStage Jobs
Containing Sort Operations on Intel® Xeon® Servers
Garrett Drysdale, Intel® Corporation
Jantz Tran, Intel® Corporation
Sriram Padmanabhan, IBM
Brian Caufield, IBM
Fan Ding, IBM
Ron Liu, IBM
Pin Lp Lv, IBM
Mi Wan Shum, IBM
Jackson Dong Jie Wei, IBM
Samuel Wong, IBM
1
Page 2
Executive Summary............................................................................................. 3
Introduction .......................................................................................................... 4
Overview of IBM InfoSphere DataStage .......................................................... 5
Overview of Intel® Xeon® Series X7500 Processors....................................... 6
Sort Operation in IBM InfoSphere DataStage.................................................. 7
Testing Configurations.................................................................................. 8
Summary for Sort Performance Optimizations ............................................... 9
Recommendations for Optimizing Sort Performance .................................. 10
Optimal RMU Tuning ................................................................................. 11
Configuration / Job Tuning Recommendations.........................................................12
Final Merge Sort Phase Tuning using Linux Read Ahead..................... 15
Configuration / Job Tuning Recommendations.........................................................15
Using a Buffer Operator to minimize latency for Sort Input................. 17
Configuration / Job Tuning Recommendations.........................................................17
Minimizing I/O for Sort Data containing Variable length fields .......... 18
Configuration / Job Tuning Recommendations.........................................................19
Future Study: Using Memory for RAM Based Scratch Disk ................. 19
Best Practices....................................................................................................... 20
Conclusion .......................................................................................................... 21
Further reading................................................................................................... 22
Contributors.................................................................................................. 22
Notices ................................................................................................................. 25
Trademarks ................................................................................................... 26
2
Page 3
Executive Summary
The objective of this document is to communicate best practices for tuning IBM
InfoSphere DataStage jobs containing sort operations on Intel® Xeon® Servers. Sort
operations are I/O intensive operations and can cause significant I/O load on the
temporary or scratch file system. To optimize server CPU utilization, the scratch I/O
storage system must be capable of providing the necessary disk bandwidth demanded by
the sort operations. A scratch storage system that cannot write or read data at a high
enough bandwidth will lead to under-utilization of computing capability of the system.
This will be observed as low CPU utilization.
The paper provides recommendations that will reduce the bandwidth demand placed on
the scratch storage I/O system by sort operations. These I/O reductions result in
improved performance that can be quite significant for systems where the scratch I/O
storage system is significantly under sized in comparison to the compute capability of the
processors.
3
Page 4
Introduction
This whitepaper is the first in what is anticipated to be a series of whitepapers envisioned
to provide IBM InfoSphere DataStage customers with helpful performance tuning
guidelines for deployment on Intel® Xeon® processor- based platforms. IBM and Intel
began collaborating to optimize performance and ROI of the combination of IBM
InfoSphere DataStage and Intel® Xeon® based platforms in 2007. Our goal is to not only
optimize the performance, and therefore, reduce the total cost of ownership of this
powerful combination in future versions of IBM InfoSphere DataStage on future Intel®
processors, but also to pass along tuning and configuration guidance that we discover
along the way.
In our work together, we are striving to understand the execution characteristics of
DataStage jobs on Intel® platforms. This information is used to determine the hardware
configurations, the operating system settings, and the job design and tuning techniques
to optimize performance. Because of the highly scalable capabilities of IBM InfoSphere
DataStage, our tests are focused on the latest Intel® Xeon® 4 and 8 socket capable X7560
Xeon® EX processors. Initially, we are testing with four socket configurations.
We have presented information about IBM InfoSphere DataStage on Intel® platforms at
the 2009 and 2010 IBM Information on Demand Conferences. In 2009, our audience
applauded the great scalability of IBM InfoSphere DataStage on Intel® platforms, but
asked us to provide more information on the I/O requirements of jobs and how to get the
most out of existing platform I/O capability. Since then, we have found ways to increase
the overall performance of all jobs in the new Information Server 8.5 version of IBM
InfoSphere DataStage that is now a 64 bit binary on Intel® platforms, and we
investigated the I/O requirements of sorting.
The focus of the paper is on the key pieces of information we obtained regarding
configuring the platform, Operating System, and DataStage jobs that contain sort
operators. Sort is a crucial operation in data integration software, Sort operations are I/O
intensive operations and can cause significant I/O load on the temporary or scratch file
system. To optimize server CPU utilization, the scratch I/O storage system must be
capable of providing the necessary disk bandwidth demanded by the sort operations. A
scratch storage system that cannot write or read data at a high enough bandwidth will
lead to under-utilization of computing capability of the system. This will be observed as
low CPU utilization.
The paper provides recommendations that will reduce the bandwidth demand placed on
the scratch storage I/O system by sort operations. These I/O reductions result in
improved performance that can be quite significant for systems where the scratch I/O
storage system is significantly under sized in comparison to the compute capability of the
processors. We show such a scenario in this paper. Ideally, the best solution is to
upgrade the scratch I/O storage subsystem to match the compute capability of the server.
4
Page 5
Overview of IBM InfoSphere DataStage
IBM InfoSphere DataStage is a product for data integration via Extract-Transform-Load
capabilities. It provides a designer tool that allows developers to visually create
integration jobs. Job is used within IBM InfoSphere DataStage to describe extract,
transform and load (ETL) tasks. Jobs are composed from a rich palette of operators called
stages. These stages include:
• Source and target access for databases, applications and files
• General processing stages such as filter, sort, join, union, lookup and aggregations
• Built-in and custom transformations
• Copy, move, FTP and other data movement stages
• Real-time, XML, SOA and Message queue processing
Additionally, IBM InfoSphere DataStage allows pre- and post-conditions to be applied to
all these stages. Multiple jobs can be controlled and linked by a sequencer. The sequencer
provides the control logic that can be used to process the appropriate data integration
jobs. IBM InfoSphere DataStage also supports a rich administration capability for
deploying, scheduling and monitoring jobs.
One of the great strengths of IBM InfoSphere DataStage is that when designing jobs, very
little consideration to the underlying structure of the system is required and does not
typically need to change. If the system changes, is upgraded or improved, or if a job is
developed on one platform and implemented on another, the job design does not
necessarily have to change. IBM InfoSphere DataStage has the capability to learn about
the shape and size of the system from the IBM InfoSphere DataStage configuration file.
Further, it has the capability to organize the resources needed for a job according to what
is defined in the configuration file. When a system changes, the file is changed, not the
jobs. A configuration file defines one or more processing nodes with which the job will
run. The processing nodes are logical rather than physical. The number of processing
nodes does not necessarily correspond to the number of cores in the system.
The following are factors that affect the optimal degree of parallelism:
• CPU-intensive applications, which typically perform multiple CPU-demanding
operations on each record, benefit from the greatest possible parallelism up to the
capacity supported by a given system.
• Jobs with large memory requirements can benefit from parallelism if they act on data
that has been partitioned and if the required memory is also divided among partitions.
• Applications that are disk- or I/O-intensive, such as those that extract data from and
load data into databases, benefit from configurations in which the number of logical
nodes equals the number of I/O paths being accessed. For example, if a table is
5
Page 6
partitioned 16 ways inside a database or if a data set is spread across 16 disk drives, one
should set up a node pool consisting of 16 processing nodes.
Another great strength of IBM InfoSphere DataStage is that it does not rely on the
functions and processes of a database to perform transformations: while IBM InfoSphere
DataStage can generate complex SQL and leverages databases, IBM InfoSphere
DataStage is designed from the ground up as a multipath data integration engine equally
at home with files, streams, databases, and internal caching in single-machine, cluster,
and grid implementations. As a result, customers in many circumstances find they do not
also need to invest in staging databases to support IBM InfoSphere DataStage.
Overview of Intel® Xeon® Series X7500 Processors
Servers using the Intel® Xeon® series 7500 processor deliver dramatic increases in
performance and scalability versus previous generation servers. The chipset includes
new embedded technologies that give professionals in business, information
management, creative, and scientific fields, the tools to solve problems faster, process
larger data sets, and meet bigger challenges.
With intelligent performance, a new high-bandwidth interconnect architecture, and
greater memory capacity, platforms based on the Intel® Xeon® series 7500 processor are
ideal for demanding workloads. A standard four-socket server provides up to 32
processor cores, 64 execution threads and a full terabyte of memory. Eight-socket and
larger systems are in development by leading system vendors. The Intel® Xeon® series
7500 processor also includes more than 20 new reliability, availability and serviceability
(RAS) features that improve data integrity and uptime. One of the most important is
Intel® Machine Check Architecture Recovery, which allows the operating system to take
corrective action and continue running when uncorrected errors are detected. These
highly scalable servers can be used to support enormous user populations.
Server platforms based on the Intel® Xeon® series 7500 processor deliver a number of
additional features that help to improve performance, scalability and energy-efficiency.
•
Next-generation Intel® Virtualization Technology (Intel® VT) provides extensive
hardware assists in processors, chipsets and I/O devices to enable fast application
performance in virtual machines, including near-native I/O performance. Intel®
VT also supports live virtual machine migration among current and future Intel®
Xeon® processor-based servers, so businesses maintain a common pool of
virtualized resources as they add new servers.
•
Intel® QuickPath Interconnect Technology provides point-to-point links to
distributed shared memory. The Intel® Xeon® 7500 series processors with QPI
feature two integrated memory controllers with and 3 QPI links to deliver
scalable interconnect bandwidth, outstanding memory performance and
flexibility and tightly integrated interconnect RAS features. Technical articles on
QPI can be found at http://www.intel.com/technology/quickpath/.
6
Page 7
•
Intel® Turbo Boost Technology boosts performance when it’s needed most by
dynamically increasing core frequencies beyond rated values for peak
workloads.
•
Intel® Intelligent Power Technology adjusts core frequencies to conserve power
when demand is lower.
•
Intel® Hyper-Threading Technology can improve throughput and reduce
latency for multithreaded applications and for multiple workloads running
concurrently in virtualized environments.
For additional information on the Intel® Xeon® Series 7500 Processor for mission critical
applications, please see
http://www.intel.com/pressroom/archive/releases/20100330comp_sm.htm.
Sort Operation in IBM InfoSphere DataStage
A brief overall description of the Sort operation is given here. The Sort operator
implements a segmented merge sort and accomplishes sorting in two phases.
First, the initial sort phase categorizes chunks of data into the correct order and stores
this data as files to the scratch file system. The sort operator uses a buffer whose size is
defined by the RMU parameter. This buffer is divided into two halves. The sorting
thread will sort the half of the buffer it is working on until it is full and then move to the
other half to begin inserts. The full buffer portion is sorted and then written out as a
chunk to the scratch file system. The data is written to disk by a separate writer thread.
See the figure below.
Figure 1 - Sort operation overview
The sort buffer is used during both the initial sort phase and the final merge phase of the
sort operation. During the final merge phase, a block of data is read from the beginning
of each of the temporary sorted files stored on the scratch file system. If the sort buffer is
too small, there will not be enough memory to read a chunk of data from each of the
temporary sort files from the initial sort phase. This condition will be detected during
the initial sort phase and if it occurs, a second thread will run to perform pre-merging of
7
Page 8
the temporary sort files. This will reduce the number of temporary sort files so that the
buffer will have sufficient space to load a block of data from each of the temporary sort
files during final merging.
In the following tests, we will show several tuning and configuration settings that can be
used to reduce the I/O demand placed on the system by sort operations.
Testing Configurations
The testing was done on a single Intel® server with the Intel® Xeon® 7500 series chipset
and four Intel® Xeon® X7560 processors. The X7560 processors are based on the
Nehalem micro architecture. The system has 4 sockets, 8 cores per socket, and 2 threads
per core using Intel® Hyper-Threading Technology for a total of 64 threads of execution.
Our test configuration uses 64 GB of memory though the platform has a maximum
capacity of 1 TB. The processor operating frequency is 2.26 GHz and each processor has
24 MB of L3 cache shared across the 8 cores.
The system uses 5 Intel® X-25E solid state drives (SSDs) for temporary I/O storage
configured in a RAID-0 array using the on board RAID controller. This storage is used as
scratch storage for the sort tests. The bandwidth capability of the 5 SSDs was not
sufficient to maximize the CPU utilization of the system given the high performance
capabilities of DataStage and this will be explained in more detail later. We recommend
sizing the I/O subsystem to maximize CPU utilization although we were not able to do
this given the equipment available at the time of data collection.
The operating system is Red Hat* Enterprise Linux* 5.3, 64 bit version.
The test environment is a standard Information Server two tier configuration. The client
tier is used to run just the DataStage client applications. All the remaining Information
Server tiers are installed on a single Intel® Xeon® X7560 server.
Test Client(s)
Client
•
•
•
•
Information Server (IS) Tiers
(Services + Repository + Engine)
Intel® Xeon® X7560 Server
Window server 2003
Processor Type: x86
- based PC
Processor Speed: 2.4GHZ
Memory Size: 8 GB RAM
Services + Repository + Engine Tiers
•
•
•
•
•
•
Platform: Red Hat EL 5.3, 64 bit
Processor: Intel® Xeon® X7560, 4
socket, 32 cores, 64 threads
Processor Speed: 2.26 GHz
Memory Size: 64 GB RAM
Metadata Repository: DB2/LUW 9.7 GA
5 Intel X25-E SSDs for Scratch Space
configured as RAID0 array using onboard
controller.
IS Topology: Standalone
8
Page 9
Figure 2 - System Test Configuration
The following table lists the specifics of the platform tested:
OEM
Intel®
CPU Model ID
7560
Platform Name
Boxboro
Sockets
4
Cores per Socket
8
Threads per core
2
CPU Code Name
Nehalem-EX
CPU Frequency (GHz)
2.24
QPI GT/s
6.4
Hyperthreading
Enabled
Prefetch Settings
Default
LLC Size (MB)
24
BIOS Version
R21
Memory Installed (GB)
64
DIMM Type
DDR3-1066
DIMM Size (GB)
4
Number of DIMMS
16
NUMA
OS
Table 1 – Intel® Platform Tested
Enabled
RHEL 5.3 64 bit
Summary for Sort Performance Optimizations
This section provides a brief summary of the recommendations from this performance
study. Section 6 provides more detail for those seeking the deeper technical dive.
Reducing I/O contention is critical to optimizing Sort stage performance. Spreading
sorting I/O usage across different physical disks is a simple first step. A sample
DataStage configuration file to implement this method is shown below.
{
node "node1"
{
fastname "DataStage1.ibm.com"
pools ""
resource disk "/opt/IBM/InformationServer/Server/Datasets1" {pools ""}
resource scratchdisk "/opt/IBM/InformationServer/Server/Scratch1" {pools ""}
}
9
Page 10
node "node2"
{
fastname "DataStage2.ibm.com"
pools ""
resource disk "/opt/IBM/InformationServer/Server/Datasets2" {pools ""}
resource scratchdisk "/opt/IBM/InformationServer/Server/Scratch2" {pools ""}
}
}
In this configuration file, each DataStage processing node has its own scratch space
defined in a directory that resides on separate physical devices. This helps prevent
contention for I/O subsystem resource among DataStage processing nodes. This is a fairly
well known technique and was not studied for this paper.
This paper describes additional techniques to achieve the optimal performance for
DataStage jobs containing Sort operations:
1.
Setting the Restrict Memory Usage (RMU) parameter for sort operations to an
appropriate value for large data sets will reduce I/O demand on the scratch file
system. The recommend RMU size varies with the data set size and node count.
The formula is shown in section 6.1 along with a reference table that summarizes
the suggested RMU sizes for a variety of data set sizes and node counts. The
RMU parameter provides users with the flexibility of defining the sort buffer size
to optimize memory usage of their system.
2.
Increasing the default Linux read-ahead value for the disk storage system(s) for
scratch space can increase the performance of the final merge phase of sort
operations. The recommended setting for the read ahead value is 512 or 1024
sectors (256KB or 512KB) for the scratch file system. See section 6.2 for
information on how to change the read ahead value in Linux.
3.
Sort operations can benefit from having a buffer operator inserted prior to the
sort in the data flow graph. Because sort operations work on large amounts of
data, a buffer operator provides extra storage to get the data to the sort operator
as fast as possible. See section 6.3 for details.
4.
Enabling the APT_OLD_BOUNDED_LENGTH setting can decrease I/O demand
during sorting when bounded length VARCHAR data is involved, potentially
resulting in improved overall throughput for the job.
Recommendations for Optimizing Sort Performance
We investigated the Sort operation in detail and considered the effect of a number of
performance tuning factors on the I/O characteristics. The input data size and the format
of the data are critical input factors affecting sort. The layout of the I/O subsystem and
the file cache and prefetching characteristics are also important. The RMU buffer size
10
Page 11
configuration parameter has a significant effect on the behavior of the sort as the input
data set size is adjusted. These factors are considered in greater detail below.
In our tests, a job consisting of one sort stage and running on a one node configuration
was capable of sorting and writing data at the rate of 120 MB/s to the scratch file system.
Increasing the node count of the job quickly resulted in more I/O requests to the scratch
I/O storage array than it was able to service in a timely manner. Due to the limitation of
the scratch I/O system, the Intel® Xeon® server CPUs were greatly underutilized. The
scratch I/O file system was simply under-configured for a server with such a high
computational capability. This illustrates the high compute power available on the
Intel® Xeon® processors and the ability of IBM InfoSphere DataStage to efficiently
harness this compute power. Configuring sufficient I/O to harness the computational
capability of this powerful combination of hardware and software is of paramount
importance to enable efficient utilization of system.
For our test system, we chose a configuration for the scratch storage I/O system that was
significantly undersized in comparison to the compute capability of the server. While we
recommend always configuring for optimal performance which would include a more
capable scratch storage system, customer feedback has indicated that many deployed
systems have insufficient bandwidth capability to the scratch storage system. The
tuning and configuration tips found in this paper to design to increase performance on
all systems, but will be especially beneficial for systems constrained by the scratch I/O
storage system. In all cases, the amount of data transferred may be reduced by these
tuning and configuration tips.
By adjusting the DataStage parallel node count, we were able to match the scratch
storage capabilities and prevent the scratch storage system from being saturated. This
allowed us to study and develop this tuning guidance in a balanced environment. Using
this strategy, we developed several tuning guidelines to reduce the demand for scratch
storage I/O which we used to effectively increase performance. This is likely to be the
situation for many customers as growth in CPU processing performance continues to
outpace the I/O capability of storage subsystems.
Several valuable tuning guidelines were discovered and we present the findings here.
While these findings are significant, and we highly recommend them, we also want to
make clear that there is no substitute for having a high performance scratch storage
system capable of supplying sufficient bandwidth and I/Os per second (IOPS) to
maintain high CPU utilization. The tuning guidance given here will help even a high
performance scratch I/O system deliver better performance to DataStage jobs using sort
operations.
The remainder of this section describes the tuning results we found to improve sort
performance through I/O reduction.
Optimal RMU Tuning
11
Page 12
This section describes how to tune the sort buffer size parameter called RMU to minimize
I/O demand on the scratch I/O system. An RMU value that is too small will result in
intermediate merges of temporary files during the initial sort phase. These intermediate
merges can significantly increase the I/O demand on the scratch file system. Tuning the
RMU value appropriately can eliminate all intermediate merge operations and greatly
increase throughput of sort operations for systems with limited I/O bandwidth to the
scratch I/O file system.
The scratch disk I/O system on many systems is a bottleneck to performance due to
insufficient bandwidth capability of the number of disks or the interconnect bandwidth is
less than needed to maximize CPU utilization. The elimination of pre-merging can
reduce the overall I/O demand on the scratch file system therefore allowing the scratch
file system to complete I/O faster, increasing throughput and decreasing job run time.
Configuration / Job Tuning Recommendations
Given knowledge of the size of data to be sorted, it is possible to calculate the optimal
RMU value that will prevent the pre-merge thread from running and thus, reducing I/O
demand. The RMU formula is:
RMU (MB) >= SQRT ( DataSizeToSort (MB) / NodeCount) / 2
Notes about using the above formula:
1.
The total data size is divided by the node count because the data sorted per node
decreases with increasing node count. A node in this context refers to the
number of parallel instances of the job when it is instantiated.
2.
Our tests indicate that the RMU value can span a fairly large range and still
provide good performance. Sometimes the amount of data to be sorted is not
known precisely. We recommend attempting to estimate the input data size
within one or two factors of the actual value. In other words, overestimating the
data set size by a factor of 2x will still result in an RMU value from the above
equation that will provide good performance results.
3.
The default RMU value is 20MB. This RMU value can sort up to 1.6 GB of data
per node while avoiding costly pre-merge operations. If your data set size
divided by node count is less than 1.6 GB, then no change is necessary to the
RMU.
The following table is a handy reference of RMU settings for different sizes of input data
(data set size) and node counts. The table assumes the user knows the data set size to be
sorted. Knowing the precise size of the data being sorted may not be feasible. Over
estimating the data set size by up to a factor 2 times the actual data size will still result in
good performance. The default RMU value is 20 MB. The table contains the word
“Default” where the formula results in less than 20 MB indicating that the user should
use the default value. It is not necessary to decrease the RMU value below the 20 MB
default, though doing so is allowed.
12
Page 13
Data
Size to
be
Sorted
1
1.5
3
10
30
100
300
1000
3000
10000
1Node
Min
RMU
(MB)
4Nodes
Min
RMU
(MB)
8Nodes
Min
RMU
(MB)
16Nodes
Min
RMU
(MB)
24Nodes
Min
RMU
(MB)
32Nodes
Min
RMU
(MB)
48Nodes
Min
RMU
(MB)
64Nodes
Min
RMU
(MB)
Default
Default
28
51
88
160
277
506
876
1600
Default
Default
Default
25
44
80
139
253
438
800
Default
Default
Default
Default
31
57
98
179
310
566
Default
Default
Default
Default
22
40
69
126
219
400
Default
Default
Default
Default
Default
33
57
103
179
327
Default
Default
Default
Default
Default
28
49
89
155
283
Default
Default
Default
Default
Default
23
40
73
126
231
Default
Default
Default
Default
Default
20
35
63
110
200
Table 2 – RMU Buffer Size Table
Our test results of a job consisting of one sort stage running with 4 parallel nodes with
two different RMU values are shown in Figure 3. The correct sizing of the RMU value
resulted in a 36% throughput increase. In the tests, the I/O bandwidth did not decrease
because the I/O subsystem was delivering the maximum bandwidth it was capable of in
both cases. However, because the total quantity of data transferred was much lower, the
CPU cores were able to operate at higher CPU utilization and complete the sort in a
shorter amount of time. This optimization is very effective for scratch disks that are
unable to deliver enough scratch file I/O bandwidth to feed the high performing Intel®
Xeon® Server and highly efficient IBM InfoSphere DataStage Software.
The results shown here are for a sort only job where we have isolated the effect of the
RMU parameter. This optimization will help more complex jobs, but will only directly
affect the performance of the sort operators within the job.
RMU Size
Read Ahead Setting
Run Time
10MB
128KB(Linux Default)
4.05 minutes
30MB
128KB(Linux Default)
2.97 minutes
Figure 3 - Performance Tuning Sort with Sort Operator RMU value
To modify the RMU setting for a Sort Stage in a job, on DataStage Designer client canvas,
open the Sort stage, click on Tab ‘Stage’, then ‘Properties’, click on ‘Options’ in the left
13
Page 14
window, and select ‘Restrict Memory Usage (MB) from the ‘Available properties to add’
window to add it.
Figure 4 – Adding RMU Option
Once the Restrict Memory Usage option is added, its value can be set to the
recommended one based on above-mentioned formula.
14
Page 15
Figure 5 – Setting RMU Option
Final Merge Sort Phase Tuning using Linux Read Ahead
During testing of the single node sort job, we found that CPU utilization of final merge
can be improved by changing the scratch disk read ahead setting in Linux, resulting in
substantial throughput improvements of the final merge sort phase.
Configuration / Job Tuning Recommendations
The default Linux file system read ahead value is 256 sectors. A sector is 512 bytes so the
total default read ahead is 128 kB. Our testing indicated that increasing the read ahead
value to 1024 sectors (512 kB) increased CPU utilization and reduced the final merge time
by reducing the amount of time that DataStage had to wait for I/Os from the scratch file
system. This resulted in an increase in throughput of the final merge phase of sort of
approximately 30%.
Test results for a job consisting of one sort stage running with 4 parallel nodes with two
different values for the Linux read ahead setting are shown in Figure 6. Increasing the
Linux default read ahead setting of 128 kB to 512 kB resulted in a 9% improvement in
throughput of the job.
15
Page 16
RMU Size
Read Ahead Setting
Run Time
30MB
128KB(Linux Default)
2.97 minutes
30MB
512KB
2.72 minutes
Figure 6 - Performance Tuning Sort Operator with Linux Read Ahead Setting
The current read ahead setting in Linux can be obtained using the following command:
>hdparm
To set the read ahead setting for a specific disk device in Linux, use the following
command:
>hdparm –a 1024 /dev/sdb1
/dev/sdb1)
(sets read ahead to 1024 sectors on disk device,
To make the command persist across reboots, edit the /etc/init.d/boot.local file.
Recommended settings to try are 512 sectors (256 kB) or 1024 sectors (512 kB).
Increasing read ahead size results in more data being read from the disk and stored in
the OS disk cache memory. As a result, more read requests by the sort operator get the
requested data directly from the OS disk cache instead of waiting for the full latency of a
data read from the scratch storage system. (Note that the Linux file system cache is
controlled by the kernel and uses memory that is not allocated to processes.)
In our tests, the scratch storage system consists of SSDs configured in a RAID-0 array. I/O
request latencies are low on this system compared to typical rotating media storage
arrays. Increasing OS read ahead will benefit scratch storage arrays consisting of HDDs
even more. Larger read ahead values than those tested may be more beneficial for HDD
arrays. We chose to use SSDs because they provide higher bandwidth, much improved
IOPS (I/Os per second) and much lower latency than an equivalent number of hard disk
drives.
Many RAID controllers found in commercial storage systems also have capability to do
read ahead on read requests and store data in the cache. It is good to enable this feature
if it is available on the storage array being used for scratch storage. It is still important to
increase read ahead in the OS. Serving requests from the OS disk cache will be faster than
having to wait for data from the RAID engine.
The results shown here are on a job with a sort operation only. Tuning of read ahead will
not impact performance of other operations in the job that are not performing scratch
disk I/O.
16
Page 17
Using a Buffer Operator to minimize latency for Sort Input
The DataStage parallel engine employs buffering automatically as part of its normal
operations. Because the initial sort phase has such a high demand for input data, it is
especially sensitive to latency spikes in the data source feeding the sort. These latency
spikes can occur due to data being sourced from local or remote disks, or due to
scheduling of operators by the operating system. By adding an additional buffer in front
of the sort, we were able to maintain the CPU utilization on the core running the sort
thread at 100% during the entire initial sort phase, thus increasing the performance of the
initial sort phase by nearly 7%.
Configuration / Job Tuning Recommendations
We recommend using an additional buffer prior to sort of size equal to the RMU value.
To add an additional buffer in front of the sort, open the Sort stage in a DataStage job on
DataStage Designer client canvas, click on the ‘Input’ tab, then ‘Advanced’. Select
‘Buffer’ from the ‘Buffering mode’ drop-down menu and modify the ‘Maximum memory
buffer size (bytes)’ field.
Figure 7 - Adding buffer in front of the sort
17
Page 18
Minimizing I/O for Sort Data containing Variable length fields
By default, the parallel engine internally handles bounded length VARCHAR fields
(those that specify a maximum length) as essentially fixed length character strings. If the
actual data in the field is less than the maximum length, the string is padded to the
maximum size. This behavior is efficient for CPU processing of records throughout the
course of an entire job flow but it increases the I/O demands for operations such as Sort.
When environment variable APT_OLD_BOUNDED_LENGTH is set, the data within
each VARCHAR field is processed without additional padding resulting in a decreased
amount of data written to disk. This can decrease I/O bandwidth demand and therefore
increase performance when running a scratch disk subsystem with insufficient
bandwidth. This can increase job throughput if the scratch file system is not able to keep
up with the processing capability of DataStage and the Intel® Xeon® Server. Additional
CPU cycles will be used to process variable length data when using
APT_OLD_BOUNDED_LENGTH. More CPU processing power is used to reduce the
amount of I/O required from the scratch file system by using this setting.
Our test results of a job consisting of one sort stage running with 16 parallel nodes using
the APT_OLD_BOUNDED_LENGTH resulted in a 25% reduction in size of temporary
sort files and a 26% increase in throughput (a 21% reduction in runtime.)
Normalized Comparison
Default With APT_OLD_BOUNDED_LENGTH
Scratch Storage Space
1.0
0.75x (75% of the original storage space
Consumed
used)
Runtime
1.0
0.79x (79% of the original runtime)
Throughput
1.0
1.26x (26% increase in job processing rate)
Table 3 – Sort Operation performance comparison using
APT_OLD_BOUNDED_LENGTH
Please note that the performance benefit of this tuning parameter will vary based on
several factors. It only applies to data records that have varchar fields. The actual file
size reduction realized on the scratch storage system will depend heavily on the
maximum size specified by the varchar fields, and the size of the actual data contained in
these fields, and whether the varchar fields are a sort key for the records. The amount of
performance benefit will depend on how much the total file size is reduced, along with
the data request rate of the sort operations compared to the capability of the scratch file
system to supply the data. In our test configuration, the 16 node test resulted in the
scratch I/O system being driven to its maximum bandwidth limit. By setting
APT_OLD_BOUNDED_LENGTH, the amount of data that was written and subsequently
read from the disk decreased substantially over the length of the job allowing faster
completion.
18
Page 19
Configuration / Job Tuning Recommendations
This optimization will only affect data sets that use bounded length VARCHAR data
types. APT_OLD_BOUNDED_LENGTH is a user defined variable for DataStage. The
variable can be added either at the project level or job level. You can follow the
instructions in the IBM InfoSphere DataStage and QualityStage Administrator Client
Guide and the IBM InfoSphere DataStage and QualityStage Designer Client Guide to add
and set a new variable.
We recommend trying this setting if low CPU utilization is observed during sorting or if
it is known that the scratch file system is unable to keep up with job demands.
Future Study: Using Memory for RAM Based Scratch Disk
As a future study, we intend to investigate performance when using a RAM based disk
for scratch storage. . The memory bandwidth available in the Nehalem-EX test system is
greater than 70 GB/s when correctly configured. While SSDs offer some bandwidth
improvements over hard disk drives, they cannot begin to match the performance of
main memory bandwidth. The system supports PCI Express lanes to reach ~ 35 GB/s of
I/O in each direction if all PCIe lanes are utilized. However, such an I/O solution would
be expensive.
The currently available 4 socket Intel® X7560 systems can address 1 TB of memory and 8
socket systems can address 2 TB of memory. DRAM capacity will continue to rise with
new product releases and IBM X series systems also offer options to increase DRAM
capacity beyond the baseline. While DRAM is expensive when compared to disk drives
on a per capacity basis, it is more favorable when comparing bandwidth capability in and
out of the system. We plan to evaluate the performance and cost benefit analysis of large
in-memory storage compared to disk drive based storage solutions and provides the
results in the near future.
19
Page 20
Best Practices
This paper describes additional techniques to achieve the optimal performance for
DataStage jobs containing Sort operations:
•
Setting the Restrict Memory Usage (RMU) parameter for sort operations to an
appropriate value for large data sets will reduce I/O demand on the scratch file
system. The recommend RMU size varies with the data set size and node count.
The formula is shown in section 6.1 along with a reference table that summarizes
the suggested RMU sizes for a variety of data set sizes and node counts. The
RMU parameter provides users with the flexibility of defining the sort buffer size
to optimize memory usage of their system.
•
Increasing the default Linux read-ahead value for the disk storage system(s) for
scratch space can increase the performance of the final merge phase of sort
operations. The recommended setting for the read ahead value is 512 or 1024
sectors (256KB or 512KB) for the scratch file system. See section 6.2 for
information on how to change the read ahead value in Linux.
•
Sort operations can benefit from having a buffer operator inserted prior to the
sort in the data flow graph. Because sort operations work on large amounts of
data, a buffer operator provides extra storage to get the data to the sort operator
as fast as possible. See section 6.3 for details.
•
Enabling the APT_OLD_BOUNDED_LENGTH setting can decrease I/O demand
during sorting when bounded length VARCHAR data is involved, potentially
resulting in improved overall throughput for the job.
20
Page 21
Conclusion
We have shown how to optimize IBM InfoSphere DataStage sort performance on Intel®
Xeon® processors using a variety of tuning options such as Sort buffer RMU size, Linux
read ahead settings, additional Buffer operator, and configuring the Varchar length
parameter.
Our results reinforce the necessity of correctly sizing I/O to optimize server performance.
For sort, it is imperative to have sufficient scratch I/O storage performance to allow
maximization of all sort operators running in the system concurrently in order to fully
utilize the server.
Powerful mission critical servers like the Intel® Xeon® Platforms based on the X7500
series processor running the IBM InfoSphere DataStage parallel engine can efficiently
process data at extremely high data rates. As a result, I/O and network bandwidth are
extremely important for high performance. Network interconnects like 10 Gbit/s
Ethernet or 40 Gbit/s Fiber Channel are necessary to fully realize the computation
potential of this powerful combination of hardware and software. In the near future, we
plan to analyze the cost and benefit trade off of using large DRAM capacity as a
replacement for disk subsystems for scratch I/O. We also will be looking at tuning high
bandwidth networking solutions to optimize performance.
21
Page 22
Further reading
Other documentation with information on compression you might be interested in:
•
IBM InfoSphere Information Server, Version 8.5 Information Center
http://publib.boulder.ibm.com/infocenter/iisinfsv/v8r5/index.jsp
Contributors
Garrett Drysdale is a Sr. Software Performance Engineer for
Intel. Garrett has analyzed and optimized software on Intel®
platforms since 1995 spanning client, workstation, and
enterprise server market segments. Garrett currently works
with enterprise software developers to analyze and optimize
server applications, and with internal design teams to assist
in evaluating the impact of new technologies on software
performance for future Intel® platforms. Garrett has a BSEE
from University of Missouri-Rolla and a MSEE from The Georgia
Institute of Technology. His email is
[email protected].
Jantz Tran is a Software Performance Engineer for Intel. He
has been analyzing and optimizing enterprise software on Intel
server platforms for 10 years. Jantz has a BSCE from Texas A&M
University. His email is [email protected].
Dr. Sriram Padmanabhan is an IBM Distinguished Engineer, and
Chief Architect for IBM InfoSphere Servers. Most recently, he
had led the Information Management Advanced Technologies team
investigating new technical areas such as the impact of Web
2.0 information access and delivery. He was a Research Staff
Member and then a manager of the Database Technology group at
IBM T.J. Watson Research Center for several years. He was a
key technologist for DB2’s shared-nothing parallel database
feature and one of the originators of DB2’s multi-dimensional
clustering feature. He was also a chief architect for Data
Warehouse Edition which provides integrated warehousing and
business intelligence capabilities enhancing DB2. Dr.
Padmanabhan has authored more than 25 publications including a
book chapter on DB2 in a popular database text book, several
journal articles, and many papers in leading database
conferences. His email is [email protected].
22
Page 23
Brian Caufield is a Software Architect for Infosphere
Information Server responsible for the definition and design
of new IBM InfoSphere DataStage features, and also works with
the Information Server Performance Team. Brian represents IBM
at the TPC, working to define an industry standard benchmark
for data integration. Previously, Brian worked for 10 years
as a developer on IBM InfoSphere DataStage specializing in the
parallel engine. His email is [email protected].
Fan Ding is currently a member of the Information Server
Performance Team. Prior to joining the team, he worked in
Information Integration Federation Server Development. Fan has a
PH.D. in Mechanical Engineering and a Master in Computer Science
from University of Wisconsin. His email is: [email protected].
Ron Liu is currently a member of the IBM InfoSphere
Information Server Performance Team with focus on performance
tuning and information integration benchmark development.
Prior to his current job, Ron had 7 years in Database Server
development (federation runtime, wrapper, query gateway,
process model, and database security). Ron has a Master of
Science in Computer Science and Bachelor of Science in
Physics. His email is [email protected].
Pin Lp Lv is a Software Performance Engineer from IBM. Pin has
worked for IBM since 2006. He worked as a software tester for IBM
WebSphere Product Center Team and RFID Team from September
2006 to March 2009, and joined IBM InfoSphere Information Server
Performance Team in April 2009. Pin has a Master of Science degree
in Computer Science from University of West Scotland. His email is
[email protected]
Mi Wan Shum is the manager of the IBM InfoSphere Information
Server performance team at the IBM Silicon Valley Lab. She
graduated from University of Texas at Austin and she has years of
software development experience in IBM. Her email is
[email protected]
Jackson (Dong Jie) Wei is a Staff Software Performance Engineer for
IBM. He once worked as a DBA in CSRC before joining IBM in 2006.
Since then, he has been working on the Information Server product.
In 2009, he began to focus his work on the ETL performance. Jackson
is also the technical lead for the IBM China Lab Information Server
performance group. He got his bachelor and master degrees for
Electronic Engineering of Peking University in 2000 and 2003
respectively. His email is [email protected].
23
Page 24
Samuel Wong is a member of the IBM InfoSphere InfoSphere
Information Server performance team at the IBM Silicon Valley Lab.
He graduated from University of Toronto and he has 12 years of
software development experience with IBM. His email is
[email protected]
24
Page 25
Notices
This information was developed for products and services offered in the U.S.A.
IBM may not offer the products, services, or features discussed in this document in other
countries. Consult your local IBM representative for information on the products and services
currently available in your area. Any reference to an IBM product, program, or service is not
intended to state or imply that only that IBM product, program, or service may be used. Any
functionally equivalent product, program, or service that does not infringe any IBM
intellectual property right may be used instead. However, it is the user's responsibility to
evaluate and verify the operation of any non-IBM product, program, or service.
IBM may have patents or pending patent applications covering subject matter described in
this document. The furnishing of this document does not grant you any license to these
patents. You can send license inquiries, in writing, to:
IBM Director of Licensing
IBM Corporation
North Castle Drive
Armonk, NY 10504-1785
U.S.A.
The following paragraph does not apply to the United Kingdom or any other country where
such provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES
CORPORATION PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER
EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do
not allow disclaimer of express or implied warranties in certain transactions, therefore, this
statement may not apply to you.
Without limiting the above disclaimers, IBM provides no representations or warranties
regarding the accuracy, reliability or serviceability of any information or recommendations
provided in this publication, or with respect to any results that may be obtained by the use of
the information or observance of any recommendations provided herein. The information
contained in this document has not been submitted to any formal IBM test and is distributed
AS IS. The use of this information or the implementation of any recommendations or
techniques herein is a customer responsibility and depends on the customer’s ability to
evaluate and integrate them into the customer’s operational environment. While each item
may have been reviewed by IBM for accuracy in a specific situation, there is no guarantee
that the same or similar results will be obtained elsewhere. Anyone attempting to adapt
these techniques to their own environment do so at their own risk.
This document and the information contained herein may be used solely in connection with
the IBM products discussed in this document.
This information could include technical inaccuracies or typographical errors. Changes are
periodically made to the information herein; these changes will be incorporated in new
editions of the publication. IBM may make improvements and/or changes in the product(s)
and/or the program(s) described in this publication at any time without notice.
Any references in this information to non-IBM Web sites are provided for convenience only
and do not in any manner serve as an endorsement of those Web sites. The materials at
those Web sites are not part of the materials for this IBM product and use of those Web sites is
at your own risk.
IBM may use or distribute any of the information you supply in any way it believes
appropriate without incurring any obligation to you.
Any performance data contained herein was determined in a controlled environment.
Therefore, the results obtained in other operating environments may vary significantly. Some
measurements may have been made on development-level systems and there is no
guarantee that these measurements will be the same on generally available systems.
Furthermore, some measurements may have been estimated through extrapolation. Actual
results may vary. Users of this document should verify the applicable data for their specific
environment.
25
Page 26
Information concerning non-IBM products was obtained from the suppliers of those products,
their published announcements or other publicly available sources. IBM has not tested those
products and cannot confirm the accuracy of performance, compatibility or any other
claims related to non-IBM products. Questions on the capabilities of non-IBM products should
be addressed to the suppliers of those products.
All statements regarding IBM's future direction or intent are subject to change or withdrawal
without notice, and represent goals and objectives only.
This information contains examples of data and reports used in daily business operations. To
illustrate them as completely as possible, the examples include the names of individuals,
companies, brands, and products. All of these names are fictitious and any similarity to the
names and addresses used by an actual business enterprise is entirely coincidental.
COPYRIGHT LICENSE:
This information contains sample application programs in source language, which illustrate
programming techniques on various operating platforms. You may copy, modify, and
distribute these sample programs in any form without payment to IBM, for the purposes of
developing, using, marketing or distributing application programs conforming to the
application programming interface for the operating platform for which the sample
programs are written. These examples have not been thoroughly tested under all conditions.
IBM, therefore, cannot guarantee or imply reliability, serviceability, or function of these
programs.
Trademarks
IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International
Business Machines Corporation in the United States, other countries, or both. If these and
other IBM trademarked terms are marked on their first occurrence in this information with a
trademark symbol (® or ™), these symbols indicate U.S. registered or common law
trademarks owned by IBM at the time this information was published. Such trademarks may
also be registered or common law trademarks in other countries. A current list of IBM
trademarks is available on the Web at “Copyright and trademark information” at
www.ibm.com/legal/copytrade.shtml
Windows is a trademark of Microsoft Corporation in the United States, other countries, or
both.
UNIX is a registered trademark of The Open Group in the United States and other countries.
Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both.
Other company, product, or service names may be trademarks or service marks of others.
26