...

Best Practices Deep-dive performance analysis with InfoSphere Optim Performance

by user

on
Category: Documents
16

views

Report

Comments

Transcript

Best Practices Deep-dive performance analysis with InfoSphere Optim Performance
DB2 for Linux, UNIX, and Windows
Best Practices
Deep-dive performance
analysis with InfoSphere
Optim Performance
Manager V5.3
Steve Rees
[email protected]
© 2013 IBM Corporation
Disclaimer/Trademarks
THE INFORMATION CONTAINED IN THIS DOCUMENT HAS NOT BEEN SUBMITTED TO ANY FORMAL IBM TEST AND
IS DISTRIBUTED AS IS. THE USE OF THIS INFORMATION OR THE IMPLEMENTATION OF ANY OF THESE
TECHNIQUES IS A CUSTOMER RESPONSIBILITY AND DEPENDS ON THE CUSTOMER’S ABILITY TO EVALUATE AND
INTEGRATE THEM INTO THE CUSTOMER’S OPERATIONAL ENVIRONMENT. WHILE IBM MAY HAVE REVIEWED
EACH ITEM FOR ACCURACY IN A SPECIFIC SITUATION, THERE IS NO GUARANTEE THAT THE SAME OR SIMILAR
RESULTS WILL BE OBTAINED ELSEWHERE. ANYONE ATTEMPTING TO ADAPT THESE TECHNIQUES TO THEIR
OWN ENVIRONMENTS DO SO AT THEIR OWN RISK.
ANY PERFORMANCE DATA CONTAINED IN THIS DOCUMENT WERE DETERMINED IN VARIOUS CONTROLLED
LABORATORY ENVIRONMENTS AND ARE FOR REFERENCE PURPOSES ONLY. CUSTOMERS SHOULD NOT ADAPT
THESE PERFORMANCE NUMBERS TO THEIR OWN ENVIRONMENTS AS SYSTEM PERFORMANCE STANDARDS.
THE RESULTS THAT MAY BE OBTAINED IN OTHER OPERATING ENVIRONMENTS MAY VARY SIGNIFICANTLY.
USERS OF THIS DOCUMENT SHOULD VERIFY THE APPLICABLE DATA FOR THEIR SPECIFIC ENVIRONMENT.
Trademarks
IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines Corp., registered
in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current
list of IBM trademarks is available on the Web at “Copyright and trademark information” at www.ibm.com/legal/copytrade.shtml
2
© 2013 IBM Corporation
Agenda
 The latest updates in OPM v5.3
 Recognizing and addressing system-level
bottlenecks with OPM
– Examples: Disk, CPU and Locking
 Drilling down to statement-level problems
using OPM and Optim Query Workload Tuner
3
© 2013 IBM Corporation
Improved in OPM 5.3 – the Overview Dashboard
 Brings together the most critical performance metrics into one place
for at-a-glance monitoring
 Shows the average across the current monitoring interval
– Long intervals
show broad
trends
– Short intervals
show more
immediate,
fine-grained
values
– Baseline feature
enhanced in 5.3
– The previous
"KPI" version of
the Overview is
still available
4
© 2013 IBM Corporation
The Overview Dashboard
5
© 2013 IBM Corporation
The Overview Dashboard
6
© 2013 IBM Corporation
The Overview Dashboard
7
© 2013 IBM Corporation
The Overview Dashboard
8
© 2013 IBM Corporation
Setting a Baseline for Overview
 When performance / behavior represents a "good / typical"
value, you can lock in a baseline
– Future monitor data will be highlighted if it deviates sufficiently from the
baseline
Before baseline is set, no
metrics are flagged
(all bars are blue)
After baseline is set, if things change,
bars change color and
flyover shows current &
baseline values
Grey markers show
value from the baseline
interval, and +/-
9
© 2013 IBM Corporation
Tips for Using Overview Baselines
 Choose well – but you can always change it!
– Getting the right baseline can be an iterative process, depending on
how subsequent monitor data compares
– What initially seemed to be a typical / 'good' period might turn out to be
slightly skewed
– You can even choose a baseline time which is later than the
measurements you want to compare
 Choose a long enough time interval that you get a good
average
– A bit of variation during your baseline is a good thing, since it gives a
non-zero standard deviation, which avoids 'false positives'.
– If the baseline measurement is too stable (typically, this means too
short to capture normal variation), then the comparisons of run data
against the baseline may be too sensitive, and report deviations that
aren't really there.
 Tip: open one browser window for showing the baseline, and
use a another one for viewing other intervals
– You can flip between baseline & current intervals easily, making the
differences really stand out
10
© 2013 IBM Corporation
Performance Overview Report
 Provides good top-level
view of overall system
performance
– Choose metrics and sort orders
that are most relevant for the
database being analyzed
 Report is exportable to
pdf/xls/ppt formats
– Useful for upline reporting
– Drill-down available via other
reports:
11
© 2013 IBM Corporation
OPM 5.3 Highlights
 Support for BLU Acceleration in DB2 LUW 10.5
– New metrics for column-organized tables e.g. physical reads,
rows read, etc.
 Integration of Optim Query Tuner web client
– Available alongside existing link to 'fat client' OQWT
 Additional alerts for HADR and logging
 Improved Overview dashboard for more at-a-glance
information
12
© 2013 IBM Corporation
Sidebar: Optim Query Capture Replay
 Capture SQL activity on your system, and then replay
it exactly & compare the results with the original
–
–
–
–
–
–
On the same database & system, or a different one
To verify software upgrades
To verify hardware migrations
To verify growth capacity by replaying at higher speeds
To verify the impact of performance tuning
To try out new technologies or different configurations
• Compression, MDC, MQT, …
Record
…and BLU Acceleration in DB2 10.5 !
Optim Query
Capture and
Replay
Play
13
© 2013 IBM Corporation
Agenda
 The latest developments in OPM v5.3
 Recognizing and addressing system-level
bottlenecks with OPM
– Examples: Disk, CPU, and Locks
 Drilling down to statement-level problems
using OPM and Optim Query Tuner
14
© 2013 IBM Corporation
System-level Bottlenecks - What Kind are You Seeing?
 What part of the system (a resource, etc.) or other
factor is limiting the system’s performance?
 Understanding what type of bottleneck you’re dealing
with can rule out a lot of possible problems!
– Tip – OPM alerts & differences from the Overview Baseline can
help point us in the right direction early on
Disk?
CPU?
Locks?
15
© 2013 IBM Corporation
The Plan: Start from Overview and Work Out from There
ad
o
l
rk oard
o
W hb
s
Da
dI
n
la d
o
o oar
p
r
ffe shb
u
B Da
/O
D
w
e
rvi
e
Ov
L
SQ
16
ard
o
hb
s
Da
ing rd
k
c
Lo hboa
s
Da
ard
o
hb
s
a
ry r
e
Qu une
tim d T
p
O loa
rk
o
W
© 2013 IBM Corporation
Disk?
Recognizing a System-level Disk Bottleneck
Basic Symptoms:




I/O wait seen in OPM Overview, Tivoli or operating system tools (e.g. vmstat)
I/O or log alerts in OPM Health Summary
Long I/O times (1-5 ms ideal; > 10-15ms is a problem) in OPM I/O dashboard
Often low-to-mid CPU usage seen in OPM Overview or System dashboard
Now:
I/O
Overview
dashboard
@baseline
SQL Response
17
Current
overview
dashboard
showing IO
problems
© 2013 IBM Corporation
Disk?
Drilling down on a sample Tablespace Disk Bottleneck
 Overview dashboard data suggests an IO problem
– % of DB2 time spent waiting on IO increased from 31% to 75%
– Average statement response time increased from 0.2 ms to 0.6 ms
*click!*
18
© 2013 IBM Corporation
Disk?
Drilling down on a sample Tablespace Disk Bottleneck
 Linking to the Buffer Pool and I/O dashboard gives us more
information
– Time-line graphing of IO response times
Check the
value @
baseline…
@Baseline time:
19
© 2013 IBM Corporation
Disk?
Drilling down on a sample Tablespace Disk Bottleneck
 OPM gives us easy drill-down from buffer pool to tablespace to
tables
– Is the heavy I/O due to one particular tablespace?
Compare current
vs. baseline –
no evidence of
one tablespace
swamping
everything
@Baseline time:
20
© 2013 IBM Corporation
Disk?
Drilling down on a sample Tablespace Disk Bottleneck
 What do we know about this problem so far?
– Overview dashboard
• 31% to 75% increase in %time spent on IO
• 3x increase in average SQL statement response time
– Buffer Pool and I/O dashboard
• 8.5k to 42k increase in BP reads per minute
• 98% to 86% reduction in BP data hit ratio
• No apparent tablespace hotspots
 Some possible causes to consider
– One or more bad SQL statement plans?
• More likely if there was one tablespace that was much hotter than the others
– Poor disk configuration?
• Possible, but bufferpool configuration should be checked first
21
© 2013 IBM Corporation
Disk?
Drilling down on a sample Tablespace Disk Bottleneck
 In this case, all the symptoms are explained by a decrease in
bufferpool size
– Next steps would likely include seeing if there was a manual decrease in
BP size, or self-tuning memory drove down the BP size in response to
memory pressures on the system
@Problem time:
@Baseline time:
22
© 2013 IBM Corporation
CPU?
Recognizing a System-level CPU Bottleneck
Basic Symptoms:
• Total CPU utilization near 100% seen in OPM Overview or System
dashboard / Tivoli monitoring / vmstat / etc.
• One process steadily consuming (100/N) % of total CPU time in an Nway SMP, seen in ps / task manager
• CPU utilization significantly higher than baseline in OPM overview
DB2 time: OTHER
CPU Idle
Tx / min
Pkg Cache hit ratio
23
© 2013 IBM Corporation
CPU?
Drilling down on a sample CPU Bottleneck
 A big spike in package cache demand
suggests lots of new incoming SQL
– SQL dashboard shows us top statements by CPU usage
across multiple executions (new in OPM 5)
When package cache h/r
was bad, hot statements
were executed only once
and had literals, vs. the
baseline, which had many
executions and a
parameter style we
should recognize…
@Baseline time
@Baseline time
24
© 2013 IBM Corporation
CPU?
Drilling down on a sample CPU Bottleneck
 In this case, the main clue of what's gone wrong is in the SQL
(which is often a good place to look!)
– Baseline:
Select … where S_W_ID = :L0
– Problem time:
Select … where S_W_ID = 132
 The ":Lxx" parameter format indicates
– That statement concentrator (STMT_CONC) was enabled during the
baseline, to map literal values to host variables
– This works around the use of literal values in simple SQL statements,
and saves lots of CPU and package cache memory
 High CPU, low package cache hit ratio and the return to literals
suggest that STMT_CONC was turned off
25
© 2013 IBM Corporation
Locks?
Recognizing a System-level Locking Bottleneck
Basic Symptoms – a ‘Lazy System’
• High average lock wait time, or large number of applications waiting
on locks, seen in Overview Dashboard
• Low CPU consumption and/or low throughput
Things were fine,
our system was
running well, but
then one day at
about 8:15pm,
throughput briefly
dropped to zero!
This matches well
with a lock wait
spike on the
Overview.
Things didn't go
completely to
zero – something
was still reading a
bunch of rows,
way more than
usual …
26
© 2013 IBM Corporation
Locks?
Drilling down on a Sample Locking Bottleneck
– from the Overview dashboard
 We can also see a big jump in lock wait time in the at-a-glance
Overview dashboard
DB2 Lock Wait time, %
click!
Avg Lock Wait time, ms
27
© 2013 IBM Corporation
Locks?
Drilling down on a Sample Locking Bottleneck
– from the Locking dashboard
The Locking Dashboard gives us the details –
45 Lock Wait Alerts during the spike we're
diagnosing
cli
ck
!
Compare this to the baseline, which had no timeouts
and virtually no blocking time
@Baseline time
double click!
28
© 2013 IBM Corporation
Locks?
Drilling down on a Sample Locking Bottleneck
– from the Locking Dashboard
do
u
cli ble
ck
!
29
We have many lock waits to choose from this one is typical. One piece of key
information here is that the blocking query is
running in RR isolation!
© 2013 IBM Corporation
Queries
Drilling Down – Diagnosing Slow Queries with
OPM
 To this point, we’ve been looking primarily at system-level
bottlenecks
– Disk, CPU, locking
– Root causes here tended to have less to do with individual statements
than with broader forces
• Applications
• Database & system configuration
• External factors
 Individual slow SQL statements can be anything from a minor
irritant to a serious problem, potentially causing system-wide
performance problems
– As with system-level bottlenecks, Optim tools make the job of
diagnosing a slow SQL statement much quicker
30
© 2013 IBM Corporation
Queries
Step 1: Getting Our Bearings at the System Level
 Low throughput has been detected, and 'hovering' around the
Overview dashboard shows a variety of bad news …
Average statement response time
Average CPU run queue length
31
© 2013 IBM Corporation
Queries
Step 1: Getting Our Bearings at the System Level
 Low throughput has been detected, and 'hovering' around the
Overview dashboard shows a variety of bad news …
Time in DB2 executing SQL
Throughput in requests
32
© 2013 IBM Corporation
Queries
Step 1: Getting Our Bearings at the System Level
 Low throughput has been detected, and 'hovering' around the
Overview dashboard shows a variety of bad news …
click
!
@Baseline time
33
© 2013 IBM Corporation
Queries
Step 2: Slow Workload to Slow Statements
 We jump to the SQL statement dashboard from Overview to get
more information on the hottest statement, and use the time
slider to compare with values from the baseline period
@Baseline time
34
@Baseline time
© 2013 IBM Corporation
Queries
Step 2: Slow Workload to Slow Statements
 Along with runtime data, we also see the timeron cost estimate of
the heaviest query
@Baseline time
35
© 2013 IBM Corporation
Queries
Step 2: Slow Workload to Slow Statements
 From system-level information, it seems like one statement has
become very heavy, likely due to a plan change
 We can activate tuning of one or more statements directly from
the SQL dashboard
36
© 2013 IBM Corporation
Queries
Step 3: Analyzing the Slow Statement
 The 'Tune' button switches focus to Optim Query Workload Tuner
and populates the statement into it
k!
c
i
l
c
37
© 2013 IBM Corporation
Queries
Step 3: Analyzing the Slow Statement
click!
38
Sure enough,
the plan shows
the high
estimated
cost we saw
earlier, starting
at the bottom
with an
expensive
index scan
© 2013 IBM Corporation
Queries
Step 3: Analyzing the Slow Statement
click!
Item #1:
Stats Advisor wants
us to make sure stats
stay up to date
39
Item #2:
Stats Advisor says
to be sure runtime
parameter values
don’t change the
plan from what we
get here
Item #3:
Access Path Advisor
indicates there’s a
full index scan, which
might not be optimal
in this case. It also
aligns with the high
cost index scan we
saw in the plan
Item #4:
Index Advisor says it
has some indexes to
recommend
© 2013 IBM Corporation
Queries
Step 3: Analyzing the Slow Statement
!
click
OQWT is strongly
recommending that we
update statistics (and
it tells us exactly where
and how further down)
40
© 2013 IBM Corporation
Queries
Step 4: Fixing the Problem!
Ru
nR
UN
the STAT
n…
S,
Following OQWT's advice,
we run RUNSTATS on the
recommended table +
indexes, and then get
OQWT to re-analyze the
impact on the SQL,
and compare the results
clic
k!
click!
41
© 2013 IBM Corporation
Queries
Step 4: Fixing the Problem!
Updating the stats resulted
in the optimizer choosing
a different join type,
as well as using start / stop
keys on our index scan.
Net: 50x drop in estimated
cost, and a return to good
performance in the
application
42
© 2013 IBM Corporation
Summary
 Performance work is fundamentally quite tricky!
– There are many potential issues on many different levels
– Pulling together and understanding all the relevant data can
seem like a job only for experts – but OPM makes it much
easier
 Before a problem occurs –
– Take OPM "out for a spin" – get familiar with what the various
dashboards and reports can do
– Choose a suitable baseline time for the Overview dashboard,
and adjust the alert thresholds if necessary
43
© 2013 IBM Corporation
Summary
 When a problem occurs …
– Open a pair of browser sessions – one at the baseline time,
and the other showing the problem interval you're
investigating
– The Overview dashboards is usually the best place to start,
followed by Workload and System to see trends over time
– Drill down from there into Bufferpool, Locking & Logging for
system-wide issues
– The SQL Statement dashboard and Optim Query Workload
Tuner are key for identifying SQL problems
44
© 2013 IBM Corporation
Discover more DB2 best practices at:
http://www.ibm.com/developerworks/data/bestpractices/db2luw/
45
© 2013 IBM Corporation
Fly UP