Best Practices Deep-dive performance analysis with InfoSphere Optim Performance
by user
Comments
Transcript
Best Practices Deep-dive performance analysis with InfoSphere Optim Performance
DB2 for Linux, UNIX, and Windows Best Practices Deep-dive performance analysis with InfoSphere Optim Performance Manager V5.3 Steve Rees [email protected] © 2013 IBM Corporation Disclaimer/Trademarks THE INFORMATION CONTAINED IN THIS DOCUMENT HAS NOT BEEN SUBMITTED TO ANY FORMAL IBM TEST AND IS DISTRIBUTED AS IS. THE USE OF THIS INFORMATION OR THE IMPLEMENTATION OF ANY OF THESE TECHNIQUES IS A CUSTOMER RESPONSIBILITY AND DEPENDS ON THE CUSTOMER’S ABILITY TO EVALUATE AND INTEGRATE THEM INTO THE CUSTOMER’S OPERATIONAL ENVIRONMENT. WHILE IBM MAY HAVE REVIEWED EACH ITEM FOR ACCURACY IN A SPECIFIC SITUATION, THERE IS NO GUARANTEE THAT THE SAME OR SIMILAR RESULTS WILL BE OBTAINED ELSEWHERE. ANYONE ATTEMPTING TO ADAPT THESE TECHNIQUES TO THEIR OWN ENVIRONMENTS DO SO AT THEIR OWN RISK. ANY PERFORMANCE DATA CONTAINED IN THIS DOCUMENT WERE DETERMINED IN VARIOUS CONTROLLED LABORATORY ENVIRONMENTS AND ARE FOR REFERENCE PURPOSES ONLY. CUSTOMERS SHOULD NOT ADAPT THESE PERFORMANCE NUMBERS TO THEIR OWN ENVIRONMENTS AS SYSTEM PERFORMANCE STANDARDS. THE RESULTS THAT MAY BE OBTAINED IN OTHER OPERATING ENVIRONMENTS MAY VARY SIGNIFICANTLY. USERS OF THIS DOCUMENT SHOULD VERIFY THE APPLICABLE DATA FOR THEIR SPECIFIC ENVIRONMENT. Trademarks IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at www.ibm.com/legal/copytrade.shtml 2 © 2013 IBM Corporation Agenda The latest updates in OPM v5.3 Recognizing and addressing system-level bottlenecks with OPM – Examples: Disk, CPU and Locking Drilling down to statement-level problems using OPM and Optim Query Workload Tuner 3 © 2013 IBM Corporation Improved in OPM 5.3 – the Overview Dashboard Brings together the most critical performance metrics into one place for at-a-glance monitoring Shows the average across the current monitoring interval – Long intervals show broad trends – Short intervals show more immediate, fine-grained values – Baseline feature enhanced in 5.3 – The previous "KPI" version of the Overview is still available 4 © 2013 IBM Corporation The Overview Dashboard 5 © 2013 IBM Corporation The Overview Dashboard 6 © 2013 IBM Corporation The Overview Dashboard 7 © 2013 IBM Corporation The Overview Dashboard 8 © 2013 IBM Corporation Setting a Baseline for Overview When performance / behavior represents a "good / typical" value, you can lock in a baseline – Future monitor data will be highlighted if it deviates sufficiently from the baseline Before baseline is set, no metrics are flagged (all bars are blue) After baseline is set, if things change, bars change color and flyover shows current & baseline values Grey markers show value from the baseline interval, and +/- 9 © 2013 IBM Corporation Tips for Using Overview Baselines Choose well – but you can always change it! – Getting the right baseline can be an iterative process, depending on how subsequent monitor data compares – What initially seemed to be a typical / 'good' period might turn out to be slightly skewed – You can even choose a baseline time which is later than the measurements you want to compare Choose a long enough time interval that you get a good average – A bit of variation during your baseline is a good thing, since it gives a non-zero standard deviation, which avoids 'false positives'. – If the baseline measurement is too stable (typically, this means too short to capture normal variation), then the comparisons of run data against the baseline may be too sensitive, and report deviations that aren't really there. Tip: open one browser window for showing the baseline, and use a another one for viewing other intervals – You can flip between baseline & current intervals easily, making the differences really stand out 10 © 2013 IBM Corporation Performance Overview Report Provides good top-level view of overall system performance – Choose metrics and sort orders that are most relevant for the database being analyzed Report is exportable to pdf/xls/ppt formats – Useful for upline reporting – Drill-down available via other reports: 11 © 2013 IBM Corporation OPM 5.3 Highlights Support for BLU Acceleration in DB2 LUW 10.5 – New metrics for column-organized tables e.g. physical reads, rows read, etc. Integration of Optim Query Tuner web client – Available alongside existing link to 'fat client' OQWT Additional alerts for HADR and logging Improved Overview dashboard for more at-a-glance information 12 © 2013 IBM Corporation Sidebar: Optim Query Capture Replay Capture SQL activity on your system, and then replay it exactly & compare the results with the original – – – – – – On the same database & system, or a different one To verify software upgrades To verify hardware migrations To verify growth capacity by replaying at higher speeds To verify the impact of performance tuning To try out new technologies or different configurations • Compression, MDC, MQT, … Record …and BLU Acceleration in DB2 10.5 ! Optim Query Capture and Replay Play 13 © 2013 IBM Corporation Agenda The latest developments in OPM v5.3 Recognizing and addressing system-level bottlenecks with OPM – Examples: Disk, CPU, and Locks Drilling down to statement-level problems using OPM and Optim Query Tuner 14 © 2013 IBM Corporation System-level Bottlenecks - What Kind are You Seeing? What part of the system (a resource, etc.) or other factor is limiting the system’s performance? Understanding what type of bottleneck you’re dealing with can rule out a lot of possible problems! – Tip – OPM alerts & differences from the Overview Baseline can help point us in the right direction early on Disk? CPU? Locks? 15 © 2013 IBM Corporation The Plan: Start from Overview and Work Out from There ad o l rk oard o W hb s Da dI n la d o o oar p r ffe shb u B Da /O D w e rvi e Ov L SQ 16 ard o hb s Da ing rd k c Lo hboa s Da ard o hb s a ry r e Qu une tim d T p O loa rk o W © 2013 IBM Corporation Disk? Recognizing a System-level Disk Bottleneck Basic Symptoms: I/O wait seen in OPM Overview, Tivoli or operating system tools (e.g. vmstat) I/O or log alerts in OPM Health Summary Long I/O times (1-5 ms ideal; > 10-15ms is a problem) in OPM I/O dashboard Often low-to-mid CPU usage seen in OPM Overview or System dashboard Now: I/O Overview dashboard @baseline SQL Response 17 Current overview dashboard showing IO problems © 2013 IBM Corporation Disk? Drilling down on a sample Tablespace Disk Bottleneck Overview dashboard data suggests an IO problem – % of DB2 time spent waiting on IO increased from 31% to 75% – Average statement response time increased from 0.2 ms to 0.6 ms *click!* 18 © 2013 IBM Corporation Disk? Drilling down on a sample Tablespace Disk Bottleneck Linking to the Buffer Pool and I/O dashboard gives us more information – Time-line graphing of IO response times Check the value @ baseline… @Baseline time: 19 © 2013 IBM Corporation Disk? Drilling down on a sample Tablespace Disk Bottleneck OPM gives us easy drill-down from buffer pool to tablespace to tables – Is the heavy I/O due to one particular tablespace? Compare current vs. baseline – no evidence of one tablespace swamping everything @Baseline time: 20 © 2013 IBM Corporation Disk? Drilling down on a sample Tablespace Disk Bottleneck What do we know about this problem so far? – Overview dashboard • 31% to 75% increase in %time spent on IO • 3x increase in average SQL statement response time – Buffer Pool and I/O dashboard • 8.5k to 42k increase in BP reads per minute • 98% to 86% reduction in BP data hit ratio • No apparent tablespace hotspots Some possible causes to consider – One or more bad SQL statement plans? • More likely if there was one tablespace that was much hotter than the others – Poor disk configuration? • Possible, but bufferpool configuration should be checked first 21 © 2013 IBM Corporation Disk? Drilling down on a sample Tablespace Disk Bottleneck In this case, all the symptoms are explained by a decrease in bufferpool size – Next steps would likely include seeing if there was a manual decrease in BP size, or self-tuning memory drove down the BP size in response to memory pressures on the system @Problem time: @Baseline time: 22 © 2013 IBM Corporation CPU? Recognizing a System-level CPU Bottleneck Basic Symptoms: • Total CPU utilization near 100% seen in OPM Overview or System dashboard / Tivoli monitoring / vmstat / etc. • One process steadily consuming (100/N) % of total CPU time in an Nway SMP, seen in ps / task manager • CPU utilization significantly higher than baseline in OPM overview DB2 time: OTHER CPU Idle Tx / min Pkg Cache hit ratio 23 © 2013 IBM Corporation CPU? Drilling down on a sample CPU Bottleneck A big spike in package cache demand suggests lots of new incoming SQL – SQL dashboard shows us top statements by CPU usage across multiple executions (new in OPM 5) When package cache h/r was bad, hot statements were executed only once and had literals, vs. the baseline, which had many executions and a parameter style we should recognize… @Baseline time @Baseline time 24 © 2013 IBM Corporation CPU? Drilling down on a sample CPU Bottleneck In this case, the main clue of what's gone wrong is in the SQL (which is often a good place to look!) – Baseline: Select … where S_W_ID = :L0 – Problem time: Select … where S_W_ID = 132 The ":Lxx" parameter format indicates – That statement concentrator (STMT_CONC) was enabled during the baseline, to map literal values to host variables – This works around the use of literal values in simple SQL statements, and saves lots of CPU and package cache memory High CPU, low package cache hit ratio and the return to literals suggest that STMT_CONC was turned off 25 © 2013 IBM Corporation Locks? Recognizing a System-level Locking Bottleneck Basic Symptoms – a ‘Lazy System’ • High average lock wait time, or large number of applications waiting on locks, seen in Overview Dashboard • Low CPU consumption and/or low throughput Things were fine, our system was running well, but then one day at about 8:15pm, throughput briefly dropped to zero! This matches well with a lock wait spike on the Overview. Things didn't go completely to zero – something was still reading a bunch of rows, way more than usual … 26 © 2013 IBM Corporation Locks? Drilling down on a Sample Locking Bottleneck – from the Overview dashboard We can also see a big jump in lock wait time in the at-a-glance Overview dashboard DB2 Lock Wait time, % click! Avg Lock Wait time, ms 27 © 2013 IBM Corporation Locks? Drilling down on a Sample Locking Bottleneck – from the Locking dashboard The Locking Dashboard gives us the details – 45 Lock Wait Alerts during the spike we're diagnosing cli ck ! Compare this to the baseline, which had no timeouts and virtually no blocking time @Baseline time double click! 28 © 2013 IBM Corporation Locks? Drilling down on a Sample Locking Bottleneck – from the Locking Dashboard do u cli ble ck ! 29 We have many lock waits to choose from this one is typical. One piece of key information here is that the blocking query is running in RR isolation! © 2013 IBM Corporation Queries Drilling Down – Diagnosing Slow Queries with OPM To this point, we’ve been looking primarily at system-level bottlenecks – Disk, CPU, locking – Root causes here tended to have less to do with individual statements than with broader forces • Applications • Database & system configuration • External factors Individual slow SQL statements can be anything from a minor irritant to a serious problem, potentially causing system-wide performance problems – As with system-level bottlenecks, Optim tools make the job of diagnosing a slow SQL statement much quicker 30 © 2013 IBM Corporation Queries Step 1: Getting Our Bearings at the System Level Low throughput has been detected, and 'hovering' around the Overview dashboard shows a variety of bad news … Average statement response time Average CPU run queue length 31 © 2013 IBM Corporation Queries Step 1: Getting Our Bearings at the System Level Low throughput has been detected, and 'hovering' around the Overview dashboard shows a variety of bad news … Time in DB2 executing SQL Throughput in requests 32 © 2013 IBM Corporation Queries Step 1: Getting Our Bearings at the System Level Low throughput has been detected, and 'hovering' around the Overview dashboard shows a variety of bad news … click ! @Baseline time 33 © 2013 IBM Corporation Queries Step 2: Slow Workload to Slow Statements We jump to the SQL statement dashboard from Overview to get more information on the hottest statement, and use the time slider to compare with values from the baseline period @Baseline time 34 @Baseline time © 2013 IBM Corporation Queries Step 2: Slow Workload to Slow Statements Along with runtime data, we also see the timeron cost estimate of the heaviest query @Baseline time 35 © 2013 IBM Corporation Queries Step 2: Slow Workload to Slow Statements From system-level information, it seems like one statement has become very heavy, likely due to a plan change We can activate tuning of one or more statements directly from the SQL dashboard 36 © 2013 IBM Corporation Queries Step 3: Analyzing the Slow Statement The 'Tune' button switches focus to Optim Query Workload Tuner and populates the statement into it k! c i l c 37 © 2013 IBM Corporation Queries Step 3: Analyzing the Slow Statement click! 38 Sure enough, the plan shows the high estimated cost we saw earlier, starting at the bottom with an expensive index scan © 2013 IBM Corporation Queries Step 3: Analyzing the Slow Statement click! Item #1: Stats Advisor wants us to make sure stats stay up to date 39 Item #2: Stats Advisor says to be sure runtime parameter values don’t change the plan from what we get here Item #3: Access Path Advisor indicates there’s a full index scan, which might not be optimal in this case. It also aligns with the high cost index scan we saw in the plan Item #4: Index Advisor says it has some indexes to recommend © 2013 IBM Corporation Queries Step 3: Analyzing the Slow Statement ! click OQWT is strongly recommending that we update statistics (and it tells us exactly where and how further down) 40 © 2013 IBM Corporation Queries Step 4: Fixing the Problem! Ru nR UN the STAT n… S, Following OQWT's advice, we run RUNSTATS on the recommended table + indexes, and then get OQWT to re-analyze the impact on the SQL, and compare the results clic k! click! 41 © 2013 IBM Corporation Queries Step 4: Fixing the Problem! Updating the stats resulted in the optimizer choosing a different join type, as well as using start / stop keys on our index scan. Net: 50x drop in estimated cost, and a return to good performance in the application 42 © 2013 IBM Corporation Summary Performance work is fundamentally quite tricky! – There are many potential issues on many different levels – Pulling together and understanding all the relevant data can seem like a job only for experts – but OPM makes it much easier Before a problem occurs – – Take OPM "out for a spin" – get familiar with what the various dashboards and reports can do – Choose a suitable baseline time for the Overview dashboard, and adjust the alert thresholds if necessary 43 © 2013 IBM Corporation Summary When a problem occurs … – Open a pair of browser sessions – one at the baseline time, and the other showing the problem interval you're investigating – The Overview dashboards is usually the best place to start, followed by Workload and System to see trends over time – Drill down from there into Bufferpool, Locking & Logging for system-wide issues – The SQL Statement dashboard and Optim Query Workload Tuner are key for identifying SQL problems 44 © 2013 IBM Corporation Discover more DB2 best practices at: http://www.ibm.com/developerworks/data/bestpractices/db2luw/ 45 © 2013 IBM Corporation