Introduction to troubleshooting InfoSphere MDM Collaboration Server (MDMCS) performance issues Stefan Reil
by user
Comments
Transcript
Introduction to troubleshooting InfoSphere MDM Collaboration Server (MDMCS) performance issues Stefan Reil
Stefan Reil 24.11.2015 Introduction to troubleshooting InfoSphere MDM Collaboration Server (MDMCS) performance issues © 2009 IBM Corporation Introduction to troubleshooting MDMCS performance issues Agenda Architectural overview Tools for performance troubleshooting and MDMCS SQL logging mechanism Locate the performance bottleneck in component stack Client-side checks Server-side checks Database Server-side checks 2 © 2014 IBM Corporation Introduction to troubleshooting MDMCS performance issues Architectural overview A short architectural overview is provided, as it will help to understand approaches to narrow down performance bottlenecks. MDMCS includes a web-based application with a three-tier architecture that consists of: A web-based user interface for rendering PIM content in a browser, including static, dynamic or cached data in the client side. This tier runs on a web browser (Microsoft Internet Explorer) A middle tier with the functional modules and services that process user requests and data on the central MDMCS server A database management system (DBMS) that stores the data that is required by the middle tier. This tier runs on a database server (DB2 / Oracle). (Usually Application and Database Server are separate physical servers, but can be run on one physical server as well) 3 © 2014 IBM Corporation Introduction to troubleshooting MDMCS performance issues Architectural overview (cont.) The MDMCS application (middle tier) consists of 6 services (implemented as separate JVM processes), where each one is responsible for specific tasks Appsvr Service for UI event processing Workflow Engine Service processes workflow events that are posted to the database Scheduler Service runs all scheduled jobs in the background (imports, exports, reports) Admin Service starts and stops the services Eventprocessor Service is the event dispatcher between all the product modules Queuemanager Service to send documents outside of MDMCS Each service logs debugging or error messages in its own directory under $TOP/logs in a directory with name “servicename_hostname” e.g. appsvr?_<hostname> 4 © 2014 IBM Corporation Introduction to troubleshooting MDMCS performance issues Architectural overview (cont.) 5 relevant for scheduled jobs, workflowprocessing relevant for UI interactions Runtime Accumulation Layered view of main performance impacting components User Experience Browser (IE) OS / Hardware IE version & settings, JavaScript logic HW resources (CPU, memory) Network Network speed and latency WAS MDMCS Services / Custom Code OS / Hardware Configuration settings, complexity of data model, custom code logic, concurrent users, concurrent jobs HW resources (CPU, memory,disk I/O) Network Network speed and latency DB2 / Oracle OS / Hardware Statistics, bufferpools, indexes, fragmentation, old version data HW resources (CPU, memory, disk I/O) © 2014 IBM Corporation Introduction to troubleshooting MDMCS performance issues Architectural overview Exemplified information flow for simple UI interaction use case: “Open item in UI” An http request is send via network from client to “MDMCS server” to display item Appsvr_hostname service retrieves all required base information (e.g. spec definitions, item details, container scripts …) from database (if not cached already) using SQL statements. Number and complexity of triggered SQLs depends on data model and custom code complexity. The database server processes the SQL requests. SQL runtime highly affected by up to dateness of statistics, memory (bufferpools), indexes, data fragmentation, old version data ... 6 Web browser renders the received data. Depending on complexity of result page and java script code this might be resource intensive and highly affected by IE version and hardware resources Appsvr_hostname service processes the received data, applies custom code logic (e.g. entry build scripts, value rules ...), generates the response and sends it back to client SQL results are send back to MDMCS server © 2014 IBM Corporation Introduction to troubleshooting MDMCS performance issues Some common tools to diagnose performance issues User Experience Browser (IE) OS / Hardware Stop watch Windows Task Manager, HTTP tracing tools e.g. Fiddler, IE Developer Tool (F12) Windows resource monitor Network Netstat, traceroute OS / Hardware Sumall; PIM Cache Snapshot; Custom code debug statements; Deactivate container scripts; JVM verbose gc; IBM Monitoring and Diagnostic Tools for Java, ... Nmon, vmstat ... Network netstat, traceroute DB2 / Oracle OS / Hardware Db2 clp, db2pd, snapshots / sqlplus, AWR ... WAS MDMCS Services / Custom Code 7 Nmon, vmstat ... © 2014 IBM Corporation Introduction to troubleshooting MDMCS performance issues Background information: MDMCS SQL logging mechanism MDMCS services enabled to log triggered SQL queries along with its runtime in db.log files Configuration settings influence logging behavior: Logging level as defined in $TOP/etc/default/log.xml <appender name="DB" class="org.apache.log4j.RollingFileAppender"> <param name="File" value="/opt/envs/..../logs/${svc_name}/db.log" /> <param name="Append" value="true" /> <param name="maxFileSize" value="10MB" /> <param name="maxBackupIndex" value="2" /> <param name="encoding" value="UTF-8" /> <layout class="org.apache.log4j.PatternLayout"> <param name="ConversionPattern" value="%d [%t] %-5p %c %x- %m%n"/> </layout> </appender> <category name="com.ibm.ccd.common.db" additivity="false"> <priority value="debug" /> <appender-ref ref="DB" /> </category> DB DBappender appenderisisused usedfor for SQL tracking SQL tracking Defines Definessize sizeand andnumber number of oflog logfiles files Defines Defineslevel levelof oflogging: logging: --“debug” mode “debug” modeto totrack trackall allSQLs SQLs --“info” “info”to totrack trackdelayed delayedqueries queriesonly only Delayed SQL threshold value as defined in $TOP/etc/default/common.properties profiling_query_delay_threshold= <milliseconds> It defines a runtime threshold, at which queries are marked as "delayed queries". When using logging level info, only such “delayed queries” will be tracked. 8 © 2014 IBM Corporation Introduction to troubleshooting MDMCS performance issues Background information: MDMCS SQL logging mechanism Note: Using debug mode might impact performance, but should be less than 5% (unless IO is very slow) profiling_query_delay_threshold setting is ambiguous. Always use debug mode when investigating performance issues to track all SQLs Run times tracked in the db.log files do not represent the SQL runtime on DB server side. It reflects the time-span between when the SQL was triggered by MDMCS service and when the service was able to pick up the result. Logged runtimes might be higher than actual runtimes on DB server due to: - slow network - constrained CPU and thread resources on MDMCS server side. When in doubt, test runtime of selected SQLs directly on DB server side to check extra delay 9 © 2014 IBM Corporation Introduction to troubleshooting MDMCS performance issues Background information: MDMCS SQL logging mechanism Sample db.log content 2015-03-24 08:43:06,360 [jsp_48: classicornew.wpc] DEBUG com.ibm.ccd.common.db.Query - Time: (2453 ms) Dynamic Query [select itd_item_id, itd_blob from itd, ( select itm_id from ( select * from ( SELECT itm_id FROM tctg_itm_item WHERE itm_company_id = 30001 Runtime Runtimeof ofquery queryin in Name Nameof ofthe thethread threadwhich which AND itm_container_id = 25206 milliseconds milliseconds triggered triggeredthe thequery query AND itm_version_id <= 999999999 AND itm_next_version_id >= 999999999 ) AS OM98 fetch first 857 rows only ) AS DUMMY_ALIAS_01 ) AS DUMMY_ALIAS_02 where itd_item_id = itm_id and itd_next_version_id = 999999999 order by itd_item_id] 2015-03-24 08:43:06,360 [jsp_48: classicornew.wpc] WARN com.ibm.ccd.common.db.Query - CWPCM0018W:The query is delayed: ( 2,453 ms)\n Dynamic Query [select itd_item_id, itd_blob from itd, ( select itm_id from ( select * from ( SELECT itm_id The Thequery querywill willbe betracked trackedaasecond secondtime timewith withflag flag“The “Thequery queryisisdelayed” delayed” FROM tctg_itm_item ififthe runtime is above the profiling_query_delay_threshold value the runtime is above the profiling_query_delay_threshold value WHERE itm_company_id = 30001 ... 2015-03-24 08:43:06,927 [jsp_48: classicornew.wpc] DEBUG com.ibm.ccd.common.db.Query - Time: (0 ms) Static Query [name: GET_LKP_TIMESTAMP] [id: 1780] select rtrim(caa_value) from tctg_caa_catalog_attributes where caa_name = 'LAST_MODIFIED' and caa_catalog_id = 25206 and caa_next_version_id = 999999999 Timestamp Timestampwhen whenthread threadreceived receivedquery queryresult result 10 © 2014 IBM Corporation Introduction to troubleshooting MDMCS performance issues Sumall: Generate SQL workload/performance statistics Utility Sumall can be used to parse db.log files and generate a statistical summary of logged SQL runtimes. The tool is already bundled with MDMCS starting with release V10.0.0, but depending on release and fixpack, might not provide latest features. It is invoked with: $JAVA_RT com.ibm.ccd.common.wpcsupport.util.Sumall <db.log files> [options] A standalone version of the tool containing latest enhancements is published in technote http://www-01.ibm.com/support/docview.wss?uid=swg24041140 (How to create SQL workload and performance statistics from db.log files) and can be used to parse as well db.log files from MDMCS V9.1.0: java -jar <fullpath>/Sumall.jar <db.log files> [options] To get list of options run: invoke with “h” appended java -jar <fullpath>/Sumall.jar h 11 © 2014 IBM Corporation Introduction to troubleshooting MDMCS performance issues Sumall/Sumall.jar: Generate SQL workload/performance statistics Usage: $JAVA_RT com.ibm.ccd.common.wpcsupport.util.Sumall <db.log files> [leng=n characters] [delayed] [printsql] [cvon] [cleanoff] [fromtime="yyyy-MM-dd HH:mm:ss[,SSS]"] [totime="yyyy-MM-dd HH:mm:ss[,SSS]"] [threadname="<threadname>"] [threadlist] [version] [h|help] Options h|help: 12 print usage info leng: Default Value = 70. It defines how many characters from the beginning of the SQL are used for statistical aggregation printsql: Extract SQLs and its runtime in sequential order, print one line per SQL delayed: Only SQLs marked as delayed will be analyzed cvon: Use to turn on calculation of coefficient of variation (cv) It provides a normalized metric how much single runtimes vary compared to average runtime cleanoff: When set, parameter values will not be replaced in Dynamic SQL strings. Depending on leng parameter setting, this might cause Dynamic queries of the same type to be reported on seperate lines, in case individual parameter values are contained in the first n (leng) Characters. This might be desired if you want to investigate Dynamic queries with their actual parameter values fromtime: Analyze only SQLs with a timestamp > fromtime, required input format: yyyy-MM-dd HH:mm:ss totime: Analyze only SQLs with a timestamp < totime, required input format: yyyy-MM-dd HH:mm:ss threadname: SQLs will be analyzed only for the provided thread. Use a threadname as listed by threadlist option threadlist: Print sqlstatistics per thread ordered by total SQLruntime version: Print current version © 2014 IBM Corporation Introduction to troubleshooting MDMCS performance issues Sumall: Sample output “threadname” “threadname”option optioncan canbe beused usedto toextract extractSQL SQL workload triggered by one specific thread workload triggered by one specific threadonly only SQL SQLstatistics statisticsare aregenerated generatedbased basedon onfirst first70 70characters characters (default), but can be varied using “leng” option (default), but can be varied using “leng” option java -jar Sumall.jar db.log* threadname="jsp_57678: moveEntriesToNextStep.wpc" … Output Outputisissorted sortedbased basedon ontotal total runtime in descending order. runtime in descending order.SQLs SQLs with withhighest highestimpact impactare areshown shownatat the thebottom bottom For Foreach eachunique uniqueSQL SQLstatement, statement, 33statistical values statistical valuesare areprinted: printed: --total totalruntime runtime(milliseconds) (milliseconds) --number numberof ofexecutions executions --average runtime average runtime(milliseconds) (milliseconds) 130 130 1,0 GEN_CTG_CSA_CAT_SYS_ATTR_GETBYCATEGORYID] SELECT * FROM tctg_csa_cat_s 134 130 1,0 GEN_CTG_CHI_CATEGORY_HIERARCHY_GETBYCHILDID] SELECT * FROM tctg_chi_ca 134 130 1,0 GEN_CTG_CFP_CAT_FULL_PATHS_GETBYCATEGORYID] SELECT * FROM tctg_cfp_cat 158 1 158,0 GET_EVENT_ENTRY_MATCHING_SIGNATURE] SELECT wee_entry_id FROM twfl_wfe_ 219 235 0,9 GEN_SEC_SCU_USER_GETBYID] SELECT * FROM tsec_scu_user WHERE scu_user_i 223 220 1,0 GEN_SEC_CMP_COMPANY_GETBYID] SELECT * FROM tsec_cmp_company WHERE cmp_ 267 278 1,0 GEN_CTG_ICM_ITEM_CATEGORY_MAP_GETBYITEMVERSION] SELECT * FROM tctg_icm 276 250 1,1 GEN_CTG_ITD_ITEM_DETAIL_GETBYIDVERSION] SELECT * FROM tctg_itd_item_de 431 428 1,0 GEN_CTG_ITM_ITEM_GETBYPRIMARYKEYVERSION] SELECT * FROM tctg_itm_item W ----------------------------------------------sum(ms) cnt avg(ms) QUERY Monitored timeperiod (MT): 2015-07-24 10:37:54,467 - 2015-07-24 10:37:57,563 : 0,00 h | 0,00 m | 3,10 s Statistical values for all queries Total Count Average Total (ms) (cnt) (avg ms) (minutes) 2862 2733 1 0,0 Monitored Monitoredtime timeperiod period(MT) (MT)shows showsthe the duration of the thread based on duration of the thread based onfirst firstand and last lasttriggered triggeredSQL SQL 13 Total Runtime (hours) 0,0 % of MT (%) 92,4 Aggregated Aggregatedstatistics statistics % %of ofMT: MT:ititreflects reflectsthe theSQL SQLfootprint footprintin inrelation relationto to Monitored time period. Indicative especially when Monitored time period. Indicative especially whenstatistics statistics are arecalculated calculatedfor foraasingle singlethread thread Rule Ruleof ofthumb: thumb: -->=80%: >=80%:thread threadperformance performanceisislimited limitedby bySQL SQLperformance performance --<=80%: data processing becomes CPU intensive <=80%: data processing becomes CPU intensive(complex (complex data, data, extensive extensivecustom customcode codelogic logic...) ...) © 2014 IBM Corporation Introduction to troubleshooting MDMCS performance issues Basic health checks to ensure optimal performance Basic healthchecks should be run before starting any performance troubleshooting Is the database configured according to installation requirements? Does the current table and index layout match the default schema? Are table and index statistics up to date or is table and index data very fragmented? Are old versions being deleted regularly? Are MDMCS caches configured sufficiently high? Are there any CPU or memory (also DB bufferpool) related bottlenecks on the application or database server side ? Details to be found at: http://www-01.ibm.com/support/docview.wss?uid=swg21641292 14 © 2014 IBM Corporation Introduction to troubleshooting MDMCS performance issues Understand and define the performance problem What exactly is slow General slowness may point to some basic configuration/maintenance deficiencies or resource bottlenecks Only a specific use case ==> precisely define and document the reproduction steps and involved objects Use always the same data set for tests, if possible run tests on isolated environment Clarify what custom logic is involved (container scripts, specs, views) What is the current and expected runtime When did the problem manifest Performance degraded over time (indication for general maintenance tasks) Abrupt change in performance? What has changed ? (New custom code or data model deployment, manipulation of high volume of data ...) Who is affected All users working with the user interface(which one used: classic or new simplified UI), or specific users only (specific permissions, location in a different network, different client workstation hardware and Internet Explorer version …) 15 © 2014 IBM Corporation Introduction to troubleshooting MDMCS performance issues Locate performance bottleneck in component stack If unclear, do an initial diagnostic data collection session to narrow down the performance bottleneck in the stack before starting specific investigations. 16 Network relevant for scheduled jobs, workflow processing relevant for UI interactions User Experience Browser (IE) OS / Hardware WAS MDMCS Services / Custom Code OS / Hardware Network Overall execution time: stop watch Time iexplore.exe peaks in CPU usage: Windows Task Manager / stop watch Time to receive response from MDMCS server: Http tracing tool like Fiddler, IE Developer tools(F12) Execution time on MDMCS server along with SQL runtime footprint: db.log analysis (captured in debug mode) using Sumall utility General monitoring of CPU or memory usage peaks at MDMCS or DB server: Nmon, vmstat ... DB2 / Oracle OS / Hardware © 2014 IBM Corporation Introduction to troubleshooting MDMCS performance issues Locate performance bottleneck in component stack (example) Reported Problem: Rendering all items in a specific step within worklist summary page takes some 35 seconds. To collect: As the use case involves UI interaction, client side checks need to be included: Overall execution time till page is rendered Use stop watch to measure. Track start and stop time for the use case by running the “date” command on MDMCS shell console so log files can be matched more easily Time till when response is received from MDMCS server: Use http tracing tool like Fiddler When does iexplore.exe start to peak in CPU usage and for how long Monitor iexplore.exe in process tab of Windows Task Manager Execution time on MDMCS server and SQL runtime footprint: use pimSupport.sh -t start / stop to capture all log files in debug mode automatically (only use if it is a "short" reproduction usecase, otherwise data volume might be too big) or manually enable debug mode (Technote 1383862) and make sure to archive db.log* files immediately when the reproduction use case is finished to avoid log file wrapping Monitor CPU and memory usage on MDMCS server: Use nmon, vmstat, top ... 17 © 2014 IBM Corporation Introduction to troubleshooting MDMCS performance issues Locate performance bottleneck in component stack (example) Observations: After clicking “verify attributes” link, iexplore.exe CPU started to peak after some 10 seconds and stayed high for another 25 seconds. Use case completed in total after some 35 seconds As shown by Fiddler, response from MDMCS server was received after some 10 seconds 18 © 2014 IBM Corporation Introduction to troubleshooting MDMCS performance issues Locate performance bottleneck in component stack (example) Runtime footprint on MDMCS server side First run Sumall with fromtime and threadlist option to identify thread that was serving the client request java -jar Sumall.jar db.log* threadlist fromtime="2014-02-27 14:28:00" … 992 1828 0.5 GEN_SEC_CMP_COMPANY_GETBYID] SELECT * FROM tsec_cmp_company WHERE cmp_ 1566 2304 0.7 OBJECT_ACCESS_FOR_USER_ROLE_ALL_GROUPS] SELECT orm_access, orm_access_ 2920 1600 1.8 GEN_WFL_CEH_COLL_ENTRY_HISTORY_GETBYCAENTRY] SELECT * FROM twfl_ceh_co ----------------------------------------------sum(ms) cnt avg(ms) QUERY Monitored timeperiod (MT): 2014-02-27 14:28:35,999 Statistical values for all queries Total Count Average Total (ms) (cnt) (avg ms) (minutes) 14282 15754 0 0.2 PRINTING THREADSUMMARY Total Count Average (ms) (cnt) (avg ms) 1 1 1.0 30 14 2.1 69 97 0.7 124 19 6.5 5661 9600 0.6 8397 6023 1.4 Total Count Average (ms) (cnt) (avg ms) - 2014-02-27 16:04:49,337 : Total Runtime (hours) 0.0 1.60 h | 96.00 m | 5773.34 s % of MT (%) 0.2 first timestamp last timestamp Thread 2014-02-27 14:28:35 2014-02-27 15:02:30 2014-02-27 14:28:48 2014-02-27 14:32:41 2014-02-27 14:28:59 2014-02-27 14:28:36 first timestamp 2014-02-27 15:02:30 2014-02-27 16:04:49 2014-02-27 16:02:41 2014-02-27 16:04:06 2014-02-27 14:28:46 last timestamp unregistered (formely: jsp_274: QuickLinks.wpc) logoutListener:277 heartbeat_appsvr jvm_maintenance_appsvr Thread-45 jsp_276: ItemListQueryStore.wpc Thread Thread Thread“jsp_276: “jsp_276:ItemListQueryStore.wpc” ItemListQueryStore.wpc”has hasserved servedthe the client clientrequest request(matches (matchesbased basedon ontimestamp timestampand andname) name) 19 © 2014 IBM Corporation Introduction to troubleshooting MDMCS performance issues Locate performance bottleneck in component stack (example) Calculate the SQL runtime statistics for only the thread jsp_276 java -jar Sumall.jar db.log* threadname="jsp_276: ItemListQueryStore.wpc" sum(ms) cnt avg(ms) QUERY ----------------------------------------------... 8 2 4.0 GEN_SEC_SCU_USER_GETBYID] SELECT * FROM tsec_scu_user WHERE scu_user_i 47 7 6.7 GEN_WFL_CAE_COLL_AREA_ENTRY_GETBYCASTEPPATHID] SELECT * FROM twfl_cae_ 239 400 0.6 Dynamic: [ select distinct csm_spec_id from tctg_csm_category_spec_ma 326 400 0.8 POPULATE_CSM_MAP] SELECT coa_src_container_id FROM twfl_coa_collaborat 349 400 0.9 GEN_CTG_ICM_ITEM_CATEGORY_MAP_GETBYITEMVERSION] SELECT * FROM tctg_icm 366 400 0.9 GEN_CTG_ITM_ITEM_GETBYPRIMARYKEYVERSION] SELECT * FROM tctg_itm_item W 369 400 0.9 GEN_CTG_ITD_ITEM_DETAIL_GETBYIDVERSION] SELECT * FROM tctg_itd_item_de 407 400 1.0 GET_COL_AREA_ENTRY_LOCKED_COUNT_IN_STEP] select count(*) from cae, cel 498 400 1.2 GET_CSM_INFO_FOR_ITEMSET] SELECT distinct csm_category_id, csm_spec_id 529 400 1.3 GET_ITM_INFO_FOR_ITM_LIST] SELECT itm_id, itm_primary_key, itm_version 594 400 1.5 GET_ICM_INFO_FOR_ITEMSET] SELECT * FROM tctg_icm_item_category_map WHE 784 400 2.0 GET_CA_HISTORY_BY_DATES_FOR_PK] SELECT * FROM twfl_ceh_coll_entry_hist 946 400 2.4 GET_ITD_FOR_ITEMSET] SELECT * FROM tctg_itd_item_detail WHERE itd_item 2920 1600 1.8 GEN_WFL_CEH_COLL_ENTRY_HISTORY_GETBYCAENTRY] SELECT * FROM twfl_ceh_co ----------------------------------------------sum(ms) cnt avg(ms) QUERY Monitored timeperiod (MT): 2014-02-27 14:28:36,138 Statistical values for all queries Total Count Average Total (ms) (cnt) (avg ms) (minutes) 8397 6023 1 0.1 2014-02-27 14:28:46,358 : Total Runtime (hours) 0.0 6023 6023queries querieswere weretriggered triggeredwith withaatotal total runtime runtimeof of8.397 8.397seconds seconds 20 - % of MT (%) 82.2 0.00 h | 0.00 m | 10.22 s Thread Threadwas wasactive activeon onMDMCS MDMCS server for 10.22 seconds server for 10.22 seconds Total TotalSQL SQLruntime runtimecontributed contributed to 82% of monitored to 82% of monitoredtimeperiod timeperiod © 2014 IBM Corporation Introduction to troubleshooting MDMCS performance issues Locate performance bottleneck in component stack (example) Overview: User Experience Browser (IE) OS / Hardware Network WAS MDMCS Services / Custom Code Iexplore.exe showing peak CPU usage for some 25 s Time to receive response from MDMCS: ~10 s Time to receive response matches Execution time on MDMCS server ==> network in between client and server does not add any reasonable delay OS / Hardware Execution time on MDMCS server: ~10s SQL runtime footprint to receive data from DB: 8 s Network CPU on MDMCS and DB server not exhausted DB2 / Oracle OS / Hardware 21 Overal execution time: ~35 s (seconds) Main performance bottleneck on client side (25s rendering time) but SQL workload is high as well and could be even more relevant on less performing environments © 2014 IBM Corporation Introduction to troubleshooting MDMCS performance issues If performance bottleneck is located on Client-side Clean your browser cache before using the user interface the first time after a fix pack, interim fix, or test fix has been applied. Test with using a zoom level of 100% in Microsoft Internet Explorer (IE) Check whether IE browser version is compatible with MDMCS system requirements. In general: latest approved (check system requirements) IE version performs best When working with multiple MDMCS client sessions in parallel, ensure that each IE window uses its own session Run test with recent client hardware (especially new CPU processor) If above checks and tests do not improve rendering performance on client, engage with MDMCS support for more in depth investigations 22 © 2014 IBM Corporation Introduction to troubleshooting MDMCS performance issues If performance bottleneck is seen on MDMCS server side Decide on next steps based on generated SQL footprint statistics What is the SQL runtime footprint in comparison to monitored timeperiod. Is slowness rather due to processing results or retrieving data from DB server ? Large SQL runtime footprint (% of MT > 80) due to slow performing SQLs: Do a SQL runtime differentiation check on both MDMCS server and DB server to evaluate whether it is a DB server side issue, network issue or some JVM issue picking up the results fast enough If it is a DB server side issue, recheck recommended data maintenance, DB configuration, bufferpool hitratios, OS resources … Large SQL runtime footprint due to massive number of SQLs: Check MDMCS cache hitratio (if low, spec information is retrieved over and over again) Check custom code logic whether same data is retrieved repeatedly CPU and memory utilization bottlenecks on MDMCS server Issues with JVM memory usage – rarely seen nowadays (use verbose gc to track memory usage and garbage collection) 23 © 2014 IBM Corporation Introduction to troubleshooting MDMCS performance issues Slow processing due to low MDMCS cache hit ratios Spec Specrelated relatedqueries queries(nom, (nom,spa, spa,spc, spc,nod, nod, noa, noh, agn) dominate SQL runtime noa, noh, agn) dominate SQL runtime java -jar Sumall.jar db.log* … 18328 42138 0,4 Dynamic: [SELECT nom_master_spec_id, nom_master_spec_node_id, nom_sub 18553 42162 0,4 Dynamic: [SELECT DISTINCT spa_version_id FROM tctg_spa_spec_attribute 18667 42139 0,4 Dynamic: [SELECT noh_spec_id, noh_parent_node_id, noh_child_node_id, 18742 42149 0,4 GET_MAX_VERSION] SELECT max ( ver_id ) FROM tctg_ver_version 19012 42138 0,5 Dynamic: [SELECT * FROM tctg_spc_specification WHERE (spc_id in (? )) 19907 42138 0,5 Dynamic: [SELECT * FROM tctg_spa_spec_attributes WHERE (spa_spec_id i 20245 414 48,9 GET_CA_HISTORY_BY_DATES_FOR_PK] SELECT * FROM twfl_ceh_coll_entry_hist 24226 42138 0,6 Dynamic: [SELECT distinct nod_id, noh_spec_id, nod_spec_id, nod_name, 28697 42138 0,7 Dynamic: [SELECT noa_node_id, noa_name, noa_value FROM tctg_noa_node_ 96318 54 1783,7 Dynamic: [ SELECT agn_spec_id b_spec_id, agn_node_id b_node_id, cvd_v ----------------------------------------------sum(ms) cnt avg(ms) QUERY Monitored timeperiod (MT): 2015-04-09 15:45:27,126 Statistical values for all queries Total Count Average Total (ms) (cnt) (avg ms) (minutes) 364204 437814 0 6,1 - 2015-04-09 16:04:26,913 : Total Runtime (hours) 0,1 0,32 h | 18,00 m | 1139,79 s % of MT (%) 32,0 Action: Check Cache hit ratio using … SupportUtil –cmd=getRunTimeCacheDetail 24 © 2014 IBM Corporation Introduction to troubleshooting MDMCS performance issues Slow processing due low MDMCS cache hit ratios Check Cache hit utilization after MDMCS services have been busy with regular workload: $JAVA_RT com.ibm.ccd.common.wpcsupport.util.SupportUtil –cmd=getRunTimeCacheDetail Cache utilization is shown for each service. Cache utilization might be different in between services For example PIM CACHE SNAPSHOT: Title Current Cache Size --------------appsvr_HOAPMDM001FST---------------Spec Cache 600 Spec Key To Current Start Version Cache 600 Spec Key Version to Start Version Cache 600 Lookup Table Cache 100 Role Cache 2000 Access Cache 5000 Spec Name Cache 5000 Attribute Collections Cache 100 WSDL Cache 0 Workflow Definition Cache 250 Script Cache 2000 Catalog Cache 100 Catalog Definition Cache 100 Node Id Cache 5000 View Cache 500 ... Max Cache Size Cache Hits Cache Requests Hit Percentage 599 600 600 10 24 153 9 19 0 14 147 0 30 989 500 24878675 25305758 0 84691 90825 106736 16364 755635 0 45369 68387 0 203432 1016559 1880926 30335844 35777562 0 84822 91128 109308 16381 759681 0 45415 68663 0 203496 1018396 2286372 82% 71% 0% 100% 100% 98% 100% 99% 0% 100% 100% 0% 100% 100% 82% Increase cache settings whenever “Max Cache Size” has reached “Current Cache Size” and “Cache Requests” are higher than “Cache Hits” 25 © 2014 IBM Corporation Introduction to troubleshooting MDMCS performance issues Inefficient custom code affecting item display performance Observation: According Accordingto tohttp httptrace, trace,itittakes takes>>12 12seconds secondsto to receive data on client receive data on client SQL SQLfootprint footprintisisonly onlysome some33 33% %of ofthread threadruntime, runtime, but buthigh highnumber numberof ofexecutions executionsof of22SQL SQLtypes types Performance Performancelimited limitedby bycustom customor orMDMCS MDMCSserver server code code high highCPU CPUusage usageseen seenfor forappsvr* appsvr*service servicewhile while request requestisisprocessed processedon onMDMCS MDMCSserver serverside side java -jar Sumall.jar db.log threadname="jsp_347: classicornew.wpc" ... 64 122 0.5 GEN_CTG_ICM_ITEM_CATEGORY_MAP_GETBYITEMVERSION] SELECT * FROM tctg_icm 1259 2597 0.5 GEN_CTG_AGL_ATTR_GROUP_LOCALE_GETBYATTRGROUPVERSION] SELECT * FROM tct 2538 5180 0.5 GEN_CTG_ATG_ATTR_GROUP_GETBYID] SELECT * FROM tctg_atg_attr_group WHER ----------------------------------------------sum(ms) cnt avg(ms) QUERY Monitored timeperiod (MT): 2015-03-23 17:30:34,301 Statistical values for all queries Total Count Average Total (ms) (cnt) (avg ms) (minutes) 4116 8260 0 0.1 26 - 2015-03-23 17:30:46,637 : Total Runtime (hours) 0.0 0.00 h | 0.00 m | 12.34 s % of MT (%) 33.4 © 2014 IBM Corporation Introduction to troubleshooting MDMCS performance issues Inefficient custom code affecting item display performance Using Usingadditional additionalSumall Sumalloptions options“cleanoff” “cleanoff”and and“leng” “leng”reveals, reveals,that thatactually actually the same data is retrieved over and over again and thus increases the same data is retrieved over and over again and thus increasesSQL SQL footprint footprintsignificantly significantly What is triggering What is triggeringthis this?? java -jar Sumall.jar db.log threadname="jsp_347: classicornew.wpc" leng=5000 cleanoff ... 25 1 25.0 RELEASE_MASTER_LOCKS_FOR_THREAD_JVM] UPDATE tutl_lck_lock SET lck_thread_id = '-... 1147 2590 0.4 GEN_CTG_ATG_ATTR_GROUP_GETBYID] SELECT * FROM tctg_atg_attr_group WHERE atg_id = -1 1257 2590 0.5 GEN_CTG_AGL_ATTR_GROUP_LOCALE_GETBYATTRGROUPVERSION] SELECT * FROM tctg_agl_attr... 1391 2590 0.5 GEN_CTG_ATG_ATTR_GROUP_GETBYID] SELECT * FROM tctg_atg_attr_group WHERE atg_id = 6421 ----------------------------------------------sum(ms) cnt avg(ms) QUERY Monitored timeperiod (MT): 2015-03-23 17:30:34,301 Statistical values for all queries Total Count Average Total (ms) (cnt) (avg ms) (minutes) 4116 8260 0 0.1 27 - 2015-03-23 17:30:46,637 : Total Runtime (hours) 0.0 0.00 h | 0.00 m | 12.34 s % of MT (%) 33.4 © 2014 IBM Corporation Introduction to troubleshooting MDMCS performance issues Inefficient custom code affecting item display performance Possible diagnostic actions to expose custom code issues: Test runtime with selectively disabled container scripts or suspected value rules Especially review custom code which is run in loops In complex custom code, add logging statements at major codeblocks, containerscripts to see where time is spend 2015-03-24 16:33:00,028 [jsp_93: /ccd_rendering/renderer.jsp] DEBUG com.ibm.ccd.common.interpreter.engine.Script - @@@ Script [ [ContainerA.PostSave]] started at: Tue Mar 24 16:33:00 CET 2015 ... 2015-03-24 16:33:08,343 [jsp_93: /ccd_rendering/renderer.jsp] DEBUG com.ibm.ccd.common.interpreter.engine.Script - @@@ Script [ [ContainerA.PostSave]] ended at: Tue Mar 24 16:33:08 CET 2015 If too many SQLs are triggered by custom code, but you do not know which function calls trigger these, combine timestamp information from custom code debug statements with selectively parsing db.log files with Sumall utility, making use of options fromtime, totime, threadname Eventually profile the respective JVM using some java profiler. E.g. IBM Healthcenter: http://www.ibm.com/developerworks/java/jdk/tools/healthcenter/ 28 © 2014 IBM Corporation Introduction to troubleshooting MDMCS performance issues Inefficient custom code affecting item display performance Solution in documented usecase: Move call setAttributeGroupsToProcess(hmAttGroups) out of for loop reduced the number of executions of queries GEN_CTG_ATG_ATTR_GROUP_GETBYID GEN_CTG_AGL_ATTR_GROUP_LOCALE_GETBYATTRGROUPVERSION GEN_CTG_ATG_ATTR_GROUP_GETBYID tremendously and thus reduced total SQL footprint and processing time Original custom code Library 29 Fixed custom code Library © 2014 IBM Corporation Introduction to troubleshooting MDMCS performance issues When to consider average SQL runtime slow? java -jar Sumall.jar db.log … 8405 7850 1,1 GEN_CTG_CSA_CAT_SYS_ATTR_GETBYCATEGORYID] SELECT * FROM tctg_csa_cat_s 8412 7850 1,1 GEN_CTG_CFP_CAT_FULL_PATHS_GETBYCATEGORYID] SELECT * FROM tctg_cfp_cat 10556 50 211,1 Dynamic: [select count(distinct ita? .ita_item_id) AS DUMMY_ALIAS_? f 12600 12146 1,0 GEN_SEC_CMP_COMPANY_GETBYID] SELECT * FROM tsec_cmp_company WHERE cmp_ 14707 92 159,9 GET_EVENT_ENTRY_MATCHING_SIGNATURE] SELECT wee_entry_id FROM twfl_wfe_ 27968 47450 0,6 Dynamic: [ select distinct csm_spec_id from tctg_csm_category_spec_ma 56704 52561 1,1 GEN_CTG_ICM_ITEM_CATEGORY_MAP_GETBYITEMVERSION] SELECT * FROM tctg_icm 57138 51090 1,1 GEN_CTG_ITM_ITEM_GETBYPRIMARYKEYVERSION] SELECT * FROM tctg_itm_item W 58288 49922 1,2 GEN_CTG_ITD_ITEM_DETAIL_GETBYIDVERSION] SELECT * FROM tctg_itd_item_de ----------------------------------------------sum(ms) cnt avg(ms) QUERY ? ? ? No definite answer, depends on hardware and network and relevance (how often is it triggered) - Static queries are often simple and show average runtimes in the range of 0.4 – 20ms - Dynamic queries are more complex and runtimes of several seconds might be expected - overall use case performance might be affected by even slight changes in average SQL runtime if it involves large volume of executions Recommendation Generate SQL statistics at intervals and different environments: - get familiar with “expected” runtimes - discover changes in frequency of specific SQLs which allows proactive investigations - changes in avg runtimes of frequent SQLs are indicators of changed workload, need for maintenance tasks, resource issues ... 30 © 2014 IBM Corporation Introduction to troubleshooting MDMCS performance issues Variation of average runtimes in between environments What is the reason for this difference? Server hardware, network, DB configuration and maintenance, workload ... Check influence of network e.g. by doing SQL runtime differentiation test (next slides) 31 © 2014 IBM Corporation Introduction to troubleshooting MDMCS performance issues SQL runtime differentiation test Extract a set of commonly seen “select” SQLs from db.log files and measure execution time when run against DB server from different locations using native DB clients like sqlplus (Oracle) or db2 clp (DB2) How to generate SQLfile for runtime differentiation test (extract only select SQLs) java java -jar -jar Sumall.jar Sumall.jar db.log db.log printsql printsql leng=500 leng=500 cleanoff cleanoff || grep grep -vi -vi update update || grep -vi insert | grep -vi delete |grep -vi merge |head -500 grep -vi insert | grep -vi delete |grep -vi merge |head -500 >> sql500 sql500 awk awk -f -f gentestsqlfile.awk gentestsqlfile.awk sql500 sql500 >> test500sql.tvf test500sql.tvf Content Content of of gentestsqlfile.awk gentestsqlfile.awk BEGIN{} BEGIN{} {{ sqlstring1 sqlstring1 == substr($0,52); substr($0,52); pos = index($0,"Dynamic:"); pos = index($0,"Dynamic:"); if(pos if(pos >> 0){ 0){ split(sqlstring1,v,"["); split(sqlstring1,v,"["); split(v[2],z,"]"); split(v[2],z,"]"); print print z[1] z[1] ";" ";" ;; }} else else {{ split(sqlstring1,v,"]"); split(sqlstring1,v,"]"); print print v[2] v[2] ";" ";" ;; }} }} END{} END{} 32 © 2014 IBM Corporation Introduction to troubleshooting MDMCS performance issues SQL runtime differentiation test Run the SQLs multiple times (allow population of bufferpools) against the DB server from both, DB and MDMCS server date; time db2 -tvf test500sql.tvf >foo Executions on MDMCS server Tue Mar 24 13:00:37 CET 2015 real 0m3.59s user 0m0.16s sys 0m0.29s Executions directly on DB server Tue Mar 24 13:03:06 CET 2015 real 0m0.93s user 0m0.07s sys 0m0.13s Tue Mar real user sys 24 13:00:41 CET 2015 0m3.18s 0m0.13s 0m0.22s Tue Mar real user sys 24 13:03:07 CET 2015 0m0.83s 0m0.07s 0m0.14s Tue Mar real user sys 24 13:05:44 CET 2015 0m2.39s 0m0.11s 0m0.20s Tue Mar real user sys 24 13:03:08 CET 2015 0m0.93s 0m0.08s 0m0.14s To execute the same set of SQLs on MDMCS server against DB takes more than 3 times longer than when running the SQLs directly on DB server. ==> network adds reasonable delay, so it should be checked (one exemption: if CPU capability of DB server is much higher than on MDMCS server, then difference might be less due to network than CPU difference) 33 © 2014 IBM Corporation Introduction to troubleshooting MDMCS performance issues One simple Network check: Traceroute Latency and bandwidth between the application server and database should be < 0.25 ms Path to destination should be short Check with traceroute, e.g. traceroute 192.168.97.104 traceroute to 192.168.97.104 (192.168.97.104) from 10.252.41.244 (10.252.41.244), 30 hops max outgoing MTU = 1500 1 10.252.40.1 (10.252.40.1) 10 ms 1 ms 4 ms 2 10.211.247.250 (10.211.247.250) 7 ms 5 ms 3 ms 3 df0000001670026.comp.com (192.168.97.104) 4 ms 3 ms 6 ms ==> rather long hop times and 2 routers involved 34 © 2014 IBM Corporation Introduction to troubleshooting MDMCS performance issues Database Server-side checks One time task after migration / changing setup Ensure DB configuration matches MDMCS DB setup recommendations and table and index layout conforms with installed version http://www-01.ibm.com/support/docview.wss?uid=swg21641292 Regular tasks Table and index statistics are generated in detail mode at regular intervals Tables and indexes are reorganized when needed (e.g. reorgchk) Monitor Bufferpools configured sufficiently high to ensure high bufferpool hitratio AND low # of physical reads Resource intensive SQLs (e.g. dbsnapshots, AWR reports) and evaluate impact with MDMCS Administrator Free space in tablespaces respectively filesystem to allow growth of data CPU and disk I/O not at its limit 35 © 2014 IBM Corporation Introduction to troubleshooting MDMCS performance issues Database Server-side checks If there are SQLs that perform slowly, investigate data access plan. What is the expected resultset ? Does it utilize indexes ? Would an additional index accelerate data access ? You may add at any time new “non unique” indexes for testing purposes. But when found beneficial, recheck with support so helpful indexes can be included in the default schema Use db2bach & db2exfmt to investigate single SQL performance issues on DB2 - Collect runtime info along with database snapshots, which show e.g. bufferpool reads db2batch -d <dbname> -a <schemaname>/<passwd> -f <file containing SQL> -i complete ICM_INFO_dyn.batch -o f -1 p 5 o 5 e yes Note: -r option "-o ... e yes" sets the explain mode, so data accessplan gets collected - Extract the data accessplan db2exfmt -d <dbname> -# 0 -w -1 -g TFIC -n % -s % -o <accessplanoutput> -e <schemaname> Attention: SQLs tracked as “Static Query” in db.log are run as prepared statements using parameter markers. In rare situations it might be observed that running same SQL using literals instead of parameter markers might be faster. http://www.ibm.com/developerworks/data/library/techarticle/dm-1104db2statementconcentratorreoptimization/ 36 © 2014 IBM Corporation Introduction to troubleshooting MDMCS performance issues Importance of high bufferpool hitratios on DB server Need to read data from disk severely affects SQL runtime. Tests done in isolated DB server environment with a complex SQL (as reported in some PMR) by using db2batch show this influence. Only parameter changed was bufferpool size which influences number of physical reads (see next slide) 37 SELECT DISTINCT ita1.ita_item_id AS c1 FROM cat cat1 ,icm icm1 ,ita ita1 ,ita ita2 ,ita ita3 WHERE ita1.ita_catalog_id = 7825 AND ita1.ita_version_id <= 999999999 AND ita1.ita_next_version_id >= 999999999 AND cat1.cat_id = icm1.icm_category_id AND ita1.ita_node_id = 11192 AND ita2.ita_node_id = 11198 AND ita2.ita_item_id = ita1.ita_item_id AND ita3.ita_node_id = 15467 AND ita3.ita_item_id = ita1.ita_item_id AND icm1.icm_item_id = ita1.ita_item_id AND cat1.cat_primary_key = '00208' AND cat1.cat_version_id <= 999999999 AND cat1.cat_next_version_id >= 999999999 AND ita1.ita_value_numeric <= 1390950000000 AND ( ( ita2.ita_value_numeric > 1390950000000 AND ita2.ita_version_id <= 999999999 AND ita2.ita_next_version_id >= 999999999 ) OR ( ita2.ita_value_numeric IS NULL AND ita2.ita_version_id <= 999999999 AND ita2.ita_next_version_id >= 999999999 ) ) AND ita3.ita_value_string = 'true' AND ita3.ita_version_id <= 999999999 AND ita3.ita_next_version_id >= 999999999 AND icm1.icm_version_id <= 999999999 AND icm1.icm_next_version_id >= 999999999 © 2014 IBM Corporation Introduction to troubleshooting MDMCS performance issues Importance of high bufferpool hitratios on DB server By lowering bufferpool size from 50 to 30K, physical reads started to show up (6700) and caused SQL runtime to double (13,8 versus 28,2 seconds) Each physical read accounted for an additional delay of 2.2 ms 38 © 2014 IBM Corporation Introduction to troubleshooting MDMCS performance issues 39 © 2014 IBM Corporation