Characterizing Memory Hierarchies of Multicore Processors Using Microbenchmarks
by user
Comments
Transcript
Characterizing Memory Hierarchies of Multicore Processors Using Microbenchmarks
Characterizing Memory Hierarchies of Multicore Processors Using Microbenchmarks Christopher Celio, Aleah Caulin, Krste Asanovic, Dave Patterson What does the hardware really look like? GOAL: design and implement portable software micro-benchmarks to determine the important characteristics of the hardware • • • • • • • • • Number of cache levels Data cache sizes at each level in the cache hierarchy Access time to each level in the cache hierarhcy Off-chip bandwidth for unit-stride accesses Cache-to-cache transfer latency Cache-to-cache transfer bandwidth Request bandwidth Is a cache inclusive? Is a cache shared? Cache Size and Access Latency Cache-to-Cache Transfer Latency Microbenchmark • one thread performs pointer chasing on an array • Unit stride (optimal locality) • 16 element stride (different cache lines) • randomly sorted array (avoids prefetching) • Measure the time required to complete a fixed amount of work (pointer chasing for N iterations) Microbenchmark • One thread writes an allocated array while a second thread waits on a spin lock • When it is the other thread’s turn it writes to the array • Forced serialization between array elements required to measure latency • writing privileges alternate between threads many times • Ensures a “cold start” for each thread – no data in L1 • Measure the time required to complete a fixed amount of work (N iterations of swapping array write privileges) //‘stride’ is an input parameter. // The ‘multiply’ instruction is removed // by enumerating each case. // Instead, a ‘shift’ is synthesized. if (stride == 16) { uint32_t const stride_16 = 16; start_time = get_seconds(); for (uint32_t k = 0; k < g_num_iterations; k++) { idx = g_arr_n_ptr[idx*stride_16]; } stop_time = get_seconds(); } Motivation & Useful Applications Determining Hardware Specs • Measure unpublished specs • cache-to-cache transfer latency • Producing realistic specs • Published theoretical bandwidth numbers may not be sustainable Verification of Simulators • Micro-benchmarks can characterize existing machines to provide a calibration for simulator targets • Useful for finding bugs that render incorrect simulation results L1 Cache L2 Cache Locking overhead SCALE L2 Cache -- tile 0,0 to tile 0,1 -- tile 0,0 to tile 7,6 -- tile 0,0 to 0,1 direct messaging L3 Cache L3 Cache Request Bandwidth 1. Complex interactions of the memory hierarchy can yield convoluted results. -- randomized -- cache-line stride -- unit stride Derived processor specs: L1 Intel i7 Arrandale 32 KB 128 KB 4 MB 1.2 ns 3.0 ns 15.8 ns 3.5 cycles 8.5 cycles 44.4 cycles Intel i7 Bloomfield 32 KB 256 KB 8 MB 1.1 ns 2.9 ns 13.6 ns 3.8 cycles 9.6 cycles 45.2 cycles 8 KB 64 KB 2-‐3.5 MB Microbenchmark • One thread performs multiple pointer chases in parallel on an array • Determines saturation of requests per second access latency L2 L3 Processor TilePro64 L1 4.3 ns 12.9 ns 70.2 ns 3.0 cycles 10.0 cycles 49.1 cycles TILE64 pipeline latencies provided by the Tilera architecture manual: OperaRon (LD to use) L1 hit www.PosterPresentations.com 246 MB/s uint32_t *idx = malloc(num_requests * sizeof(uint32_t)); for (int i=0; i < num_requests; i++) idx[i] = (rand() % g_num_elements); start_time = get_seconds(); for (uint32_t k = 0; k < g_num_iterations; k++) { for (uint32_t i=0; i < num_requests; i++) idx[i] = g_arr_n_ptr[idx[i]]; } stop_time = get_seconds(); Kernel a(i) = b(i) Bytes/iter FLOPS/iter 16 0 a(i) = q*b(i) 16 1 SUM a(i) = b(i) + c(i) 24 1 a(i) = b(i) + q*c(i) 24 2 Sustainable Bandwidth (MB/s) Scale 7011 Add Triad 7511 7536 8 12470 12130 12431 12641 56 501 490 544 536 Roofline: • Visual representation of realistic performance and productivity expectations • Determines hardware limitations for a specific kernel • Prioritizes optimizations by showing potential benefits • Uses STREAM to generate maximum FLOPS/Byte asymptote MEMBENCH: • Single core focused • measures 33 different transfer types • main memory read/write • L2 cache read/write • video memory write • main memory transfer • L2 cache transfer • video memory transfer References MEMBench. http://www.intelligentfirm.com/membench/membench.html McCalpin, John D., 1995: "Memory Bandwidth and Machine Balance in Current High Performance Computers", IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter, December 1995. Latency 2 cycles S. Williams, A. Waterman, D. Patterson, "Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architectures", Communications of the ACM (CACM), April 2009. L1 miss, L2 hit 8 cycles L1/L2 miss, adjacent $hit 35 cycles TEMPLATE DESIGN © 2008 2150 MB/s TilePro64 # Threads Copy Intel i7 Arrandale 4 7292 TilePro64 -- randomized -- cache-line stride -- unit stride data cache size L2 L3 Intel i7 975 Ext. Ed TRIAD Intel i7 975 Ext. Ed Off-chip DRAM L2 Cache Processor C2C Write Bandwidth Intel i7 Arrandale 1789 MB/s } double stop_time = get_seconds(); -- randomized -- cache-line stride -- unit stride L1 Cache waiting_thread = tid; *mylock = 0; *otherLock = 1; Name COPY L1 Cache 2. Accurate measurements of runtime are limited by the precision of the system clock 3. Hyperthreading needs to be avoided for certain measurements (e.g. cache-to-cache latency) 4. Locking overhead 5. OS scheduling 6. Portability across platforms do{ array[i] = count; i++; }while(i<num_elements-STRIDE); Related Work Challenges Virtual memory, TLBs Memory pre-fetchers Adaptive cache replacement policies Cache coherence protocols Memory controller interactions count++; int idx = 0; int i=0; Microbenchmark • One thread writes an allocated array while a second thread waits on a spin lock • Array access is unit-strided, and not serialized STREAM: • Sustainable Memory Bandwidth (MB/s) • Measures unit-strided access bandwidth from off-chip L3 Cache More Informed Auto-tuners • Auto-tuners may be restricted to run on a sub-set of the hardware • • • • • Cache-to-Cache Bandwidth //initialization code... for(i==0; i<num_elements-1; i++) array[i] = i+STRIDE; array[num_elements -1] = 0; //after initialization… tid = pthread_self(); double start_time = get_seconds(); while(count < num_iters){ if(tid == waiting_thread) while(*mylock == 0); [email protected]