...

Characterizing Memory Hierarchies of Multicore Processors Using Microbenchmarks

by user

on
Category: Documents
17

views

Report

Comments

Transcript

Characterizing Memory Hierarchies of Multicore Processors Using Microbenchmarks
Characterizing Memory Hierarchies of Multicore Processors
Using Microbenchmarks
Christopher Celio, Aleah Caulin, Krste Asanovic, Dave Patterson
What does the hardware really look like?
GOAL: design and implement portable software
micro-benchmarks to determine the important
characteristics of the hardware
• 
• 
• 
• 
• 
• 
• 
• 
• 
Number of cache levels
Data cache sizes at each level in the cache hierarchy
Access time to each level in the cache hierarhcy
Off-chip bandwidth for unit-stride accesses
Cache-to-cache transfer latency
Cache-to-cache transfer bandwidth
Request bandwidth
Is a cache inclusive?
Is a cache shared?
Cache Size and Access Latency
Cache-to-Cache Transfer Latency
Microbenchmark
•  one thread performs pointer chasing on an array
•  Unit stride (optimal locality)
•  16 element stride (different cache lines)
•  randomly sorted array (avoids prefetching)
•  Measure the time required to complete a fixed amount
of work (pointer chasing for N iterations)
Microbenchmark
•  One thread writes an allocated array while a second
thread waits on a spin lock
•  When it is the other thread’s turn it writes to the array
•  Forced serialization between array elements required to
measure latency
•  writing privileges alternate between threads many times
•  Ensures a “cold start” for each thread – no data in L1
•  Measure the time required to complete a fixed amount of
work (N iterations of swapping array write privileges)
//‘stride’ is an input parameter.
// The ‘multiply’ instruction is removed
// by enumerating each case.
// Instead, a ‘shift’ is synthesized.
if (stride == 16)
{ uint32_t const stride_16 = 16;
start_time = get_seconds();
for (uint32_t k = 0; k < g_num_iterations; k++) { idx = g_arr_n_ptr[idx*stride_16];
} stop_time = get_seconds();
}
Motivation & Useful Applications
Determining Hardware Specs
•  Measure unpublished specs
•  cache-to-cache transfer latency
•  Producing realistic specs
•  Published theoretical bandwidth numbers may not
be sustainable
Verification of Simulators
•  Micro-benchmarks can characterize existing machines
to provide a calibration for simulator targets
•  Useful for finding bugs that render incorrect simulation
results
L1 Cache
L2 Cache
Locking
overhead
SCALE L2 Cache
-- tile 0,0 to tile 0,1
-- tile 0,0 to tile 7,6
-- tile 0,0 to 0,1 direct messaging
L3 Cache
L3 Cache
Request Bandwidth
1. Complex interactions of the memory hierarchy can
yield convoluted results.
-- randomized
-- cache-line stride
-- unit stride
Derived processor specs:
L1 Intel i7 Arrandale 32 KB 128 KB 4 MB 1.2 ns 3.0 ns 15.8 ns 3.5 cycles 8.5 cycles 44.4 cycles Intel i7 Bloomfield 32 KB 256 KB 8 MB 1.1 ns 2.9 ns 13.6 ns 3.8 cycles 9.6 cycles 45.2 cycles 8 KB 64 KB 2-­‐3.5 MB Microbenchmark
•  One thread performs
multiple pointer chases
in parallel on an array
•  Determines saturation
of requests per second
access latency L2 L3 Processor TilePro64 L1 4.3 ns 12.9 ns 70.2 ns 3.0 cycles 10.0 cycles 49.1 cycles TILE64 pipeline latencies provided by the Tilera architecture manual:
OperaRon (LD to use) L1 hit www.PosterPresentations.com
246 MB/s uint32_t *idx = malloc(num_requests * sizeof(uint32_t));
for (int i=0; i < num_requests; i++)
idx[i] = (rand() % g_num_elements);
start_time = get_seconds();
for (uint32_t k = 0; k < g_num_iterations; k++)
{
for (uint32_t i=0; i < num_requests; i++)
idx[i] = g_arr_n_ptr[idx[i]];
}
stop_time = get_seconds();
Kernel a(i) = b(i) Bytes/iter FLOPS/iter 16 0 a(i) = q*b(i) 16 1 SUM a(i) = b(i) + c(i) 24 1 a(i) = b(i) + q*c(i) 24 2 Sustainable Bandwidth (MB/s)
Scale 7011 Add Triad 7511 7536 8 12470 12130 12431 12641 56 501 490 544 536 Roofline:
•  Visual representation of realistic performance and
productivity expectations
•  Determines hardware limitations for a specific kernel
•  Prioritizes optimizations by showing potential benefits
•  Uses STREAM to generate maximum FLOPS/Byte
asymptote
MEMBENCH:
•  Single core focused
•  measures 33 different transfer types
•  main memory read/write
•  L2 cache read/write
•  video memory write
•  main memory transfer
•  L2 cache transfer
•  video memory transfer
References
MEMBench. http://www.intelligentfirm.com/membench/membench.html
McCalpin, John D., 1995: "Memory Bandwidth and Machine Balance in Current High Performance Computers", IEEE
Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter, December 1995.
Latency 2 cycles S. Williams, A. Waterman, D. Patterson, "Roofline: An Insightful Visual Performance Model for Floating-Point
Programs and Multicore Architectures", Communications of the ACM (CACM), April 2009.
L1 miss, L2 hit 8 cycles L1/L2 miss, adjacent $hit 35 cycles TEMPLATE DESIGN © 2008
2150 MB/s TilePro64 # Threads Copy Intel i7 Arrandale 4 7292 TilePro64 -- randomized
-- cache-line stride
-- unit stride
data cache size L2 L3 Intel i7 975 Ext. Ed TRIAD Intel i7 975 Ext. Ed Off-chip
DRAM
L2 Cache
Processor C2C Write Bandwidth Intel i7 Arrandale 1789 MB/s }
double stop_time = get_seconds();
-- randomized
-- cache-line stride
-- unit stride
L1 Cache
waiting_thread = tid;
*mylock = 0;
*otherLock = 1;
Name COPY L1 Cache
2.  Accurate measurements of runtime are limited by the
precision of the system clock
3.  Hyperthreading needs to be avoided for certain
measurements (e.g. cache-to-cache latency)
4.  Locking overhead
5.  OS scheduling
6.  Portability across platforms
do{
array[i] = count;
i++;
}while(i<num_elements-STRIDE);
Related Work
Challenges
Virtual memory, TLBs
Memory pre-fetchers
Adaptive cache replacement policies
Cache coherence protocols
Memory controller interactions
count++;
int idx = 0;
int i=0;
Microbenchmark
•  One thread writes an allocated
array while a second thread waits
on a spin lock
•  Array access is unit-strided, and
not serialized
STREAM:
•  Sustainable Memory Bandwidth (MB/s)
•  Measures unit-strided access bandwidth from off-chip
L3 Cache
More Informed Auto-tuners
•  Auto-tuners may be restricted to run on a sub-set of
the hardware
• 
• 
• 
• 
• 
Cache-to-Cache Bandwidth
//initialization code...
for(i==0; i<num_elements-1; i++)
array[i] = i+STRIDE;
array[num_elements -1] = 0;
//after initialization…
tid = pthread_self();
double start_time = get_seconds();
while(count < num_iters){
if(tid == waiting_thread)
while(*mylock == 0);
[email protected]
Fly UP