Improving communication performance in dense linear algebra via topology-aware collectives Edgar Solomonik

by user

on 15 сентября 2016

Category: Documents

>> Downloads: 2

views

Report

Comments

Description

Download Improving communication performance in dense linear algebra via topology-aware collectives Edgar Solomonik

Transcript

Improving communication performance in dense linear algebra via topology-aware collectives Edgar Solomonik

Collective communication
2.5D algorithms
Modelling exascale
Improving communication performance in dense
linear algebra via topology-aware collectives
Edgar Solomonik∗ , Abhinav Bhatele† , and James Demmel∗
†
∗ University of California, Berkeley
Lawrence Livermore National Laboratory
Supercomputing, November 2011
Edgar Solomonik
Mapping dense linear algebra
1/ 29
Collective communication
2.5D algorithms
Modelling exascale
Outline
Collective communication
Rectangular collectives
2.5D algorithms
2.5D Matrix Multiplication
2.5D LU factorization
Modelling exascale
Multicast performance
MM and LU performance
Edgar Solomonik
Mapping dense linear algebra
2/ 29
Collective communication
2.5D algorithms
Modelling exascale
Rectangular collectives
Performance of multicast (BG/P vs Cray)
1 MB multicast on BG/P, Cray XT5, and Cray XE6
8192
BG/P
XE6
XT5
Bandwidth (MB/sec)
4096
2048
1024
512
256
128
8
Edgar Solomonik
64
#nodes
512
Mapping dense linear algebra
4096
3/ 29
Collective communication
2.5D algorithms
Modelling exascale
Rectangular collectives
Why the performance discrepancy in multicasts?
I
Cray machines use binomial multicasts
I
I
I
I
Form spanning tree from a list of nodes
Route copies of message down each branch
Network contention degrades utilization on a 3D torus
BG/P uses rectangular multicasts
I
I
Require network topology to be a k-ary n-cube
Form 2n edge-disjoint spanning trees
I
I
Route in different dimensional order
Use both directions of bidirectional network
Edgar Solomonik
Mapping dense linear algebra
4/ 29
Collective communication
2.5D algorithms
Modelling exascale
Rectangular collectives
2D rectangular multicasts trees
2D 4X4 Torus
Spanning tree 1
Spanning tree 2
Spanning tree 3
Spanning tree 4
All 4 trees combined
root
[Watts and Van De Geijn 95]
Edgar Solomonik
Mapping dense linear algebra
5/ 29
Collective communication
2.5D algorithms
Modelling exascale
Rectangular collectives
Another look at that first plot
How much better are rectangular
algorithms on P = 4096 nodes?
I Binomial collectives on XE6
I
Rectangular collectives on
BG/P
I
I
BG/P
XE6
XT5
4096
1/30th of link
bandwidth
4X the link bandwidth
120X improvement in
efficiency!
Bandwidth (MB/sec)
I
1 MB multicast on BG/P, Cray XT5, and Cray XE6
8192
2048
1024
512
256
128
8
64
#nodes
How can we apply this?
Edgar Solomonik
Mapping dense linear algebra
6/ 29
512
4096
Collective communication
2.5D algorithms
Modelling exascale
2.5D Matrix Multiplication
2.5D LU factorization
Matrix multiplication
A
B
B
A
B
A
B
A
Edgar Solomonik
Mapping dense linear algebra
7/ 29
Collective communication
2.5D algorithms
Modelling exascale
2.5D Matrix Multiplication
2.5D LU factorization
2D matrix multiplication
A
B
16 CPUs (4x4)
B
A
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
B
A
B
A
[Cannon 69], [Van De Geijn and Watts 97]
Edgar Solomonik
Mapping dense linear algebra
8/ 29
Collective communication
2.5D algorithms
Modelling exascale
2.5D Matrix Multiplication
2.5D LU factorization
3D matrix multiplication
64 CPUs (4x4x4)
A
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
B
CPU
CPU
CPU
CPU
B
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
A
A
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
A
CPU
CPU
B
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
B
CPU
CPU
CPU
CPU
4 copies of matrices
[Agarwal et al 95], [Aggarwal, Chandra, and Snir 90], [Bernsten 89]
Edgar Solomonik
Mapping dense linear algebra
9/ 29
Collective communication
2.5D algorithms
Modelling exascale
2.5D Matrix Multiplication
2.5D LU factorization
2.5D matrix multiplication
A
B
32 CPUs (4x4x2)
B
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
A
A
B
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
B
CPU
CPU
CPU
CPU
CPU
2 copies of matrices
A
Edgar Solomonik
Mapping dense linear algebra
10/ 29
Collective communication
2.5D algorithms
Modelling exascale
2.5D Matrix Multiplication
2.5D LU factorization
Strong scaling matrix multiplication
2.5D MM on BG/P (n=65,536)
Percentage of machine peak
100
2.5D SUMMA
2D SUMMA
ScaLAPACK PDGEMM
80
60
40
20
0
256
512
1024
2048
#nodes
Edgar Solomonik
Mapping dense linear algebra
11/ 29
Collective communication
2.5D algorithms
Modelling exascale
2.5D Matrix Multiplication
2.5D LU factorization
2.5D MM on 65,536 cores
Matrix multiplication on 16,384 nodes of BG/P
Percentage of machine peak
100
2.5D MM
2D MM
80
60
2.7X faster
Using 16 matrix copies
40
20
0
12X faster
8192
131072
n
Edgar Solomonik
Mapping dense linear algebra
12/ 29
Collective communication
2.5D algorithms
Modelling exascale
2.5D Matrix Multiplication
2.5D LU factorization
Cost breakdown of MM on 65,536 cores
Execution time normalized by 2D
Matrix multiplication on 16,384 nodes of BG/P
1.4
1.2
1
communication
idle
computation
95% reduction in comm
0.8
0.6
0.4
0.2
0
n=
8
19
2,
n=
2D
n=
81
13
92
,2
Edgar Solomonik
.5D
10
n=
13
10
72
,2
Mapping dense linear algebra
D
72
,2
13/ 29
.5D
Collective communication
2.5D algorithms
Modelling exascale
2.5D Matrix Multiplication
2.5D LU factorization
2.5D LU factorization
U₀₀
L₀₀
U₀₀
U₀₃
L₀₀
U₀₀
U₀₃
L₀₀
U₀₀
(A)
L₀₀
L₃₀
L₂₀
Edgar Solomonik
U₀₁
L₁₀
Mapping dense linear algebra
14/ 29
Collective communication
2.5D algorithms
Modelling exascale
2.5D Matrix Multiplication
2.5D LU factorization
2.5D LU factorization
U₀₀
(B)
L₀₀
U₀₀
U₀₃
L₀₀
U₀₀
U₀₃
L₀₀
U₀₀
(A)
L₀₀
L₃₀
L₂₀
U₀₁
L₁₀
Edgar Solomonik
Mapping dense linear algebra
15/ 29
Collective communication
2.5D algorithms
Modelling exascale
2.5D Matrix Multiplication
2.5D LU factorization
2.5D LU factorization
Look at how this
update is distributed.
A
B
B
A
B
Same 3D update
in multiplication
A
B
A
Edgar Solomonik
Mapping dense linear algebra
16/ 29
Collective communication
2.5D algorithms
Modelling exascale
2.5D Matrix Multiplication
2.5D LU factorization
2.5D LU factorization
U₀₀
(B)
L₀₀
U₀₀
U₀₃
L₀₀
U₀₀
U₀₃
L₀₀
U₀₀
(A)
L₀₀
L₃₀
L₂₀
U₀₁
L₁₀
U
(C)
(D)
L
[Solomonik and Demmel, EuroPar ’11, Distinguished Paper]
Edgar Solomonik
Mapping dense linear algebra
17/ 29
Collective communication
2.5D algorithms
Modelling exascale
2.5D Matrix Multiplication
2.5D LU factorization
2.5D LU on 65,536 cores
LU on 16,384 nodes of BG/P (n=131,072)
100
Time (sec)
80
communication
idle
compute
60
2X faster
40
2X faster
20
0
NO
-pi
vo
t
NO
2D
-pi
Edgar Solomonik
vo
t2
CA
-pi
vo
CA
-pi
v
D
ot
2
Mapping dense linear algebra
18/ 29
.5D
t2
.5D
Collective communication
2.5D algorithms
Modelling exascale
2.5D Matrix Multiplication
2.5D LU factorization
Percentage of machine peak
Rectangular (RCT) vs binomial (BNM) collectives
Binomial vs rectangular collectives on BG/P (n=131,072, p=16,384)
80
2.5D RCT
2.5D BNM
70
2D BNM
60
50
40
30
20
10
0
MM
Edgar Solomonik
LU
LU
+P
VT
Mapping dense linear algebra
19/ 29
Collective communication
2.5D algorithms
Modelling exascale
Multicast performance
MM and LU performance
A model for rectangular multicasts
tmcast = m/Bn + 2(d + 1) · o + 3L + d · P 1/d · (2o + L)
Our multicast model consists of 3 terms
1. m/Bn , the bandwidth cost
2. 2(d + 1) · o + 3L, the multicast start-up overhead
3. d · P 1/d · (2o + L), the path overhead
Edgar Solomonik
Mapping dense linear algebra
20/ 29
Collective communication
2.5D algorithms
Modelling exascale
Multicast performance
MM and LU performance
A model for binomial multicasts
tbnm = log2 (P) · (m/Bn + 2o + L)
I
The root of the binomial tree sends log2 (P) copies of message
I
The setup overhead is overlapped with the path overhead
I
We assume no contention
Edgar Solomonik
Mapping dense linear algebra
21/ 29
Collective communication
2.5D algorithms
Modelling exascale
Multicast performance
MM and LU performance
Model verification: one dimension
DCMF Broadcast on a ring of 8 nodes of BG/P
trect model
DCMF rectangle dput
tbnm model
DCMF binomial
Bandwidth (MB/sec)
1000
800
600
400
200
1
8
64
512
4096
32768
262144
msg size (KB)
Edgar Solomonik
Mapping dense linear algebra
22/ 29
Collective communication
2.5D algorithms
Modelling exascale
Multicast performance
MM and LU performance
Model verification: two dimensions
DCMF Broadcast on 64 (8x8) nodes of BG/P
Bandwidth (MB/sec)
2000
trect model
DCMF rectangle dput
tbnm model
DCMF binomial
1500
1000
500
0
1
8
64
512
4096
32768
262144
msg size (KB)
Edgar Solomonik
Mapping dense linear algebra
23/ 29
Collective communication
2.5D algorithms
Modelling exascale
Multicast performance
MM and LU performance
Model verification: three dimensions
DCMF Broadcast on 512 (8x8x8) nodes of BG/P
3000
trect model
Faraj et al data
DCMF rectangle dput
tbnm model
DCMF binomial
Bandwidth (MB/sec)
2500
2000
1500
1000
500
0
1
8
64
512
4096
32768
262144
msg size (KB)
Edgar Solomonik
Mapping dense linear algebra
24/ 29
Collective communication
2.5D algorithms
Modelling exascale
Multicast performance
MM and LU performance
Modelling collectives at exascale (p = 262, 144)
Exascale broadcast performance
200
180
Bandwidth (GB/sec)
160
140
120
trect 6D
trect 5D
trect 4D
trect 3D
trect 2D
trect 1D
tbnm 3D
100
80
60
40
20
0
0.125
1
8
64
512
4096
32768 262144
msg size (MB)
Edgar Solomonik
Mapping dense linear algebra
25/ 29
Collective communication
2.5D algorithms
Modelling exascale
Multicast performance
MM and LU performance
Modelling matrix multiplication at exascale
MM strong scaling at exascale (xy plane to full xyz torus)
Parallel efficiency
1
0.9
0.8
0.7
0.6
2.5D with rectangular (c=z)
2.5D with binomial (c=z)
2D with binomial
0.5
0.4
1
8
64
z dimension of partition
Edgar Solomonik
Mapping dense linear algebra
26/ 29
Collective communication
2.5D algorithms
Modelling exascale
Multicast performance
MM and LU performance
Modelling LU factorization at exascale
LU strong scaling at exascale (xy plane to full xyz torus)
1
Parallel efficiency
0.8
0.6
0.4
0.2
0
2.5D with rectangular (c=z)
2.5D with binomial (c=z)
2D LU with binomial
1
8
64
z dimension of partition
Edgar Solomonik
Mapping dense linear algebra
27/ 29
Collective communication
2.5D algorithms
Modelling exascale
Multicast performance
MM and LU performance
Conclusion
I
Topology-aware scheduling
I
I
I
I
I
Present in IBM BG but not in Cray supercomputers
Avoids network contention/congestion
Enables optimized communication collectives
Leads to simple communication performance models
Future work
I
I
I
An automated framework for topology-aware mapping
Tensor computations mapping
Better models for network contention
Edgar Solomonik
Mapping dense linear algebra
28/ 29
Collective communication
2.5D algorithms
Modelling exascale
Multicast performance
MM and LU performance
Acknowledgements
I
I
Krell CSGF DOE fellowship (DE-AC02-06CH11357)
Resources at Argonne National Lab and Lawrence Berkeley
National Lab
I
I
I
I
I
I
Berkeley ParLab funding
I
I
I
DE-SC0003959
DE-SC0004938
DE-FC02-06-ER25786
DE-SC0001845
DE-AC02-06CH11357
Microsoft (Award #024263) and Intel (Award #024894)
U.C. Discovery (Award #DIG07-10227)
Released by Lawrence Livermore National Laboratory as
LLNL-PRES-514231
Edgar Solomonik
Mapping dense linear algebra
29/ 29
Reductions
Backup slides
Edgar Solomonik
Mapping dense linear algebra
30/ 29
Reductions
A model for rectangular reductions
tred = max[m/(8γ), 3m/β, m/Bn ]+2(d +1)·o+3L+d ·P 1/d ·(2o+L)
I
I
Any multicast tree can be inverted to produce a reduction tree
The reduction operator must be applied at each node
I
I
each node operates on 2m data
both the memory bandwidth and computation cost can be
overlapped
Edgar Solomonik
Mapping dense linear algebra
31/ 29
Reductions
Rectangular reduction performance on BG/P
DCMF Reduce peak bandwidth (largest message size)
400
torus rectangle ring
torus binomial
torus rectangle
torus short binomial
Bandwidth (MB/sec)
350
300
250
200
150
100
50
8
64
512
#nodes
BG/P rectangular reduction performs significantly worse than
multicast
Edgar Solomonik
Mapping dense linear algebra
32/ 29
Reductions
Performance of custom line reduction
Performance of custom Reduce/Multicast on 8 nodes
800
Bandwidth (MB/sec)
700
600
500
400
300
MPI Broadcast
Custom Ring Multicast
Custom Ring Reduce 2
Custom Ring Reduce 1
MPI Reduce
200
100
512
4096
32768
msg size (KB)
Edgar Solomonik
Mapping dense linear algebra
33/ 29
Reductions
A new LU latency lower bound
k₀
A₀₀
k₁
A₁₁
k₂
n
A₂₂
k₃
A₃₃
k₄
A₄₄
cr
iti
c
al
pa
kd-1
th
Ad-1,d-1
n
√
flops lower bound requires d = Ω( p) blocks/messages
√
bandwidth lower bound required d = Ω( cp) blocks/messages
Edgar Solomonik
Mapping dense linear algebra
34/ 29
Reductions
2.5D LU strong scaling (without pivoting)
LU without pivoting on BG/P (n=65,536)
Percentage of machine peak
100
ideal scaling
2.5D LU
2D LU
80
60
40
20
0
256
512
1024
2048
#nodes
Edgar Solomonik
Mapping dense linear algebra
35/ 29
Reductions
Strong scaling of 2.5D LU with tournament pivoting
LU with tournament pivoting on BG/P (n=65,536)
Percentage of machine peak
100
ideal scaling
2.5D LU
2D LU
ScaLAPACK PDGETRF
80
60
40
20
0
256
512
1024
2048
#nodes
Edgar Solomonik
Mapping dense linear algebra
36/ 29