Reducing communication in dense matrix/tensor computations Edgar Solomonik Aug 11th, 2011

by user

on 15 сентября 2016

Category: Documents

>> Downloads: 5

views

Report

Comments

Description

Download Reducing communication in dense matrix/tensor computations Edgar Solomonik Aug 11th, 2011

Transcript

Reducing communication in dense matrix/tensor computations Edgar Solomonik Aug 11th, 2011

Topology-aware collectives
2.5D algorithms
Tensor contractions
Conclusions and future work
Reducing communication in dense matrix/tensor
computations
Edgar Solomonik
UC Berkeley
Aug 11th, 2011
Edgar Solomonik
Communication-avoiding contractions
1/ 44
Topology-aware collectives
2.5D algorithms
Tensor contractions
Conclusions and future work
Outline
Topology-aware collectives
Rectangular collectives
Multicasts
Reductions
2.5D algorithms
2.5D matrix multiplication
2.5D LU factorization
Tensor contractions
Algorithms for distributed tensor contractions
A tensor contraction library implementation
Conclusions and future work
Edgar Solomonik
Communication-avoiding contractions
2/ 44
Topology-aware collectives
2.5D algorithms
Tensor contractions
Conclusions and future work
Rectangular collectives
Multicasts
Reductions
Performance of multicast (BG/P vs Cray)
1 MB multicast on BG/P, Cray XT5, and Cray XE6
8192
BG/P
XE6
XT5
Bandwidth (MB/sec)
4096
2048
1024
512
256
128
8
Edgar Solomonik
64
#nodes
512
Communication-avoiding contractions
4096
3/ 44
Topology-aware collectives
2.5D algorithms
Tensor contractions
Conclusions and future work
Rectangular collectives
Multicasts
Reductions
Why the performance discrepancy in multicasts?
I
Cray machines use binomial multicasts
I
I
I
I
Form spanning tree from a list of nodes
Route copies of message down each branch
Network contention degrades utilization on a 3D torus
BG/P uses rectangular multicasts
I
I
Require network topology to be a k-ary n-cube
Form 2n edge-disjoint spanning trees
I
I
Route in different dimensional order
Use both directions of bidirectional network
Edgar Solomonik
Communication-avoiding contractions
4/ 44
Topology-aware collectives
2.5D algorithms
Tensor contractions
Conclusions and future work
Rectangular collectives
Multicasts
Reductions
2D rectangular multicasts trees
2D 4X4 Torus
Spanning tree 1
Spanning tree 2
Spanning tree 3
Spanning tree 4
All 4 trees combined
root
Edgar Solomonik
Communication-avoiding contractions
5/ 44
Topology-aware collectives
2.5D algorithms
Tensor contractions
Conclusions and future work
Rectangular collectives
Multicasts
Reductions
A model for rectangular multicasts
tmcast = m/Bn + 2(d + 1) · o + 3L + d · P 1/d · (2o + L)
Our multicast model consists of 3 terms
1. m/Bn , the bandwidth cost incurred at the root
2. 2(d + 1) · o + 3L, the start-up overhead of setting up the
multicasts in all dimensions
3. d · P 1/d · (2o + L), the path overhead reflects the time for a
packet to get from the root to the farthest destination node
Edgar Solomonik
Communication-avoiding contractions
6/ 44
Topology-aware collectives
2.5D algorithms
Tensor contractions
Conclusions and future work
Rectangular collectives
Multicasts
Reductions
A model for binomial multicasts
tbnm = log2 (P) · (m/Bn + 2o + L)
I
The root of the binomial tree sends the entire message
log2 (P) times
I
The setup overhead is overlapped with the path overhead
I
We assume no contention
Edgar Solomonik
Communication-avoiding contractions
7/ 44
Topology-aware collectives
2.5D algorithms
Tensor contractions
Conclusions and future work
Rectangular collectives
Multicasts
Reductions
Model verification: one dimension
DCMF Broadcast on a ring of 8 nodes of BG/P
trect model
DCMF rectangle dput
tbnm model
DCMF binomial
Bandwidth (MB/sec)
1000
800
600
400
200
1
8
64
512
4096
32768
262144
msg size (KB)
Edgar Solomonik
Communication-avoiding contractions
8/ 44
Topology-aware collectives
2.5D algorithms
Tensor contractions
Conclusions and future work
Rectangular collectives
Multicasts
Reductions
Model verification: two dimensions
DCMF Broadcast on 64 (8x8) nodes of BG/P
Bandwidth (MB/sec)
2000
trect model
DCMF rectangle dput
tbnm model
DCMF binomial
1500
1000
500
0
1
8
64
512
4096
32768
262144
msg size (KB)
Edgar Solomonik
Communication-avoiding contractions
9/ 44
Topology-aware collectives
2.5D algorithms
Tensor contractions
Conclusions and future work
Rectangular collectives
Multicasts
Reductions
Model verification: three dimensions
DCMF Broadcast on 512 (8x8x8) nodes of BG/P
3000
trect model
Faraj et al data
DCMF rectangle dput
tbnm model
DCMF binomial
Bandwidth (MB/sec)
2500
2000
1500
1000
500
0
1
8
64
512
4096
32768
262144
msg size (KB)
Edgar Solomonik
Communication-avoiding contractions
10/ 44
Topology-aware collectives
2.5D algorithms
Tensor contractions
Conclusions and future work
Rectangular collectives
Multicasts
Reductions
A model for rectangular reductions
tred = max[m/(8γ), 3m/β, m/Bn ]+2(d +1)·o+3L+d ·P 1/d ·(2o+L)
I
I
Any multicast tree can be inverted to produce a reduction tree
The reduction operator must be applied at each node
I
I
each node operates on 2m data
both the memory bandwidth and computation cost can be
overlapped
Edgar Solomonik
Communication-avoiding contractions
11/ 44
Topology-aware collectives
2.5D algorithms
Tensor contractions
Conclusions and future work
Rectangular collectives
Multicasts
Reductions
Rectangular reduction performance on BG/P
DCMF Reduce peak bandwidth (largest message size)
400
torus rectangle ring
torus binomial
torus rectangle
torus short binomial
Bandwidth (MB/sec)
350
300
250
200
150
100
50
8
64
512
#nodes
BG/P rectangular reduction performs significantly worse than
multicast
Edgar Solomonik
Communication-avoiding contractions
12/ 44
Topology-aware collectives
2.5D algorithms
Tensor contractions
Conclusions and future work
Rectangular collectives
Multicasts
Reductions
Performance of custom line reduction
Performance of custom Reduce/Multicast on 8 nodes
800
Bandwidth (MB/sec)
700
600
500
400
300
MPI Broadcast
Custom Ring Multicast
Custom Ring Reduce 2
Custom Ring Reduce 1
MPI Reduce
200
100
512
4096
32768
msg size (KB)
Edgar Solomonik
Communication-avoiding contractions
13/ 44
Topology-aware collectives
2.5D algorithms
Tensor contractions
Conclusions and future work
Rectangular collectives
Multicasts
Reductions
Another look at that first plot
Just how much better are
rectangular algorithms on
P = 4096 nodes?
I Binomial collectives on XE6
I
Rectangular collectives on
BG/P
I
I
1/30th of link
bandwidth
4.3X the link bandwidth
Over 120X improvement
in efficiency!
BG/P
XE6
XT5
4096
Bandwidth (MB/sec)
I
1 MB multicast on BG/P, Cray XT5, and Cray XE6
8192
2048
1024
512
256
128
8
64
#nodes
512
How can we apply this?
Edgar Solomonik
Communication-avoiding contractions
14/ 44
4096
Topology-aware collectives
2.5D algorithms
Tensor contractions
Conclusions and future work
2.5D matrix multiplication
2.5D LU factorization
2.5D Cannon-style matrix multiplication
0
B₀₁
1
0
2
1
2
3
B₁₁
=
A₂₀
3
B₃₁
1
0
2
2
3
3
B₂₁
A₂₂ A₂₃
A₂₁
1
1
2
2
2D (P=16, c=1)
0
1
0
1
0
0
0
1
0
0
+
2
2
3
2
3
0
0
3
1
2.5D (P=32, c=2)
2
0
1
0
+
3
2
3D (P=64, c=4)
2
1
1
0
Edgar Solomonik
0
Communication-avoiding contractions
15/ 44
Topology-aware collectives
2.5D algorithms
Tensor contractions
Conclusions and future work
2.5D matrix multiplication
2.5D LU factorization
Classification of parallel dense matrix algorithms
algs
2D
2.5D
3D
c
1
[1, P 1/3 ]
P 1/3
memory (M)
O(n2 /P)
O(cn2 /P)
O(n2 /P 2/3 )
words √
(W )
2
O(n / P)
√
O(n2 / cP)
O(n2 /P 2/3 )
messages
(S)
√
O(pP)
O( P/c 3 )
O(log(P))
NEW: 2.5D algorithms generalize 2D and 3D algrotihms
Edgar Solomonik
Communication-avoiding contractions
16/ 44
Topology-aware collectives
2.5D algorithms
Tensor contractions
Conclusions and future work
2.5D matrix multiplication
2.5D LU factorization
Minimize communication with
I
minimal memory (2D)
I
with as much memory as available (2.5D) - flexible
I
with as much memory as the algorithm can exploit (3D)
Match the network topology of
√
√
I a
P-by- P grid (2D)
p
p
I a
P/c-by- P/c-by-c grid, most cuboids (2.5D) - flexible
I
a P 1/3 -by-P 1/3 -by-P 1/3 cube (3D)
Edgar Solomonik
Communication-avoiding contractions
17/ 44
Topology-aware collectives
2.5D algorithms
Tensor contractions
Conclusions and future work
2.5D matrix multiplication
2.5D LU factorization
2.5D SUMMA-style matrix multiplication
Matrix mapping to 3D partition of BG/P
Edgar Solomonik
Communication-avoiding contractions
18/ 44
Topology-aware collectives
2.5D algorithms
Tensor contractions
Conclusions and future work
2.5D matrix multiplication
2.5D LU factorization
2.5D MM strong scaling
2.5D MM on BG/P (n=65,536)
Percentage of machine peak
100
2.5D SUMMA
2.5D Cannon
2D MM (Cannon)
ScaLAPACK PDGEMM
80
60
40
20
0
256
512
1024
2048
#nodes
Edgar Solomonik
Communication-avoiding contractions
19/ 44
Topology-aware collectives
2.5D algorithms
Tensor contractions
Conclusions and future work
2.5D matrix multiplication
2.5D LU factorization
2.5D MM on 65,536 cores
2.5D MM on 16,384 nodes of BG/P
Percentage of machine peak
100
80
2.5D SUMMA
2.5D Cannon
2D Cannon
2D SUMMA
60
40
20
0
8192
32768
131072
n
Edgar Solomonik
Communication-avoiding contractions
20/ 44
Topology-aware collectives
2.5D algorithms
Tensor contractions
Conclusions and future work
2.5D matrix multiplication
2.5D LU factorization
Cost breakdown of MM on 65,536 cores
Execution time normalized by 2D
SUMMA (2D vs 2.5D) on 16,384 nodes of BG/P
1.4
communication
idle
computation
1.2
1
0.8
0.6
0.4
0.2
0
n= n=
n= n=
n= n=
81 81
32 32
13 13
92 92
76 76
10 10
,2 ,2
8,
8
72 72
D
.5D
2D , 2.5
,2 ,2
D
.5D
D
Edgar Solomonik
Communication-avoiding contractions
21/ 44
Topology-aware collectives
2.5D algorithms
Tensor contractions
Conclusions and future work
2.5D matrix multiplication
2.5D LU factorization
A new latency lower bound for LU
p
Reduce latency to O( P/c 3 ) for
LU?
I For block size n/d LU does
I
I
I
I
Ω(n3 /d 2 ) flops
Ω(n2 /d) words
Ω(d) msgs
I
A₀₀
k₁
A₁₁
k₂
Now pick d (=latency cost)
I
k₀
√
d = Ω( P) to minimize
flops √
d = Ω( c · P) to
minimize words
No dice. But lets minimize
bandwidth.
Edgar Solomonik
n
A₂₂
k₃
A₃₃
k₄
A₄₄
cr
iti
c
al
pa
th
kd-1
Ad-1,d-1
n
Communication-avoiding contractions
22/ 44
Topology-aware collectives
2.5D algorithms
Tensor contractions
Conclusions and future work
2.5D matrix multiplication
2.5D LU factorization
2.5D LU factorization without pivoting
3. Broadcast blocks so all
layers own the panels
of L and U.
U₀₀
L₀₀
U₀₀
U₀₀
U₀₀
L₀₀
L₃₀
4.Broadcast diﬀerent
subpanels within each
layer.
U₀₃
L₀₀
(A)
(B)
U₀₃
L₀₀
L₂₀
U₀₁
L₁₀
1. Factorize A₀₀ and
communicate L₀₀ and U₀₀
among layers.
5.Multiply subpanels
on each layer.
2. Perform TRSMs to compute
a panel of L and a panel of U.
7. Broadcast the panels and
continue factorizing the Schur's
complement...
6.Reduce (sum) the
next panels.*
U
(C)
(D)
L
* All layers always need to contribute to reduction
even if iteration done with subset of layers.
Edgar Solomonik
Communication-avoiding contractions
23/ 44
Topology-aware collectives
2.5D algorithms
Tensor contractions
Conclusions and future work
2.5D matrix multiplication
2.5D LU factorization
2.5D LU factorization with tournament pivoting
PA₀
2. Reduce to find best pivot rows.
3. Pivot rows in first big block column
on each layer.
PA₃ PA₂ PA₁ PA₀
L
U
U₀₁
L₁₀
L
U₀₁
U
L₁₀
L
U₀₁
U
L₁₀
P
L
U
P
L
U
U
U₀₁
da
L₁₀
Up
6. Recurse to compute the rest
of the first big block column of L.
te
L
U₀₁
8. Perform TRSMs
to compute panel of U
U₀₀
L₀₀
U₀₁
U
U
4. Apply TRSMs to
compute first column of L
and the first block of a row of U.
1. Factorize each block
in the first column with pivoting.
L
L
L₁₀
P
L
U
L₂₀
P
L
U
P
L
U
L₃₀
P
L
U
L₄₀
P
L
U
P
L
U
da
L₁₀
Up
U₀₀
L₀₀
U₀₃
te
L
U
U₀₁
da
L₁₀
Up
te
L₃₀
L
U
U₀₁
L₂₀
da
te
L₁₀
Up
da
L₂₀
te
da
Up
L₃₀
te
da
Up
L₄₀
Up
U₀₀
L₀₀
L₁₀
U₀₂
U₀₀
L₀₀
9. Update the rest
of the matrix as
before and recurse
on next block panel...
U₀₁
te
7. Pivot rows in the rest
of the matrix on each
layer.
5. Update corresponding
interior blocks S=A-L k0 *U₀₁.
Edgar Solomonik
Communication-avoiding contractions
24/ 44
Topology-aware collectives
2.5D algorithms
Tensor contractions
Conclusions and future work
2.5D matrix multiplication
2.5D LU factorization
2.5D LU strong scaling
2.5D LU with on BG/P (n=65,536)
Percentage of machine peak
100
2.5D LU (no-pvt)
2.5D LU (CA-pvt)
2D LU (no-pvt)
2D LU (CA-pvt)
ScaLAPACK PDGETRF
80
60
40
20
0
256
512
1024
2048
#nodes
Edgar Solomonik
Communication-avoiding contractions
25/ 44
Topology-aware collectives
2.5D algorithms
Tensor contractions
Conclusions and future work
2.5D matrix multiplication
2.5D LU factorization
2.5D LU on 65,536 cores
Time (sec)
2.5D LU vs 2D LU on 16,384 nodes of BG/P (n=131,072)
100
communication
idle
80
compute
60
40
20
0
2D
2.5
2.5
2D
2.
D
D
CA 5D C
no
no
-pv
Apv
-pv
-pv
t
pv
t
tb
t re
t
nm
ct
no
Edgar Solomonik
Communication-avoiding contractions
26/ 44
Topology-aware collectives
2.5D algorithms
Tensor contractions
Conclusions and future work
Algorithms for distributed tensor contractions
A tensor contraction library implementation
Bridging dense linear algebra techniques and applications
Target application: tensor contractions in electronic structure
calculations (quantum chemistry)
I
Often memory constrained
I
Most target tensors are oddly shaped
I
Need support for high dimensional tensors
I
Need handling of partial/full tensor symmetries
I
Would like to use communication avoiding ideas (blocking,
2.5D, topology-awareness)
Edgar Solomonik
Communication-avoiding contractions
27/ 44
Topology-aware collectives
2.5D algorithms
Tensor contractions
Conclusions and future work
Algorithms for distributed tensor contractions
A tensor contraction library implementation
Decoupling memory usage and topology-awareness
I
2.5D algorithms couple memory usage and virtual topology
I
c copies of a matrix implies c processor layers
I
Instead, we can nest 2D and/or 2.5D algorithms
I
Higher-dimensional algorithms allow smarter topology aware
mapping
Edgar Solomonik
Communication-avoiding contractions
28/ 44
Topology-aware collectives
2.5D algorithms
Tensor contractions
Conclusions and future work
Algorithms for distributed tensor contractions
A tensor contraction library implementation
Higher-dimensional distributed MM
I
2.5D algorithms couple memory usage and virtual topology
I
c copies of a matrix implies c processor layers
I
Instead, we can nest 2D and/or 2.5D algorithms
I
Higher-dimensional algorithms allow smarter topology aware
mapping
Edgar Solomonik
Communication-avoiding contractions
29/ 44
Topology-aware collectives
2.5D algorithms
Tensor contractions
Conclusions and future work
Algorithms for distributed tensor contractions
A tensor contraction library implementation
4D SUMMA-Cannon
How do we map to a 3D partition
without using more memory
I
I
SUMMA (bcast-based) on
2D layers
Cannon (send-based) along
third dimension
Cannon calls SUMMA as
sub-routine
I
I
I
Minimize inefficient
(non-rectangular)
communication
Allow better overlap
MM on 512 nodes of BG/P
100
Percentage of flops peak
I
80
2.5D MM
4D MM
2D MM (Cannon)
PBLAS MM
60
40
20
0
4096
8192
16384
32768
65536
matrix dimension
Treats MM as a 4D tensor
contraction
Edgar Solomonik
Communication-avoiding contractions
30/ 44
131072
Topology-aware collectives
2.5D algorithms
Tensor contractions
Conclusions and future work
Algorithms for distributed tensor contractions
A tensor contraction library implementation
Symmetry is a problem
I
I
A fully symmetric tensor of dimenson d requires only nd /d!
storage
Symmetry significantly complicates sequential implementation
I
I
I
Irregular indexing makes alignment and unrolling difficult
Generalizing over all partial-symmetries is expensive
Blocked or block-cyclic virtual processor decmpositions give
irregular or imbalanced virtual grids
Blocked
Block-cyclic
P0 P1
P2 P3
Edgar Solomonik
Communication-avoiding contractions
31/ 44
Topology-aware collectives
2.5D algorithms
Tensor contractions
Conclusions and future work
Algorithms for distributed tensor contractions
A tensor contraction library implementation
Solving the symmetry problem
I
I
A cyclic decomposition allows balanced and regular blocking
of symmetric tensors
If the cyclic-phase is the same in each symmetric dimension,
each sub-tensor retains the symmetry of the whole tensor
Cyclic
Edgar Solomonik
Communication-avoiding contractions
32/ 44
Topology-aware collectives
2.5D algorithms
Tensor contractions
Conclusions and future work
Algorithms for distributed tensor contractions
A tensor contraction library implementation
A generalized cyclic layout is still challenging
I
In order to retain partial symmetry, all symmetric dimensions
of a tensor must be mapped with the same cyclic phase
I
The contracted dimensions of A and B must be mapped with
the same phase
I
And yet the virtual mapping, needs to be mapped to a
physical topology, which can be any shape
Edgar Solomonik
Communication-avoiding contractions
33/ 44
Topology-aware collectives
2.5D algorithms
Tensor contractions
Conclusions and future work
Algorithms for distributed tensor contractions
A tensor contraction library implementation
Virtual processor grid dimensions
I
I
Our virtual cyclic topology is somewhat restrictive and the
physical topology is very restricted
Virtual processor grid dimensions serve as a new level of
indirection
I
I
If a tensor dimension must have a certain cyclic phase, adjust
physical mapping by creating a virtual processor dimension
Allows physical processor grid to be ’stretchable’
Edgar Solomonik
Communication-avoiding contractions
34/ 44
Topology-aware collectives
2.5D algorithms
Tensor contractions
Conclusions and future work
Algorithms for distributed tensor contractions
A tensor contraction library implementation
Constructing a virtual processor grid for MM
Matrix multiply on 2x3 processor grid. Red lines represent
virtualized part of processor grid. Elements assigned to blocks by
cyclic phase.
B
A
X
Edgar Solomonik
C
=
Communication-avoiding contractions
35/ 44
Topology-aware collectives
2.5D algorithms
Tensor contractions
Conclusions and future work
Algorithms for distributed tensor contractions
A tensor contraction library implementation
Unfolding the processor grid
I
Higher-dimensional fully-symmetric tensors can be mapped
onto a lower-dimensional processor grid via creation of new
virtual dimensions
I
Lower-dimensional tensors can be mapped onto a
higher-dimensional processor grid via by unfolding (serializing)
pairs of processor dimensions
I
However, when possible, replication is better than unfolding,
since unfolded processor grids can lead to an unbalanced
mapping
Edgar Solomonik
Communication-avoiding contractions
36/ 44
Topology-aware collectives
2.5D algorithms
Tensor contractions
Conclusions and future work
Algorithms for distributed tensor contractions
A tensor contraction library implementation
A basic parallel algorithm for symmetric tensor contractions
1. Arrange processor grid in any k-ary n-cube shape
2. Map (via unfold & virt) both A and B cyclically along the
dimensions being contracted
3. Map (via unfold & virt) the remaining dimensions of A and B
cyclically
4. For each tensor dimension contracted over, recursively
mulitply the tensors along the mapping
I
Each contraction dimension is represented with a nested call to
a local multiply or a parallel algorithm (e.g. Cannon)
Edgar Solomonik
Communication-avoiding contractions
37/ 44
Topology-aware collectives
2.5D algorithms
Tensor contractions
Conclusions and future work
Algorithms for distributed tensor contractions
A tensor contraction library implementation
Tensor library structure
The library supports arbitrary-dimensional parallel tensor
contractions with any symmetries on n-cuboid processor torus
partitions
1. Load tensor data by (global rank, value) pairs
2. Once a contraction is defined, map participating tensors
3. Distribute or reshuffle tensor data/pairs
4. Construct contraction algorithm with recursive function/args
pointers
5. Contract the sub-tensors with a user-defined sequential
contract function
6. Output (global rank, value) pairs on request
Edgar Solomonik
Communication-avoiding contractions
38/ 44
Topology-aware collectives
2.5D algorithms
Tensor contractions
Conclusions and future work
Algorithms for distributed tensor contractions
A tensor contraction library implementation
Current tensor library status
I
Dense and symmetric remapping/repadding/contractions
implemented
I
Currently functional only for dense tensors, but with full
symmetric logic
I
Can perform automatic mapping with physical and virtual
dimensions, but cannot unfold processor dimensions yet
I
Complete library interface implemented, including basic
auxillary functions (e.g. map/reduce, sum, etc.)
Edgar Solomonik
Communication-avoiding contractions
39/ 44
Topology-aware collectives
2.5D algorithms
Tensor contractions
Conclusions and future work
Algorithms for distributed tensor contractions
A tensor contraction library implementation
Next implementation steps
I
Currently integrating library with a SCF method code that
uses dense contractions
I
Get symmetric redistribution working correctly
I
Automatic unfolding of processor dimensions
I
Implement mapping by replication to enable 2.5D algorithms
I
Much basic performance debugging/optimization left to do
I
More optimization needed for sequential symmetric
contractions
Edgar Solomonik
Communication-avoiding contractions
40/ 44
Topology-aware collectives
2.5D algorithms
Tensor contractions
Conclusions and future work
Algorithms for distributed tensor contractions
A tensor contraction library implementation
Very preliminary contraction library results
Contracts tensors of size 64x64x256x256 in 1 second on 2K nodes
Strong scaling of dense contraction on BG/P 64x64x256x256
Percentage of machine peak
100
no rephase
rephase every contraction
repad every contraction
80
60
40
20
0
256
512
1024
2048
#nodes
Edgar Solomonik
Communication-avoiding contractions
41/ 44
Topology-aware collectives
2.5D algorithms
Tensor contractions
Conclusions and future work
Algorithms for distributed tensor contractions
A tensor contraction library implementation
Potential benefit of unfolding
Unfolding smallest two BG/P torus dimensions improves
performance.
Strong scaling of dense contraction on BG/P 64x64x256x256
Percentage of machine peak
100
no-rephase 2D
no-rephase 3D
80
60
40
20
0
256
512
1024
2048
#nodes
Edgar Solomonik
Communication-avoiding contractions
42/ 44
Topology-aware collectives
2.5D algorithms
Tensor contractions
Conclusions and future work
Conntributions
I
Models for rectangular collectives
I
2.5D algorithms theory and implementation
I
Using a cyclic mapping to parallelize symmetric tensor
contractions
I
Extending and tuning processor grid with virtual dimensions
I
Automatic mapping of high-dimensional tensors to
topology-aware physical partitions
I
A parallel tensor contraction algorithm/library without a
global address space
Edgar Solomonik
Communication-avoiding contractions
43/ 44
Topology-aware collectives
2.5D algorithms
Tensor contractions
Conclusions and future work
Conclusions and references
I
Parallel tensor contraction algorithm and library seem to be
the first communication-efficient practical approach
I
Preliminary results and theory indicate high potential of this
tensor contraction library
I
papers
I
I
(2.5D) to appear in Euro-Par 2011, Distinguished paper
(2.5D + rectangular collective models) to appear in
Supercomputing 2011
Edgar Solomonik
Communication-avoiding contractions
44/ 44
Backup slides
Edgar Solomonik
Communication-avoiding contractions
45/ 44
A new LU latency lower bound
k₀
A₀₀
k₁
A₁₁
k₂
n
A₂₂
k₃
A₃₃
k₄
A₄₄
cr
iti
c
al
pa
kd-1
th
Ad-1,d-1
n
√
flops lower bound requires d = Ω( p) blocks/messages
√
bandwidth lower bound required d = Ω( cp) blocks/messages
Edgar Solomonik
Communication-avoiding contractions
46/ 44
Virtual topology of 2.5D algorithms
c-1
0
k
0
i
1/2
(p/c)-1
0
j
1/2
(p/c) -1
√ √ 2D algorithm mapping:
P ×
P grid
p
p
2.5D algorithm mapping:
P/c ×
P/c × c grid for any c
Edgar Solomonik
Communication-avoiding contractions
47/ 44