Reducing communication in dense matrix/tensor computations Edgar Solomonik Aug 11th, 2011
by user
Comments
Transcript
Reducing communication in dense matrix/tensor computations Edgar Solomonik Aug 11th, 2011
Topology-aware collectives 2.5D algorithms Tensor contractions Conclusions and future work Reducing communication in dense matrix/tensor computations Edgar Solomonik UC Berkeley Aug 11th, 2011 Edgar Solomonik Communication-avoiding contractions 1/ 44 Topology-aware collectives 2.5D algorithms Tensor contractions Conclusions and future work Outline Topology-aware collectives Rectangular collectives Multicasts Reductions 2.5D algorithms 2.5D matrix multiplication 2.5D LU factorization Tensor contractions Algorithms for distributed tensor contractions A tensor contraction library implementation Conclusions and future work Edgar Solomonik Communication-avoiding contractions 2/ 44 Topology-aware collectives 2.5D algorithms Tensor contractions Conclusions and future work Rectangular collectives Multicasts Reductions Performance of multicast (BG/P vs Cray) 1 MB multicast on BG/P, Cray XT5, and Cray XE6 8192 BG/P XE6 XT5 Bandwidth (MB/sec) 4096 2048 1024 512 256 128 8 Edgar Solomonik 64 #nodes 512 Communication-avoiding contractions 4096 3/ 44 Topology-aware collectives 2.5D algorithms Tensor contractions Conclusions and future work Rectangular collectives Multicasts Reductions Why the performance discrepancy in multicasts? I Cray machines use binomial multicasts I I I I Form spanning tree from a list of nodes Route copies of message down each branch Network contention degrades utilization on a 3D torus BG/P uses rectangular multicasts I I Require network topology to be a k-ary n-cube Form 2n edge-disjoint spanning trees I I Route in different dimensional order Use both directions of bidirectional network Edgar Solomonik Communication-avoiding contractions 4/ 44 Topology-aware collectives 2.5D algorithms Tensor contractions Conclusions and future work Rectangular collectives Multicasts Reductions 2D rectangular multicasts trees 2D 4X4 Torus Spanning tree 1 Spanning tree 2 Spanning tree 3 Spanning tree 4 All 4 trees combined root Edgar Solomonik Communication-avoiding contractions 5/ 44 Topology-aware collectives 2.5D algorithms Tensor contractions Conclusions and future work Rectangular collectives Multicasts Reductions A model for rectangular multicasts tmcast = m/Bn + 2(d + 1) · o + 3L + d · P 1/d · (2o + L) Our multicast model consists of 3 terms 1. m/Bn , the bandwidth cost incurred at the root 2. 2(d + 1) · o + 3L, the start-up overhead of setting up the multicasts in all dimensions 3. d · P 1/d · (2o + L), the path overhead reflects the time for a packet to get from the root to the farthest destination node Edgar Solomonik Communication-avoiding contractions 6/ 44 Topology-aware collectives 2.5D algorithms Tensor contractions Conclusions and future work Rectangular collectives Multicasts Reductions A model for binomial multicasts tbnm = log2 (P) · (m/Bn + 2o + L) I The root of the binomial tree sends the entire message log2 (P) times I The setup overhead is overlapped with the path overhead I We assume no contention Edgar Solomonik Communication-avoiding contractions 7/ 44 Topology-aware collectives 2.5D algorithms Tensor contractions Conclusions and future work Rectangular collectives Multicasts Reductions Model verification: one dimension DCMF Broadcast on a ring of 8 nodes of BG/P trect model DCMF rectangle dput tbnm model DCMF binomial Bandwidth (MB/sec) 1000 800 600 400 200 1 8 64 512 4096 32768 262144 msg size (KB) Edgar Solomonik Communication-avoiding contractions 8/ 44 Topology-aware collectives 2.5D algorithms Tensor contractions Conclusions and future work Rectangular collectives Multicasts Reductions Model verification: two dimensions DCMF Broadcast on 64 (8x8) nodes of BG/P Bandwidth (MB/sec) 2000 trect model DCMF rectangle dput tbnm model DCMF binomial 1500 1000 500 0 1 8 64 512 4096 32768 262144 msg size (KB) Edgar Solomonik Communication-avoiding contractions 9/ 44 Topology-aware collectives 2.5D algorithms Tensor contractions Conclusions and future work Rectangular collectives Multicasts Reductions Model verification: three dimensions DCMF Broadcast on 512 (8x8x8) nodes of BG/P 3000 trect model Faraj et al data DCMF rectangle dput tbnm model DCMF binomial Bandwidth (MB/sec) 2500 2000 1500 1000 500 0 1 8 64 512 4096 32768 262144 msg size (KB) Edgar Solomonik Communication-avoiding contractions 10/ 44 Topology-aware collectives 2.5D algorithms Tensor contractions Conclusions and future work Rectangular collectives Multicasts Reductions A model for rectangular reductions tred = max[m/(8γ), 3m/β, m/Bn ]+2(d +1)·o+3L+d ·P 1/d ·(2o+L) I I Any multicast tree can be inverted to produce a reduction tree The reduction operator must be applied at each node I I each node operates on 2m data both the memory bandwidth and computation cost can be overlapped Edgar Solomonik Communication-avoiding contractions 11/ 44 Topology-aware collectives 2.5D algorithms Tensor contractions Conclusions and future work Rectangular collectives Multicasts Reductions Rectangular reduction performance on BG/P DCMF Reduce peak bandwidth (largest message size) 400 torus rectangle ring torus binomial torus rectangle torus short binomial Bandwidth (MB/sec) 350 300 250 200 150 100 50 8 64 512 #nodes BG/P rectangular reduction performs significantly worse than multicast Edgar Solomonik Communication-avoiding contractions 12/ 44 Topology-aware collectives 2.5D algorithms Tensor contractions Conclusions and future work Rectangular collectives Multicasts Reductions Performance of custom line reduction Performance of custom Reduce/Multicast on 8 nodes 800 Bandwidth (MB/sec) 700 600 500 400 300 MPI Broadcast Custom Ring Multicast Custom Ring Reduce 2 Custom Ring Reduce 1 MPI Reduce 200 100 512 4096 32768 msg size (KB) Edgar Solomonik Communication-avoiding contractions 13/ 44 Topology-aware collectives 2.5D algorithms Tensor contractions Conclusions and future work Rectangular collectives Multicasts Reductions Another look at that first plot Just how much better are rectangular algorithms on P = 4096 nodes? I Binomial collectives on XE6 I Rectangular collectives on BG/P I I 1/30th of link bandwidth 4.3X the link bandwidth Over 120X improvement in efficiency! BG/P XE6 XT5 4096 Bandwidth (MB/sec) I 1 MB multicast on BG/P, Cray XT5, and Cray XE6 8192 2048 1024 512 256 128 8 64 #nodes 512 How can we apply this? Edgar Solomonik Communication-avoiding contractions 14/ 44 4096 Topology-aware collectives 2.5D algorithms Tensor contractions Conclusions and future work 2.5D matrix multiplication 2.5D LU factorization 2.5D Cannon-style matrix multiplication 0 B₀₁ 1 0 2 1 2 3 B₁₁ = A₂₀ 3 B₃₁ 1 0 2 2 3 3 B₂₁ A₂₂ A₂₃ A₂₁ 1 1 2 2 2D (P=16, c=1) 0 1 0 1 0 0 0 1 0 0 + 2 2 3 2 3 0 0 3 1 2.5D (P=32, c=2) 2 0 1 0 + 3 2 3D (P=64, c=4) 2 1 1 0 Edgar Solomonik 0 Communication-avoiding contractions 15/ 44 Topology-aware collectives 2.5D algorithms Tensor contractions Conclusions and future work 2.5D matrix multiplication 2.5D LU factorization Classification of parallel dense matrix algorithms algs 2D 2.5D 3D c 1 [1, P 1/3 ] P 1/3 memory (M) O(n2 /P) O(cn2 /P) O(n2 /P 2/3 ) words √ (W ) 2 O(n / P) √ O(n2 / cP) O(n2 /P 2/3 ) messages (S) √ O(pP) O( P/c 3 ) O(log(P)) NEW: 2.5D algorithms generalize 2D and 3D algrotihms Edgar Solomonik Communication-avoiding contractions 16/ 44 Topology-aware collectives 2.5D algorithms Tensor contractions Conclusions and future work 2.5D matrix multiplication 2.5D LU factorization Minimize communication with I minimal memory (2D) I with as much memory as available (2.5D) - flexible I with as much memory as the algorithm can exploit (3D) Match the network topology of √ √ I a P-by- P grid (2D) p p I a P/c-by- P/c-by-c grid, most cuboids (2.5D) - flexible I a P 1/3 -by-P 1/3 -by-P 1/3 cube (3D) Edgar Solomonik Communication-avoiding contractions 17/ 44 Topology-aware collectives 2.5D algorithms Tensor contractions Conclusions and future work 2.5D matrix multiplication 2.5D LU factorization 2.5D SUMMA-style matrix multiplication Matrix mapping to 3D partition of BG/P Edgar Solomonik Communication-avoiding contractions 18/ 44 Topology-aware collectives 2.5D algorithms Tensor contractions Conclusions and future work 2.5D matrix multiplication 2.5D LU factorization 2.5D MM strong scaling 2.5D MM on BG/P (n=65,536) Percentage of machine peak 100 2.5D SUMMA 2.5D Cannon 2D MM (Cannon) ScaLAPACK PDGEMM 80 60 40 20 0 256 512 1024 2048 #nodes Edgar Solomonik Communication-avoiding contractions 19/ 44 Topology-aware collectives 2.5D algorithms Tensor contractions Conclusions and future work 2.5D matrix multiplication 2.5D LU factorization 2.5D MM on 65,536 cores 2.5D MM on 16,384 nodes of BG/P Percentage of machine peak 100 80 2.5D SUMMA 2.5D Cannon 2D Cannon 2D SUMMA 60 40 20 0 8192 32768 131072 n Edgar Solomonik Communication-avoiding contractions 20/ 44 Topology-aware collectives 2.5D algorithms Tensor contractions Conclusions and future work 2.5D matrix multiplication 2.5D LU factorization Cost breakdown of MM on 65,536 cores Execution time normalized by 2D SUMMA (2D vs 2.5D) on 16,384 nodes of BG/P 1.4 communication idle computation 1.2 1 0.8 0.6 0.4 0.2 0 n= n= n= n= n= n= 81 81 32 32 13 13 92 92 76 76 10 10 ,2 ,2 8, 8 72 72 D .5D 2D , 2.5 ,2 ,2 D .5D D Edgar Solomonik Communication-avoiding contractions 21/ 44 Topology-aware collectives 2.5D algorithms Tensor contractions Conclusions and future work 2.5D matrix multiplication 2.5D LU factorization A new latency lower bound for LU p Reduce latency to O( P/c 3 ) for LU? I For block size n/d LU does I I I I Ω(n3 /d 2 ) flops Ω(n2 /d) words Ω(d) msgs I A₀₀ k₁ A₁₁ k₂ Now pick d (=latency cost) I k₀ √ d = Ω( P) to minimize flops √ d = Ω( c · P) to minimize words No dice. But lets minimize bandwidth. Edgar Solomonik n A₂₂ k₃ A₃₃ k₄ A₄₄ cr iti c al pa th kd-1 Ad-1,d-1 n Communication-avoiding contractions 22/ 44 Topology-aware collectives 2.5D algorithms Tensor contractions Conclusions and future work 2.5D matrix multiplication 2.5D LU factorization 2.5D LU factorization without pivoting 3. Broadcast blocks so all layers own the panels of L and U. U₀₀ L₀₀ U₀₀ U₀₀ U₀₀ L₀₀ L₃₀ 4.Broadcast different subpanels within each layer. U₀₃ L₀₀ (A) (B) U₀₃ L₀₀ L₂₀ U₀₁ L₁₀ 1. Factorize A₀₀ and communicate L₀₀ and U₀₀ among layers. 5.Multiply subpanels on each layer. 2. Perform TRSMs to compute a panel of L and a panel of U. 7. Broadcast the panels and continue factorizing the Schur's complement... 6.Reduce (sum) the next panels.* U (C) (D) L * All layers always need to contribute to reduction even if iteration done with subset of layers. Edgar Solomonik Communication-avoiding contractions 23/ 44 Topology-aware collectives 2.5D algorithms Tensor contractions Conclusions and future work 2.5D matrix multiplication 2.5D LU factorization 2.5D LU factorization with tournament pivoting PA₀ 2. Reduce to find best pivot rows. 3. Pivot rows in first big block column on each layer. PA₃ PA₂ PA₁ PA₀ L U U₀₁ L₁₀ L U₀₁ U L₁₀ L U₀₁ U L₁₀ P L U P L U U U₀₁ da L₁₀ Up 6. Recurse to compute the rest of the first big block column of L. te L U₀₁ 8. Perform TRSMs to compute panel of U U₀₀ L₀₀ U₀₁ U U 4. Apply TRSMs to compute first column of L and the first block of a row of U. 1. Factorize each block in the first column with pivoting. L L L₁₀ P L U L₂₀ P L U P L U L₃₀ P L U L₄₀ P L U P L U da L₁₀ Up U₀₀ L₀₀ U₀₃ te L U U₀₁ da L₁₀ Up te L₃₀ L U U₀₁ L₂₀ da te L₁₀ Up da L₂₀ te da Up L₃₀ te da Up L₄₀ Up U₀₀ L₀₀ L₁₀ U₀₂ U₀₀ L₀₀ 9. Update the rest of the matrix as before and recurse on next block panel... U₀₁ te 7. Pivot rows in the rest of the matrix on each layer. 5. Update corresponding interior blocks S=A-L k0 *U₀₁. Edgar Solomonik Communication-avoiding contractions 24/ 44 Topology-aware collectives 2.5D algorithms Tensor contractions Conclusions and future work 2.5D matrix multiplication 2.5D LU factorization 2.5D LU strong scaling 2.5D LU with on BG/P (n=65,536) Percentage of machine peak 100 2.5D LU (no-pvt) 2.5D LU (CA-pvt) 2D LU (no-pvt) 2D LU (CA-pvt) ScaLAPACK PDGETRF 80 60 40 20 0 256 512 1024 2048 #nodes Edgar Solomonik Communication-avoiding contractions 25/ 44 Topology-aware collectives 2.5D algorithms Tensor contractions Conclusions and future work 2.5D matrix multiplication 2.5D LU factorization 2.5D LU on 65,536 cores Time (sec) 2.5D LU vs 2D LU on 16,384 nodes of BG/P (n=131,072) 100 communication idle 80 compute 60 40 20 0 2D 2.5 2.5 2D 2. D D CA 5D C no no -pv Apv -pv -pv t pv t tb t re t nm ct no Edgar Solomonik Communication-avoiding contractions 26/ 44 Topology-aware collectives 2.5D algorithms Tensor contractions Conclusions and future work Algorithms for distributed tensor contractions A tensor contraction library implementation Bridging dense linear algebra techniques and applications Target application: tensor contractions in electronic structure calculations (quantum chemistry) I Often memory constrained I Most target tensors are oddly shaped I Need support for high dimensional tensors I Need handling of partial/full tensor symmetries I Would like to use communication avoiding ideas (blocking, 2.5D, topology-awareness) Edgar Solomonik Communication-avoiding contractions 27/ 44 Topology-aware collectives 2.5D algorithms Tensor contractions Conclusions and future work Algorithms for distributed tensor contractions A tensor contraction library implementation Decoupling memory usage and topology-awareness I 2.5D algorithms couple memory usage and virtual topology I c copies of a matrix implies c processor layers I Instead, we can nest 2D and/or 2.5D algorithms I Higher-dimensional algorithms allow smarter topology aware mapping Edgar Solomonik Communication-avoiding contractions 28/ 44 Topology-aware collectives 2.5D algorithms Tensor contractions Conclusions and future work Algorithms for distributed tensor contractions A tensor contraction library implementation Higher-dimensional distributed MM I 2.5D algorithms couple memory usage and virtual topology I c copies of a matrix implies c processor layers I Instead, we can nest 2D and/or 2.5D algorithms I Higher-dimensional algorithms allow smarter topology aware mapping Edgar Solomonik Communication-avoiding contractions 29/ 44 Topology-aware collectives 2.5D algorithms Tensor contractions Conclusions and future work Algorithms for distributed tensor contractions A tensor contraction library implementation 4D SUMMA-Cannon How do we map to a 3D partition without using more memory I I SUMMA (bcast-based) on 2D layers Cannon (send-based) along third dimension Cannon calls SUMMA as sub-routine I I I Minimize inefficient (non-rectangular) communication Allow better overlap MM on 512 nodes of BG/P 100 Percentage of flops peak I 80 2.5D MM 4D MM 2D MM (Cannon) PBLAS MM 60 40 20 0 4096 8192 16384 32768 65536 matrix dimension Treats MM as a 4D tensor contraction Edgar Solomonik Communication-avoiding contractions 30/ 44 131072 Topology-aware collectives 2.5D algorithms Tensor contractions Conclusions and future work Algorithms for distributed tensor contractions A tensor contraction library implementation Symmetry is a problem I I A fully symmetric tensor of dimenson d requires only nd /d! storage Symmetry significantly complicates sequential implementation I I I Irregular indexing makes alignment and unrolling difficult Generalizing over all partial-symmetries is expensive Blocked or block-cyclic virtual processor decmpositions give irregular or imbalanced virtual grids Blocked Block-cyclic P0 P1 P2 P3 Edgar Solomonik Communication-avoiding contractions 31/ 44 Topology-aware collectives 2.5D algorithms Tensor contractions Conclusions and future work Algorithms for distributed tensor contractions A tensor contraction library implementation Solving the symmetry problem I I A cyclic decomposition allows balanced and regular blocking of symmetric tensors If the cyclic-phase is the same in each symmetric dimension, each sub-tensor retains the symmetry of the whole tensor Cyclic Edgar Solomonik Communication-avoiding contractions 32/ 44 Topology-aware collectives 2.5D algorithms Tensor contractions Conclusions and future work Algorithms for distributed tensor contractions A tensor contraction library implementation A generalized cyclic layout is still challenging I In order to retain partial symmetry, all symmetric dimensions of a tensor must be mapped with the same cyclic phase I The contracted dimensions of A and B must be mapped with the same phase I And yet the virtual mapping, needs to be mapped to a physical topology, which can be any shape Edgar Solomonik Communication-avoiding contractions 33/ 44 Topology-aware collectives 2.5D algorithms Tensor contractions Conclusions and future work Algorithms for distributed tensor contractions A tensor contraction library implementation Virtual processor grid dimensions I I Our virtual cyclic topology is somewhat restrictive and the physical topology is very restricted Virtual processor grid dimensions serve as a new level of indirection I I If a tensor dimension must have a certain cyclic phase, adjust physical mapping by creating a virtual processor dimension Allows physical processor grid to be ’stretchable’ Edgar Solomonik Communication-avoiding contractions 34/ 44 Topology-aware collectives 2.5D algorithms Tensor contractions Conclusions and future work Algorithms for distributed tensor contractions A tensor contraction library implementation Constructing a virtual processor grid for MM Matrix multiply on 2x3 processor grid. Red lines represent virtualized part of processor grid. Elements assigned to blocks by cyclic phase. B A X Edgar Solomonik C = Communication-avoiding contractions 35/ 44 Topology-aware collectives 2.5D algorithms Tensor contractions Conclusions and future work Algorithms for distributed tensor contractions A tensor contraction library implementation Unfolding the processor grid I Higher-dimensional fully-symmetric tensors can be mapped onto a lower-dimensional processor grid via creation of new virtual dimensions I Lower-dimensional tensors can be mapped onto a higher-dimensional processor grid via by unfolding (serializing) pairs of processor dimensions I However, when possible, replication is better than unfolding, since unfolded processor grids can lead to an unbalanced mapping Edgar Solomonik Communication-avoiding contractions 36/ 44 Topology-aware collectives 2.5D algorithms Tensor contractions Conclusions and future work Algorithms for distributed tensor contractions A tensor contraction library implementation A basic parallel algorithm for symmetric tensor contractions 1. Arrange processor grid in any k-ary n-cube shape 2. Map (via unfold & virt) both A and B cyclically along the dimensions being contracted 3. Map (via unfold & virt) the remaining dimensions of A and B cyclically 4. For each tensor dimension contracted over, recursively mulitply the tensors along the mapping I Each contraction dimension is represented with a nested call to a local multiply or a parallel algorithm (e.g. Cannon) Edgar Solomonik Communication-avoiding contractions 37/ 44 Topology-aware collectives 2.5D algorithms Tensor contractions Conclusions and future work Algorithms for distributed tensor contractions A tensor contraction library implementation Tensor library structure The library supports arbitrary-dimensional parallel tensor contractions with any symmetries on n-cuboid processor torus partitions 1. Load tensor data by (global rank, value) pairs 2. Once a contraction is defined, map participating tensors 3. Distribute or reshuffle tensor data/pairs 4. Construct contraction algorithm with recursive function/args pointers 5. Contract the sub-tensors with a user-defined sequential contract function 6. Output (global rank, value) pairs on request Edgar Solomonik Communication-avoiding contractions 38/ 44 Topology-aware collectives 2.5D algorithms Tensor contractions Conclusions and future work Algorithms for distributed tensor contractions A tensor contraction library implementation Current tensor library status I Dense and symmetric remapping/repadding/contractions implemented I Currently functional only for dense tensors, but with full symmetric logic I Can perform automatic mapping with physical and virtual dimensions, but cannot unfold processor dimensions yet I Complete library interface implemented, including basic auxillary functions (e.g. map/reduce, sum, etc.) Edgar Solomonik Communication-avoiding contractions 39/ 44 Topology-aware collectives 2.5D algorithms Tensor contractions Conclusions and future work Algorithms for distributed tensor contractions A tensor contraction library implementation Next implementation steps I Currently integrating library with a SCF method code that uses dense contractions I Get symmetric redistribution working correctly I Automatic unfolding of processor dimensions I Implement mapping by replication to enable 2.5D algorithms I Much basic performance debugging/optimization left to do I More optimization needed for sequential symmetric contractions Edgar Solomonik Communication-avoiding contractions 40/ 44 Topology-aware collectives 2.5D algorithms Tensor contractions Conclusions and future work Algorithms for distributed tensor contractions A tensor contraction library implementation Very preliminary contraction library results Contracts tensors of size 64x64x256x256 in 1 second on 2K nodes Strong scaling of dense contraction on BG/P 64x64x256x256 Percentage of machine peak 100 no rephase rephase every contraction repad every contraction 80 60 40 20 0 256 512 1024 2048 #nodes Edgar Solomonik Communication-avoiding contractions 41/ 44 Topology-aware collectives 2.5D algorithms Tensor contractions Conclusions and future work Algorithms for distributed tensor contractions A tensor contraction library implementation Potential benefit of unfolding Unfolding smallest two BG/P torus dimensions improves performance. Strong scaling of dense contraction on BG/P 64x64x256x256 Percentage of machine peak 100 no-rephase 2D no-rephase 3D 80 60 40 20 0 256 512 1024 2048 #nodes Edgar Solomonik Communication-avoiding contractions 42/ 44 Topology-aware collectives 2.5D algorithms Tensor contractions Conclusions and future work Conntributions I Models for rectangular collectives I 2.5D algorithms theory and implementation I Using a cyclic mapping to parallelize symmetric tensor contractions I Extending and tuning processor grid with virtual dimensions I Automatic mapping of high-dimensional tensors to topology-aware physical partitions I A parallel tensor contraction algorithm/library without a global address space Edgar Solomonik Communication-avoiding contractions 43/ 44 Topology-aware collectives 2.5D algorithms Tensor contractions Conclusions and future work Conclusions and references I Parallel tensor contraction algorithm and library seem to be the first communication-efficient practical approach I Preliminary results and theory indicate high potential of this tensor contraction library I papers I I (2.5D) to appear in Euro-Par 2011, Distinguished paper (2.5D + rectangular collective models) to appear in Supercomputing 2011 Edgar Solomonik Communication-avoiding contractions 44/ 44 Backup slides Edgar Solomonik Communication-avoiding contractions 45/ 44 A new LU latency lower bound k₀ A₀₀ k₁ A₁₁ k₂ n A₂₂ k₃ A₃₃ k₄ A₄₄ cr iti c al pa kd-1 th Ad-1,d-1 n √ flops lower bound requires d = Ω( p) blocks/messages √ bandwidth lower bound required d = Ω( cp) blocks/messages Edgar Solomonik Communication-avoiding contractions 46/ 44 Virtual topology of 2.5D algorithms c-1 0 k 0 i 1/2 (p/c)-1 0 j 1/2 (p/c) -1 √ √ 2D algorithm mapping: P × P grid p p 2.5D algorithm mapping: P/c × P/c × c grid for any c Edgar Solomonik Communication-avoiding contractions 47/ 44