Comments
Description
Transcript
Slides - Agenda
APE group Does HPC really fit GPUs ? Davide Rossetti INFN – Roma [email protected] Napoli, 26-27 January 2010 January 25-27 2010 Incontro di lavoro della CCR 1 A case study: Lattice QCD V. Lubicz – CSN4 talk, settembre 2009 January 25-27 2010 Incontro di lavoro della CCR 2 A case study: Lattice QCD ● ● ● Most valuable product is the gauge configuration ● Different types: Nf, schemes ● Different sizes A grid-enabled community (www.ildg.org) ● Storage sites ● Production sites ● Analysis sites Gauge configuration production really expensive !!! January 25-27 2010 Incontro di lavoro della CCR 3 HPC in INFN ● ● ● Focus on compute intensive Physics (excluding LHC stuff): LQCD, Astro, Nuclear, Medical Needs for 2010-2015: ● ~ 0.01-1 Pflops for single research group ● ~ 0.1-10 Pflops nationwide Translates to: ● Big infrastructure (cooling, power, …) ● High procurement costs (€/Gflops) ● High maintenance costs (W/Gflops) January 25-27 2010 Incontro di lavoro della CCR 4 LQCD on GPU ? ● Story begins with video games (Egri, Fodor et al. 2006) ● Wilson-Dirac operator at 120Gflops (K.Ogawa 2009) ● Domain Wall fermions (Tsukuba/Taiwan 2009) ● Definitive work: Quda lib (M.A.Clark et al. 2009): o Double, Single, Half-precision o Half-prec solver with reliable updates > 100Gflops o MIT/X11 Open Source License January 25-27 2010 Incontro di lavoro della CCR 5 INFN on GPUs 2D Spin models (Di Renzo et al, 2008) LQCD Stag. fermions on Chroma (Cossu, D'Elia et al, Ge+Pi 2009) Bio-Computing on GPU (Salina, Rossi et al, ToV 2010?) Gravitational wave analysis (Bosi, Pg 2010?) Geant4 on GPU (Caccia, Rm 2010?) January 25-27 2010 Incontro di lavoro della CCR 6 How many GPUs ? Raw estimate for memory footprint: •Full solver in GPU •Gauge field + 15 fermion fields •No symmetry tricks •No half-prec tricks January 25-27 2010 Lattice size Single precision memory (GiB) Double # GTX280 precision memory (GiB) # Tesla C1060 # Tesla C2070 243x48 1 2.1 3 1 1 323x64 3.3 6.7 4-8 2 1-2 483x96 17 34 17-35 5-9 3-6 643x128 54 108 55-110 14-28 9-18 Incontro di lavoro della CCR 7 If one GPU is not enough Multi-GPU, the Fastra II* approach: ● Stick 13 GPUs together ● 12TFLOPS @ 2KW ● CPU threads feed GPU kernels ● Embarrassingly parallel → great!!! ● Full problem fits → good! ● Enjoy the warm weather * University of Antwerp, Belgium January 25-27 2010 Incontro di lavoro della CCR 8 multi-GPUs need scaling! Seems easy: 1. 2. 3. 4. 1-2-4 GPUs in 1-2U system (or buy Tesla M1060) Stack many Add an interconnect (IB, Myrinet 10G, custom) & plug accurately :) Simply write your program in C+MPI+CUDA/OpenCL(+threads) Multi-node Single GPU parallelism kernel January 25-27 2010 Multi-GPU mgmt Incontro di lavoro della CCR 9 Some near-term solutions for LQCD Two INFN approved projects: ● QUonG: cluster of GPUs with custom 3D torus network APEnet+ talk by R.Ammendola ● Aurora: dual Xeon 5500 custom blade with IB & 3D first-neighbor network January 25-27 2010 Incontro di lavoro della CCR 10 INFN assets ● 20 years of experience in high-speed 3D torus interconnects (APE100, APEmille, apeNEXT, APEnet) ● 20 years writing parallel codes ● Control over HW architecture vs. algorithms January 25-27 2010 Incontro di lavoro della CCR 11 Wish list for multi-GPU computing Open the GPU to the world: Provide APIs to hook inside your drivers • Allow PCIe to PCIe DMAs or better … • … add some high-speed data I/O port toward an external device (FPGA, custom ASIC) • Promote GPU from simple accelerator to main computing engine status !! January 25-27 2010 Incontro di lavoro della CCR P C I e x p r e s s DRAM Main Memory GPU 12 In conclusions ● ● GPUs are good at small scales Scaling from single GPU, to multi-GPU, to multinode, hierarchy deepen ● Programming complexity increases ● Watch GPU → Network latency ● Please, help us link your GPU to our 3D network !!! Game over :) January 25-27 2010 Incontro di lavoro della CCR 13