Advanced Hardware and Software Technologies for Ultra-long FFT’s HPEC 2005
by user
Comments
Transcript
Advanced Hardware and Software Technologies for Ultra-long FFT’s HPEC 2005
Advanced Hardware and Software Technologies for Ultra-long FFT’s Hahn Kim, Jeremy Kepner, M. Michael Vai, Crystal Kahn HPEC 2005 21 September 2005 MIT Lincoln Laboratory HPEC 2005-1 21 Sept. 2005 This work is sponsored by the Department of the Air Force under contract FA8721-05-C-0002. Opinions, interpretations, conclusions and recommendations are those of the authors and are not necessarily endorsed by the United States Government Outline • Background • FPGA-Based Hardware Technology • Parallel Software Technology • Conclusions HPEC 2005-2 21 Sept. 2005 MIT Lincoln Laboratory Introduction Real-time Embedded ISR Radar ELINT SIGINT 1 Gigapoint complex FFT 150M ops ~1M words 1 Kilopoint complex FFT 50K ops ~1K words Non Real-Time Data Analysis HSI Monte Carlo IMINT HPEC 2005-3 21 Sept. 2005 MIT Lincoln Laboratory Motivation for Real-time, Ultra-long FFT’s Wide-Band Digital Receiver RF Receiver ADC 1 G-pt FFT Signal Processing Application Channelized Narrow-Band Digital Receiver ADC ADC • ... Analog Channelizer ... RF Receiver 100 M-pt FFT Signal Processing Application 100 M-pt FFT Can use FFT to detect weak signals – Reduce noise floor – Longer intervals result in higher gain • HPEC 2005-4 21 Sept. 2005 Real-time, ultra-long FFT processor is a critical enabling component MIT Lincoln Laboratory FFT Technology Space 10 GSPS Sustained Data Rate ’04 RFEL HyperSpeed IP (FPGA) 32 K-pt @ 3.2 GSPS ’04: LL SYCO (FPGA) 8 K-pt @ 1.2 GSPS ≥ 1G-pt @ 1.2 GSPS 1 GSPS ’99: LL Discoverer II (ASIC) 128 pt @ 0.5 GSPS ’03: Xilinx IP (FPGA) 16 K-pt @ 0.2 GSPS ’02: Transtech DSP Quixilica IP (FPGA) 128 M-pt @ 0.128 GSPS 8 K-pt @ 0.16 GSPS ’03: DSP Architectures DSP-24 (ASIC) 64 M-pt 1 K-pt @ 0.06 GSPS @ 0.064 GSPS ’01: CRI Pathfinder II (ASIC) 1 M-pt @ 0.04 GSPS 100 MSPS 10 MSPS ’03: Analog Devices ADSP-21262 (DSP) 1 K-pt @ 0.02 GSPS ’04 RFEL HyperLength IP (FPGA) 256 M-pt @ 0.01 GSPS OFDM Communication Standards Req. 100 10K 64 G-pt @ 0.04 GSPS ’04 LL pMatlab (128 Proc. w/ GigE) 4 G-pt @ 0.01 GSPS State-of-The-Art FFT Technology 1M 100M 10G FFT Length (Points) 1T 100T Objective: Extend state-of-the-art to ultra-long FFT’s HPEC 2005-5 21 Sept. 2005 MIT Lincoln Laboratory Outline • Background • FPGA-Based Hardware Technology • Parallel Software Technology • Conclusions HPEC 2005-6 21 Sept. 2005 MIT Lincoln Laboratory Ultra-long FFT Implementation Challenges Memory Requirement (Bytes) 80G 8G 1 G-pt 0.8G 100 M-pt 80M 8M 0.8M 80K 8K 8 K-pt FPGA Embedded Memory Only 10K • • 100K Off-Chip Memory Required 1M 10M 100M FFT Length (Points) 1G 10G Beyond ~32 K-pt FFT, off-chip memory for FIFO and twiddle factors is required – Full duplex memory access is a challenge Lincoln architecture selected for ultra-long FFT – HPEC 2005-7 21 Sept. 2005 32 K-pt “A Systolic FFT Architecture for Real Time FPGA Systems,” HPEC 2004. MIT Lincoln Laboratory Real-Time 1G-Point FFT Architecture • MN-pt FFT can be implemented with an M-pt FFT and an N-pt FFT – E.g. 1 G-pt FFT ⇒ M = 32 K-pt, N = 32 K-pt – Each 32 K-pt FFT fits into an FPGA Corner Turn 32K FFT Corner Turn Twiddle Factors 32K FFT Corner Turn …9876543210 … 9 .. 8 7 3 6 2 5 1 4 0 Corner Turn Example (for an 8-Point FFT): … • …. 9 8 ……… 3210 7654 Corner Turn (Matrix Transpose) … .. 8 .. 9 .. .. .. .. 40 51 62 73 Lincoln has developed a corner turn architecture that operates at 1 GSPS HPEC 2005-8 21 Sept. 2005 MIT Lincoln Laboratory Real-Time FFT Architecture Mem Mem Mem Mem Mem Mem Mem Mem Real-Time Corner Turn Mem Mem Mem Mem Mem Mem Mem Mem 32K FFT Real-Time Corner Turn Mem Mem Mem Mem Cache and MUX Mem Mem Mem Mem DMUX and Switch Mem Mem Mem Mem FPGA Mem Mem Mem Mem Cache and MUX Mem Mem Mem Mem DMUX and Switch Mem Mem Mem Mem FPGA Mem Mem Mem Mem Cache and MUX ADC 1 GSPS HOLD DMUX and Switch 1 Gigapoint FFT Configuration Mem Mem Mem Mem 32K FFT Real-Time Corner Turn Reconfigurable architecture allows for multiple implementations: e.g. 1 G-pt @ 1 GSPS or 100 M-pt @ 100 MSPS X 10 channels HPEC 2005-9 21 Sept. 2005 MIT Lincoln Laboratory Real-time Example FFT Implementation Current FFT Capabilities Future Ultra-long FFT Capabilities • 6” – Provides on-board memory banks for performing real-time corner turns – Performs form-factor optimization 13” 13” • • Symbiotic Communications (SYCO) real-time processor – – – – 8 K-pt FFT 450 GOPS @ 130 Watts 208 FFT butterflies No on-board memory • “Rapid Prototyping of a Realtime Range Compression Processor,” HPEC 2005 Poster Session HPEC 2005-10 21 Sept. 2005 Processor enhancement * Initial estimate based on double-precision operation Develop a universal FFT architecture – 100M-Pt FFT X 10 channels – 1G-Pt FFT MIT Lincoln Laboratory Outline • Background • FPGA-Based Hardware Technology • Parallel Software Technology • Conclusions HPEC 2005-11 21 Sept. 2005 MIT Lincoln Laboratory Motivation for Out-of-Core Technology • Data are often larger than memory on a single processor. Memory Disk Can address memory limitation with: 1. Multiple CPUs 2. Using disk as memory Hughes, G.F. “Wise Drives.” IEEE Spectrum. Aug 2002. • HPEC 2005-12 21 Sept. 2005 Out-of-core technology uses memory as a “window” into data stored on disk. MIT Lincoln Laboratory MatlabMPI & pMatlab Software Layers Application Vector/Matrix Vector/Matrix Parallel Library Output Analysis Input Comp Comp Conduit Task Library Library Layer Layer (pMatlab) (pMatlab) Kernel Kernel Layer Layer Messaging (MatlabMPI) Math (Matlab) User Interface Hardware Interface Parallel Hardware •• Can Can build build aa parallel parallel library library with with aa few few messaging messaging primitives primitives •• MatlabMPI MatlabMPI provides provides this this messaging messaging capability: capability: MMPI_Send(dest PI_Send(dest,co ,comm m,tag,X) m,tag,X);; XX == MPI_Recv(source,com MPI_Recv(source,com m,tag) m,tag);; HPEC 2005-13 21 Sept. 2005 •• Can Can build build aa application application with with aa few few parallel parallel structures structures and and functions functions •• pMatlab pMatlab provides provides parallel parallel arrays arrays and and functions functions XX == ones(n ones(n,mapX); ,mapX); YY == zeros(n,mapY); zeros(n,mapY); Y(: Y(:,,::))==fffftt(X) (X);; MIT Lincoln Laboratory Out-of-Core Memory Management: pMatlab eXtreme Virtual Memory (XVM) Virtual Memory Out-of-Core + Transparent to user (ease of use) – Disk access patterns are often sub-optimal for most algorithms + Provides control of disk at object level (performance) – Exposes swapping mechanism to user Goal of pMatlab XVM is to add out-ofcore capability to pMatlab that is: 1. 1. 2. 2. 3. 3. Easy Easy to to use use Has Has high high performance performance Can Can transparently transparently distribute distribute data data pMatlab + Provides mechanism for transparently distributing data across multiple CPUs HPEC 2005-14 21 Sept. 2005 MIT Lincoln Laboratory Data Organization MN 1. Data starts as a vector with length MN 2. Divide into M vectors with length N N N 3. Reorganize as a MxN matrix 4. Distribute rows across processors HPEC 2005-15 21 Sept. 2005 M P0 P1 MIT Lincoln Laboratory Hierarchical Matrices and Maps Out-of-core matrices 32K Matrix 32K Global matrix 32K 8K P0 2K P1 + P3 P0 Global map P2 P1 Core block P2 Core block P3 (in-core) (out-of-core) + Out-of-core map HPEC 2005-16 21 Sept. 2005 • • Hierarchical matrices span processors and the storage hierarchy User constructs hierarchy by specifying global and out-of-core maps MIT Lincoln Laboratory Ultra-long FFT FFT 0 Ultra-long FFT × twiddle corner turn FFT 1 2 3 0 Software-based Implementations 1 2 3 MATLAB® pMatlab C/MPI pMatlab XVM Testbed Lincoln HPC cluster (LLGrid) • • • • 80 dual CPU nodes 2.8-3.06 GHz Pentium 4 Xeon 4 GB memory Gigabit Ethernet HPEC 2005-17 21 Sept. 2005 MIT Lincoln Laboratory Scalability 1 Gigapoint FFT • • • HPEC 2005-18 21 Sept. 2005 pMatlab XVM supports ultra-long FFT’s with little degradation in performance Maximum problem size = size of available disk space 1 TB represents a 64 G-pt double-precision, complex FFT MIT Lincoln Laboratory Outline • Background • FPGA-Based Hardware Technology • Parallel Software Technology • Conclusions HPEC 2005-19 21 Sept. 2005 MIT Lincoln Laboratory System Development Methodologies System Prototyping Modify word size, precision, etc. N Bit-true simulation Satisfy requirements? Y Initial design Architecture Analysis 1 Gigapoint Performance Verification Compare HPEC 2005-20 21 Sept. 2005 MIT Lincoln Laboratory Summary • Presented FPGA architecture for real-time, ultra-long FFT’s – Can implement 1 G-pt FFT with smaller FFT’s – Use SYCO processor to implement FFT’s – Developed real-time corner turn architecture • Future – Develop a universal ultra-long FFT architecture – Allows multiple configurations in same hardware 1 G-Pt FFT 100 M-Pt FFT X 10 channels – Application-specific precision and dynamic analysis HPEC 2005-21 21 Sept. 2005 MIT Lincoln Laboratory Summary • Presented parallel software architecture for ultra-long FFT’s – Added out-of-core capability to pMatlab – Supports ultra-long FFT’s with little degradation in performance – Demonstrated 64 G-pt FFT (1 TB) • Future – Expand cluster to enable 64 T-pt FFT (1 PB) HPEC 2005-22 21 Sept. 2005 MIT Lincoln Laboratory Acknowledgements • • • • • • • • HPEC 2005-23 21 Sept. 2005 Ken Senne Gerry Banner Leslie Alger Bob Bond Cy Chan Hector Chan Tim Currie Preston Jackson • • • • • • • • Andy McCabe Peter Michaleas Michael Moore Charles Rader Albert Reuther Jonathan Scalera Nadya Travinin Edmund Wong MIT Lincoln Laboratory