...

Advanced Hardware and Software Technologies for Ultra-long FFT’s HPEC 2005

by user

on
Category: Documents
12

views

Report

Comments

Transcript

Advanced Hardware and Software Technologies for Ultra-long FFT’s HPEC 2005
Advanced Hardware and Software
Technologies for Ultra-long FFT’s
Hahn Kim, Jeremy Kepner, M. Michael Vai, Crystal Kahn
HPEC 2005
21 September 2005
MIT Lincoln Laboratory
HPEC 2005-1
21 Sept. 2005
This work is sponsored by the Department of the Air Force under contract FA8721-05-C-0002. Opinions,
interpretations, conclusions and recommendations are those of the authors and are not necessarily endorsed by
the United States Government
Outline
•
Background
•
FPGA-Based Hardware Technology
•
Parallel Software Technology
•
Conclusions
HPEC 2005-2
21 Sept. 2005
MIT Lincoln Laboratory
Introduction
Real-time Embedded ISR
Radar
ELINT
SIGINT
1 Gigapoint
complex FFT
150M ops
~1M words
1 Kilopoint
complex FFT
50K ops
~1K words
Non Real-Time Data Analysis
HSI Monte Carlo IMINT
HPEC 2005-3
21 Sept. 2005
MIT Lincoln Laboratory
Motivation for Real-time, Ultra-long FFT’s
Wide-Band Digital Receiver
RF Receiver
ADC
1 G-pt FFT
Signal Processing
Application
Channelized Narrow-Band Digital Receiver
ADC
ADC
•
...
Analog
Channelizer
...
RF Receiver
100 M-pt FFT
Signal Processing
Application
100 M-pt FFT
Can use FFT to detect weak signals
– Reduce noise floor
– Longer intervals result in higher gain
•
HPEC 2005-4
21 Sept. 2005
Real-time, ultra-long FFT processor is a critical enabling
component
MIT Lincoln Laboratory
FFT Technology Space
10 GSPS
Sustained Data Rate
’04 RFEL HyperSpeed IP (FPGA)
32 K-pt @ 3.2 GSPS
’04: LL SYCO (FPGA)
8 K-pt @ 1.2 GSPS
≥ 1G-pt @ 1.2 GSPS
1 GSPS
’99: LL Discoverer II (ASIC)
128 pt @ 0.5 GSPS
’03: Xilinx IP (FPGA)
16 K-pt @ 0.2 GSPS
’02: Transtech DSP Quixilica IP (FPGA) 128 M-pt
@ 0.128 GSPS
8 K-pt @ 0.16 GSPS
’03: DSP Architectures DSP-24 (ASIC)
64 M-pt
1 K-pt @ 0.06 GSPS
@ 0.064 GSPS
’01: CRI Pathfinder II (ASIC)
1 M-pt @ 0.04 GSPS
100 MSPS
10 MSPS
’03: Analog Devices ADSP-21262 (DSP)
1 K-pt @ 0.02 GSPS
’04 RFEL HyperLength IP (FPGA)
256 M-pt @ 0.01 GSPS
OFDM
Communication
Standards Req.
100
10K
64 G-pt @ 0.04 GSPS
’04 LL pMatlab (128 Proc. w/ GigE)
4 G-pt @ 0.01 GSPS
State-of-The-Art
FFT Technology
1M
100M
10G
FFT Length (Points)
1T
100T
Objective: Extend state-of-the-art to ultra-long FFT’s
HPEC 2005-5
21 Sept. 2005
MIT Lincoln Laboratory
Outline
•
Background
•
FPGA-Based Hardware Technology
•
Parallel Software Technology
•
Conclusions
HPEC 2005-6
21 Sept. 2005
MIT Lincoln Laboratory
Ultra-long FFT Implementation
Challenges
Memory Requirement (Bytes)
80G
8G
1 G-pt
0.8G
100 M-pt
80M
8M
0.8M
80K
8K
8 K-pt
FPGA
Embedded
Memory
Only
10K
•
•
100K
Off-Chip
Memory
Required
1M
10M
100M
FFT Length (Points)
1G
10G
Beyond ~32 K-pt FFT, off-chip memory for FIFO and twiddle factors is
required
–
Full duplex memory access is a challenge
Lincoln architecture selected for ultra-long FFT
–
HPEC 2005-7
21 Sept. 2005
32 K-pt
“A Systolic FFT Architecture for Real Time FPGA Systems,” HPEC 2004.
MIT Lincoln Laboratory
Real-Time 1G-Point FFT Architecture
•
MN-pt FFT can be implemented with an M-pt FFT and an N-pt FFT
– E.g. 1 G-pt FFT ⇒ M = 32 K-pt, N = 32 K-pt
– Each 32 K-pt FFT fits into an FPGA
Corner
Turn
32K
FFT
Corner
Turn
Twiddle
Factors
32K
FFT
Corner
Turn
…9876543210
… 9 .. 8 7 3 6 2 5 1 4 0
Corner Turn Example
(for an 8-Point FFT):
…
•
…. 9 8
………
3210
7654
Corner
Turn
(Matrix Transpose)
…
.. 8
.. 9
.. ..
.. ..
40
51
62
73
Lincoln has developed a corner turn architecture that operates at 1 GSPS
HPEC 2005-8
21 Sept. 2005
MIT Lincoln Laboratory
Real-Time FFT Architecture
Mem
Mem
Mem
Mem
Mem
Mem
Mem
Mem
Real-Time
Corner Turn
Mem
Mem
Mem
Mem
Mem
Mem
Mem
Mem
32K
FFT
Real-Time
Corner Turn
Mem
Mem
Mem
Mem
Cache and MUX
Mem
Mem
Mem
Mem
DMUX and Switch
Mem
Mem
Mem
Mem
FPGA
Mem
Mem
Mem
Mem
Cache and MUX
Mem
Mem
Mem
Mem
DMUX and Switch
Mem
Mem
Mem
Mem
FPGA
Mem
Mem
Mem
Mem
Cache and MUX
ADC
1 GSPS
HOLD DMUX and Switch
1 Gigapoint FFT Configuration
Mem
Mem
Mem
Mem
32K
FFT
Real-Time
Corner Turn
Reconfigurable architecture allows for multiple implementations:
e.g. 1 G-pt @ 1 GSPS or 100 M-pt @ 100 MSPS X 10 channels
HPEC 2005-9
21 Sept. 2005
MIT Lincoln Laboratory
Real-time Example FFT Implementation
Current FFT Capabilities
Future Ultra-long FFT
Capabilities
•
6”
– Provides on-board
memory banks for
performing real-time
corner turns
– Performs form-factor
optimization
13”
13”
•
•
Symbiotic Communications
(SYCO) real-time processor
–
–
–
–
8 K-pt FFT
450 GOPS @ 130 Watts
208 FFT butterflies
No on-board memory
•
“Rapid Prototyping of a Realtime Range Compression
Processor,” HPEC 2005 Poster
Session
HPEC 2005-10
21 Sept. 2005
Processor enhancement
* Initial estimate based on double-precision operation
Develop a universal FFT
architecture
– 100M-Pt FFT X 10
channels
– 1G-Pt FFT
MIT Lincoln Laboratory
Outline
•
Background
•
FPGA-Based Hardware Technology
•
Parallel Software Technology
•
Conclusions
HPEC 2005-11
21 Sept. 2005
MIT Lincoln Laboratory
Motivation for Out-of-Core Technology
•
Data are often larger than memory on a single processor.
Memory
Disk
Can address memory limitation with:
1. Multiple CPUs
2. Using disk as memory
Hughes, G.F. “Wise Drives.” IEEE Spectrum. Aug 2002.
•
HPEC 2005-12
21 Sept. 2005
Out-of-core technology uses memory as a “window” into
data stored on disk.
MIT Lincoln Laboratory
MatlabMPI & pMatlab Software Layers
Application
Vector/Matrix
Vector/Matrix
Parallel
Library
Output
Analysis
Input
Comp
Comp
Conduit
Task
Library
Library Layer
Layer (pMatlab)
(pMatlab)
Kernel
Kernel Layer
Layer
Messaging (MatlabMPI)
Math (Matlab)
User
Interface
Hardware
Interface
Parallel
Hardware
•• Can
Can build
build aa parallel
parallel library
library with
with aa
few
few messaging
messaging primitives
primitives
•• MatlabMPI
MatlabMPI provides
provides this
this
messaging
messaging capability:
capability:
MMPI_Send(dest
PI_Send(dest,co
,comm m,tag,X)
m,tag,X);;
XX == MPI_Recv(source,com
MPI_Recv(source,com m,tag)
m,tag);;
HPEC 2005-13
21 Sept. 2005
•• Can
Can build
build aa application
application with
with aa few
few
parallel
parallel structures
structures and
and functions
functions
•• pMatlab
pMatlab provides
provides parallel
parallel arrays
arrays
and
and functions
functions
XX == ones(n
ones(n,mapX);
,mapX);
YY == zeros(n,mapY);
zeros(n,mapY);
Y(:
Y(:,,::))==fffftt(X)
(X);;
MIT Lincoln Laboratory
Out-of-Core Memory Management:
pMatlab eXtreme Virtual Memory (XVM)
Virtual Memory
Out-of-Core
+ Transparent to user (ease of use)
– Disk access patterns are often
sub-optimal for most algorithms
+ Provides control of disk at
object level (performance)
– Exposes swapping mechanism
to user
Goal of pMatlab XVM is to add out-ofcore capability to pMatlab that is:
1.
1.
2.
2.
3.
3.
Easy
Easy to
to use
use
Has
Has high
high performance
performance
Can
Can transparently
transparently distribute
distribute data
data
pMatlab
+ Provides mechanism for
transparently distributing data
across multiple CPUs
HPEC 2005-14
21 Sept. 2005
MIT Lincoln Laboratory
Data Organization
MN
1. Data starts as a vector
with length MN
2. Divide into M vectors
with length N
N
N
3. Reorganize as a
MxN matrix
4. Distribute rows
across processors
HPEC 2005-15
21 Sept. 2005
M
P0
P1
MIT Lincoln Laboratory
Hierarchical Matrices and Maps
Out-of-core
matrices
32K
Matrix
32K
Global matrix
32K
8K
P0
2K
P1
+
P3
P0
Global
map
P2
P1
Core block
P2
Core block
P3
(in-core)
(out-of-core)
+
Out-of-core
map
HPEC 2005-16
21 Sept. 2005
•
•
Hierarchical matrices span processors
and the storage hierarchy
User constructs hierarchy by specifying
global and out-of-core maps
MIT Lincoln Laboratory
Ultra-long FFT
FFT
0
Ultra-long FFT
× twiddle
corner
turn
FFT
1
2
3
0
Software-based
Implementations
1
2
3
MATLAB®
pMatlab
C/MPI
pMatlab XVM
Testbed
Lincoln HPC cluster (LLGrid)
•
•
•
•
80 dual CPU nodes
2.8-3.06 GHz Pentium 4 Xeon
4 GB memory
Gigabit Ethernet
HPEC 2005-17
21 Sept. 2005
MIT Lincoln Laboratory
Scalability
1 Gigapoint FFT
•
•
•
HPEC 2005-18
21 Sept. 2005
pMatlab XVM supports ultra-long FFT’s with little degradation in
performance
Maximum problem size = size of available disk space
1 TB represents a 64 G-pt double-precision, complex FFT
MIT Lincoln Laboratory
Outline
•
Background
•
FPGA-Based Hardware Technology
•
Parallel Software Technology
•
Conclusions
HPEC 2005-19
21 Sept. 2005
MIT Lincoln Laboratory
System Development Methodologies
System Prototyping
Modify word size,
precision, etc.
N
Bit-true
simulation
Satisfy
requirements?
Y
Initial design
Architecture Analysis
1 Gigapoint
Performance Verification
Compare
HPEC 2005-20
21 Sept. 2005
MIT Lincoln Laboratory
Summary
•
Presented FPGA architecture for real-time, ultra-long FFT’s
– Can implement 1 G-pt FFT with smaller FFT’s
– Use SYCO processor to implement FFT’s
– Developed real-time corner turn architecture
•
Future
– Develop a universal ultra-long FFT architecture
– Allows multiple configurations in same hardware
1 G-Pt FFT
100 M-Pt FFT X 10 channels
– Application-specific precision and dynamic analysis
HPEC 2005-21
21 Sept. 2005
MIT Lincoln Laboratory
Summary
•
Presented parallel software architecture for ultra-long FFT’s
– Added out-of-core capability to pMatlab
– Supports ultra-long FFT’s with little degradation in
performance
– Demonstrated 64 G-pt FFT (1 TB)
•
Future
– Expand cluster to enable 64 T-pt FFT (1 PB)
HPEC 2005-22
21 Sept. 2005
MIT Lincoln Laboratory
Acknowledgements
•
•
•
•
•
•
•
•
HPEC 2005-23
21 Sept. 2005
Ken Senne
Gerry Banner
Leslie Alger
Bob Bond
Cy Chan
Hector Chan
Tim Currie
Preston Jackson
•
•
•
•
•
•
•
•
Andy McCabe
Peter Michaleas
Michael Moore
Charles Rader
Albert Reuther
Jonathan Scalera
Nadya Travinin
Edmund Wong
MIT Lincoln Laboratory
Fly UP