High-Performance FPGA-Based QR Decomposition

by user

on 15 сентября 2016

Category: Documents

>> Downloads: 2

views

Report

Comments

Description

Download High-Performance FPGA-Based QR Decomposition

Transcript

High-Performance FPGA-Based QR Decomposition

High-Performance FPGA-Based
QR Decomposition
Huy Nguyen, James Haupt, Michael Eskowitz, Birol Bekirov,
Jonathan Scalera, Thomas Anderson,
Michael Vai, and Kenneth Teitelbaum
HPEC 2005
September 20, 2005
* This work is sponsored by the Department of the Air Force under Air Force Contract FA8721-05-C-0002. Opinions, interpretations,
conclusions, and recommendations are those of the authors and are not necessarily endorsed by the United States Government.
HPEC 2005 -1
HTN 10/28/2005
MIT Lincoln Laboratory
Outline
HPEC 2005 -2
HTN 10/28/2005
•
QR Computation
•
C-Language Implementation on FPGA
•
Pipelined Linear Array on FPGA
•
Summary
MIT Lincoln Laboratory
Adaptive Beamforming in Radar Processing
Channels X
A/D
Subband
Filter Bank
Adaptive
Beamform
4
16
Real-Time Frontend Processing
GMTI
processing
Wideband
Active
Radar
Steering
vectors
V
Weights W
Beams Y
Beamforming
Adaptive Weight
Computation
Double
r = QR(X)
backW = inv(XH * X) * V
substitution
= inv(r) * (inv(r)H * V)
Y = WH * X
Adaptive Beamforming
Airborne
Electronically
Steered
Array
Backend
Processing
& Recording
Processor
Subband Filtering and
Adaptive Beamforming
HPEC 2005 -3
HTN 10/28/2005
Non-adaptive beam patterns
Wideband Interference
Adaptive beam patterns
Interference Cancellation
Target
Interference
MIT Lincoln Laboratory
Adaptive Weight Computation
McWhirter Array
Input sample vectors
x0 (n)
x1 (n)
x3 (n)
x2 (n )
0
x
(
n
)
x
(
n
)
x
(
n
)
x0,0 (n)
0,1
0, 2
0, 3
r00 (n)
r02 (n)
r01(n)
x1,1 (n)
c0 (n) s (n)
0
c1 (n)
x1, 2 (n)
r22 (n)
- QR, Back-substitution, Beamforming can be
performed using the same array
r13 (n)
x2,3 (n)
- Givens rotation in “voltage-domain” requires
smaller word size than “power-domain”
r23 (n)
c (n )
2
Results ri j(n)
s2 (n)
computed and
stored at array nodes
x3,3 (n)
%
r33 (n)
Z
AMF / ACE
Challenges
Weights, Beam = Z
⎧
rij (0) = ⎨ α i = j
⎩ 0 i≠ j
- Entire array does not fit in FPGA
rii (n) = rii2 (n − 1) + xi ,i (n)
- Efficiency at most 50%
2
rii (n − 1)
rii (n)
x (n)
si (n) = i ,i
ri ,i (n)
sv = s when control = 1
x _ out = x _ in + s • conj( sv )
Z = xH r-1 r-1H v
ci (n) =
rij (n) = ci r (n − 1) + si∗ xi , j (n)
xi +1, j (n ) = − si rij (n − 1) + ci xi , j (n )
HPEC 2005 -4
HTN 10/28/2005
- Systolic, high-speed neighbor communication
x1,3 (n)
x2,2 (n)
s1 (n)
0
r03 (n)
r12 (n)
r11(n)
0
Desirable Properties of McWhirter Array
x _ out = x _ in + s • conj( s )
AMF = vH r-1 r-1H v
ACE = xH r-1 r-1H x
- AMF: Adaptive Matched Filter
- ACE: Adaptive Coherence Estimate
- QR and BackSolve need different dynamic ranges
- 1/sqrt in boundary nodes difficult to implement
- Op-count higher than Squared Givens and
House-Holder methods
MIT Lincoln Laboratory
Computation Platform
• On-board computation eliminates I/O latency
• FPGAs enable high computational throughput
QR
BF
FFT
FFT
PPF
PPF
AD
MEMIO
AD
AD
Single Board
Computer
Processing 200 Gops
Internal bw 15 GBps
Digital i/o
5 GBps dplx
SBC comm 50 MBps dplx
Scalable
x4
Processor Board
A/D Modules
HPEC 2005 -5
HTN 10/28/2005
10,000
Very Large Scale Integration (VLSI)
1,000
SBC
Comm
CTRL
AD
and power efficiency
GOPS / Liter
QR
BF
Field Programmable Gate Arrays (FPGA)
100
Programmable Processors
10
1
0.1
0.001
0.01
0.1
1
10
GOPS / Watt
100
Linear Array
MicroBlaze Processor
MIT Lincoln Laboratory
1000
Outline
•
QR Computation
•
C-Language Implementation on FPGA
– Soft-macro Embedded Processor inside FPGA
– Measured Performance
HPEC 2005 -6
HTN 10/28/2005
•
Pipelined Linear Array on FPGA
•
Summary
MIT Lincoln Laboratory
Software Approach to QR Decomposition
- C program runs on Xilinx’s 32-bit MicroBlaze processor macro core
- Software performs Givens-based QR and back substitution to compute the weights
- QR program code resident on FPGA (450 lines of C code)
- Maximum clock rate 100 MHz on Virtex-II
- Floating-point library and floating-point unit available
- Development time 2 – 3 months, flexible, suitable for rapid prototyping
Hardware domain
HDL
Beam
former
Synthesize
Place & Route
HPEC 2005 -7
HTN 10/28/2005
QR
C Code
Macro
μBlaze
100 FLOPS peak
10 MFLOPS measured
μBlaze
One MicroBlaze takes
about 10% of Virtex-II 8000
Software domain
Beam
former
Compile
Link
Flt-pt
Library
.exe
Debugger
MIT Lincoln Laboratory
Measured Performance
•
•
•
Optimization results in 100x speedup for the µBlaze
Effective “ops”
– Inclusion of floating-point unit
per sec
– 1/sqrt with Newton’s method rather than library call
– In-line code roll up
# Processors to meet
Time per QR
system requirements
– Integer-to-fltpt conversion and vice versa
– Next power of 2
Performance measurement
– C code with compiler optimization ON
– Execution time measured on µBlaze,
G4, and Pentium 4
– Effectively 10 to 20 clocks per “op”
– Performance scales with clock rate
– Multiple µBlaze processors needed to
meet system time budget
Training
5 x chans
µBlaze
100 MHz
G4
733 MHz
Pentium 4
3.6 GHz
4
Chans
(4 x 20)
759 µs,
5 MFLOPS,
3 units
106 µs,
38 MFLOPS,
1 unit
17 µs,
200 FLOPS,
1 unit
8
Chans
(8 x 40)
3,833 µs,
7.5 MFLOPS,
12 units
602 µs,
48 MFLOPS,
3 units
90 µs,
300 FLOPS,
1 unit
16
Chans
(16 x 80)
22,180 µs,
9.8 MFLOPS,
NA
3,890 µs,
56 MFLOPS,
12 units
550 µs,
400 FLOPS,
2 units
Verification was simple, most time spent on speed optimization
– Excellent visibility into variable space inside the FPGA
– Would be excellent wrapper for verifying / customizing
high-performance IP cores
HPEC 2005 -8
HTN 10/28/2005
MIT Lincoln Laboratory
*QR program not hand-coded for G4 Altivec or Pentium MMX or MicroBlaze
Outline
•
QR Computation
•
C-Language Implementation on FPGA
•
Pipelined Linear Array on FPGA
•
HPEC 2005 -9
HTN 10/28/2005
–
Linear array for Givens computation
–
Weight computation
–
Threshold Floating-point
–
1/sqrt() using polynomial approximation
–
Pipelining
Summary
MIT Lincoln Laboratory
Linear Array Structure
- Linear Array is obtained by folding McWhirter array [Walke et al]
- One row of schedule is implemented in the FPGA
- 100% efficiency, less hardware resources
1, 1 1, 2
2, 2
1, 3
1, 4
1, 5
1, 1
3, 4
2, 5
2, 3
2, 4 2, 5
4, 4
1, 2
3, 5
3, 3
3, 4 3, 5
2, 2
4, 5
1, 3
4, 4 4, 5
5, 5
2, 3
1, 4
5, 5
3, 3
1, 5
2, 4
50%
idle
Highlights
3 nodes
5 nodes
Time frame
k
- Floating-point
- No weight computation
1, 3
1, 4
1, 5
1, 1
3, 4
2, 5
2, 2
2, 3
2, 4 2, 5
4, 4
1, 2
3, 5
3, 3
3, 4 3, 5
2, 2
4, 5
1, 3
4, 4 4, 5
5, 5
2, 3
1, 4
Legend
5, 5
3, 3
1, 5
2, 4
Sample k-1
Sample k
Sample k+1
HPEC 2005 -10
HTN 10/28/2005
Linear Array
- Squared Givens
- Difficult op: Division
1, 1 1, 2
Triangular Array
Walke et al
Time frame
k+1
This Presentation
- Givens rotation
- Difficult op: 1/sqrt()
- “Threshold-fltpt”
- Weight computation
MIT Lincoln Laboratory
Weight Computation
- Weights, beam outputs, AMF, ACE can be computed by adding a Beamforming mode
and two additional output nodes
- In Beamforming mode, array r is used in computation but not updated
x0 (n) x1(n) x2 (n) x3 (n)
1, 1
1, 2
c0 (n)
s0 (n)
1, 3
2, 3
2, 2
c1(n)
s1(n)
1, 4
2, 4
3, 3
c2 (n)
s2 (n)
Array ri j(n)
stored at array
nodes
3, 4
0
0
0
1, 5
1, 6
1, 7
2, 5
3, 5
2, 6
3, 6
5-node Implementation
Output
nodes
2, 7
3, 7
Z(k-1)
c3 (n)
4, 4
s3 (n)
4, 5
5, 5
4, 6
5, 6
4, 7
5, 7
Z AMF/ACE
sv = s when control = 1
x _ out = x _ in + s • conj( sv )
H
-1
Z=x r r
-1H
v
Weights, Beam = Z
HPEC 2005 -11
HTN 10/28/2005
x _ out = x _ in + s • conj( s )
H -1
AMF = v r r
-1H
v
ACE = xH r-1 r-1H x
Z(k)
3, 6
3, 7
1, 1
3, 4
2, 5
1, 6
1, 7
4, 4
1, 2
3, 5
4, 6
4, 7
2, 2
4, 5
1, 3
2, 6
2, 7
5, 5
2, 3
1, 4
5, 6
5, 7
3, 3
1, 5
2, 4
3, 6
3, 7
1, 1
3, 4
2, 5
1, 6
1, 7
4, 4
1, 2
3, 5
4, 6
4, 7
2, 2
4, 5
1, 3
2, 6
2, 7
5, 5
2, 3
1, 4
5, 6
5, 7
3, 3
1, 5
2, 4
5 channels,
4 steering vectors,
4 weight vectors
Input to Array
X(k)
X(k+1)
X(1)
…
X(k)
X(k+1)
…
X(N)
0
V(1)
[Identity
Matrix]
V(2)
[I Matrix]
V(3)
[I Matrix]
V(4)
[I Matrix]
0
Time
QR
mode
Beam
mode
W(1)
W(2)
W(3)
W(4)
Weights out
MIT Lincoln Laboratory
Threshold Floating Point
r_real
r_sq
1/sqrt()
x_real
c
r
exponent
normalization
and realignment
1/r
r_old
s_real
Fixed-pt
r_sq
Input
XMax 16
Growth
3 16
6
s_QR
s_Beam
c
r_im_old
r_real_old r_im_old
r_real_old
s_im s_real
s_real s_im
x_im
c
c
-
s_im
x_out_real
19
*29* Exp 3b
38
Threshold floating-point shifts the window by 9
bits when register content crosses threshold
value(s).
•
The number of thresholds is determined by
shift size. Binary threshold (1/2 word size)
results in normalizing logic simpler than
regular floating-point implementation.
•
Word size is a function of dynamic range and
application required precision. Most of the
time, it is smaller than fixed-pt implementation.
MIT Lincoln Laboratory
Exp 3b
32
19
*29*
*29* Exp 3b
20
2 x 20
19
*29* Exp 3b
19
*29* Exp 3b
x_out_im
•
Input
XMax 16
Guard
3
r_im
-
Dual-use variable ‘s’ has different dynamic
ranges in QR mode and Beam mode.
HPEC 2005 -12
HTN 10/28/2005
x_real
x_im
Threshold fltpt
32
1/r
•
-
• red = value stored from previous QR iteration
• blue = node outputs
r_sq_old
x, r, v
Internal Node
s_real x_im
x_real
x_real
r_im_old
s_im
s_im
x_im
s_real
r_real_oldc
c
Boundary Node
x_im
x_real
Polynomial Approximation to 1/sqrt()
- First-order approximation works well for 1/sqrt in [1, 2]
- Match with double precision computation within +/- lsb/2
- 16-bit output: need two 256 x 18 tables, one 18 x 18 multiplier, and one 18b adder
- 24-bit output: need two 4k x 26 tables, one 26 x 26 multiplier, and one 26b adder
- Table sizes appropriate for FPGA implementation
- No iteration required
Polynomial Interpolation
Normalized to [1, 2]
Y’ = A * X’ + B
X = 1. x1x2 … xmxm+1 … x31x32
X’’
X’
A B
Y =
[A, B] = polyfit(X, 1/sqrt(X))
1
X
X’
Y’ = A * X’ + B
X
HPEC 2005 -13
HTN 10/28/2005
MIT Lincoln Laboratory
Pipelining Illustrated for 5 Levels
3,7
1,7
4,7
2,7
5,7
3,7
1,7
4,7
2,7
5,7
3,6
1,6
4,6
2,6
5,6
3,6
1,6
4,6
2,6
5,6
1,1
4,4
2,2
5,5
3,3
1,1
4,4
2,2
5,5
3,3
3,4
1,2
4,5
2,3
1,5
3,4
1,2
4,5
2,3
1,5
2,5
3,5
1,3
1,4
2,4
2,5
3,5
1,3
1,4
2,4
3,7 3,6 1,11 3,4 2,5
5 clks
per step 1,7 1,6 4,4 1,2 3,5
4,7
2,7
5,7
3,7
1,7
4,7
50
2,7
clks
5,7
3,7
Non-pipelined Schedule
1,7
- Takes 5 clocks to compute
4,7
- Wait until done before
2,7
issuing new input sample
5,7
- 2 input samples in 50 clks
3,7
1,7
Pipelined Schedule
4,7
- Introduce one input every clk
2,7
- Spread dependent
5,7
computation out (> 5 clks)
3,7
- Find schedule that is 100% efficient 1,7
4,7
2,7
5,7
HPEC 2005 -14
HTN 10/28/2005
4,6
2,6
5,6
3,6
1,6
4,6
2,6
5,6
3,6
1,6
4,6
2,6
5,6
3,6
1,6
4,6
2,6
5,6
3,6
1,6
4,6
2,6
5,6
2,2
5,5
3,3
1,1
4,4
2,2
5,5
3,3
1,1
4,4
2,2
5,5
3,3
1,1
4,4
2,2
5,5
3,3
1,1
4,4
2,2
5,5
3,3
4,5
2,3
1,5
3,4
1,2
4,5
2,3
1,5
3,4
1,2
4,5
2,3
1,5
3,4
1,2
4,5
2,3
1,5
3,4
1,2
4,5
2,3
1,5
1,3
1,4
2,4
2,5
3,5
1,3
1,4
2,4
2,5
3,5
1,3
1,4
2,4
2,5
3,5
1,3
1,4
2,4
2,5
3,5
1,3
1,4
2,4
3,7
1,7
4,7
2,7
5,7
3,7
1,7
4,7
2,7
5,7
3,7
1,7
4,7
2,7
5,7
3,7
1,7
4,7
2,7
5,7
3,7
1,7
4,7
2,7
5,7
3,6
1,6
4,6
2,6
5,6
3,6
1,6
4,6
2,6
5,6
3,6
1,6
4,6
2,6
5,6
3,6
1,6
4,6
2,6
5,6
3,6
1,6
4,6
2,6
5,6
1,1
4,4
2,2
5,5
3,3
1,1
4,4
2,2
5,5
3,3
1,1
4,4
2,2
5,5
3,3
1,1
4,4
2,2
5,5
3,3
1,1
4,4
2,2
5,5
3,3
3,4
1,2
4,5
2,3
1,5
3,4
1,2
4,5
2,3
1,5
3,4
1,2
4,5
2,3
1,5
3,4
1,2
4,5
2,3
1,5
3,4
1,2
4,5
2,3
1,5
2,5
3,5
1,3
1,4
2,4
2,5
3,5
1,3
1,4
2,4
2,5
3,5
1,3
1,4
2,4
2,5
3,5
1,3
1,4
2,4
2,5
3,5
1,3
1,4
2,4
3,7
1,7
4,7
2,7
5,7
3,6
1,6
4,6
2,6
5,6
1,1
4,4
2,2
5,5
3,3
3,4
1,2
4,5
2,3
1,5
MIT Lincoln Laboratory
2,5
3,5
1,3
1,4
2,4
Pipelining Illustrated for 5 Levels
3,7
1,7
4,7
2,7
5,7
3,7
1,7
4,7
2,7
5,7
3,6
1,6
4,6
2,6
5,6
3,6
1,6
4,6
2,6
5,6
1,1
4,4
2,2
5,5
3,3
1,1
4,4
2,2
5,5
3,3
3,4
1,2
4,5
2,3
1,5
3,4
1,2
4,5
2,3
1,5
2,5
3,5
1,3
1,4
2,4
2,5
3,5
1,3
1,4
2,4
3,7 3,6 1,11 3,4 2,5
5 clks
per step 1,7 1,6 4,4 1,2 3,5
50
clks
Non-pipelined Schedule
- Takes 5 clocks to compute
- Wait until done before
issuing new input sample
- 2 input samples in 50 clks
Pipelined Schedule
- Introduce one input every clk
- Spread dependent
computation out (> 5 clks)
HPEC 2005 -15
HTN 10/28/2005
4,7
2,7
5,7
3,7
1,7
4,7
2,7
5,7
3,7
1,7
4,7
2,7
5,7
3,7
1,7
4,7
2,7
5,7
3,7
1,7
4,7
2,7
5,7
4,6
2,6
5,6
3,6
1,6
4,6
2,6
5,6
3,6
1,6
4,6
2,6
5,6
3,6
1,6
4,6
2,6
5,6
3,6
1,6
4,6
2,6
5,6
2,2
5,5
3,3
1,12
4,4
2,2
5,5
3,3
1,1
4,4
2,2
5,5
3,3
1,1
4,4
2,2
5,5
3,3
1,1
4,4
2,2
5,5
3,3
4,5
2,3
1,5
3,4
1,2
4,5
2,3
1,5
3,4
1,2
4,5
2,3
1,5
3,4
1,2
4,5
2,3
1,5
3,4
1,2
4,5
2,3
1,5
1,3
1,4
2,4
2,5
3,5
1,3
1,4
2,4
2,5
3,5
1,3
1,4
2,4
2,5
3,5
1,3
1,4
2,4
2,5
3,5
1,3
1,4
2,4
3,7
1,7
4,7
2,7
5,7
3,7
1,7
4,7
2,7
5,7
3,7
1,7
4,7
2,7
5,7
3,7
1,7
4,7
2,7
5,7
3,7
1,7
4,7
2,7
5,7
3,6
1,6
4,6
2,6
5,6
3,6
1,6
4,6
2,6
5,6
3,6
1,6
4,6
2,6
5,6
3,6
1,6
4,6
2,6
5,6
3,6
1,6
4,6
2,6
5,6
1,1
4,4
2,2
5,5
3,3
1,1
4,4
2,2
5,5
3,3
1,1
4,4
2,2
5,5
3,3
1,1
4,4
2,2
5,5
3,3
1,1
4,4
2,2
5,5
3,3
3,4
1,2
4,5
2,3
1,5
3,4
1,2
4,5
2,3
1,5
3,4
1,2
4,5
2,3
1,5
3,4
1,2
4,5
2,3
1,5
3,4
1,2
4,5
2,3
1,5
2,5
3,5
1,3
1,4
2,4
2,5
3,5
1,3
1,4
2,4
2,5
3,5
1,3
1,4
2,4
2,5
3,5
1,3
1,4
2,4
2,5
3,5
1,3
1,4
2,4
3,7
1,7
4,7
2,7
5,7
3,6
1,6
4,6
2,6
5,6
1,1
4,4
2,2
5,5
3,3
3,4
1,2
4,5
2,3
1,5
MIT Lincoln Laboratory
2,5
3,5
1,3
1,4
2,4
Pipelining Illustrated for 5 Levels
3,7
1,7
4,7
2,7
5,7
3,7
1,7
4,7
2,7
5,7
3,6
1,6
4,6
2,6
5,6
3,6
1,6
4,6
2,6
5,6
1,1
4,4
2,2
5,5
3,3
1,1
4,4
2,2
5,5
3,3
3,4
1,2
4,5
2,3
1,5
3,4
1,2
4,5
2,3
1,5
2,5
3,5
1,3
1,4
2,4
2,5
3,5
1,3
1,4
2,4
3,7 3,6 1,11 3,4 2,5
5 clks
per step 1,7 1,6 4,4 1,2 3,5
4,7
2,7
5,7
3,7
1,7
4,7
50
2,7
clks
5,7
3,7
Non-pipelined Schedule
1,7
- Takes 5 clocks to compute
4,7
- Wait until done before
2,7
issuing new input sample
5,7
- 2 input samples in 50 clks
3,7
1,7
Pipelined Schedule
4,7
- Introduce one input every clk
2,7
- Spread dependent
5,7
computation out (> 5 clks)
3,7
- Find schedule that is 100% efficient 1,7
4,7
2,7
5,7
HPEC 2005 -16
HTN 10/28/2005
4,6
2,6
5,6
3,6
1,6
4,6
2,6
5,6
3,6
1,6
4,6
2,6
5,6
3,6
1,6
4,6
2,6
5,6
3,6
1,6
4,6
2,6
5,6
2,2
5,5
3,3
1,12
4,4
2,2
5,5
3,3
1,13
4,4
2,2
5,5
3,3
1,1
4,4
2,2
5,5
3,3
1,1
4,4
2,2
5,5
3,3
4,5
2,3
1,5
3,4
1,2
4,5
2,3
1,5
3,4
1,2
4,5
2,3
1,5
3,4
1,2
4,5
2,3
1,5
3,4
1,2
4,5
2,3
1,5
1,3
1,4
2,4
2,5
3,5
1,3
1,4
2,4
2,5
3,5
1,3
1,4
2,4
2,5
3,5
1,3
1,4
2,4
2,5
3,5
1,3
1,4
2,4
3,7
1,7
4,7
2,7
5,7
3,7
1,7
4,7
2,7
5,7
3,7
1,7
4,7
2,7
5,7
3,7
1,7
4,7
2,7
5,7
3,7
1,7
4,7
2,7
5,7
3,6
1,6
4,6
2,6
5,6
3,6
1,6
4,6
2,6
5,6
3,6
1,6
4,6
2,6
5,6
3,6
1,6
4,6
2,6
5,6
3,6
1,6
4,6
2,6
5,6
1,1
4,4
2,2
5,5
3,3
1,1
4,4
2,2
5,5
3,3
1,1
4,4
2,2
5,5
3,3
1,1
4,4
2,2
5,5
3,3
1,1
4,4
2,2
5,5
3,3
3,4
1,2
4,5
2,3
1,5
3,4
1,2
4,5
2,3
1,5
3,4
1,2
4,5
2,3
1,5
3,4
1,2
4,5
2,3
1,5
3,4
1,2
4,5
2,3
1,5
2,5
3,5
1,3
1,4
2,4
2,5
3,5
1,3
1,4
2,4
2,5
3,5
1,3
1,4
2,4
2,5
3,5
1,3
1,4
2,4
2,5
3,5
1,3
1,4
2,4
3,7
1,7
4,7
2,7
5,7
3,6
1,6
4,6
2,6
5,6
1,1
4,4
2,2
5,5
3,3
3,4
1,2
4,5
2,3
1,5
MIT Lincoln Laboratory
2,5
3,5
1,3
1,4
2,4
Pipelining Illustrated for 5 Levels
3,7
1,7
4,7
2,7
5,7
3,7
1,7
4,7
2,7
5,7
3,6
1,6
4,6
2,6
5,6
3,6
1,6
4,6
2,6
5,6
1,1
4,4
2,2
5,5
3,3
1,1
4,4
2,2
5,5
3,3
3,4
1,2
4,5
2,3
1,5
3,4
1,2
4,5
2,3
1,5
2,5
3,5
1,3
1,4
2,4
2,5
3,5
1,3
1,4
2,4
3,7 3,6 1,11 3,4 2,5
5 clks
per step 1,7 1,6 4,4 1,2 3,5
4,7
2,7
5,7
3,7
1,7
4,7
50
2,7
clks
5,7
3,7
Non-pipelined Schedule
1,7
- Takes 5 clocks to compute
4,7
- Wait until done before
2,7
issuing new input sample
5,7
- 2 input samples in 50 clks
3,7
1,7
Pipelined Schedule
4,7
- Introduce one input every clk
2,7
- Spread dependent
5,7
computation out (> 5 clks)
3,7
- Find schedule that is 100% efficient 1,7
4,7
2,7
5,7
HPEC 2005 -17
HTN 10/28/2005
4,6
2,6
5,6
3,6
1,6
4,6
2,6
5,6
3,6
1,6
4,6
2,6
5,6
3,6
1,6
4,6
2,6
5,6
3,6
1,6
4,6
2,6
5,6
2,2
5,5
3,3
1,12
4,4
2,2
5,5
3,3
1,13
4,4
2,2
5,5
3,3
1,14
4,4
2,2
5,5
3,3
1,1
4,4
2,2
5,5
3,3
4,5
2,3
1,5
3,4
1,2
4,5
2,3
1,5
3,4
1,2
4,5
2,3
1,5
3,4
1,2
4,5
2,3
1,5
3,4
1,2
4,5
2,3
1,5
1,3
1,4
2,4
2,5
3,5
1,3
1,4
2,4
2,5
3,5
1,3
1,4
2,4
2,5
3,5
1,3
1,4
2,4
2,5
3,5
1,3
1,4
2,4
3,7
1,7
4,7
2,7
5,7
3,7
1,7
4,7
2,7
5,7
3,7
1,7
4,7
2,7
5,7
3,7
1,7
4,7
2,7
5,7
3,7
1,7
4,7
2,7
5,7
3,6
1,6
4,6
2,6
5,6
3,6
1,6
4,6
2,6
5,6
3,6
1,6
4,6
2,6
5,6
3,6
1,6
4,6
2,6
5,6
3,6
1,6
4,6
2,6
5,6
1,1
4,4
2,2
5,5
3,3
1,1
4,4
2,2
5,5
3,3
1,1
4,4
2,2
5,5
3,3
1,1
4,4
2,2
5,5
3,3
1,1
4,4
2,2
5,5
3,3
3,4
1,2
4,5
2,3
1,5
3,4
1,2
4,5
2,3
1,5
3,4
1,2
4,5
2,3
1,5
3,4
1,2
4,5
2,3
1,5
3,4
1,2
4,5
2,3
1,5
2,5
3,5
1,3
1,4
2,4
2,5
3,5
1,3
1,4
2,4
2,5
3,5
1,3
1,4
2,4
2,5
3,5
1,3
1,4
2,4
2,5
3,5
1,3
1,4
2,4
3,7
1,7
4,7
2,7
5,7
3,6
1,6
4,6
2,6
5,6
1,1
4,4
2,2
5,5
3,3
3,4
1,2
4,5
2,3
1,5
MIT Lincoln Laboratory
2,5
3,5
1,3
1,4
2,4
Pipelining Illustrated for 5 Levels
3,7
1,7
4,7
2,7
5,7
3,7
1,7
4,7
2,7
5,7
3,6
1,6
4,6
2,6
5,6
3,6
1,6
4,6
2,6
5,6
1,1
4,4
2,2
5,5
3,3
1,1
4,4
2,2
5,5
3,3
3,4
1,2
4,5
2,3
1,5
3,4
1,2
4,5
2,3
1,5
2,5
3,5
1,3
1,4
2,4
2,5
3,5
1,3
1,4
2,4
3,7 3,6 1,11 3,4 2,5
5 clks
per step 1,7 1,6 4,4 1,2 3,5
4,7
2,7
5,7
3,7
1,7
4,7
50
2,7
clks
5,7
3,7
Non-pipelined Schedule
1,7
- Takes 5 clocks to compute
4,7
- Wait until done before
2,7
issuing new input sample
5,7
- 2 input samples in 50 clks
3,7
1,7
Pipelined Schedule
4,7
- Introduce one input every clk
2,7
- Spread dependent
5,7
computation out (> 5 clks)
3,7
- Find schedule that is 100% efficient 1,7
- 10 input samples in 50 clks
4,7
- Non-trivial when number of channel 2,7
is different than pipe depth
5,7
HPEC 2005 -18
HTN 10/28/2005
4,6
2,6
5,6
3,6
1,6
4,6
2,6
5,6
3,6
1,6
4,6
2,6
5,6
3,6
1,6
4,6
2,6
5,6
3,6
1,6
4,6
2,6
5,6
2,2
5,5
3,3
1,12
4,4
2,2
5,5
3,3
1,13
4,4
2,2
5,5
3,3
1,14
4,4
2,2
5,5
3,3
1,15
4,4
2,2
5,5
3,3
4,5
2,3
1,5
3,4
1,2
4,5
2,3
1,5
3,4
1,2
4,5
2,3
1,5
3,4
1,2
4,5
2,3
1,5
3,4
1,2
4,5
2,3
1,5
1,3
1,4
2,4
2,5
3,5
1,3
1,4
2,4
2,5
3,5
1,3
1,4
2,4
2,5
3,5
1,3
1,4
2,4
2,5
3,5
1,3
1,4
2,4
3,7
1,7
4,7
2,7
5,7
3,7
1,7
4,7
2,7
5,7
3,7
1,7
4,7
2,7
5,7
3,7
1,7
4,7
2,7
5,7
3,7
1,7
4,7
2,7
5,7
3,6
1,6
4,6
2,6
5,6
3,6
1,6
4,6
2,6
5,6
3,6
1,6
4,6
2,6
5,6
3,6
1,6
4,6
2,6
5,6
3,6
1,6
4,6
2,6
5,6
1,16
4,4
2,2
5,5
3,3
1,17
4,4
2,2
5,5
3,3
1,18
4,4
2,2
5,5
3,3
1,19
4,4
2,2
5,5
3,3
1,110
4,4
2,2
5,5
3,3
3,4
1,2
4,5
2,3
1,5
3,4
1,2
4,5
2,3
1,5
3,4
1,2
4,5
2,3
1,5
3,4
1,2
4,5
2,3
1,5
3,4
1,2
4,5
2,3
1,5
2,5
3,5
1,3
1,4
2,4
2,5
3,5
1,3
1,4
2,4
2,5
3,5
1,3
1,4
2,4
2,5
3,5
1,3
1,4
2,4
2,5
3,5
1,3
1,4
2,4
Array 100% efficient
after 10 * 5 = 50 clocks
3,7
1,7
4,7
2,7
5,7
3,6
1,6
4,6
2,6
5,6
1,111
4,4
2,2
5,5
3,3
3,4
1,2
4,5
2,3
1,5
MIT Lincoln Laboratory
2,5
3,5
1,3
1,4
2,4
Array Performance
Training 5 x channels
Weight computation for 1 Beam
Virtex-II 8000
80 MHz clock (not fully optimized)
No pipeline
Pipeline by 16
4
Chans
(4 x 20)
300 ops per sample
20 samples for QR
6 samples flushing per beam
1 Msps
300 MOPS
26 us refresh
25% Resources
[16 Msps estimated]
[4.8 GOPS]
[1.7 us]
35% Resources
8
Chans
(8 x 40)
900 ops per sample
40 samples per QR
10 samples flushing per beam
0.5 Msps
500 MOPS
90 us refresh
35% Resources
[8 Msps]
[8 GOPS]
[6 us]
45% Resources
16
Chans
(16 x 80)
3060 ops per sample
80 samples per QR
18 samples flushing per beam
0.3 Msps
900 MOPS
333 us refresh
60% Resources
[6 Msps]
[14.4 GOPS]
[21 us]
70% Resources
*Op counts are for 5-chan, 9-chan, and 17-chan arrays with 1 weight computation node
*Large multipliers are implemented with 17-bit built-in multipliers and logic slices
•
•
•
Non-pipelined refresh rate is 26 us for 4 channels, and 333 us for 16 channels
Non-pipelined 16-channel array is 1.7x faster than 3.6 GHz Pentium4, 12x 700 MHz G4
Pipelined version expected to be 16 times faster
HPEC 2005 -19
HTN 10/28/2005
MIT Lincoln Laboratory
Outline
HPEC 2005 -20
HTN 10/28/2005
•
QR Computation
•
C-Language Implementation on FPGA
•
Pipelined Givens Linear Array on FPGA
•
Summary
MIT Lincoln Laboratory
Li
FP
1
0.01
100
Programmable Processors
10
1
0.1
1
10
Throughput (GOPS)
100
*6 copies of MicroBlaze can be used
in parallel if application permits
•
•
•
Field Programmable Gate Arrays (FPGA)
8
4
0.1
0.001
GOPS / Liter
ne
ar
Ar
ra
y
Li n
A
FP
G
16 channels
Better
Very Large Scale Integration (VLSI)
Pi
pe
l
in
e
d
ea
rA
rra
y
3 GH
z
m4
MHz
10
10,000
1,000
GA
100
Pen
ti u
1,000
G4 700
Time per weight refresh (us)
10,000
6 copies
MicroBlaze 100 MHz
QR & Weight Computation Summary
1000
0.1
0.001
1000
1
10
100
GOPS / Watt
*Assume 6 MicroBlaze processors in one FPGA V2-8000
*Assume minimum volume for packaging and cooling:
- FPGA, G4: 10 cm x 10 cm x 3 cm per chip ~ .3 L
- Pentium 4: 16 cm x 20 cm x 3 cm per chip ~ 1 L
0.01
0.1
Linear array throughput increases with # channels due to available capacity on chip
Linear array has best computational density (GOPS/L) and efficiency (GOPS/W)
All processors are close in computational density and efficiency
HPEC 2005 -21
HTN 10/28/2005
MIT Lincoln Laboratory
*QR program not hand-coded for G4 Altivec or Pentium MMX or MicroBlaze
Summary
•
QR Decomposition is widely used in scientific applications.
Radar adaptive beam-forming requires high-speed QR
•
Software-based MicroBlaze Embedded Microprocessor
– Single-chip implementation with multiple cores can outperforms G4
– Similar computational density and efficiency as G4 and Pentium4
*QR program not hand-coded for G4 Altivec or Pentium MMX or MicroBlaze
•
Linear array for Givens computation
– Two orders of magnitude better than programmable processors in
computational density
– One order of magnitude better than programmable processors in
computational efficiency
HPEC 2005 -22
HTN 10/28/2005
MIT Lincoln Laboratory