GPU VSIPL: High Performance VSIPL Implementation for GPUs Andrew Kerr, Dan Campbell*,

by user

on 15 сентября 2016

Category: Documents

>> Downloads: 2

views

Report

Comments

Description

Download GPU VSIPL: High Performance VSIPL Implementation for GPUs Andrew Kerr, Dan Campbell*,

Transcript

GPU VSIPL: High Performance VSIPL Implementation for GPUs Andrew Kerr, Dan Campbell*,

GPU VSIPL: High Performance
VSIPL Implementation for GPUs
Andrew Kerr, Dan Campbell*,
Mark Richards, Mike Davis
[email protected], [email protected],
[email protected], [email protected]
High Performance Embedded Computing (HPEC)
Workshop
24 September 2008
Distribution Statement (A): Approved for
public release; distribution is unlimited
This work was supported in part by DARPA and AFRL
under contracts FA8750-06-1-0012 and FA8650-07-C7724. The opinions expressed are those of the authors.
GTRI_B-‹#›
1
Signal Processing on Graphics Processors
• GPUs original role: turn 3-D polygons into 2-D pixels…
• …Which also makes them cheap & plentiful source of FLOPs
• Leverages volume & competition in entertainment industry
• Primary role highly parallel, very regular
• Typically <$500 drop-in addition to standard PC
• Outstripping CPU capacity, and growing more quickly
• Peak theoretical ~1TFlop
• Power draw: 280GTX = 200W
Q6600 = 100W
• Still making improvements in market app with more
parallelism, so growth continues
GTRI_B-‹#›
2
GPU/CPU Performance Growth
CPU & GPU Capacity Growth
1171
1000
914
GFLOPS
244
122
100
85
60
24
10
13
12
ATI
NVIDIA
Intel x86
"Moore's Law"
GTRI_B-‹#›
3
GPGPU (Old) Concept of Operations
“A=B+C”
void main(float2 tc0 : TEXCOORD0,
out float4 col : COLOR,
uniform samplerRECT B,
uniform samplerRECT C)
{
col = texRECT (B, tc0) +
texRECT (C, tc0);
}
• Arrays  Textures
• Render polygon with the same pixel dimensions as output texture
• Execute with fragment program to perform desired calculation
• Move data from output buffer to desired texture
Now we have compute-centric programming models…
… But they require expertise to fully exploit
GTRI_B-‹#›
4
VSIPL - Vector Signal Image Processing Library
• Portable API for linear algebra, image & signal processing
• Originally sponsored by DARPA in mid ’90s
• Targeted embedded processors – portability primary aim
• Open standard, Forum-based
• Initial API approved April 2000
• Functional coverage
• Vector, Matrix, Tensor
• Basic math operations, linear algebra, solvers, FFT, FIR/IIR,
bookkeeping, etc
GTRI_B-‹#›
5
VSIPL & GPU: Well Matched
• VSIPL is great for exploiting GPUs
• High level API with good coverage for dense linear algebra
• Allows non experts to benefit from hero programmers
• Explicit memory access controls
• API precision flexibility
• GPUs are great for VSIPL
• Improves prototyping by speeding algorithm testing
• Cheap addition allows more engineers access to HPC
• Large speedups without needing explicit parallelism at application level
GTRI_B-‹#›
6
GPU-VSIPL Implementation
• Full, compliant implementation of VSIPL Core-Lite Profile
• Fully encapsulated CUDA backend
• Leverages CUFFT library
• All VSIPL functions accelerated
• Core Lite Profile:
• Single precision floating point, some basic integer
• Vector & Sxalar, complex & real support
• Basic elementwise, FFT, FIR, histogram, RNG, support
• Full list: http://www.vsipl.org/coreliteprofile.pdf
• Also, some matrix support, including vsip_fftm_f
GTRI_B-‹#›
7
CUDA Programming & Optimization
CUDA Programming Model
CUDA Optimization Considerations
Grid
Block
Block
Block
Datapath
Datapath
• Maximize occupancy to hide
Datapath
Datapath
Shared
Shared
Datapath
Memory
Datapath
Thread
Memory
Thread
Datapath
Shared
Datapath
Thread
Datapath
Register
Memory
Datapath
Register
Thread
File
Datapath
File
Datapath
Datapath
Datapath
Datapath
Datapath
Register
File
•
•
•
•
•
•
•
memory latency
Keep lots of threads in flight
Carefully manage memory access to
allow coalesce & avoid conflicts
Avoid slow operations (e.g integer
multiply for indexing)
Minimize synch barriers
Careful loop unrolling
Hoist loop invariants
Reduce register use for greater
occupancy
Device Memory
CPU
Dispatch
• “GPU Performance Assessment with
the HPEC Challenge” – Thursday PM
Host Memory
GTRI_B-‹#›
8
GPU VSIPL Speedup: Unary
320X
1000
nVidia 8800GTX
Vs
Intel Q6600
Speedup
100
10
20X
1
0.1
Vector Length
vcos
vexp
vlog
vsqrt
vmag
vsq
cvconj
GTRI_B-‹#›
9
GPU VSIPL Speedup: Binary
100
40X
nVidia 8800GTX
Vs
Intel Q6600
Speedup
10
25X
1
0.1
0.01
Vector Length
vmul
vadd
vdiv
vsub
cvadd
cvsub
cvmul
cvjmul
GTRI_B-‹#›
10
GPU VSIPL Speedup: FFT
83X
100
nVidia 8800GTX
Vs
Intel Q6600
10
Speedup
39X
1
0.1
0.01
Vector Length
Real to Complex
Complex to Complex
GTRI_B-‹#›
11
GPU VSIPL Speedup: FIR
Filter Length 1024
10000
Runtime(ms)/Speedup
1000
nVidia 8800GTX
Vs
Intel Q6600
157X
100
10
1
32768
65536
131072
262144
524288
1048576
2097152
Input Vector Length
GPU (r)
GPU (c)
TASP (r)
TASP (c)
Speedup (r)
Speedup (c)
GTRI_B-‹#›
12
Application Example: Range Doppler Map
• Simple Range/Doppler data visualization application demo
• Intro app for new VSIPL programmer
• 59x Speedup TASP  GPU-VSIPL
• No changes to source code
Section
Admit
Baseband
Zeropad
Fast time FFT
Multiply
Fast Time FFT-1
Slow time FFT, 2x CT
log10 |.|2
Release
Total:
9800GX2
Time (ms)
8.88
67.77
23.18
47.25
8.11
48.59
12.89
22.2
54.65
293.52
Q6600
Time (ms)
0
1872.3
110.71
5696.3
33.92
5729.04
3387
470.15
0
17299.42
Speedup
0
28
5
121
4
118
263
21
0
59
GTRI_B-‹#›
13
GPU-VSIPL: Future Plans
• Expand matrix support
• Move toward full Core Profile
• More linear algebra/solvers
• VSIPL++
• Double precision support
GTRI_B-‹#›
14
Conclusions
• GPUs are fast, cheap signal processors
• VSIPL is a portable, intuitive means to exploit GPUs
• GPU-VSIPL allows easy access to GPU performance without
becoming an expert CUDA/GPU programer
• 10-100x speed improvement possible with no code change
• Not yet released, but unsupported previews may show up at:
http://gpu-vsipl.gtri.gatech.edu
GTRI_B-‹#›
15