...

R-Stream: Compiler Technology for Next Generation HPEC Role in Tool Chain

by user

on
Category: Documents
58

views

Report

Comments

Transcript

R-Stream: Compiler Technology for Next Generation HPEC Role in Tool Chain
R-Stream: Compiler Technology for Next Generation HPEC
Reservoir Labs Inc.
Role in Tool Chain
Compiler Tech. for HPEC
R-Stream is a source-to-source compiler intended to
augment an existing single processor tool chain.
C + first-class arrays
R-Stream compiler technology automatically maps
applications to HPEC architectures with:
1. Transform loops for locality, determine granularity
•Goal is maximizing data that can live in local memory or local memories
•Interchange and partially fuse parallel outer loops
•Classify communications as local memory, inter-processor, or global memory
•Single-processor grains contain local memory communication
•Multi-processor grains contain communication between local memories
• Multiple processor cores
StreamIt
Machine
model for target
architecture
R-Stream Compiler
Thread
Proc
Proc0
Proc1
Cache
Mem0
Mem1
DMA2
C + architecturespecific library
C + Streaming
Virtual Machine API
Prototype 2.0 Mapper
GlobalMem
• Distributed on-chip memories w/ DMA
STAP
weights
LQ,QR
60 Kflop
AB weights
LQ,QR
60 Kflop
• Reconfigurable processors and memories
R-Stream optimizes the whole application, e.g. reducing
memory traffic between kernels, unlike using a library alone.
R-Stream maps one C program to multiple targets, for faster,
cheaper, more reliable development than mapping by hand.
Early Results
Time Delay &
Equalization
FIR
19 Mflop
Adaptive
Beamforming
Matrix Multiply
6 Mflop
Pulse
Compression
FIR
9 Mflop
Overlapping pipelined iterations
i
M-Chip
(Not
actual)
Space/time
BF+PC+DF
Target detect
Target detect
Compiler Structure
Key
technologies
+
Robust
infrastructure
+
Modular
interfaces
4
16
8
8
FP ALUs
8
16
2
1
8
6
Frequency
500
1000
500
420
1000
250
Gflops
16.0
64.0
4.0
6.7
64.0
12.0
Local Memory Size
(words)
Global Memory BW
(bytes/ns)
32768
65536
1.6
24576
0.262
8192
4
512 (n
per
proc)
1
0.250
0.037
Delay/equal.
2700000
2800000
2900000
Space/time
3000000
3100000
3200000
0
100000
200000
300000
400000
500000
600000
700000
800000
900000
Stream Proc.
Occupancy
Input [i]
1200000
1300000
1400000
Output [i]
Time
0.40
0.20
0.00
8192
Select Morph
16384
32768
65536
Local memory size (KWords)
131072
Occupancy
Characterize
Architecture
Occupancy
Occupancy
Characterize
Application
DMA store
DMA store
1.00
Input
[i+1]
Output
[i+2]
0.80
DMA store
DMA load
0.60
DMA store
0.40
Output
[i+1]
DMA load
Input
[i+2]
1.20
1.20
0.60
DMA2
LM0
LM1
LM2
fullLM1
fullLM0
mv-equalized-mt-194
mv-beams-mt-194
fullLM3
fullLM2
mv-equalized-mt-202
mv-beams-mt-202
mv-equalized-mt-210
mv-beams-mt-210
mv-equalized-mt-198
mv-beams-mt-198
mv-equalized-mt-290
mv-beams-mt-290
mv-equalized-mt-206
mv-beams-mt-206
mv-equalized-mt-294
mv-beams-mt-294
mv-equalized-mt-338
mv-beams-mt-338
mv-equalized-mt-386
mv-beams-mt-386
fullLM5
fullLM4
mv-equalized-mt-298
mv-beams-mt-298
mv-equalized-mt-434
mv-beams-mt-434
mv-equalized-mt-438
mv-beams-mt-438
mv-equalized-mt-226
mv-beams-mt-226
mv-equalized-mt-222
mv-beams-mt-222
mv-equalized-mt-302
mv-beams-mt-302
mv-equalized-mt-346
mv-beams-mt-346
mv-equalized-mt-394
mv-beams-mt-394
LM4
fullLM9
fullLM8
mv-equalized-mt-218
mv-beams-mt-218
mv-equalized-mt-214
mv-beams-mt-214
mv-equalized-mt-342
mv-beams-mt-342
mv-equalized-mt-390
mv-beams-mt-390
LM3
fullLM7
fullLM6
DMA3
DMA4
mv-equalized-mt-306
mv-beams-mt-306
mv-equalized-mt-442
mv-beams-mt-442
1500000
1600000
1700000
DMA5
DMA6
mv-equalized-mt-238
mv-beams-mt-238
mv-equalized-mt-310
mv-beams-mt-310
DMA7
mv-equalized-mt-314
mv-beams-mt-314
1800000
1900000
2000000
2100000
2200000
2300000
mv-equalized-mt-318
mv-beams-mt-318
DMA8
DMA9
DMA10
DMA11
GBW0
GBW1
mv-compressed-mt-577
mv-dopplers-mt-577
mv-equalized-mt-414
mv-beams-mt-414
mv-compressed-mt-553
mv-dopplers-mt-553
mv-compressed-mt-661
mv-dopplers-mt-661
mv-compressed-mt-589
mv-dopplers-mt-589
mv-compressed-mt-598
mv-dopplers-mt-598
mv-compressed-mt-634
mv-dopplers-mt-634
mv-compressed-mt-670
mv-dopplers-mt-670
mv-equalized-mt-418
mv-beams-mt-418
mv-compressed-mt-556
mv-dopplers-mt-556
LM10
fullLM21
fullLM20
mv-equalized-mt-266
mv-beams-mt-266
mv-equalized-mt-274
mv-beams-mt-274
mv-equalized-mt-270
mv-beams-mt-270
mv-equalized-mt-326
mv-beams-mt-326
mv-equalized-mt-278
mv-beams-mt-278
mv-equalized-mt-330
mv-beams-mt-330
mv-equalized-mt-374
mv-beams-mt-374
mv-equalized-mt-422
mv-beams-mt-422
mv-equalized-mt-378
mv-beams-mt-378
mv-equalized-mt-426
mv-beams-mt-426
mv-compressed-mt-559
mv-dopplers-mt-559
mv-compressed-mt-625
mv-dopplers-mt-625
0.050
0.075
0.100
0.125
0.150
Global mem. BW (words/p-flop)
mv-compressed-mt-700
mv-dopplers-mt-700
mv-compressed-mt-628
mv-dopplers-mt-628
mv-compressed-mt-652
mv-dopplers-mt-652
mv-equalized-mt-286
mv-beams-mt-286
mv-equalized-mt-334
mv-beams-mt-334
mv-equalized-mt-382
mv-beams-mt-382
mv-compressed-mt-565
mv-dopplers-mt-565
mv-equalized-mt-430
mv-beams-mt-430
mv-compressed-mt-601
mv-dopplers-mt-601
mv-compressed-mt-604
mv-dopplers-mt-604
mv-compressed-mt-646
mv-dopplers-mt-646
mv-compressed-mt-673
mv-dopplers-mt-673
mv-compressed-mt-679
mv-dopplers-mt-679
mv-compressed-mt-667
mv-dopplers-mt-667
mv-compressed-mt-685
mv-dopplers-mt-685
mv-compressed-mt-676
mv-dopplers-mt-676
mv-compressed-mt-694
mv-dopplers-mt-694
mv-compressed-mt-703
mv-dopplers-mt-703
mv-compressed-mt-715
mv-dopplers-mt-715
mv-compressed-mt-712
mv-dopplers-mt-712
mv-compressed-mt-727
mv-dopplers-mt-727
mv-compressed-mt-706
mv-dopplers-mt-706
mv-compressed-mt-718
mv-dopplers-mt-718
mv-compressed-mt-721
mv-dopplers-mt-721
mv-compressed-mt-724
mv-dopplers-mt-724
mv-compressed-mt-733
mv-dopplers-mt-733
mv-compressed-mt-736
mv-dopplers-mt-736
mv-compressed-mt-568
mv-dopplers-mt-568
mv-compressed-mt-607
mv-dopplers-mt-607
mv-compressed-mt-637
mv-dopplers-mt-637
mv-compressed-mt-655
mv-dopplers-mt-655
mv-compressed-mt-682
mv-dopplers-mt-682
mv-compressed-mt-709
mv-dopplers-mt-709
mv-inputs-mt-1
mv-equalized-mt-1
mv-inputs-mt-10
mv-equalized-mt-10
mv-compressed-mt-730
mv-dopplers-mt-730
mv-inputs-mt-13
mv-equalized-mt-13
mv-inputs-mt-4
mv-equalized-mt-4
mv-inputs-mt-16
mv-equalized-mt-16
mv-inputs-mt-19
mv-equalized-mt-19
mv-inputs-mt-22
mv-equalized-mt-22
mv-inputs-mt-25
mv-equalized-mt-25
mv-inputs-mt-28
mv-equalized-mt-28
mv-inputs-mt-31
mv-equalized-mt-31
mv-inputs-mt-55
mv-equalized-mt-55
mv-inputs-mt-34
mv-equalized-mt-34
mv-inputs-mt-43
mv-equalized-mt-43
mv-inputs-mt-46
mv-equalized-mt-46
mv-inputs-mt-49
mv-equalized-mt-49
mv-inputs-mt-40
mv-equalized-mt-40
mv-inputs-mt-52
mv-equalized-mt-52
mv-inputs-mt-58
mv-equalized-mt-58
mv-inputs-mt-61
mv-equalized-mt-61
mv-inputs-mt-64
mv-equalized-mt-64
mv-inputs-mt-37
mv-equalized-mt-37
mv-inputs-mt-70
mv-equalized-mt-70
mv-inputs-mt-73
mv-equalized-mt-73
mv-inputs-mt-79
mv-equalized-mt-79
mv-inputs-mt-82
mv-equalized-mt-82
mv-inputs-mt-85
mv-equalized-mt-85
mv-inputs-mt-76
mv-equalized-mt-76
mv-inputs-mt-88
mv-equalized-mt-88
mv-inputs-mt-91
mv-equalized-mt-91
mv-inputs-mt-94
mv-equalized-mt-94
mv-inputs-mt-100
mv-equalized-mt-100
mv-inputs-mt-97
mv-equalized-mt-97
mv-inputs-mt-103
mv-equalized-mt-103
mv-inputs-mt-106
mv-equalized-mt-106
mv-inputs-mt-109
mv-equalized-mt-109
mv-inputs-mt-115
mv-equalized-mt-115
mv-inputs-mt-118
mv-equalized-mt-118
mv-inputs-mt-112
mv-equalized-mt-112
mv-inputs-mt-121
mv-equalized-mt-121
mv-inputs-mt-124
mv-equalized-mt-124
mv-inputs-mt-127
mv-equalized-mt-127
mv-inputs-mt-130
mv-equalized-mt-130
mv-inputs-mt-136
mv-equalized-mt-136
mv-inputs-mt-133
mv-equalized-mt-133
mv-inputs-mt-139
mv-equalized-mt-139
mv-inputs-mt-142
mv-equalized-mt-142
2400000
2500000
2600000
2700000
2800000
2900000
mv-inputs-mt-145
mv-equalized-mt-145
mv-inputs-mt-151
mv-equalized-mt-151
mv-inputs-mt-154
mv-equalized-mt-154
mv-inputs-mt-157
mv-equalized-mt-157
mv-inputs-mt-148
mv-equalized-mt-148
mv-inputs-mt-160
mv-equalized-mt-160
mv-inputs-mt-163
mv-equalized-mt-163
mv-inputs-mt-166
mv-equalized-mt-166
mv-inputs-mt-172
mv-equalized-mt-172
mv-inputs-mt-169
mv-equalized-mt-169
mv-inputs-mt-175
mv-equalized-mt-175
mv-inputs-mt-178
mv-equalized-mt-178
mv-inputs-mt-181
mv-equalized-mt-181
mv-inputs-mt-187
mv-equalized-mt-187
mv-inputs-mt-190
mv-equalized-mt-190
mv-beams-mt-454
mv-compressed-mt-454
mv-inputs-mt-184
mv-equalized-mt-184
mv-beams-mt-457
mv-compressed-mt-457
mv-beams-mt-460
mv-compressed-mt-460
mv-beams-mt-463
mv-compressed-mt-463
mv-beams-mt-466
mv-compressed-mt-466
mv-beams-mt-469
mv-compressed-mt-469
mv-beams-mt-472
mv-compressed-mt-472
mv-beams-mt-475
mv-compressed-mt-475
mv-beams-mt-478
mv-compressed-mt-478
mv-beams-mt-484
mv-compressed-mt-484
mv-beams-mt-481
mv-compressed-mt-481
mv-beams-mt-487
mv-compressed-mt-487
mv-beams-mt-490
mv-compressed-mt-490
3000000
3100000
3200000
mv-beams-mt-493
mv-compressed-mt-493
mv-beams-mt-499
mv-compressed-mt-499
mv-beams-mt-496
mv-compressed-mt-496
mv-beams-mt-505
mv-compressed-mt-505
mv-beams-mt-502
mv-compressed-mt-502
mv-beams-mt-508
mv-compressed-mt-508
mv-beams-mt-511
mv-compressed-mt-511
mv-beams-mt-514
mv-compressed-mt-514
mv-beams-mt-520
mv-compressed-mt-520
mv-beams-mt-523
mv-compressed-mt-523
3300000
3400000
3500000
mv-beams-mt-529
mv-compressed-mt-529
mv-beams-mt-532
mv-compressed-mt-532
mv-beams-mt-535
mv-compressed-mt-535
mv-beams-mt-538
mv-compressed-mt-538
mv-beams-mt-544
mv-compressed-mt-544
mv-beams-mt-547
mv-compressed-mt-547
mv-dopplers-mt-743
mv-dopplers-mt-791
mv-dopplers-mt-751
mv-dopplers-mt-799
mv-dopplers-mt-759
mv-dopplers-mt-815
mv-dopplers-mt-779
mv-uncluttered-mt-779
mv-dopplers-mt-783
mv-dopplers-mt-835
mv-dopplers-mt-827
mv-dopplers-mt-807
mv-dopplers-mt-823
mv-dopplers-mt-755
mv-dopplers-mt-803
mv-uncluttered-mt-803
mv-dopplers-mt-767
mv-dopplers-mt-775
mv-dopplers-mt-855
mv-dopplers-mt-863
mv-dopplers-mt-875
mv-dopplers-mt-907
mv-uncluttered-mt-907
mv-dopplers-mt-919
mv-dopplers-mt-787
mv-dopplers-mt-811
mv-uncluttered-mt-811
mv-dopplers-mt-763
mv-dopplers-mt-771
mv-uncluttered-mt-771
mv-dopplers-mt-831
mv-uncluttered-mt-831
mv-dopplers-mt-847
mv-dopplers-mt-859
mv-dopplers-mt-867
mv-uncluttered-mt-867
mv-dopplers-mt-871
mv-dopplers-mt-883
mv-uncluttered-mt-883
mv-dopplers-mt-891
mv-dopplers-mt-895
mv-dopplers-mt-899
mv-dopplers-mt-911
mv-dopplers-mt-915
mv-uncluttered-mt-915
mv-dopplers-mt-923
mv-dopplers-mt-931
mv-dopplers-mt-935
mv-dopplers-mt-939
mv-uncluttered-mt-939
mv-dopplers-mt-951
mv-dopplers-mt-963
mv-dopplers-mt-955
mv-dopplers-mt-971
mv-dopplers-mt-747
mv-uncluttered-mt-747
mv-dopplers-mt-795
mv-uncluttered-mt-795
mv-dopplers-mt-851
mv-uncluttered-mt-851
mv-dopplers-mt-879
mv-dopplers-mt-887
mv-dopplers-mt-903
mv-dopplers-mt-819
mv-dopplers-mt-839
mv-dopplers-mt-843
3600000
3700000
mv-beams-mt-517
mv-compressed-mt-517
mv-beams-mt-526
mv-compressed-mt-526
mv-beams-mt-541
mv-compressed-mt-541
mv-dopplers-mt-943
1
2
4
8
16
mv-dopplers-mt-975
mv-dopplers-mt-967
mv-dopplers-mt-927
mv-dopplers-mt-947
mv-dopplers-mt-959
mv-uncluttered-mt-959
mv-dopplers-mt-979
mv-uncluttered-mt-979
32
Number of processors
Multiprocessor
Scheduling
Memory Allocation and
DMA Insertion
Performance Estimation
Loop Transforms and
Granularity Selection
Resource Dep. Analysis
Convert to Serial IR
SVM Output
Code Generation
Custom Output
Supports Diverse Architectures
R-Stream prototype supports a large class of architectures
via a flexible machine model, including:
MIT
RAW
ISI / Raytheon
Monarch
UT Austin
TRIPS
Stanford
Smart Memories
GM
mv-equalized-mt-282
mv-beams-mt-282
Innovative 3.0 Technology
Mapping
Data Dep. Analysis
Convert to Parallel IR
ThrP1
mv-inputs-mt-0
mv-uncluttered-mt-987
mv-compressed-mt-616
mv-dopplers-mt-616
mv-compressed-mt-640
mv-dopplers-mt-640
mv-compressed-mt-664
mv-dopplers-mt-664
mv-compressed-mt-562
mv-dopplers-mt-562
mv-compressed-mt-688
mv-dopplers-mt-688
mv-compressed-mt-697
mv-dopplers-mt-697
LM11
fullLM23
fullLM22
mv-compressed-mt-592
mv-dopplers-mt-592
mv-compressed-mt-613
mv-dopplers-mt-613
mv-compressed-mt-631
mv-dopplers-mt-631
mv-compressed-mt-649
mv-dopplers-mt-649
mv-compressed-mt-658
mv-dopplers-mt-658
mv-equalized-mt-370
mv-beams-mt-370
mv-compressed-mt-580
mv-dopplers-mt-580
mv-compressed-mt-610
mv-dopplers-mt-610
mv-compressed-mt-622
mv-dopplers-mt-622
mv-equalized-mt-322
mv-beams-mt-322
LM9
fullLM19
fullLM18
0.00
0.025
ThrP0
mv-equalized-mt-446
mv-beams-mt-446
mv-compressed-mt-586
mv-dopplers-mt-586
mv-compressed-mt-574
mv-dopplers-mt-574
mv-equalized-mt-262
mv-beams-mt-262
mv-equalized-mt-366
mv-beams-mt-366
mv-compressed-mt-583
mv-dopplers-mt-583
mv-compressed-mt-571
mv-dopplers-mt-571
mv-equalized-mt-258
mv-beams-mt-258
mv-equalized-mt-254
mv-beams-mt-254
mv-equalized-mt-362
mv-beams-mt-362
mv-equalized-mt-410
mv-beams-mt-410
LM8
fullLM17
fullLM16
mv-equalized-mt-250
mv-beams-mt-250
mv-equalized-mt-246
mv-beams-mt-246
mv-equalized-mt-358
mv-beams-mt-358
mv-equalized-mt-406
mv-beams-mt-406
LM7
fullLM15
fullLM14
mv-inputs-mt-67
mv-equalized-mt-67
3800000
0.20
mv-equalized-mt-242
mv-beams-mt-242
mv-equalized-mt-354
mv-beams-mt-354
mv-equalized-mt-402
mv-beams-mt-402
LM6
fullLM13
fullLM12
mv-equalized-mt-234
mv-beams-mt-234
mv-equalized-mt-230
mv-beams-mt-230
mv-equalized-mt-350
mv-beams-mt-350
mv-equalized-mt-398
mv-beams-mt-398
LM5
fullLM11
fullLM10
mv-compressed-mt-550
mv-dopplers-mt-550
1000000
1100000
DMA load
Occupancy vs. Number of processors
Occupancy vs.
Global memory bandwidth
0.80
DMA1
Local memories
•Tile parallel outer loop(s) around inner loopnests
•Inner loopnest produces and consumes blocks of data
•Memory allocator places these blocks in 2D space
•Tiles alternate between half-buffers within local memory
0.048
1.20
1.00
0.80
0.60
0.40
0.20
0.00
DMA0
3600000
3700000
Target detect
mv-inputs-mt-7
mv-equalized-mt-7
1.00
StrP1
3. Memory allocation and DMA insertion
Global Mem. BW
Occupancy
Occupancy vs. Local memory size
Morph Selection
StrP1
3300000
3400000
3500000
BF+PC+DF
DMA load
Scalar Analysis and Optimization
StrP9
2400000
2500000
2600000
2.3
0.016
StrP8
2100000
2200000
2300000
Time
StreamIt to C Converter
Extended EDG Front End
0.001
StrP7
1800000
1900000
2000000
Local memory address space
0.100
StrP6
1500000
1600000
1700000
mv-compressed-mt-619
mv-dopplers-mt-619
Global Memory BW
(words/p-flop)
StrP5
3800000
64000
4
StrP4
1200000
1300000
1400000
Target detect
BF+PC+DF
4
StrP3
700000
800000
900000
BF+PC+DF
Space/time
4
StrP2
400000
500000
600000
Schedule
Delay/equal.
Space/time
Imagine
Stream Processors
StrP1
DMA
100000
200000
300000
Time
Delay/equal.
Time
RAW
StrP0
Global Thread
memory processBW
ors
0
i+1
Delay/equal.
Smart
Mem.
Stream
processors
1000000
1100000
Binary Executable
TRIPS
Target
Parameter
Estimation
MLE, Spline
Interpolation
80 Kflop
Target
Detection
CFAR,3-D
Grouping
300 Kflop
•Modulo scheduling with parallel loops and chunks of
code as “operations” and processors as “ALUs”
•Overlaps computation and DMA
•Smooth spectrum from time to space multiplexed
Early results show efficient mappings over a wide range of
architectural parameters:
Ex.
Shown
Space-time
Adaptive
Processing
Matrix Multiply
6 Mflop
2. Multiprocessor scheduling
i-1
Single Processor Compiler
Doppler
Filtering
DFT
8 Mflop
R-Stream prototype 3.0, currently in development, will
produce even more efficient mappings for a wider range
of applications by leveraging:
• SRE-based internal representation to eliminate false
dependences
• Affine partitioning framework to discover maximum
degrees of parallelism in application
• Unified/constraint-based mapping to avoid phaseordering.
ThrP2
ThrP3
Fly UP