R-Stream: Compiler Technology for Next Generation HPEC Role in Tool Chain
by user
Comments
Transcript
R-Stream: Compiler Technology for Next Generation HPEC Role in Tool Chain
R-Stream: Compiler Technology for Next Generation HPEC Reservoir Labs Inc. Role in Tool Chain Compiler Tech. for HPEC R-Stream is a source-to-source compiler intended to augment an existing single processor tool chain. C + first-class arrays R-Stream compiler technology automatically maps applications to HPEC architectures with: 1. Transform loops for locality, determine granularity •Goal is maximizing data that can live in local memory or local memories •Interchange and partially fuse parallel outer loops •Classify communications as local memory, inter-processor, or global memory •Single-processor grains contain local memory communication •Multi-processor grains contain communication between local memories • Multiple processor cores StreamIt Machine model for target architecture R-Stream Compiler Thread Proc Proc0 Proc1 Cache Mem0 Mem1 DMA2 C + architecturespecific library C + Streaming Virtual Machine API Prototype 2.0 Mapper GlobalMem • Distributed on-chip memories w/ DMA STAP weights LQ,QR 60 Kflop AB weights LQ,QR 60 Kflop • Reconfigurable processors and memories R-Stream optimizes the whole application, e.g. reducing memory traffic between kernels, unlike using a library alone. R-Stream maps one C program to multiple targets, for faster, cheaper, more reliable development than mapping by hand. Early Results Time Delay & Equalization FIR 19 Mflop Adaptive Beamforming Matrix Multiply 6 Mflop Pulse Compression FIR 9 Mflop Overlapping pipelined iterations i M-Chip (Not actual) Space/time BF+PC+DF Target detect Target detect Compiler Structure Key technologies + Robust infrastructure + Modular interfaces 4 16 8 8 FP ALUs 8 16 2 1 8 6 Frequency 500 1000 500 420 1000 250 Gflops 16.0 64.0 4.0 6.7 64.0 12.0 Local Memory Size (words) Global Memory BW (bytes/ns) 32768 65536 1.6 24576 0.262 8192 4 512 (n per proc) 1 0.250 0.037 Delay/equal. 2700000 2800000 2900000 Space/time 3000000 3100000 3200000 0 100000 200000 300000 400000 500000 600000 700000 800000 900000 Stream Proc. Occupancy Input [i] 1200000 1300000 1400000 Output [i] Time 0.40 0.20 0.00 8192 Select Morph 16384 32768 65536 Local memory size (KWords) 131072 Occupancy Characterize Architecture Occupancy Occupancy Characterize Application DMA store DMA store 1.00 Input [i+1] Output [i+2] 0.80 DMA store DMA load 0.60 DMA store 0.40 Output [i+1] DMA load Input [i+2] 1.20 1.20 0.60 DMA2 LM0 LM1 LM2 fullLM1 fullLM0 mv-equalized-mt-194 mv-beams-mt-194 fullLM3 fullLM2 mv-equalized-mt-202 mv-beams-mt-202 mv-equalized-mt-210 mv-beams-mt-210 mv-equalized-mt-198 mv-beams-mt-198 mv-equalized-mt-290 mv-beams-mt-290 mv-equalized-mt-206 mv-beams-mt-206 mv-equalized-mt-294 mv-beams-mt-294 mv-equalized-mt-338 mv-beams-mt-338 mv-equalized-mt-386 mv-beams-mt-386 fullLM5 fullLM4 mv-equalized-mt-298 mv-beams-mt-298 mv-equalized-mt-434 mv-beams-mt-434 mv-equalized-mt-438 mv-beams-mt-438 mv-equalized-mt-226 mv-beams-mt-226 mv-equalized-mt-222 mv-beams-mt-222 mv-equalized-mt-302 mv-beams-mt-302 mv-equalized-mt-346 mv-beams-mt-346 mv-equalized-mt-394 mv-beams-mt-394 LM4 fullLM9 fullLM8 mv-equalized-mt-218 mv-beams-mt-218 mv-equalized-mt-214 mv-beams-mt-214 mv-equalized-mt-342 mv-beams-mt-342 mv-equalized-mt-390 mv-beams-mt-390 LM3 fullLM7 fullLM6 DMA3 DMA4 mv-equalized-mt-306 mv-beams-mt-306 mv-equalized-mt-442 mv-beams-mt-442 1500000 1600000 1700000 DMA5 DMA6 mv-equalized-mt-238 mv-beams-mt-238 mv-equalized-mt-310 mv-beams-mt-310 DMA7 mv-equalized-mt-314 mv-beams-mt-314 1800000 1900000 2000000 2100000 2200000 2300000 mv-equalized-mt-318 mv-beams-mt-318 DMA8 DMA9 DMA10 DMA11 GBW0 GBW1 mv-compressed-mt-577 mv-dopplers-mt-577 mv-equalized-mt-414 mv-beams-mt-414 mv-compressed-mt-553 mv-dopplers-mt-553 mv-compressed-mt-661 mv-dopplers-mt-661 mv-compressed-mt-589 mv-dopplers-mt-589 mv-compressed-mt-598 mv-dopplers-mt-598 mv-compressed-mt-634 mv-dopplers-mt-634 mv-compressed-mt-670 mv-dopplers-mt-670 mv-equalized-mt-418 mv-beams-mt-418 mv-compressed-mt-556 mv-dopplers-mt-556 LM10 fullLM21 fullLM20 mv-equalized-mt-266 mv-beams-mt-266 mv-equalized-mt-274 mv-beams-mt-274 mv-equalized-mt-270 mv-beams-mt-270 mv-equalized-mt-326 mv-beams-mt-326 mv-equalized-mt-278 mv-beams-mt-278 mv-equalized-mt-330 mv-beams-mt-330 mv-equalized-mt-374 mv-beams-mt-374 mv-equalized-mt-422 mv-beams-mt-422 mv-equalized-mt-378 mv-beams-mt-378 mv-equalized-mt-426 mv-beams-mt-426 mv-compressed-mt-559 mv-dopplers-mt-559 mv-compressed-mt-625 mv-dopplers-mt-625 0.050 0.075 0.100 0.125 0.150 Global mem. BW (words/p-flop) mv-compressed-mt-700 mv-dopplers-mt-700 mv-compressed-mt-628 mv-dopplers-mt-628 mv-compressed-mt-652 mv-dopplers-mt-652 mv-equalized-mt-286 mv-beams-mt-286 mv-equalized-mt-334 mv-beams-mt-334 mv-equalized-mt-382 mv-beams-mt-382 mv-compressed-mt-565 mv-dopplers-mt-565 mv-equalized-mt-430 mv-beams-mt-430 mv-compressed-mt-601 mv-dopplers-mt-601 mv-compressed-mt-604 mv-dopplers-mt-604 mv-compressed-mt-646 mv-dopplers-mt-646 mv-compressed-mt-673 mv-dopplers-mt-673 mv-compressed-mt-679 mv-dopplers-mt-679 mv-compressed-mt-667 mv-dopplers-mt-667 mv-compressed-mt-685 mv-dopplers-mt-685 mv-compressed-mt-676 mv-dopplers-mt-676 mv-compressed-mt-694 mv-dopplers-mt-694 mv-compressed-mt-703 mv-dopplers-mt-703 mv-compressed-mt-715 mv-dopplers-mt-715 mv-compressed-mt-712 mv-dopplers-mt-712 mv-compressed-mt-727 mv-dopplers-mt-727 mv-compressed-mt-706 mv-dopplers-mt-706 mv-compressed-mt-718 mv-dopplers-mt-718 mv-compressed-mt-721 mv-dopplers-mt-721 mv-compressed-mt-724 mv-dopplers-mt-724 mv-compressed-mt-733 mv-dopplers-mt-733 mv-compressed-mt-736 mv-dopplers-mt-736 mv-compressed-mt-568 mv-dopplers-mt-568 mv-compressed-mt-607 mv-dopplers-mt-607 mv-compressed-mt-637 mv-dopplers-mt-637 mv-compressed-mt-655 mv-dopplers-mt-655 mv-compressed-mt-682 mv-dopplers-mt-682 mv-compressed-mt-709 mv-dopplers-mt-709 mv-inputs-mt-1 mv-equalized-mt-1 mv-inputs-mt-10 mv-equalized-mt-10 mv-compressed-mt-730 mv-dopplers-mt-730 mv-inputs-mt-13 mv-equalized-mt-13 mv-inputs-mt-4 mv-equalized-mt-4 mv-inputs-mt-16 mv-equalized-mt-16 mv-inputs-mt-19 mv-equalized-mt-19 mv-inputs-mt-22 mv-equalized-mt-22 mv-inputs-mt-25 mv-equalized-mt-25 mv-inputs-mt-28 mv-equalized-mt-28 mv-inputs-mt-31 mv-equalized-mt-31 mv-inputs-mt-55 mv-equalized-mt-55 mv-inputs-mt-34 mv-equalized-mt-34 mv-inputs-mt-43 mv-equalized-mt-43 mv-inputs-mt-46 mv-equalized-mt-46 mv-inputs-mt-49 mv-equalized-mt-49 mv-inputs-mt-40 mv-equalized-mt-40 mv-inputs-mt-52 mv-equalized-mt-52 mv-inputs-mt-58 mv-equalized-mt-58 mv-inputs-mt-61 mv-equalized-mt-61 mv-inputs-mt-64 mv-equalized-mt-64 mv-inputs-mt-37 mv-equalized-mt-37 mv-inputs-mt-70 mv-equalized-mt-70 mv-inputs-mt-73 mv-equalized-mt-73 mv-inputs-mt-79 mv-equalized-mt-79 mv-inputs-mt-82 mv-equalized-mt-82 mv-inputs-mt-85 mv-equalized-mt-85 mv-inputs-mt-76 mv-equalized-mt-76 mv-inputs-mt-88 mv-equalized-mt-88 mv-inputs-mt-91 mv-equalized-mt-91 mv-inputs-mt-94 mv-equalized-mt-94 mv-inputs-mt-100 mv-equalized-mt-100 mv-inputs-mt-97 mv-equalized-mt-97 mv-inputs-mt-103 mv-equalized-mt-103 mv-inputs-mt-106 mv-equalized-mt-106 mv-inputs-mt-109 mv-equalized-mt-109 mv-inputs-mt-115 mv-equalized-mt-115 mv-inputs-mt-118 mv-equalized-mt-118 mv-inputs-mt-112 mv-equalized-mt-112 mv-inputs-mt-121 mv-equalized-mt-121 mv-inputs-mt-124 mv-equalized-mt-124 mv-inputs-mt-127 mv-equalized-mt-127 mv-inputs-mt-130 mv-equalized-mt-130 mv-inputs-mt-136 mv-equalized-mt-136 mv-inputs-mt-133 mv-equalized-mt-133 mv-inputs-mt-139 mv-equalized-mt-139 mv-inputs-mt-142 mv-equalized-mt-142 2400000 2500000 2600000 2700000 2800000 2900000 mv-inputs-mt-145 mv-equalized-mt-145 mv-inputs-mt-151 mv-equalized-mt-151 mv-inputs-mt-154 mv-equalized-mt-154 mv-inputs-mt-157 mv-equalized-mt-157 mv-inputs-mt-148 mv-equalized-mt-148 mv-inputs-mt-160 mv-equalized-mt-160 mv-inputs-mt-163 mv-equalized-mt-163 mv-inputs-mt-166 mv-equalized-mt-166 mv-inputs-mt-172 mv-equalized-mt-172 mv-inputs-mt-169 mv-equalized-mt-169 mv-inputs-mt-175 mv-equalized-mt-175 mv-inputs-mt-178 mv-equalized-mt-178 mv-inputs-mt-181 mv-equalized-mt-181 mv-inputs-mt-187 mv-equalized-mt-187 mv-inputs-mt-190 mv-equalized-mt-190 mv-beams-mt-454 mv-compressed-mt-454 mv-inputs-mt-184 mv-equalized-mt-184 mv-beams-mt-457 mv-compressed-mt-457 mv-beams-mt-460 mv-compressed-mt-460 mv-beams-mt-463 mv-compressed-mt-463 mv-beams-mt-466 mv-compressed-mt-466 mv-beams-mt-469 mv-compressed-mt-469 mv-beams-mt-472 mv-compressed-mt-472 mv-beams-mt-475 mv-compressed-mt-475 mv-beams-mt-478 mv-compressed-mt-478 mv-beams-mt-484 mv-compressed-mt-484 mv-beams-mt-481 mv-compressed-mt-481 mv-beams-mt-487 mv-compressed-mt-487 mv-beams-mt-490 mv-compressed-mt-490 3000000 3100000 3200000 mv-beams-mt-493 mv-compressed-mt-493 mv-beams-mt-499 mv-compressed-mt-499 mv-beams-mt-496 mv-compressed-mt-496 mv-beams-mt-505 mv-compressed-mt-505 mv-beams-mt-502 mv-compressed-mt-502 mv-beams-mt-508 mv-compressed-mt-508 mv-beams-mt-511 mv-compressed-mt-511 mv-beams-mt-514 mv-compressed-mt-514 mv-beams-mt-520 mv-compressed-mt-520 mv-beams-mt-523 mv-compressed-mt-523 3300000 3400000 3500000 mv-beams-mt-529 mv-compressed-mt-529 mv-beams-mt-532 mv-compressed-mt-532 mv-beams-mt-535 mv-compressed-mt-535 mv-beams-mt-538 mv-compressed-mt-538 mv-beams-mt-544 mv-compressed-mt-544 mv-beams-mt-547 mv-compressed-mt-547 mv-dopplers-mt-743 mv-dopplers-mt-791 mv-dopplers-mt-751 mv-dopplers-mt-799 mv-dopplers-mt-759 mv-dopplers-mt-815 mv-dopplers-mt-779 mv-uncluttered-mt-779 mv-dopplers-mt-783 mv-dopplers-mt-835 mv-dopplers-mt-827 mv-dopplers-mt-807 mv-dopplers-mt-823 mv-dopplers-mt-755 mv-dopplers-mt-803 mv-uncluttered-mt-803 mv-dopplers-mt-767 mv-dopplers-mt-775 mv-dopplers-mt-855 mv-dopplers-mt-863 mv-dopplers-mt-875 mv-dopplers-mt-907 mv-uncluttered-mt-907 mv-dopplers-mt-919 mv-dopplers-mt-787 mv-dopplers-mt-811 mv-uncluttered-mt-811 mv-dopplers-mt-763 mv-dopplers-mt-771 mv-uncluttered-mt-771 mv-dopplers-mt-831 mv-uncluttered-mt-831 mv-dopplers-mt-847 mv-dopplers-mt-859 mv-dopplers-mt-867 mv-uncluttered-mt-867 mv-dopplers-mt-871 mv-dopplers-mt-883 mv-uncluttered-mt-883 mv-dopplers-mt-891 mv-dopplers-mt-895 mv-dopplers-mt-899 mv-dopplers-mt-911 mv-dopplers-mt-915 mv-uncluttered-mt-915 mv-dopplers-mt-923 mv-dopplers-mt-931 mv-dopplers-mt-935 mv-dopplers-mt-939 mv-uncluttered-mt-939 mv-dopplers-mt-951 mv-dopplers-mt-963 mv-dopplers-mt-955 mv-dopplers-mt-971 mv-dopplers-mt-747 mv-uncluttered-mt-747 mv-dopplers-mt-795 mv-uncluttered-mt-795 mv-dopplers-mt-851 mv-uncluttered-mt-851 mv-dopplers-mt-879 mv-dopplers-mt-887 mv-dopplers-mt-903 mv-dopplers-mt-819 mv-dopplers-mt-839 mv-dopplers-mt-843 3600000 3700000 mv-beams-mt-517 mv-compressed-mt-517 mv-beams-mt-526 mv-compressed-mt-526 mv-beams-mt-541 mv-compressed-mt-541 mv-dopplers-mt-943 1 2 4 8 16 mv-dopplers-mt-975 mv-dopplers-mt-967 mv-dopplers-mt-927 mv-dopplers-mt-947 mv-dopplers-mt-959 mv-uncluttered-mt-959 mv-dopplers-mt-979 mv-uncluttered-mt-979 32 Number of processors Multiprocessor Scheduling Memory Allocation and DMA Insertion Performance Estimation Loop Transforms and Granularity Selection Resource Dep. Analysis Convert to Serial IR SVM Output Code Generation Custom Output Supports Diverse Architectures R-Stream prototype supports a large class of architectures via a flexible machine model, including: MIT RAW ISI / Raytheon Monarch UT Austin TRIPS Stanford Smart Memories GM mv-equalized-mt-282 mv-beams-mt-282 Innovative 3.0 Technology Mapping Data Dep. Analysis Convert to Parallel IR ThrP1 mv-inputs-mt-0 mv-uncluttered-mt-987 mv-compressed-mt-616 mv-dopplers-mt-616 mv-compressed-mt-640 mv-dopplers-mt-640 mv-compressed-mt-664 mv-dopplers-mt-664 mv-compressed-mt-562 mv-dopplers-mt-562 mv-compressed-mt-688 mv-dopplers-mt-688 mv-compressed-mt-697 mv-dopplers-mt-697 LM11 fullLM23 fullLM22 mv-compressed-mt-592 mv-dopplers-mt-592 mv-compressed-mt-613 mv-dopplers-mt-613 mv-compressed-mt-631 mv-dopplers-mt-631 mv-compressed-mt-649 mv-dopplers-mt-649 mv-compressed-mt-658 mv-dopplers-mt-658 mv-equalized-mt-370 mv-beams-mt-370 mv-compressed-mt-580 mv-dopplers-mt-580 mv-compressed-mt-610 mv-dopplers-mt-610 mv-compressed-mt-622 mv-dopplers-mt-622 mv-equalized-mt-322 mv-beams-mt-322 LM9 fullLM19 fullLM18 0.00 0.025 ThrP0 mv-equalized-mt-446 mv-beams-mt-446 mv-compressed-mt-586 mv-dopplers-mt-586 mv-compressed-mt-574 mv-dopplers-mt-574 mv-equalized-mt-262 mv-beams-mt-262 mv-equalized-mt-366 mv-beams-mt-366 mv-compressed-mt-583 mv-dopplers-mt-583 mv-compressed-mt-571 mv-dopplers-mt-571 mv-equalized-mt-258 mv-beams-mt-258 mv-equalized-mt-254 mv-beams-mt-254 mv-equalized-mt-362 mv-beams-mt-362 mv-equalized-mt-410 mv-beams-mt-410 LM8 fullLM17 fullLM16 mv-equalized-mt-250 mv-beams-mt-250 mv-equalized-mt-246 mv-beams-mt-246 mv-equalized-mt-358 mv-beams-mt-358 mv-equalized-mt-406 mv-beams-mt-406 LM7 fullLM15 fullLM14 mv-inputs-mt-67 mv-equalized-mt-67 3800000 0.20 mv-equalized-mt-242 mv-beams-mt-242 mv-equalized-mt-354 mv-beams-mt-354 mv-equalized-mt-402 mv-beams-mt-402 LM6 fullLM13 fullLM12 mv-equalized-mt-234 mv-beams-mt-234 mv-equalized-mt-230 mv-beams-mt-230 mv-equalized-mt-350 mv-beams-mt-350 mv-equalized-mt-398 mv-beams-mt-398 LM5 fullLM11 fullLM10 mv-compressed-mt-550 mv-dopplers-mt-550 1000000 1100000 DMA load Occupancy vs. Number of processors Occupancy vs. Global memory bandwidth 0.80 DMA1 Local memories •Tile parallel outer loop(s) around inner loopnests •Inner loopnest produces and consumes blocks of data •Memory allocator places these blocks in 2D space •Tiles alternate between half-buffers within local memory 0.048 1.20 1.00 0.80 0.60 0.40 0.20 0.00 DMA0 3600000 3700000 Target detect mv-inputs-mt-7 mv-equalized-mt-7 1.00 StrP1 3. Memory allocation and DMA insertion Global Mem. BW Occupancy Occupancy vs. Local memory size Morph Selection StrP1 3300000 3400000 3500000 BF+PC+DF DMA load Scalar Analysis and Optimization StrP9 2400000 2500000 2600000 2.3 0.016 StrP8 2100000 2200000 2300000 Time StreamIt to C Converter Extended EDG Front End 0.001 StrP7 1800000 1900000 2000000 Local memory address space 0.100 StrP6 1500000 1600000 1700000 mv-compressed-mt-619 mv-dopplers-mt-619 Global Memory BW (words/p-flop) StrP5 3800000 64000 4 StrP4 1200000 1300000 1400000 Target detect BF+PC+DF 4 StrP3 700000 800000 900000 BF+PC+DF Space/time 4 StrP2 400000 500000 600000 Schedule Delay/equal. Space/time Imagine Stream Processors StrP1 DMA 100000 200000 300000 Time Delay/equal. Time RAW StrP0 Global Thread memory processBW ors 0 i+1 Delay/equal. Smart Mem. Stream processors 1000000 1100000 Binary Executable TRIPS Target Parameter Estimation MLE, Spline Interpolation 80 Kflop Target Detection CFAR,3-D Grouping 300 Kflop •Modulo scheduling with parallel loops and chunks of code as “operations” and processors as “ALUs” •Overlaps computation and DMA •Smooth spectrum from time to space multiplexed Early results show efficient mappings over a wide range of architectural parameters: Ex. Shown Space-time Adaptive Processing Matrix Multiply 6 Mflop 2. Multiprocessor scheduling i-1 Single Processor Compiler Doppler Filtering DFT 8 Mflop R-Stream prototype 3.0, currently in development, will produce even more efficient mappings for a wider range of applications by leveraging: • SRE-based internal representation to eliminate false dependences • Affine partitioning framework to discover maximum degrees of parallelism in application • Unified/constraint-based mapping to avoid phaseordering. ThrP2 ThrP3