RAMP Gold Hardware and Software Architecture Jan 2009 Zhangxi Tan,
by user
Comments
Transcript
RAMP Gold Hardware and Software Architecture Jan 2009 Zhangxi Tan,
RAMP Gold Hardware and Software Architecture Zhangxi Tan, Yunsup Lee, Andrew Waterman, Rimas Avizienis, David Patterson, Krste Asanovic UC Berkeley Jan 2009 Par Lab Research Overview Easy to write correct programs that run efficiently on manycore Composition & Coordination Language (C&CL) C&CL Compiler/Interpreter Parallel Libraries Efficiency Languages Parallel Frameworks Sketching Static Verification Type Systems Directed Testing Autotuners Dynamic Legacy Communication & Schedulers Checking Code Synch. Primitives Efficiency Language Compilers Debugging OS Libraries & Services with Replay Legacy OS Hypervisor Multicore/GPGPU ParLab Manycore/RAMP Correctness Diagnosing Power/Performance Personal Image Hearing, Parallel Speech Health Retrieval Music Browser Design Patterns/Motifs 2 RAMP Gold : A Parlab manycore emulator Parlab Manycore Leverage RAMP FPGA emulation infrastructure to build prototypes of proposed architectural features Fast enough to run real apps “tapeout” everyday RAMP Single-socket tiled manycore target Timing State Arch State Timing Model Pipeline Functional Model Pipeline Gold SPARC v8 -> ISA neutural Shared memory, distributed coherent cache Multiple on-chip networks and memory channels Split functional/timing model, both in hardware Host multithreading of both functional and timing models 3 Host multithreading Single hardware pipeline with multiple copies of CPU state Virtualized the pipeline with fine-grained multithreading to emulate more target CPUs with high efficiency (i.e. MIPS/FPGA) Hide emulation latencies Not multithreading target Target Model CPU 1 CPU 2 CPU 63 CPU 64 Functional CPU model on FPGA PC PC1 PC PC1 1 1 Thread Select I$ IR DE GPR1 GPR1 GPR1 GPR1 X Y A L U D$ +1 6 6 6 4 RAMP Gold v1 model Timing State Special Registers Thread Selection Static Thread Selection (Round Robin) PC Tag/Data read request Instruction Fetch Instruction Fetch I$ timing Host memory interface Inst / Partial Decode result Decode Decode Register File Access Regfile Decoding 32-bit Instruction Host I$/IMMU Decode Timing Memory Timing Model 32-bit Multithreade d Register File text imm Execution text Decode ALU control/Exception Detection OP1 OP2 Execution Timing pc Execution Unit 1 (Simple ALU) Execution Unit 2 (Complex ALU) Special Register OPs Mem Address D$ Timing Memory Memory Tag/Data read request Functional Status Host D$/ DMMU Write Back / Exception Commit Timing Write Back/ Exception Host memory interface 128-bit data Functional Pipeline Controls Timing Model Functional Model 5 Balancing Func. And Timing Models Timing Model Timing Model Timing Model Functional GGG G PP Model GGPPRR PR P R 11 RR1 1 11 Host DRAM Cache Single functional model supports multiple timing models on FPGAs 6 RAMP Gold Implementation Single FPGA Implementation Xilinx ML505 BEE3 Low cost Xilinx ML505 board 64~128 cores, 2GB DDR2, FP, timing model, 100~130 MIPS A 64-core functional model demo Multi-FPGA Implementation BEE3 : 4 Xilinx Virtex 5 LX155T 1K~2K cores, 64GB DDR2, FP, timing model Higher emulation capacity and memory bandwidth 7 RAMP Gold Prototyping Version 1 Single FPGA implementation on Xilinx ML505 64~128 cores, integer only, 2G byte memory, running at 100 MHz Simple timing model RAMP performance counter support Full verification environment Single in-order issue CPU with an ideal shared memory (2cycle access latency) Software simulator (C-gold) RTL/netlist verification HW on-chip verification BSD license : Everything is built from scratch! 8 CPU Functional Model (1) 64 HW threads, full 32-bit SPARC v8 CPU The same binary runs on both SUN boxes and RAMP Optimized for emulation throughput (MIPS/FPGA) 1 cycle access latency for most of the instructions on host Microcode operation for complex and new instructions E.g. trap, active messages Design for FPGA fabric for optimal performance “Deep” pipeline : 11 physical stages, no bypassing network DSP based ALU ECC/parity protected RAM/cache lines and etc. Double clocked BRAM/LUTRAM Fine-tuned FPGA resource mapping 9 CPU Functional Model (2) Status Coded in Systemverilog Passed the verification suite donated by SPARC International Verified against our C functional simulator Mapped and tested on HW @ 100 MHz Maximum frequency > 130 MHz FPGA resource consumption (XCV5LX50T) 1 CPU + SRAM controller + memory network LUT Register BRAM DSP 2,719 4,852 13 2 9% 16% 22% 4% 10 Verification/Testing Flow App Source Files (.S or .C) GNU SPARC v8 Compiler/ Linker Frontend Test Server ELF Binaries ELF to DRAM Loader Customized Linker Script (.lds) Standard ELF Loader ELF to BRAM Loader Solaris/Linux Machine (Handle Syscall) Disassembler C implementation Frontend Links Frontend Links Frontend Links FPGA Target Xilinx ML505, BEE3 C Functional Simulator HW state dumps Reference Result RTL src files / netlist (.sv, .v) Host dynamic simulation libraries (.so) Xilinx Unisim Library Systemverilog DPI interface libbfd Modelsim SE/Questasim 6.4a Simulation logs Checker 11 GCC Toolchain SPARC cross compiler with newlib Link newlib statically Built with (binutils-2.18, gcc-4.3.2/gmp-4.3.2/mpfr-2.3.2, newlib1.16.0) sparc-elf-{gcc, g++, ld, nm, objdump, ranlib, strip, …} newlib is a C library intended for use on embedded systems C functions in newlib are narrowed down to 19 system calls _exit, close, environ, execve, fork, fstat, getpid, isatty, kill, link, lseek, open, read, sbrk, stat, times, unlink, wait, write Now we can compile our C, C++ source code (w/ standard C functions) to a SPARC executable sparc-elf-gcc –o hello hello.c –lc –lsys –mcpu=v8 12 Frontend Machine Multiple backends Narrow interface to support a new backend C-gold functional simulator (link: function calls) Modelsim simulator (link: DPI) Actual H/W (Xilinx ML505, BEE3) (link: gigabit ethernet) Host/Target interface CPU reset Memory interface {read,write}_{signed,unsigned}_{8,16,32,64} Execute system calls received from the backend Signaled by the backend proxy kernel Map a Solaris system call to a Linux system call and execute 13 Proxy Kernel How could we support I/Os? The target doesn’t have any peripherals (e.g. disks) It would be a pain to program a system which can’t read or write to anything… It would be more pain to make the peripherals work with the actual H/W A minimal kernel which acts as a proxy for system calls invoked by newlib Proxy kernel sends the arguments and the system call number to the frontend machine The frontend machine does the actual system call and returns the results back to the proxy kernel Finally the PC is moved back to the application and everybody is happy 14 C-gold Functional Simulator Baseline functional model to verify our functional model written in system verilog Full 32-bit SPARC v8 Includes an IEEE 754 compatible FPU New instruction introduced to support active messages SENDAM Written from scratch, no junk in it Very fast, ~25 MIPS Easy to understand Easy to add/modify modules for experiments Flexible parameters (Number of target threads, host threads, …) 15 Code & CrossCompile App Source Files (.S or .C) GNU SPARC v8 Compiler/ Linker syscall res Frontend Test Server ELF Binaries ELF to DRAM Loader Customized Linker Script (.lds) Standard ELF Loader ELF to BRAM Loader Solaris/Linux Machine (Handle Syscall) Disassembler C implementation Frontend Links Frontend Links Proxy Kernel write(…) Frontend Links syscall req FPGA Target Xilinx ML505, BEE3 C Functional Simulator HW state dumps Reference Result Checker RTL src files / netlist (.sv, .v) Host dynamic simulation libraries (.so) Xilinx Unisim Library Systemverilog DPI interface libbfd Modelsim SE/Questasim 6.4a Simulation logs 16 On the fly DEMO 17 Plans for the Next Version Try out several research projects Performance Counter Virtual Local Stores Enhance the functional model and timing model Add FPUs Memory/cache timing models 18 Acknowledgement Special thanks to Prof. Kurt Keutzer for help from Synopsys! 19 Backup Slides 20 Pipeline Architecture Thread Selection Instruction Fetch 1 Special Registers (pc/npc, wim, psr, thread control registers) Static Thread Selection Microcode ROM Instruction Fetch 1 (Round Robin) Tag/Data read request (Issue address Request) Micro inst. 128-bit memory interface 32-bit Instruction Instruction Fetch 2 I-Cache (nine 18kb BRAMs) Instruction Fetch 2 Tag (compare tag) Synthesized Instruction Tag compare result Register File Access 1 & 2* Mem request under cache miss Decode (Resolve Branch, Decode register file address) Decode Regfile Read 2 cycles (pipelined) 32-bit Multithreaded Register File (four 36kb BRAMs) Register File Access 3 LUT ROM DSP (clk x2) BRAM (clk x2) OP1 Memory 1 MUL/DIV/SHF (4 DSPs) Simple ALU (1 DSP) /LDST decoding Unaligned address detection / Store preparation Issue Load Special register handling Physical implementation (RDPSR/RDWIM) Tag / 128-bit data Memory 2 Write Back / Exception Read & Select Manually-controlled BRAM mapping LUTRAM mapping by memory compiler D-Cache (nine 18kb BRAMs) Generate microcode request Load align / Write Back 128-bit read & modify data All BRAM/LUTRAM/DSP blocks in double clocked or DDR mode Tag/Data read request (issue address) 128-bit memory interface Trap/IRQ handling 11 pipeline stages (no forwarding) -> 7 logical stages Static thread scheduling, zero overhead context switch Avoid complex operations with “microcode” E.g. traps, ST 32-bit I/O bus (threaded) with interrupt support LUT RAM (clk x2) Decode ALU control/Exception Detection pc imm OP2 Execution Single issue in order pipeline (integer only) Extra pipeline stages for routing ECC/Parity protected BRAMs Deep submicron effect on FPGAs 21 Implementation Challenges CPU state storage Minimize FPGA resource consumption Where? How large? Does it fit on FPGA? E.g. Mapping ALU to DSPs Host cache & TLB Need cache? Architecture and capacity Bandwidth requirement and R/W access ports host multithreading amplifies the requirement 22 State storage Complete 32-bit SPARC v8 ISA w. traps/exceptions All CPU states (integer only) are stored in SRAMs on FPGA Per context register file -- BRAM 3 register windows stored in BRAM chunks of 64 8 (global) + 3*16 (reg window) = 54 6 special registers pc/npc -- LUTRAM PSR (Processor state register) -- LUTRAM WIM (Register Window Mask) -- LUTRAM Y (High 32-bit result for MUL/DIV) -- LUTRAM TBR (Trap based registers) -- BRAM (packed with regfile) Buffers for host multithreading (LUTRAM) Maximum 64 threads per pipeline on Xilinx Virtex5 Bounded by LUTRAM depth (6-input LUTs) 23 Mapping SPARC ALU to DSP Xilinx DSP48E advantage 48-bit add/sub/logic/mux + pattern detector Easy to generate ALU flags: < 10 LUTs for C, O Pipelined access over 500 MHz 24 DSP advantage Instruction coverage (two double clocked DSPs / pipeline) 1 cycle ALU (1 DSP) LD/ST (address calculation) Bit-wise logic (and, or, …) SETHI (value by pass) JMPL, RETT, CALL (address calculation) SAVE/RESTORE (add/sub) WRPSR, RDPSR, RDWIM (XOR op) Long latency ALU instructions (1 DSP) Shift/MUL (2 cycles) 5%~10% logic save for 32-bit data path 25 Host Cache/TLB Accelerating emulation performance! Need separate model for target cache Per thread cache (Partitioned) Split I/D direct-map write-allocate write-back cache Block size: 32 bytes (BEE3 DDR2 controller heart beat) 64-thread configuration: 256B I$, 256B D$ Size doubled in 32-thread configuration Non-blocking cache, 64 outstanding requests (max) Physical tags, indexed by virtual or physical address $ size < page size 67% BRAM usage Per thread TLB Split I/D direct-map TLB: 8 entries ITLB, 8 entries DTLB Dummy currently Static translation for Solaris virtual address layout 26 Cache-Memory Architecture Memory Controller Memory request address Refill Index 128-bit data Victim data write back Refill Data (ECC) 512x72x4 Tag (Parity) 512 x 36 RAMB18SDP 128-bit data RAMB36SDP (x72) RAMB36SDP (x72) RAMB36SDP (x72) Mem ops RAMB36SDP (x72) Memory Command FIFO Lookup Index 64-bit data + Tag Tag Write Back 64-bit data Read & Modify Prepare LD/ST address Load Select / Routing Cache FSM (Hit, exception, etc) Cache replay? Integer Pipeline Pipeline State Control Memory Stage (1) Memory Stage (2) Load Align/Sign Pipeline Register Exception/Write Back Stage Pipeline Register Cache controller Non-blocking pipelined access (3-stages) matches CPU pipeline Decoupled access/refill: allow pipelined, OOO mem accesses Tell the pipeline to “replay” inst. on miss 128-bit refill/write back data path fill one block at 2x clk rate 27 Example: A distributed memory non-cache coherent system Eight multithreaded SPARC v8 pipelines in two clusters Memory subsystem Each thread emulates one independent node in target system 512 nodes/FPGA Predicted emulation performance: ~1 GIPS/FPGA (10% I$ miss, 30% D$ miss, 30% LD/ST) x2 compared to naïve manycore implementation Total memory capacity 16 GB, 32MB/node (512 nodes) One DDR2 memory controller per cluster Per FPGA bandwidth: 7.2 GB/s Memory space is partitioned to emulate distributed memory system 144-bit wide credit-based memory network Inter-node communication (under development) Two-level tree network with DMA to provide all-to-all communication 28 Project Status Done with RTL implementation ~12,000 lines synthesizable Systemverilog code FPGA resource utilization per pipeline on Xilinx V5 LX110T ~4% logic (LUT), ~10% BRAM Max 10 pipelines, but back off to 8 or less Built RTL verification infrastructure SPARC v8 certification test suite (donated by SPARC international) + Systemverilog Can be used to run more programs but very slow (~0.3 KIPS) 29 Project Status Passed all SPARC v8 integer diagnostics in presynthesized RTL simulation Run single threaded Solaris apps (5 syscall supported so far) Working on HW verification after synthesis and P&R Synthesized with an alpha version of Synplify Will support MentorGraphics Precision 2008a Update 2 in late Nov Planned Release in Jan 09 64/128 emulated CPUs on Xilinx ML505 board @ $500 + 2GB DDR2 DRAM cost Source code will be available under BSD license 30 Thank you 31