RAMP Gold Hardware and Software Architecture Jan 2009 Zhangxi Tan,
by user
Comments
Transcript
RAMP Gold Hardware and Software Architecture Jan 2009 Zhangxi Tan,
RAMP Gold
Hardware and Software Architecture
Zhangxi Tan, Yunsup Lee, Andrew Waterman, Rimas Avizienis,
David Patterson, Krste Asanovic
UC Berkeley
Jan 2009
Par Lab Research Overview
Easy to write correct programs that run efficiently on manycore
Composition & Coordination Language (C&CL)
C&CL Compiler/Interpreter
Parallel
Libraries
Efficiency
Languages
Parallel
Frameworks
Sketching
Static
Verification
Type
Systems
Directed
Testing
Autotuners
Dynamic
Legacy
Communication &
Schedulers
Checking
Code
Synch. Primitives
Efficiency Language Compilers
Debugging
OS Libraries & Services
with Replay
Legacy OS
Hypervisor
Multicore/GPGPU
ParLab Manycore/RAMP
Correctness
Diagnosing Power/Performance
Personal Image Hearing,
Parallel
Speech
Health Retrieval Music
Browser
Design Patterns/Motifs
2
RAMP Gold : A Parlab manycore emulator
Parlab Manycore
Leverage
RAMP FPGA emulation
infrastructure to build prototypes of
proposed architectural features
Fast
enough to run real apps
“tapeout” everyday
RAMP
Single-socket tiled manycore target
Timing
State
Arch
State
Timing
Model
Pipeline
Functional
Model
Pipeline
Gold
SPARC v8 -> ISA neutural
Shared memory, distributed coherent cache
Multiple on-chip networks and memory
channels
Split functional/timing model, both in
hardware
Host multithreading of both functional
and timing models
3
Host multithreading
Single hardware pipeline with multiple copies of CPU state
Virtualized the pipeline with fine-grained multithreading to
emulate more target CPUs with high efficiency (i.e. MIPS/FPGA)
Hide emulation latencies
Not multithreading target
Target Model
CPU
1
CPU
2
CPU
63
CPU
64
Functional CPU model on FPGA
PC
PC1
PC
PC1 1
1
Thread
Select
I$
IR
DE
GPR1
GPR1
GPR1
GPR1
X
Y
A
L
U
D$
+1
6
6
6
4
RAMP Gold v1 model
Timing
State
Special
Registers
Thread
Selection
Static Thread
Selection
(Round Robin)
PC
Tag/Data read request
Instruction
Fetch
Instruction
Fetch
I$ timing
Host memory
interface
Inst / Partial
Decode result
Decode
Decode
Register File
Access
Regfile
Decoding
32-bit Instruction
Host I$/IMMU
Decode Timing
Memory
Timing Model
32-bit
Multithreade
d Register
File
text
imm
Execution
text
Decode ALU
control/Exception
Detection
OP1
OP2
Execution
Timing
pc
Execution Unit 1
(Simple ALU)
Execution Unit 2
(Complex ALU)
Special Register
OPs
Mem Address
D$ Timing
Memory
Memory
Tag/Data read request
Functional
Status
Host D$/
DMMU
Write Back
/ Exception
Commit Timing
Write Back/
Exception
Host memory
interface
128-bit data
Functional
Pipeline Controls
Timing Model
Functional Model
5
Balancing Func. And Timing Models
Timing
Model
Timing
Model
Timing
Model
Functional GGG
G PP
Model
GGPPRR
PR
P R 11
RR1 1
11
Host DRAM Cache
Single functional model supports multiple timing models
on FPGAs
6
RAMP Gold Implementation
Single FPGA Implementation
Xilinx ML505
BEE3
Low cost Xilinx ML505 board
64~128 cores, 2GB DDR2, FP,
timing model, 100~130 MIPS
A 64-core functional model demo
Multi-FPGA Implementation
BEE3 : 4 Xilinx Virtex 5 LX155T
1K~2K cores, 64GB DDR2, FP,
timing model
Higher emulation capacity and memory
bandwidth
7
RAMP Gold Prototyping
Version 1
Single FPGA implementation on Xilinx ML505
64~128 cores, integer only, 2G byte memory, running
at 100 MHz
Simple timing model
RAMP performance counter support
Full verification environment
Single in-order issue CPU with an ideal shared memory (2cycle access latency)
Software simulator (C-gold)
RTL/netlist verification
HW on-chip verification
BSD license : Everything is built from scratch!
8
CPU Functional Model (1)
64 HW threads, full 32-bit SPARC v8 CPU
The same binary runs on both SUN boxes and RAMP
Optimized for emulation throughput (MIPS/FPGA)
1 cycle access latency for most of the instructions on host
Microcode operation for complex and new instructions
E.g. trap, active messages
Design for FPGA fabric for optimal performance
“Deep” pipeline : 11 physical stages, no bypassing network
DSP based ALU
ECC/parity protected RAM/cache lines and etc.
Double clocked BRAM/LUTRAM
Fine-tuned FPGA resource mapping
9
CPU Functional Model (2)
Status
Coded in Systemverilog
Passed the verification suite donated by SPARC International
Verified against our C functional simulator
Mapped and tested on HW @ 100 MHz
Maximum frequency > 130 MHz
FPGA resource consumption (XCV5LX50T)
1 CPU + SRAM controller + memory network
LUT
Register
BRAM
DSP
2,719
4,852
13
2
9%
16%
22%
4%
10
Verification/Testing Flow
App Source
Files
(.S or .C)
GNU SPARC v8
Compiler/
Linker
Frontend Test Server
ELF
Binaries
ELF to
DRAM
Loader
Customized
Linker Script
(.lds)
Standard
ELF Loader
ELF to
BRAM
Loader
Solaris/Linux Machine
(Handle Syscall)
Disassembler C
implementation
Frontend Links
Frontend Links
Frontend Links
FPGA Target
Xilinx ML505, BEE3
C Functional
Simulator
HW state
dumps
Reference Result
RTL src files /
netlist
(.sv, .v)
Host dynamic
simulation
libraries (.so)
Xilinx Unisim Library
Systemverilog DPI interface
libbfd
Modelsim SE/Questasim 6.4a
Simulation
logs
Checker
11
GCC Toolchain
SPARC cross compiler with newlib
Link newlib statically
Built with (binutils-2.18, gcc-4.3.2/gmp-4.3.2/mpfr-2.3.2, newlib1.16.0)
sparc-elf-{gcc, g++, ld, nm, objdump, ranlib, strip, …}
newlib is a C library intended for use on embedded systems
C functions in newlib are narrowed down to 19 system calls
_exit, close, environ, execve, fork, fstat, getpid, isatty, kill,
link, lseek, open, read, sbrk, stat, times, unlink, wait, write
Now we can compile our C, C++ source code (w/
standard C functions) to a SPARC executable
sparc-elf-gcc –o hello hello.c –lc –lsys –mcpu=v8
12
Frontend Machine
Multiple backends
Narrow interface to support a new backend
C-gold functional simulator (link: function calls)
Modelsim simulator (link: DPI)
Actual H/W (Xilinx ML505, BEE3) (link: gigabit ethernet)
Host/Target interface
CPU reset
Memory interface
{read,write}_{signed,unsigned}_{8,16,32,64}
Execute system calls received from the backend
Signaled by the backend proxy kernel
Map a Solaris system call to a Linux system call and execute
13
Proxy Kernel
How could we support I/Os?
The target doesn’t have any peripherals (e.g. disks)
It would be a pain to program a system which can’t read or
write to anything…
It would be more pain to make the peripherals work with the
actual H/W
A minimal kernel which acts as a proxy for system calls
invoked by newlib
Proxy kernel sends the arguments and the system call number
to the frontend machine
The frontend machine does the actual system call and returns
the results back to the proxy kernel
Finally the PC is moved back to the application and everybody is
happy
14
C-gold Functional Simulator
Baseline functional model to verify our functional model
written in system verilog
Full 32-bit SPARC v8
Includes an IEEE 754 compatible FPU
New instruction introduced to support active messages
SENDAM
Written from scratch, no junk in it
Very fast, ~25 MIPS
Easy to understand
Easy to add/modify modules for experiments
Flexible parameters (Number of target threads, host threads, …)
15
Code & CrossCompile
App Source
Files
(.S or .C)
GNU SPARC v8
Compiler/
Linker
syscall
res
Frontend Test Server
ELF
Binaries
ELF to
DRAM
Loader
Customized
Linker Script
(.lds)
Standard
ELF Loader
ELF to
BRAM
Loader
Solaris/Linux Machine
(Handle Syscall)
Disassembler C
implementation
Frontend Links
Frontend Links
Proxy Kernel
write(…)
Frontend Links
syscall req
FPGA Target
Xilinx ML505, BEE3
C Functional
Simulator
HW state
dumps
Reference Result
Checker
RTL src files /
netlist
(.sv, .v)
Host dynamic
simulation
libraries (.so)
Xilinx Unisim Library
Systemverilog DPI interface
libbfd
Modelsim SE/Questasim 6.4a
Simulation
logs
16
On the fly DEMO
17
Plans for the Next Version
Try out several research projects
Performance Counter
Virtual Local Stores
Enhance the functional model and timing model
Add FPUs
Memory/cache timing models
18
Acknowledgement
Special
thanks to Prof. Kurt
Keutzer for help from Synopsys!
19
Backup Slides
20
Pipeline Architecture
Thread
Selection
Instruction
Fetch 1
Special Registers
(pc/npc, wim, psr,
thread control
registers)
Static Thread
Selection
Microcode ROM
Instruction Fetch 1
(Round Robin)
Tag/Data read
request
(Issue address Request)
Micro inst.
128-bit memory
interface
32-bit
Instruction
Instruction
Fetch 2
I-Cache
(nine 18kb
BRAMs)
Instruction Fetch 2
Tag
(compare tag)
Synthesized
Instruction
Tag compare result
Register File
Access 1 & 2*
Mem request
under cache miss
Decode
(Resolve Branch,
Decode register file
address)
Decode
Regfile Read
2 cycles (pipelined)
32-bit
Multithreaded
Register File
(four 36kb
BRAMs)
Register File
Access 3
LUT ROM
DSP (clk x2)
BRAM (clk x2)
OP1
Memory 1
MUL/DIV/SHF
(4 DSPs)
Simple ALU (1 DSP)
/LDST decoding
Unaligned address
detection / Store
preparation
Issue Load
Special register
handling
Physical implementation
(RDPSR/RDWIM)
Tag / 128-bit data
Memory 2
Write Back
/ Exception
Read & Select
Manually-controlled BRAM mapping
LUTRAM mapping by memory compiler
D-Cache
(nine 18kb
BRAMs)
Generate
microcode request
Load align /
Write Back
128-bit read & modify data
All BRAM/LUTRAM/DSP blocks in double
clocked or DDR mode
Tag/Data read
request
(issue address)
128-bit memory
interface
Trap/IRQ handling
11 pipeline stages (no forwarding) -> 7
logical stages
Static thread scheduling, zero overhead
context switch
Avoid complex operations with “microcode”
E.g. traps, ST
32-bit I/O bus (threaded) with interrupt
support
LUT RAM (clk x2)
Decode ALU
control/Exception
Detection
pc
imm
OP2
Execution
Single issue in order pipeline (integer
only)
Extra pipeline stages for routing
ECC/Parity protected BRAMs
Deep submicron effect on FPGAs
21
Implementation Challenges
CPU state storage
Minimize FPGA resource consumption
Where?
How large? Does it fit on FPGA?
E.g. Mapping ALU to DSPs
Host cache & TLB
Need cache?
Architecture and capacity
Bandwidth requirement and R/W access ports
host multithreading amplifies the requirement
22
State storage
Complete 32-bit SPARC v8 ISA w. traps/exceptions
All CPU states (integer only) are stored in SRAMs on FPGA
Per context register file -- BRAM
3 register windows stored in BRAM chunks of 64
8 (global) + 3*16 (reg window) = 54
6 special registers
pc/npc -- LUTRAM
PSR (Processor state register) -- LUTRAM
WIM (Register Window Mask) -- LUTRAM
Y (High 32-bit result for MUL/DIV) -- LUTRAM
TBR (Trap based registers) -- BRAM (packed with regfile)
Buffers for host multithreading (LUTRAM)
Maximum 64 threads per pipeline on Xilinx Virtex5
Bounded by LUTRAM depth (6-input LUTs)
23
Mapping SPARC ALU to DSP
Xilinx DSP48E advantage
48-bit add/sub/logic/mux + pattern detector
Easy to generate ALU flags: < 10 LUTs for C, O
Pipelined access over 500 MHz
24
DSP advantage
Instruction coverage (two double clocked DSPs / pipeline)
1 cycle ALU (1 DSP)
LD/ST (address calculation)
Bit-wise logic (and, or, …)
SETHI (value by pass)
JMPL, RETT, CALL (address calculation)
SAVE/RESTORE (add/sub)
WRPSR, RDPSR, RDWIM (XOR op)
Long latency ALU instructions (1 DSP)
Shift/MUL (2 cycles)
5%~10% logic save for 32-bit data path
25
Host Cache/TLB
Accelerating emulation performance!
Need separate model for target cache
Per thread cache (Partitioned)
Split I/D direct-map write-allocate write-back cache
Block size: 32 bytes (BEE3 DDR2 controller heart beat)
64-thread configuration: 256B I$, 256B D$
Size doubled in 32-thread configuration
Non-blocking cache, 64 outstanding requests (max)
Physical tags, indexed by virtual or physical address
$ size < page size
67% BRAM usage
Per thread TLB
Split I/D direct-map TLB: 8 entries ITLB, 8 entries DTLB
Dummy currently
Static translation for Solaris virtual address layout
26
Cache-Memory Architecture
Memory
Controller
Memory request
address
Refill
Index
128-bit
data
Victim data
write back
Refill
Data (ECC)
512x72x4
Tag (Parity)
512 x 36
RAMB18SDP
128-bit data
RAMB36SDP (x72)
RAMB36SDP (x72)
RAMB36SDP (x72)
Mem ops
RAMB36SDP (x72)
Memory
Command FIFO
Lookup
Index
64-bit data
+ Tag
Tag
Write
Back
64-bit data
Read & Modify
Prepare LD/ST
address
Load Select / Routing
Cache FSM
(Hit, exception, etc)
Cache
replay?
Integer Pipeline
Pipeline State
Control
Memory Stage (1)
Memory Stage (2)
Load Align/Sign
Pipeline Register
Exception/Write
Back Stage
Pipeline Register
Cache controller
Non-blocking pipelined access (3-stages) matches CPU pipeline
Decoupled access/refill: allow pipelined, OOO mem accesses
Tell the pipeline to “replay” inst. on miss
128-bit refill/write back data path
fill one block at 2x clk rate
27
Example: A distributed memory non-cache coherent system
Eight multithreaded SPARC v8 pipelines in
two clusters
Memory subsystem
Each thread emulates one independent node
in target system
512 nodes/FPGA
Predicted emulation performance:
~1 GIPS/FPGA (10% I$ miss, 30% D$
miss, 30% LD/ST)
x2 compared to naïve manycore
implementation
Total memory capacity 16 GB, 32MB/node
(512 nodes)
One DDR2 memory controller per cluster
Per FPGA bandwidth: 7.2 GB/s
Memory space is partitioned to emulate
distributed memory system
144-bit wide credit-based memory network
Inter-node communication (under
development)
Two-level tree network with DMA to provide
all-to-all communication
28
Project Status
Done with RTL implementation
~12,000 lines synthesizable Systemverilog code
FPGA resource utilization per pipeline on Xilinx V5 LX110T
~4% logic (LUT), ~10% BRAM
Max 10 pipelines, but back off to 8 or less
Built RTL verification infrastructure
SPARC v8 certification test suite (donated by SPARC
international) + Systemverilog
Can be used to run more programs but very slow
(~0.3 KIPS)
29
Project Status
Passed all SPARC v8 integer diagnostics in presynthesized RTL simulation
Run single threaded Solaris apps (5 syscall supported so
far)
Working on HW verification after synthesis and P&R
Synthesized with an alpha version of Synplify
Will support MentorGraphics Precision 2008a Update
2 in late Nov
Planned Release in Jan 09
64/128 emulated CPUs on Xilinx ML505 board @
$500 + 2GB DDR2 DRAM cost
Source code will be available under BSD license
30
Thank you
31