RAMP Gold Hardware and Software Architecture Jan 2009 Zhangxi Tan,

by user

on 15-09-2016

Category: Documents

>> Downloads: 2

views

Report

Comments

Description

Download RAMP Gold Hardware and Software Architecture Jan 2009 Zhangxi Tan,

Transcript

RAMP Gold Hardware and Software Architecture Jan 2009 Zhangxi Tan,

RAMP Gold
Hardware and Software Architecture
Zhangxi Tan, Yunsup Lee, Andrew Waterman, Rimas Avizienis,
David Patterson, Krste Asanovic
UC Berkeley
Jan 2009
Par Lab Research Overview
Easy to write correct programs that run efficiently on manycore
Composition & Coordination Language (C&CL)
C&CL Compiler/Interpreter
Parallel
Libraries
Efficiency
Languages
Parallel
Frameworks
Sketching
Static
Verification
Type
Systems
Directed
Testing
Autotuners
Dynamic
Legacy
Communication &
Schedulers
Checking
Code
Synch. Primitives
Efficiency Language Compilers
Debugging
OS Libraries & Services
with Replay
Legacy OS
Hypervisor
Multicore/GPGPU
ParLab Manycore/RAMP
Correctness
Diagnosing Power/Performance
Personal Image Hearing,
Parallel
Speech
Health Retrieval Music
Browser
Design Patterns/Motifs
2
RAMP Gold : A Parlab manycore emulator
Parlab Manycore
 Leverage
RAMP FPGA emulation
infrastructure to build prototypes of
proposed architectural features
 Fast
enough to run real apps
 “tapeout” everyday
 RAMP

Single-socket tiled manycore target


Timing
State
Arch
State


Timing
Model
Pipeline
Functional
Model
Pipeline

Gold
SPARC v8 -> ISA neutural
Shared memory, distributed coherent cache
Multiple on-chip networks and memory
channels
Split functional/timing model, both in
hardware
Host multithreading of both functional
and timing models
3
Host multithreading

Single hardware pipeline with multiple copies of CPU state



Virtualized the pipeline with fine-grained multithreading to
emulate more target CPUs with high efficiency (i.e. MIPS/FPGA)
Hide emulation latencies
Not multithreading target
Target Model
CPU
1
CPU
2
CPU
63
CPU
64
Functional CPU model on FPGA
PC
PC1
PC
PC1 1
1
Thread
Select
I$
IR
DE
GPR1
GPR1
GPR1
GPR1
X
Y
A
L
U
D$
+1
6
6
6
4
RAMP Gold v1 model
Timing
State
Special
Registers
Thread
Selection
Static Thread
Selection
(Round Robin)
PC
Tag/Data read request
Instruction
Fetch
Instruction
Fetch
I$ timing
Host memory
interface
Inst / Partial
Decode result
Decode
Decode
Register File
Access
Regfile
Decoding
32-bit Instruction
Host I$/IMMU
Decode Timing
Memory
Timing Model
32-bit
Multithreade
d Register
File
text
imm
Execution
text
Decode ALU
control/Exception
Detection
OP1
OP2
Execution
Timing
pc
Execution Unit 1
(Simple ALU)
Execution Unit 2
(Complex ALU)
Special Register
OPs
Mem Address
D$ Timing
Memory
Memory
Tag/Data read request
Functional
Status
Host D$/
DMMU
Write Back
/ Exception
Commit Timing
Write Back/
Exception
Host memory
interface
128-bit data
Functional
Pipeline Controls
Timing Model
Functional Model
5
Balancing Func. And Timing Models
Timing
Model
Timing
Model
Timing
Model
Functional GGG
G PP
Model
GGPPRR
PR
P R 11
RR1 1
11
Host DRAM Cache

Single functional model supports multiple timing models
on FPGAs
6
RAMP Gold Implementation

Single FPGA Implementation


Xilinx ML505

BEE3

Low cost Xilinx ML505 board
64~128 cores, 2GB DDR2, FP,
timing model, 100~130 MIPS
A 64-core functional model demo
Multi-FPGA Implementation


BEE3 : 4 Xilinx Virtex 5 LX155T
1K~2K cores, 64GB DDR2, FP,
timing model

Higher emulation capacity and memory
bandwidth
7
RAMP Gold Prototyping

Version 1
 Single FPGA implementation on Xilinx ML505
 64~128 cores, integer only, 2G byte memory, running
at 100 MHz
 Simple timing model



RAMP performance counter support
Full verification environment




Single in-order issue CPU with an ideal shared memory (2cycle access latency)
Software simulator (C-gold)
RTL/netlist verification
HW on-chip verification
BSD license : Everything is built from scratch!
8
CPU Functional Model (1)

64 HW threads, full 32-bit SPARC v8 CPU





The same binary runs on both SUN boxes and RAMP
Optimized for emulation throughput (MIPS/FPGA)
1 cycle access latency for most of the instructions on host
Microcode operation for complex and new instructions
 E.g. trap, active messages
Design for FPGA fabric for optimal performance





“Deep” pipeline : 11 physical stages, no bypassing network
DSP based ALU
ECC/parity protected RAM/cache lines and etc.
Double clocked BRAM/LUTRAM
Fine-tuned FPGA resource mapping
9
CPU Functional Model (2)

Status





Coded in Systemverilog
Passed the verification suite donated by SPARC International
Verified against our C functional simulator
Mapped and tested on HW @ 100 MHz
 Maximum frequency > 130 MHz
FPGA resource consumption (XCV5LX50T)

1 CPU + SRAM controller + memory network
LUT
Register
BRAM
DSP
2,719
4,852
13
2
9%
16%
22%
4%
10
Verification/Testing Flow
App Source
Files
(.S or .C)
GNU SPARC v8
Compiler/
Linker
Frontend Test Server
ELF
Binaries
ELF to
DRAM
Loader
Customized
Linker Script
(.lds)
Standard
ELF Loader
ELF to
BRAM
Loader
Solaris/Linux Machine
(Handle Syscall)
Disassembler C
implementation
Frontend Links
Frontend Links
Frontend Links
FPGA Target
Xilinx ML505, BEE3
C Functional
Simulator
HW state
dumps
Reference Result
RTL src files /
netlist
(.sv, .v)
Host dynamic
simulation
libraries (.so)
Xilinx Unisim Library
Systemverilog DPI interface
libbfd
Modelsim SE/Questasim 6.4a
Simulation
logs
Checker
11
GCC Toolchain

SPARC cross compiler with newlib



Link newlib statically



Built with (binutils-2.18, gcc-4.3.2/gmp-4.3.2/mpfr-2.3.2, newlib1.16.0)
sparc-elf-{gcc, g++, ld, nm, objdump, ranlib, strip, …}
newlib is a C library intended for use on embedded systems
C functions in newlib are narrowed down to 19 system calls
 _exit, close, environ, execve, fork, fstat, getpid, isatty, kill,
link, lseek, open, read, sbrk, stat, times, unlink, wait, write
Now we can compile our C, C++ source code (w/
standard C functions) to a SPARC executable

sparc-elf-gcc –o hello hello.c –lc –lsys –mcpu=v8
12
Frontend Machine

Multiple backends




Narrow interface to support a new backend



C-gold functional simulator (link: function calls)
Modelsim simulator (link: DPI)
Actual H/W (Xilinx ML505, BEE3) (link: gigabit ethernet)
Host/Target interface
 CPU reset
Memory interface
 {read,write}_{signed,unsigned}_{8,16,32,64}
Execute system calls received from the backend


Signaled by the backend proxy kernel
Map a Solaris system call to a Linux system call and execute
13
Proxy Kernel

How could we support I/Os?




The target doesn’t have any peripherals (e.g. disks)
It would be a pain to program a system which can’t read or
write to anything…
It would be more pain to make the peripherals work with the
actual H/W
A minimal kernel which acts as a proxy for system calls
invoked by newlib



Proxy kernel sends the arguments and the system call number
to the frontend machine
The frontend machine does the actual system call and returns
the results back to the proxy kernel
Finally the PC is moved back to the application and everybody is
happy
14
C-gold Functional Simulator

Baseline functional model to verify our functional model
written in system verilog




Full 32-bit SPARC v8
Includes an IEEE 754 compatible FPU
New instruction introduced to support active messages
 SENDAM
Written from scratch, no junk in it




Very fast, ~25 MIPS
Easy to understand
Easy to add/modify modules for experiments
Flexible parameters (Number of target threads, host threads, …)
15
Code & CrossCompile
App Source
Files
(.S or .C)
GNU SPARC v8
Compiler/
Linker
syscall
res
Frontend Test Server
ELF
Binaries
ELF to
DRAM
Loader
Customized
Linker Script
(.lds)
Standard
ELF Loader
ELF to
BRAM
Loader
Solaris/Linux Machine
(Handle Syscall)
Disassembler C
implementation
Frontend Links
Frontend Links
Proxy Kernel
write(…)
Frontend Links
syscall req
FPGA Target
Xilinx ML505, BEE3
C Functional
Simulator
HW state
dumps
Reference Result
Checker
RTL src files /
netlist
(.sv, .v)
Host dynamic
simulation
libraries (.so)
Xilinx Unisim Library
Systemverilog DPI interface
libbfd
Modelsim SE/Questasim 6.4a
Simulation
logs
16
On the fly DEMO
17
Plans for the Next Version

Try out several research projects



Performance Counter
Virtual Local Stores
Enhance the functional model and timing model


Add FPUs
Memory/cache timing models
18
Acknowledgement
 Special
thanks to Prof. Kurt
Keutzer for help from Synopsys!
19
Backup Slides
20
Pipeline Architecture
Thread
Selection
Instruction
Fetch 1

Special Registers
(pc/npc, wim, psr,
thread control
registers)
Static Thread
Selection
Microcode ROM
Instruction Fetch 1
(Round Robin)
Tag/Data read
request

(Issue address Request)
Micro inst.
128-bit memory
interface
32-bit
Instruction
Instruction
Fetch 2
I-Cache
(nine 18kb
BRAMs)

Instruction Fetch 2
Tag
(compare tag)
Synthesized
Instruction
Tag compare result
Register File
Access 1 & 2*

Mem request
under cache miss
Decode
(Resolve Branch,
Decode register file
address)
Decode
Regfile Read

2 cycles (pipelined)
32-bit
Multithreaded
Register File
(four 36kb
BRAMs)
Register File
Access 3
LUT ROM
DSP (clk x2)
BRAM (clk x2)
OP1

Memory 1
MUL/DIV/SHF
(4 DSPs)
Simple ALU (1 DSP)
/LDST decoding
Unaligned address
detection / Store
preparation
Issue Load
Special register
handling
Physical implementation

(RDPSR/RDWIM)
Tag / 128-bit data
Memory 2
Write Back
/ Exception
Read & Select
Manually-controlled BRAM mapping
 LUTRAM mapping by memory compiler
D-Cache
(nine 18kb
BRAMs)


Generate
microcode request
Load align /
Write Back
128-bit read & modify data
All BRAM/LUTRAM/DSP blocks in double
clocked or DDR mode

Tag/Data read
request
(issue address)
128-bit memory
interface
Trap/IRQ handling
11 pipeline stages (no forwarding) -> 7
logical stages
Static thread scheduling, zero overhead
context switch
Avoid complex operations with “microcode”
 E.g. traps, ST
32-bit I/O bus (threaded) with interrupt
support
LUT RAM (clk x2)
Decode ALU
control/Exception
Detection
pc
imm
OP2
Execution
Single issue in order pipeline (integer
only)
Extra pipeline stages for routing
ECC/Parity protected BRAMs
 Deep submicron effect on FPGAs
21
Implementation Challenges

CPU state storage



Minimize FPGA resource consumption


Where?
How large? Does it fit on FPGA?
E.g. Mapping ALU to DSPs
Host cache & TLB



Need cache?
Architecture and capacity
Bandwidth requirement and R/W access ports
 host multithreading amplifies the requirement
22
State storage


Complete 32-bit SPARC v8 ISA w. traps/exceptions
All CPU states (integer only) are stored in SRAMs on FPGA




Per context register file -- BRAM
 3 register windows stored in BRAM chunks of 64
 8 (global) + 3*16 (reg window) = 54
6 special registers
 pc/npc -- LUTRAM
 PSR (Processor state register) -- LUTRAM
 WIM (Register Window Mask) -- LUTRAM
 Y (High 32-bit result for MUL/DIV) -- LUTRAM
 TBR (Trap based registers) -- BRAM (packed with regfile)
Buffers for host multithreading (LUTRAM)
Maximum 64 threads per pipeline on Xilinx Virtex5

Bounded by LUTRAM depth (6-input LUTs)
23
Mapping SPARC ALU to DSP

Xilinx DSP48E advantage
 48-bit add/sub/logic/mux + pattern detector
 Easy to generate ALU flags: < 10 LUTs for C, O
 Pipelined access over 500 MHz
24
DSP advantage

Instruction coverage (two double clocked DSPs / pipeline)
 1 cycle ALU (1 DSP)
 LD/ST (address calculation)
 Bit-wise logic (and, or, …)
 SETHI (value by pass)
 JMPL, RETT, CALL (address calculation)
 SAVE/RESTORE (add/sub)
 WRPSR, RDPSR, RDWIM (XOR op)
 Long latency ALU instructions (1 DSP)
 Shift/MUL (2 cycles)

5%~10% logic save for 32-bit data path
25
Host Cache/TLB

Accelerating emulation performance!


Need separate model for target cache
Per thread cache (Partitioned)
Split I/D direct-map write-allocate write-back cache
 Block size: 32 bytes (BEE3 DDR2 controller heart beat)
 64-thread configuration: 256B I$, 256B D$
 Size doubled in 32-thread configuration
 Non-blocking cache, 64 outstanding requests (max)
 Physical tags, indexed by virtual or physical address
 $ size < page size
 67% BRAM usage


Per thread TLB
Split I/D direct-map TLB: 8 entries ITLB, 8 entries DTLB
 Dummy currently
 Static translation for Solaris virtual address layout

26
Cache-Memory Architecture
Memory
Controller
Memory request
address
Refill
Index
128-bit
data
Victim data
write back
Refill
Data (ECC)
512x72x4
Tag (Parity)
512 x 36
RAMB18SDP
128-bit data
RAMB36SDP (x72)
RAMB36SDP (x72)
RAMB36SDP (x72)
Mem ops
RAMB36SDP (x72)
Memory
Command FIFO
Lookup
Index
64-bit data
+ Tag
Tag
Write
Back
64-bit data
Read & Modify
Prepare LD/ST
address
Load Select / Routing
Cache FSM
(Hit, exception, etc)
Cache
replay?
Integer Pipeline
Pipeline State
Control
Memory Stage (1)
Memory Stage (2)
Load Align/Sign
Pipeline Register

Exception/Write
Back Stage
Pipeline Register
Cache controller




Non-blocking pipelined access (3-stages) matches CPU pipeline
Decoupled access/refill: allow pipelined, OOO mem accesses
Tell the pipeline to “replay” inst. on miss
128-bit refill/write back data path
 fill one block at 2x clk rate
27
Example: A distributed memory non-cache coherent system

Eight multithreaded SPARC v8 pipelines in
two clusters




Memory subsystem






Each thread emulates one independent node
in target system
512 nodes/FPGA
Predicted emulation performance:
 ~1 GIPS/FPGA (10% I$ miss, 30% D$
miss, 30% LD/ST)
 x2 compared to naïve manycore
implementation
Total memory capacity 16 GB, 32MB/node
(512 nodes)
One DDR2 memory controller per cluster
Per FPGA bandwidth: 7.2 GB/s
Memory space is partitioned to emulate
distributed memory system
144-bit wide credit-based memory network
Inter-node communication (under
development)

Two-level tree network with DMA to provide
all-to-all communication
28
Project Status

Done with RTL implementation
 ~12,000 lines synthesizable Systemverilog code
 FPGA resource utilization per pipeline on Xilinx V5 LX110T
 ~4% logic (LUT), ~10% BRAM
 Max 10 pipelines, but back off to 8 or less

Built RTL verification infrastructure
 SPARC v8 certification test suite (donated by SPARC
international) + Systemverilog
 Can be used to run more programs but very slow
(~0.3 KIPS)
29
Project Status




Passed all SPARC v8 integer diagnostics in presynthesized RTL simulation
Run single threaded Solaris apps (5 syscall supported so
far)
Working on HW verification after synthesis and P&R
 Synthesized with an alpha version of Synplify
 Will support MentorGraphics Precision 2008a Update
2 in late Nov
Planned Release in Jan 09
 64/128 emulated CPUs on Xilinx ML505 board @
$500 + 2GB DDR2 DRAM cost
 Source code will be available under BSD license
30
Thank you
31