...

POWER8 Scale Out, OpenPOWER and CAPI Georgia IBM

by user

on
Category: Documents
19

views

Report

Comments

Transcript

POWER8 Scale Out, OpenPOWER and CAPI Georgia IBM
POWER8 Scale Out,
OpenPOWER and CAPI
Georgia IBM
POWER User
Group
16 APR 2015
JT Kellington
POWER8 Scale Out,
OpenPOWER and CAPI
POWER8 Scale Out
Power April 2014 Announcements
• New POWER8 Scale Out Servers
– IBM POWER8 2U 2 socket server:
– IBM POWER8 4U 1 socket server:
– IBM POWER8 4U 2 socket server:
Power S822
Power S814
Power S824
• New POWER8 Linux Servers
– IBM POWER8 Linux 2U 1 socket server:
Power S812L
– IBM POWER8 Linux 2U 2 socket server:
Power S822L
• New Virtualization Management
– Enhanced HMC Functionality
– IBM PowerKVM – Kernel Virtual Machine
• New Linux Distro Offering
– Canonical Ubuntu
– Available on Linux Power servers with PowerKVM
© 2015 IBM Corporation
Power April 2014 Announcements
• New I/O Options
– Ethernet
• New IBM i Releases
– IBM i 7.2 (1st new version in 4 years)
– IBM i 7.1 TR8
• POWER8 Hardware support
– IBM BLU Acceleration Solution - Power Systems Edition
– IBM PowerVP – Virtualization Performance
– IBM PowerSC – Security and Compliance
– IBM PowerVM
– IBM PowerVC
© 2015 IBM Corporation
Innovation Drives Performance
Relative %
of Improvement
Gain by Technology Scaling
Gain by Innovation
100%
80%
60%
40%
20%
0%
180 nm 130 nm 90 nm
© 2015 IBM Corporation
65 nm
45 nm
32 nm
22 nm
POWER8:
The First Processor Designed for Big Data
IBM 22nm Technology
• Silicon-on-Insulator
• 15 metal layers
• Deep trench eDRAM
POWER8 Processor
Compute
• 12 cores (thread strength optimized)
• SMT8, 16-wide execution
• 2X internal data flows
• Transactional Memory
Cache
• 64KB L1 + 512KB L2 / core
• 96MB L3 + up to 128MB L4 / socket
• 2X bandwidths
System Interfaces
• 230 GB/s memory bandwidth / socket
• Up to 48x Integrated PCI gen 3 / socket
• CAPI (over PCI gen 3)
• Robust, Large SMP Interconnect
• On chip Energy Mgmt, VRM / core
© 2015 IBM Corporation
6
POWER8 Memory Organization (Max Config shown)
DRAM
Chips
Memory
Buffer
128
GB
16MB
16MB
128
GB
POWER8 DCM
128
GB
16MB
16MB
128
GB
128
GB
16MB
16MB
128
GB
128
GB
16MB
16MB
128
GB
 Up to 1 TB / Socket
 First P8 Systems:
512 GB /Socket
POWER8 Performance
IO Bandwidth (scale-out systems)
Memory BW per Socket
POWER8
POWER8
POWER7+
POWER7
POWER7
POWER6
POWER6
POWER5
0
50
100
150
200
Per Core Performance Gains (mixed workloads)
0
100
150
200
250
per Socket Performance Gains (SMT8)
POWER8
POWER8
POWER7
POWER7
POWER6
POWER6
POWER5
POWER5
© 2015 IBM Corporation
50
8
POWER8 Scale-Out Systems
© 2015 IBM Corporation
Power Systems scale-out portfolio
Power Systems
S824
Power Systems
S822
Power Systems
•2-socket, 2U
S822L
Power Systems
•All Operating Systems
•2-socket, 2U
•PowerVM only
S812L
•Linux Only
•1-socket, 2U
•Linux Only
•KVM and PowerVM
•KVM and PowerVM
Power Systems
S824L
Power Systems
•2-socket, 4U
S814
•Linux Only
•1-socket, 4U
•All Operating Systems
•PowerVM only
•Bare metal
•2-socket, 4U
•All Operating Systems
•PowerVM only
POWER8 2U Scale Out Comparison
Power 730
Power S822
Processor
POWER7+
POWER8
Sockets
2
2
Cores
8 / 12 / 16
12 / 20
Maximum Memory
512 MB @ 1066 MHz
512 GB / 1 TB @ 1600 MHz
Memory Cache
No
Yes
Memory Bandwidth
68 GB/sec
192 GB/sec
Memory DRAM Spare
No
Yes
IO Expansion Slots
Dual GX++
4 PCIe x16 G3
PCIe slots
5 PCIe x8 LP
4 / 5 PCIe x8 LP
2 / 4 PCIe x16 LP
PCIe Hot Plug Support
No
Yes
IO bandwidth
60 GB/sec
192 GB/sec
Ethernet ports
Four 1 Gbt
Four 1 Gbt
SFF
6
12
Easy Tier Support
No
Yes
Integrated split backplane
Yes ( 3 + 3 )
Yes ( 6 + 6 )
Service Processor
Generation 1
Generation 2
POWER8 4U Scale Out Comparison
Power 720
Power System S814
Processor
POWER7+
POWER8
Sockets
1
1
Cores
4/6/8
6/8
Maximum Memory
512 GB @ 1066 MHz
512 GB @ 1600 MHz
Memory Cache
No
Yes
Memory Bandwidth
136 GB/sec
192 GB/sec
Memory DRAM Spare
No
Yes
IO Expansion Slots
Dual GX++
4 PCIe x16 G3
PCIe slots
5 PCIe x8 FH / HL
4 PCIe x8 HH / HL (opt)
5 PCIe x8 FH / HL
2 PCIe x16 FH / FL
CAPI (Capable slots)
N/A
One
PCIe Hot Plug Support
No
Yes
IO bandwidth
40 GB/sec
96 GB/sec
Ethernet ports
Quad 1 Gbt
Quad 1 Gbt (x8 Slot)
SFF bays
6
12
Easy Tier Support
No
Yes
Integrated split backplane
Yes ( 3 + 3 )
Yes ( 6 + 6 )
Service Processor
Generation 1
Generation 2
POWER8 4U Comparison
Power 740
Power Systems
S824
Processor
Sockets
Cores
Maximum Memory
Memory Cache
Memory Bandwidth
POWER7+
2
16
1 TB @ 1066 MHz
No
68 GB/sec
POWER8
2
24
1 TB (2 TB ) @ 1600 MHz
Yes
192 GB/sec
Memory DRAM Spare
No
Yes
IO Drwr Expansion Slots
Dual GX++
4 PCIe x16 G3
PCIe Hot Plug Support
IO bandwidth
5 PCIe x8 FH / HL
4 PCIe x8 HH / HL (opt)
No
60 GB/sec
7 PCIe x8 FH / HL
4 PCIe x16 FH / FL
Yes
192 GB/sec
Ethernet ports
Quad 1 Gbt
Quad 1 Gbt
SFF bays
Integrated split backplane
Easy Tier
Service Processor
6
Yes ( 3 + 3 )
No
Generation 1
12
Yes ( 6 + 6 )
Yes
Generation 2
PCIe slots
Performance / Benchmarks
© 2015 IBM Corporation
POWER8 System Performance
P8 S824
P5+ 595
P4 690
© 2015 IBM Corporation
Power 740 vs Power S824
Performance
Performance
per KW
Max Watts
200
Performance
per BTU
6
400
2000
150
300
4
100
200
1000
2
50
100
0
0
P 740+
P8 S824
~2x Better
Performance
© 2015 IBM Corporation
0
P 740+ P8 S824
50% more Cores
More Internal Storage
More I/O Slots
Higher Perf Memory
0
P 740+
P8 S824
Greater Energy
Efficiency
P 740+
P8 S824
Better Thermal
Characteristics
SAP Sales & Distribution 2-Tier ERP 6
Benchmark
24 Core Systems
2x +
2x
Better Performance
than nearest Intel
competition
IBM S824
© 2015 IBM Corporation
Fujitsu
RX300 S8
HP ProLiant
BL460c
Cisco UCS
C240 M3
Siebel CRM Release 8.1.1.4
Benchmark
Performance
Per Core
Leadership
Performance
>3x
IBM
Power
S824
6-core
© 2015 IBM Corporation
Oracle
SPARC
T4-2
16-core
Cisco
UCS
B200 M3
16-core
IBM
Power
S824
6-core
Oracle
SPARC
T4-2
16-core
Cisco
UCS
B200 M3
16-core
eBS 12.1.3 Payroll Benchmark
Performance
Leadership
Per Core
Performance
2x +
IBM
Power
S824
12-core
© 2015 IBM Corporation
Cisco
UCS
B200 M3
24-core
Oracle
SPARC
X3-2L
16-core
IBM
Power
S824
12-core
Cisco
UCS
B200 M3
24-core
Oracle
SPARC
X3-2L
16-core
Operating Systems
© 2015 IBM Corporation
POWER8 AIX Levels
11 / 2012 12 / 2012 3 / 2013 5 / 2013 8 / 2013 9 / 2013 10 / 2013 12 / 2013 2Q / 2014 3Q / 2014
AIX 6.1
TL7
AIX 6.1
TL8
SP6
SP1
SP7
SP2
AIX 6.1
TL9
AIX 7.1
TL1
AIX 7.1
TL2
SP8
SP9
SP10
SP3
SP4
SP5
SP3
+ APAR
IV56366
SP1
SP6
SP1
AIX 7.1
TL3
SP7
SP2
SP8
SP9
SP10
SP3
SP4
SP5
SP1
SP3
+ APAR
IV56367
P7 or P6 Modes with Virtual I/O
P7 or P6 Modes with Full I/O Support
P8, P7 or P6 Modes with Full I/O Support
© 2015 IBM Corporation
21
Why AIX……
• Best Performance and Scalability
– Scales to 256 Cores
– #1 SAP System performance
– #1 SAP per Core performance
• Most Available
– AIX & Power # 1 in availability (ITIC 2013 report)
• Most Secure
– CAPP/OSPP/EAL4+ Security Certification
– 0 reported security breeches with SAP and IBM DB2 or Oracle DB2 on
AIX & Power
• Self Tuning (Dynamic System Optimization)
– Monitors and adjusts optimizations as needed
– Cache & Memory affinity
– Shared memory & Data Stream Pre- fetch optimization
• Minimize Memory requirements
– Active Memory Expansion
© 2015 IBM Corporation
Investment being made into AIX……
• Hot patching of AIX Kernel
– Apply fix to “Live” AIX Kernel
– No reboot of the partition required
– No recycling of the applications
• CAPI Enablement
– Support of CAPI resources
• SRIOV Enhancements
– FCoE & Fibre Channel
• Performance improvements
– Pthreads Trans Memory
• Future Considerations
–
–
–
–
AME Enhancements
Larger Max memory
Split Core support
DSO Enhancements
© 2015 IBM Corporation
IBM i Levels
IBM i 7.1 TR8
POWER7
Max Scale = 32 cores (SMT4)
Max Partition = 64 cores (SMT4)
Threads = ST, SMT2, SMT4 up to 256 threads in single partition
POWER8
Max Scale = 32 cores (SMT8)
Max Partition = 64 cores (SMT4)
Threads = ST, SMT2, SMT4, SMT8 up to 256 threads / single partition
IBM i 7.2
POWER7
Max Scale = 32 cores (SMT4)
Max Partition = 96 cores (SMT4)
Threads = ST, SMT2, SMT4 up to 384 threads in single partition
POWER8
Max Scale = 48 cores (SMT8)
Max Partition = 96 cores (SMT8)
Threads = ST, SMT2, SMT4, SMT8 up to 768 threads / single partition
IBM i 7.2 and POWER8 Highlights
• Enhancing Systems of Engagement and Systems of Record:
–
–
POWER8 enables new levels of performance, reliability and scalability
making it simpler to integrate systems of engagement and systems of record on
a single system and single architecture
IBM i 7.2 locks down business data, increases security and improves
performance minimizing risk as you extend business systems to customers
through mobile and cloud. And, combined with new encrypt/decrypt capabilities
in POWER8, ensuring your data is protected has never been easier
• Key Capabilities:
–
–
–
–
Powerful new features of DB2® for i ensures security of the data in a modern
environment of mobile, social and network access
IBM Navigator for i extends system management capabilities to manage and
monitor performance services
Integrated Security SSO application suite extended to
include FTP and Telnet authentication with Kerberos
PowerHA SystemMirror for i Express Edition introduces HyperSwap and
improves system resiliency to ensure continual access for customers and
employees
–
Analytics: combined value of DB2 WebQuery & Cognos on Linux on Power
–
Free Format RPG provides game changing enhancements for developers,
making extension to mobile and social easier.
© 2014©International
Business Machines Corporation
2015 IBM Corporation
25
POWER8 Linux Distros
2Q / 2014
RHEL6
RHEL 6.5
P7 Mode in P8
RHEL 7
RHEL 7.0 - POWER8 Support
RHEL 7.1 – LE KVM Support
SLES 11
SLES 11 + SP3
P7 Mode in P8
SLES 12
POWER8 LE KVM
Ubuntu (LE)
14.04.00/01
P8 Support
© 2015 IBM Corporation
Virtualization
© 2015 IBM Corporation
Power System Software
An intelligent IT infrastructure for Cloud, Big Data,
Analytics & Mobile
Simplified Virtualization and Cloud Management
Expanded choice and enhanced value for the industry’s most scalable & flexible virtualization
infrastructure for UNIX, Linux and IBM i
New
PowerKVM: Open Virtualization for scale-out Linux Systems
• Kernel-Based Virtual Machine(KVM) Open Source Hypervisor for virtualizing Linux
guest VMs on POWER8 Linux Scale-out servers
• Exploit existing Linux admin skills and tools
• Leverage Power systems performance and resiliency
PowerVM: Virtualization without Limits
• Delivers higher levels of utilization
• Simplified virtualization user experience with new performance views & capacity data
PowerVP: - Virtualization Performance
• Improved memory and shared processor affinity to optimize performance and
service levels
PowerVC (Virtualization Center): Increase IT productivity and agility
• Built on OpenStack
• Improved scalability, active directory support and shared storage pools enabling faster
integration with clients existing infrastructure
SmartCloud Entry for Power Systems*
• Extended capability to enable customization & quicker deployment of
OpenStack-based cloud solutions
28
© 2015 IBM Corporation
28
Power Systems Performance Monitoring
HMC Past
• Disjoint set of tools
• Multiple agents need to be installed in OS
• Minimal or Lack of Visualization
HMC in 2Q-2014
• Integrated Visual Monitor in HMC
• Standard set of Interfaces for
external APIs to consume data
Performance Monitoring – Metrics & Dashboard
Performance metric indicators & utilization dashboard
 Processor, memory & I/O
Server & LPAR level information
Basic trend data collection and visualization
Provides full PowerVM
 Identify bottlenecks
performance and
 Early problem detection
capacity metrics
REST based API to access:
Via a single touch-point
 All platform (PHYP & VIOS) metrics for Tivoli
(HMC).
 Third Party tools
Power Virtualization Options
PowerKVM
Initial Offering: Q2 2014
PowerVM
PowerKVM provides an Open Source
choice for Power Virtualization for Linux
workloads. Best for clients that have
Linux centric admins.
Initial Offering: 2004
PowerVM is Power Virtualization that will continue to be enhanced to
support AIX, IBM i Workloads as well as Linux Workloads
PowerVM vs PowerKVM Comparison
PowerVM
PowerKVM
2004
Q2 2014
Supported Hardware
All P6, P7, P7+, P8
Systems
PowerLinux P8 Systems
Supported OS
AIX, IBM i & Linux
Linux
Workload Mobility
Supports AIX, IBM i &
Linux
Linux
Basic Virtualization
Management
IVM / HMC / FSM
Virtman/libvirt
PowerVC/VMControl
PowerVC, Vanilla OpenStack
Power Centric
Linux/x86 Centric
Established Security
Track Record on Power
Yes
No
Open Source Hypervisor
No
Yes
GA Availability
Advanced Virtualization
Management
Admin Type
PowerKVM Positioning
•
•
•
•
•
•
•
•
•
•
First release available in 2014
Focus: New Linux workloads for Power Systems
Seamless transition for existing Linux admins to adopt Power Linux
Virtualization without any training
No HMC or other traditional IBM consoles
• Normal Linux management and OpenStack options
PowerKVM only supports Linux guest VMs
Cloud potential: Have many more small VMs than traditional Power
Virtualization
POWER8 PowerLinux hardware only
Live Workload mobility support between PowerKVM servers
Open Source Hypervisor: Hardware is abstracted by firmware
Managed by OpenStack(PowerVC) or by off the shelf OpenStack or
local Linux Tools
POWER8 Scale Out,
OpenPOWER and CAPI
OpenPOWER
The Era of Heterogeneous Computing is Coming…
Microprocessors and technology alone are no longer driving Cost/performance improvements
Processors
Semiconductor Technology
Without Price Increases
2 socket systems
© 2015 IBM Corporation
2 socket sys @ constant cost
35
System stack innovations are
required to drive cost/performance
System Stack
Applications and services
Systems Management &
Cloud Deployment
Systems Acceleration &
HW/SW Optimization
Firmware, Operating System
and Hypervisor
Processors
Semiconductor Technology
Some Example Use Cases
Workload Acceleration
Services Delivery Model
Advanced Memories
Optimized System Design
Custom SOC’s
© 2014©International
Business Machines Corporation
2015 IBM Corporation
36
OpenPOWER Extends Moore’s Law to the
System
OpenPOWER will enable data
centers to rethink their approach to
technology.
Member companies may use
POWER for custom open servers
and components for Linux based
cloud data centers.
OpenPOWER ecosystem partners
can optimize the interactions of
server building blocks –
microprocessors, networking, I/O &
other components – to tune
performance.
How will the OpenPOWER Foundation
benefit clients?
– OpenPOWER technology creates
greater choice for customers
– Open and collaborative development
model on the Power platform will
create more opportunity for
innovation
– New innovators will broaden the
capability and value of the Power
platform
What does this mean to the industry?
– Game changer on the competitive
landscape of the server industry
– Will enable and drive innovation in
the industry
– Provide more choice in the industry
Platinum Members
© 2015 IBM Corporation
Fueling an Open Development Community
Implementation / HPC / Research
System / Software / Integration
I/O / Storage / Acceleration
Boards / Systems
Chip / SOC
© 2015 IBM Corporation
38
Complete member list at www.openpowerfoundation.org
OpenPOWER: Growing Fast
System/Software/Services
I/O, Storage, Acceleration
Boards/Systems
Chip/SOC
***Chart from April 2014!!!
© 2015 IBM Corporation
39
“POWER” Built for Open Innovation
POWER Processors have a Leadership Set of Differentiated Interfaces
GPU/Other
DMI
Memory
Interface
Control
PowerCore
DMI
Memory
Interface
Control
CAPI
IBM & Partner
Devices
NVLINK
POWER8/8+
Server
Class
Memory
Processors
GPU/Other
NVLINK
Server
Class
Memory
Innovation with OpenPOWER is taking place on all interfaces and with custom SOC Designs
40
© 2015 IBM Corporation
40
Redesigning the Computer
Targeted Software
Acceleration Packs
Transparent Tooling
Middleware Like Abstraction
+
FPGA or GPU
CPU’s
Services



Strong Cores for Serial Codes
Runs Traditional & Legacy Software
Runs OS (Security, Virtualization, etc)
•
•
•
•
•
Extreme Parallelism available
Targeted Software Accelerator packs
IP Base Libraries
Customer IP
Reconfigurable Nature fights Commoditization
Greater robustness is achieved by mating of specializations….
© 2015 IBM Corporation
41
When to Use FPGAs
• Transistor Efficiency & Extreme Parallelism
– Bit-level operations
– Variable-precision floating point
• Power-Performance Advantage
– >2x compared to Multicore (MIC) or GPGPU
– Unused LUTs are powered off
• Technology Scaling better than CPU/GPU
– FPGAs are not frequency or power limited yet
– 3D has great potential
• Dynamic reconfiguration
– Flexibility for application tuning at run-time vs. compile-time
• Additional advantages when FPGAs are network connected ...
– allows network as well as compute specialization
© 2015 IBM Corporation
When to Use GPGPUs
• Extreme FLOPS & Parallelism
– Double-precision floating point leadership
– Hundreds of GPGPU cores
• Programming Ease & Software Group Interest
– CUDA & extensive libraries
– OpenCL
– IBM Java (coming soon)
• Bandwidth Advantage on Power
– Start w/PCIe gen3 x16 and then move to NVLink
• Leverage existing GPGPU eco-system and development base
– Lots of existing use-Cases to build on
– Heavy HPC investment in GPGPU
© 2015 IBM Corporation
Power8 Invents CAPI
Power Processor
• Coherent Attached Processor Proxy (CAPP) in processor
–
–
Unit on processor that extends coherency to an attached device
On processor directory responds on behalf of off-chip device
(Filtering snoops)
CAPP
PCIe
• Coherency protocol tunneled over standard PCIe
–
CAPI over
PCIe
–
Eliminates the need for special I/Os and protocol logic
CAPI utilizes standard Posted Write and Non-posted Reads
Reduces the complexity and bandwidth requirements of the
attached device
• Enables attached device to be a peer to the processor
–
–
–
Coherently Attached
Device
Simplifies programming model between application
Enables device to use same effective address as application
running in processor
Eliminates the cumbersome I/O Device Driver requirements
Pinned memory not required
Why CAPI is Better than Traditional PCIe
CAPI
FPGA
IBM Supplied POWER
Service Layer
Function n
Function 2
Function 1
PCIe
Function 0
CAPP
Power Processor
Typical I/O Model Flow
DD Call
Copy or Pin
Source Data
MMIO Notify
Accelerator
Acceleration
Poll / Int
Completion
Copy or Unpin
Result Data
Ret. From DD
Completion
Flow with a Coherent Model
Shared Mem.
Notify Accelerator
Acceleration
Shared Memory
Completion
Advantages of Coherent Attachment Over I/O Attachment
•
Virtual Addressing & Data Caching
–
Shared Memory
–
Lower latency for highly referenced
data
•
Easier, More Natural Programming Model
–
Traditional thread level
programming
–
Long latency of I/O typically requires
restructuring of application
•
Enables Applications Not Possible on I/O
–
Pointer chasing, etc…
Workloads to Innovate
• Start with what FPGAs are good at: Embarrassingly Parallel
Problems
• Combine with CAPI strengths:
– Ease of programming
– Lack of device driver
– Shared memory & caching (host to accelerator communication)
• What do you get:
–
–
–
–
Bitwise data manipulation (e.g. Deep Compression)
Pattern recognition
Encryption
Monte Carlo
Statistical modeling for complex predictions
– Image Analytics & Biometrics
Facial recognition
Feature detection (e.g. cancer)
– Network Packet Processing & Inspection
– Bioinformatics (e.g. Sequence alignment)
– Reverse time migration (Oil & Gas)
– Ensemble Calculations of Numerical Weather Prediction
© 2015 IBM Corporation
– Machine Learning
Example: File System Acceleration with CAPI-FPGA
•
•
•
•
Compression
– IBM Gzip offers best combination of
performance and compression rate
De-Duplication
– Signature calculation is easy to
integrate with compression datapath
Crypto
– Crypto acceleration on P8
– FPGA is also a good fit, especially if
crypto algorithm is non-standard
Content analytics for real-time tagging
– IBM CAPI/FPGA accelerated text
analytics
– IBM CAPI/FPGA accelerated image
analytics
•
Power 8 / CAPI benefits
– Very strong memory & I/O bandwidth
– Seamless integration with CAPI
shared memory interface (acc. Is just
like another core )
– Variety of accelerator partners
through OpenPOWER ( Altera, Xilinx,
NVIDIA, ...)
IBM Accelerated GZIP Compression
What it is:
 An FPGA-based low-latency GZIP Compressor & Decompressor with single-thread
througput of ~2GB/s and a compression rate significantly better than low-CPU overhead
compressors like snappy.
© 2015 IBM Corporation
48
48
IBM Accelerated Text Processing
What it is:
 A compiler/runtime system for
accelerating text analytics on a sharedmemory CPU-FPGA
Results

Annotations
AQL
For years, Microsoft Corporation
CEO Bill Gates was against open
source. But today he appears to
have changed his mind. "We can
be open source”
• rule language
• SQL-like syntax
Big Speedup vs. Multithread SW
systemT
optimizer
Compiled
operator
graph
To appear @:

Hot Chips 2014
© 2015 IBM Corporation
49
systemT
runtime
Java +
FPGA
49
FPGA Image & Video Processing
Information Extraction
Object Recognition
Information located where pixels change
color (edges, blobs)
Template Matching
Approa
ch
Extract relevant information from input
image to enable object recognition
Motivati
ons
Goal
Edge Detection, Feature Extraction, Segmentation
Design fully-pipelined FPGA architectures 
streaming application
Real-time, low-power, onboard image
processing solution

Intrinsic properties of objects

Sobel and Canny: extract contours/edges

Object boundaries

SURF: extract scale & rotation-invariant features
Applications requiring edge detection & feature extraction span a wide range of domains
50

Computer/Machine Vision: Tracking, Object Recognition & Navigation

General image proc.: Compression

Quality Control: Unsupervised Defect Identification

Medical Imaging: Analysis + Diagnosis & Computer Guided Surgery
© 2015 IBM Corporation
Gaussian 1st derivative
2nd derivative
2D convolution with Gaussian Filter: blur
2D convolution with Gaussian 1st derivative: extract edges
2D convolution with Gaussian 2nd derivative: extract features
X
FPGA acceleration results from:
Y
Y
X
Hardware
Design
Theory
Custom Hardware Mapping
Parallel 2D convolution

Process all pixels inside filter in parallel
Parallel 2D convolution in x, y, z direction
Parallel 2D convolution for all filter scales

51
© 2015 IBM Corporation
Total of 33 filters
Resul
ts
Results & Conclusions
OpenCL vs. VHDL performance table
OpenCL vs. VHDL
productivity table
VHDL
development
time
6 months
Conclusio
ns
Sobel,
Canny,
&
SURF
52
© 2015 IBM Corporation
OpenCL
development
time
1 month
OpenCL
performance
VHDL performance
Apps.
Stratix 4
Stratix 5
Stratix 5
Frames/sec
Max
freq.
Frames/sec
Max
freq.
Frames/sec
Max
freq.
Sobel
475
170
909
300
870
300
Canny
470
170
890
300
823
309
SURF
392
170
870
300
804
283
Performance
Productivity
IBM Accelerated Image Processing
What it is:
 A real-time multi-HD stream Harris-Laplace feature detection algorithms implemented in
an FPGA
Performance:

166M pixels per second
( i.e. multi-stream HD video)
To appear:

IBM Journal of Research & Development
© 2015 IBM Corporation
53
CAPI Attached Flash Optimization
–
–
–
Attach TMS Flash to POWER8 via CAPI coherent Attach
Issues Read/Write Commands from applications to eliminate 97% of code pathlength
Saves 20-30 cores per 1M IOPs
Application
Application
Posix Async
I/O Style API
Read/Write Syscall
FileSystem
strategy ( )
iodone (
)
LVM
strategy ( )
iodone (
)
20K
instructions
reduced to
<500
aio_read()
aio_write()1
User Library
Shared Memory
Work Queue
Disk and Adapter DD
Pin buffers,
Translate, Map DMA,
Start I/O
54
© 2015 IBM Corporation
Interrupt, unmap,
unpin,Iodone scheduling
54
55 © 2015 IBM Corporation
55
Flash as Slow Memory
client
server
network
flash
network
Memory
network
CAPI
network
Conventional PCIe I/O
acceptable
latency
56
© 2015 IBM Corporation
56
Monte-Carlo CAPI Acceleration
Running
1 million iterations
At least
250x Faster
with CAPI FPGA +
POWER8 core
Full execution of a Heston
model pricing for a single
security:
1. SOBOL sequence
generator (pRNG)
2. Inverse Normal to create
the non-linear distribution
3. Path-generation
4. Pay-off function
Easier to Code:
Reduces C code writing by 40x compared to non-CAPI FPGA
© 2015 IBM Corporation
57
POWER8-based Network Acceleration
Faster workloads with less infrastructure
10x
Eastern
New York
Boston
data
RDMA
data
Washington D.C.
data
IBM Power Systems and Mellanox®
Technologies partnering to
simultaneously accelerate the
network and compute for NoSQL
workloads.
© 2015 IBM Corporation
lower latency
data
Central
Chicago
higher
throughput
10x
Dramatically less
data center
infrastructure
exploiting high speed
networks with
Remote DMA
Dramatically
faster
responsiveness
to customers
leveraging POWER8
high throughput low
latency I/O
58
58
GPU Acceleration Example: Espresso
• We’re only just discovering how to make this data useful
Large global retailers
collect petabytes of data
Transactions generate
tens of millions of filing
cabinets of paper
How does a retailer
translate all of this data to
business value?
Group customers in
segments with similar
behavior
Customize products and
marketing programs
• Impossible to make this much data useful through human inspection
© 2015 IBM Corporation
59
IBM Power Systems GPU Acceleration of Java Applications
• Now possible on today's Big Data and Java Workload Acceleration
– Use of segmentation or clustering in the retail industry
•
•
Look for non-obvious patterns in the sales data and react
quickly Analyze across tens of thousands of dimensions
quickly and accurately
Lends itself nicely to a bit of computer science known as
"k-means clustering"
– Outcome could lead to new products, revised products and
advertising, launching new campaigns….wherever the data
leads you….
Imagine generating 100 times more ideas for new products and
campaigns – who can get you there?
© 2015 IBM Corporation
60
GPU Espresso Demo
• IBM and NVIDIA are demonstrating segmentation
using GPU accelerated machine learning for
clustering using Hadoop / Mahout
– OpenPower initiative with NVIDIA
– First product implementing GPU acceleration for
Java
• Best-in-class ingredients
– IBM POWER8 – Designed for Big Data
– IBM Java
– NVIDIA CUDA GPU acceleration
– Ubuntu Little Endian Linux for POWER
• Achieving 8X performance improvement
© 2015 IBM Corporation
61
61
OpenPOWER innovations benefit Clients
Altera FPGA acceleration and IBM CAPI
Monte Carlo 250x faster than POWER8 core
US Dept of Energy $325M super computing
contract awarded to IBM, Mellanox, and NVIDIA
alone, reduced C code 40x over non-CAPI FPGA
DoE systems for science and
stockpile stewardship
Data Engine for NoSQL 24:1 server
consolidation, 3x lower cost per user, 40TB
CAPI-attached flash
Sierra and Summit systems to be
>100 PF, 2 GB/core main memory,
local NVRAM, and science
performance 4x-8x Titan or Sequoia
CAPI dev kit with FPGA card from Nallatech
NVIDIA acceleration built into IBM Power S824L
Tyan OpenPOWER Customer Reference System
8x faster than x86 Ivy
Bridge on pattern extraction
82x faster for Cognos BI and
DB2 BLU
62
© 2015 IBM Corporation
© 2014 OpenPOWER Foundation
62
University Research on Power8 Accelerators
•
•
•
•
•
•
•
Photodynamic Therapy @ University of Toronto
fMRI @ Western University
Genomics @ University of Illinois Urbana-Champaign & Rice & Delft
Seismic @ University of Texas
Data Analytics @ North Carolina State University
Financial Risk @ University of Florida
The list is growing rapidly…
© 2015 IBM Corporation
POWER8 Scale Out,
OpenPOWER and CAPI
What is CAPI?
What’s in a name?
© 2015 IBM Corporation
65
FPGA as an Accelerator
• FPGA: Field Programmable Gate Array
–
–
–
–
It’s a re-programmable chip
It can run fast (cycle times of 250 – 500 Mhz or more)
It has Industry Standard Interfaces like PCI-E Gen3
The Major FPGA Suppliers, Altera and Xilinx,
are OpenPOWER Foundation members
PCIE
gzip
Source code for FPGAs has traditionally
been written in RTL* (VHDL** or Verilog).
Now, we also have OpenCL, a more
programmer friendly language.
© 2015 IBM Corporation
*RTL = Register Transfer Level
**VHDL = VHSIC*** Hardware Description Language
***VHSIC = Very High Speed Integrated Circuit
FPGA
Encrypt
Monte
Carlo
FPGA Library
66
When to Use FPGAs
• Transistor Efficiency & Extreme Parallelism
– Bit-level operations
– Variable-precision floating point
• Power-Performance Advantage
– >2x compared to Multicore (MIC) or GPGPU
– Unused LUTs are powered off
• Technology Scaling better than CPU/GPU
– FPGAs are not frequency or power limited yet
– 3D has great potential
• Dynamic reconfiguration
– Flexibility for application tuning at run-time vs. compile-time
• Additional advantages when FPGAs are network connected ...
– allows network as well as compute specialization
© 2015 IBM Corporation
Why is an Accelerator Faster?
PCIE
FPGA
Question:
The POWER8 Processor runs at ~3Ghz while our
FPGA runs at 250Mhz. So why would an accelerator
be better?
Answer:
The FPGA is better for certain algorithms, such as
those that are numerical intensive or have parallelism.
The POWER8 processor has a finite set of instructions
to implement the algorithm in SW.
The FPGA is customized logic built for specific
processing of an algorithm.
68
© 2015 IBM Corporation
Why is an Accelerator Faster?
PCIE
FPGA
Example 1: Numerical Intensive Algorithm
Integral ()
Variables
sin
Sigma ()
cos
x+
Sin ()
∑
Cos ()
+
Main
(n,a,v,w)
SW
© 2015 IBM Corporation
∫
Done!
Done!
FPGA
69
Why is an Accelerator Faster?
PCIE
FPGA
Example 2: Parallelism
Monte Carlo Risk Analysis to determine
probability of financial success:
Given current finances, run 100 scenarios
Variable distributor
Engine 1
Engine 2
Engine 3
Engine 4
Engine 5
Engine 6
Engine 7
Engine8
Engine 9
Monte
Variables
Main
(Vars)
Engine 50
Variables
Results Accumulator
SW
© 2015 IBM Corporation
10
5
50
100
FPGA
70
So what is new?
Accelerators on FPGAs
have been around for a
long time….
So what is new?
Coherency makes the
accelerator a peer to
the POWER8 cores
© 2015 IBM Corporation
71
What was done before CAPI?
Prior to CAPI, an application called a device driver to utilize an
FPGA Accelerator.
The device driver performed a memory mapping operation.
Device Driver
Storage Area
Virt Addr
Variables
Variables
Input
Data
Input
Data
Memory Subsystem
Output
Data
3 versions of the data (not coherent).
1000s of instructions in the device driver.
PCIE
FPGA
Output
Data
Input
Variables
Data
POWER8
Core
DD
App
© 2015 IBM Corporation
POWER8
Core
POWER8
Core
POWER8
Core
POWER8
Core
POWER8
Core
72
CAPI Coherency
With CAPI, the FPGA shares memory with the cores
Virt Addr
1 coherent version of the data.
No device driver call/instructions.
PCIE
PSL
Memory Subsystem
FPGA
Output
Data
Input
Variables
Data
POWER8
Core
App
© 2015 IBM Corporation
POWER8
Core
POWER8
Core
POWER8
Core
POWER8
Core
POWER8
Core
73
CAPI vs. I/O Device Driver: Data Prep
Typical I/O Model Flow: Total ~13µs for data prep
Copy or Pin
Source Data
DD Call
300 Instructions
MMIO Notify
Accelerator
10,000 Instructions
7.9µs
Acceleration
Application
Dependent, but
Equal to below
Poll / Interrupt
Completion
Copy or Unpin
Result Data
Ret. From DD
Completion
1,000 Instructions
3,000 Instructions
1,000 Instructions
4.9µs
Flow with a Coherent Model: Total 0.36µs
Shared Mem.
Notify Accelerator
400 Instructions
0.3µs
© 2015 IBM Corporation
Acceleration
Application
Dependent, but
Equal to above
Shared Memory
Completion
100 Instructions
0.06µs
74
CAPI Differentiation
CAPI vs. I/O or Socket FPGA Solution
IBM Innovation
Customer Impact
FPGA is a peer to the processor
-- Caching and translations by PSL
Simple Programming paradigm
Higher performance
Architecture allows for any kind of
FPGA or even an ASIC
Flexible solutions
Connection to Flash, FC, EN….
Virtualization in the Architecture
Applications can share Accelerator
I/O Paradigm
© 2015 IBM Corporation
CAPI Paradigm
POWER8 Processor
Let’s take a closer look at how IBM Engineers made CAPI work
Technology
•
Cores
Core
L3
L3
Bank
Bank
Bank
Chip Interconnect
Bank
Bank
L2
L2
Core
Core
L2
Core
L2
Core
Core
L2
L2
L3
L3
L3
Bank
Bank
Bank
L3
L3
L3
Bank
Bank
Bank
L2
L2
Core
Core
L2
SMP
Bank
Core
Chip Interconnect
PCIe
L3
SMP
L3
CAPI
L3
• 512 KB SRAM L2 / core
• 96 MB eDRAM shared L3
Core
Memory Bus
L3
CAPI
L2
SMP Interconnect
L2
Caches
Memory
SMP Interconnect
L2
SMP
Core
PCIe
• Crypto and memory
expansion
• Transactional memory
• VMM assist
• Data move/VM mobility
Core
SMP
Accelerators
POWER8 Scale-Out Dual Chip Module
Memory Bus
• 12 cores (SMT8)
• 8 dispatch, 10 issue,
16 execution pipes
• 2x internal data
flows/queues
• Enhanced prefetching
• 64 KB data cache,
32 KB instruction cache
22 nm SOI, eDRAM, 15 ML 650 mm2
• Up to 230 GB/s
sustained bandwidth
Bus Interfaces
• Durable open memory attach
interface
• Integrated PCIe Gen3
• SMP interconnect
• CAPI
Energy Management
•
© 2015 IBM Corporation
On-chip power management microcontroller
76
How CAPI Works
CAPI Developer Kit Card
Acceleration Portion:
Data or Compute Intensive,
Storage or External I/O
Algorithm
Algo
rithm
PCIe
Sharing the same memory space
Accelerator is a peer to POWER8 Core
Application Portion:
Data Set-up, Control
POWER8 Processor
© 2015 IBM Corporation
CAPI technology connections
FPGA
AFU
IBM
Supplied PSL
• Proprietary hardware to enable
coherent acceleration
• Operating system enablement
– Ubuntu LE
– Libcxl function calls
• Customer application and accelerator
PCIe
• Application sets up data and calls the
accelerator functional unit (AFU)
CAPP
Memory (Coherent)
• AFU reads and writes coherent data across the
PCIe and communicates with the application
– PSL cache holds coherent data for quick
AFU access
POWER8
Core
OS
App
POWER8 Processor
© 2015 IBM Corporation
78
CAPI solution flow
OS
Connect to
accelerator
1
AFU
IBM Supplied
PSL
App
Open device
cxl_afu_open_dev
AFU reserved for work
Set Work Element
Descriptor (WED) at
AddrX – may contain
addresses of other data
structures
2
3
Start accelerator
Understands WED content - and
any other addressed data
structures
2
Attach device
cxl_afu_attach
Reset AFU
PSL_WED_Ax is
set to AddrX
AFU_CNTL_An[E]
is set
CTL interface
jea gets AddrX
jcom gets start
AFU fetches AddrX (the WED)
starts operation
CMD interface
Buffer interface
4
AFU continues to work
using this interface
Resp interface
5
6
If required, App can
read or write AFU
registers
MMIO interface
App knows AFU is finished
(Mechanism is user
defined)
App can start again
from top or free AFU
© 2015 IBM Corporation
6
CTL interface
Free device
cxl_afu_free
AFU finishes
(Mechanism is user defined)
De-assert RUNNING
Assert DONE
79
POWER8 with CAPI Cards
Front View
POWER8 Modules
Side View
CAPI Dev Kit Cards
© 2015 IBM Corporation
80
Basic concepts of CAPI
CAPI vs. CAPI Solutions
• CAPI is a platform to enable acceleration
• CAPI provides an infrastructure to improve performance of
an application through FPGA acceleration
– Enables customer-defined acceleration within the processor complex
• CAPI allows implementation of a wide range of accelerators
to optimally address many different customer challenges
Platform
for
Innovation
– Each implementation is a unique CAPI Solution
• A CAPI Solution is a specific implementation of an algorithm
that uses an FPGA + application
• A CAPI Solution requires logic designers and programmers
to implement the solution
• CAPI Solution Examples:
Specific
Customer
Solution
– Flash Appliance (IBM Data Engine for NoSQL)
– MonteCarlo Algorithm
© 2015 IBM Corporation
81
Why Accelerate on CAPI?
• Reasons to consider CAPI Acceleration
– Higher Performance
 If your customer has a complex application running on a core, consider
CAPI for better performance
 If your customer already does I/O attached FPGA acceleration, CAPI will
simplify their software and provide better performance
– Lower IT Costs
 By moving workload to CAPI, your customer will need fewer cores
 In some cases, such as the IBM Data Engine for NoSQL, CAPI can do the
same work with far less infrastructure
– Lower Power
•
Running acceleration on an FPGA can result in lower power consumption
vs. running the application as software on a core
Note:
When considering CAPI for a particular solution, we compare it to:
1. The same solution running as software –OR–
2. The same solution running on an IO attached FPGA
© 2015 IBM Corporation
82
CAPI ecosystem partners and consumers
CAPI-APPS
For
Clients
Have a client who wants their
IBM Application to be
accelerated on CAPI? (ex:
DB2, CPLEX, Streams)
Contact: Jonathan Dement
([email protected])
IBM CAPI Solutions
Have a client or partner who
wants to create a CAPI-App
and sell it to others? Point
them to the CAPI resources in
this doc (IBM and Nallatech
websites) and email Bruce
Wile ([email protected])
about the opportunity
Partner Solutions
IBM Data Engine for NoSQL
Clients with their
Own Proprietary Solutions
Have a client or partner who
wants to create a proprietary
CAPI Solution? Point them to
the CAPI resources in this doc
(IBM and Nallatech websites)
© 2015 IBM Corporation
and email Bruce Wile
([email protected]).
Why tell Bruce Wile about
the opportunity?
Depending on the size of the
opportunity, we will engage
the CAPI Customer
Enablement Team
83
Two Paths into CAPI
CAPI
CAPI Developer Kit
CAPI Market Solutions
CAPI App
Solutions
Clients create their own,
proprietary business solution.
© 2015 IBM Corporation
IBM & Partners create business
solutions for the CAPI Market.
Clients buy pre-packaged
solutions from the CAPI Market.
84
CAPI Solutions
CAPI App
Solutions
© 2015 IBM Corporation
85
Open Development Driving CAPI Solutions
Implementation / HPC / Research
System / Software / Integration
I/O / Storage / Acceleration
Boards / Systems
Chip / SOC
© 2015 IBM Corporation
© 2014 OpenPOWER Foundation
Complete member list at www.openpowerfoundation.org
86
Potential Markets for CAPI Solutions
Edge of Network; JPEG
& Video processing
Network Packet Processing
Database Acceleration/KVS
Machine Learning
Bitwise Data Manipulation
Compression/Encryption
Ensemble
Calculations of
Numerical Weather
Prediction
Big
Data/
Data/
BigBig
Data/
Database/
Database/
Database/
Compute
Compute
Compute
Weather
Social/
Social/
Media
Media
Radiation Therapy
Pharmaceuticals
Public Health Image
Analysis Genomics
Medicine
Medicine
CAPI
Market
Finance/
Insurance
Visual /
Visual
/
Biometric
Biometric
Analysis
Analysis
Oil & Gas
Reverse Time Migration
Database Acceleration
& Fast Storage
Data Analytics
Pattern Recognition
Risk Analysis
Monte Carlo
Pattern Analysis
Retail Security
Facial Recognition
Manufacturing
/EDA
Deep Computation and
Critical Runtime Jobs
© 2015 IBM Corporation
Fluid Dynamics
3D Modeling CAD
Pipeline Analysis & Flow
Specialized Algorithms
87
CAPI Availability
• See: http://www.ibm.com/support/customercare/sas/f/capi/home.html
• CAPI Developer Kit
– Procure through Nallatech
– For customers considering creating their own CAPI Solution
–CAPI Decision and Process Guide
– Requires POWER8 Server
– Available now
– See www.nallatech.com/capi
• First CAPI Solution:
IBM Data Engine for NoSQL
– Procure through IBM
– GA in early 2015
© 2015 IBM Corporation
88
CAPI Developer Kit
© 2015 IBM Corporation
89
CAPI Developer Kit – FPGA Card
2 Banks of SDRAM
Dual 10G SFP+
Altera Stratix V FPGA
Complete Datasheet
PCI-E Gen3
© 2015 IBM Corporation
90
CAPI Developer Kit
IBM POWER8TM Server
© 2015 IBM Corporation
91
CAPI Developer Kit
© 2015 IBM Corporation
92
CAPI Developer Kit
http://www.ibm.com/support/customercare/sas/f/capi/home.html
© 2015 IBM Corporation
93
© Copyright International Business Machines Corporation 2015
Printed in the United States of America September 2015
IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines Corp.,
registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies.
A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at
www.ibm.com/legal/copytrade.shtml.
The following terms are trademarks or registered trademarks licensed by Power.org in the United States and/or other countries: Power ISA.
Information on the list of U.S. trademarks licensed by Power.org may be found at www.power.org/about/brand-center/.
Linux is a trademark of Linus Torvalds in the United States, other countries, or both.
Other company, product, and service names may be trademarks or service marks of others.
All information contained in this document is subject to change without notice. The products described in this document
are NOT intended for use in applications such as implantation, life support, or other hazardous uses where malfunction
could result in death, bodily injury, or catastrophic property damage. The information contained in this document does not
affect or change IBM product specifications or warranties. Nothing in this document shall operate as an express or implied
license or indemnity under the intellectual property rights of IBM or third parties. All information contained in this document
was obtained in specific environments, and is presented as an illustration. The results obtained in other operating
environments may vary.
While the information contained herein is believed to be accurate, such information is preliminary, and should not be relied upon for accuracy or completeness, and no representations
or warranties of accuracy or completeness are made.
Note: This document contains information on products in the design, sampling and/or initial production phases
of development. This information is subject to change without notice. Verify with your IBM field applications
engineer that you have the latest version of this document before finalizing a design.
You may use this documentation solely for developing technology products compatible with Power Architecture®. You may not modify or distribute this documentation. No license,
express or implied, by estoppel or otherwise to any intellectual property rights is granted by this document.
THE INFORMATION CONTAINED IN THIS DOCUMENT IS PROVIDED ON AN “AS IS” BASIS. In no event will IBM be
liable for damages arising directly or indirectly from any use of the information contained in this document.
IBM Systems and Technology Group
2070 Route 52, Bldg. 330
Hopewell Junction, NY 12533-6351
The IBM home page can be found at ibm.com®.
Version 1.0
29 September 2014—IBM Confidential
© 2015 IBM Corporation
94
Fly UP