Comments
Transcript
POWER8 Scale Out, OpenPOWER and CAPI Georgia IBM
POWER8 Scale Out, OpenPOWER and CAPI Georgia IBM POWER User Group 16 APR 2015 JT Kellington POWER8 Scale Out, OpenPOWER and CAPI POWER8 Scale Out Power April 2014 Announcements • New POWER8 Scale Out Servers – IBM POWER8 2U 2 socket server: – IBM POWER8 4U 1 socket server: – IBM POWER8 4U 2 socket server: Power S822 Power S814 Power S824 • New POWER8 Linux Servers – IBM POWER8 Linux 2U 1 socket server: Power S812L – IBM POWER8 Linux 2U 2 socket server: Power S822L • New Virtualization Management – Enhanced HMC Functionality – IBM PowerKVM – Kernel Virtual Machine • New Linux Distro Offering – Canonical Ubuntu – Available on Linux Power servers with PowerKVM © 2015 IBM Corporation Power April 2014 Announcements • New I/O Options – Ethernet • New IBM i Releases – IBM i 7.2 (1st new version in 4 years) – IBM i 7.1 TR8 • POWER8 Hardware support – IBM BLU Acceleration Solution - Power Systems Edition – IBM PowerVP – Virtualization Performance – IBM PowerSC – Security and Compliance – IBM PowerVM – IBM PowerVC © 2015 IBM Corporation Innovation Drives Performance Relative % of Improvement Gain by Technology Scaling Gain by Innovation 100% 80% 60% 40% 20% 0% 180 nm 130 nm 90 nm © 2015 IBM Corporation 65 nm 45 nm 32 nm 22 nm POWER8: The First Processor Designed for Big Data IBM 22nm Technology • Silicon-on-Insulator • 15 metal layers • Deep trench eDRAM POWER8 Processor Compute • 12 cores (thread strength optimized) • SMT8, 16-wide execution • 2X internal data flows • Transactional Memory Cache • 64KB L1 + 512KB L2 / core • 96MB L3 + up to 128MB L4 / socket • 2X bandwidths System Interfaces • 230 GB/s memory bandwidth / socket • Up to 48x Integrated PCI gen 3 / socket • CAPI (over PCI gen 3) • Robust, Large SMP Interconnect • On chip Energy Mgmt, VRM / core © 2015 IBM Corporation 6 POWER8 Memory Organization (Max Config shown) DRAM Chips Memory Buffer 128 GB 16MB 16MB 128 GB POWER8 DCM 128 GB 16MB 16MB 128 GB 128 GB 16MB 16MB 128 GB 128 GB 16MB 16MB 128 GB Up to 1 TB / Socket First P8 Systems: 512 GB /Socket POWER8 Performance IO Bandwidth (scale-out systems) Memory BW per Socket POWER8 POWER8 POWER7+ POWER7 POWER7 POWER6 POWER6 POWER5 0 50 100 150 200 Per Core Performance Gains (mixed workloads) 0 100 150 200 250 per Socket Performance Gains (SMT8) POWER8 POWER8 POWER7 POWER7 POWER6 POWER6 POWER5 POWER5 © 2015 IBM Corporation 50 8 POWER8 Scale-Out Systems © 2015 IBM Corporation Power Systems scale-out portfolio Power Systems S824 Power Systems S822 Power Systems •2-socket, 2U S822L Power Systems •All Operating Systems •2-socket, 2U •PowerVM only S812L •Linux Only •1-socket, 2U •Linux Only •KVM and PowerVM •KVM and PowerVM Power Systems S824L Power Systems •2-socket, 4U S814 •Linux Only •1-socket, 4U •All Operating Systems •PowerVM only •Bare metal •2-socket, 4U •All Operating Systems •PowerVM only POWER8 2U Scale Out Comparison Power 730 Power S822 Processor POWER7+ POWER8 Sockets 2 2 Cores 8 / 12 / 16 12 / 20 Maximum Memory 512 MB @ 1066 MHz 512 GB / 1 TB @ 1600 MHz Memory Cache No Yes Memory Bandwidth 68 GB/sec 192 GB/sec Memory DRAM Spare No Yes IO Expansion Slots Dual GX++ 4 PCIe x16 G3 PCIe slots 5 PCIe x8 LP 4 / 5 PCIe x8 LP 2 / 4 PCIe x16 LP PCIe Hot Plug Support No Yes IO bandwidth 60 GB/sec 192 GB/sec Ethernet ports Four 1 Gbt Four 1 Gbt SFF 6 12 Easy Tier Support No Yes Integrated split backplane Yes ( 3 + 3 ) Yes ( 6 + 6 ) Service Processor Generation 1 Generation 2 POWER8 4U Scale Out Comparison Power 720 Power System S814 Processor POWER7+ POWER8 Sockets 1 1 Cores 4/6/8 6/8 Maximum Memory 512 GB @ 1066 MHz 512 GB @ 1600 MHz Memory Cache No Yes Memory Bandwidth 136 GB/sec 192 GB/sec Memory DRAM Spare No Yes IO Expansion Slots Dual GX++ 4 PCIe x16 G3 PCIe slots 5 PCIe x8 FH / HL 4 PCIe x8 HH / HL (opt) 5 PCIe x8 FH / HL 2 PCIe x16 FH / FL CAPI (Capable slots) N/A One PCIe Hot Plug Support No Yes IO bandwidth 40 GB/sec 96 GB/sec Ethernet ports Quad 1 Gbt Quad 1 Gbt (x8 Slot) SFF bays 6 12 Easy Tier Support No Yes Integrated split backplane Yes ( 3 + 3 ) Yes ( 6 + 6 ) Service Processor Generation 1 Generation 2 POWER8 4U Comparison Power 740 Power Systems S824 Processor Sockets Cores Maximum Memory Memory Cache Memory Bandwidth POWER7+ 2 16 1 TB @ 1066 MHz No 68 GB/sec POWER8 2 24 1 TB (2 TB ) @ 1600 MHz Yes 192 GB/sec Memory DRAM Spare No Yes IO Drwr Expansion Slots Dual GX++ 4 PCIe x16 G3 PCIe Hot Plug Support IO bandwidth 5 PCIe x8 FH / HL 4 PCIe x8 HH / HL (opt) No 60 GB/sec 7 PCIe x8 FH / HL 4 PCIe x16 FH / FL Yes 192 GB/sec Ethernet ports Quad 1 Gbt Quad 1 Gbt SFF bays Integrated split backplane Easy Tier Service Processor 6 Yes ( 3 + 3 ) No Generation 1 12 Yes ( 6 + 6 ) Yes Generation 2 PCIe slots Performance / Benchmarks © 2015 IBM Corporation POWER8 System Performance P8 S824 P5+ 595 P4 690 © 2015 IBM Corporation Power 740 vs Power S824 Performance Performance per KW Max Watts 200 Performance per BTU 6 400 2000 150 300 4 100 200 1000 2 50 100 0 0 P 740+ P8 S824 ~2x Better Performance © 2015 IBM Corporation 0 P 740+ P8 S824 50% more Cores More Internal Storage More I/O Slots Higher Perf Memory 0 P 740+ P8 S824 Greater Energy Efficiency P 740+ P8 S824 Better Thermal Characteristics SAP Sales & Distribution 2-Tier ERP 6 Benchmark 24 Core Systems 2x + 2x Better Performance than nearest Intel competition IBM S824 © 2015 IBM Corporation Fujitsu RX300 S8 HP ProLiant BL460c Cisco UCS C240 M3 Siebel CRM Release 8.1.1.4 Benchmark Performance Per Core Leadership Performance >3x IBM Power S824 6-core © 2015 IBM Corporation Oracle SPARC T4-2 16-core Cisco UCS B200 M3 16-core IBM Power S824 6-core Oracle SPARC T4-2 16-core Cisco UCS B200 M3 16-core eBS 12.1.3 Payroll Benchmark Performance Leadership Per Core Performance 2x + IBM Power S824 12-core © 2015 IBM Corporation Cisco UCS B200 M3 24-core Oracle SPARC X3-2L 16-core IBM Power S824 12-core Cisco UCS B200 M3 24-core Oracle SPARC X3-2L 16-core Operating Systems © 2015 IBM Corporation POWER8 AIX Levels 11 / 2012 12 / 2012 3 / 2013 5 / 2013 8 / 2013 9 / 2013 10 / 2013 12 / 2013 2Q / 2014 3Q / 2014 AIX 6.1 TL7 AIX 6.1 TL8 SP6 SP1 SP7 SP2 AIX 6.1 TL9 AIX 7.1 TL1 AIX 7.1 TL2 SP8 SP9 SP10 SP3 SP4 SP5 SP3 + APAR IV56366 SP1 SP6 SP1 AIX 7.1 TL3 SP7 SP2 SP8 SP9 SP10 SP3 SP4 SP5 SP1 SP3 + APAR IV56367 P7 or P6 Modes with Virtual I/O P7 or P6 Modes with Full I/O Support P8, P7 or P6 Modes with Full I/O Support © 2015 IBM Corporation 21 Why AIX…… • Best Performance and Scalability – Scales to 256 Cores – #1 SAP System performance – #1 SAP per Core performance • Most Available – AIX & Power # 1 in availability (ITIC 2013 report) • Most Secure – CAPP/OSPP/EAL4+ Security Certification – 0 reported security breeches with SAP and IBM DB2 or Oracle DB2 on AIX & Power • Self Tuning (Dynamic System Optimization) – Monitors and adjusts optimizations as needed – Cache & Memory affinity – Shared memory & Data Stream Pre- fetch optimization • Minimize Memory requirements – Active Memory Expansion © 2015 IBM Corporation Investment being made into AIX…… • Hot patching of AIX Kernel – Apply fix to “Live” AIX Kernel – No reboot of the partition required – No recycling of the applications • CAPI Enablement – Support of CAPI resources • SRIOV Enhancements – FCoE & Fibre Channel • Performance improvements – Pthreads Trans Memory • Future Considerations – – – – AME Enhancements Larger Max memory Split Core support DSO Enhancements © 2015 IBM Corporation IBM i Levels IBM i 7.1 TR8 POWER7 Max Scale = 32 cores (SMT4) Max Partition = 64 cores (SMT4) Threads = ST, SMT2, SMT4 up to 256 threads in single partition POWER8 Max Scale = 32 cores (SMT8) Max Partition = 64 cores (SMT4) Threads = ST, SMT2, SMT4, SMT8 up to 256 threads / single partition IBM i 7.2 POWER7 Max Scale = 32 cores (SMT4) Max Partition = 96 cores (SMT4) Threads = ST, SMT2, SMT4 up to 384 threads in single partition POWER8 Max Scale = 48 cores (SMT8) Max Partition = 96 cores (SMT8) Threads = ST, SMT2, SMT4, SMT8 up to 768 threads / single partition IBM i 7.2 and POWER8 Highlights • Enhancing Systems of Engagement and Systems of Record: – – POWER8 enables new levels of performance, reliability and scalability making it simpler to integrate systems of engagement and systems of record on a single system and single architecture IBM i 7.2 locks down business data, increases security and improves performance minimizing risk as you extend business systems to customers through mobile and cloud. And, combined with new encrypt/decrypt capabilities in POWER8, ensuring your data is protected has never been easier • Key Capabilities: – – – – Powerful new features of DB2® for i ensures security of the data in a modern environment of mobile, social and network access IBM Navigator for i extends system management capabilities to manage and monitor performance services Integrated Security SSO application suite extended to include FTP and Telnet authentication with Kerberos PowerHA SystemMirror for i Express Edition introduces HyperSwap and improves system resiliency to ensure continual access for customers and employees – Analytics: combined value of DB2 WebQuery & Cognos on Linux on Power – Free Format RPG provides game changing enhancements for developers, making extension to mobile and social easier. © 2014©International Business Machines Corporation 2015 IBM Corporation 25 POWER8 Linux Distros 2Q / 2014 RHEL6 RHEL 6.5 P7 Mode in P8 RHEL 7 RHEL 7.0 - POWER8 Support RHEL 7.1 – LE KVM Support SLES 11 SLES 11 + SP3 P7 Mode in P8 SLES 12 POWER8 LE KVM Ubuntu (LE) 14.04.00/01 P8 Support © 2015 IBM Corporation Virtualization © 2015 IBM Corporation Power System Software An intelligent IT infrastructure for Cloud, Big Data, Analytics & Mobile Simplified Virtualization and Cloud Management Expanded choice and enhanced value for the industry’s most scalable & flexible virtualization infrastructure for UNIX, Linux and IBM i New PowerKVM: Open Virtualization for scale-out Linux Systems • Kernel-Based Virtual Machine(KVM) Open Source Hypervisor for virtualizing Linux guest VMs on POWER8 Linux Scale-out servers • Exploit existing Linux admin skills and tools • Leverage Power systems performance and resiliency PowerVM: Virtualization without Limits • Delivers higher levels of utilization • Simplified virtualization user experience with new performance views & capacity data PowerVP: - Virtualization Performance • Improved memory and shared processor affinity to optimize performance and service levels PowerVC (Virtualization Center): Increase IT productivity and agility • Built on OpenStack • Improved scalability, active directory support and shared storage pools enabling faster integration with clients existing infrastructure SmartCloud Entry for Power Systems* • Extended capability to enable customization & quicker deployment of OpenStack-based cloud solutions 28 © 2015 IBM Corporation 28 Power Systems Performance Monitoring HMC Past • Disjoint set of tools • Multiple agents need to be installed in OS • Minimal or Lack of Visualization HMC in 2Q-2014 • Integrated Visual Monitor in HMC • Standard set of Interfaces for external APIs to consume data Performance Monitoring – Metrics & Dashboard Performance metric indicators & utilization dashboard Processor, memory & I/O Server & LPAR level information Basic trend data collection and visualization Provides full PowerVM Identify bottlenecks performance and Early problem detection capacity metrics REST based API to access: Via a single touch-point All platform (PHYP & VIOS) metrics for Tivoli (HMC). Third Party tools Power Virtualization Options PowerKVM Initial Offering: Q2 2014 PowerVM PowerKVM provides an Open Source choice for Power Virtualization for Linux workloads. Best for clients that have Linux centric admins. Initial Offering: 2004 PowerVM is Power Virtualization that will continue to be enhanced to support AIX, IBM i Workloads as well as Linux Workloads PowerVM vs PowerKVM Comparison PowerVM PowerKVM 2004 Q2 2014 Supported Hardware All P6, P7, P7+, P8 Systems PowerLinux P8 Systems Supported OS AIX, IBM i & Linux Linux Workload Mobility Supports AIX, IBM i & Linux Linux Basic Virtualization Management IVM / HMC / FSM Virtman/libvirt PowerVC/VMControl PowerVC, Vanilla OpenStack Power Centric Linux/x86 Centric Established Security Track Record on Power Yes No Open Source Hypervisor No Yes GA Availability Advanced Virtualization Management Admin Type PowerKVM Positioning • • • • • • • • • • First release available in 2014 Focus: New Linux workloads for Power Systems Seamless transition for existing Linux admins to adopt Power Linux Virtualization without any training No HMC or other traditional IBM consoles • Normal Linux management and OpenStack options PowerKVM only supports Linux guest VMs Cloud potential: Have many more small VMs than traditional Power Virtualization POWER8 PowerLinux hardware only Live Workload mobility support between PowerKVM servers Open Source Hypervisor: Hardware is abstracted by firmware Managed by OpenStack(PowerVC) or by off the shelf OpenStack or local Linux Tools POWER8 Scale Out, OpenPOWER and CAPI OpenPOWER The Era of Heterogeneous Computing is Coming… Microprocessors and technology alone are no longer driving Cost/performance improvements Processors Semiconductor Technology Without Price Increases 2 socket systems © 2015 IBM Corporation 2 socket sys @ constant cost 35 System stack innovations are required to drive cost/performance System Stack Applications and services Systems Management & Cloud Deployment Systems Acceleration & HW/SW Optimization Firmware, Operating System and Hypervisor Processors Semiconductor Technology Some Example Use Cases Workload Acceleration Services Delivery Model Advanced Memories Optimized System Design Custom SOC’s © 2014©International Business Machines Corporation 2015 IBM Corporation 36 OpenPOWER Extends Moore’s Law to the System OpenPOWER will enable data centers to rethink their approach to technology. Member companies may use POWER for custom open servers and components for Linux based cloud data centers. OpenPOWER ecosystem partners can optimize the interactions of server building blocks – microprocessors, networking, I/O & other components – to tune performance. How will the OpenPOWER Foundation benefit clients? – OpenPOWER technology creates greater choice for customers – Open and collaborative development model on the Power platform will create more opportunity for innovation – New innovators will broaden the capability and value of the Power platform What does this mean to the industry? – Game changer on the competitive landscape of the server industry – Will enable and drive innovation in the industry – Provide more choice in the industry Platinum Members © 2015 IBM Corporation Fueling an Open Development Community Implementation / HPC / Research System / Software / Integration I/O / Storage / Acceleration Boards / Systems Chip / SOC © 2015 IBM Corporation 38 Complete member list at www.openpowerfoundation.org OpenPOWER: Growing Fast System/Software/Services I/O, Storage, Acceleration Boards/Systems Chip/SOC ***Chart from April 2014!!! © 2015 IBM Corporation 39 “POWER” Built for Open Innovation POWER Processors have a Leadership Set of Differentiated Interfaces GPU/Other DMI Memory Interface Control PowerCore DMI Memory Interface Control CAPI IBM & Partner Devices NVLINK POWER8/8+ Server Class Memory Processors GPU/Other NVLINK Server Class Memory Innovation with OpenPOWER is taking place on all interfaces and with custom SOC Designs 40 © 2015 IBM Corporation 40 Redesigning the Computer Targeted Software Acceleration Packs Transparent Tooling Middleware Like Abstraction + FPGA or GPU CPU’s Services Strong Cores for Serial Codes Runs Traditional & Legacy Software Runs OS (Security, Virtualization, etc) • • • • • Extreme Parallelism available Targeted Software Accelerator packs IP Base Libraries Customer IP Reconfigurable Nature fights Commoditization Greater robustness is achieved by mating of specializations…. © 2015 IBM Corporation 41 When to Use FPGAs • Transistor Efficiency & Extreme Parallelism – Bit-level operations – Variable-precision floating point • Power-Performance Advantage – >2x compared to Multicore (MIC) or GPGPU – Unused LUTs are powered off • Technology Scaling better than CPU/GPU – FPGAs are not frequency or power limited yet – 3D has great potential • Dynamic reconfiguration – Flexibility for application tuning at run-time vs. compile-time • Additional advantages when FPGAs are network connected ... – allows network as well as compute specialization © 2015 IBM Corporation When to Use GPGPUs • Extreme FLOPS & Parallelism – Double-precision floating point leadership – Hundreds of GPGPU cores • Programming Ease & Software Group Interest – CUDA & extensive libraries – OpenCL – IBM Java (coming soon) • Bandwidth Advantage on Power – Start w/PCIe gen3 x16 and then move to NVLink • Leverage existing GPGPU eco-system and development base – Lots of existing use-Cases to build on – Heavy HPC investment in GPGPU © 2015 IBM Corporation Power8 Invents CAPI Power Processor • Coherent Attached Processor Proxy (CAPP) in processor – – Unit on processor that extends coherency to an attached device On processor directory responds on behalf of off-chip device (Filtering snoops) CAPP PCIe • Coherency protocol tunneled over standard PCIe – CAPI over PCIe – Eliminates the need for special I/Os and protocol logic CAPI utilizes standard Posted Write and Non-posted Reads Reduces the complexity and bandwidth requirements of the attached device • Enables attached device to be a peer to the processor – – – Coherently Attached Device Simplifies programming model between application Enables device to use same effective address as application running in processor Eliminates the cumbersome I/O Device Driver requirements Pinned memory not required Why CAPI is Better than Traditional PCIe CAPI FPGA IBM Supplied POWER Service Layer Function n Function 2 Function 1 PCIe Function 0 CAPP Power Processor Typical I/O Model Flow DD Call Copy or Pin Source Data MMIO Notify Accelerator Acceleration Poll / Int Completion Copy or Unpin Result Data Ret. From DD Completion Flow with a Coherent Model Shared Mem. Notify Accelerator Acceleration Shared Memory Completion Advantages of Coherent Attachment Over I/O Attachment • Virtual Addressing & Data Caching – Shared Memory – Lower latency for highly referenced data • Easier, More Natural Programming Model – Traditional thread level programming – Long latency of I/O typically requires restructuring of application • Enables Applications Not Possible on I/O – Pointer chasing, etc… Workloads to Innovate • Start with what FPGAs are good at: Embarrassingly Parallel Problems • Combine with CAPI strengths: – Ease of programming – Lack of device driver – Shared memory & caching (host to accelerator communication) • What do you get: – – – – Bitwise data manipulation (e.g. Deep Compression) Pattern recognition Encryption Monte Carlo Statistical modeling for complex predictions – Image Analytics & Biometrics Facial recognition Feature detection (e.g. cancer) – Network Packet Processing & Inspection – Bioinformatics (e.g. Sequence alignment) – Reverse time migration (Oil & Gas) – Ensemble Calculations of Numerical Weather Prediction © 2015 IBM Corporation – Machine Learning Example: File System Acceleration with CAPI-FPGA • • • • Compression – IBM Gzip offers best combination of performance and compression rate De-Duplication – Signature calculation is easy to integrate with compression datapath Crypto – Crypto acceleration on P8 – FPGA is also a good fit, especially if crypto algorithm is non-standard Content analytics for real-time tagging – IBM CAPI/FPGA accelerated text analytics – IBM CAPI/FPGA accelerated image analytics • Power 8 / CAPI benefits – Very strong memory & I/O bandwidth – Seamless integration with CAPI shared memory interface (acc. Is just like another core ) – Variety of accelerator partners through OpenPOWER ( Altera, Xilinx, NVIDIA, ...) IBM Accelerated GZIP Compression What it is: An FPGA-based low-latency GZIP Compressor & Decompressor with single-thread througput of ~2GB/s and a compression rate significantly better than low-CPU overhead compressors like snappy. © 2015 IBM Corporation 48 48 IBM Accelerated Text Processing What it is: A compiler/runtime system for accelerating text analytics on a sharedmemory CPU-FPGA Results Annotations AQL For years, Microsoft Corporation CEO Bill Gates was against open source. But today he appears to have changed his mind. "We can be open source” • rule language • SQL-like syntax Big Speedup vs. Multithread SW systemT optimizer Compiled operator graph To appear @: Hot Chips 2014 © 2015 IBM Corporation 49 systemT runtime Java + FPGA 49 FPGA Image & Video Processing Information Extraction Object Recognition Information located where pixels change color (edges, blobs) Template Matching Approa ch Extract relevant information from input image to enable object recognition Motivati ons Goal Edge Detection, Feature Extraction, Segmentation Design fully-pipelined FPGA architectures streaming application Real-time, low-power, onboard image processing solution Intrinsic properties of objects Sobel and Canny: extract contours/edges Object boundaries SURF: extract scale & rotation-invariant features Applications requiring edge detection & feature extraction span a wide range of domains 50 Computer/Machine Vision: Tracking, Object Recognition & Navigation General image proc.: Compression Quality Control: Unsupervised Defect Identification Medical Imaging: Analysis + Diagnosis & Computer Guided Surgery © 2015 IBM Corporation Gaussian 1st derivative 2nd derivative 2D convolution with Gaussian Filter: blur 2D convolution with Gaussian 1st derivative: extract edges 2D convolution with Gaussian 2nd derivative: extract features X FPGA acceleration results from: Y Y X Hardware Design Theory Custom Hardware Mapping Parallel 2D convolution Process all pixels inside filter in parallel Parallel 2D convolution in x, y, z direction Parallel 2D convolution for all filter scales 51 © 2015 IBM Corporation Total of 33 filters Resul ts Results & Conclusions OpenCL vs. VHDL performance table OpenCL vs. VHDL productivity table VHDL development time 6 months Conclusio ns Sobel, Canny, & SURF 52 © 2015 IBM Corporation OpenCL development time 1 month OpenCL performance VHDL performance Apps. Stratix 4 Stratix 5 Stratix 5 Frames/sec Max freq. Frames/sec Max freq. Frames/sec Max freq. Sobel 475 170 909 300 870 300 Canny 470 170 890 300 823 309 SURF 392 170 870 300 804 283 Performance Productivity IBM Accelerated Image Processing What it is: A real-time multi-HD stream Harris-Laplace feature detection algorithms implemented in an FPGA Performance: 166M pixels per second ( i.e. multi-stream HD video) To appear: IBM Journal of Research & Development © 2015 IBM Corporation 53 CAPI Attached Flash Optimization – – – Attach TMS Flash to POWER8 via CAPI coherent Attach Issues Read/Write Commands from applications to eliminate 97% of code pathlength Saves 20-30 cores per 1M IOPs Application Application Posix Async I/O Style API Read/Write Syscall FileSystem strategy ( ) iodone ( ) LVM strategy ( ) iodone ( ) 20K instructions reduced to <500 aio_read() aio_write()1 User Library Shared Memory Work Queue Disk and Adapter DD Pin buffers, Translate, Map DMA, Start I/O 54 © 2015 IBM Corporation Interrupt, unmap, unpin,Iodone scheduling 54 55 © 2015 IBM Corporation 55 Flash as Slow Memory client server network flash network Memory network CAPI network Conventional PCIe I/O acceptable latency 56 © 2015 IBM Corporation 56 Monte-Carlo CAPI Acceleration Running 1 million iterations At least 250x Faster with CAPI FPGA + POWER8 core Full execution of a Heston model pricing for a single security: 1. SOBOL sequence generator (pRNG) 2. Inverse Normal to create the non-linear distribution 3. Path-generation 4. Pay-off function Easier to Code: Reduces C code writing by 40x compared to non-CAPI FPGA © 2015 IBM Corporation 57 POWER8-based Network Acceleration Faster workloads with less infrastructure 10x Eastern New York Boston data RDMA data Washington D.C. data IBM Power Systems and Mellanox® Technologies partnering to simultaneously accelerate the network and compute for NoSQL workloads. © 2015 IBM Corporation lower latency data Central Chicago higher throughput 10x Dramatically less data center infrastructure exploiting high speed networks with Remote DMA Dramatically faster responsiveness to customers leveraging POWER8 high throughput low latency I/O 58 58 GPU Acceleration Example: Espresso • We’re only just discovering how to make this data useful Large global retailers collect petabytes of data Transactions generate tens of millions of filing cabinets of paper How does a retailer translate all of this data to business value? Group customers in segments with similar behavior Customize products and marketing programs • Impossible to make this much data useful through human inspection © 2015 IBM Corporation 59 IBM Power Systems GPU Acceleration of Java Applications • Now possible on today's Big Data and Java Workload Acceleration – Use of segmentation or clustering in the retail industry • • Look for non-obvious patterns in the sales data and react quickly Analyze across tens of thousands of dimensions quickly and accurately Lends itself nicely to a bit of computer science known as "k-means clustering" – Outcome could lead to new products, revised products and advertising, launching new campaigns….wherever the data leads you…. Imagine generating 100 times more ideas for new products and campaigns – who can get you there? © 2015 IBM Corporation 60 GPU Espresso Demo • IBM and NVIDIA are demonstrating segmentation using GPU accelerated machine learning for clustering using Hadoop / Mahout – OpenPower initiative with NVIDIA – First product implementing GPU acceleration for Java • Best-in-class ingredients – IBM POWER8 – Designed for Big Data – IBM Java – NVIDIA CUDA GPU acceleration – Ubuntu Little Endian Linux for POWER • Achieving 8X performance improvement © 2015 IBM Corporation 61 61 OpenPOWER innovations benefit Clients Altera FPGA acceleration and IBM CAPI Monte Carlo 250x faster than POWER8 core US Dept of Energy $325M super computing contract awarded to IBM, Mellanox, and NVIDIA alone, reduced C code 40x over non-CAPI FPGA DoE systems for science and stockpile stewardship Data Engine for NoSQL 24:1 server consolidation, 3x lower cost per user, 40TB CAPI-attached flash Sierra and Summit systems to be >100 PF, 2 GB/core main memory, local NVRAM, and science performance 4x-8x Titan or Sequoia CAPI dev kit with FPGA card from Nallatech NVIDIA acceleration built into IBM Power S824L Tyan OpenPOWER Customer Reference System 8x faster than x86 Ivy Bridge on pattern extraction 82x faster for Cognos BI and DB2 BLU 62 © 2015 IBM Corporation © 2014 OpenPOWER Foundation 62 University Research on Power8 Accelerators • • • • • • • Photodynamic Therapy @ University of Toronto fMRI @ Western University Genomics @ University of Illinois Urbana-Champaign & Rice & Delft Seismic @ University of Texas Data Analytics @ North Carolina State University Financial Risk @ University of Florida The list is growing rapidly… © 2015 IBM Corporation POWER8 Scale Out, OpenPOWER and CAPI What is CAPI? What’s in a name? © 2015 IBM Corporation 65 FPGA as an Accelerator • FPGA: Field Programmable Gate Array – – – – It’s a re-programmable chip It can run fast (cycle times of 250 – 500 Mhz or more) It has Industry Standard Interfaces like PCI-E Gen3 The Major FPGA Suppliers, Altera and Xilinx, are OpenPOWER Foundation members PCIE gzip Source code for FPGAs has traditionally been written in RTL* (VHDL** or Verilog). Now, we also have OpenCL, a more programmer friendly language. © 2015 IBM Corporation *RTL = Register Transfer Level **VHDL = VHSIC*** Hardware Description Language ***VHSIC = Very High Speed Integrated Circuit FPGA Encrypt Monte Carlo FPGA Library 66 When to Use FPGAs • Transistor Efficiency & Extreme Parallelism – Bit-level operations – Variable-precision floating point • Power-Performance Advantage – >2x compared to Multicore (MIC) or GPGPU – Unused LUTs are powered off • Technology Scaling better than CPU/GPU – FPGAs are not frequency or power limited yet – 3D has great potential • Dynamic reconfiguration – Flexibility for application tuning at run-time vs. compile-time • Additional advantages when FPGAs are network connected ... – allows network as well as compute specialization © 2015 IBM Corporation Why is an Accelerator Faster? PCIE FPGA Question: The POWER8 Processor runs at ~3Ghz while our FPGA runs at 250Mhz. So why would an accelerator be better? Answer: The FPGA is better for certain algorithms, such as those that are numerical intensive or have parallelism. The POWER8 processor has a finite set of instructions to implement the algorithm in SW. The FPGA is customized logic built for specific processing of an algorithm. 68 © 2015 IBM Corporation Why is an Accelerator Faster? PCIE FPGA Example 1: Numerical Intensive Algorithm Integral () Variables sin Sigma () cos x+ Sin () ∑ Cos () + Main (n,a,v,w) SW © 2015 IBM Corporation ∫ Done! Done! FPGA 69 Why is an Accelerator Faster? PCIE FPGA Example 2: Parallelism Monte Carlo Risk Analysis to determine probability of financial success: Given current finances, run 100 scenarios Variable distributor Engine 1 Engine 2 Engine 3 Engine 4 Engine 5 Engine 6 Engine 7 Engine8 Engine 9 Monte Variables Main (Vars) Engine 50 Variables Results Accumulator SW © 2015 IBM Corporation 10 5 50 100 FPGA 70 So what is new? Accelerators on FPGAs have been around for a long time…. So what is new? Coherency makes the accelerator a peer to the POWER8 cores © 2015 IBM Corporation 71 What was done before CAPI? Prior to CAPI, an application called a device driver to utilize an FPGA Accelerator. The device driver performed a memory mapping operation. Device Driver Storage Area Virt Addr Variables Variables Input Data Input Data Memory Subsystem Output Data 3 versions of the data (not coherent). 1000s of instructions in the device driver. PCIE FPGA Output Data Input Variables Data POWER8 Core DD App © 2015 IBM Corporation POWER8 Core POWER8 Core POWER8 Core POWER8 Core POWER8 Core 72 CAPI Coherency With CAPI, the FPGA shares memory with the cores Virt Addr 1 coherent version of the data. No device driver call/instructions. PCIE PSL Memory Subsystem FPGA Output Data Input Variables Data POWER8 Core App © 2015 IBM Corporation POWER8 Core POWER8 Core POWER8 Core POWER8 Core POWER8 Core 73 CAPI vs. I/O Device Driver: Data Prep Typical I/O Model Flow: Total ~13µs for data prep Copy or Pin Source Data DD Call 300 Instructions MMIO Notify Accelerator 10,000 Instructions 7.9µs Acceleration Application Dependent, but Equal to below Poll / Interrupt Completion Copy or Unpin Result Data Ret. From DD Completion 1,000 Instructions 3,000 Instructions 1,000 Instructions 4.9µs Flow with a Coherent Model: Total 0.36µs Shared Mem. Notify Accelerator 400 Instructions 0.3µs © 2015 IBM Corporation Acceleration Application Dependent, but Equal to above Shared Memory Completion 100 Instructions 0.06µs 74 CAPI Differentiation CAPI vs. I/O or Socket FPGA Solution IBM Innovation Customer Impact FPGA is a peer to the processor -- Caching and translations by PSL Simple Programming paradigm Higher performance Architecture allows for any kind of FPGA or even an ASIC Flexible solutions Connection to Flash, FC, EN…. Virtualization in the Architecture Applications can share Accelerator I/O Paradigm © 2015 IBM Corporation CAPI Paradigm POWER8 Processor Let’s take a closer look at how IBM Engineers made CAPI work Technology • Cores Core L3 L3 Bank Bank Bank Chip Interconnect Bank Bank L2 L2 Core Core L2 Core L2 Core Core L2 L2 L3 L3 L3 Bank Bank Bank L3 L3 L3 Bank Bank Bank L2 L2 Core Core L2 SMP Bank Core Chip Interconnect PCIe L3 SMP L3 CAPI L3 • 512 KB SRAM L2 / core • 96 MB eDRAM shared L3 Core Memory Bus L3 CAPI L2 SMP Interconnect L2 Caches Memory SMP Interconnect L2 SMP Core PCIe • Crypto and memory expansion • Transactional memory • VMM assist • Data move/VM mobility Core SMP Accelerators POWER8 Scale-Out Dual Chip Module Memory Bus • 12 cores (SMT8) • 8 dispatch, 10 issue, 16 execution pipes • 2x internal data flows/queues • Enhanced prefetching • 64 KB data cache, 32 KB instruction cache 22 nm SOI, eDRAM, 15 ML 650 mm2 • Up to 230 GB/s sustained bandwidth Bus Interfaces • Durable open memory attach interface • Integrated PCIe Gen3 • SMP interconnect • CAPI Energy Management • © 2015 IBM Corporation On-chip power management microcontroller 76 How CAPI Works CAPI Developer Kit Card Acceleration Portion: Data or Compute Intensive, Storage or External I/O Algorithm Algo rithm PCIe Sharing the same memory space Accelerator is a peer to POWER8 Core Application Portion: Data Set-up, Control POWER8 Processor © 2015 IBM Corporation CAPI technology connections FPGA AFU IBM Supplied PSL • Proprietary hardware to enable coherent acceleration • Operating system enablement – Ubuntu LE – Libcxl function calls • Customer application and accelerator PCIe • Application sets up data and calls the accelerator functional unit (AFU) CAPP Memory (Coherent) • AFU reads and writes coherent data across the PCIe and communicates with the application – PSL cache holds coherent data for quick AFU access POWER8 Core OS App POWER8 Processor © 2015 IBM Corporation 78 CAPI solution flow OS Connect to accelerator 1 AFU IBM Supplied PSL App Open device cxl_afu_open_dev AFU reserved for work Set Work Element Descriptor (WED) at AddrX – may contain addresses of other data structures 2 3 Start accelerator Understands WED content - and any other addressed data structures 2 Attach device cxl_afu_attach Reset AFU PSL_WED_Ax is set to AddrX AFU_CNTL_An[E] is set CTL interface jea gets AddrX jcom gets start AFU fetches AddrX (the WED) starts operation CMD interface Buffer interface 4 AFU continues to work using this interface Resp interface 5 6 If required, App can read or write AFU registers MMIO interface App knows AFU is finished (Mechanism is user defined) App can start again from top or free AFU © 2015 IBM Corporation 6 CTL interface Free device cxl_afu_free AFU finishes (Mechanism is user defined) De-assert RUNNING Assert DONE 79 POWER8 with CAPI Cards Front View POWER8 Modules Side View CAPI Dev Kit Cards © 2015 IBM Corporation 80 Basic concepts of CAPI CAPI vs. CAPI Solutions • CAPI is a platform to enable acceleration • CAPI provides an infrastructure to improve performance of an application through FPGA acceleration – Enables customer-defined acceleration within the processor complex • CAPI allows implementation of a wide range of accelerators to optimally address many different customer challenges Platform for Innovation – Each implementation is a unique CAPI Solution • A CAPI Solution is a specific implementation of an algorithm that uses an FPGA + application • A CAPI Solution requires logic designers and programmers to implement the solution • CAPI Solution Examples: Specific Customer Solution – Flash Appliance (IBM Data Engine for NoSQL) – MonteCarlo Algorithm © 2015 IBM Corporation 81 Why Accelerate on CAPI? • Reasons to consider CAPI Acceleration – Higher Performance If your customer has a complex application running on a core, consider CAPI for better performance If your customer already does I/O attached FPGA acceleration, CAPI will simplify their software and provide better performance – Lower IT Costs By moving workload to CAPI, your customer will need fewer cores In some cases, such as the IBM Data Engine for NoSQL, CAPI can do the same work with far less infrastructure – Lower Power • Running acceleration on an FPGA can result in lower power consumption vs. running the application as software on a core Note: When considering CAPI for a particular solution, we compare it to: 1. The same solution running as software –OR– 2. The same solution running on an IO attached FPGA © 2015 IBM Corporation 82 CAPI ecosystem partners and consumers CAPI-APPS For Clients Have a client who wants their IBM Application to be accelerated on CAPI? (ex: DB2, CPLEX, Streams) Contact: Jonathan Dement ([email protected]) IBM CAPI Solutions Have a client or partner who wants to create a CAPI-App and sell it to others? Point them to the CAPI resources in this doc (IBM and Nallatech websites) and email Bruce Wile ([email protected]) about the opportunity Partner Solutions IBM Data Engine for NoSQL Clients with their Own Proprietary Solutions Have a client or partner who wants to create a proprietary CAPI Solution? Point them to the CAPI resources in this doc (IBM and Nallatech websites) © 2015 IBM Corporation and email Bruce Wile ([email protected]). Why tell Bruce Wile about the opportunity? Depending on the size of the opportunity, we will engage the CAPI Customer Enablement Team 83 Two Paths into CAPI CAPI CAPI Developer Kit CAPI Market Solutions CAPI App Solutions Clients create their own, proprietary business solution. © 2015 IBM Corporation IBM & Partners create business solutions for the CAPI Market. Clients buy pre-packaged solutions from the CAPI Market. 84 CAPI Solutions CAPI App Solutions © 2015 IBM Corporation 85 Open Development Driving CAPI Solutions Implementation / HPC / Research System / Software / Integration I/O / Storage / Acceleration Boards / Systems Chip / SOC © 2015 IBM Corporation © 2014 OpenPOWER Foundation Complete member list at www.openpowerfoundation.org 86 Potential Markets for CAPI Solutions Edge of Network; JPEG & Video processing Network Packet Processing Database Acceleration/KVS Machine Learning Bitwise Data Manipulation Compression/Encryption Ensemble Calculations of Numerical Weather Prediction Big Data/ Data/ BigBig Data/ Database/ Database/ Database/ Compute Compute Compute Weather Social/ Social/ Media Media Radiation Therapy Pharmaceuticals Public Health Image Analysis Genomics Medicine Medicine CAPI Market Finance/ Insurance Visual / Visual / Biometric Biometric Analysis Analysis Oil & Gas Reverse Time Migration Database Acceleration & Fast Storage Data Analytics Pattern Recognition Risk Analysis Monte Carlo Pattern Analysis Retail Security Facial Recognition Manufacturing /EDA Deep Computation and Critical Runtime Jobs © 2015 IBM Corporation Fluid Dynamics 3D Modeling CAD Pipeline Analysis & Flow Specialized Algorithms 87 CAPI Availability • See: http://www.ibm.com/support/customercare/sas/f/capi/home.html • CAPI Developer Kit – Procure through Nallatech – For customers considering creating their own CAPI Solution –CAPI Decision and Process Guide – Requires POWER8 Server – Available now – See www.nallatech.com/capi • First CAPI Solution: IBM Data Engine for NoSQL – Procure through IBM – GA in early 2015 © 2015 IBM Corporation 88 CAPI Developer Kit © 2015 IBM Corporation 89 CAPI Developer Kit – FPGA Card 2 Banks of SDRAM Dual 10G SFP+ Altera Stratix V FPGA Complete Datasheet PCI-E Gen3 © 2015 IBM Corporation 90 CAPI Developer Kit IBM POWER8TM Server © 2015 IBM Corporation 91 CAPI Developer Kit © 2015 IBM Corporation 92 CAPI Developer Kit http://www.ibm.com/support/customercare/sas/f/capi/home.html © 2015 IBM Corporation 93 © Copyright International Business Machines Corporation 2015 Printed in the United States of America September 2015 IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at www.ibm.com/legal/copytrade.shtml. The following terms are trademarks or registered trademarks licensed by Power.org in the United States and/or other countries: Power ISA. Information on the list of U.S. trademarks licensed by Power.org may be found at www.power.org/about/brand-center/. Linux is a trademark of Linus Torvalds in the United States, other countries, or both. Other company, product, and service names may be trademarks or service marks of others. All information contained in this document is subject to change without notice. The products described in this document are NOT intended for use in applications such as implantation, life support, or other hazardous uses where malfunction could result in death, bodily injury, or catastrophic property damage. The information contained in this document does not affect or change IBM product specifications or warranties. Nothing in this document shall operate as an express or implied license or indemnity under the intellectual property rights of IBM or third parties. All information contained in this document was obtained in specific environments, and is presented as an illustration. The results obtained in other operating environments may vary. While the information contained herein is believed to be accurate, such information is preliminary, and should not be relied upon for accuracy or completeness, and no representations or warranties of accuracy or completeness are made. Note: This document contains information on products in the design, sampling and/or initial production phases of development. This information is subject to change without notice. Verify with your IBM field applications engineer that you have the latest version of this document before finalizing a design. You may use this documentation solely for developing technology products compatible with Power Architecture®. You may not modify or distribute this documentation. No license, express or implied, by estoppel or otherwise to any intellectual property rights is granted by this document. THE INFORMATION CONTAINED IN THIS DOCUMENT IS PROVIDED ON AN “AS IS” BASIS. In no event will IBM be liable for damages arising directly or indirectly from any use of the information contained in this document. IBM Systems and Technology Group 2070 Route 52, Bldg. 330 Hopewell Junction, NY 12533-6351 The IBM home page can be found at ibm.com®. Version 1.0 29 September 2014—IBM Confidential © 2015 IBM Corporation 94