ELASTIC CLOUD STORAGE (ECS) V2.2 PERFORMANCE

by user

on 15 сентября 2016

Category: Documents

>> Downloads: 28

205

views

Report

Comments

Description

Download ELASTIC CLOUD STORAGE (ECS) V2.2 PERFORMANCE

Transcript

ELASTIC CLOUD STORAGE (ECS) V2.2 PERFORMANCE

ELASTIC CLOUD STORAGE (ECS) V2.2
PERFORMANCE
An Analysis of Single-site Performance with Sizing
Guidelines
ABSTRACT
This white paper examines object performance of the latest Dell EMC Elastic Cloud
Storage software. Performance results are from testing done on an ECS Appliance
model U3000 using the S3 protocol. Included are recommendations for sizing an ECS
appliance for performance.
September 2016
WHITE PAPER
The information in this publication is provided “as is.” Dell EMC Corporation makes no representations or warranties of any kind with
respect to the information in this publication, and specifically disclaims implied warranties of merchantability or fitness for a particular
purpose.
Use, copying, and distribution of any Dell EMC software described in this publication requires an applicable software license.
Dell, Dell EMC, EMC, the EMC logo, are registered trademarks or trademarks of Dell EMC Corporation in the United States and other
countries. All other trademarks used herein are the property of their respective owners. © Copyright 2016 EMC Corporation. All rights
reserved. Published in the USA. 01/16, H12545.1
EMC believes the information in this document is accurate as of its publication date. The information is subject to change without
notice.
EMC is now part of the Dell group of companies.
2
TABLE OF CONTENTS
EXECUTIVE SUMMARY ...........................................................................................................4
AUDIENCE ........................................................................................................................................ 4
SCOPE .............................................................................................................................................. 4
ECS ARCHITECTURE ...............................................................................................................4
TEST METHODOLOGY AND ENVIRONMENT ........................................................................5
Test Framework ................................................................................................................................ 5
S3 Protocol ........................................................................................................................................ 5
Numbers of Objects........................................................................................................................... 6
Clients ............................................................................................................................................... 6
ECS Appliance .................................................................................................................................. 6
Connectivity ....................................................................................................................................... 6
RESULTS ...................................................................................................................................7
Data Presentation.............................................................................................................................. 7
Data Boundaries................................................................................................................................ 8
Metrics ............................................................................................................................................... 8
Read Results ..................................................................................................................................... 9
Write Results ................................................................................................................................... 10
SIZING .................................................................................................................................... 10
General Guidelines.......................................................................................................................... 11
Capacity .......................................................................................................................................... 11
Examples ........................................................................................................................................ 12
Workload ......................................................................................................................................... 12
Object Size .............................................................................................................................................. 12
Thread Count ........................................................................................................................................... 12
Examples ........................................................................................................................................ 13
Mixed Workloads ............................................................................................................................. 13
CONCLUSION ........................................................................................................................ 13
3
EXECUTIVE SUMMARY
Traditional SAN and NAS storage platforms, while critical for enterprise applications, were never designed for modern cloud
applications and the demands of cloud. The unabated growth in unstructured content is driving the need for a simpler storage
architecture that can efficiently manage billions and trillions of files and accelerate the development of cloud, mobile and big data
applications while reduce escalating overhead and storage costs. In order to fulfill these requirements, IT organizations and service
providers have begun to evaluate and utilize low cost, commodity and open-source infrastructures. These commodity components and
standard open technologies reduce storage costs but the individual components provide lower performance and reliability – and also
require the operational expertise to ensure serviceability of the infrastructure.
Dell EMC® Elastic Cloud Storage (ECS™) provides a complete software-defined object storage platform designed for today’s scale-out
cloud storage requirements. ECS offers all the cost advantages of commodity infrastructure with the enterprise-grade reliability and
serviceability of and integrated Dell EMC-supported solution. ECS benefits include:

Cloud-scale Economics: 65% TCO savings versus public cloud services

Simplicity & Scale: Single global namespace, unlimited apps, users and files

Multi-site Efficiency: Low storage overhead across geo-distributed locations

Faster App Development: API accessible storage accelerates cloud apps & analytics

Turnkey Cloud: Multi-tenancy, self-service access and metering capabilities
The goal of this paper is to provide a high-level overview of the ECS performance and analysis of performance data to allow consumers
to understand and set expectations for ECS system performance. It is also intended to be a guide on how to properly size an ECS
deployment as well.
AUDIENCE
This white paper is intended to support the field and customers in understanding ECS performance and sizing considerations.
SCOPE
ECS version 2.2 introduced another data service for file access using NFS version 3, however performance testing using file-based
protocols including the ECS Hadoop Distributed File System (ECS HDFS) protocol is outside of the scope of this document.
Performance of multi-site configurations is also outside of the scope of this document. We intend to revisit these topics in future
versions of this paper.
ECS Architecture
ECS is available for purchase in two forms - bundled as an appliance or as software only for deployment on customer-owned
commodity hardware. ECS is built using a layered architecture as seen in Figure 1 below. Each layer is completely abstracted and
independently scalable to allow for high availability, no single points of failure and limitless scaling capabilities. The software contains
four layers which are deployed using Docker containers. The four layers are data services, provisioning, storage engine and fabric.
These layers sit on top of the underlying operating system which is by default SUSE Linux Enterprise Server (SLES) 12.
The ECS portal and provisioning layer provide a portal Web-Based GUI for management, licensing, provisioning, and reporting of ECS
nodes. There is also a dashboard, which provides overall system level health and performance, and alert management. System
administrators interact with ECS using this layer.
The data services layer provides the object and file protocol access to the system. It is responsible for receiving and responding to
client requests and extracts and bundles information to and from the shared common storage engine. This is the layer where clients
interact with the system over supported protocols.
The storage engine is responsible for accessing and protecting data. It is the core of ECS and contains the main components that are
responsible for processing the requests, storing and retrieval of data, data protection, and replication. The commodity disks managed
by this layer are mounted using the XFS filesystem. These filesystems are filled with 10GB files which act as containers. Both
4
metadata and data are stored in chunks of 128MB inside the container files. Metadata is triple mirrored and objects less than or equal
to 128 MB are triple mirrored initially then converted to erasure-coded (EC) form later. Objects greater than 128MB are erasure-coded
"in line" as they are written to disk. Erasure-coding is done using Reed-Solomon coding.
The Fabric layer provides clustering, system health, software management, configuration management, upgrade capabilities and
alerting. It is responsible for keeping the services running and managing the resources such as disks, containers, firewall and network.
It tracks and reacts to environment changes such as failure detection and provides alerts relating to the health of the system. This layer
interacts directly with the operating system.
For additional information, please review ECS Overview and Architecture white paper at http://www.emc.com/collateral/whitepapers/h14071-ecs-architectural-guide-wp.pdf.
Figure 1 ECS Architecture
Test Methodology and Environment
TEST FRAMEWORK
A currently internal-only Dell EMC tool name Mongoose was built to create S3 workflows and was used to generate the workloads for
the tests reported in this paper. It is similar to other tools used to measure object performance such as Cloud Object Storage
Benchmark (COSBench) and The Grinder and there are plans to release the tool to the public sometime in 2016. Mongoose allows
Dell EMC test engineers to perform S3 operations via HTTP such as CREATE, READ, UPDATE, AND DELETE with varied object sizes
and thread counts, the two primary influences on system performance.
S3 PROTOCOL
Performance of object storage platforms differ from traditional architectures such as NAS and SAN in many ways. Because of this how
we execute performance tests and characterize their results must also differ from traditional approaches. Object storage platforms are
generally IP-addressable storage accessed via a REST API and ECS natively supports Amazon's S3, OpenStack Swift, and Dell EMC's
Atmos object protocols. The Content-addressable Storage (CAS) object protocol is also supported by ECS however its core behavior
differs enough from the previous three to require separate analysis in a future paper.
The three REST APIs are all similar in overall approach and capabilities, but there are some differences in semantics, function and
limitations. None of those differences are material to ECS performance, so we focus on the S3 API in most of our testing and in this
paper. All three APIs are implemented as components that conceptually sit on top of a common object engine. Due to that common
data path, differing only in the API-specific layer, we expect all three APIs all perform similarly when exercising like functionality. This is
validated with our testing. With this confirmed similarity of performance between the three APIs, when using this document as a
performance reference, one can use data for any of the APIs to represent the system performance generically, independent of which of
5
the three APIs are in use. Note that this similarity in performance holds only when the APIs are being used in ways that exercise the
same underlying functionality. The APIs have optional functionality that when used can cause performance differences.
NUMBERS OF OBJECTS
Currently the response time on an empty ECS system is much lower than one fully populated with objects. For example, on an empty
system 7,000 TPS may be seen, but on a filled system we may only see around 3,000 TPS. This is still quite fast however and equates
to 10.8 million transactions (objects) per hour. What this means is users will initially see growing response times as the number of
objects increase over time on an ECS system. The low initial response time occurs because object metadata (index) initially fits entirely
in memory which allows requests to be served quickly. A BPlusTree stores the index on disk and is short enough initially so even when
disk access is required for a lookup, no more than one or two disk operations are required. As the number of objects increase over
time the number of calls to disk to service requests increase as well. Our testing has shown that the initial decline in response time
does plateau. It is because of this behavior that we test on systems that are filled up to the point where performance has plateaued so
we can report realistic numbers rather than the highest seen on empty systems. We do this by performing tests under a single
namespace inside a single bucket that is prepopulated with a few hundreds of thousands of objects.
CLIENTS
A one-to-one ratio of clients to ECS nodes is sufficient for our ECS test system to reach maximum performance. This is because the
Mongoose application is able to perform tests using a varied number of threads. Each of the eight clients has the following specs:

Intel server with dual, four-core Intel 2.5 GHz Ivy Bridge (E5-2609) processors

64GB RAM

One configured 10GbE port

SLES 12 with 3.12.39-47-default kernel
ECS APPLIANCE
The ECS appliance consist sof varying quantities of the same node, disks, and disk enclosure and what distinguishes them is the
number of node and disk enclosures and the number of disks they contain. Details on the breakdown of hardware for each of the
available ECS appliance models can be found in the ECS Specification Sheet here: http://www.emc.com/collateral/specificationsheet/h13117-emc-ecs-appliance-ss.pdf.
The ECS U3000 appliance has the following hardware.

Two sets of four Phoenix blades each in shared 2U chassis and each with dual, four-core Intel 2.5 GHz Ivy Bridge (E5-2609)
processors, 64GB RAM, a single DOM for OS boot, a single Gbit/s SAS interface (for Disk Array Enclosure connectivity) and two
10GbE ports

Two 10GbE top-of-rack (TOR) switches which uplink to the customer network and are used both for internal node-to-node and
external node-to-client communication

One 1GbE switch used for internal and external out-of-band management

Eight Disk Array Enclosures (DAE) each containing sixty 6TB SATA drives. Each DAE is connected to one of the nodes in the
system via a single 6 Gbit/s SAS connection. That is, each node of the eight nodes is directly connected to one DAE.
Performance testing was done on an ECS appliance model U3000 installed with ECS software version 2.2. The object space was
configured using a single virtual datacenter, storage pool, replication group, namespace and an S3 bucket created without encryption or
file capabilities enabled. Also note we do plan to publish similar data using generation 2 hardware in the first half of 2016.
CONNECTIVITY
Each of the eight clients used in the tests were connected directly to a 10GbE TOR data switch in the ECS appliance rack. Only one of
the two TOR switches is required and used during testing. The added complexity of using a load balancer can be avoided since
6
Mongoose is able to spread the workload evenly across all nodes. A load balancer that is properly configured and sized should not add
delay to the transactions so by directly connecting clients to one the appliance's TOR switches the tests are kept focused on the
transactions and not 3rd party networking components. Figure 2 has a simplistic view of the basic components and connectivity.
Figure 2 Lab Connectivity
Results
DATA PRESENTATION
Charts in this document are in a form designed to maximize the information density. We refer to charts of both transactional rate
throughput and bandwidth as "butterfly" charts because of their resemblance to spread wings. An annotated example is presented in
Figure 3 below.
Data for the charts below in Figures 4 and 5 come from a test we have named the max test, because it is most often used as a test of
cluster peak performance. In this test, we write a series of homogeneous workloads, and each workload is a datapoint in the curve on
these charts.
There is a defined sequence of object sizes in our max test, small to large, with enough granularity in the sizes to show the shape of the
curve. We start with writes, and first write the smallest object size with a fixed thread count, long enough (generally around five
minutes) to determine the steady state throughput for that thread count and object size. A brief delay is inserted between object sizes
so we can clearly delineate where workflows start and stop. Then the next object size is written, and so on, determining the throughput
for each object size. After the write workloads are complete we begin reading objects starting again at the smallest size first and
moving to larger and larger sizes. Since transactions for a range of small objects all perform the same, for that same object range we
will use the same thread count. But as larger and larger objects are tested, the thread count is periodically reduced. The appropriate
thread counts are determined empirically, and are chosen to be large enough to achieve peak performance, but not unnecessarily large
beyond that, since that would just grow queuing delays in the system. The tests in this paper used a range threads from 2,560 for the
smallest objects down to 320 threads for largest objects.
Note that the two curves in Figures 3-5 represent the same sequence of data points. They differ only in their units. The curves that
generally start in the upper left, and trend downward to the right represent the objects per second, and are measured on the left y-axis.
To obtain the other curves, the objects/second data points are multiplied by size of the objects, resulting in lines representing the
bandwidth, generally in MB/second. Those curves start at the origin and generally trend to the upper right, and are read on the right yaxis.
Data was formed from running the max test a sufficient number of times to ensure consistent results were obtained.
7
DATA BOUNDARIES
There are two components to each transaction on the system, one, metadata operations, and two, data movement. Metadata
operations include both reading and writing or updating the metadata associated with an object and we refer to these operations as
transactional overhead. Every transaction on the system will have some form of metadata overhead and the size of an object along
with the transaction type determine the associated level of resource utilization. Data movement is required for each transaction also
and refers to the receipt or delivery of client data. We refer to this component in terms of throughput or bandwidth, generally in
megabytes per second (MB/s).
For some range of small object sizes we have found a range where the transactional overhead is the dominant fraction of the overall
response time and the time to move the data payload is insignificant. We see a small curve in performance charts in this area but not
much. This is shown in Figure 3 as the area of small file constant TPS region. We quote the low end of that range as the boundary of
performance rather than using the highest numbers seen, or the peak. The ceiling we use is 10% less than peak or 90%. The lower
dashed line on the left side of Figure 3 labelled 'simplified small file TPS' shows the point we use actually use to report as the peak
TPS.
Similarly for some range of large object sizes we have found an area where bandwidth overhead is similar as well so we quote the low
end of that range as the boundary there also. Again, the peak ceiling we use is 10% less than peak or 90%.
The red vertical lines in Figure 3 represent the small and large file boundaries used in testing. Outside of that range of object sizes
there is not much curvature to the lines in Figure 3, again because the TPS or throughputs are so similar in those areas. The results
Figure 3 Butterfly Chart
outside of these boundaries are close because the impact of one of the two components is drastically more significant as compared to
the other. For small objects transactional overhead is similar and is the largest, or limiting, factor to performance. Since every
transaction has a metadata component, and the metadata operations have a minimum amount of time to complete, at objects with
similar sizes that component, the transactional overhead will be similar. On the other end of the spectrum, for large objects, the
transactional overhead is minimal compared to the amount of time required to physically move the data between client and the ECS
system. For larger objects the major factor in response time is the throughput which is why at certain large object sizes the throughput
potential is similar.
METRICS
Although many people ask for the number of input/output operations per second (IOPS) that are possible on an ECS system, there is
no meaningful way for us to provide that in comparison to other storage systems. This is because object storage systems are built on
REST APIs that do not define discrete I/O operations at the same level as other filesystems. For example a storage system that
implements a filesystem following the POSIX API semantics have a usage model that retains state between I/O operations to the same
file. This retention of state makes it possible for a sequence of logical I/O operations which can be counted discretely. Conversely,
8
when using a REST API to object storage, the file would be read with a single API call. It is this difference in operations and how they
are counted that makes it impossible to use a common metric.
It is because of the two components mentioned above, transactional overhead and data throughput, that in object storage systems
small object performance must be quoted in transactions per second (TPS) and large object performance quoted in data throughput.
For transactions with object sizes in the small object region, the performance for a given level of concurrency will be nearly constant.
Whether the transactions are 1KB or 10KB, the throughput in objects per second will be nearly the same. However, if we provide an
estimated throughput in units of bandwidth (MB/s or KB/s) based on a stated object size of 10KB, and it is revealed later that the actual
object size is 1KB, then the estimated throughput will be 10X the actual throughput. This error is avoided by consistent use of TPS or
objects/second when referring to small object throughput.
For large objects data throughput in bandwidth, generally in MB/s, is used to quote performance as in this case moving the data
accounts for the most significant factor of the transactions.
It is clear the charts below follow the same butterfly pattern as referenced above in Figure 3. This will always hold true when varying
the object size from small to large due to the nature of the transactions. On the left hand side of each chart we see the effect of
transactional overhead for small objects and on the right hand side we see the same effect of throughput for large objects. To put this
in to an equation, response time, or total time to complete a transaction, is the result of the time to perform the transaction's required
metadata operations, or the transaction's transactional overheard, plus, the time it takes to move the data between client and through
ECS, the transaction's data throughput.
Total response time = transactional overhead time + data movement time
Data movement time has much less of an impact to smaller objects as compared to transactional time and the opposite is true for large
objects. The larger the object the more data movement time dominates the transactional time. Hence bandwidth becomes a more
important metric to measure large object performance.
READ RESULTS
Figure 4 Read Performance
U3000 Read Performance on ECS v2.2
3500
4500
4000
3500
2500
3000
2000
2500
1500
2000
1500
1000
Bandwidth - MB/s
Transactions Per Second
3000
1000
500
500
0
0
2560
10KB
2560
100KB
1280
640
1MB
10MB
Object Sizes with Thread Counts
320
100MB
320
200MB
Figure 4 shown above are the read results in transactions per second and MB/s for the U3000 using varying object sizes. The
horizontal axis also lists the number of threads used for to generate the load for each object size.
9
The maximum TPS seen on the read chart is 3,129. For large objects, 3,959 MB/s is the highest rate seen. The rate for both 100MB
and 200MB objects are very close and points to the effect of data throughput at this larger size. As object size grows beyond 200MB a
similar throughput rate of close to 4,000 MB/s can be expected. Using 320 threads at these larger object sizes the ECS system is fully
saturated.
WRITE RESULTS
Figure 5 Write Performance
U3000 Write Performance on ECS v2.2
2500
2000
1800
1600
1400
1500
1200
1000
1000
800
Bandwidth - MB/s
Transactions Per Second
2000
600
500
400
200
0
0
2560
10KB
2560
100KB
1280
640
1MB
10MB
Object Sizes with Thread Counts
320
100MB
320
200MB
Figure 5 shown above shows the write results in transactions per second and MB/s for the U3000 using varying object sizes. The
horizontal axis also lists the number of threads used to generate the load for each object size.
Write performance for a sustained workload of 10KB object writes, using 2,560 threads, is close to 2,000 TPS. Again the thread count
was chosen for the test for small object sizes because at that quantity of threads the system is saturated. The slope of the TPS line is
steeper than above with reads which indicates that the data payload is becoming more impactful, most likely due to triple mirroring,
since reads only need to retrieve a single copy.
Objects of 200MB and greater, in this data, perform at a throughput rate of 1,850 MB/s. This is the maximum throughput seen for
sustained large object write workloads on the U3000 appliance and what can be expected as object sizes continue to grow. The
throughput increase from 100MB to 200MB is due to an optimization employed for objects greater than 128MB, where we erasure code
the object during ingest, rather than writing three copies and converting to EC later.
Sizing
10
GENERAL GUIDELINES
Sizing an ECS deployment purely for capacity determines the minimum required number of disks needed to meet a required raw
storage requirement. From the number of disks required, either the minimum number of DAE required, if planning for a single site, or
the minimum number of drives required per site can be determined. Additional considerations must be made to account for
configurations using either four or eight DAE per appliance along with the number of drives each DAE contains.
When sizing for workload, the result is a minimum required number of ECS nodes. Number of nodes required is the end result for
workload sizing calculations because it is the ECS system itself, as tested as a whole, that is being sized and not specifically the
amount of network, CPU, memory or disk resources. Similarly as with DAE, appliances are sold with four or eight nodes so the end
number of nodes required to support a workload may need to be rounded up to the next multiple of four.
As mentioned throughout this paper, object size and thread counts, along with transaction type, are the primary factors which dictate
system performance. Because of this in order to size for workload object sizes and their transaction types along with related application
thread counts should be cataloged for all workloads that will utilize the ECS system.
CAPACITY
Sizing for capacity requires an understanding of how ECS stores and protects data. For complete details of how ECS stores and
replicates data and related storage overhead see the Overview and Architectural guide referenced above. Table 1 contains storage
overhead calculations.
Table 1 STORAGE OVERHEAD
NUMBER OF SITES
DEFAULT EC
12+4
COLD ARCHIVE EC
10+2
1
1.33
1.2
2
2.67
2.4
3
2.00
1.8
4
1.77
1.6
5
1.67
1.5
6
1.60
1.44
7
1.55
1.40
8
1.52
1.37
ECS storage overhead varies based on the number of sites in use in a replication group and the EC option configured. Overhead
ranges from as low as 1.2X for a single site configured for using 10+2 cold archive EC, up to 2.67X for two sites configured using the
default 12+4 EC. Although the maximum overhead is 2.67, this number decreases significantly as the number of sites in a replication
group is increased (when data written on all the sites). As data is written its overhead is initially 3X but once a chunk is filled to 128MB
it is sealed to prevent further writes; at this point the data is erasure-coded and two of the chunks are discarded.
To determine the minimum number of drives needed to meet a raw data requirement, the raw storage requirement is multiplied by the
storage overhead determined by Table 1 above. Because disks are installed inside DAE in equal proportions the minimum number of
drives required must be divided by the number of drives that will be populated in each DAE. The minimum number of drives in a
generation 1 (G1) system is fifteen and in a generation 2 (G2) system is ten so either of those numbers can be used, depending upon
the generation of the appliance, to determine the minimum number of DAE required to meet the capacity requirement. Similarly, if it is
known that the DAEs will be fully populated then the total minimum number of drives can be divided by sixty. Division of the total
minimum number of drives can also be divided by forty-five, thirty, twenty, fifteen, or ten if it is known how the DAE will be populated.
After the minimum number of DAE has been determined, rounding up to a multiple of four is required. That is because DAE are
deployed in appliances in allotments of either four or eight only. After the number of required DAE has been determined the available
appliance options can be reviewed to determine what is most suitable for the deployment.
11
EXAMPLES
To meet a requirement for 2 petabytes of raw data at a single site, 2.66 petabytes (2 petabytes * 1.33 overhead) of storage is required if
12+4 EC is used and 2.4 petabytes (2 petabytes * 1.2 overhead) is needed for 10+2 EC. Additionally a conversion of the final number
from base10 to base2, or petabytes to pebibytes can be performed if desired. Specifically sizing for a G1 system which uses 6 terabyte
drives and 12+4 EC, 444 disks are required. The 444 disks are determined by dividing total capacity needed, 2.66PB, by the capacity
of G1 drives, 6TB. Disks can be populated in DAE in in batches of fifteen in G1 hardware, so the maximum number of DAE required,
determined by using the minimum of fifteen disks per DAE, is thirty. This is derived from dividing the required number of disks, 444, by
15, the minimum number of drives per DAE. Since DAE are populated in appliances in multiples of four or eight, the maximum number
of DAE required is thirty-two or two up from the minimum thirty DAE to accommodate the DAE four block deployment requirements. To
meet the 2 petabyte requirement a variety of hardware appliances can be chosen ranging from eight U300 (four node/four DAE each
with 15 disks) to a single U3000 (eight nodes/eight DAE each with 60 drives). For sizing for capacity only, the number of ECS nodes
used isn't important, just the disk space required.
The difference in sizing for capacity for G2 hardware is 8TB disks are used rather than 6TB disks and they can be deployed in the DAE
in in lots of 10, 20, 30, 45, and 60 per DAE, as opposed to 15, 30, 45, and 60 as with G1 appliances.
For an ECS three site cold storage deployment with a 1.5PB raw storage requirement on G2 hardware, 1.5PB is first multiplied by 1.8,
this is because in Table 1, 1.8 is the storage overhead for three sites using cold storage 10+2 EC. This results in a need for 2.7PB of
disk space. Since G2 hardware uses 8PB drives, 2.7PB is divided by 8TB to determine the minimum number of disks needed to
support the capacity requirement, or 338 drives. Dividing 338 drives by ten, the minimum number of drives possible in a G2 DAE, we
know at most 34 DAE are required. If in the end it was decided to only populate each DAE with ten drives, is would be necessary to
round 34 DAE up to 36 DAE because appliances come with either four or eight DAE only. Often for cold storage however dense
configurations are possible so for this example the minimum required number of drives, 338, could be divided by 60, the number that
represents a fully populated DAE. That division yields 6, after rounding up from 5.63. Six fully populated DAE are all that is needed to
meet the raw storage requirement in this case. In this scenario three sites are planned and since each site will have at minimum one
ECS appliance, breaking down the storage requirements in to the minimum DAE doesn't make sense. Dividing the minimum number of
drives required by three, for the number of sites deployed, yields 112. With this, each site must have a minimum of 112 drives. Again
since each appliance has four or eight DAE, 112 can be divided by 4 to determine the minimum drives in each DAE required, or 28.
Again, in G2 appliances DAE can be populated with 10, 20, 30, 45, or 60 drives, so in this case an appliance with four DAE each
populated with 30 drives at each site will meet the 1.5PB raw storage requirement.
WORKLOAD
To size for a specific workload the number of nodes deployed is critical and the two most important variables to consider are object size
and thread count. The type of operation(s) performed are also important and are used to look up datapoints for the given object size on
the performance charts provided.
OBJECT SIZE
As mentioned previously we refer to the two components of a transaction as transactional overhead and data movement or throughput.
For small objects the transactional overhead is high because each object requires at a minimum a query and/or possibly an update of
metadata and the impact of the movement of data for small sizes is minimal in comparison. At the other end of the spectrum, for large
objects, the movement of data and associated throughput is the most critical factor of the time to complete the transaction and the
transactional overhead makes up a much smaller portion of the related work.
Since the major influencing factor to performance of each of these components is dramatically different between small and large
objects, sizing for workload must consider the size of objects as either being small or large. At each object size, for each transaction
type, the influence by each of the transaction's components varies, but it is because the limiting factor is so different between small and
large objects that it is impossible to use an average of both small and large object sizes to approximate performance. It is possible to
lump small objects together and approximate performance for a group of small objects, and it is possible to lump large objects together
and approximate expected performance for a group of large object as well. Again this is because the major influence in response time
between small objects is similar and the major influence in response time between large object is similar as well. For workloads with
object sizes that vary widely, lump small size objects together, and lump large object sizes together, but do not count on being able to
estimate across-the-board object sizes using provided performance data.
THREAD COUNT
Performance data was created using sufficient threads to saturate the ECS system during a sustained workload.
12
For application thread counts, only those which interact with the ECS system need to be considered. For example if an application
uses five hundred threads but each thread only interacts with the ECS system a quarter of the time, consider the thread count one
hundred twenty-five as opposed to five hundred. All threads in the performance tests are dedicated one hundred percent of the time to
performing the transactions.
Thread counts ranging from 2,560 (or 640 per nodes) and 320 (or 40 per node) were chosen for the extremes of small and large objects
as they are found for those sizes to saturate an ECS U3000 appliance. Use of additional threads may not result in an increase in TPS
or throughput and may even cause a decrease in performance due to queuing delays. Sizing for thread count requires considering that
the data provided in this paper is from a saturated system. The appliance was saturated in order to determine the maximum
performance possible. ECS deployments should be sized to allow for enough headroom to allow for projected growth as well as to
ensure consistent performance during times of hardware downtime.
EXAMPLES
Assume a workload with 100KB objects needs to be sized to support 5,000 CREATE objects per second. Using the write chart above
and looking up from 100KB, a fully saturated U3000 appliance can perform 1,700 TPS. Because the system was saturated during
testing, objects on the system could be created using more than 2,560 threads, but using more would not necessarily increase the TPS
above 1,700. In order to get the approximate 5,000 TPS required, three U3000 appliances (or some combination of twenty-four ECS
nodes), each connected to by 320 threads as tested, are required. It is expected that based on the performance data provided, three
U3000 systems can deliver approximately 5,100 TPS (3 x 1,700 TPS) when 7,680 threads (2,560 per U3000) are evenly spread across
their nodes, which meets the 5,000 CREATE TPS goal.
Assume a similar workload against three U3000 appliances with object sizes set to 100KB, but, an application which can only put to
work 160 threads dedicated to writing the objects on the ECS system, as opposed to 7,680. With fewer threads the same TPS cannot
be recognized. It is not possible for less than 7,680 threads to come close to the 5,000 CREATE transactions per second requirement.
We cannot say for sure with the data presented in this paper what TPS you'll see using 160 threads, we only know using 2,560 threads
per each U3000, at that object size for write transactions that you will see 1,700 TPS on one U3000 system, and aproximately5,100 for
all three U3000, and that the systems will be saturated. In order to get to a 5,000 TPS rate an application must be able to put to work
7,680 threads against three U3000 appliances, or a combination of appliances that consist of twenty-four ECS nodes each with fully
populated DAE (as tested).
To meet a requirement for 5,000 TPS for an object size of 100KB, 7,680 threads, or so, are required and need to be spread evenly
across three U3000 appliances, or, twenty-four ECS nodes. At that number of threads and ECS nodes it is expected for the ECS
system to be saturated. To run the workload with additional headroom, 7,680 dedicated threads are still required but more than twentyfour nodes should be deployed. Using thirty-two nodes each with 240 threads per node as opposed to 320, in this scenario for
example, will allow for some breathing room on the appliance.
MIXED WORKLOADS
In the real world not all workloads line up as we have tested here with 100% all READ or 100% all WRITE transactions. Using one
transaction type for an entire test however allows us to single out how the system performs specifically for that type of transaction at a
given size and with the load of a specific number of threads. Without testing for a specific combination of object sizes and transaction
types it is only possible to estimate performance of mixed transaction type workloads using data from each set of results if it is
considered the transactions will be on separate systems. It cannot be implied, for example, that for a 50/50 split of reads and writes, for
a given object size and given number of threads, you can take 50% of the read results and 50% of the write results and add them
together. Specific testing of mixed workloads on a common system must be completed to accurately determine expectations.
Again for mixed workloads where object sizes vary from small to large it is not possible to use an average object size to project system
performance. This is because the limiting factors vary significantly between large and small objects. For small objects transactional
overhead or metadata operations impact expected performance the most and for large objects the movement of data impacts expected
performance the most. Transactional overhead and data throughput are two entirely different constraints and therefore averaging
results of both large and small object sizes is not possible. Determine worst case small object performance using the largest of the
small object sizes, and determine worst case large object performance using the smallest of the large object sizes, and size for each
independently in order to best determine the number of ECS nodes to meet this type of workload requirement.
Conclusion
13
In conclusion, the performance data provided for S3 read and write transactions enable estimation of TPS or MB/s for workloads for a
single-site ECS deployment. Swift and Atmos protocols are expected to behave similarly for read and write transactions because the
APIs are similar enough in how they behave for these basic operations and on ECS they share a common underlying storage layer.
The maximum throughput seen on the U3000 appliance is 3,959 MB/s for read transactions of 200MB objects. Because read
transactions are less resource intensive than write transactions we know the max throughput on a U3000 appliance is approximately
4,000 MB/s. Similarly for smaller objects 3,129 TPS is possible. Using these high numbers with the lower highs seen for write
performance, 1,971 TPS and 1,850 Mb/s, a range of performance can be seen in comparing read and write workflows. Understanding
which factors influence the range of performance, mainly object size, transaction type, and thread count, assists in projecting expected
performance of ECS systems with specific workloads.
Sizing for capacity is a simple math problem with these variables, raw storage required, erasure-coding scheme chosen and the
number of sites deployed. The end result is a number of disks required to meet the raw capacity plus overhead. The minimum number
of disks required allows for the determination of the appliances available that can meet the need.
Sizing for performance, although it also requires knowing only a few key variables, is much more complicated. This is because real
world workloads often do not consists of only a few simple and well-known similar sized objects, transaction types and thread counts
that line up easily to simple performance charts. That being said however, where workloads can be categorized and broken down in to
transaction types, object sizes, and thread counts, and with a general understanding how they interoperate with ECS, some basic
calculations can be implemented and an approximation of ECS node resources required can be made.
14