Comments
Transcript
ELASTIC CLOUD STORAGE (ECS) V2.2 PERFORMANCE
ELASTIC CLOUD STORAGE (ECS) V2.2 PERFORMANCE An Analysis of Single-site Performance with Sizing Guidelines ABSTRACT This white paper examines object performance of the latest Dell EMC Elastic Cloud Storage software. Performance results are from testing done on an ECS Appliance model U3000 using the S3 protocol. Included are recommendations for sizing an ECS appliance for performance. September 2016 WHITE PAPER The information in this publication is provided “as is.” Dell EMC Corporation makes no representations or warranties of any kind with respect to the information in this publication, and specifically disclaims implied warranties of merchantability or fitness for a particular purpose. Use, copying, and distribution of any Dell EMC software described in this publication requires an applicable software license. Dell, Dell EMC, EMC, the EMC logo, are registered trademarks or trademarks of Dell EMC Corporation in the United States and other countries. All other trademarks used herein are the property of their respective owners. © Copyright 2016 EMC Corporation. All rights reserved. Published in the USA. 01/16, H12545.1 EMC believes the information in this document is accurate as of its publication date. The information is subject to change without notice. EMC is now part of the Dell group of companies. 2 TABLE OF CONTENTS EXECUTIVE SUMMARY ...........................................................................................................4 AUDIENCE ........................................................................................................................................ 4 SCOPE .............................................................................................................................................. 4 ECS ARCHITECTURE ...............................................................................................................4 TEST METHODOLOGY AND ENVIRONMENT ........................................................................5 Test Framework ................................................................................................................................ 5 S3 Protocol ........................................................................................................................................ 5 Numbers of Objects........................................................................................................................... 6 Clients ............................................................................................................................................... 6 ECS Appliance .................................................................................................................................. 6 Connectivity ....................................................................................................................................... 6 RESULTS ...................................................................................................................................7 Data Presentation.............................................................................................................................. 7 Data Boundaries................................................................................................................................ 8 Metrics ............................................................................................................................................... 8 Read Results ..................................................................................................................................... 9 Write Results ................................................................................................................................... 10 SIZING .................................................................................................................................... 10 General Guidelines.......................................................................................................................... 11 Capacity .......................................................................................................................................... 11 Examples ........................................................................................................................................ 12 Workload ......................................................................................................................................... 12 Object Size .............................................................................................................................................. 12 Thread Count ........................................................................................................................................... 12 Examples ........................................................................................................................................ 13 Mixed Workloads ............................................................................................................................. 13 CONCLUSION ........................................................................................................................ 13 3 EXECUTIVE SUMMARY Traditional SAN and NAS storage platforms, while critical for enterprise applications, were never designed for modern cloud applications and the demands of cloud. The unabated growth in unstructured content is driving the need for a simpler storage architecture that can efficiently manage billions and trillions of files and accelerate the development of cloud, mobile and big data applications while reduce escalating overhead and storage costs. In order to fulfill these requirements, IT organizations and service providers have begun to evaluate and utilize low cost, commodity and open-source infrastructures. These commodity components and standard open technologies reduce storage costs but the individual components provide lower performance and reliability – and also require the operational expertise to ensure serviceability of the infrastructure. Dell EMC® Elastic Cloud Storage (ECS™) provides a complete software-defined object storage platform designed for today’s scale-out cloud storage requirements. ECS offers all the cost advantages of commodity infrastructure with the enterprise-grade reliability and serviceability of and integrated Dell EMC-supported solution. ECS benefits include: Cloud-scale Economics: 65% TCO savings versus public cloud services Simplicity & Scale: Single global namespace, unlimited apps, users and files Multi-site Efficiency: Low storage overhead across geo-distributed locations Faster App Development: API accessible storage accelerates cloud apps & analytics Turnkey Cloud: Multi-tenancy, self-service access and metering capabilities The goal of this paper is to provide a high-level overview of the ECS performance and analysis of performance data to allow consumers to understand and set expectations for ECS system performance. It is also intended to be a guide on how to properly size an ECS deployment as well. AUDIENCE This white paper is intended to support the field and customers in understanding ECS performance and sizing considerations. SCOPE ECS version 2.2 introduced another data service for file access using NFS version 3, however performance testing using file-based protocols including the ECS Hadoop Distributed File System (ECS HDFS) protocol is outside of the scope of this document. Performance of multi-site configurations is also outside of the scope of this document. We intend to revisit these topics in future versions of this paper. ECS Architecture ECS is available for purchase in two forms - bundled as an appliance or as software only for deployment on customer-owned commodity hardware. ECS is built using a layered architecture as seen in Figure 1 below. Each layer is completely abstracted and independently scalable to allow for high availability, no single points of failure and limitless scaling capabilities. The software contains four layers which are deployed using Docker containers. The four layers are data services, provisioning, storage engine and fabric. These layers sit on top of the underlying operating system which is by default SUSE Linux Enterprise Server (SLES) 12. The ECS portal and provisioning layer provide a portal Web-Based GUI for management, licensing, provisioning, and reporting of ECS nodes. There is also a dashboard, which provides overall system level health and performance, and alert management. System administrators interact with ECS using this layer. The data services layer provides the object and file protocol access to the system. It is responsible for receiving and responding to client requests and extracts and bundles information to and from the shared common storage engine. This is the layer where clients interact with the system over supported protocols. The storage engine is responsible for accessing and protecting data. It is the core of ECS and contains the main components that are responsible for processing the requests, storing and retrieval of data, data protection, and replication. The commodity disks managed by this layer are mounted using the XFS filesystem. These filesystems are filled with 10GB files which act as containers. Both 4 metadata and data are stored in chunks of 128MB inside the container files. Metadata is triple mirrored and objects less than or equal to 128 MB are triple mirrored initially then converted to erasure-coded (EC) form later. Objects greater than 128MB are erasure-coded "in line" as they are written to disk. Erasure-coding is done using Reed-Solomon coding. The Fabric layer provides clustering, system health, software management, configuration management, upgrade capabilities and alerting. It is responsible for keeping the services running and managing the resources such as disks, containers, firewall and network. It tracks and reacts to environment changes such as failure detection and provides alerts relating to the health of the system. This layer interacts directly with the operating system. For additional information, please review ECS Overview and Architecture white paper at http://www.emc.com/collateral/whitepapers/h14071-ecs-architectural-guide-wp.pdf. Figure 1 ECS Architecture Test Methodology and Environment TEST FRAMEWORK A currently internal-only Dell EMC tool name Mongoose was built to create S3 workflows and was used to generate the workloads for the tests reported in this paper. It is similar to other tools used to measure object performance such as Cloud Object Storage Benchmark (COSBench) and The Grinder and there are plans to release the tool to the public sometime in 2016. Mongoose allows Dell EMC test engineers to perform S3 operations via HTTP such as CREATE, READ, UPDATE, AND DELETE with varied object sizes and thread counts, the two primary influences on system performance. S3 PROTOCOL Performance of object storage platforms differ from traditional architectures such as NAS and SAN in many ways. Because of this how we execute performance tests and characterize their results must also differ from traditional approaches. Object storage platforms are generally IP-addressable storage accessed via a REST API and ECS natively supports Amazon's S3, OpenStack Swift, and Dell EMC's Atmos object protocols. The Content-addressable Storage (CAS) object protocol is also supported by ECS however its core behavior differs enough from the previous three to require separate analysis in a future paper. The three REST APIs are all similar in overall approach and capabilities, but there are some differences in semantics, function and limitations. None of those differences are material to ECS performance, so we focus on the S3 API in most of our testing and in this paper. All three APIs are implemented as components that conceptually sit on top of a common object engine. Due to that common data path, differing only in the API-specific layer, we expect all three APIs all perform similarly when exercising like functionality. This is validated with our testing. With this confirmed similarity of performance between the three APIs, when using this document as a performance reference, one can use data for any of the APIs to represent the system performance generically, independent of which of 5 the three APIs are in use. Note that this similarity in performance holds only when the APIs are being used in ways that exercise the same underlying functionality. The APIs have optional functionality that when used can cause performance differences. NUMBERS OF OBJECTS Currently the response time on an empty ECS system is much lower than one fully populated with objects. For example, on an empty system 7,000 TPS may be seen, but on a filled system we may only see around 3,000 TPS. This is still quite fast however and equates to 10.8 million transactions (objects) per hour. What this means is users will initially see growing response times as the number of objects increase over time on an ECS system. The low initial response time occurs because object metadata (index) initially fits entirely in memory which allows requests to be served quickly. A BPlusTree stores the index on disk and is short enough initially so even when disk access is required for a lookup, no more than one or two disk operations are required. As the number of objects increase over time the number of calls to disk to service requests increase as well. Our testing has shown that the initial decline in response time does plateau. It is because of this behavior that we test on systems that are filled up to the point where performance has plateaued so we can report realistic numbers rather than the highest seen on empty systems. We do this by performing tests under a single namespace inside a single bucket that is prepopulated with a few hundreds of thousands of objects. CLIENTS A one-to-one ratio of clients to ECS nodes is sufficient for our ECS test system to reach maximum performance. This is because the Mongoose application is able to perform tests using a varied number of threads. Each of the eight clients has the following specs: Intel server with dual, four-core Intel 2.5 GHz Ivy Bridge (E5-2609) processors 64GB RAM One configured 10GbE port SLES 12 with 3.12.39-47-default kernel ECS APPLIANCE The ECS appliance consist sof varying quantities of the same node, disks, and disk enclosure and what distinguishes them is the number of node and disk enclosures and the number of disks they contain. Details on the breakdown of hardware for each of the available ECS appliance models can be found in the ECS Specification Sheet here: http://www.emc.com/collateral/specificationsheet/h13117-emc-ecs-appliance-ss.pdf. The ECS U3000 appliance has the following hardware. Two sets of four Phoenix blades each in shared 2U chassis and each with dual, four-core Intel 2.5 GHz Ivy Bridge (E5-2609) processors, 64GB RAM, a single DOM for OS boot, a single Gbit/s SAS interface (for Disk Array Enclosure connectivity) and two 10GbE ports Two 10GbE top-of-rack (TOR) switches which uplink to the customer network and are used both for internal node-to-node and external node-to-client communication One 1GbE switch used for internal and external out-of-band management Eight Disk Array Enclosures (DAE) each containing sixty 6TB SATA drives. Each DAE is connected to one of the nodes in the system via a single 6 Gbit/s SAS connection. That is, each node of the eight nodes is directly connected to one DAE. Performance testing was done on an ECS appliance model U3000 installed with ECS software version 2.2. The object space was configured using a single virtual datacenter, storage pool, replication group, namespace and an S3 bucket created without encryption or file capabilities enabled. Also note we do plan to publish similar data using generation 2 hardware in the first half of 2016. CONNECTIVITY Each of the eight clients used in the tests were connected directly to a 10GbE TOR data switch in the ECS appliance rack. Only one of the two TOR switches is required and used during testing. The added complexity of using a load balancer can be avoided since 6 Mongoose is able to spread the workload evenly across all nodes. A load balancer that is properly configured and sized should not add delay to the transactions so by directly connecting clients to one the appliance's TOR switches the tests are kept focused on the transactions and not 3rd party networking components. Figure 2 has a simplistic view of the basic components and connectivity. Figure 2 Lab Connectivity Results DATA PRESENTATION Charts in this document are in a form designed to maximize the information density. We refer to charts of both transactional rate throughput and bandwidth as "butterfly" charts because of their resemblance to spread wings. An annotated example is presented in Figure 3 below. Data for the charts below in Figures 4 and 5 come from a test we have named the max test, because it is most often used as a test of cluster peak performance. In this test, we write a series of homogeneous workloads, and each workload is a datapoint in the curve on these charts. There is a defined sequence of object sizes in our max test, small to large, with enough granularity in the sizes to show the shape of the curve. We start with writes, and first write the smallest object size with a fixed thread count, long enough (generally around five minutes) to determine the steady state throughput for that thread count and object size. A brief delay is inserted between object sizes so we can clearly delineate where workflows start and stop. Then the next object size is written, and so on, determining the throughput for each object size. After the write workloads are complete we begin reading objects starting again at the smallest size first and moving to larger and larger sizes. Since transactions for a range of small objects all perform the same, for that same object range we will use the same thread count. But as larger and larger objects are tested, the thread count is periodically reduced. The appropriate thread counts are determined empirically, and are chosen to be large enough to achieve peak performance, but not unnecessarily large beyond that, since that would just grow queuing delays in the system. The tests in this paper used a range threads from 2,560 for the smallest objects down to 320 threads for largest objects. Note that the two curves in Figures 3-5 represent the same sequence of data points. They differ only in their units. The curves that generally start in the upper left, and trend downward to the right represent the objects per second, and are measured on the left y-axis. To obtain the other curves, the objects/second data points are multiplied by size of the objects, resulting in lines representing the bandwidth, generally in MB/second. Those curves start at the origin and generally trend to the upper right, and are read on the right yaxis. Data was formed from running the max test a sufficient number of times to ensure consistent results were obtained. 7 DATA BOUNDARIES There are two components to each transaction on the system, one, metadata operations, and two, data movement. Metadata operations include both reading and writing or updating the metadata associated with an object and we refer to these operations as transactional overhead. Every transaction on the system will have some form of metadata overhead and the size of an object along with the transaction type determine the associated level of resource utilization. Data movement is required for each transaction also and refers to the receipt or delivery of client data. We refer to this component in terms of throughput or bandwidth, generally in megabytes per second (MB/s). For some range of small object sizes we have found a range where the transactional overhead is the dominant fraction of the overall response time and the time to move the data payload is insignificant. We see a small curve in performance charts in this area but not much. This is shown in Figure 3 as the area of small file constant TPS region. We quote the low end of that range as the boundary of performance rather than using the highest numbers seen, or the peak. The ceiling we use is 10% less than peak or 90%. The lower dashed line on the left side of Figure 3 labelled 'simplified small file TPS' shows the point we use actually use to report as the peak TPS. Similarly for some range of large object sizes we have found an area where bandwidth overhead is similar as well so we quote the low end of that range as the boundary there also. Again, the peak ceiling we use is 10% less than peak or 90%. The red vertical lines in Figure 3 represent the small and large file boundaries used in testing. Outside of that range of object sizes there is not much curvature to the lines in Figure 3, again because the TPS or throughputs are so similar in those areas. The results Figure 3 Butterfly Chart outside of these boundaries are close because the impact of one of the two components is drastically more significant as compared to the other. For small objects transactional overhead is similar and is the largest, or limiting, factor to performance. Since every transaction has a metadata component, and the metadata operations have a minimum amount of time to complete, at objects with similar sizes that component, the transactional overhead will be similar. On the other end of the spectrum, for large objects, the transactional overhead is minimal compared to the amount of time required to physically move the data between client and the ECS system. For larger objects the major factor in response time is the throughput which is why at certain large object sizes the throughput potential is similar. METRICS Although many people ask for the number of input/output operations per second (IOPS) that are possible on an ECS system, there is no meaningful way for us to provide that in comparison to other storage systems. This is because object storage systems are built on REST APIs that do not define discrete I/O operations at the same level as other filesystems. For example a storage system that implements a filesystem following the POSIX API semantics have a usage model that retains state between I/O operations to the same file. This retention of state makes it possible for a sequence of logical I/O operations which can be counted discretely. Conversely, 8 when using a REST API to object storage, the file would be read with a single API call. It is this difference in operations and how they are counted that makes it impossible to use a common metric. It is because of the two components mentioned above, transactional overhead and data throughput, that in object storage systems small object performance must be quoted in transactions per second (TPS) and large object performance quoted in data throughput. For transactions with object sizes in the small object region, the performance for a given level of concurrency will be nearly constant. Whether the transactions are 1KB or 10KB, the throughput in objects per second will be nearly the same. However, if we provide an estimated throughput in units of bandwidth (MB/s or KB/s) based on a stated object size of 10KB, and it is revealed later that the actual object size is 1KB, then the estimated throughput will be 10X the actual throughput. This error is avoided by consistent use of TPS or objects/second when referring to small object throughput. For large objects data throughput in bandwidth, generally in MB/s, is used to quote performance as in this case moving the data accounts for the most significant factor of the transactions. It is clear the charts below follow the same butterfly pattern as referenced above in Figure 3. This will always hold true when varying the object size from small to large due to the nature of the transactions. On the left hand side of each chart we see the effect of transactional overhead for small objects and on the right hand side we see the same effect of throughput for large objects. To put this in to an equation, response time, or total time to complete a transaction, is the result of the time to perform the transaction's required metadata operations, or the transaction's transactional overheard, plus, the time it takes to move the data between client and through ECS, the transaction's data throughput. Total response time = transactional overhead time + data movement time Data movement time has much less of an impact to smaller objects as compared to transactional time and the opposite is true for large objects. The larger the object the more data movement time dominates the transactional time. Hence bandwidth becomes a more important metric to measure large object performance. READ RESULTS Figure 4 Read Performance U3000 Read Performance on ECS v2.2 3500 4500 4000 3500 2500 3000 2000 2500 1500 2000 1500 1000 Bandwidth - MB/s Transactions Per Second 3000 1000 500 500 0 0 2560 10KB 2560 100KB 1280 640 1MB 10MB Object Sizes with Thread Counts 320 100MB 320 200MB Figure 4 shown above are the read results in transactions per second and MB/s for the U3000 using varying object sizes. The horizontal axis also lists the number of threads used for to generate the load for each object size. 9 The maximum TPS seen on the read chart is 3,129. For large objects, 3,959 MB/s is the highest rate seen. The rate for both 100MB and 200MB objects are very close and points to the effect of data throughput at this larger size. As object size grows beyond 200MB a similar throughput rate of close to 4,000 MB/s can be expected. Using 320 threads at these larger object sizes the ECS system is fully saturated. WRITE RESULTS Figure 5 Write Performance U3000 Write Performance on ECS v2.2 2500 2000 1800 1600 1400 1500 1200 1000 1000 800 Bandwidth - MB/s Transactions Per Second 2000 600 500 400 200 0 0 2560 10KB 2560 100KB 1280 640 1MB 10MB Object Sizes with Thread Counts 320 100MB 320 200MB Figure 5 shown above shows the write results in transactions per second and MB/s for the U3000 using varying object sizes. The horizontal axis also lists the number of threads used to generate the load for each object size. Write performance for a sustained workload of 10KB object writes, using 2,560 threads, is close to 2,000 TPS. Again the thread count was chosen for the test for small object sizes because at that quantity of threads the system is saturated. The slope of the TPS line is steeper than above with reads which indicates that the data payload is becoming more impactful, most likely due to triple mirroring, since reads only need to retrieve a single copy. Objects of 200MB and greater, in this data, perform at a throughput rate of 1,850 MB/s. This is the maximum throughput seen for sustained large object write workloads on the U3000 appliance and what can be expected as object sizes continue to grow. The throughput increase from 100MB to 200MB is due to an optimization employed for objects greater than 128MB, where we erasure code the object during ingest, rather than writing three copies and converting to EC later. Sizing 10 GENERAL GUIDELINES Sizing an ECS deployment purely for capacity determines the minimum required number of disks needed to meet a required raw storage requirement. From the number of disks required, either the minimum number of DAE required, if planning for a single site, or the minimum number of drives required per site can be determined. Additional considerations must be made to account for configurations using either four or eight DAE per appliance along with the number of drives each DAE contains. When sizing for workload, the result is a minimum required number of ECS nodes. Number of nodes required is the end result for workload sizing calculations because it is the ECS system itself, as tested as a whole, that is being sized and not specifically the amount of network, CPU, memory or disk resources. Similarly as with DAE, appliances are sold with four or eight nodes so the end number of nodes required to support a workload may need to be rounded up to the next multiple of four. As mentioned throughout this paper, object size and thread counts, along with transaction type, are the primary factors which dictate system performance. Because of this in order to size for workload object sizes and their transaction types along with related application thread counts should be cataloged for all workloads that will utilize the ECS system. CAPACITY Sizing for capacity requires an understanding of how ECS stores and protects data. For complete details of how ECS stores and replicates data and related storage overhead see the Overview and Architectural guide referenced above. Table 1 contains storage overhead calculations. Table 1 STORAGE OVERHEAD NUMBER OF SITES DEFAULT EC 12+4 COLD ARCHIVE EC 10+2 1 1.33 1.2 2 2.67 2.4 3 2.00 1.8 4 1.77 1.6 5 1.67 1.5 6 1.60 1.44 7 1.55 1.40 8 1.52 1.37 ECS storage overhead varies based on the number of sites in use in a replication group and the EC option configured. Overhead ranges from as low as 1.2X for a single site configured for using 10+2 cold archive EC, up to 2.67X for two sites configured using the default 12+4 EC. Although the maximum overhead is 2.67, this number decreases significantly as the number of sites in a replication group is increased (when data written on all the sites). As data is written its overhead is initially 3X but once a chunk is filled to 128MB it is sealed to prevent further writes; at this point the data is erasure-coded and two of the chunks are discarded. To determine the minimum number of drives needed to meet a raw data requirement, the raw storage requirement is multiplied by the storage overhead determined by Table 1 above. Because disks are installed inside DAE in equal proportions the minimum number of drives required must be divided by the number of drives that will be populated in each DAE. The minimum number of drives in a generation 1 (G1) system is fifteen and in a generation 2 (G2) system is ten so either of those numbers can be used, depending upon the generation of the appliance, to determine the minimum number of DAE required to meet the capacity requirement. Similarly, if it is known that the DAEs will be fully populated then the total minimum number of drives can be divided by sixty. Division of the total minimum number of drives can also be divided by forty-five, thirty, twenty, fifteen, or ten if it is known how the DAE will be populated. After the minimum number of DAE has been determined, rounding up to a multiple of four is required. That is because DAE are deployed in appliances in allotments of either four or eight only. After the number of required DAE has been determined the available appliance options can be reviewed to determine what is most suitable for the deployment. 11 EXAMPLES To meet a requirement for 2 petabytes of raw data at a single site, 2.66 petabytes (2 petabytes * 1.33 overhead) of storage is required if 12+4 EC is used and 2.4 petabytes (2 petabytes * 1.2 overhead) is needed for 10+2 EC. Additionally a conversion of the final number from base10 to base2, or petabytes to pebibytes can be performed if desired. Specifically sizing for a G1 system which uses 6 terabyte drives and 12+4 EC, 444 disks are required. The 444 disks are determined by dividing total capacity needed, 2.66PB, by the capacity of G1 drives, 6TB. Disks can be populated in DAE in in batches of fifteen in G1 hardware, so the maximum number of DAE required, determined by using the minimum of fifteen disks per DAE, is thirty. This is derived from dividing the required number of disks, 444, by 15, the minimum number of drives per DAE. Since DAE are populated in appliances in multiples of four or eight, the maximum number of DAE required is thirty-two or two up from the minimum thirty DAE to accommodate the DAE four block deployment requirements. To meet the 2 petabyte requirement a variety of hardware appliances can be chosen ranging from eight U300 (four node/four DAE each with 15 disks) to a single U3000 (eight nodes/eight DAE each with 60 drives). For sizing for capacity only, the number of ECS nodes used isn't important, just the disk space required. The difference in sizing for capacity for G2 hardware is 8TB disks are used rather than 6TB disks and they can be deployed in the DAE in in lots of 10, 20, 30, 45, and 60 per DAE, as opposed to 15, 30, 45, and 60 as with G1 appliances. For an ECS three site cold storage deployment with a 1.5PB raw storage requirement on G2 hardware, 1.5PB is first multiplied by 1.8, this is because in Table 1, 1.8 is the storage overhead for three sites using cold storage 10+2 EC. This results in a need for 2.7PB of disk space. Since G2 hardware uses 8PB drives, 2.7PB is divided by 8TB to determine the minimum number of disks needed to support the capacity requirement, or 338 drives. Dividing 338 drives by ten, the minimum number of drives possible in a G2 DAE, we know at most 34 DAE are required. If in the end it was decided to only populate each DAE with ten drives, is would be necessary to round 34 DAE up to 36 DAE because appliances come with either four or eight DAE only. Often for cold storage however dense configurations are possible so for this example the minimum required number of drives, 338, could be divided by 60, the number that represents a fully populated DAE. That division yields 6, after rounding up from 5.63. Six fully populated DAE are all that is needed to meet the raw storage requirement in this case. In this scenario three sites are planned and since each site will have at minimum one ECS appliance, breaking down the storage requirements in to the minimum DAE doesn't make sense. Dividing the minimum number of drives required by three, for the number of sites deployed, yields 112. With this, each site must have a minimum of 112 drives. Again since each appliance has four or eight DAE, 112 can be divided by 4 to determine the minimum drives in each DAE required, or 28. Again, in G2 appliances DAE can be populated with 10, 20, 30, 45, or 60 drives, so in this case an appliance with four DAE each populated with 30 drives at each site will meet the 1.5PB raw storage requirement. WORKLOAD To size for a specific workload the number of nodes deployed is critical and the two most important variables to consider are object size and thread count. The type of operation(s) performed are also important and are used to look up datapoints for the given object size on the performance charts provided. OBJECT SIZE As mentioned previously we refer to the two components of a transaction as transactional overhead and data movement or throughput. For small objects the transactional overhead is high because each object requires at a minimum a query and/or possibly an update of metadata and the impact of the movement of data for small sizes is minimal in comparison. At the other end of the spectrum, for large objects, the movement of data and associated throughput is the most critical factor of the time to complete the transaction and the transactional overhead makes up a much smaller portion of the related work. Since the major influencing factor to performance of each of these components is dramatically different between small and large objects, sizing for workload must consider the size of objects as either being small or large. At each object size, for each transaction type, the influence by each of the transaction's components varies, but it is because the limiting factor is so different between small and large objects that it is impossible to use an average of both small and large object sizes to approximate performance. It is possible to lump small objects together and approximate performance for a group of small objects, and it is possible to lump large objects together and approximate expected performance for a group of large object as well. Again this is because the major influence in response time between small objects is similar and the major influence in response time between large object is similar as well. For workloads with object sizes that vary widely, lump small size objects together, and lump large object sizes together, but do not count on being able to estimate across-the-board object sizes using provided performance data. THREAD COUNT Performance data was created using sufficient threads to saturate the ECS system during a sustained workload. 12 For application thread counts, only those which interact with the ECS system need to be considered. For example if an application uses five hundred threads but each thread only interacts with the ECS system a quarter of the time, consider the thread count one hundred twenty-five as opposed to five hundred. All threads in the performance tests are dedicated one hundred percent of the time to performing the transactions. Thread counts ranging from 2,560 (or 640 per nodes) and 320 (or 40 per node) were chosen for the extremes of small and large objects as they are found for those sizes to saturate an ECS U3000 appliance. Use of additional threads may not result in an increase in TPS or throughput and may even cause a decrease in performance due to queuing delays. Sizing for thread count requires considering that the data provided in this paper is from a saturated system. The appliance was saturated in order to determine the maximum performance possible. ECS deployments should be sized to allow for enough headroom to allow for projected growth as well as to ensure consistent performance during times of hardware downtime. EXAMPLES Assume a workload with 100KB objects needs to be sized to support 5,000 CREATE objects per second. Using the write chart above and looking up from 100KB, a fully saturated U3000 appliance can perform 1,700 TPS. Because the system was saturated during testing, objects on the system could be created using more than 2,560 threads, but using more would not necessarily increase the TPS above 1,700. In order to get the approximate 5,000 TPS required, three U3000 appliances (or some combination of twenty-four ECS nodes), each connected to by 320 threads as tested, are required. It is expected that based on the performance data provided, three U3000 systems can deliver approximately 5,100 TPS (3 x 1,700 TPS) when 7,680 threads (2,560 per U3000) are evenly spread across their nodes, which meets the 5,000 CREATE TPS goal. Assume a similar workload against three U3000 appliances with object sizes set to 100KB, but, an application which can only put to work 160 threads dedicated to writing the objects on the ECS system, as opposed to 7,680. With fewer threads the same TPS cannot be recognized. It is not possible for less than 7,680 threads to come close to the 5,000 CREATE transactions per second requirement. We cannot say for sure with the data presented in this paper what TPS you'll see using 160 threads, we only know using 2,560 threads per each U3000, at that object size for write transactions that you will see 1,700 TPS on one U3000 system, and aproximately5,100 for all three U3000, and that the systems will be saturated. In order to get to a 5,000 TPS rate an application must be able to put to work 7,680 threads against three U3000 appliances, or a combination of appliances that consist of twenty-four ECS nodes each with fully populated DAE (as tested). To meet a requirement for 5,000 TPS for an object size of 100KB, 7,680 threads, or so, are required and need to be spread evenly across three U3000 appliances, or, twenty-four ECS nodes. At that number of threads and ECS nodes it is expected for the ECS system to be saturated. To run the workload with additional headroom, 7,680 dedicated threads are still required but more than twentyfour nodes should be deployed. Using thirty-two nodes each with 240 threads per node as opposed to 320, in this scenario for example, will allow for some breathing room on the appliance. MIXED WORKLOADS In the real world not all workloads line up as we have tested here with 100% all READ or 100% all WRITE transactions. Using one transaction type for an entire test however allows us to single out how the system performs specifically for that type of transaction at a given size and with the load of a specific number of threads. Without testing for a specific combination of object sizes and transaction types it is only possible to estimate performance of mixed transaction type workloads using data from each set of results if it is considered the transactions will be on separate systems. It cannot be implied, for example, that for a 50/50 split of reads and writes, for a given object size and given number of threads, you can take 50% of the read results and 50% of the write results and add them together. Specific testing of mixed workloads on a common system must be completed to accurately determine expectations. Again for mixed workloads where object sizes vary from small to large it is not possible to use an average object size to project system performance. This is because the limiting factors vary significantly between large and small objects. For small objects transactional overhead or metadata operations impact expected performance the most and for large objects the movement of data impacts expected performance the most. Transactional overhead and data throughput are two entirely different constraints and therefore averaging results of both large and small object sizes is not possible. Determine worst case small object performance using the largest of the small object sizes, and determine worst case large object performance using the smallest of the large object sizes, and size for each independently in order to best determine the number of ECS nodes to meet this type of workload requirement. Conclusion 13 In conclusion, the performance data provided for S3 read and write transactions enable estimation of TPS or MB/s for workloads for a single-site ECS deployment. Swift and Atmos protocols are expected to behave similarly for read and write transactions because the APIs are similar enough in how they behave for these basic operations and on ECS they share a common underlying storage layer. The maximum throughput seen on the U3000 appliance is 3,959 MB/s for read transactions of 200MB objects. Because read transactions are less resource intensive than write transactions we know the max throughput on a U3000 appliance is approximately 4,000 MB/s. Similarly for smaller objects 3,129 TPS is possible. Using these high numbers with the lower highs seen for write performance, 1,971 TPS and 1,850 Mb/s, a range of performance can be seen in comparing read and write workflows. Understanding which factors influence the range of performance, mainly object size, transaction type, and thread count, assists in projecting expected performance of ECS systems with specific workloads. Sizing for capacity is a simple math problem with these variables, raw storage required, erasure-coding scheme chosen and the number of sites deployed. The end result is a number of disks required to meet the raw capacity plus overhead. The minimum number of disks required allows for the determination of the appliances available that can meet the need. Sizing for performance, although it also requires knowing only a few key variables, is much more complicated. This is because real world workloads often do not consists of only a few simple and well-known similar sized objects, transaction types and thread counts that line up easily to simple performance charts. That being said however, where workloads can be categorized and broken down in to transaction types, object sizes, and thread counts, and with a general understanding how they interoperate with ECS, some basic calculations can be implemented and an approximation of ECS node resources required can be made. 14