Amazon SageMaker is a completely managed machine studying (ML) service. With SageMaker, information scientists and builders can rapidly and simply construct and prepare ML fashions, after which straight deploy them right into a production-ready hosted surroundings. It offers an built-in Jupyter authoring pocket book occasion for straightforward entry to your information sources for exploration and evaluation, so that you don’t should handle servers. It additionally offers widespread ML algorithms which are optimized to run effectively towards extraordinarily massive information in a distributed surroundings.
SageMaker real-time inference is good for workloads which have real-time, interactive, low-latency necessities. With SageMaker real-time inference, you’ll be able to deploy REST endpoints which are backed by a selected occasion sort with a specific amount of compute and reminiscence. Deploying a SageMaker real-time endpoint is just step one within the path to manufacturing for a lot of prospects. We wish to have the ability to maximize the efficiency of the endpoint to attain a goal transactions per second (TPS) whereas adhering to latency necessities. A big a part of efficiency optimization for inference is ensuring you choose the right occasion sort and rely to again an endpoint.
This publish describes the perfect practices for load testing a SageMaker endpoint to search out the best configuration for the variety of cases and measurement. This can assist us perceive the minimal provisioned occasion necessities to satisfy our latency and TPS necessities. From there, we dive into how one can monitor and perceive the metrics and efficiency of the SageMaker endpoint using Amazon CloudWatch metrics.
We first benchmark the efficiency of our mannequin on a single occasion to establish the TPS it may deal with per our acceptable latency necessities. Then we extrapolate the findings to determine on the variety of cases we want with the intention to deal with our manufacturing site visitors. Lastly, we simulate production-level site visitors and arrange load exams for a real-time SageMaker endpoint to verify our endpoint can deal with the production-level load. Your complete set of code for the instance is out there within the following GitHub repository.
Overview of resolution
For this publish, we deploy a pre-trained Hugging Face DistilBERT mannequin from the Hugging Face Hub. This mannequin can carry out various duties, however we ship a payload particularly for sentiment evaluation and textual content classification. With this pattern payload, we try to attain 1000 TPS.
Deploy a real-time endpoint
This publish assumes you’re accustomed to deploy a mannequin. Confer with Create your endpoint and deploy your mannequin to know the internals behind internet hosting an endpoint. For now, we will rapidly level to this mannequin within the Hugging Face Hub and deploy a real-time endpoint with the next code snippet:
Let’s check our endpoint rapidly with the pattern payload that we wish to use for load testing:
Observe that we’re backing the endpoint utilizing a single Amazon Elastic Compute Cloud (Amazon EC2) occasion of sort ml.m5.12xlarge, which comprises 48 vCPU and 192 GiB of reminiscence. The variety of vCPUs is an efficient indication of the concurrency the occasion can deal with. Typically, it’s really helpful to check totally different occasion sorts to verify now we have an occasion that has assets which are correctly utilized. To see a full checklist of SageMaker cases and their corresponding compute energy for real-time Inference, confer with Amazon SageMaker Pricing.
Metrics to trace
Earlier than we will get into load testing, it’s important to know what metrics to trace to know the efficiency breakdown of your SageMaker endpoint. CloudWatch is the first logging device that SageMaker makes use of that will help you perceive the totally different metrics that describe your endpoint’s efficiency. You possibly can make the most of CloudWatch logs to debug your endpoint invocations; all logging and print statements you’ve in your inference code are captured right here. For extra info, confer with How Amazon CloudWatch works.
There are two various kinds of metrics CloudWatch covers for SageMaker: instance-level and invocation metrics.
Occasion-level metrics
The primary set of parameters to think about is the instance-level metrics: CPUUtilization
and MemoryUtilization
(for GPU-based cases, GPUUtilization
). For CPUUtilization
, you might even see percentages above 100% at first in CloudWatch. It’s necessary to understand for CPUUtilization
, the sum of all of the CPU cores is being displayed. For instance, if the occasion behind your endpoint comprises 4 vCPUs, this implies the vary of utilization is as much as 400%. MemoryUtilization
, however, is within the vary of 0–100%.
Particularly, you should utilize CPUUtilization
to get a deeper understanding of in case you have enough and even an extra quantity of {hardware}. When you’ve got an under-utilized occasion (lower than 30%), you could possibly doubtlessly scale down your occasion sort. Conversely, if you’re round 80–90% utilization, it might profit to select an occasion with higher compute/reminiscence. From our exams, we propose round 60–70% utilization of your {hardware}.
Invocation metrics
As recommended by the title, invocation metrics is the place we will monitor the end-to-end latency of any invokes to your endpoint. You possibly can make the most of the invocation metrics to seize error counts and what sort of errors (5xx, 4xx, and so forth) that your endpoint could also be experiencing. Extra importantly, you’ll be able to perceive the latency breakdown of your endpoint calls. Lots of this may be captured with ModelLatency
and OverheadLatency
metrics, as illustrated within the following diagram.
The ModelLatency
metric captures the time that inference takes inside the mannequin container behind a SageMaker endpoint. Observe that the mannequin container additionally contains any customized inference code or scripts that you’ve handed for inference. This unit is captured in microseconds as an invocation metric, and customarily you’ll be able to graph a percentile throughout CloudWatch (p99, p90, and so forth) to see in case you’re assembly your goal latency. Observe that a number of elements can impression mannequin and container latency, akin to the next:
- Customized inference script – Whether or not you’ve applied your personal container or used a SageMaker-based container with customized inference handlers, it’s greatest apply to profile your script to catch any operations which are particularly including a number of time to your latency.
- Communication protocol – Think about REST vs. gRPC connections to the mannequin server inside the mannequin container.
- Mannequin framework optimizations – That is framework particular, for instance with TensorFlow, there are a variety of surroundings variables you’ll be able to tune which are TF Serving particular. Make sure that to test what container you’re utilizing and if there are any framework-specific optimizations you’ll be able to add inside the script or as surroundings variables to inject within the container.
OverheadLatency
is measured from the time that SageMaker receives the request till it returns a response to the shopper, minus the mannequin latency. This half is basically exterior of your management and falls underneath the time taken by SageMaker overheads.
Finish-to-end latency as a complete relies on a wide range of elements and isn’t essentially the sum of ModelLatency
plus OverheadLatency
. For instance, in case you shopper is making the InvokeEndpoint
API name over the web, from the shopper’s perspective, the end-to-end latency can be web + ModelLatency
+ OverheadLatency
. As such, when load testing your endpoint with the intention to precisely benchmark the endpoint itself, it’s really helpful to give attention to the endpoint metrics (ModelLatency
, OverheadLatency
, and InvocationsPerInstance
) to precisely benchmark the SageMaker endpoint. Any points associated to end-to-end latency can then be remoted individually.
A number of questions to think about for end-to-end latency:
- The place is the shopper that’s invoking your endpoint?
- Are there any middleman layers between your shopper and the SageMaker runtime?
Auto scaling
We don’t cowl auto scaling on this publish particularly, nevertheless it’s an necessary consideration with the intention to provision the proper variety of cases primarily based on the workload. Relying in your site visitors patterns, you’ll be able to connect an auto scaling coverage to your SageMaker endpoint. There are totally different scaling choices, akin to TargetTrackingScaling
, SimpleScaling
, and StepScaling
. This enables your endpoint to scale out and in mechanically primarily based in your site visitors sample.
A standard possibility is goal monitoring, the place you’ll be able to specify a CloudWatch metric or customized metric that you’ve outlined and scale out primarily based on that. A frequent utilization of auto scaling is monitoring the InvocationsPerInstance
metric. After you’ve recognized a bottleneck at a sure TPS, you’ll be able to typically use that as a metric to scale out to a higher variety of cases to have the ability to deal with peak a great deal of site visitors. To get a deeper breakdown of auto scaling SageMaker endpoints, confer with Configuring autoscaling inference endpoints in Amazon SageMaker.
Load testing
Though we make the most of Locust to show how we will load check at scale, in case you’re attempting to proper measurement the occasion behind your endpoint, SageMaker Inference Recommender is a extra environment friendly possibility. With third-party load testing instruments, it’s a must to manually deploy endpoints throughout totally different cases. With Inference Recommender, you’ll be able to merely cross an array of the occasion sorts you wish to load check towards, and SageMaker will spin up jobs for every of those cases.
Locust
For this instance, we use Locust, an open-source load testing device you could implement utilizing Python. Locust is much like many different open-source load testing instruments, however has a couple of particular advantages:
- Straightforward to arrange – As we display on this publish, we’ll cross a easy Python script that may simply be refactored on your particular endpoint and payload.
- Distributed and scalable – Locust is event-based and makes use of gevent underneath the hood. That is very helpful for testing extremely concurrent workloads and simulating hundreds of concurrent customers. You possibly can obtain excessive TPS with a single course of working Locust, nevertheless it additionally has a distributed load era function that lets you scale out to a number of processes and shopper machines, as we are going to discover on this publish.
- Locust metrics and UI – Locust additionally captures end-to-end latency as a metric. This can assist complement your CloudWatch metrics to color a full image of your exams. That is all captured within the Locust UI, the place you’ll be able to monitor concurrent customers, employees, and extra.
To additional perceive Locust, take a look at their documentation.
Amazon EC2 setup
You possibly can arrange Locust in no matter surroundings is appropriate for you. For this publish, we arrange an EC2 occasion and set up Locust there to conduct our exams. We use a c5.18xlarge EC2 occasion. The client-side compute energy can be one thing to think about. At occasions if you run out of compute energy on the shopper facet, that is typically not captured, and is mistaken as a SageMaker endpoint error. It’s necessary to position your shopper in a location of enough compute energy that may deal with the load that you’re testing at. For our EC2 occasion, we use an Ubuntu Deep Studying AMI, however you’ll be able to make the most of any AMI so long as you’ll be able to correctly arrange Locust on the machine. To know launch and connect with your EC2 occasion, confer with the tutorial Get began with Amazon EC2 Linux cases.
The Locust UI is accessible through port 8089. We will open this by adjusting our inbound safety group guidelines for the EC2 Occasion. We additionally open up port 22 so we will SSH into the EC2 occasion. Think about scoping the supply right down to the particular IP handle you’re accessing the EC2 occasion from.
After you’re linked to your EC2 occasion, we arrange a Python digital surroundings and set up the open-source Locust API through the CLI:
We’re now able to work with Locust for load testing our endpoint.
Locust testing
All Locust load exams are carried out primarily based off of a Locust file that you simply present. This Locust file defines a process for the load check; that is the place we outline our Boto3 invoke_endpoint API name. See the next code:
Within the previous code, modify your invoke endpoint name parameters to fit your particular mannequin invocation. We use the InvokeEndpoint
API utilizing the next piece of code within the Locust file; that is our load check run level. The Locust file we’re utilizing is locust_script.py.
Now that now we have our Locust script prepared, we wish to run distributed Locust exams to emphasize check our single occasion to learn the way a lot site visitors our occasion can deal with.
Locust distributed mode is a bit more nuanced than a single-process Locust check. In distributed mode, now we have one major and a number of employees. The first employee instructs the employees on spawn and management the concurrent customers which are sending a request. In our distributed.sh script, we see by default that 240 customers shall be distributed throughout the 60 employees. Observe that the --headless
flag within the Locust CLI removes the UI function of Locust.
./distributed.sh huggingface-pytorch-inference-2022-10-04-02-46-44-677 #to execute Distributed Locust check
We first run the distributed check on a single occasion backing the endpoint. The concept right here is we wish to absolutely maximize a single occasion to know the occasion rely we have to obtain our goal TPS whereas staying inside our latency necessities. Observe that if you wish to entry the UI, change the Locust_UI
surroundings variable to True and take the general public IP of your EC2 occasion and map port 8089 to the URL.
The next screenshot reveals our CloudWatch metrics.
Finally, we discover that though we initially obtain a TPS of 200, we begin noticing 5xx errors in our EC2 client-side logs, as proven within the following screenshot.
We will additionally confirm this by our instance-level metrics, particularly CPUUtilization
.
Right here we discover
CPUUtilization
at practically 4,800%. Our ml.m5.12x.massive occasion has 48 vCPUs (48 * 100 = 4800~). That is saturating your complete occasion, which additionally helps clarify our 5xx errors. We additionally see a rise in ModelLatency
.
It appears as if our single occasion is getting toppled and doesn’t have the compute to maintain a load previous the 200 TPS that we’re observing. Our goal TPS is 1000, so let’s attempt to enhance our occasion rely to five. This might need to be much more in a manufacturing setting, as a result of we have been observing errors at 200 TPS after a sure level.
We see in each the Locust UI and CloudWatch logs that now we have a TPS of practically 1000 with 5 cases backing the endpoint.
In the event you begin experiencing errors even with this {hardware} setup, ensure that to observe
CPUUtilization
to know the complete image behind your endpoint internet hosting. It’s essential to know your {hardware} utilization to see if you might want to scale up and even down. Typically container-level issues result in 5xx errors, but when CPUUtilization
is low, it signifies that it’s not your {hardware} however one thing on the container or mannequin stage that could be main to those points (correct surroundings variable for variety of employees not set, for instance). However, in case you discover your occasion is getting absolutely saturated, it’s an indication that you might want to both enhance the present occasion fleet or check out a bigger occasion with a smaller fleet.
Though we elevated the occasion rely to five to deal with 100 TPS, we will see that the ModelLatency
metric remains to be excessive. That is because of the cases being saturated. Typically, we propose aiming to make the most of the occasion’s assets between 60–70%.
Clear up
After load testing, ensure that to scrub up any assets you gained’t make the most of through the SageMaker console or by means of the delete_endpoint Boto3 API name. As well as, ensure that to cease your EC2 occasion or no matter shopper setup it’s a must to not incur any additional costs there as effectively.
Abstract
On this publish, we described how one can load check your SageMaker real-time endpoint. We additionally mentioned what metrics you have to be evaluating when load testing your endpoint to know your efficiency breakdown. Make sure that to take a look at SageMaker Inference Recommender to additional perceive occasion right-sizing and extra efficiency optimization methods.
In regards to the Authors
Marc Karp is a ML Architect with the SageMaker Service staff. He focuses on serving to prospects design, deploy and handle ML workloads at scale. In his spare time, he enjoys touring and exploring new locations.
Ram Vegiraju is a ML Architect with the SageMaker Service staff. He focuses on serving to prospects construct and optimize their AI/ML options on Amazon SageMaker. In his spare time, he loves touring and writing.