Hi all! My name is Kirill, I'm CTO at Adapty. Most of our architecture is on AWS, and today I'm going to talk about how we cut server costs by 3x by using Spot Instances in production, and how to autoscale them. First there will be an overview of how it works, and then detailed instructions for getting started.
What are Spot Instances?
Here are some screenshots that show the price history for Spot Instances.
m5.large in eu-west-1 (Ireland). The price has been mostly stable for 3 months, saving 2.9x at the moment.
m5.large in the us-east-1 (N. Virginia) region. Price fluctuates continuously over the course of 3 months, with current savings ranging from 2.3x to 2.8x depending on availability zone.
t3.small in us-east-1 (N. Virginia). Price stable for 3 months, currently saving 3.4x.
Service architecture
The basic architecture of the service, which we will talk about in this article, is shown in the diagram below.
Application Load Balancer → EC2 Target Group → Elastic Container Service
An Application Load Balancer (ALB) is used as a balancer, which sends requests to the EC2 Target Group (TG). The TG is responsible for opening ports for the ALBs on the instances and binding them to the Elastic Container Service (ECS) container ports. ECS is an analogue of Kubernetes in AWS, which deals with the management of Docker containers.
On one instance, there can be several running containers with the same ports, so we cannot set them fixed. ECS tells TG that it is starting a new task (in Kubernetes terminology, this is called a pod), it checks for free ports on the instance and assigns one of them to the task being launched. Also, TG regularly checks whether the instance and api are working on it using health check, and if it sees any problems, it stops sending requests there.
EC2 Auto Scaling Groups + ECS Capacity Providers
The diagram above does not show the EC2 Auto Scaling Groups (ASG) service. From the name you can understand that it is responsible for scaling instances. At the same time, until recently, AWS did not have a built-in ability to manage the number of running machines from ECS. ECS allowed scaling the number of tasks, for example, by CPU usage, RAM, or the number of requests. But if tasks occupied all free instances, then new machines did not automatically rise.
This has changed with the advent of ECS Capacity Providers (ECS CP). Now each service in the ECS can be associated with ASG, and if the tasks do not fit on running instances, then new ones will rise (but within the established ASG limits). This also works in the opposite direction, if the ECS CP sees idle instances without tasks, then it will instruct ASG to turn them off. ECS CP has the ability to specify the target percentage of instances loading, so that a certain number of machines are always free for quick scaling of tasks, I will talk about this a little later.
EC2 Launch Templates
The last service that I will talk about before moving on to a detailed description of the creation of this infrastructure is EC2 Launch Templates. It allows you to create a template by which all machines will start, so as not to repeat this every time from scratch. Here you can select the type of machine to start, security group, disk image, and many other options. You can also specify user data that will be uploaded to all launched instances. You can run scripts in user data, for example, you can edit the contents of a file
One of the most important configuration options in this article is
About the disk - AWS recently
Service creation
We proceed directly to the creation of the described service. In the process, I will describe in addition a few useful points that were not mentioned above. In general, this is a step-by-step instruction, but I will not consider some very basic or, on the contrary, very specific cases. All actions are performed in the AWS visual console, but they can be replayed programmatically using CloudFormation or Terraform. At Adapty, we use Terraform.
EC2 Launch Template
In this service, the configuration of the machines that will be used is created. Templates are managed in the EC2 -> Instances -> Launch templates section.
Amazon machine image (AMI) - specify the disk image with which all instances will be launched. For ECS, in most cases, you should use the optimized image from Amazon. It is regularly updated and contains everything you need to run ECS. To find out the current image ID, go to the page
instance type - Specify the instance type. Choose the one that best suits your task.
Key pair (login) - specify the certificate with which you can connect to the instance via SSH, if necessary.
Network settings - Specify the network settings. networking platform in most cases should be a Virtual Private Cloud (VPC). security groups - security groups for your instances. Since we will use the balancer before the instances, I recommend specifying a group here that allows incoming connections only from the balancer. That is, you will have 2 security groups, one for the balancer, which allows incoming (inbound) connections from everywhere on ports 80 (http) and 443 (https), and the second for machines, which allows incoming connections on any ports from the balancer group. Outgoing (outbound) connections in both groups must be opened via the TCP protocol to all ports to all addresses. You can restrict ports and addresses for outgoing connections, but then you need to constantly monitor that you are not trying to access somewhere on a closed port.
Storage (volumes) - specify the parameters of disks for machines. The disk size cannot be less than that specified in AMI, for ECS Optimized - 30 GiB.
advanced details - specify additional parameters.
Purchasing options - whether we want to buy Spot Instances. We want to, but we will not check this box here, we will configure it in the Auto Scaling Group, there are more options.
IAM instance profile - specify the role with which the instances will be launched. In order for instances to work in ECS, they need rights, which usually lie in the role ecsInstanceRole. In some cases it can be created, if not, then here
Then there are many parameters, basically you can leave default values everywhere, but each of them has a clear description. I always enable the EBS-optimized instance and T2/T3 Unlimited options if used
User time - Specify user data. We will edit the file /etc/ecs/ecs.config
, which contains the configuration of the ECS agent.
An example of what user data might look like:
#!/bin/bash
echo ECS_CLUSTER=DemoApiClusterProd >> /etc/ecs/ecs.config
echo ECS_ENABLE_SPOT_INSTANCE_DRAINING=true >> /etc/ecs/ecs.config
echo ECS_CONTAINER_STOP_TIMEOUT=1m >> /etc/ecs/ecs.config
echo ECS_ENGINE_AUTH_TYPE=docker >> /etc/ecs/ecs.config
echo "ECS_ENGINE_AUTH_DATA={"registry.gitlab.com":{"username":"username","password":"password"}}" >> /etc/ecs/ecs.config
ECS_CLUSTER=DemoApiClusterProd
- the parameter indicates that the instance belongs to a cluster with a given name, that is, this cluster will be able to host its tasks on this server. We have not yet created a cluster, but we will use this name when creating it.
ECS_ENABLE_SPOT_INSTANCE_DRAINING=true
— the parameter indicates that when a signal is received about the shutdown of a Spot instance, all tasks on it should be transferred to the Draining status.
ECS_CONTAINER_STOP_TIMEOUT=1m
- parameter specifies that after receiving the SIGINT signal, all tasks have 1 minute before they are killed.
ECS_ENGINE_AUTH_TYPE=docker
- the parameter indicates that the docker scheme is used as the authorization mechanism
ECS_ENGINE_AUTH_DATA=...
- connection parameters to the private container registry, where your Docker images are stored. If it is public, then nothing needs to be specified.
In this article, I will use a public image from Docker Hub, so specify the parameters ECS_ENGINE_AUTH_TYPE
и ECS_ENGINE_AUTH_DATA
not necessary.
Good to know: it is recommended to update the AMI regularly, because the new versions update the versions of Docker, Linux, ECS agent, etc. To keep this in mind, you can
EC2 Auto Scaling Group
The Auto Scaling Group is responsible for launching and scaling instances. Groups are managed in the EC2 -> Auto Scaling -> Auto Scaling Groups section.
launch template - select the template created in the previous step. Leave the default version.
Purchase options and instance types - specify the types of instances for the cluster. Adhere to launch template uses the instance type from Launch Template. Combine purchase options and instance types allows you to flexibly configure instance types. We will use it.
optional on-demand base - the number of regular, non-spot instances that will always work.
On-Demand percentage above base - the percentage of regular and spot instances, 50-50 will be distributed equally, 20-80 for each regular instance will be raised 4 spot. For this example, I will indicate 50-50, but in reality we most often do 20-80, in some cases 0-100.
Instance types - here you can specify additional types of instances that will be used in the cluster. We never used because I don't really understand the meaning of this story. Maybe it's the limits for specific types of instances, but they increase easily through support. If you know the application, I will be glad to read in the comments)
Network - network settings, select VPC and subnets for machines, in most cases you should select all available subnets.
load balancing - balancer settings, but we will do it separately, we do not touch anything here. health checks will also be configured later.
Group size - specify the limits on the number of machines in the cluster and the desired number of machines at the start. The number of machines in the cluster will never be less than the minimum specified and greater than the maximum, even if the metrics should scale.
Scaling policies - scaling parameters, but we will scale based on running ECS tasks, so we will configure the scaling later.
Instance scale-in protection - protection of instances from deletion when scaling down. We enable it so that ASG does not delete a machine that has running tasks. ECS Capacity Provider will disable protection for instances that do not have tasks.
add tags - you can specify tags for instances (for this, the Tag new instances checkbox must be checked). I recommend that you specify the Name tag, then all instances that are launched within the group will have the same name, it is convenient to view them in the console.
After creating a group, open it and go to the Advanced configurations section, which is why not all options are visible in the console at the creation stage.
Termination policies - rules that are taken into account when deleting instances. They are applied in order. We usually use the ones in the picture below. First, instances with the oldest Launch Template are deleted (for example, if we updated the AMI, we created a new version, but all instances managed to switch to it). Then the instances that are closest to the next checkout time in terms of billing are selected. And then the oldest ones are selected by launch date.
Good to know: to update all machines in the cluster, it is convenient to use
Application Load Balancer and EC2 Target Group
The balancer is created in the EC2 → Load Balancing → Load Balancers section. We will use the Application Load Balancer, a comparison of different types of load balancers can be found at
listeners - it makes sense to make ports 80 and 443 and redirect from 80 to 443 later using the balancer rules.
Availability Zones - in most cases, we select all access zones.
Configure Security Settings - here you specify the SSL certificate for the balancer, the most convenient option is ELBSecurityPolicy-2016-08
. After creating a balancer, you will see it dns name, to which you need to configure the CNAME for your domain. For example, this is how it looks in Cloudflare.
Security Group - create or select a security group for the balancer, I wrote more about this a little higher in the EC2 Launch Template → Network settings section.
Target group - we create a group that is responsible for routing requests from the balancer to machines and checks their availability in order to replace them in case of problems. target type must be Instance, Protocol и Port (The Harbour District) any, if you use HTTPS to communicate between the balancer and instances, then you need to upload a certificate to them. In this example, we will not do this, we will just leave port 80.
health checks — service health check parameters. In a real service, this should be a separate request that implements important parts of the business logic, for this example, I will leave the default settings. Next, you can select the request interval, timeout, success codes, etc. In our example, we will specify Success codes 200-399, because the Docker image that will be used returns a 304 code.
Register Targets - machines for the group are selected here, but in our case, ECS will deal with this, so we just skip this step.
Good to know: at the level of the balancer, you can enable logs that will be saved in S3 in a certain
ECS Task Definition
In the previous steps, we created everything related to the service infrastructure, now we move on to describing the containers that we will run. This is done in the ECS → Task Definitions section.
Launch type compatibility - select EC2.
Task execution IAM role - choose ecsTaskExecutionRole
. With it, logs are written, access to secret variables is given, etc.
In the Container Definitions section, click Add Container.
Image - link to the image with the project code, in this example I will use a public image from Docker Hub
Memory Limits - memory limits for the container. hard limit - a hard limit, if the container goes beyond the specified value, then the docker kill command will be executed, the container will die immediately. Soft limit - soft limit, the container can go beyond the specified value, but this parameter will be taken into account when placing tasks on machines. For example, if the machine has 4 GiB of RAM, and the soft limit of the container is 2048 MiB, then this machine can have a maximum of 2 running tasks with this container. In reality, 4 GiB of RAM is slightly less than 4096 MiB, which can be viewed on the ECS Instances tab in the cluster. Soft limit cannot be greater than hard limit. It is important to understand that if there are several containers in one task, then their limits are summed up.
Port mappings - In host port specify 0, which means that the port will be assigned dynamically, it will be tracked by the Target Group. container port - the port on which your application runs is often set in the command for execution, or assigned in your application code, Dockerfile, etc. For our example, we use 3000 because it is specified in
health check - container health check parameters, not to be confused with the one configured in the Target Group.
Environment - environment settings. CPU units - similar to Memory limits, only about the processor. Each processor core is 1024 units, so if the server has a dual-core processor, and the container has a value of 512, then 4 tasks with this container can be launched on one server. CPU units always correspond to the number of cores, they cannot be a little less, as in the case of memory.
Command - command to start the service inside the container, all parameters are separated by commas. It could be gunicorn, npm, etc. If not specified, then the value of the CMD directive from the Dockerfile will be used. Specify npm,start
.
environment variables - container environment variables. It can be either just text data or secret variables from
Storage and Logging - here we will configure logging in CloudWatch Logs (log service from AWS). To do this, just enable the Auto-configure CloudWatch Logs checkbox. After creating the Task Definition, a group of logs will be automatically created in CloudWatch. By default, logs are stored in it indefinitely, I recommend changing the Retention period from Never Expire to the required period. This is done in CloudWatch Log groups, you need to click on the current period and select a new one.
ECS Cluster and ECS Capacity Provider
Go to the ECS → Clusters section to create a cluster. We choose EC2 Linux + Networking as a template.
cluster name - very important, we make the same name here as specified in the Launch Template in the parameter ECS_CLUSTER
, in our case - DemoApiClusterProd
. Check the box Create an empty cluster. Optionally, you can enable Container Insights to view service metrics in CloudWatch. If you did everything correctly, then in the ECS Instances section you will see machines that were created in the Auto Scaling group.
Go to the tab Capacity Providers and create a new one. Let me remind you that it is needed in order to manage the creation and shutdown of machines, depending on the number of running ECS tasks. It is important to note that a provider can only be associated with one group.
auto scaling group - select the previously created group.
managed scaling - enable so that the provider can scale the service.
target capacity % - what percentage of loading cars with tasks we need. If you specify 100%, then all machines will always be occupied by running tasks. If you specify 50%, then half of the cars will always be free. In this case, if there is a sharp jump in load, new taxis will immediately get on free cars, without having to wait for the deployment of instances.
Managed termination protection - enable, this parameter allows the provider to remove the protection of instances from deletion. This happens when there are no active tasks on the machine and allows Target capacity %.
ECS Service and scaling setup
The last step :) To create a service, you need to go to the previously created cluster on the Services tab.
launch type - you need to click on Switch to capacity provider strategy and select the previously created provider.
Task Definition - select the previously created Task Definition and its revision.
Service name - in order not to get confused, we always specify the same as the Task Definition.
service type - always Replica.
Number of tasks — the desired number of active tasks in the service. This parameter is controlled by scaling, but it still needs to be specified.
Minimum healthy percent и maximum percent - determine the behavior of tasks during deployment. The default values of 100 and 200 indicate that at the time of deployment, the number of tasks will increase by a factor of two, and then return to the desired one. If 1 task works for you, min=0, and max=100, then during deployment it will be killed, and after that a new one will rise, that is, it will be simple. If 1 task works, min=50, max=150, then the deployment will not happen at all, because 1 task cannot be divided in half or increased by one and a half times.
Deployment type - leave Rolling update.
Placement Templates - rules for placing tasks on cars. The default is AZ Balanced Spread - this means who each new task will be placed on a new instance until the machines in all availability zones rise. We usually do BinPack - CPU and Spread - AZ, with this policy, tasks are placed as densely as possible on one machine per CPU. If a new machine needs to be created, it is created in a new availability zone.
load balancer type - select Application Load Balancer.
Service IAM role - choose ecsServiceRole
.
Load balancer name - select the previously created balancer.
Health check grace period - a pause before performing health checks after rolling out a new task, we usually set 60 seconds.
Container to load balance - in the Target group name item, select the group created earlier, and everything will be automatically filled.
Service Auto Scaling — service scaling parameters. Select Configure Service Auto Scaling to adjust your service's desired count. Set the minimum and maximum number of tasks when scaling.
IAM role for Service Auto Scaling - choose AWSServiceRoleForApplicationAutoScaling_ECSService
.
Automatic task scaling policies — Rules for scaling. There are 2 types:
- Target tracking - tracking the target metric (CPU / RAM usage or the number of requests for each task). For example, we want the average CPU load to be 85%, when it gets higher, then new tasks will be added until it reaches the target value. If the load is lower, then the tasks, on the contrary, will be removed if protection against scaling down is not enabled (Disable scale-in).
- step scaling - reaction to a random event. Here you can set up a reaction to any event (CloudWatch Alarm), when it occurs, you can add or remove the specified number of tasks, or specify the exact number of tasks.
A service can have several scaling rules, this can be useful, the main thing is to make sure that they do not conflict with each other.
Conclusion
If you followed the instructions and used the same Docker image, your service should return such a page.
- We have created a template by which all the machines in the service start. We also learned how to update cars when the template changes.
- We have set up the processing of the Spot instance stop signal, so within a minute after receiving it, all running tasks are removed from the machine, so nothing is lost or interrupted.
- We raised the balancer to evenly distribute the load across the machines.
- We have created a service that runs on Spot Instances, due to this, the cost of machines is reduced by about 3 times.
- We set up autoscaling in both directions to handle the increase in loads, but at the same time not pay for downtime.
- We use the Capacity Provider so that the application manages the infrastructure (machines) and not vice versa.
- We are great.
If you have predictable load spikes, for example, you are advertising in a large email campaign, you can adjust the scaling by
You can also scale based on data from different parts of your system. For example, we have a function
I would be glad if you tell in the comments interesting cases of using Spot Instances and ECS, or something related to scaling.
Soon there will be articles about how we process thousands of analytical events per second on a predominantly serverless stack (with money) and how services are deployed using GitLab CI and Terraform Cloud.
Subscribe to us, it will be interesting!
Only registered users can participate in the survey.
Do you use spot instances in production?
-
22,2%Yes6
-
66,7%No18
-
11,1%I learned about them from the article, I plan to use3
27 users voted. 5 users abstained.
Source: habr.com