These 6 cloudformation lessons I have learned for life

I started working with cloud formation 4 years ago. Since then, I have broken many infrastructures, even those that were already in production. But every time I messed something up, I learned something new. Through this experience, I will share some of the most important lessons I have learned.

These 6 cloudformation lessons I have learned for life

Lesson 1: Test Changes Before Deploying Them

I learned this lesson as soon as I started working with cloud formation. I don’t remember what exactly I broke then, but I remember exactly that I used the command aws cloudformation update. This command simply rolls out the template without any validation of the changes that will be deployed. I don't think it needs an explanation as to why you need to test all changes before deploying them.

After this failure, I immediately changed deployment pipeline, replacing the update command with the command create-change-set

# OPERATION is either "UPDATE" or "CREATE"
changeset_id=$(aws cloudformation create-change-set 
    --change-set-name "$CHANGE_SET_NAME" 
    --stack-name "$STACK_NAME" 
    --template-body "$TPL_PATH" 
    --change-set-type "$OPERATION" 
    --parameters "$PARAMETERS" 
    --output text 
    --query Id)

aws cloudformation wait 
    change-set-create-complete --change-set-name "$changeset_id"

Once a changeset is created, it does not affect the existing stack in any way. Unlike the update command, the changeset approach does not trigger an actual deployment. Instead, it creates a list of changes that you can review prior to deployment. You can view the changes in the aws console interface. But if you prefer to automate everything you can, then check them in the CLI:

# this command is presented only for demonstrational purposes.
# the real command should take pagination into account
aws cloudformation describe-change-set 
    --change-set-name "$changeset_id" 
    --query 'Changes[*].ResourceChange.{Action:Action,Resource:ResourceType,ResourceId:LogicalResourceId,ReplacementNeeded:Replacement}' 
    --output table

This command should produce output similar to the following:

--------------------------------------------------------------------
|                         DescribeChangeSet                        |
+---------+--------------------+----------------------+------------+
| Action  | ReplacementNeeded  |      Resource        | ResourceId |
+---------+--------------------+----------------------+------------+
|  Modify | True               |  AWS::ECS::Cluster   |  MyCluster |
|  Replace| True               |  AWS::RDS::DBInstance|  MyDB      |
|  Add    | None               |  AWS::SNS::Topic     |  MyTopic   |
+---------+--------------------+----------------------+------------+

Pay special attention to the changes where Action is Replace, Delete or where ReplacementNeeded-True. These are the most dangerous changes and usually result in loss of information.

When the changes are reviewed, they can be rolled out

aws cloudformation execute-change-set --change-set-name "$changeset_id"

operation_lowercase=$(echo "$OPERATION" | tr '[:upper:]' '[:lower:]')
aws cloudformation wait "stack-${operation_lowercase}-complete" 
    --stack-name "$STACK_NAME"

Lesson 2: Use stack policy to prevent stateful replacement or deletion of resources

Sometimes just seeing the changes is not enough. We are all human and we all make mistakes. Shortly after we started using changesets, a teammate of mine unknowingly performed a deployment, which resulted in a database update. Nothing terrible happened because it was a testing environment.

Even though our scripts displayed a list of changes and asked for confirmation, the Replace change was skipped because the list of changes was too large to fit on the screen. And because it was a normal update in a testing environment, there wasn't much attention paid to the changes.

There are resources that you never want to replace or delete. These are statefull services such as an RDS database instance or an elastichsearch cluster, etc. It would be nice if aws would automatically refuse to deploy if the operation in progress would require the removal of such a resource. Luckily, cloudformation has a built-in way to do this. This is called stack policy and you can read more about it in documentation:

STACK_NAME=$1
RESOURCE_ID=$2

POLICY_JSON=$(cat <<EOF
{
    "Statement" : [{
        "Effect" : "Deny",
        "Action" : [
            "Update:Replace",
            "Update:Delete"
        ],
        "Principal": "*",
        "Resource" : "LogicalResourceId/$RESOURCE_ID"
    }]
}
EOF
)

aws cloudformation set-stack-policy --stack-name "$STACK_NAME" 
    --stack-policy-body "$POLICY_JSON"

Lesson 3: Use UsePreviousValue when updating the stack with secret parameters

When you create an RDS mysql entity, AWS requires you to provide a MasterUsername and MasterUserPassword. Since it's better not to store secrets in source code and I wanted to automate absolutely everything, I implemented a "smart mechanism" where credentials will be obtained from s3 before deployment, and if no credentials are found, new credentials are generated and stored in s3 .

These credentials will then be passed as parameters to the cloudformation create-change-set command. While experimenting with the script, it happened that the connection to s3 was lost, and my "smart engine" considered it as a signal to generate new credentials.

If I started using this script in a production environment and the connectivity issue reappeared, it would update the stack with the new credentials. In this particular case, nothing bad will happen. However, I abandoned this approach and started using a different one, providing credentials only once - when creating the stack. And later, when the stack needs to be updated, instead of specifying the secret value of the parameter, I would simply use UsePreviousValue=true:

aws cloudformation create-change-set 
    --change-set-name "$CHANGE_SET_NAME" 
    --stack-name "$STACK_NAME" 
    --template-body "$TPL_PATH" 
    --change-set-type "UPDATE" 
    --parameters "ParameterKey=MasterUserPassword,UsePreviousValue=true"

Lesson 4: use rollback configuration

Another team I worked with used the function cloud formationcalled rollback configuration. I hadn't met her before and quickly realized that it would make deploying my stacks even cooler. Now I use every time I deploy my code to lambda or ECS with cloudformation.

How it works: you specify CloudWatch alarm in parameter --rollback configurationwhen you create the changeset. Later, when you execute the changeset, aws monitors alarm for at least one minute. It rolls back the deployment if alarm changes state to ALARM during this time.

Below is an example template excerpt cloud formationin which I create cloudwatch alarmA that tracks a custom cloud metric in terms of the number of errors in the cloud logs (the metric is generated via MetricFilter):

Resources:
  # this metric tracks number of errors in the cloudwatch logs. In this
  # particular case it's assumed logs are in json format and the error logs are
  # identified by level "error". See FilterPattern
  ErrorMetricFilter:
    Type: AWS::Logs::MetricFilter
    Properties:
      LogGroupName: !Ref LogGroup
      FilterPattern: !Sub '{$.level = "error"}'
      MetricTransformations:
      - MetricNamespace: !Sub "${AWS::StackName}-log-errors"
        MetricName: Errors
        MetricValue: 1
        DefaultValue: 0

  ErrorAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${AWS::StackName}-errors"
      Namespace: !Sub "${AWS::StackName}-log-errors"
      MetricName: Errors
      Statistic: Maximum
      ComparisonOperator: GreaterThanThreshold
      Period: 1 # 1 minute
      EvaluationPeriods: 1
      Threshold: 0
      TreatMissingData: notBreaching
      ActionsEnabled: yes

Now alarm can be used as rollback trigger when executing toolbox:

ALARM_ARN=$1

ROLLBACK_TRIGGER=$(cat <<EOF
{
  "RollbackTriggers": [
    {
      "Arn": "$ALARM_ARN",
      "Type": "AWS::CloudWatch::Alarm"
    }
  ],
  "MonitoringTimeInMinutes": 1
}
EOF
)

aws cloudformation create-change-set 
    --change-set-name "$CHANGE_SET_NAME" 
    --stack-name "$STACK_NAME" 
    --template-body "$TPL_PATH" 
    --change-set-type "UPDATE" 
    --rollback-configuration "$ROLLBACK_TRIGGER"

Lesson 5: Make sure you deploy the latest version of the template

It's easy to deploy a cloudformation template that isn't up to date, but it does a lot of damage. It happened to us once: a developer didn't commit the latest changes from Git and unknowingly deployed a previous version of the stack. This resulted in downtime for the application that used this stack.

Something as simple as adding a check to see if a branch is up to date before deploying would be fine (assuming git is your version control tool):

git fetch
HEADHASH=$(git rev-parse HEAD)
UPSTREAMHASH=$(git rev-parse master@{upstream})

if [[ "$HEADHASH" != "$UPSTREAMHASH" ]] ; then
   echo "Branch is not up to date with origin. Aborting"
   exit 1
fi

Lesson 6: Don't Reinvent the Wheel

It may seem that deployment with cloud formation - it's easy. You just need a bunch of bash scripts that execute aws cli commands.

4 years ago I started with simple scripts that call the aws cloudformation create-stack command. Soon the script was no longer simple. Each lesson learned made the script more and more complex. It was not only difficult, but also with a bunch of bugs.

Now I work in a small IT department. Experience shows that each team has its own way of deploying cloudformation stacks. And that's bad. It would be better if everyone used the same approach. Fortunately, there are many tools available to help you deploy and configure cloudformation stacks.

These lessons will help you avoid mistakes.

Source: habr.com

Add a comment