Pitfalls of Terraform

Pitfalls of Terraform
Let's highlight a few pitfalls, including those related to loops, if statements, and deployment techniques, as well as more general issues that affect Terraform in general:

  • the count and for_each options have limitations;
  • zero downtime deployment restrictions;
  • even a good plan can fail;
  • refactoring can have its pitfalls;
  • deferred consistency is consistent with... deferred.

The count and for_each options have limitations

The examples in this chapter use the count parameter and the for_each expression extensively in loops and conditional logic. They perform well, but they have two important limitations that you need to be aware of.

  • count and for_each cannot refer to any resource output variables.
  • count and for_each cannot be used in module configuration.

count and for_each cannot refer to any output resource variables

Imagine you need to deploy multiple EC2 servers and for some reason you don't want to use ASG. Your code could be like this:

resource "aws_instance" "example_1" {
   count             = 3
   ami                = "ami-0c55b159cbfafe1f0"
   instance_type = "t2.micro"
}

Let's consider them in turn.

Since the count parameter is set to a static value, this code will work without problems: when you issue the apply command, it will create three EC2 servers. But what if you wanted to deploy one server per Availability Zone or AZ within your current AWS Region? You can have your code load the list of zones from the aws_availability_zones data source and then loop through each one and create an EC2 server in it using the count parameter and array access by index:

resource "aws_instance" "example_2" {
   count                   = length(data.aws_availability_zones.all.names)
   availability_zone   = data.aws_availability_zones.all.names[count.index]
   ami                     = "ami-0c55b159cbfafe1f0"
   instance_type       = "t2.micro"
}

data "aws_availability_zones" "all" {}

This code will also work fine, since the count parameter can refer to data sources without any problems. But what happens if the number of servers you need to create depends on the output of some resource? To demonstrate this, the easiest way is to take the random_integer resource, which, as you might guess from the name, returns a random integer:

resource "random_integer" "num_instances" {
  min = 1
  max = 3
}

This code generates a random number from 1 to 3. Let's see what happens if we try to use the result output of this resource in the count parameter of the aws_instance resource:

resource "aws_instance" "example_3" {
   count             = random_integer.num_instances.result
   ami                = "ami-0c55b159cbfafe1f0"
   instance_type = "t2.micro"
}

If you run terraform plan on this code, you get the following error:

Error: Invalid count argument

   on main.tf line 30, in resource "aws_instance" "example_3":
   30: count = random_integer.num_instances.result

The "count" value depends on resource attributes that cannot be determined until apply, so Terraform cannot predict how many instances will be created. To work around this, use the -target argument to first apply only the resources that the count depends on.

Terraform requires that count and for_each be calculated at the planning stage, before any resources are created or modified. This means that count and for_each can refer to literals, variables, data sources, and even lists of resources (as long as their length can be determined at scheduling time), but not computed output resource variables.

count and for_each cannot be used in module configuration

Someday you might be tempted to add the count parameter to the module configuration:

module "count_example" {
     source = "../../../../modules/services/webserver-cluster"

     count = 3

     cluster_name = "terraform-up-and-running-example"
     server_port = 8080
     instance_type = "t2.micro"
}

This code tries to use count inside the module to create three copies of the webserver-cluster resource. Or, you might want to make the inclusion of the module optional depending on some boolean condition by setting its count parameter to 0. This looks quite reasonable, but the result of terraform plan is the following error:

Error: Reserved argument name in module block

   on main.tf line 13, in module "count_example":
   13: count = 3

The name "count" is reserved for use in a future version of Terraform.

Unfortunately, as of Terraform 0.12.6 release, using count or for_each on a module resource is not supported. According to the Terraform 0.12 release notes (http://bit.ly/3257bv4), HashiCorp plans to add this feature in the future, so depending on when you read this book, it may already be available. To know for sure read the Terraform changelog here.

Zero Downtime Deployment Limitations

Using the create_before_destroy block in conjunction with ASG is a great solution for zero-downtime deployments, except for one caveat: autoscaling rules are not supported. Or, to be more precise, this resets the ASG size back to min_size on every deployment, which can be a problem if you used autoscale rules to increase the number of running servers.

For example, the webserver-cluster module contains a pair of aws_autoscaling_schedule resources that increase the number of servers in the cluster from two to ten at 9 am. If you deploy at, say, 11:9 am, the new ASG will boot up with not ten, but just two servers, and stay that way until XNUMX:XNUMX am the next day.

This limitation can be circumvented in several ways.

  • Change the recurrence setting in aws_autoscaling_schedule from 0 9 * * * ("run at 9 am") to something like 0-59 9-17 * * * ("run every minute from 9 am to 5 pm"). If the ASG already has ten servers, running this autoscale rule again won't change anything, which is what we want. But if the ASG is a very recent deployment, this rule ensures that it reaches ten servers in a maximum of one minute. This is not exactly an elegant approach, and large jumps from ten to two servers and back can also cause problems for users.
  • Create a custom script that uses the AWS API to determine the number of active servers in the ASG, call it using an external data source (see "External Data Source" on page 249), and set the desired_capacity parameter of the ASG to the value returned by this script. This way, each new instance of ASG will always start with the same capacity as the original Terraform code and make it harder to maintain.

Of course, Terraform should ideally have built-in support for zero-downtime deployments, but as of May 2019, the HashiCorp team has no plans to add this functionality (details - here).

A good plan can fail

Sometimes the plan command produces a perfectly valid deployment plan, but the apply command returns an error. Try, for example, adding an aws_iam_user resource with the same name you used for the IAM user you created earlier in Chapter 2:

resource "aws_iam_user" "existing_user" {
   # ΠŸΠΎΠ΄ΡΡ‚Π°Π²ΡŒΡ‚Π΅ сюда имя ΡƒΠΆΠ΅ ΡΡƒΡ‰Π΅ΡΡ‚Π²ΡƒΡŽΡ‰Π΅Π³ΠΎ ΠΏΠΎΠ»ΡŒΠ·ΠΎΠ²Π°Ρ‚Π΅Π»Ρ IAM,
   # Ρ‡Ρ‚ΠΎΠ±Ρ‹ ΠΏΠΎΠΏΡ€Π°ΠΊΡ‚ΠΈΠΊΠΎΠ²Π°Ρ‚ΡŒΡΡ Π² использовании ΠΊΠΎΠΌΠ°Π½Π΄Ρ‹ terraform import
   name = "yevgeniy.brikman"
}

Now, if you run the plan command, Terraform will display a seemingly reasonable deployment plan:

Terraform will perform the following actions:

   # aws_iam_user.existing_user will be created
   + resource "aws_iam_user" "existing_user" {
         + arn                  = (known after apply)
         + force_destroy   = false
         + id                    = (known after apply)
         + name               = "yevgeniy.brikman"
         + path                 = "/"
         + unique_id         = (known after apply)
      }

Plan: 1 to add, 0 to change, 0 to destroy.

If you run the apply command, you get the following error:

Error: Error creating IAM User yevgeniy.brikman: EntityAlreadyExists:
User with name yevgeniy.brikman already exists.

   on main.tf line 10, in resource "aws_iam_user" "existing_user":
   10: resource "aws_iam_user" "existing_user" {

The problem, of course, is that an IAM user with that name already exists. And this can happen not only to IAM users, but to almost any resource. It is possible that someone created this resource manually or using the command line, but whatever the case, the identity of the identifiers leads to conflicts. There are many varieties of this error that often take Terraform newbies by surprise.

The key point is that the terraform plan command only takes into account the resources that are specified in the Terraform state file. If the resources are created in some other way (for example, manually, by clicking on the AWS console), they will not end up in the state file and, therefore, Terraform will not take them into account when executing the plan command. As a result, a plan that seems correct at first glance will turn out to be unsuccessful.

There are two lessons to be learned from this.

  • If you have already started working with Terraform, do not use anything else. If part of your infrastructure is managed by Terraform, you can no longer change it manually. Otherwise, you not only risk getting weird Terraform bugs, but also negate many of the benefits of IaC because the code is no longer an accurate representation of your infrastructure.
  • If you already have some kind of infrastructure, use the import command. If you start using Terraform with an existing infrastructure, you can add it to the state file using the terraform import command. This way Terraform will know what infrastructure to manage. The import command takes two arguments. The first is the address of the resource in your configuration files. Here is the same syntax as in resource links: _. (like aws_iam_user.existing_user). The second argument is the ID of the resource to be imported. Let's say the aws_iam_user resource ID is the username (for example, yevgeniy.brikman), and the aws_instance resource ID is the EC2 server ID (like i-190e22e5). How to import a resource is usually specified in the documentation at the bottom of its page.

    The following is an import command that allows you to synchronize the aws_iam_user resource that you added to your Terraform configuration along with the IAM user in Chapter 2 (substitute your name for yevgeniy.brikman, of course):

    $ terraform import aws_iam_user.existing_user yevgeniy.brikman

    Terraform will use the AWS API to find your IAM user and create an association in the state file between it and the aws_iam_user.existing_user resource in your Terraform configuration. From now on, when you run the plan command, Terraform will know that the IAM user already exists and will not try to create it again.

    It should be noted that if you already have a lot of resources that you want to import into Terraform, manually writing the code and importing each one in turn can be a hassle. Therefore, you should look into a tool like Terraforming (http://terraforming.dtan4.net/), which can automatically import code and state from your AWS account.

    Refactoring can have its pitfalls

    Refactoring is a common practice in programming when you change the internal structure of the code while leaving the external behavior unchanged. This is to make the code clearer, cleaner, and easier to maintain. Refactoring is an indispensable technique that should be applied regularly. But when it comes to Terraform or any other IaC tool, you have to be very careful about what you mean by the "external behavior" of a piece of code, or you'll run into unforeseen problems.

    For example, a common type of refactoring is to change the names of variables or functions to be more understandable. Many IDEs have built-in support for refactoring and can automatically rename variables and functions throughout a project. In general-purpose programming languages, this is a trivial procedure that you don't have to think about, but in Terraform, you should be extremely careful with this, otherwise you may experience outages.

    For example, the webserver-cluster module has an input variable cluster_name:

    variable "cluster_name" {
       description = "The name to use for all the cluster resources"
       type          = string
    }

    Imagine that you started using this module to deploy a microservice called foo. Later, you might want to rename your service to bar. This change may seem trivial, but in reality it can cause outages.

    The fact is that the webserver-cluster module uses the cluster_name variable in a number of resources, including the name parameter of two security groups and ALB:

    resource "aws_lb" "example" {
       name                    = var.cluster_name
       load_balancer_type = "application"
       subnets = data.aws_subnet_ids.default.ids
       security_groups      = [aws_security_group.alb.id]
    }

    If you change the name parameter in some resource, Terraform will delete the old version of this resource and create a new one instead. But if that resource is an ALB, between its removal and the download of the new version, you will have no mechanism to redirect traffic to your web server. Similarly, if a security group is deleted, your servers will begin to reject any network traffic until a new group is created.

    Another kind of refactoring that you might be interested in is changing the Terraform ID. Let's take the aws_security_group resource in the webserver-cluster module as an example:

    resource "aws_security_group" "instance" {
      # (...)
    }

    The identifier for this resource is called instance. Imagine that during refactoring you decide to change it to a more understandable (in your opinion) cluster_instance name:

    resource "aws_security_group" "cluster_instance" {
       # (...)
    }

    What will eventually happen? That's right: interruption.

    Terraform associates the ID of each resource with a cloud provider ID. For example, iam_user maps to an AWS IAM user ID, and aws_instance maps to an AWS EC2 server ID. If you change the resource identifier (say, from instance to cluster_instance, as in the case of aws_security_group), it will look to Terraform as if you have removed the old resource and added a new one. If you apply these changes, Terraform will remove the old security group and create a new one, and in the meantime, your servers will begin to reject any network traffic.

    Here are the four main lessons you should take away from this discussion.

    • Always use the plan command. It can reveal all these snags. Carefully review its output and pay attention to situations where Terraform plans to delete resources that most likely should not be deleted.
    • Create before you delete. If you want to replace a resource, think carefully about whether you need to create a replacement before deleting the original. If the answer is yes, then create_before_destroy can help. The same result can be achieved manually by following two steps: first add a new resource to the configuration and run the apply command, and then remove the old resource from the configuration and use the apply command again.
    • Changing identifiers requires changing state. If you want to change the identifier associated with a resource (for example, rename aws_security_group from instance to cluster_instance) while avoiding deleting the resource and creating a new version of it, you must update the Terraform state file accordingly. Never do this manually - use the terraform state command instead. When renaming identifiers, you should run the terraform state mv command, which has the following syntax:
      terraform state mv <ORIGINAL_REFERENCE> <NEW_REFERENCE>

      ORIGINAL_REFERENCE is an expression that refers to the resource in its current form, and NEW_REFERENCE is where you want to move it. For example, if you are renaming the aws_security_group from instance to cluster_instance, you would run the following command:

      $ terraform state mv 
         aws_security_group.instance 
         aws_security_group.cluster_instance

      This tells Terraform that the state that was previously associated with aws_security_group.instance should now be associated with aws_security_group.cluster_instance. If, after renaming and running this command, terraform plan does not show any changes, then you did everything right.

    • Some parameters cannot be changed. The parameters of many resources are immutable. If you try to change them, Terraform will delete the old resource and create a new one instead. Each resource page usually lists what happens when a parameter is changed, so be sure to check the documentation. Always use the plan command and consider using the create_before_destroy strategy.

    Delayed Consistency Is Consistent With Delayed Consistency

    The APIs of some cloud providers, such as AWS, are asynchronous and have delayed consistency. Asynchrony means that the interface can immediately return a response without waiting for the requested action to complete. Delayed consistency means that changes can take time to propagate throughout the system; while this is happening, your responses may be inconsistent depending on which data source replica is responding to your API calls.

    Imagine, for example, that you make an API call to AWS asking you to create an EC2 server. The API will return a "successful" response (201 Created) almost instantly, without waiting for the server itself to be created. If you try to connect to it right away, it will almost certainly fail because AWS is still initializing resources at that point, or alternatively the server hasn't booted yet. Moreover, if you make another call to get information about this server, you might get an error (404 Not Found). The thing is, the information about this EC2 server can still be distributed across AWS, it will take a few seconds for it to become available everywhere.

    Whenever you use an asynchronous API with delayed consistency, you must retry your request periodically until the action completes and propagates throughout the system. Unfortunately, the AWS SDK doesn't provide any good tools for this, and the Terraform project used to suffer from a lot of bugs like 6813 (https://github.com/hashicorp/terraform/issues/6813):

    $ terraform apply
    aws_subnet.private-persistence.2: InvalidSubnetID.NotFound:
    The subnet ID 'subnet-xxxxxxx' does not exist

    In other words, you create a resource (for example, a subnet) and then try to get some information about it (like the ID of the newly created subnet), and Terraform cannot find it. Most of these bugs (including 6813) have already been fixed, but they still show up from time to time, especially when Terraform adds support for a new resource type. This is annoying, but in most cases it does no harm. When you run terraform apply again, everything should work, because by this time the information will already have spread throughout the system.

    This excerpt is from Evgeny Brikman's book "Terraform: infrastructure at the code level".

Source: habr.com

Add a comment