Switched from Terraform to CloudFormation - and regretted it

Representing infrastructure as code in a repeatable text format is a simple best practice for systems that don't need to be moused over. The name behind this practice is Infrastructure as Code, and so far, there are two popular tools to implement it, especially in AWS: terraform и Cloud Formation.

Switched from Terraform to CloudFormation - and regretted it
Comparing experience with Terraform and CloudFormation

Before coming to Twitch (Aka Amazon Jr.) I worked in one startup and used Terraform for three years. At the new place, I also used Terraform with might and main, and then the company pushed through the transition to everything a la Amazon, including CloudFormation. I have been hard at work developing best practices for both, and have used both tools in very complex organization-wide workflows. Later, after thinking carefully about the implications of moving from Terraform to CloudFormation, I became convinced that Terraform was probably the best choice for the organization.

Terraform Terrible

Beta Software

Terraform hasn't even released version 1.0 yet, which is a good reason not to use it. Since I first tried it myself, it has changed a lot, but then terraform apply often broke after a few updates or just after a couple of years of operation. I would say that “now everything is different”, but ... that's what everyone says, no? There are changes that are incompatible with previous versions, although they are appropriate, and even the feeling is that the syntax and abstractions of resource stores are now right. The tool seems to have gotten better, but... :-0

On the other hand, AWS has done a good job of maintaining compatibility with previous versions. This is probably because their services are often thoroughly tested within the organization and only then, after renaming, they are published. So "well done" is an understatement. Maintaining compatibility with previous APIs for a system as varied and complex as AWS is incredibly difficult. Anyone who has had to maintain public APIs that are as widely used should understand how difficult it is to do so for so many years. But the behavior of CloudFormation in my memory has never changed over the years.

Meet the leg ... this is a bullet

As far as I know delete resource outsider CloudFormation stack from your CF stack is not possible. The same is true with Terraform. It allows you to import existing resources into your stack. The function, one might say, is amazing, but with great power comes great responsibility. One has only to put a resource on the stack, and while you are working with your stack, you cannot remove or change this resource. One day it backfired. Someone on Twitch once, with no mischief, accidentally imported someone's AWS Security Group into their own Terraform stack. I entered several commands and ... the security group (along with incoming traffic) disappeared.

Terraform Great

Recovery from incomplete states

Sometimes CloudFormation cannot fully transition from one state to another. At the same time, he will try to return to the previous one. Unfortunately, this is not always feasible. It can be scary to debug what happened later - you never know if CloudFormation will be delighted that it is being hacked - albeit for repair. Whether it will be possible or not to return to the previous state, he really does not know how to determine and, by default, hangs for hours waiting for a miracle.

Terraform, on the other hand, tends to recover from failed transitions much more gracefully and offers advanced debugging tools.

Clearer changes to document state

“Okay, load balancer, you change. But how?"

— a concerned engineer ready to hit the accept button.

Sometimes I need to do some work with the load balancer in the CloudFormation stack, like adding a port number or changing a security group. ClouFormation displays changes weakly. I, like on pins and needles, double-check the yaml file ten times to make sure that I haven’t erased anything necessary, but haven’t added anything extra.

Terraform is much more transparent in this regard. Sometimes it is even too transparent (read: gets). Fortunately, the latest version includes an improved display of changes - now you can see exactly what is changing.

Flexibility

Write software backwards.

To put it bluntly, the most important hallmark of long-lived software is the ability to adapt to change. Write any software from the reverse. I most often made a mistake by taking a “simple” service, and then I began to cram everything into a single CloudFormation or Terraform stack. And of course, after months it was revealed that I understood everything wrong, and the service was really not easy! And now I need to somehow break the big stack into small components. When working with CloudFormation, it is possible to do this only by first recreating the existing stack, which I do not do with my databases. Terraform, on the other hand, allowed the stack to be dissected and divided into more understandable smaller parts.

Modules in git

Sharing Terraform code across multiple stacks is much easier than sharing CloudFormation code. With Terraform, you can put code into a git repository and access it using semantic version control. Anyone with access to this repository can reuse the shared code. CloudFormation's equivalent is S3, but it doesn't have the same benefits, and there's no reason why we should abandon git in favor of S3 at all.

The organization grew and the ability to share common stacks reached a critical level. With Terraform, all this is easy and natural, while CloudFormation will make you jump through the rings before something like this works for you.

Operations as code

"Let's script and it's okay."

—engineer 3 years before inventing the Terraform bicycle.

When it comes to software development, Go or a Java program is not just code.

Switched from Terraform to CloudFormation - and regretted it
Code as Code

There is after all still an infrastructure on which it works.

Switched from Terraform to CloudFormation - and regretted it
Infrastructure as Code

But where is she from? How to monitor it? Where does your code live? Do developers need access permission?

Switched from Terraform to CloudFormation - and regretted it
Operations as code

Being a software developer is not just about writing code.

Not AWS alone: ​​You probably use other vendors. SignalFx, PagerDuty or Github. Maybe you have an internal Jenkins server for CI/CD, or an internal Grafana control panel for monitoring. Infra as Code is chosen for a variety of reasons, and all of them are equally important for everything related to software.

When I was at Twitch, we were accelerating services inside mixed embedded systems and Amazon's AWS systems. We churned out and maintained a lot of microservices, driving up our operational costs. The discussions went something like this:

  • Я: Damn, too many gestures to overclock one microservice. I'll have to use this garbage to create an AWS account (we went to 2 accounts on microservice), then this one for setting alerts, then this one for the code repository, and this one for the list of email addresses, and this one...
  • Lead: Script and okay.
  • Я: Fine, but the script itself will change. You will need a way to check that all these amazon built-in gizmos are up to date.
  • Lead: Sounds good. And for this we will write a script.
  • Я: Great! And the script will probably still need to set parameters. Will he accept them?
  • Lead: Yes, he will accept where he will go!
  • Я: The process may change, backward compatibility will be lost. Some sort of semantically version control is required.
  • Lead: Great idea!
  • Я: Tools can be changed manually, inside the user interface. We need a way to check and fix this.

…3 years later:

  • Lead: And we got terraform.

The moral of the story is this: even if you head over heels in everything amazon, you're still using something that isn't from AWS, and those services have a state that uses a configuration language to keep that state in sync.

CloudFormation lambda vs terraform git modules

lambda is CloudFormation's solution to the question of custom logic. With lambda you can create macros or user resource. This approach introduces additional complexities that Terraform's semantic versioning of git modules does not have. For me, the most pressing issue was managing permissions for all those custom lambdas (which are dozens of AWS accounts). Another important issue was the “what came first, the chicken or the egg?” problem, which was related to the lambda code. This function itself is infrastructure and code, and it itself needs to be monitored and updated. The final nail in the coffin was the difficulty in semantic updating of lambda code changes; it was also necessary to make sure that the actions of the stack without a direct command did not change between runs.

I remember once I wanted to create a canary deployment for the Elastic Beanstalk environment with a classic load balancer. The easiest way would be to do a second deployment for EB next to the production environment, taking one more step: combining the autoscaling group of the canary deployment with the deployment LB to the production environment. And since Terraform uses ASG beantalk as output, this will require 4 extra lines of code in Terraform. When I asked if there was a comparable solution in CloudFormation, they pointed me to a whole repository in git with a deployment pipeline and other things: all for the sake of what the unfortunate 4 lines of Terraform code could do.

It is better at detecting drift

Make sure reality matches expectations.

drift detection is a very powerful feature of operations as code, because it helps to make sure that reality matches expectations. It is available with both CloudFormation and Terraform. But as the working stack grew, the drift search in CloudFormation produced more and more false positives.

With Terraform, you have much more advanced drift detection lifecycle hooks. For example, you enter the command ignore_changes right in the ECS task definition if you want to ignore changes to a particular task definition without ignoring changes to the entire ECS deployment.

CDK and the Future of CloudFormation

CloudFormation is difficult to manage at a large, cross-infrastructure scale. Many of these difficulties are recognized, and the tool needs things like aws-cdk, a framework for defining cloud infrastructure in code and running it through AWS CloudFormation. It will be interesting to see what the future holds for aws-cdk, but it will be hard to compete with Terraform's other strengths; to pull up CloudFormation, global changes will be required.

So that Terraform does not disappoint

This is “infrastructure as CODE”, not “as text”.

My first impression of Terraform was rather bad. I guess I just didn't understand the approach. Almost all engineers in the first unwittingly perceive it as a text format that needs to be converted into the desired infrastructure. DO NOT DO IT THIS WAY.

The truisms of good software development apply to Terraform as well.

I have seen many of the best practices for good code being ignored in Terraform. You have been studying for years to become a good programmer. Don't give up this experience just because you work with Terraform. The truisms of good software development apply to Terraform as well.

How can you not document the code?

I came across huge stacks of Terraform with absolutely no documentation. How can you write code in pages - completely without documentation? Add documentation that explains your code Terraform (emphasis here on the word "code"), why this section is so important, and what you are doing.

How can you deploy services that were once one big main() function?

I have seen very complex Terraform stacks presented as a single module. Why don't we deploy software this way? Why do we split large functions into smaller ones? The same answers are true for Terraform. If your module is too large, you need to break it into smaller modules.

Doesn't your company use libraries?

I've seen engineers spinning up a new project with Terraform, stupidly copy-pasting huge pieces from other projects into their own, and then picking them up until it starts working. Would you like to work with the “combat” code in your company? We don't just use libraries. Yes, not everything has to be a library, but where are we without common libraries in principle ?!

Don't you use PEP8 or gofmt?

Most languages ​​have a standard accepted formatting scheme. In Python, this is PEP8. In Go, gofmt. Terraform has its own: terraform fmt... Use it to your health!

Will you start using React without knowing JavaScript?

Terraform modules can simplify some part of the complex infrastructure you create, but this does not mean that you can not rummage in it at all. Want to properly use Terraform without understanding resources? You are doomed: time will pass, and you will never master Terraform.

Are you coding with singletons or dependency injection?

Dependency Injection is a recognized best practice for software development and is preferred over singletons. How is this useful in Terraform? I've seen Terraform modules dependent on the remote state. Instead of writing modules that retrieve from a remote state, write a module that accepts parameters. And then pass these parameters to the module.

Do your libraries do ten things well, or do one thing great?

Libraries work best when they focus on a single task that they perform well. Instead of writing large Terraform modules that try to do everything at once, build them into pieces that do one thing well. And then combine them as needed.

How do you make changes to libraries without backwards compatibility?

A common Terraform module, like a regular library, needs to communicate changes to users somehow without backwards compatibility. It's annoying when such changes occur in libraries, and it's just as annoying when changes without backwards compatibility are made in Terraform modules. It is recommended to use git tags and semver when using Terraform modules.

Is the production service running on your laptop or data center?

Hashicorp has tools like terraform cloud to launch your terraform. These centralized services make it easy to manage, audit, and approve terraform changes.

Don't you write tests?

Engineers recognize that the code needs to be tested, but at the same time they themselves often forget to check when working with Terraform. For infrastructure, this is fraught with insidious moments. My advice is to "test" or "create sample" stacks using modules that can be properly deployed for testing during CI/CD.

Terraform and microservices

The life and death of microservice companies depends on the speed, renewal and destruction of new microservice work stacks.

The most common negative point associated with microservice architectures, which will not get rid of in any way, is related to the work, not to the code. If you perceive Terraform only as a way to automate only the infrastructure side of a microservice architecture, then you are depriving yourself of the true benefits of this system. Now already everything is like a code.

Source: habr.com

Add a comment