Patterns in Terraform to combat chaos and manual routine. Maxim Kostrikin (Ixtens)

Patterns in Terraform to combat chaos and manual routine. Maxim Kostrikin (Ixtens)

It would seem that the Terraform developers offer quite convenient best practices for working with the AWS infrastructure. Only there is a nuance. Over time, the number of environments increases, features appear in each. Appears almost a copy of the application stack in the neighboring region. And the Terraform code needs to be carefully copied and edited according to the new requirements or to make a snowflake.

My report is about patterns in Terraform to combat chaos and manual routine on large and long projects.

Video:

Patterns in Terraform to combat chaos and manual routine. Maxim Kostrikin (Ixtens)

I'm 40, I've been in IT for 20 years. I have been working at Ixtens for 12 years. We are engaged in ecommerce-driven-development. And I have been practicing DevOps practices for 5 years.

Patterns in Terraform to combat chaos and manual routine. Maxim Kostrikin (Ixtens)

My story will be about the experience in a project in a company whose name I will not say, hiding behind a non-disclosure agreement.

The numbers on the slide are given in order to understand the scope of the project. And everything I'm going to say next is related to Amazon.

Patterns in Terraform to combat chaos and manual routine. Maxim Kostrikin (Ixtens)

I joined this project 4 years ago. And the infrastructure refactoring was in full swing, because the project had grown. And those patterns that were used, they no longer fit. And given all the planned growth of the project, it was necessary to come up with something new.

Thanks to Matvey, who told us yesterday what happened at Dodo Pizza. This is what happened to us 4 years ago.

Developers came and began to make infrastructure code.

The most obvious reasons why this was required was time to market. It was necessary to make sure that the DevOps team was not a bottleneck when rolling out. And among other things, Terraform and Puppet were used at the very first level.

Patterns in Terraform to combat chaos and manual routine. Maxim Kostrikin (Ixtens)

Terraform is an open source project from HashiCorp. And for those who don't know what it is at all, the next few slides.

Patterns in Terraform to combat chaos and manual routine. Maxim Kostrikin (Ixtens)

Infrastructure as code means that we can describe our infrastructure and ask some robots to make sure that we get the resources that we described.

For example, we need a virtual machine. We will describe, add a few required parameters.

Patterns in Terraform to combat chaos and manual routine. Maxim Kostrikin (Ixtens)

After that, we will configure access to Amazon in the console. And ask for Terraform plan. Terraform plan will say: "Ok, for your resource, we can do these things." And at least one resource will be added. And no changes are expected.

Patterns in Terraform to combat chaos and manual routine. Maxim Kostrikin (Ixtens)

After everything suits you, you can ask Terraform apply and Terraform will create an instance for you, and you will get a virtual machine in your cloud.

Patterns in Terraform to combat chaos and manual routine. Maxim Kostrikin (Ixtens)

Further, our project develops. We're adding some changes there. We ask for more instances, we add 53 entries.

Patterns in Terraform to combat chaos and manual routine. Maxim Kostrikin (Ixtens)

And we repeat. Please plan. We see what changes are planned. Apply. And so our infrastructure grows.

Terraform uses such a thing as state files. That is, it saves all the changes that go to Amazon in a file, where for each resource that you described, there are corresponding resources that were created in Amazon. Thus, when changing the description of a resource, Terraform knows exactly what needs to be changed in Amazon.

Patterns in Terraform to combat chaos and manual routine. Maxim Kostrikin (Ixtens)

These state files were originally just files. And we stored them in Git, which was extremely inconvenient. Constantly someone forgot to commit changes, and there were many conflicts.

Now it is possible to use the backend, i.e. Terraform is indicated in which bucket, by which key the state file should be saved. And Terraform itself will take care of getting this state file, doing all the magic and putting back the final result.

Patterns in Terraform to combat chaos and manual routine. Maxim Kostrikin (Ixtens)

Our infrastructure is growing. Here is our code. And now we don't want to just create a virtual machine, we want to have a test environment.

Patterns in Terraform to combat chaos and manual routine. Maxim Kostrikin (Ixtens)

Terraform allows you to make such a thing as a module, i.e. describe the same thing in some folder.

Patterns in Terraform to combat chaos and manual routine. Maxim Kostrikin (Ixtens)

And, for example, in testing, call this module and get the same thing as if we were doing Terraform apply in the module itself. Here is the code for testing.

Patterns in Terraform to combat chaos and manual routine. Maxim Kostrikin (Ixtens)

For production, we can send some changes there, because in testing we don’t need large instances, in production large instances will come in handy.

Patterns in Terraform to combat chaos and manual routine. Maxim Kostrikin (Ixtens)

And then I'll go back to the project. It was a difficult task, the infrastructure was planned very large. And it was necessary to somehow place all the code so that it would be convenient for everyone: for those who perform maintenance on this code, and for those who make changes. And it was planned that any developer could go and fix the infrastructure as needed for his part of the platform.

This is a directory tree that is recommended by HashiCorp if you have a large project and it makes sense to divide the entire infrastructure into some small pieces, and describe each piece in a separate folder.

Having an extensive resource library, you can call about the same thing in testing and in production.

Patterns in Terraform to combat chaos and manual routine. Maxim Kostrikin (Ixtens)

In our case, this was not entirely suitable, because the test stack for developers or for testing needed to be obtained somehow simpler. And I didn’t want to go through the folders and apply in the right sequence, and worry that the base would rise, and then the instance that uses this base would rise. Therefore, all testing was launched from one folder. The same modules were called there, but everything went through in one run.

Terraform takes care of all dependencies. And it always creates resources in that sequence so that you can get an IP address, for example, from a freshly created instance, and get this IP address in the route53 entry.

In addition, the platform is very large. And running a test stack, even if for an hour, even if for 8 hours, is quite an expensive business.

And we have automated this business. And the Jenkins job allowed the stack to run. It was necessary to launch a pull request in it with the changes that the developer wants to test, specify all the necessary options, components, and sizes. If he wants performance testing, then he can take more instances. If he just needs to check that some form opens, he could start at the minimum wage. And also indicate whether a cluster is needed or not, etc.

And then Jenkins pushed a shell script that slightly modified the code in the Terraform folder. Removed unnecessary files, added the necessary files. And then, with one run of Terraform apply, the stack rose.

And then there were other steps that I do not want to go into.

Patterns in Terraform to combat chaos and manual routine. Maxim Kostrikin (Ixtens)

Due to the fact that for testing we needed a little more options than in production, we had to make copies of the modules so that in these copies we could add those features that are needed only in testing.

And it so happened that in testing, it seems like you want to test those changes that will eventually go to production. But in fact, one thing was tested, and a little different was used in production. And there was a small break in the pattern that in production all changes were applied by the operation team. And sometimes it turned out that those changes that were supposed to go from testing to production, they remained in another version.

In addition, there was such a problem that a new service was added, which was slightly different from some existing one. And instead of modifying an existing module, you had to make a copy of it and add the necessary changes.

In fact, Terraform is not a real language. This is a declaration. If we need to declare something, then we declare it. And it all works.

At some point, when discussing one of my pull requests, one of my colleagues said that it is not necessary to produce snowflakes. I wondered what he meant. There is such a scientific fact that in the world there are no two identical snowflakes, they are all slightly, but different. And as soon as I heard this, I immediately felt the full weight of the Terraform code. Because when it was required to move from version to version, Terraform required a breaking chain change, i.e. the code was no longer compatible with the next version. And I had to make a pull request, which covered almost half of the files in the infrastructure, in order to bring the infrastructure to the next version of Terraform.

And after such a snowflake appeared, all the Terraform code that we had turned into a big, big pile of snow.

For an external developer who is outside of operation, it does not matter much to him, because he made a pull request, his resource started. And that's it, it's not his concern. And the DevOps team that makes sure everything is OK needs to make all these changes. And the cost of these changes increased very, very much with each additional snowflake.

Patterns in Terraform to combat chaos and manual routine. Maxim Kostrikin (Ixtens)

There is a story about how a student at a seminar draws two perfect circles with chalk on a blackboard. And the teacher is surprised how he managed to draw so smoothly without a compass. The student replies: “It’s very simple, I turned a meat grinder for two years in the army.”

And out of the four years that I've been on this project, I've been doing Terraform for about two years. And, of course, I have some tricks, some tips on how to simplify the Terraform code, work with it like a programming language and reduce the burden on developers who must keep this code up to date.

Patterns in Terraform to combat chaos and manual routine. Maxim Kostrikin (Ixtens)

The first thing I would like to start with is Symlinks. Terraform has a lot of repetitive code. For example, calling a provider at almost every point where we create a piece of infrastructure is the same. And it is logical to put it in a separate daddy. And wherever the provider is required to make Symlinks to this file.

Patterns in Terraform to combat chaos and manual routine. Maxim Kostrikin (Ixtens)

For example, you use assume role in production, which allows you to get access rights to some external Amazon account. And by changing one file, all the remaining ones that are in the resource tree will have the required rights so that Terraform knows which Amazon segment to access.

Patterns in Terraform to combat chaos and manual routine. Maxim Kostrikin (Ixtens)

Where Symlinks don't work? As I said, Terraform has state files. And they are very, very cool. But the fact is that Terraform initializes the backend in the very first one. And he cannot use any variables in these parameters, they always need to be written in text.

And as a result, when someone makes a new resource, he copies part of the code from other folders. And he can make a mistake with the key or with the bucket. For example, he makes a sandbox thing out of a sandbox, and then makes it in production. And so it may turn out that the bucket in production will be used from the sandbox. Of course, they will find it quickly. It will be possible to somehow fix this, but nevertheless it is a waste of time and, to some extent, resources.

Patterns in Terraform to combat chaos and manual routine. Maxim Kostrikin (Ixtens)

What can we do next? Before working with Terraform, you need to initialize it. At the time of initialization, Terraform downloads all plugins. At some point, they broke from a monolith into a more microservice architecture. And you always need to do Terraform init so that it pulls up all the modules, all the plugins.

And you can use a shell script, which, firstly, can get all the variables. Shell script is unlimited. And, secondly, the way. If we always use the path that is in the repository as the key to the state file, then, accordingly, the error will be excluded here.

Patterns in Terraform to combat chaos and manual routine. Maxim Kostrikin (Ixtens)

Where to get data? JSON file. Terraform allows you to write infrastructure not only in hcl (HashiCorp Configuration Language), but also in JSON.

JSON is easy to read from a shell script. Accordingly, you can put a configuration file with a bucket in some place. And use this bucket both in the Terraform code and in the shell script for initialization.

Patterns in Terraform to combat chaos and manual routine. Maxim Kostrikin (Ixtens)

Why is it important to have a Terraform bucket? Because there is such a thing as remote state files. That is, when I raise some resource, in order to tell Amazon: “Please raise instance”, I need to specify a lot of required parameters.

And these identifiers are stored in some other folder. And I can take it and say: "Terraform, please run to the state file of that very resource and get me these identifiers." And thus there is a kind of unification between different regions or environments.

It is not always possible to use a remote state file. For example, you manually created a VPC. And the Terraform code that creates the VPC creates such a different VPC that it takes a very long time and you have to adjust one to the other, so you can use the following trick.

Patterns in Terraform to combat chaos and manual routine. Maxim Kostrikin (Ixtens)

That is, to make a module that, as it were, makes VPC and gives you identifiers, but in fact there is just a file with hardcoded values ​​that can be used to create the same instance.

Patterns in Terraform to combat chaos and manual routine. Maxim Kostrikin (Ixtens)

It is not always necessary to save the state file in the cloud. For example, when testing modules, you can use backend initialization, when the file will be saved just on disk at the time of testing.

Patterns in Terraform to combat chaos and manual routine. Maxim Kostrikin (Ixtens)

Now a little about testing. What can be tested in Terraform? Probably, a lot is possible, but I will talk about these 4 things.

HashiCorp has an understanding of how to format Terraform code. And Terraform fmt lets you format the code you edit according to that belief. Accordingly, the tests must necessarily check whether the formatting matches what HashiCorp bequeathed, so that you do not have to change the location of brackets, etc.

Patterns in Terraform to combat chaos and manual routine. Maxim Kostrikin (Ixtens)

The next one is Terraform validate. It does a little more than a syntax check - ala, are all the brackets paired. What is important here? We have a very thin infrastructure. It has a lot of different folders. And in each you need to run Terraform validate.

Accordingly, to speed up testing, we run several processes in parallel using parallel.

Parallel is a very cool thing, use it.

But every time Terraform is initialized, it goes to HashiCorp and asks, “What are the latest plugins? And the plugin that I have in the cache - is it the one or not the one? And it slowed down at every step.

Patterns in Terraform to combat chaos and manual routine. Maxim Kostrikin (Ixtens)

If Terraform tells you where the plugins are, Terraform will say: “OK, this is probably the freshest thing there is. I won't go anywhere, I'll start validating your Terraform code right away."

Patterns in Terraform to combat chaos and manual routine. Maxim Kostrikin (Ixtens)

In order to fill the folder with the necessary plugins, we have a very simple Terraform code that just needs to be initialized. Here, of course, you need to specify all the providers that somehow participate in your code, otherwise Terraform will say: "I don't know any provider, because it is not in the cache."

Patterns in Terraform to combat chaos and manual routine. Maxim Kostrikin (Ixtens)

The next one is the Terraform plan. As I said, development is cyclical. We make code with changes. And then you need to find out what changes are planned for the infrastructure.

And when the infrastructure is very, very large, you can change one module, fix some test environment or some specific region, and break some neighboring one. Therefore, a Terraform plan should be made for the entire infrastructure and show what changes are planned.

You can do it the smart way. For example, we wrote a Python script that resolves dependencies. And depending on what has been changed: a Terraform module or just a specific component, it makes plans for all dependent folders.

Terraform plan must be done upon request. At least that's what we do.

Tests, of course, are good to do for every change, for every commit, but plans are quite an expensive thing. And we say in the pull request: "Please give me the plans." The robot starts. And sends to the comments or to attach all the plans that are expected from your changes.

The plan is a rather expensive thing. It takes time because Terraform goes to Amazon and asks, “Does this instance still exist? Does this autoscale have exactly the same parameters?”. And in order to speed it up, you can use a parameter such as refresh=false. This means that Terraform will deflate the S3 state. And will believe that the state will exactly match what is in Amazon.

Such a Terraform plan is much faster, but the state must match your infrastructure, i.e., somewhere, sometime Terraform refresh must start. Terraform refresh does exactly that, so that the state matches what is in the real infrastructure.

And I must say about safety. This is where it should have started. Where you run Terraform and Terraform works with your infrastructure, there is a vulnerability. That is, you are essentially executing code. And if the pull request contains some kind of malicious code, then it can be executed on an infrastructure that has too much access. Therefore, be careful where you launch Terraform plan.

Patterns in Terraform to combat chaos and manual routine. Maxim Kostrikin (Ixtens)

The next thing I would like to talk about is user-data testing.

What is user-data? In Amazon, when we create an instance, we can send some kind of letter from the instance - meta data. When an instance is started, usually cloud init is always present on those instances. Cloud init reads this letter and says: "OK, today I am a load balancer." And in accordance with these precepts, he performs some actions.

Patterns in Terraform to combat chaos and manual routine. Maxim Kostrikin (Ixtens)

But, unfortunately, when we do Terraform plan and Terraform apply, user-data looks like this slurry of numbers. That is, he just sends you a hash. And all you can see in the plan is whether there will be any changes or the hash will remain the same.

And if you do not pay attention to this, then some beaten text file may go to Amazon, to the real infrastructure.

Patterns in Terraform to combat chaos and manual routine. Maxim Kostrikin (Ixtens)

Alternatively, you can specify not the entire infrastructure during execution, but only the template. And in the code, say: "Please display this template for me." And as a result, you can get a printout of what your data will look like on Amazon.

Patterns in Terraform to combat chaos and manual routine. Maxim Kostrikin (Ixtens)

Another option is to use a module to generate user-data. You will apply this module. Get the file on disk. Compare it with the reference. And thus, if some jun decides to fix a little user-data, then your tests will say: “OK, there are some changes here and there - this is normal.”

Patterns in Terraform to combat chaos and manual routine. Maxim Kostrikin (Ixtens)

The next thing I would like to talk about is Automate Terraform apply.

Of course, it’s scary enough to do Terraform apply in automatic mode, because who knows what changes have come there and how detrimental they can be to a living infrastructure.

For a test environment, this is all fine. That is, a job that creates a test environment is what all developers need. And such an expression as “everything worked for me” is not a funny meme, but proof that a person got confused, raised a stack, launched some tests on this stack. And he made sure that everything was fine there and said: “OK, the code that I release has been tested.”

In production, sandbox, and other environments that are more business-critical, it's safe to partially use some resources because it doesn't cause anyone to die. These are: autoscale groups, security groups, roles, route53 and there the list can be quite large. But keep an eye on what's going on, read reports of automated applications.

Where it is dangerous or scary to use, for example, if these are some persistent resources, from a database, then get reports that there are unapplied changes in some piece of infrastructure. And the engineer is already supervised running jobs to apply or doing it from his console.

Amazon has such a thing as Terminate protection. And it can protect in some cases from changes that are not required for you. So Terraform went to Amazon and says "I need to kill this instance to make another one". And Amazon says, “Sorry, not today. We have Terminate protection.”

Patterns in Terraform to combat chaos and manual routine. Maxim Kostrikin (Ixtens)

And the icing on the cake is code optimization. When we work with Terraform code, we must pass a very large number of parameters to the module. These are the parameters that are necessary in order to create some kind of resource. And the code turns into large lists of parameters that need to be passed from module to module, from module to module, especially if the modules are nested.

And it's very hard to read. It is very difficult to review this. And very often it turns out that some parameters are being reviewed and they are not quite the ones that are needed. And it costs time and money to fix it later.

Patterns in Terraform to combat chaos and manual routine. Maxim Kostrikin (Ixtens)

Therefore, I suggest you use such a thing as a complex parameter that includes a certain tree of values. That is, you need some kind of folder where you have all the values ​​\uXNUMXb\uXNUMXbthat you would like to have on some kind of environment.

Patterns in Terraform to combat chaos and manual routine. Maxim Kostrikin (Ixtens)

And by calling this module, you can get a tree that is generated in one common module, that is, in a common module that works the same for the entire infrastructure.

In this module, you can do some calculations using such a fresh feature in Terraform as locals. And then in one output, issue some kind of complex parameter, which may include hashes, arrays, etc.

Patterns in Terraform to combat chaos and manual routine. Maxim Kostrikin (Ixtens)

On this, all the best finds that I have ended. And I would like to tell a story about Columbus. When he was looking for money for his expedition to discover India (as he thought then), no one believed him and believed that this was impossible. Then he said: "Make sure that the egg does not fall." All the bankers, very rich and probably smart people, tried to put the egg in some way, and it fell all the time. Then Columbus took the egg, pressed it a little. The shell crumpled and the egg remained motionless. They said, "Oh, that's too easy!" And Columbus answered: “Yes, it is too simple. And when I open up India, everyone will use this trade route.”

And what I have just told you is probably quite simple and trivial things. And when you learn about them and start using them, it's in the order of things. So use it. And if these are quite normal things for you, then at least you know how to put an egg so that it does not fall.

Patterns in Terraform to combat chaos and manual routine. Maxim Kostrikin (Ixtens)

To sum up:

  • Try to avoid snowflakes. And the fewer snowflakes, the fewer resources you will need to make any changes to your entire large infrastructure.
  • Constant change. That is, when some changes have occurred in the code, you need to bring your infrastructure in line with these changes as soon as possible. There should not be a situation when someone comes in two or three months to look at Elasticsearch, makes a Terraform plan, and there are a lot of changes that he did not expect. And it takes a lot of time to put everything back in order.
  • Tests and automation. The more code you have covered with tests and features, the more confidence you have that you are doing everything right. And automatic delivery will increase your confidence many times over.
  • The code for the test and production environments should be almost the same. Practically, because after all, production is a little different and there will still be some nuances that will go beyond the test environment. But nevertheless, plus or minus it can be provided.
  • And if you have a lot of Terraform code and it takes a lot of time to keep this code up to date, then it's never too late to refactor and bring it into good shape.

Patterns in Terraform to combat chaos and manual routine. Maxim Kostrikin (Ixtens)

  • immutable infrastructure. AMI delivery on schedule.
  • Structure for route53 when you have a lot of entries and want them to be in a consistent order.
  • Fight against API rate limits. This is when Amazon says, "That's it, I can't accept any more requests, please wait." And half of the office is waiting until it can launch its infrastructure.
  • spot instances. Amazon is not a cheap event and spots allow you to save quite a lot. And there you can tell a whole report about it.
  • Security and IAM roles.
  • Search for lost resources, when you have instances of unknown origin in Amazone, they eat money. Even if instances costs $100-150 per month, it's more than $1 per year. Finding such resources is a lucrative business.
  • And reserved instances.

Patterns in Terraform to combat chaos and manual routine. Maxim Kostrikin (Ixtens)

That's all for me. Terraform is very cool, use it. Thank you!

Questions

Thanks for the report! You have a state file in S3, but how do you solve the problem that several people can take this state file and try to deploy?

First, we are not in a hurry. Secondly, there are flags, in which we report that we are working on some piece of code. That is, despite the fact that the infrastructure is very large, this does not mean that someone is constantly using something. And when there was an active phase, this was a problem, we kept state files in Git. This was important, otherwise someone would make a state file, and we had to manually collect them in a heap in order to continue further. Now there is no such problem. In general, Terraform solved this problem. And if something is constantly changing, then you can use locks that prevent what you said.

Are you using open source or enterprise?

No enterprise, that is, everything that you can go and download for free.

My name is Stanislav. I wanted to make a small addition. You talked about the Amazon feature that allows you to make an instance unkillable. This is also in Terraform itself, in the Life Second block, you can prescribe a ban on change, or a ban on destruction.

Was limited in time. Good point.

I also wanted to ask two things. First, you talked about testing. Have you used any testing tools? I heard about the Test Kitchen plugin. Perhaps there is something else. And I would like to ask about Local Values. How do they basically differ from Input Variables? And why can't I parametrize something only through Local Values? I tried to deal with this topic, but somehow I did not figure it out myself.

We can talk in more detail behind this hall. Testing tools are our complete self-made. There is nothing there to test. In general, there are options when automatic tests raise the infrastructure somewhere, check that it is OK, and then destroy everything with a report that your infrastructure is still in good shape. We don't have that because the test stacks run every day. And that's enough. And if something starts to break, then it will start to break without us checking it somewhere else.

Regarding Local Values, let's continue the conversation outside the audience.

Hello! Thanks for the report! Very informative. You said that you have a lot of the same type of code to describe the infrastructure. Have you considered generating this code?

Great question, thanks! The point is that when we use infrastructure as code, we assume that we look at the code and understand what kind of infrastructure lies behind this code. If the code is generated, then we need to imagine what code will be generated in order to understand what kind of infrastructure will be there. Or we generate the code, commit it and, in fact, we get the same thing. Therefore, we went the way we wrote, we got it. Plus, generators appeared a little later, when we started making. And it was too late to change.

Have you heard of jsonnet?

No.

Look, this is really cool stuff. I see a specific case where you can apply it and generate a data structure.

Generators are good when you have them, like in the joke about the shaving machine. That is, the first time the face is different, but then everyone has the same face. The generators are very cool. But, unfortunately, our faces are a little different. This is problem.

Just look. Thank you!

My name is Maxim, I'm from Sberbank. You said a little that you tried to bring Terraform to an analogue of a programming language. Isn't it easier to use Ansible?

These are very different things. Ansible can create resources, and Puppet can create resources in Amazon. But Terraform is downright sharpened.

Do you only have Amazon?

It's not that we only have Amazon. We almost only have Amazon. But the key feature is that Terraform remembers. In Ansible, if you say: "Pick me up 5 instances", then it will raise, and then you say: "And now I need 3". And Terraform will say: “Ok, I’ll kill 2”, and Ansible will say: “Ok, here’s 3 for you.” Total 8.

Hello! Thanks for your report! It was very interesting to hear about Terraform. I just want to make a small comment about the fact that Terraform still does not have a stable release, so be very careful with Terraform.

Nice spoon for dinner. That is, if you need a solution, then you sometimes postpone what is unstable, etc., but it works and helped us.

The question is. You are using the Remote backend, you are using S 3. Why are you not using the official backend?

Official?

Terraform Cloud.

When did he appear?

About 4 months ago.

If it had appeared 4 years ago, then, probably, I would have answered your question.

There is already a built-in function and locks, and you can store a state file. Try it. But I haven't tested either.

We are on a big train that is moving at high speed. And you can’t just take and throw out a few cars.

You were talking about snowflakes, why didn't you use branch? Why didn't it work out that way?

We have such an approach that the entire infrastructure is in one repository. Terraform, Puppet, all the scripts that somehow relate to this, they are all in one repository. This way we can ensure that incremental changes are tested one by one. If it were a bunch of branches, then such a project would be nearly impossible to maintain. Six months pass, and they diverge so much that it's just some kind of punishment. This is what I wanted to run away from before refactoring.

i.e. it doesn't work?

It doesn't work at all.

In branch, I cut out the folder slide. That is, if you do for each test stack, for example, team A has its own daddy, team B has its own daddy, then this also does not work. We made a unified test environment code that was flexible enough to suit everyone. That is, we served one code.

Hello! My name is Yura! Thanks for the report! Question about modules. You say you are using modules. How do you resolve the issue if changes were made in one module that are not compatible with the change of another person? Somehow versioning modules or trying to bring a prodigy to meet two requirements?

This is the big snow pile problem. This is what we suffer from when some innocuous change can break some part of the infrastructure. And it will be noticeable only after some long time.

That is, it has not been decided yet?

You make universal modules. Avoid snowflakes. And everything will work out. The second half of the report is about how to avoid it.

Hello! Thanks for the report! I would like to clarify. Behind the scenes there was a large pile, for which I came. How are Puppet and role distribution integrated?

user-data.

That is, do you just spit out the file and somehow execute on it?

User-data is a note, i.e. when we make an image clone, then Daemon rises there and trying to figure out who he is, reads a note that he is a load balancer.

That is, is it some kind of separate process that is given away?

We didn't invent it. We use it.

Hello! I just have a question about User - data. You said that there are problems there, that someone might send something to the wrong place. Is there some way to store user - data in the same Git, so that it is always clear what User-data refers to?

We generate User-data from template. That is, a certain number of variables resort there. And Terraform generates the final result. Therefore, you can’t just look at the template and say what happens, because all the problems are related to the fact that the developer thinks that he is passing a string in this variable, and then an array is used. And he - bang and I - so-and-so, so-and-so, the next line, and everything broke. If this is a new resource and a person raises it, sees that something is not working, then this is quickly resolved. And if this autoscale group has been updated, then at some point the instances in the autoscale group begin to be replaced. And clap, something is not working. It hurts.

It turns out that the only solution is to test?

Yes, you see the problem, you add test steps there. That is, output can also be tested. Maybe not so convenient, but you can also put some marks - check that User-data is nailed here.

My name is Timur. It's very cool that there are reports about how to properly organize Terraform .

I didn't even start.

I think that in the next conference, maybe there will be. I have a simple question. Why are you hardcoding the value in a separate module rather than using tfvars, i.e. is a module with values ​​better than tfvars ?

That is, I should write here (slide: Production/environment/settings.tf): domain = variable, domain vpcnetwork, vpcnetwork variable and stvars - get the same thing?

We do exactly that. We refer to the setting source module, for example.

In fact, this is such a tfvars. Tfvars is very handy in a testing environment. I have tfvars for large instances, for small ones. And I threw one file into the folder. And got what I wanted. When we saw infrastructure, we want to be able to see and immediately understand everything. And so it turns out that you need to look here, then look in tfvars.

It turns out that everything was in one place?

Yes, tfvars is when you have one code. And it is used in several different places with different nuances. Then you would throw tfvars and get your nuances. And we are infrastructure as code in its purest form. Looked and understood.

Hello! Have you come across situations where the cloud provider interferes with what you have done with Terraform? Let's say we edit the meta-data. There are ssh keys. And Google constantly slips its meta-data, its keys there. And Terraform always writes that it has changes. After each run, even if nothing changes, he always says that he will update this field now.

With keys, but - yes, part of the infrastructure is affected by such a thing, i.e. Terraform cannot change anything. We can't change anything with our hands either. As long as we live with it.

That is, you came across this, but didn’t come up with anything, how does he do it and do it himself?

Unfortunately yes.

Hello! My name is Stanislav Starkov. Mail. en Group. How do you solve the problem with generating a tag on ..., how do you pass it inside? As I understand it, through User - data, to specify the host name, incite Puppet? And the second part of the question. How do you solve this issue in SG, i.e. when you generate SG, a hundred instances of the same type, how to name them correctly?

Those instances that are very important to us, we will name them beautifully. Those that are not needed, there is a postscript that this is an autoscale group. And in theory it can be nailed, and get a new one.

As for the problem with the tag, there is no such problem, but there is such a task. And we use tags very, very heavily, because the infrastructure is large and expensive. And we need to look at what money is spent on, so tags allow us to sort out what and where it went. And, accordingly, the search for something here is a lot of money spent.

What else was the question about?

When SG creates a hundred instances, do they need to be distinguished somehow?

No, don't. Each instance has an agent that tells me that I have a problem. If the agent reports, then the agent knows about him and, at least, his IP address exists. You can already run. Secondly, we use Consul for Discovery, where there is no Kubernetes. And Consul also shows the IP address of the instance.

That is, you are targeting exactly the IP, and not the host name ?

It is impossible to navigate by host name, i.e. there are a lot of them. There are instance identifiers - AE, etc. You can find it somewhere, you can throw it into the search.

Hello! I realized that Terraform is a good thing, tailored to the clouds.

Not only.

This is the question that interests me. If you decide to move, say, to Bare Metal en masse with all your instances? Will there be any problems? Or do you still have to use other products, for example, the same Ansible that was mentioned here?

Ansible is a bit about something else. That is, Ansible is already running when instance has started. And Terraform works before the instance has started. Switching to Bare Metal is not.

Not now, but business will come and say: "Come on."

Switching to another cloud - yes, but there is a slightly different feature here. You need to write Terraform code in such a way that you can switch to some other cloud with less bloodshed.

Initially, the task was that our entire infrastructure is agnostic, i.e. any cloud should be fine, but at some point the business gave up and said: “OK, in the next N years we won’t go anywhere, you can use services from Amazon ".

Terraform allows you to create Front-End jobs, configure PagerDuty, data docs, etc. It has a lot of tails. He can practically control the entire world.

Thanks for the report! I have also been spinning Terraform for 4 years now. At the stage of a smooth transition to Terraform, to infrastructure, to a declarative description, we were faced with a situation where someone was doing something by hand, and you were trying to make a plan. And I got some error there. How do you deal with such problems? How do you find the lost resources that were indicated?

Mostly with our hands and eyes, if we see something strange in the report, then we analyze what is happening there, or we just kill it. In general, pull requests are a common thing.

If there is an error, do you rollback? Have you tried doing this?

No, this is a decision of a person at the moment when he sees the problem.

Source: habr.com