Production readiness checklist

The translation of the article was prepared specifically for the students of the course "DevOps practices and tools"which starts today!

Production readiness checklist

Have you ever released a new service to production? Or maybe engaged in the maintenance of such services? If yes, what guided you? What is good for production and what is bad? How do you train new team members on releases or maintenance of existing services.

Most companies in terms of industrial operation practices end up with β€œWild West” approaches. Each team through trial and error is self-determined with tools and best practices. But this often affects not only the success of projects, but also engineers.

Trial and error creates an environment in which blaming and blame-shifting are common. With this behavior, it becomes increasingly difficult to learn from mistakes and not repeat them again.

Successful Organizations:

  • are aware of the need for guidelines for production,
  • learning best practices,
  • start discussing production readiness issues when developing new systems or components,
  • ensure compliance with the rules of preparation for production.

Preparation for production includes the β€œreview” process. The review can be in the form of a checklist or a set of questions. Reviews can be done manually, automatically, or both. Instead of static lists of requirements, you can make checklist templates that can be adapted to specific needs. In this way, engineers can be given a way to inherit knowledge and enough flexibility when required.

When to check the service for readiness for production?

It is useful to check the readiness for production not only immediately before the release, but also when transferring to another operations team or a new employee.

Check when:

  • Release a new service to production.
  • Hand over the operation of the production service to another team, such as SRE.
  • Transfer the operation of the production service to new employees.
  • Organize technical support.

Production readiness checklist

Some time ago, as an example, I ΠΎΠΏΡƒΠ±Π»ΠΈΠΊΠΎΠ²Π°Π»Π° production readiness checklist. Although this list was created while working with Google Cloud clients, it will be useful and applicable outside of Google Cloud.

Design and Development

  • Develop a reproducible build process that does not require access to external services and does not depend on the failure of external systems.
  • During the design and development period, define and establish SLOs for your services.
  • Document availability expectations for external services you depend on.
  • Avoid a single point of failure by removing dependencies on a single global resource. Replicate the resource, or use a fallback when the resource is unavailable (for example, a hard-coded value).

Configuration management

  • Static, small, and non-secret configuration can be passed through command line options. For everything else, use configuration storage services.
  • Dynamic configuration should have fallback settings in case the configuration service is unavailable.
  • The development environment configuration should not be related to the production configuration. Otherwise, it can lead to access from the development environment to production services, which can cause privacy issues and data leakage.
  • Document what can be configured dynamically and describe fallback behavior if the configuration delivery system is unavailable.

Release Management

  • Document the release process in detail. Describe how releases affect SLO (for example, temporary latency increases due to cache misses).
  • Document canary releases.
  • Develop a review plan for canary releases and, if possible, automatic rollback mechanisms.
  • Ensure that rollbacks can use the same processes as the deployment.

Observability

  • Ensure that the set of metrics required for SLO is collected.
  • Make sure you can distinguish between client and server data. This is important for troubleshooting.
  • Set up alerts to reduce labor costs. For example, remove alerts caused by routine operations.
  • If you use Stackdriver, please include GCP platform metrics in your dashboards. Set up alerts for GCP dependencies.
  • Always distribute the incoming trace. Even if you don't participate in tracing, this will allow lower-level services to debug problems in production.

Protection and safety

  • Make sure all external connections are encrypted.
  • Make sure your production projects have the correct IAM setup.
  • Use networks to isolate groups of VM instances.
  • Use a VPN to securely connect to remote networks.
  • Document and monitor user access to data. Ensure that all user access to data is audited and logged.
  • Make sure the endpoints for debugging are ACL restricted.
  • Sanitize user input. Set payload size limits for user input.
  • Make sure your service can selectively block incoming traffic for individual users. This will block violations without affecting other users.
  • Avoid external endpoints that initiate a large number of internal operations.

Capacity Planning

  • Document how your service scales. For example: number of users, size of incoming payload, number of incoming messages.
  • Document the resource requirements for your service. For example: number of dedicated VM instances, number of Spanner instances, specialized hardware such as GPU or TPU.
  • Document resource limits: resource type, region, etc.
  • Document quota limits for creating new resources. For example, limiting the number of GCE API requests if you are using the API to create new instances.
  • Consider running load tests to analyze performance degradation.

That's all. See you in class!

Source: habr.com

Add a comment