Production Readiness Checklist

The translation of the article was prepared especially for students of the DevOps Practices and Tools course, which starts today!

Have you ever released a new service in production? Or maybe engaged in the maintenance of such services? If so, what did you follow? What is good for production and what is bad? How do you train new team members to release or maintain existing services.

Most companies in terms of industrial exploitation practices eventually come to the approaches of the “Wild West.” Each team, through trial and error, independently determines the tools and best practices. But this often affects not only the success of projects, but also engineers.

The trial and error method creates an environment in which the search for perpetrators and the transfer of responsibility is common. With this behavior, it becomes increasingly difficult to learn from mistakes and not to repeat them again.

Successful organizations:

recognize the need for guidelines for production,
learn best practices
start a discussion of readiness for production when developing new systems or components,
ensure compliance with the rules of preparation for production.

Preparation for production includes a review process. A review can be in the form of a checklist or a set of questions. A review can be done manually, automatically, or both. Instead of static requirements lists, checklist templates can be made that can be adapted to specific needs. In this way, engineers can be given a way to inherit knowledge and sufficient flexibility when required.

When to check the service for readiness for production?

It is useful to conduct a readiness check for production not only immediately prior to release, but also when transferring it to another operating team or new employee.

Check when:

Releasing a new service in production.
Transfer the operation of the production service to another team, such as SRE.
Hand over production service to new employees.
Organize technical support.

Production Readiness Checklist

Some time ago, as an example, I published a checklist for checking readiness for production. Although this list appeared while working with Google Cloud clients, it will be useful and applicable outside of Google Cloud.

Design and development

Design a reproducible build process that does not require access to external services and does not depend on the failure of external systems.
During the design and development period, define and install SLO for your services.
Document the expectations for the availability of external services you depend on.
Avoid a single point of failure by removing dependencies on one global resource. Replicate the resource or use the fallback option when the resource is unavailable (for example, a hard-coded value).

Configuration management

Static, small, and non-secret configurations can be passed through command line options. For the rest, use configuration storage services.
The dynamic configuration must have backup settings in case the configuration service is unavailable.
The development environment configuration should not be related to the production configuration. Otherwise, this may lead to access to production services from the development environment, which may cause privacy problems and data leakage.
Document what can be configured dynamically and describe fallback behavior if the configuration delivery system is unavailable.

Release management

Document the release process in detail. Describe how releases affect SLOs (for example, a temporary increase in latency due to cache misses).
Document canary releases.
Develop a canary release analysis plan and, if possible, automatic rollback mechanisms.
Make sure rollbacks can use the same processes as deployment.

Suitability for monitoring (Observability)

Make sure that you are compiling the set of metrics required for SLO.
Make sure you can distinguish between client and server data. This is important for troubleshooting.
Set alerts to reduce labor costs. For example, delete alerts caused by routine operations.
If you use Stackdriver, then include the GCP platform metrics in your dashboards. Configure alerts for GCP dependencies.
Always distribute the incoming trace. Even if you do not participate in the trace, this will allow lower-level services to debug production problems.

Protection and safety

Make sure all external connections are encrypted.
Make sure your production projects have the correct IAM setup.
Use networks to isolate groups of virtual machine instances.
Use a VPN to securely connect to remote networks.
Document and monitor user access to data. Make sure that all user access to data is checked and logged.
Ensure that debug endpoints are limited by ACLs.
Sanitize user input. Configure payload size limits for user input.
Make sure your service can selectively block incoming traffic for individual users. This will block violations without affecting other users.
Avoid external endpoints that initiate a large number of internal operations.

Capacity planning

Document how your service scales. For example: number of users, size of incoming payload, number of incoming messages.
Document resource requirements for your service. For example: the number of allocated virtual machine instances, the number of Spanner instances, specialized equipment such as a GPU or TPU.
Document resource restrictions: resource type, region, etc.
Document quota limits for creating new resources. For example, limiting the number of GCE API requests if you use the API to create new instances.
Consider performing stress tests to analyze performance degradation.

That's all. See you in the classroom!

All Articles