👘 💤 🗞️ Defender programmer stronger than entropy 👼 🏫 🚣🏻

© Dragon Ball. Goku.

The programmer-defender at any time and anywhere in the code expects the appearance of potential problems and writes the code in such a way as to protect themselves from them in advance. And if the problem cannot be defended, then at least make sure that its consequences and impact on users are minimal.

I recall the FlashForward effect from Hollywood blockbusters, when the main character sees the impending catastrophe and remains extremely calm, because he knows in advance that it will happen and has protection from it. The idea behind defensive programming is to defend against issues that are difficult or impossible to foresee. The security programmer expects errors to occur anywhere in the system and at any given time to prevent them before they cause damage. However, the goal is not to create a system that never crashes, it is still impossible. The goal is to create a system that crashes gracefully in the event of any unforeseen problem.

Let's understand in more detail what is included in the concept of "fall gracefully."

Fall fast. In the event of an unexpected error, all operations should be completed immediately, especially if subsequent calculations are difficult or may lead to data corruption.
Fall neatly. If an error occurs, the program should free all resources, remove locks, delete temporary and half-recorded files, close connections. Wait for the completion of critical operations, the interruption of which may lead to unpredictable results. Or a safe way to crash these operations.
Fall clearly and beautifully. If something is broken, the error message should be simple, concise and contain important details from the context of the system where the error occurred. This will help the team that is responsible for the system to figure out the problem as quickly as possible and fix it.

But you may have a question.

Why waste time on problems that may arise in the future? Now they are not there, the code works just perfect. In addition, problems may never happen at all. After all, professionals are not involved in engineering for the sake of engineering ( YAGNI - You aren't gonna need it)!

The main thing is pragmatism

Andrew Hunt in his book "Programmer-pragmatist" gives the following definition of defensive programming - " pragmatic paranoia ."

Protect your code from:

own mistakes;
other people's mistakes;
errors and failures in other systems with which your is integrated;
errors of iron, environments and platforms on which your application works.

Let us discuss several tactical and strategic methods of defensive programming, following which will create a reliable and predictable system that is resistant to arbitrary failures.

Some tips may seem “captain's”, but in practice, many developers do not even follow them. But if you adhere to simple practices and approaches, this will significantly increase the stability of your system.

Do not trust anyone

User data is unreliable by default. Users often misunderstand what seems obvious to us (as system developers). Expect incorrect data at the input and always check it.

Also check the amount of input. It may be that the user sends too many of them. At the same time, from the point of view of business logic, this is the correct scenario. But it can lead to too long processing. What can be done with this? For example, run it asynchronously, if the amount of input data exceeds a certain threshold and the specifics of the business allows you to process the data in the background.

Application settings (for example, configuration files) are also subject to the appearance of incorrect data in them. Often, program settings are stored in JSON, YAML, XML, INI, and other formats. Since all these are text files, you should expect that sooner or later someone will change something in them, and your program will start to work incorrectly. It can be either an end user or someone from your team.

Databases, files, centralized storages of configs, registry - all these places can be accessed by other people, and sooner or later they will change something there ( Murphy's law ).

Garbage inlet → garbage inlet

Inputs that have passed validation and are being processed should be clean if you want your code to do exactly what you expect from it.

However, it’s good practice to do additional data validation checks, including when they have already started to be processed. In critical places (billing, authorization, personal and confidential data, etc.) this is an almost mandatory requirement. This is necessary so that in case of bugs in the code or problems with the input data validator, stop the execution flow as quickly as possible. It is difficult to make a quality validation with checking all possible error scenarios, so you can use simpler methods to validate that the program is still executing correctly - assertions and exceptions.

Healthy paranoidness is a characteristic feature of all professional developers. But it is very important to seek the optimal balance and understand when the solution is already good enough.

Separate configs by surroundings

A common cause of problems is the insufficient separation of configs between environments or the absence of such a separation.

This can lead to many problems, for example:

the test environment starts reading and / or writing data from production, databases, queues and other resources;
test environment uses external integrations and services with a production account;
mixing statistics, metrics, errors from different environments;
security breach (developers, testers and other team members gain access to production resources);
hard-to-investigate bugs on production (for example, part of the messages in the queue is lost due to the test environment beginning to read it).

These are just examples, a complete list of problems that may be caused by insufficiently responsible separation of configs is almost endless and depends on the specifics of the project.

Responsible separation of configuration data by environment can significantly reduce the likelihood of immediately a whole class of problems associated with:

security
reliability;
support and deployment (DevOps engineers will thank you).

In addition, it is good practice to store secret data (keys, tokens, passwords) in a separate place specially designed for storing and processing secrets. Such systems securely encrypt data, have flexible means for managing access rights, and also allow you to quickly change keys if they have been compromised. In this case, you do not need to make changes to the code and re-deploy the application. This is especially important for systems that work with financial transactions, confidential or personal data.

Remember the cascading effect

A common cause of the fall of large and complex systems is the cascading effect. The breakdown or degradation of the functionality of one of the parts of the system occurs, and one by one the other subsystems associated with it begin to fail. Cascaded until the entire system becomes completely inaccessible.

A few protective tricks:

Use progressive (exponential) timeouts with a random element
set reasonable values for connection timeout and socket timeout;
foresee a fallback in advance in case of failure of individual services. It’s better to temporarily degrade some of the functionality, disable the services altogether, but do not risk breaking the entire system. But foresee that in this case the user sees an understandable and not frightening message, and the support and development team as soon as possible finds out about the problem.

Report problems quickly

All systems fail. Sometimes something strange happens in them that the creators expect "once every 10 years." Integrations and external APIs periodically become unavailable or respond incorrectly. Making fallback for all such cases is often difficult, long, or simply impossible. Foresee this situation in advance and report it as quickly as possible. Logging to the ERROR level or to the monitoring system - for granted. Adding additional validation to healthcheck is even better. To send a message from the code to Slack, Telegram, PagerDuty or another service that will instantly notify your team about the problem is ideal.

But it’s important to clearly understand when it makes sense to send messages directly. Only if an error, a suspicious or atypical situation is associated with business processes and it is important that a particular person or group of people in a team receive a notification as quickly as possible and can respond.

All other technical problems and deviations should be handled by standard means - monitoring, alerting, logging.

Cache frequently used and / or recent data

Programs and people have one thing in common - they tend to reuse data that is often used or recently encountered. In highly loaded systems, you should always remember this and cache data in the hottest places in the system.

The caching strategy is highly dependent on the specifics of the project and the data. If the data is mutable, there is a need for cache invalidation. Therefore, consider in advance how you will do this. And also think about what risks may occur if outdated data appears in the cache, the cache goes down, etc.

Replace expensive operations with cheap ones

Working with strings is one of the most common operations in any program. And if this is not done optimally, it can be an expensive operation. In different programming languages, the specifics of working with strings may vary, but you should always remember about it.

In large applications with a large code base, code written many years ago is often found that works without errors, but is not optimal in terms of performance. Often, a banal change in the data structure from an array / list to a hash table gives a serious boost (even if only in a local place in the code).

Sometimes you can improve performance by rewriting the algorithm to use bitwise operations. But even in those rare cases when it is justified, the code is very complex. Therefore, when making a decision, consider the readability of the code and the fact that it will need to be supported. The same goes for other tricky optimizations: almost always, such code becomes difficult to read and very difficult to maintain. If you still decide on tricky optimizations, do not forget to write comments describing what you want this code to do and why it is written that way.

At the same time, optimization should be treated with healthy pragmatism:

if it takes you, as a developer, for a few seconds or minutes - it makes sense to do it right away;
if more, then it is reasonable to do it right away only when you are 100% sure of its necessity. In all other cases, it makes sense to postpone it, write in TODO code, collect more information, consult with colleagues, etc.

Premature optimization is the root of all evil (Donald Knuth)

Rewrite in a lower level language

This is an extreme measure. Low-level languages are almost always faster compared to higher-level languages. But this solution has a price - developing such a program is longer and more difficult. Sometimes, rewriting critical parts of the system in a low-level language, you can achieve a serious increase in productivity. But there are side effects - usually such solutions lose in cross-platform and their support is more expensive. Therefore, make a decision carefully.

There is safety in numbers

In conclusion, I would like to note one more important thing, perhaps the most important one. The measures that we considered in the previous paragraphs will work only if all team members adhere to them and everyone has an understanding of who is responsible for what and what needs to be done in case of a critical situation. It is important after fixing the problem to hold a meeting (Post Mortem) with all interested people and find out why this problem arose and what can be done to prevent this problem from happening in the future. In many cases, both technical and process changes are required. With each new Post Mortem, your system will become more reliable, the team will be more experienced and cohesive, and the entropy in the universe will be slightly less;)

The article partially uses materials from Why Defensive Programming is the Best Way for Robust Coding (Ravi Shankar Rajan).

Defender programmer stronger than entropy