What is wrong with data validation and what does the Liskov substitution principle have to do with it?





If you sometimes ask yourself the question: “is everything okay for me in this method?” And choose between “what if it blows” and “it’s better to check just in case”, then welcome to ...



Correction: As lorc and 0xd34df00d noted , what is discussed below is called dependent types. You can read about them here . Well, below is the source text with my thoughts on this.



During development, there is often a need to verify the validity of data for some algorithm. Formally, this can be described as follows: suppose we get some data structure, check its value against a certain range of permissible values ​​(ODZ) and pass it on. Subsequently, the same data structure can be subjected to the same verification. If the structure remains unchanged, re-checking its validity is obviously an unnecessary action.



Although validation can really be long, the problem here is not only in performance. Much more unpleasant is the extra responsibility. The developer is not sure whether it is necessary to check the structure for validity again. In addition to unnecessary verification, on the contrary, we can assume the absence of any verification, incorrectly assuming that the structure has been verified earlier.



Thus, malfunctions are allowed in methods that expect a proven structure and do not work correctly with a structure whose value is outside a certain range of acceptable values.



Therein lies an unobvious deeper problem. In fact, a valid data structure is a subtype of the original structure. From this point of view, the problem with a method that accepts only valid objects is equivalent to the following code in a fictional language:



class Parent { ... } class Child : Parent { ... } ... void processValidObject(Parent parent) { if (parent is Child) { // process } else { // error } }
      
      





Agree that now the problem is much clearer. Before us is a canonical violation of the Liskov substitution principle. To read why it’s bad to violate the principle of substitution, for example, here .



You can solve the problem of transferring invalid objects by creating a subtype for the original data structure. For example, you can create objects through a factory, which, according to the original structure, returns either a valid subtype object or null. If we change the signature of methods that expect a valid structure so that they only accept a subtype, then the problem will disappear. Besides confidence that the system works exactly, the number of validations per square centimeter of code will decrease. Another plus is that with such actions we shift the responsibility of validating data from the developer to the compiler.



In Swift, at the syntax level, the problem of checking for null is solved. The idea is to separate the types into nullable and non-valid types. At the same time, it is made in the form of sugar in such a way that the programmer does not need to declare a new type. When declaring a variable type, ClassName guarantees that the variable is nonzero, and when declaring ClassName? the variable is null. At the same time, covariance exists between types, that is, it is possible to pass an object of type ClassName to methods that accept ClassName?



This idea can be extended to user-defined DLD. Providing objects with metadata containing DLD stored in the type will eliminate the problems described above. It would be nice to get support for such a tool in a language, but this behavior is also implemented in "regular" OO languages, such as Java or C #, using inheritance and a factory.



The situation with data validation is another confirmation that entities in the OOP are not taken from the real world, but from engineering needs.



UPD: As correctly noted in the comments, it’s worth creating subtypes only if we get additional reliability and reduce the number of identical validations.



Also, the article lacks an example. Let some file paths come to us at the entrance. Our system in some cases works with all files, and in some cases only with files that we have access to. Next, we want to transfer them to different subsystems, which also work with both accessible and inaccessible files. Further, these subsystems transfer files even further, where again it is not clear whether the file is available or not. Thus, in any dubious place, an access check will appear or it may be forgotten on the contrary. Because of this, the system will become more complicated due to widespread ambiguity and checks. But these checks load the disk and generally heavy. You can cache this check in a Boolean field, but this will not save us from the very fact of the need for checking. I suggest shifting the responsibility of checking from the developer to the compiler.



All Articles