Kaizen in software development - from my own experience

The term "kaizen" was introduced by Toyota, and a lot has been written about it in thick books from the Toyota Tao series.



Kaizen is also called the "process of continuous improvement." Usually it is associated with the industrial production of cars and conveyors, well, or at least with process control. They say little about development, but kaizen is very well suited for software development.



Further, you will learn about several cases that gradually led the author to understand kaizen in development.



A series of troubles



The first case occurred right on New Year's, at 20:00. The hard drive crashed on the server and because of it it was necessary to break away from the preparation of salads and urgently go to Moscow to the traffic exchange site to change the broken device.



After the hard drive, the motherboard burned out. They changed everything, but then decided to figure out why this happens.



Familiar admins made it clear that you need to be very careful about server hardware, not to buy anywhere and reserve everything and at every level.

It was decided to change the server and be very careful about choosing a provider. We looked at the server hardware, asked for recommendations and chose a server, which later worked without stopping and without interruption for 7 years. He continues to work now, although 5 years have passed since the author left that work.



Some more time passed and a fire occurred on the site. The server miraculously survived. Then everyone shook well, because there was a risk of total destruction of the business.



After that, a mirror of the site and the database was made on a separate, completely independent site at the other end of the city. And once she was even used.



There was also a case when traffic started on the site that was just launched, and suddenly it completely stopped, just could not withstand the load.



After the study, it became clear that the outsourcing company that created the site made it so that it did not hold more than 200 people a day. Funny and sad.



After that, it was decided to abandon outsourcing and form your own development team.



Having created the team, we got one more problem - the correction of any error caused an avalanche of new errors. Any changes almost overwhelmed the whole site.



Each correction entailed a very, very many problems. When we analyzed the situation, we realized that we need to fundamentally change everything in general - all the insides. And then the whole site was completely refactored, its entire architecture was turned upside down. And only after that the situation radically changed and the problems completely disappeared.



Eliminate Root Cause



All these solutions were united by one thing - all of them were aimed at ensuring that the root problem underlying them never arises again, so that it is completely eliminated. From the word at all. And so that the same problem will never be repeated again.



Do you understand?



Elementary: the computer crashed - we realized that we must choose the right hardware, which will never fail.



The site caught fire - they made a copy to exclude the occurrence of a similar situation in the future.



Then the words didn’t know such a thing - kaizen.



5 why



Not always the root cause of the problem lies on the surface, sometimes you have to dig into it.



A good example was given in one of Toyota's Tao books. At the factory, it was discovered that one of the machines was idle for a large amount of time during the day.



Why does he have breaks in work? It turned out that the machine stops for cleaning.

Around the machine is chips and dirt. Naturally, if there is shavings around, then it must be removed, otherwise it is impossible to work. Is that all right?



But kaizen says: you have to dig to the root cause.



Why does the chip fall? The answer comes right away: the chips are piled on because it is not going anywhere - the machine does not have a device that would allow it to be removed and collected. If there was such a device, then the machine did not have to be stopped.



Well then, let's come up with a solution that would allow this chip to be removed from the machine and make it so that it does not stop for cleaning at all. This solution is already purely technical and quite simple.



There is a very simple technique for determining the root cause: the well-known “5 why” method.

Kaizen recommends using it to get to the bottom of the root causes.



Consistently consider the causes of the problem, one after the other:





With the help of “5 why” we find the root cause, come up with a solution to eliminate it, assign a responsible person and deadlines, and check on a weekly basis the achievement of the result.

Just keep in mind that any problem can be solved both expensive and cheap ways.



Kaizen says: first choose the cheapest way. It is usually the simplest and the best.



Kaizen in software development



And now a few recent examples from the life of a software development team.



Feil Jobs



The team is deploying their best practices in Prod by launching Jenkins. Essentially, Jenkins is a sheduler like crontab, which can run scheduled jobs. And the team had such job.



Once it was discovered that Prod-Jobs fell 5 times in a row with Failure status. And no one paid attention to them, despite the fact that, in fact, every file on the Prod should be a universal alarm.



Then they began to find out the reason using the “5 why” method:





The decision was transparent: for test jobs, notifications about the files will not be sent to anyone except the owner of the job, and even if he needs it.



Plus, it was officially recorded that any notification from the job is an exceptional emergency, which everyone should respond to.



Fallen off script



The second example is a problem with the QlikView application.



Once the team was told that their QlikView reports are somehow not the same. Everything seems to work, but the data is not the same. They began to understand the problem.



It turned out that the download script did not work out to the end. Why didn’t work to the end? Because there was a lot of code in the script and somewhere in the middle was the debugging operator Exit Script - they simply forgot to remove it, did not notice it. The situation turned out when the download script fell off, and the data was used old.



Why didn't you notice this? Because testing did not show this because of the architecture. The application was divided into two independent parts (back / boot script and front), and so on. There was a lot of data, they tried not to restart them again, so as not to lose a lot of time on this.



It was specially made so that the front was not connected with the load script. He simply took a data file and showed it. It was not clear that this data file is old, that is, it was impossible to determine the presence of an error in it.



What was invented to avoid a similar situation in the future?



The team asked themselves the question: how to make sure to notice such a situation in the future? How to make it clear that the download script did not work to the end?



It was decided to register the label at the beginning of the script, and at the very end delete it. Those. if the label is not deleted, this means that the script did not complete the download to the end. The front checked that, if there was a label, it would display a red banner on the floor of the page with a notification about the problem.



Thus, although the possibility of the appearance of such problems was not completely ruled out, at least it became immediately known about them. Cheap simple solution.



Data Loss on Reboots



The monitoring web service received data from industrial stands. Periodically, it had to be finalized and corrected, and each correction required a reboot. Although the reboot lasted a couple of seconds, at that time industrial data and the abyss could come guaranteed. It was impossible to lose them, so the team decided to analyze the problem more deeply.



Questions "5 why" made it clear that the root cause of the problem is architecture - it was it that did not allow to do otherwise. No matter how tightened up the service, no matter what they did with it, all the same, it all came down to a reboot.



The new architecture solved the problem once and for all - the service was divided into two independent parts, data reception and processing. These parts were physically separated, i.e. you could safely turn off the handler, and while receiving data continued to work and save everything that came to it. And most importantly, the data receiver was made in such a way that it never required a reboot. Handlers could be safely turned off, modified and overloaded without worrying about the fact that data could be lost.



DevOps seems to be there, but it doesn't seem to be



DevOps is a magical thing. It seems to be there, but at the same time it also happens that it does not exist.



One of the developers posted his findings on the test bench. Despite the fact that he used DevOps, “suddenly” it turned out that the test bench was connected to the combat database. Part of the data was irretrievably lost.



We started to find out. It turned out that the developer did not notice that he was using the connection battle string.



The root reason was that the developer had to manually change the connection string for different stands and servers.



What does kaizen say? Correctly! We must come up with such a solution to completely eliminate the problem, i.e. remove the need to manually change the line.



And the solution was implemented - the connection string began to be automatically selected depending on the server on which it was running. The problem was resolved once and for all.



I think that you yourself already understood from the above examples that the essence of continuous improvement can be expressed in one simple phrase - to completely eliminate the re-occurrence of the problem.



Key results - an integral part of kaizen



The result of kaizen is realization, not an idea.



To come up with a solution is not so difficult, it is much more difficult to implement it.



For each decision made, it is important to deliver key results, that is, understand who needs to do what and by what date.



How do you understand that you have achieved a successful result?



Let's take an example with a connection string. What material result will be considered success here? Success will be achieved when:





Both steps must be taken by a certain date by certain people. Only with both steps can we assume that success has been achieved.



Key results are success criteria; kaizen does not work without them. Success is implementation.



Only an implemented solution will allow you to eliminate the problem in the future, therefore, when speaking about kaizen, always mean that you have to implement everything that is invented.



Where else can this be applied



As you probably saw from the examples, kaizen can and should be used in incident analysis. Actually, he was created for this.



Incidents in tech support groups, infrastructure, and product development are perfect.



As for coding, here you can analyze your code and see how it can be changed in order to permanently remove the problems found.



Yes, and the very notorious Agile-retro is also kaizen, because at these meetings problems are analyzed for the sprint, questions are asked “5 why”, and steps are being taken to prevent problems. The most natural kaizen!



The kaizen technique itself works fine in software development, it is very easy to use and well suited for use in personal matters.



The secret to success is simple: eliminate problems one by one and then they will not remain at all. This is continuous improvement.



Toyota itself uses kaizen in production with overwhelming success. Its results speak for themselves: production defects are only 3 defects per 1,000,000 parts.



Why not apply it for yourself?



Now you are officially pumped. If you hear such a term, you will know what it is.

And success in your work.



All Articles