Dances with support: types and forms of support. Support systems working in battle

Hello! My name is Alexander Afenov and I am the team leader of the Order Processing team at Lamoda. Today I want to tell you about how we rake support.



First, let's talk about how it is embedded in our processes and how in general we plan our work, sprints and iterations.



Then I’ll tell you where the support can come from and what types it is divided into.

I will share the experience of how we within the team deal with each type of support.



In the end, we consider the pros and cons of the practices we use and summarize.





My team now has two systems. The first is a big and scary thing called Order Processing . This is a system that automates the life cycle of an order from the creation process to delivery (or return).



The service is spinning on PHP 7, wrapped in Docker and orchestrated by Kubernetes, but at the same time it is implemented on the Zend1 framework and pieces of Symfony 2. Those who program on PHP now may have shuddered. For the rest, I will explain that Zend1 is a framework that had an end of life a year and a half ago. It is no longer supported and does not even have security patches.



The project is large (more than 150 thousand lines of code), and it does a lot of not its job. For example, not only processes orders, but for some reason sends mail, sms, pushes, transfers data to other systems. Therefore, we are cutting it into separate microservices.



The first thing we brought out of the monolith is the so-called Refund tool . He is the second service run by my team and is responsible for the automatic return of money to the client (more in the report of my colleague) .



Despite the fact that the Refund tool has a modern technological stack, it still generates a bunch of support due to the legacy of Order processing.



This happens due to the fact that we took a certain business process, which used to be built on a lot of excel files, and transferred it to a new system that works through Kafka, and also interacts with a couple of systems. Of course, when introducing a new system and changing the business process, we got support. And for many years of working with this, we have gained some experience.



I believe that people are divided into two categories: those with a production system that generates support, and damn liars. Therefore, I will share experiences that can be useful in optimizing your processes. If the proposed solutions (in combination or separately) suit you, then more time will appear for the development of the functionality of your system, analysis of the technical backlog, and not for working with support.



Talking about the tools used will require context, so first we will talk about the most basic.



image



Processes and roles



How do we work with support and what place does it take in our sprints?

Proportional buckets is what I call our buckets .

image



We take 10% from the technical backlog so that it does not stagnate and does not accumulate indefinitely.



Approximately 20% of the sprint is taken to lay down on some risks. For example, someone was doing a task, but this person was "hit by a bus." The next one will have to re-examine the context. As a result, we will not get into the assessment, and everything will be bad.



Next, we lay the planned support. That is, we already know that something is wrong. This is something not very burning, and we will repair it.



But the most interesting thing is an unplanned support. That is, we assume that during the iteration period something may break, and we will spend time on repairs.



The remaining 30% are projects.



You probably noticed that it turns out more than 100%. This is due to the fact that we always try to do more than we really can. Sometimes we succeed, sometimes not very.



Main support parameters



We give each support ticket an assessment according to the following parameters:



  1. Criticality for business. How important is this to them and how much does this break the business process?
  2. SLA How long should we take the problem to work and solve it?
  3. Priority


If something went wrong with the users of our systems, they report this to the support service and clarify that an incident is blocking partially or completely the business process. Support immediately brings a new ticket to the responsible system and puts it a priority .

Criticality and priority are different terms.

Priority Types



Blocker - something broke absolutely everything, stopped the business. Orders are not created, they do not go into delivery, payments are not accepted, and so on.



Major is something less important and can be repaired for a longer time, as there are workarounds, alternative paths.



Trivial For example, someone writes that our buttons are in an unpleasant color and should be repainted. There is a high probability that such a ticket will never be made.



There is also such a thing as a service level agreement , which is established by the support service together with the team and the business owner of the system. They look at which line of business has broken down as part of a specific complaint. If, for example, orders have ceased to be created (the main bread of the online store), then this problem will have a high priority, which we call P1. P is priority, unity is the most important.



P1 is a type of SLA, which means that we must take the problem to work within half an hour and solve it in a couple of hours at most.



P2 is something less significant that we should take to work within a couple of hours and decide during the day.



P3-P4 is something that has broken and does not require urgent repair. You can someday do it, take it to the next iteration.



And here we come to the priority set by the team. This can be a technical expert, senior, support engineer - anyone who deals with the problem.



Suppose we currently have 4 tasks with Major business priority. The person from the team, due to his expertise, puts a certain numerical value, which we call detailed priority . Based on it, the support board will be sorted in the future. That is, at the top there will be the highest priority tasks for the business, which are still sorted inside by the understanding of the team about how important this really is and how quickly you can do it.



Among the main parameters, it seems that one of the most important is missing - a normal description . Quite often, we have support tasks started from the Sentry system, where errors, exceptions, etc. fall. A person sees that there is some small problem, and creates a puzzle in Jira. Since our systems are integrated with each other, a task appears in the task tracker, in the description of which there is only a link to Sentry, and in the title there is an error text. All.



How should one who gets this task work with this? Not very clear. If you added a good description to this task, it would greatly help and save time.



Who will rake everything?



And when all this is done, the question arises: who will rake this beautifully sorted backlog? The answer is: support engineer.

You can listen in more detail about who the support engineer is and what he does in my talk “Technical Mortgage: What and Who Should Team Lead” with TeamLeadConf 2018.

A support engineer is a guy who takes and fixes the highest priority tasks from a support backlog. Since everything is beautifully sorted, we believe that at the top is the most important, urgent and “baking”. If there are no tasks, then he can do technical backlog.



What else is he doing?



1. Tries to isolate and eliminate root cause , that is, the root cause of the support. When regular tickets of the same type arrive to you regularly, it’s worth considering why this happens. Most likely, somewhere there is a problem that can be eliminated, and thereby stop the flow of similar tasks.



2. It sets the tasks for correction and monitoring .

If the support engineer cannot solve the problem in a day or two at most, then he sets up a separate task for it, which goes into the development backlog. Then it is evaluated by the team and gets into the iteration as a planned support.



Monitoring plays an important role for us. We hang monitoring not only on the metrics that we are used to monitoring on an ongoing basis, but also add them to localize the longest-running problems. In my opinion, it would be better if we had unnecessary monitoring, which we then drank, than the problem will be constantly repeated in the form of more and more tickets.



3. Looking for reasons for automation .

Example : we transfer data to our system, which automates the work of the delivery service. Sometimes it turns out that even with the use of the dead letter channel and forwarding, we cannot deliver information there. As a result, such orders hang somewhere, and they need to be resent.



This is a typical support that occurs several times every week. To solve this problem, we decided to make a separate page with the “resend order list” button. We no longer have this support. That is, they thought, automated, gave it to the support service.

The role of a support engineer is transferred every week to another person - this is a prerequisite. Doing such work longer is stress, demotivation, and decay, because you are constantly repairing something and don’t bring anything new to the system.


Regularity as a source of grace



It seems obvious, but it’s often forgotten about it anyway. In order for everything to work, it is necessary that our processes are put on stream and regularly observed.



Backlog Inspection



Where will we get a beautifully sorted support backlog if no one is looking there?



In a good way, you need to run into it once a month and close tasks with the trivial status (which you most likely will never get to). Be honest with yourself and the customer. If the backlog due to such tasks will grow infinitely, then later you will have to panic to try to close them. It's not very good.



Detailed priority affixing



This is the very process in which we evaluate how critical a task is. It will then be the correct sorting, and the support engineer will take the correct task from above.



Battle for priority



For example, they set a task for you and say: “Guys, the monthly report is not uploaded. We need to have it in a week, but it does not work. Please fix it. Priority P1. You need to decide within 2-3 hours. ”



And you ask: “Seriously? What are you guys talking about? After all, there is a week to fix it. Let's downgrade to P2, and we'll have a couple of days. ”



Sometimes people think that we will not take on the task, so they put a special high priority. But it happens and vice versa. For example, they write to us that orders are not created and put P2 priority. This problem is much more serious, therefore it is worth raising the priority to P1. It is useful to knowingly bargain in both directions.



Establishment of new tasks



Earlier, I mentioned the Sentry system, which includes tasks that are already being blazed by customers. However, we ourselves anticipate the problems that arise and throw tasks into this backlog ourselves.



SLA Performance Monitoring



To do this, we have schedules that show that we have tasks, the time for which will soon expire. It seems that these puzzles make sense in the first place.



Support Engineer Support



Being a support engineer is a rather depressing process, so a person should help. How can we make his life easier?



Transferring the role to the next of the team



We need to keep a schedule of who will do this next week. However, boundary moments do occur. For example, a person took a task on Friday and did not have time to finish it. He may spend time next week, but it's better to hand over the task to a new support engineer. If you drag on backlog raking for two weeks, the person will most likely be pretty demotivated. You will see this at the next personal meeting :)



Help find the source of the problem



People like to just rake tasks, but they do not focus on finding the very root cause. It is worth asking the question: “If you closed the task, then why did the problem initially arise?”. This practice will allow you to find the cause, eliminate it, and possibly get rid of the flow of such support in the future.



The need for a “fresh look”



If a person for a certain period of time could not achieve a visible result, then this task should be transferred to another. Someone else will be able to look at the problem from the other side, which may lead to the solution of the problem in another way.



But this approach can hide some interesting psychological aspects. That is, taking a task from one person and giving it to another, you risk saying that he knows better, so he will cope. Such things are best presented in a different way. Focus on the fact that we all need to solve problems with the system, and not prove to each other which of us is cooler.



Development of tools for automation



Those who are often support engineers understand that they already “bake” from performing the same typical tasks. Recently, one of our developers has got his own mini framework on Go. He goes to different databases, collects data, pushes something in Kafka. Thus, he was able to automate this task as much as possible and make life easier for others.



image

Support Sources



There are so many support that we sometimes don’t think about where it comes from so often?



Stabilization of new systems and processes



If you have brought something new, then most likely it will be used incorrectly. You will run into new problems for yourself, and your support backlog will immediately be replenished with tickets or tickets.



Support for older systems



For example, our monolith. He cannot stand still, as we always add, rewrite, something in him. Of course, this leads to the creation of a new support.



The technical error



For example, the network disconnected. You seem to be not to blame, but they will certainly come to you and ask why orders were not created. It will be necessary to repair, fix, modify something. Manual intervention will be required, and, therefore, new tickets in the backlog are provided.



Human factor



We had a case when someone was able to produce a message in RabbitMQ that our consumer hung up, and everything stopped working. This has never happened over the past 7 years, but here it somehow managed :)



The human factor that led to the failure



Someone with the words "I'll fix it now" pulled out the hard drive from the server on which billing was spinning. As a result, we got what we got. This is not the Lmoda experience, but a real case from my practice.



image



Support Types



Analytics Request



When they regularly ask about the status of something in the database, they ask to upload, collect a report for a certain period, and so on. This is a little annoying, so you have a good reason to think about automation and just provide a user interface or study the structure of the company.



For example, I did not immediately find out that most of the order data we have is stored in the Oracle database of the D&A department, and everything can be obtained from there. Such a support is either automated via interfaces or transferred to the analytics department.





Data Change Requests



Situations are different and unpredictable. Let's say our client was going to pay for his order with a card. When the courier arrived, he changed his mind and decided to do this in cash. Or, for example, somewhere an automated problem arose that needs to be changed by hand. We need to correct this data.



To do this, we try to make new API handles, make interfaces, and to the maximum drop these tasks from development and from our Operations team. This is a dangerous practice, and we get rid of it through the interface and API improvements.



Repair of business processes



If there is a direct need to edit something in the database, then there is a buggy business process. This could happen either because of an IT-related reason or something goes wrong in the business. Both there and there are adjustments required. In this case, you need to go to the business customer and discuss whether it can be done differently, or request development to repair the business process.





Feature X stopped working



This is my favorite type of support, because it is the most understandable. That is, we had some kind of thing in the prod, but it broke and needs to be fixed. Find out in which release died and for what reason. Repair and close the ticket. Everything is simple.



But there is another support - feature X does not work . It may look like the same thing, but said in other words. However, it is not.



In such a situation, they come to you and say that this thing does not work. You spend a day or two sorting it out. Only later did you realize that this never worked here . It simply was not in your system.



In another way, I call this type of support “fox” when someone tricky wants to slip a feature request under the guise of a support task. This is a regular story that is very painful. If you do not stop such moments, it turns out that your support engineer or you yourself are introducing new functionality, and the real problems from the support backlog remain unresolved.



image



Major incident



This is just the story of the coffin and the peat fire, when something broke in IT systems so badly that a specific business process arose.



Case study from our practice: we, due to a mistake in the code and imperfect auto tests, began to send a certain note about the status of the order to the external courier service, because of which people could not pick up their order at the pick-up point. It has affected thousands of customers. We had to take all orders back, spend money on it. We could not sell them, and customer loyalty was lost. This is a big incident that has hurt the business.



It is worth working with such things in a special way, and I will tell you how we do it.



How to find out that something is happening?



The most common option in the industry is to learn from users . He, of course, is the worst, because it means that they already “bake”. They see that nothing works for you, and they urgently need to be repaired.



Perhaps you will find out from the support service , which monitors the monitoring and notifies the person on duty on the system.



But the best part is to find out independently by the presence of monitoring . That is, you can use a system that tracks dynamics by metrics. For example, if something hung up in red on a suspended TV, then, say, Rabbit died.



It's cool, but it seems that this is not enough. Many problems can be caught even on the way, if you monitor certain trends . We had such a case when we noticed on the monitor that in the morning the memory consumption on the Rabbit MQ cluster began to grow. We understood that we have only 16 gigabytes there, and with such dynamics in a few hours the memory will end and everything will fall.



image



Since we saw such a trend, we got excited in time. It turned out that we had a shovel-plugin hanging, and memory was flowing. The problem was resolved, while major was averted.



If something has already happened, and you find out about it, you need to somehow localize and put out . Of course, depending on the size of the team and what it is doing, you can do different things. But I believe that mobilization is very important.



Suppose you have 5 people in a team. One of them was engaged in the analysis of why nothing is working now, while the remaining four continue to saw their features. As a result, you have broken systems and a bunch of new features. It's cool, but sometimes it's worth it to mobilize and arrange a brainstorm. Examination of each team member can reduce the time during which nothing works.



And after everything was put out, the retrospective stage begins.



image



Retrospective



At Lamoda, at this meeting, we have a person in charge of every system involved in the incident. If necessary, someone from the business is present, sometimes CTO even comes. And at this meeting, we make a description of what exactly happened and in which system it blazed.



Then comes the most fun moment - the analysis of the sequence of actions. That is, what we did between detection and extinguishing to the nearest minute, if possible.



image



We see that at 13.05 a complaint was received at the call center. By approximately 13.13 it became clear that it was massive. And only at 14.50 the team took the task to work. Before that, for some reason, 1.5 hours was simple, although the task was set, and the team was notified. It seems that this gap in 1.5 hours could be reduced and thereby save millions of rubles on this incident.



Why did this problem arise?



The guys just got up and went to dinner without taking laptops and phones, so they were not available. It seems that something is worth fixing in this moment. For example, take laptops with you, leave the duty officer in case there was a major release.



It is also seen that the fix was in production at 16.55, which is also strange, since before that more than 40 minutes have passed. When everything seems to be checked, you can roll, but we did not.



Why is this happening?



We have been running integration tests for a very long time, about half an hour. And in this situation it turns out that you need to make a decision. Or you roll out the corrected system and put out the problem, or maybe create a new one. Or you are waiting for the super-long CI / CD processes to scroll and the release will go into battle. Obviously, in our case, we need to reduce the time it takes to run all the tests so that we can run the tests faster and roll faster, when everything is checked.



The next point is the impact assessment . Impact is the impact of this incident on a business. Namely, how much the company lost in orders, in rubles, in parrots - it does not matter. This is the figure with which you can then come to business and show what happens if IT does not give time for stabilization, test automation and the like.



Then the wording of preventive actions . This is important because you need to somehow guarantee yourself, the management and anyone else, that this same thing will not shoot tonight, tomorrow and so on. That is, you need not to repeat the same incident. To do this, we formulate preventive actions. That is, how you will achieve this, how to protect yourself from it or anticipate it.



It is also necessary to put down deadlines / fix versions .



And then check the execution. Plans are good, but avoiding the problem is much more important.



In my opinion, it is also important to remember that the usual support is no worse than Major Incident. You can take a serious bug with which you fussed a day or more, walk along the same steppes and understand why it happened, why it took so long to decide how to avoid it in the future. Using the same pattern, you can more effectively rake an ordinary support, which does not apply to Major Incident.



Underwater Pros and Cons



The presence of support quite understandably affects the workflow . It distracts and infuriates, because, as a rule, everyone wants to cut new features. However, instead, you have to rake in endless tons of support, these Augean stables.



On the other hand, when a person is engaged in regular support, he manages to feel the most different parts of the system, because support is structured by criticality, but not by business processes. This allows you to quickly increase expertise in the system .



Where to get it all the time?



The initial answer is this - it is not clear where. And this is true, because it is very dependent on each particular business.



Lamoda was organized by several managing directors, one of whom oversaw IT. He was able to explain to other managing directors responsible for other parts of the business that if we create a service provider and an online store with our own automation and internal development, it is important to learn how to negotiate much in advance about how the business interacts with the IT department. He also managed to convey that he should not try to fill the sprints with new features for 80%, leaving no time for stabilization, support or raking up problems. He was able to do this and achieve a result.



However, I understand that not everyone has this far. That’s why I believe that if you apply a retrospective approach to serious support, you can show the business how much money goes into the pipe simply because IT does not have enough resources and time to solve internal problems and raking support.



image



Sweet spot



I believe that if you apply these methods, it will become easier to anticipate the appearance of problems and eliminate their root causes, and in general, stop stepping on the same rake, then achieve a very interesting state. This is a condition in which every new type of support will be new to you. It may sound ironic, funny or whatever else, but it’s really worth it because you will understand that you don’t solve the same problem over and over again. And I hope you succeed.



All Articles