We all love to talk about our successes and don't really like to talk about failures. But the experience of mistakes is often more valuable than the profit from a successfully completed business. Therefore, I would just like to talk about such cases today. So let's go ...
Pot, do not cook!
This story happened a few years ago, at the very beginning of my career as a 1C developer.
A project has appeared in our company to optimize the operation of one very heavily loaded base of one very large and respected client. The client was with such a paranoid security service that there was no and could not be any remote access to the servers from the outside. To connect to the bases directly to our office, a separate local area network was implemented with a hardware VPN and workstations with strictly negotiated software were installed. Of course, without the rights of a local administrator.
Like any other project of this kind, it began with data collection. It was assumed that we will first collect various indicators within a month, and then we will be engaged in optimizing the information base itself. How and how much time in this bureaucratic environment took to set up the MCC, this is material for a separate story. But now, at some point, it happened, the MCC was configured and started. After that, the expert who led this project (Dima, hi!), Got into his luxurious car and went to travel around our vast country, and then more - to neighboring countries. I then, in fact, still knew little and knew how, but was already considered quite a responsible developer. Therefore, before leaving, Dmitry instructed me a very important and serious task: 2 times a day, at the time of peak load, I had to go up to that secret computer and start measurements in the MCC, and turn them off after an hour. The instructions were extremely simple and clear:
- Look, press this green button āplayā here, different charts run, wait an hour, then press this button āstopā. All.
What could be easier, right? In vain did I study at the faculty for 5 years?
All week I strictly observed this ritual in the morning and evening. And everything was fine until the last day. After lunch, on Friday, as usual, I started collecting data, and then ... Well, you know how it happens ... Friday, evening, we need to finish some urgent things, finish some tasks, after work, take my wife to my mother-in-law, along the way drop into one store, a second, etc. In general, I left work, completely forgetting about this ill-fated MCC.
Saturday morning began with a call. We got ALL 1C-th base at the client. Achtung and disaster! Our expert is somewhere between Dzheyrakh and Pasanauri, outside the network access area. The main admin of the client is also in some country house and inaccessible. Trying to find out on the phone, what is the reason? Somehow it turns out that the disk space has run out, so the 1C agent service got up. Here I already began to suspect something ...
As you remember, there is no udalenka. The computer is not only isolated from the Internet, but also outside our local network. Nothing to do - going to work. While preparing and driving, the admins realized that the whole place was taken by the MCC logs and did what they thought was most reasonable - they cut it off through the task manager. Move on. You canāt just delete logs from disk - weāll lose the measurement data. Somehow they found enough space on the network share and copied the files there. Work seems to have resumed.
Sunday morning began with a call. We got ALL 1C-th base at the client. Achtung and disaster take two! All the panic is over - the place is over. But how so? MCC turned off? In a hurry, Iām going to work again, throwing the logs to free up space. And they all grow and grow, damn them! Under the fear of the most perverted executions, the administrators forbade me to start anything or configure anything at all. For the rest of Sunday, I was busy sitting at the computer and copying the logs to the ball so that the bases would not get up again.
Only late at night Dima got in touch and said that you just need to delete one small file on the 1C server. Later, a couple of weeks later, I read about him in one well-known "desk" book, but on that day, exhausted, the tortured man went home to sleep.
On Monday morning, our account was blocked until Dmitry returned from vacation, and in my account it was said quite clearly: āSo that we would not see HIS anymore with us!ā.
This is how my first optimization project ended for me.
Twice in one funnel
Large holding. 18 information bases with identical configuration, located throughout the country. The update takes place once a week and is the same ritual: the delivery file must be prepared in advance, uploaded to the cloud, make sure that it has downloaded in all branches (even in 2018 in some regions the Internet is slower than the typical 1C: ERP), check that backups were created everywhere (we donāt seem to be responsible for this, but the bitter experience taught us to be safe), then run the update script manually at each branch and make sure that it worked without errors. Often at the last moment it is discovered that one more task has to be included in the delivery and this is a minor correction, because the next update is only in a week.
So it was that time. An experienced developer with many years of experience in a hurry made a mistake in one line when transferring a task to a combat circuit. The error turned out to be critical, it was discovered after updating all the databases.
Well what to do? The developer quickly fixes the code. Doesnāt let anyone test:
- Yes, there is garbage ... I canāt make a mistake twice in one line?
An hour later, 18 branches were updated for the third time.
Developer who could
The story told by a colleague on Skype.
[Colleague]: Once upon a time there was a āDeveloper who could!ā. He had a development outfit. He wanted to open a test, but missed, and opened a productive ...
[Colleague]: But could this stop the āDeveloper who couldā? No!
[I]: But during the update, he did not understand that there were people sitting there? )))
[Colleague]: Moreover, he sees that the konf is on support ... But do you think this could stop the āDeveloper who couldā? No!
[Colleague]: He removes the configuration from support (!) And saws his mod bypassing all the repositories ...
[I]: This is not so! Finish the story, dynamic update)))
[Colleague]: Updates ... The system says: āThere are 18 active sessions in the database!ā. But how could this stop the āDeveloper who couldā? No and no again!
[Colleague]: He updates the database and passes the task to the test ...
[Colleague]: The consultant cannot find the outfit ... and only then, after a long time, he realizes that he missed.
[Colleague]: I had to scold him ...
[Colleague]: Iām calling him ... and Iām laughing into the phone ...
[Colleague]: I just donāt understand ... HOW ???
Transport collapse
The story told by a colleague and recorded from his words.
It happened in a large logistics company. Most business processes are concentrated in one information base. Competitive users for 2012 - about 3,000 people from all regions of the country.
Set a simple task. According to it, I made my own register of information, in which data is written when certain documents are posted. Although there are not many types of documents, the number of these documents per day is enormous. In theory, the write operation I added to the register should not heavily load the system. But there was one nuance in the implementation of the task - when recording a set, the property āOverwriteā was set to āFalseā. That is, each document holding added entries to the register. This was necessary according to the conditions of the problem, but practically did not affect the performance, because according to the selection conditions there were always 1-10 entries.
Functional testing was successful. We conducted several dozens of documents, made sure that the entries in the register were correct, did not notice anything suspicious and sent them to the productive.
On that unfortunate Friday morning, we updated the combat base, and users began to work. 3000 people cheerfully filled documents and the register began to be filled with data. After checking that everything was going well, after a few hours we went home with a calm soul (we work in different time zones with the main users of the information base).
It should be noted that the servers on which the IS was running are almost one of the most powerful in Russia, used under 1C. But after a few hours āsomething went wrongā (c).
Users began to notice a decrease in system performance. All operations began to slow down. Responses to any actions grew longer. The load on equipment steadily increased. While the IT department was understanding what was happening, the work in the system almost stopped. The equipment could not cope, the queues on the disks were longer than in the post offices of Russia. If the equipment were weaker, the problem would be detected almost immediately. But the most powerful servers heroically resisted my crooked hands for half a day.
āFrom the wordsā of MSSQL, the most severe request suddenly became a read request in my register. Although I did not do any readings. A problem was quickly discovered in the 1C code. I forgot to set selections on a set of records. If the property āOverwriteā would be set to āTrueā, then I would immediately find an error, because each entry would clear the entire register. But in our case this did not happen. For example, a dozen documents, of course, we did not notice any performance loss. But when the register began to fill up with tens and hundreds of thousands of lines - the system each time had to check the entire register for matching records.
By that time, according to some users, a transport collapse had already occurred, because cars did not receive documents from 1C and could not leave the unloading points.
So, ājustā forgetting to put a selection in the recordset, I put one of the largest 1C databases in Russia.
PS See also: