In 2005, a substation flew out in Moscow, and everyone in the country suddenly realized that backup server rooms needed to be done. Banks made them in 2006-2007, and retail - since 2008. Why so late? Because IT costs didn’t seem to be as important as they are now. I recall the story of how my customer was almost fired for a two-day downtime throughout the network and at the same time for other retail features. It turned out that many in IT no longer believe that this was so.
And so it was. Here is the first story from the series “C'mon, it doesn’t happen” - about how the store was reserved by an analog modem for access to the Internet. Just then there was only one provider with fiber in this place, and from the alternatives - a satellite dish or dialup. ZSSS is expensive, so they chose a voice modem. At one point, the excavator is tearing the Golden Telecom trunk. At that time, they were shifting their FOCL, so there was temporarily no reserve. For three days, the store operates on a modem. During the day, 48 kilobits per second, at night 96 kilobits per second. As you can see from the speeds, it was a very good Zyxel, it even stayed on a noisy line.
I have been working at CROC for 15 years, now I am the Director of Business Development in Retail. Historically, most of my projects are also in retail. With the consent of the customer, I publish stories about what IT retail looked like 12 years ago from myself and my colleagues. Security guards replaced some unimportant details so as not to offend specific people.
The Internet was needed to send reports (in case of an accident, they immediately decided to send them later, not a priority) and to update prices across the entire database of goods of a rather big store. The transaction is long there, if it breaks in timeout - everything is over. In general, on the first day we tried to download prices, but nothing came of it. One team dealt with IT, and the second at that time over the telephone recorded changes manually: in the office and in the store there were several people with printouts. A little later they wrote a script for importing XLS files, and began sending by mail pieces of the database to the home computer to one of the admins, and he drove them to the store by car. In the contract with the provider, by the way, the accident was not the fault of the operator - the first three days without fines, and then 1/724 of the annual amount per hour. When asked about compensation for losses, they waved a hand and said: “Well, next month the services are free!”.
Then the board of directors looked at pieces of paper with descriptions of possible accidents and considered them unlikely. Few people believed that interruptions in IT are a phenomenon, and not a one-time case related to specific surnames. Only a few cases of direct losses convinced us to listen to these strange people with glasses.
The second incredible story about a mistake in IT . Large retailers are beginning to understand the secret to the success of one of the partners. The fact is that for everyone, the business goes clearly on a different path with the same model. Everything is quick and easy for him, and he has scaled his business about 20 times over the past eight years. In that market, it was about ten times more than it should have been logically. So, they begin to parse transactions - and it comes to understanding that the partner’s SAP card contains defaults that were instituted ten years ago for a test client. And there are slightly different factors in terms of plans and rewards. As a result, a person receives about two times more than the other partners. Retail has conducted an internal investigation. We found out that this is not a conspiracy, but a banal accident. And a man on this accident for ten years built a big business.
At one point, the store director once stole. More precisely, manually adjusted data on cash registers. Then it was possible, and manual input was perceived as a priority in comparison with automatically generated. For three months he was in a probationary period in this position and systematically created at the end of the day a small delta between the report and the money actually accepted at the box office. On the 28th day of the last month of his trial period, he wrote a statement (unexpectedly for the office), and according to the procedure there were only three days left for the acceptance of the store. Naturally, an inventory was done already without it. Losses were estimated at 700 thousand rubles net (the price of Hyundai Sonata with "full minced meat" at that time), and it would have been much worse if the store checked the tax at that moment. As a result, they gathered evidence on him, sued him. The judge calls:
- Bring a floppy disk with the system, we'll see.
- I can not. I have an installation of SAP 32 terabytes. This is a floppy disk train.
- Come on, let me give the phone to our specialists. We have good specialists. Do you have 1C there?
- Uh ... You better come to us.
- No, we will not go to you.
The court could not collect Forensics, and the case was closed for lack of evidence.
In past years, we had cases at CROC where it was necessary to collect forensics for attacks, and there we already came with special utilities and made terabyte casts, which are then very convenient to study in a virtual environment, realizing that nothing else in the source data will change. But then it sounded like a fantasy. And now, far from everyone in security systems uses normal forensics, which constantly writes such an image at home and collects additional data from all packets on the network.
Optics comes to the warehouse through a neighboring building. From the neighboring building - an air bridge is already at the customer's place. The warehouse works around the clock, but at night the radio relay bridge falls off. The backup channel is a satellite, a terminal right on the roof of its warehouse. There is a good speed, 25 Mbps. For those years - space as it is. But the satellite is a large network delay, at least 0.8 seconds for the signal to pass from the earth to the satellite and vice versa. Part of this is the limitations of the speed of light, another part is a bunch of transformations on demodulators working on protocols with a high redundancy coefficient. As a result, this is due to the fact that warehouse equipment begins to work with delays. At 0.1 seconds, everything is like a clock, and at 0.8, very long packet exchanges begin. A warehouse worker pokes a scanner on a pallet, and then the handshake begins, sending readiness, setting up a protocol, sending a package, acceptance report and so on. It was already years later that protocols appeared that almost all solve with one outgoing packet. Then it was necessary to exchange a bunch of data with a remote system. As a result, the work begins to really slow down, because the operations become long. A line of cars is being built, and by morning everything is in soap. And so every night. It’s just that a switch is pulled at a nearby warehouse in the evening, and the equipment of the optics operator falls. It is impossible to transfer the channel: the cable laying was agreed by the owner of the site, and he was not even from Russia. For several months, the warehouse employees worked until they got another operator for themselves - also a radio bridge from a building outside the territory, then they put down the optics normally.
A fashion retail warehouse, that is, a place where there is a huge pile of clothes in boxes. Motorola’s Wi-Fi hotspots burned wildly there. Sometimes the firmware just flew, and sometimes they just really burned out. They were loaded from a flash drive, so there was a special person in the warehouse with a ladder who climbed up to them. In a few months, about 60 incidents occurred with a firmware reset, and a dozen points burned out. They sinned on the manufacturer and the failed batch, but after a thorough investigation it turned out that the matter was in static discharges. Dust from synthetic fabrics accumulated so that it created a conductor around the point. A little vibration, it all got one end inward, the second - on the power line somewhere. If the firmware from the static discharge had flown before this, the point was shaken off for maintenance. If she was unlucky, she was punched more seriously. Hanged on the grounding point, this solved the problem for six months. Then the points began to burn again: we looked, and the grounding moved away from the buildings. It turned out that the temperature differences under the roof of the warehouse were such that constant compression-expansion simply jacked up and deconstructed everything so that the rigid coupling could not stand it.
Frequencies on radio bridges, of course, in a good way, must be coordinated, but then almost no one did it. And indeed the people were wild. Example. Residents shot down one of the plates of the ZSSS in the city limits at the store several times. Because it radiates. Yes, it radiates. But up, a meter below it, you can already live for years. But most of the radiation you catch in focus when you climb onto the roof, get up exactly at the peak of the radiation pattern and hit the plate with a stick.
About wild people, I remember the great story with the terminals in the warehouse . For 3500 bucks apiece, one of the first full-fledged mobile terminals was brought to Russia at one of the warehouses. In fact, these were military-grade smartphones. With them it was possible to do any operations in the warehouse. And warehouse workers regularly brought them with damaged screens. “Chief, no work!” No, the retail IT team understood that they were spoiling them somehow, and most likely, maliciously. To not work. What no one understood was how they do it, taking into account the corundum screen and, most importantly, why, if they have a piecework payment, and they cut their own earnings. Just then in retail almost everywhere video surveillance began to be introduced, and IT specialists decided to slightly accelerate the project in the warehouse. Both models and real cameras stood there. They swapped some of the devices and watched what was happening. The first scene: two are sitting on the foot of the loader and trying to erase the screen with an emery cloth. “I bet you are a wimp, you don’t know how to use sandpaper?” - that was about such a motive. They played new workers like that. It turned out that the real cause of the damage was sharpened coins that managed to inflict at least some harm on the device. But one such was completely lost - they simply put a load of several centners on it with a loader. He was the only fulcrum and just flattened. But he continued to transmit data, that is, only the periphery suffered. By the way, these Hanivels worked in a warehouse without losses, while two generations of Motorol were replaced next.
There were many stories with dect communications in warehouses and shops. Firstly, there was almost no encryption there (more precisely, there were all kinds of spectrum reversals and sometimes frequency changes), so everyone in the area could listen to the store’s negotiations. As in the store neighbors with a certain desire. Secondly, they interfered wonderfully with each other: two different vendors could hammer frequencies to each other, as a result, a homogeneous infrastructure was needed through the pipes. The upshot was that the retailer in all stores began to build communications on walkie-talkies, and then a couple of years later came cell phones and normal coverage.
At the same time, there was talk that data collection terminals on newfangled wifi would be introduced in stores. And this required wide backup channels. Fortunately, Skylink appeared at the same moment: it is a broadband network (CDMA-800), and it depended on the state of the atmosphere much less than a satellite. We switched to it when we switched to the reserve - they lifted their telephone terminals on fishing rods from the windows to catch better.
Now we do a lot of retail consulting at CROC, but then the practice in Russia was not very common. But there is one wonderful story about this. About ten years ago, my esteemed colleagues developed a business model that would kill the then only online grocery operator. It was evident that many processes can be improved by the proper organization of logistics and IT, and this directly greatly reduced the cost of the order. Now Yandex.Food uses something similar, but ten years ago it all looked very, very strange. We went to the pilot, even took a piece of land, but after certain changes in the composition of the board, the project was buried. We decided to sell the model, we just described it as a hefty project and went to England. In parallel, they offered the very one of our only online operator. His managers made an appointment, asked to show different parts of the processes. It sounded like negotiations with Yandex: "we will see, and if we understand that we ourselves will not do it, then maybe we will buy it." But the story is not about this, but about the fact that during the negotiations they decided to go to their warehouse. There are several floors, elevators on the sides of the building, and people run back and forth. That is, the handyman takes the cart, then begins to drive on the floors and complete the order. Lifts are a bottleneck, in front of them are huge lines. And it works like this for many years. Then they looked at it for five minutes, then suggested at a meeting:
- Right today, make all left elevators only on the rise, and all right elevators only on the descent. This will greatly simplify loading and unloading.
- Then do the physical separation of carts and people. Let only the carts with orders ride in the elevator, and people pre-cached wait for them on the floor.
- How do you feel the effectiveness - here, buy a Talmud, where it says how to do other things normally, and not such the basics of algorithmization.
In general, for so many years they worked normally, and then some goats arrived and showed them the obvious thing. There is still a sense of Spanish shame for this. The Talmud was not taken, by the way, and their warehouse is now new, good.
Many times I heard stories about the fact that you can confuse the test with the prod . It seemed fantastic until a certain point. So, on the night of December 31 to January 2 (judging by the state of the country, this is one continuous night), as usual in retail, there was prevention and updates. Among other things, they lifted a new test installation of SAP from a three-month backup and deployed it to a separate test segment in order to transfer the developer to tests. The segment is insulated. On January 2, support calls begin:
- Why are the prices a quarter ago?
- There are no such shares already!
Call specialists:
- Hi guys! Have you deployed the installation in a closed segment?
- Yes.
- Right?
- Yes, we give a tooth.
“And check again, please.”
- ... Yes, in the closed. Although wait ...
The only bridge between the closed test segment and the prod was in the monitoring system. She is common between them. And SAP, even in that revision, turned out to be a very tricky thing: it first knocked at its addresses, and then, when it did not receive an answer, it knocked at all available at all. And as a result I reached through the cash subsystem through monitoring. And I saw that it was time to update it. And the test base is about four months old, well, SAP flooded the prices for products. It may have been hard for him, but he managed. Then the prod himself flooded new prices there. Perhaps the test felt offended and again flooded its own. With difficulties. And so it went on several times.
In general, of course, in those years, SAP was a rarity. Its presence indicated that the retail chain is looking very far into the future. Everyone did 1C, dopped it on the knee and dopped it again. Among the cash solutions, several dozen (!) Large offline solutions fought, moreover, ours and Western ones, proprietary and free. They offered a bunch of everything, and each solution still had to be finished. From HR systems, SAP, Boss-Kadrovik and 1C fought. 1C then everyone was afraid, it was raw. The second was a more or less clear minefield with known bugs, and the third was sort of better, but uncharted and foreign, and it was possible to explode anywhere. Even the architectures themselves fought - centralized and decentralized. My colleague worked in a hard discounter, so immediately put it all on a centralized system. In order to minimize the amount of equipment in the store, in addition to cash desks, they put only one module (computer and router) - that's all. The backup computer was under the table. The router, as necessary, was brought from the warehouse along with milk, because every day it was delivered. Admin was not needed on the point. Someone like the director or product manager knew how to change the module and turn it on. According to the snot on the network, many helped themselves on the spot, people's computer literacy was very good. There were less than 100 people in the IT department for the entire network of nearly 700 stores and 3,500 cash desks, including developers, consultants, admins, clerks and IT financiers. SAP practice was 11 people (for comparison, the country's largest retailer then had about 2800 stores and more than 2000 IT specialists). They were the first in retail to introduce parallel reporting on in-memory and OLAP technologies. It was almost space in those days, but it gave rise to an unsystematic configuration. The transition from SAP 4.6 to 4.7 took six months, and then the guys decided to stop at 4.7 altogether and not to upgrade further. The first in the country refused vendor support. And no one else did this before them, and our colleagues, representatives of the vendor, did not know how to behave at the same time: then they even gave access to all SAP notes (this is what they usually pay for, the basis of support), and then only access to critical and security notes began to be given to others.
Now about Disaster Recovery . If in 2005 a substation was covered, then in 2010 peat bogs were already burning in the area around the city. And they burned because there was atypical heat. For 56 days in an office of a large retail, air conditioners worked with overload, that is, they could not cope. But it began like this: at first the temperature just rose around the city. All air conditioners, including reserve, were commissioned. More or less enough. Four days later, it became clear that the story would be long, and our retail customers began to wait for a blackout. Electricians launched the prevention of diesel, he went mad and died.
There the battery died. Somehow repaired, reconnected and made sure that the diesel can start and work. At the same time, the administration of the building, where the company was sitting as a tenant, set a quota for electricity for them - their incoming beam was calculated with a small margin for typical consumption, and there was not enough power for all, considering the air conditioners. The guys brought this diesel in a container, put it nearby and turned it on. Every morning fuel trucks came to refuel it. It was necessary to measure the temperature of the data center, and then it was without monitoring. They hung a thermometer, once an hour the support slider went down from the seventh floor to the first to record his testimony. And the light on the stairs in the evenings and at night was gone: everything uncritical was turned off in the building. When the first support officer rolled up in the dark and almost went to the hospital,they were immediately allocated money for automatic monitoring of the data center. Then it got even worse, the quota went down further: they got to the point that they turned off part of the air conditioners in the office and put the fans on. It was a very unpopular measure, and no one understood why so mock people. But otherwise the data center would fall, and they have a centralized architecture. As a result, the guys survived and learned many lessons about DRP for themselves, but they began to do a completely independent platform only after a year.but they began to make a completely independent platform only a year later.but they began to make a completely independent platform only a year later.
And finally, the bike about justice, which I wish all readers. Quote:
So, once the chain was the first retailer to launch the Take 2 for the Price 1 and Take 3 for the Price 2 shares at the federal level. This was in the first week of July, because in summer demand falls and people disappear from Moscow. And on Saturday in July, the biggest decline of the week. Marketing decided that they needed to sell something on the stock so that people came to the stores. For example, egg packs cost half as much - people follow them and, you see, pick up full bags. We set everything up, double-checked, launched. On Friday night, I'm sitting in a restaurant with friends. Call:
- We drove disks! SAP has fallen! Aaaaaaaaaaa!
. . , . . , , , . . , , . ( ), — -, . , , . :
— . ?
, . :
— .
, :
— , . , . , , ?
, . :
— , . , , .
.
I stand in a light shock. Firstly, they didn’t get fired, although this is an ordinary case for changing the work of CIO. Secondly, they gave green light to DRP. And what is “saved”? Then, after the advice, it was already found out that marketing was miscalculated, and if the stock started, then there would be a loss more than from downtime. But the most amazing thing is that no one blamed the IT department for the problem, did not even say a word. Because we warned. With calculations and graphs, what will happen and what needs to be done yesterday. Regularly.
This is probably the only case I know of when in such a situation they did not look for a scapegoat in IT.
More tales