When designing IoT systems, for example, car sharing, it is very important to consider possible failures. Otherwise, you will find a critical load on technical support and customer dissatisfaction.
Parking "Skolkovskaya"
Failures happen everywhere. But in the world of the “Internet of things” this is a permanent state. When working with mobile networks and hardware, crashes occur much more often than in the web or mobile development.
Reliability of systems suffers for various reasons, including tight deadlines and limited development budgets. But IoT can work if possible failures are not denied, but accepted and try to solve the problem. If interested, there is an interesting
article on methods for increasing fault tolerance, but now it’s closer to the point.
You open the car from the phone, what could go wrong?
The steps may depend on the architecture of the system, but the scenario is generally this:
You click the "Book Now" button. A command is sent to the server. She may or may not reach.
You click the “Open Machine” button. A command is sent to the server. She may or may not reach. The server sends a command to the machine. She may or may not reach. The on-board device is trying to execute the command. It may or may not be fulfilled.
You click the “Start Trip” button. A command is sent to the server. She may or may not walk. The server sends a command to the machine. She may or may not walk. The on-board device is trying to execute the command. It may or may not be fulfilled.
Yes, this is superficial and in fact an insane amount of problems, but now we will consider only these.
Suppose all the teams have reached and all the actuators have worked - success! It can be demonstrated to investors.
Something went wrong
But what will happen if, for example, the “Open Doors” command does not reach the car?
First, the server should find out about this. In order for the real state of the machine to be synchronized with the server, command acknowledgment (ACK) is usually used. And another confirmation of the execution of the team. After all, “the team has not been delivered” and “the team has not been executed” are different events and involve different attempts to solve.
Secondly, (if the problem could not be solved, for example, by resending the command), you need to report the error to the user and not put it in the “trip” state.
In Delimobile you will start the trip.
And a conversation with the technical support operator.
Story
I work in Skolkovo. Due to the difficulties with transport accessibility, like many colleagues, I went to work and back every day on car sharing. But 3 days ago, in the parking zone, the connection deteriorated. Why there are problems with mobile communications in the Innovation Center is another question, but this situation gave rise to an interesting problem: Delimobile users who booked a car were actually trapped.
On the cold evening of September 24th, I was returning home. He booked a car and came to her.
Clicked “Start inspection”, but the doors did not open.
- Well, probably, again, a communication failure. I'll take another one. Moreover, there are so many of them!
Clicked “Finish rent” - “You are out of the parking zone”
I call in support, describe the situation. The operator is trying to open the door. Failure. Music. Doors open. Thank.
- Probably the servers failed. Okay, let's go. I press "Start the trip" - the application began to count the money.
Does not start.
I call in support, describe the situation. The operator is trying to allow the engine to start. Failure. “No connection to the machine.”
- Okay, let's close it manually. Lower the glass, exit, press the central locking button, close the glass.
Glass does not fall. Apparently, without a command from the server, the car does not turn on the ignition. But there is no connection.
- Then you need to wait for the mechanic. 1-1.5 hours.
“But it's cold here.” There are 3-4 more people around the Delimobile cars with phones go. Maybe the furs have already been sent to them ...
<car doors suddenly closed>
- And all. Thank. I’ll go by minibus.
How others solve this problem
Firstly, if there is no communication with the machine, perhaps it should not be displayed on the map.
Secondly, if the server knew that the command to open the doors had not been executed, it would not have transferred me to rental mode. So instead of 40 minutes in the cold and an additional load on technical support, I would just see an error message.
Thirdly, you can create a backup communication channel - a second modem with another operator (I had Internet on the phone). Or Bluetooth, as is done in Squirrel and YouDrive. (Perhaps this option is not for Delimobile, as it will increase development and support costs, and DM is the cheapest among the masses)
In the meantime, Delimobil saves cars "manually closed" and loads its technical support due to the lack of confirmation of the delivery of control teams. At the same time, cars without communication are visible on the map and are available for booking.
This is a broader problem.
I'm sure Delimobile engineers are great. They solved a sea of ​​problems. Really. Indeed, in addition to the equipment and the system itself, it is still necessary to build the processes of commissioning, maintenance, decommissioning, etc. Often, these processes also require the development of hardware and software.
But why then could such a situation arise? In my opinion, there are two likely reasons.
The first probable problem is different contractors for the application, servers and equipment without high-quality top-level design of the entire system. Everyone may have done their job well, but the overall architecture has problems.
The second probable reason is inherent in so many projects in principle. The fact is that to demonstrate (for example, to investors) it is not difficult to make a prototype. Perhaps this will be enough for several weeks or even days. However, the design and development of a reliable system may take a month, or even years. Unfortunately, not all effective managers understand this.
Often, effective leadership may require new features that they believe will increase company revenue. At the same time, they do not see commercial potential in increasing reliability.
What to do?
Locally, Delimobil needs to solve the parking problem in Skolkovo. A lot of cars are idle there. It is unlikely that they will be able to agree with a mobile operator to improve the quality of communication. Therefore, the most probable outcome seems to me that they will forbid parking there and transport cars to Moscow on their own. Sad outcome :( Do you think it is possible to solve this problem in a different way?
Globally - technical managers must defend the need to increase reliability. At least in Delimobile, they now have an argument.
PS Special thanks to the tormented tech support guys. They are polite and try to solve the problem.