Platform birth





The world has changed. I feel it in water, see it in the ground, feel it in the air. Everything that once existed is gone, and there are no more those who remember it.

From the movie “The Lord of the Rings: The Fellowship of the Ring”



There are 100500 articles and reports on the Internet on the topic “how we saw a monolith”, and I have no desire to write another one. I tried to go a little further and tell how technology changes led to the appearance of a completely new product (spoiler: we wrote the box, and wrote the platform). The article is largely a review, without technical details. Details will come later.



It will be about the control panel site and Vepp server. This is an ISPsystem product where I lead development. Read about the capabilities of the new panel in another article , here - only about technology. But first, as usual, a little history.



Part 1. It is necessary to change something



Our company has been writing software for hosting services automation for more than 15 years. During this time, several generations of our products have changed. When we switched from the second to the fourth, we changed Perl to C ++ and started the free sale of our software ... When we switched from the fourth to the fifth, in the pursuit of speed, we switched from a single-threaded application to multi-threaded (from a monolith containing all our products to framework).



But, not only we were changing, our customers and competitors were changing. If 10-15 years ago the site owner was technically well-versed (others were weakly interested in the Internet), now it can be a person who is not connected with IT in any way. There are many competing solutions. Under such conditions, simply working a product is not enough, it is necessary that it is easy and pleasant to use.



Heroic redesign



This change has become in all senses the most noticeable. All this time, the interface of our products has remained virtually unchanged and unified. We already wrote about this separately - a view from the UX. In the fifth generation of products, the API defined the appearance of forms and lists. Which, on the one hand, made it possible to implement many things without involving frontend developers, on the other hand, it gave rise to very complex complex challenges, sometimes affecting most of the system, and greatly limited the ability to change the interface. After all, any change is inevitably a change in the API. And that’s it, hello to the integrations!



For example: creating a user in ISPmanager can also lead to the creation of an FTP user, mail domain, mailbox, DNS record, site. And all this is done atomically, and therefore blocks the changes in the listed components until the operation is completed.



In the new API, we switched to small and simple atomic operations, and Frontend developers were left with complex complex actions. This allows you to implement a complex interface and change it without affecting the core API and, therefore, without breaking the integration.



For frontend developers, we implemented a batch query execution mechanism with the ability to perform reverse actions in case of an error. As practice shows, in most cases complex queries are requests to create something, and creation is quite easy to cancel by performing deletion.



Reduce response time and instant notifications



We refused long modifying requests. Previous versions of our products could hang for a long time, trying to fulfill a user request. We decided that for every action it is more difficult to trivially change the data in the database, we will create a task in the system and respond to the client as quickly as possible, allowing him to continue working and not look at the endless loading process.



But what if you need the result, what is called here and now? I think many people are familiar with the situation when you reload the page over and over in the hope that the operation is about to end.



In the new generation of products, we use websocket for instant event delivery.



The first implementation used longpoll, but frontend developers were uncomfortable with this approach.



HTTP internal communication



HTTP as a transport won. Now, in any language, an HTTP server is implemented in seconds (if you google, then in minutes). Even locally, it’s easier to form an HTTP request than blocking your protocol.



In the previous generation, extensions (plugins) were applications that, if necessary, were launched as CGI. But in order to write a long-lived extension, I had to try hard: write a plug-in in C ++ and rebuild it with every product update.



Therefore, in the sixth generation, we switched to internal interaction via HTTP, and the extensions, in fact, became small WEB servers.



New (not quite REST) ​​API



In previous generations of products, we passed all parameters through GET or POST. If there are no special problems with GET - its size is small, then in the case of POST there was no way to check access or redirect the request until it is fully read.



Imagine how sad it is: to accept a few hundred megabytes or gigabytes, and then discover that they were poured by an unauthorized user or that now they need to be transferred to that server!



Now the name of the function is passed in the URI, and authorization is exclusively in the headers. And this allows you to carry out part of the checks before reading the request body.



In addition, API functions have become simpler. Yes, now we do not guarantee the atomicity of creating a user, mail, website and the like. But we give the opportunity to combine these operations as needed.



We have implemented the ability to batch query execution. In fact, this is a separate service that accepts a list of requests and sequentially executes them. He can also roll back already completed operations in the event of an error.



Long live SSH!



Another decision that we made based on our previous experience is to work with the server only through SSH (even if it is a local server). Initially, we worked with the local server in VMmanager and ISPmanager, and only then we made it possible to add additional remote ones. This led to the need to support two implementations.



And when we refused to work with the local server, the last reasons for using the native libraries of the user operating system disappeared. With them we have been tormented since the founding of the company. No, native libraries have their advantages, but there are more disadvantages.



An unconditional plus is less consumption of both disk and memory, and for working inside VDS this can be quite significant. Of the drawbacks - the use of a library whose version you do not control can lead to unexpected results, which greatly increases the load on both development and testing. Another drawback is the inability to use the latest versions of libraries and modern C ++ (for example, on CentOS 6 even C ++ 11 is not fully supported).



Old habits



Changing approaches and switching to new technologies, we continued to act in the old way. This led to difficulties.



The hype topic with microservices did not pass us by. We also decided to divide the application into separate service components. This made it possible to tighten control over interaction and conduct testing of individual parts of the application. And in order to control the interaction even more tightly, we laid them out in containers, which, however, could always be put together.



In a monolithic application, you can easily access almost any data. But even if you divide the product into several applications and leave them nearby, they, as living ones, can form connections. For example, in the form of shared files or direct requests to each other.



Switching to microservices was not easy. The old pattern of “writing a library and connecting it wherever needed” has haunted us for quite some time. For example, we have a service responsible for performing “long” operations. Initially, it was implemented as a library that connects to the application that needed it.



Another habit can be described as follows: why write another service, if you can teach this existing one. The first thing we sawed off from our monolith is the authorization mechanism. But then there was a temptation to push this service all the common components, as in COREmanager (the basic framework for fifth-generation products).



Part 2. Combining the incompatible



At first, reading and writing requests were performed by a single process. As a rule, writing requests are blocking requests, but at the same time they are very fast: I wrote to the database, created a task, and answered. With a reading request, the story is different. Creating a task on it is difficult. It can generate a rather lengthy response, and what to do with this answer if the client does not return after it? How much to store it? But at the same time, the processing of reading requests is perfectly parallelized. These differences led to problems with restarting such processes. Their life cycles are simply incompatible!



We divided the application into two parts: reading and writing. True, it soon became clear that this is not very convenient from a development point of view. Reading the list you do in one place, editing in another. And here the main thing - do not forget to fix the second, if you change the first. Yes, and switching between files is at least annoying. Therefore, we came to an application that runs in two modes: reading and writing.



The previous generation of our products made extensive use of threads. But, as practice has shown, they did not save us much. Due to the many locks, the load on the CPU rarely exceeded 100%. The emergence of a large number of separate fairly fast services allowed us to abandon multithreading in favor of asynchronous and single-threaded work.



During the development process, we tried to use streams together with asynchrony (boost :: Asio allows this). But this approach is more likely to bring all the shortcomings of both approaches into your project, rather than provide any visible advantages: you will have to combine the need for control when accessing shared objects and the difficulty of writing asynchronous code.



Part 3. As we wrote the box, and wrote the platform



All services are arranged in containers and work remotely with the client server. And why then put the application on the client server? That's the question I asked the management when it came time to package the resulting product for installation on the server.



What is a platform? First, we deployed SaaS, a service that runs on our servers and allows you to configure your server. If you used any server control panel and bought it yourself - this is the solution for you. But it does not suit providers: they are not ready to provide access to the servers of their clients to a third-party company, and I understand them very well. This raises questions of both security and fault tolerance. Therefore, we decided to give them our entire SaaS so that they could deploy it at home. It’s like Amazon, which you can run in your own data center and connect to your billing. We called this solution a platform.



The first deployment did not go very smoothly. For each active user, we raised a separate container. Docker containers rise quickly, but service discovery does not work instantly: it is not intended to dynamically raise / stop containers within a second. And from the moment of raising to the moment when the service can be used, sometimes minutes passed !!!



I already wrote that the hosting user has changed a lot over the last decade. Now imagine: it buys a hosting from you - and gets access to the shell. WTF?!?! To use it, first he will have to find some SSH client (in Windows this can become a problem - by default there is no client there, I’m generally silent about mobile clients).



After he was able to get to the console, he needs to install the panel. And this operation is also not fast. And if something goes wrong? For example, RosKomNadzor may block the servers from which packages for your OS are downloaded. With such errors, the user will be left face to face.







You may argue that in most cases the user receives a panel already installed by the hoster, and the above does not apply to him. Maybe. But the panel running on the server consumes the resources you paid for (takes up disk space, eats your processor and memory). Its performance directly depends on the performance of the server (the server crashed - the panel crashed).



There may still be those who do not use any panels and think: this does not concern me. But perhaps on your server, if you have one, some kind of panel still stands and eats resources, but you just don’t use it?



“So it’s even wonderful, you help us sell resources,” Russian hosters say. Others add: “Why do we spend our resources deploying a platform that consumes more than a stand-alone panel?”



There are several answers to this question.

  1. The quality of service is improving: you can control the panel version - quickly update it when new functionality appears or errors are detected; You can announce stocks right in the panel.
  2. On an infrastructure scale, you save disk, memory, and a processor, since some processes are launched only to serve active users, and some serve many clients simultaneously.
  3. Support does not need to analyze whether the behavior is a feature of a particular version of the panel or an error. It saves time.


Many more of our customers bought licenses in packs and then juggled them, reselling them to their customers. There is no longer any need to do this, in principle, meaningless work. After all, now this is one product.



In addition, we got the opportunity to use heavy solutions, offering previously inaccessible functionality. For example, a service to get screenshots of a site requires the launch of a headless chromium. I do not think that the user will be very happy if, due to such an operation, he runs out of memory and shot himself, say, MySQL.



In conclusion, I want to note that we learn from the experience of others and are actively developing our own. Logging, docker, service discovery, all kinds of delays, retries and queues ... Now you don’t remember everything that you had to master or reinvent. All this does not make development easier, but opens up new opportunities and makes our work more exciting.



The story is not over yet, but what happened can be viewed on the Vepp website .



All Articles