As Redash noticed and fixed a problem that caused Python code performance degradation

Recently, Redash has begun replacing one task execution system with another. Namely, they began the transition from Celery to RQ. At the first stage, only those tasks that do not perform requests directly were transferred to the new platform. Among these tasks are sending emails, figuring out which requests should be updated, recording user events, and other supporting tasks.







After deploying all this, it was noticed that RQ workers require much more computing resources to solve the same volume of tasks that Celery used to solve.



The material, the translation of which we publish today, is dedicated to the story of how Redash found out the cause of the problem and coped with it.



A few words about the differences between Celery and RQ



Celery and RQ have the concept of process workers. Both there and there, for the organization of parallel tasks, creation of forks is used. When the Celery worker starts, several fork processes are created, each of which autonomously processes tasks. In the case of RQ, the instance of the worker contains only one subprocess (known as the "workhorse"), which performs one task, and then is destroyed. When the worker downloads the next task from the queue, he creates a new “workhorse”.



When working with RQ, you can achieve the same level of parallelism as when working with Celery, simply by running more worker processes. However, there is one subtle difference between Celery and RQ. In Celery, a worker creates many instances of subprocesses at startup, and then uses them repeatedly to complete many tasks. And in the case of RQ, for each job you need to create a new fork. Both approaches have their pros and cons, but here we will not talk about this.



performance measurement



Before I started profiling, I decided to measure the performance of the system by finding out how long the worker container needs to process 1000 jobs. I decided to focus on the record_event



task, as this is a common lightweight operation. To measure performance, I used the time



command. This required a couple of changes to the project code:



  1. To measure the performance of performing 1000 tasks, I decided to use RQ batch mode, in which, after processing the tasks, the process is exited.
  2. I wanted to avoid influencing my measurements with other tasks that might have been scheduled for the time I was measuring system performance. So I moved record_event



    to a separate queue called benchmark



    , replacing @job('default')



    with @job('benchmark')



    . This was done immediately before the record_event



    in tasks/general.py



    .


Now it was possible to start measurements. To begin with, I wanted to know how long it takes to start and stop a worker without load. This time could be subtracted from the final results obtained later.



 $ docker-compose exec worker bash -c "time ./manage.py rq workers 4 benchmark" real 0m14.728s user 0m6.810s sys 0m2.750s
      
      





It took 14.7 seconds to initialize the worker on my computer. I remember that.



Then I put 1000 record_event



test record_event



in the benchmark



queue:



 $ docker-compose run --rm server manage shell <<< "from redash.tasks.general import record_event; [record_event.delay({ 'action': 'create', 'timestamp': 0, 'org_id': 1, 'user_id': 1, 'object_id': 0, 'object_type': 'dummy' }) for i in range(1000)]"
      
      





After that, I started the system in the same way as I did before, and found out how long it takes to process 1000 jobs.



 $ docker-compose exec worker bash -c "time ./manage.py rq workers 4 benchmark" real 1m57.332s user 1m11.320s sys 0m27.540s
      
      





Subtracting 14.7 seconds from what happened, I found out that 4 workers process 1000 tasks in 102 seconds. Now let's try to find out why this is so. To do this, we, while the workers are busy, examine them using py-spy



.



Profiling



We add another 1,000 tasks to the queue (this must be done due to the fact that during the previous measurements all the tasks were processed), run the workers and spy on them.



 $ docker-compose run --rm server manage shell <<< "from redash.tasks.general import record_event; [record_event.delay({ 'action': 'create', 'timestamp': 0, 'org_id': 1, 'user_id': 1, 'object_id': 0, 'object_type': 'dummy' }) for i in range(1000)]" $ docker-compose exec worker bash -c 'nohup ./manage.py rq workers 4 benchmark & sleep 15 && pip install py-spy && rq info -u "redis://redis:6379/0" | grep busy | awk "{print $3}" | grep -o -P "\s\d+" | head -n 1 | xargs py-spy record -d 10 --subprocesses -o profile.svg -p' $ open -a "Google Chrome" profile.svg
      
      





I know that the previous team was very long. Ideally, to improve its readability, it would be worth breaking it into separate fragments, dividing it in those places where sequences of &&



characters are found. But the commands must be executed sequentially within the same docker-compose exec worker bash



session, so everything looks just like that. Here is a description of what this command does:



  1. Launches 4 batch workers in the background.
  2. It waits 15 seconds (approximately so much is needed to complete their download).
  3. Installs py-spy



    .
  4. Runs rq-info



    and finds out the PID of one of the workers.
  5. Records information about the work of the worker with the previously received PID for 10 seconds and saves the data in the profile.svg



    file


As a result, the following “fiery schedule” was obtained.





Visualization of data collected by py-spy



After analyzing this data, I noticed that the record_event



task spends a long time running it in sqlalchemy.orm.configure_mappers



. This happens during each task. From the documentation I learned that at the time that interests me, the relations of all previously created mappers are initialized.



Such things are absolutely not required to happen with every fork. We can initialize the relationship once in the parent worker and avoid repeating this task in the “workhorses”.



As a result, I added a call to sqlalchemy.org.configure_mappers()



to the code before starting the “workhorse” and took measurements again.



 $ docker-compose run --rm server manage shell <<< "from redash.tasks.general import record_event; [record_event.delay({ 'action': 'create', 'timestamp': 0, 'org_id': 1, 'user_id': 1, 'object_id': 0, 'object_type': 'dummy' }) for i in range(1000)] $ docker-compose exec worker bash -c "time ./manage.py rq workers 4 benchmark" real 0m39.348s user 0m15.190s sys 0m10.330s
      
      





If you subtract 14.7 seconds from these results, it turns out that we have improved the time required for 4 workers to process 1000 tasks from 102 seconds to 24.6 seconds. This is a fourfold performance improvement! Thanks to this fix, we were able to quadruple the RQ production resources and maintain the same system throughput.



Summary



From all this, I made the following conclusion: it is worth remembering that the application behaves differently if it is the only process, and if it comes to forks. If during each task you have to solve some difficult official tasks, then it is better to solve them in advance, having done this once before the fork is completed. Such things are not detected during testing and development, therefore, having felt that something is wrong with the project, measure its speed and get to the end while searching for the causes of problems with its performance.



Dear readers! Have you encountered performance problems in Python projects that you could solve by carefully analyzing a working system?








All Articles