How to optimize pandas when working with large datasets (essay)

When the memory of cars and / or a small dataset can be safely thrown into pandas without any optimization. However, if the data is large, the question arises how to process it or at least count it.



It is proposed to look at the optimization in miniature, so as not to pull giant datasets from the network.



As a dataset, we will use habrastatistics with user comments for 2019, which is publicly available thanks to one hardworking user:

dataset



As an information basis, a previously translated article from Habr will be used, in which a lot of interesting things are mixed.



Instead of joining



The habrastatistics dataset is considered small, although it occupies 288 MB and consists of 448533 lines.

Of course, you can find more data, but in order not to hang the car, let us dwell on it.



For the convenience of operations, we’ll add (just write the first line to the file) the column names:



a,b,c,d
      
      





Now, if you directly load the dataset into pandas and check how much memory it uses



 import os import time import pandas as pd import numpy as np gl = pd.read_csv('habr_2019_comments.csv',encoding='UTF') def mem_usage(pandas_obj): if isinstance(pandas_obj,pd.DataFrame): usage_b = pandas_obj.memory_usage(deep=True).sum() else: #     ,     DataFrame,   Series usage_b = pandas_obj.memory_usage(deep=True) usage_mb = usage_b / 1024 ** 2 #     return "{:03.2f} MB".format(usage_mb) print (gl.info(memory_usage='deep'))
      
      





see that he "eats" 436.1 MB:



 RangeIndex: 448533 entries, 0 to 448532 Data columns (total 4 columns): a 448533 non-null object b 448533 non-null object c 448533 non-null object d 448528 non-null object dtypes: object(4) memory usage: 436.1 MB
      
      





It also shows that we are dealing with columns that contain data of type object.

Therefore, according to the article , for columns, it is necessary to focus on the optimization calculated for the object. For all columns except one.



Column b contains dates, and, for the convenience of further calculations and clarity, it is better to send them to the index of the dataset. To do this, change the code used when reading the dataset:



 gl = pd.read_csv('habr_2019_comments.csv', parse_dates=['b'], encoding='UTF')
      
      





Now the dates are read as index of the dataset and memory consumption has slightly decreased:



 memory usage: 407.0 MB
      
      





Now we optimize the data in the dataset itself outside the columns and index



Optimization is called: “Optimization of storage of data of object types using categorical variables”.



If translated into Russian, then we need to combine the data in the columns by category, where it is effective.



To determine the effectiveness, you need to know the number of unique values ​​in the columns and if it is less than 50% of the total number of values ​​in the column, then combining the values ​​in the category will be effective.



Let's look at the dataset:



 gl_obj=gl.select_dtypes(include=['object']).copy() gl_obj.describe()
      
      



:

  acd count 448533 448533 448528 unique 25100 185 447059 top VolCh 0 ! freq 3377 260438 184
      
      





* column with dates in the index and not displayed



As you can see, from the unique row, in columns a and c , the union in the category is effective. For column a, this is 25100 users (obviously less than 448533), for c - 185 values ​​of the scale with "+" and "-" (also significantly less than 448533).



We optimize the columns:



 for col in gl_obj.columns: num_unique_values = len(gl_obj[col].unique()) num_total_values = len(gl_obj[col]) if num_unique_values / num_total_values < 0.5: converted_obj.loc[:,col] = gl_obj[col].astype('category') else: converted_obj.loc[:,col] = gl_obj[col]
      
      





To understand how much memory is used for convenience, we introduce a function:



 def mem_usage(pandas_obj): if isinstance(pandas_obj,pd.DataFrame): usage_b = pandas_obj.memory_usage(deep=True).sum() else: #     ,     DataFrame,   Series usage_b = pandas_obj.memory_usage(deep=True) usage_mb = usage_b / 1024 ** 2 #     return "{:03.2f} MB".format(usage_mb)
      
      





And check whether the optimization was effective:



 >>> print('  : '+mem_usage(gl_obj))   : 407.14 MB >>> print('  : '+mem_usage(converted_obj))   : 356.40 MB >>>
      
      





As you can see, a gain of another 50 MB was received.



Now, having understood that optimization has benefited (it happens and vice versa), we will set the dataset parameters when reading in order to immediately take into account the data we are dealing with:



 gl = pd.read_csv('habr_2019_comments.csv', parse_dates=['b'],index_col='b',dtype ={'c':'category','a':'category','d':'object'}, encoding='UTF')
      
      





We wish you fast work with datasets!



The code for downloading is here .



All Articles