A few touches on working with bigint ids in R

Each time a conversation begins about using different databases as a data source, the topic of record identifiers, objects, or something else appears. Sometimes approval of the exchange protocol can be considered by the participants for several months. int



- bigint



- guid



, further in a circle. For volumetric tasks, taking into account that natively in R there is no bigint



support (capacity ~ 2 ^ 64), the choice of the correct presentation of such identifiers can be critical in terms of performance. Is there an obvious and universal workaround? Below are a few practical considerations that can be used in projects as a litmus test.







Typically, identifiers will be used for three classes of tasks:









Based on this, we will evaluate various approaches.







It is a continuation of previous publications .







Store as string





The option for small data is quite good. Here and the ability to pick up any length of the identifier, and the ability to support not only numeric, but also alphanumeric identifiers. An additional advantage is the guaranteed ability to correctly receive data through any non-native database protocol, for example, through the REST API of the gateway.







Cons are also obvious. High memory consumption, increased information volume from the database, degradation of performance both at the network level and at the computational level.







We use the bit64



package



Many who have heard only the name of this package may think that here it is, the perfect solution. Alas, this is not entirely true. Not only is this an add-on on top of numeric



(quote: ' Again the choice is obvious: R has only one 64 bit data type: doubles. By using doubles,

integer64 inherits some functionality such as is.atomic, length, length <-, names, names <-, dim, dim <-, dimnames, dimnames. ' ), so there is still a massive expansion of basic arithmetic and there are no guarantees that it will not explode anywhere and there will be no conflict with other packages.







We use type numeric





This is a completely correct trick, which is a good compromise for those who know what exactly will be hidden in the int64



response from the database. After all, not all 64 bits will really be involved there. Often there may be a number much less than 2 ^ 64.







Such a solution is possible due to the specifics of the double-precision floating-point format. Details can be found in the popular Double-precision floating-point format article.







 The 53-bit significand precision gives from 15 to 17 significant decimal digits precision (2βˆ’53 β‰ˆ 1.11 Γ— 10βˆ’16). If a decimal string with at most 15 significant digits is converted to IEEE 754 double-precision representation, and then converted back to a decimal string with the same number of digits, the final result should match the original string. If an IEEE 754 double-precision number is converted to a decimal string with at least 17 significant digits, and then converted back to double-precision representation, the final result must match the original number.
      
      





If you have 15 or fewer decimal digits in the identifier, you can use numeric



and not worry.







The same trick is good when you need to work with temporary data, especially those containing milliseconds. Transmission of temporary data over the network in text form takes time, in addition, on the receiving side, you need to run the text -> POSIXct



, which is also extremely resource-intensive (performance drawdown at times). Transfer in binary form is not a fact that all drivers will support the transfer of the time zone and milliseconds. But the transfer of time accurate to milliseconds in the UTC zone in the unix timestamp representation (13 decimal places) is very well and losslessly provided by the numeric



format.







Not so simple and obvious



If we take a closer look at the version with lines, the obviousness and categoricality of the initial statement recede a little. Working with strings in R is not quite straightforward, even omitting the nuances of aligning memory blocks and prefetching. Judging by the books and in-depth documentation, string variables are not stored on their own in a variable, but are placed in a global string pool. All lines. And this pool is used by string arrays to reduce memory consumption. Those. a text vector will be a set of lines in the global pool + a vector of links to records from this pool.







 library(tidyverse) library(magrittr) library(stringi) library(gmp) library(profvis) library(pryr) library(rTRNG) set.seed(46572) RcppParallel::setThreadOptions(numThreads = parallel::detectCores() - 1) #              options(scipen = 10000) options(digits = 14) options(pillar.sigfig = 14) pryr::mem_used() fname <- here::here("output", "dump.csv") #  10^4,       (    +  ) m1 <- sample(stri_rand_strings(100, 64, "[a0-9]"), 10^7, replace = TRUE) #     readr::write_csv(enframe(m1, name = NULL), fname) #       m2 <- readr::read_csv(fname, col_types = "c") %>% pull(value) pryr::object_size(m2) pryr::mem_used() #      print(glue::glue("File size: {fs::file_size(fname)}. ", "Constructed from file object's (m2) size: {fs::fs_bytes(pryr::object_size(m2))}. ", "Pure pointer's size: {fs::fs_bytes(8*length(m2))}")) .Internal(inspect(m1)) .Internal(inspect(m2))
      
      





We see that even without going to the C ++ level, the hypothesis is not so far from the truth. The volume of the string vector almost coincides with the volume of 64-bit pointers, and the variable itself takes up significantly less volume than the file on disk.







 File size: 62M. Constructed from file object's (m2) size: 7.65M. Pure pointer's size: 7.63M
      
      





And the content of the vectors before writing and after reading is identical - resp. vector elements refer to the same memory blocks.







So a closer look at the use of text strings as identifiers no longer seems such a crazy idea. Benchmarks for grouping, filtering, and merging, using dplyr



and data.table



give approximately similar readings for numeric



and character



identifiers, which gives additional confirmation of optimization due to the global pool. After all, work is underway with pointers whose size is either 32 or 64 bits, depending on the assembly R (32/64), and this is precisely the numeric



type.







 #    gc() pryr::mem_used() bench::mark( string = id_df %>% group_by(id_string) %>% summarise(j = prod(j, na.rm = TRUE)), # bit64 = id_df %>% group_by(id_bit64) %>% summarise(j = prod(j, na.rm = TRUE)), numeric = id_df %>% group_by(id_numeric) %>% summarise(j = prod(j, na.rm = TRUE)), # gmp = id_df %>% group_by(id_gmp) %>% summarise(j = prod(j, na.rm = TRUE)), check = FALSE ) #    gc() pryr::mem_used() string_subset <- sample(unique(id_df$id_string), 20) numeric_subset <- sample(unique(id_df$id_numeric), 20) bench::mark( string = dplyr::filter(id_df, id_string %in% string_subset), numeric = dplyr::filter(id_df, id_numeric %in% numeric_subset), check = FALSE ) #    gc() pryr::mem_used() #        string_copy_df <- rlang::duplicate(dplyr::count(id_df, id_string)) numeric_copy_df <- rlang::duplicate(dplyr::count(id_df, id_numeric)) bench::mark( string = id_df %>% dplyr::left_join(string_copy_df, by = "id_string"), numeric = id_df %>% dplyr::left_join(numeric_copy_df, by = "id_numeric"), iterations = 10, check = FALSE )
      
      





By the way, the maximum size of available R memory can be viewed with the fs::fs_bytes(memory.limit())



.







To be honest, it should be noted that dplyr



did not always have fast dplyr



, see the case "Joining by a character column is slow, compared to joining by a factor column. # 1386 {Closed}" . This thread proposes to use the capabilities of the global pool of strings and compare not strings as such, but pointers to strings.







Memory Management Details



Basic sources









Conclusion



Naturally, this question is constantly asked in one form or another, a number of links below.









But in order to consciously understand what to do right, what are the opportunities and limitations, it is best to go down to the lowest possible level. As it turns out, there is a mass of not obvious specifics. The publication is of a research nature, since it affects quite basic aspects of R, which few people use in their daily work. If there are significant additions, comments or corrections, it will be very interesting to get to know them.







The previous publication is β€œUsing R for Utility Tasks” .








All Articles