Skip to content

Efficient memory usage on import #85

@ThomasKraft

Description

@ThomasKraft

Apologies for revisiting this issue (#66) again which was already closed, but I think the batch import could still be improved. Looking again at this area: https://github.com/tscnlab/LightLogR/blob/048786cdd0a25a2435ae26cb35f5da2e70ce50e5/R/import_expressions.R#L403C9-L405C53

I am loading ~70 VEET files in batch (just the IMU component), and receiving memory allocation errors. This could be fixed by moving to a computing cluster, but actually this shouldn't be a problem given this small number of files and I believe there is a coding solution instead of brute forcing it.

I believe the heart of the issue is that purrr::map() is loading all files into memory before list_rbind() works to combine them. The code would become more verbose, but an obvious solution would be to process the files incrementally, with cleanup of memory along the way. I.e., something like:

# Initialize empty tibble 
data <- NULL

for (filename in filenames) {  
  pattern <- paste0("^(?:[^,]*,){1}\\b", modality, "\\b")
  
  # Read and process one file at a time
  file_data <- 
    readr::read_lines(file = filename, n_max = n_max) %>%
    .[stringr::str_detect(., pattern)] %>%
    stringr::str_split(",") %>% 
    purrr::list_transpose() %>% 
    list2DF()
  
  names(file_data) <- names(veet_names[[modality]])
  
  file_data <- file_data %>% 
    dplyr::mutate(file.name = filename, .before = 1) %>%
    dplyr::mutate(
      dplyr::across(
        tidyselect::all_of(
          veet_names[[modality]][veet_names[[modality]]] %>% names()), 
        as.numeric),
      Datetime = lubridate::with_tz(
        lubridate::as_datetime(time_stamp, tz = "UTC"), tz), 
      .before = 1
    )
  
  # Bind incrementally
  data <- dplyr::bind_rows(data, file_data)
  
  # Free memory
  rm(file_data)
  gc()
}

Please note that I haven't actually tested this yet, but wanted to run the idea by you first. If you think it would be ok, I can test and make a PR off my fork.

Perhaps more importantly though, I understand that there is a desire to keep the package focused on tidyverse syntax, but perhaps there is an argument to be made that data.table will be necessary in the long-term as a dependency given the generation of increasingly large and complex datasets (my understanding is that data.table is overwhelmingly faster and more memory efficient in nearly all cases, and if one wants to maintain tidy-like syntax there is a possibility to use dtplyr: https://www.appsilon.com/post/r-dtplyr). For this same reason, packages like GGIR use data.table for analogous problems when importing extremely large accelerometry files that have to be processed in chunks. My opinion is that this would likely help to "future-proof" LightLogR.

An example for the code currently in question would be to use data.table::fread() during the incremental import, which I am virtually certain would be faster and more memory-efficient.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions