Efficient memory usage on import

Apologies for revisiting this issue (#66) again which was already closed, but I think the batch import could still be improved. Looking again at this area: https://github.com/tscnlab/LightLogR/blob/048786cdd0a25a2435ae26cb35f5da2e70ce50e5/R/import_expressions.R#L403C9-L405C53

I am loading ~70 VEET files in batch (just the IMU component), and receiving memory allocation errors. This could be fixed by moving to a computing cluster, but actually this shouldn't be a problem given this small number of files and I believe there is a coding solution instead of brute forcing it.

I believe the heart of the issue is that `purrr::map()` is loading all files into memory before `list_rbind()` works to combine them. The code would become more verbose, but an obvious solution would be to process the files incrementally, with cleanup of memory along the way. I.e., something like:

```
# Initialize empty tibble 
data <- NULL

for (filename in filenames) {  
  pattern <- paste0("^(?:[^,]*,){1}\\b", modality, "\\b")
  
  # Read and process one file at a time
  file_data <- 
    readr::read_lines(file = filename, n_max = n_max) %>%
    .[stringr::str_detect(., pattern)] %>%
    stringr::str_split(",") %>% 
    purrr::list_transpose() %>% 
    list2DF()
  
  names(file_data) <- names(veet_names[[modality]])
  
  file_data <- file_data %>% 
    dplyr::mutate(file.name = filename, .before = 1) %>%
    dplyr::mutate(
      dplyr::across(
        tidyselect::all_of(
          veet_names[[modality]][veet_names[[modality]]] %>% names()), 
        as.numeric),
      Datetime = lubridate::with_tz(
        lubridate::as_datetime(time_stamp, tz = "UTC"), tz), 
      .before = 1
    )
  
  # Bind incrementally
  data <- dplyr::bind_rows(data, file_data)
  
  # Free memory
  rm(file_data)
  gc()
}
```

Please note that I haven't actually tested this yet, but wanted to run the idea by you first. If you think it would be ok, I can test and make a PR off my fork. 

Perhaps more importantly though, I understand that there is a desire to keep the package focused on tidyverse syntax, but perhaps there is an argument to be made that `data.table` will be necessary in the long-term as a dependency given the generation of increasingly large and complex datasets (my understanding is that data.table is overwhelmingly faster and more memory efficient in nearly all cases, and if one wants to maintain tidy-like syntax there is a possibility to use dtplyr: https://www.appsilon.com/post/r-dtplyr). For this same reason, packages like `GGIR` use data.table for analogous problems when importing extremely large accelerometry files that have to be processed in chunks. My opinion is that this would likely help to "future-proof" LightLogR.

An example for the code currently in question would be to use `data.table::fread()` during the incremental import, which I am virtually certain would be faster and more memory-efficient.




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Efficient memory usage on import #85

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Efficient memory usage on import #85

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions