Apologies for revisiting this issue (#66) again which was already closed, but I think the batch import could still be improved. Looking again at this area: https://github.com/tscnlab/LightLogR/blob/048786cdd0a25a2435ae26cb35f5da2e70ce50e5/R/import_expressions.R#L403C9-L405C53
I am loading ~70 VEET files in batch (just the IMU component), and receiving memory allocation errors. This could be fixed by moving to a computing cluster, but actually this shouldn't be a problem given this small number of files and I believe there is a coding solution instead of brute forcing it.
I believe the heart of the issue is that purrr::map() is loading all files into memory before list_rbind() works to combine them. The code would become more verbose, but an obvious solution would be to process the files incrementally, with cleanup of memory along the way. I.e., something like:
# Initialize empty tibble
data <- NULL
for (filename in filenames) {
pattern <- paste0("^(?:[^,]*,){1}\\b", modality, "\\b")
# Read and process one file at a time
file_data <-
readr::read_lines(file = filename, n_max = n_max) %>%
.[stringr::str_detect(., pattern)] %>%
stringr::str_split(",") %>%
purrr::list_transpose() %>%
list2DF()
names(file_data) <- names(veet_names[[modality]])
file_data <- file_data %>%
dplyr::mutate(file.name = filename, .before = 1) %>%
dplyr::mutate(
dplyr::across(
tidyselect::all_of(
veet_names[[modality]][veet_names[[modality]]] %>% names()),
as.numeric),
Datetime = lubridate::with_tz(
lubridate::as_datetime(time_stamp, tz = "UTC"), tz),
.before = 1
)
# Bind incrementally
data <- dplyr::bind_rows(data, file_data)
# Free memory
rm(file_data)
gc()
}
Please note that I haven't actually tested this yet, but wanted to run the idea by you first. If you think it would be ok, I can test and make a PR off my fork.
Perhaps more importantly though, I understand that there is a desire to keep the package focused on tidyverse syntax, but perhaps there is an argument to be made that data.table will be necessary in the long-term as a dependency given the generation of increasingly large and complex datasets (my understanding is that data.table is overwhelmingly faster and more memory efficient in nearly all cases, and if one wants to maintain tidy-like syntax there is a possibility to use dtplyr: https://www.appsilon.com/post/r-dtplyr). For this same reason, packages like GGIR use data.table for analogous problems when importing extremely large accelerometry files that have to be processed in chunks. My opinion is that this would likely help to "future-proof" LightLogR.
An example for the code currently in question would be to use data.table::fread() during the incremental import, which I am virtually certain would be faster and more memory-efficient.
Apologies for revisiting this issue (#66) again which was already closed, but I think the batch import could still be improved. Looking again at this area: https://github.com/tscnlab/LightLogR/blob/048786cdd0a25a2435ae26cb35f5da2e70ce50e5/R/import_expressions.R#L403C9-L405C53
I am loading ~70 VEET files in batch (just the IMU component), and receiving memory allocation errors. This could be fixed by moving to a computing cluster, but actually this shouldn't be a problem given this small number of files and I believe there is a coding solution instead of brute forcing it.
I believe the heart of the issue is that
purrr::map()is loading all files into memory beforelist_rbind()works to combine them. The code would become more verbose, but an obvious solution would be to process the files incrementally, with cleanup of memory along the way. I.e., something like:Please note that I haven't actually tested this yet, but wanted to run the idea by you first. If you think it would be ok, I can test and make a PR off my fork.
Perhaps more importantly though, I understand that there is a desire to keep the package focused on tidyverse syntax, but perhaps there is an argument to be made that
data.tablewill be necessary in the long-term as a dependency given the generation of increasingly large and complex datasets (my understanding is that data.table is overwhelmingly faster and more memory efficient in nearly all cases, and if one wants to maintain tidy-like syntax there is a possibility to use dtplyr: https://www.appsilon.com/post/r-dtplyr). For this same reason, packages likeGGIRuse data.table for analogous problems when importing extremely large accelerometry files that have to be processed in chunks. My opinion is that this would likely help to "future-proof" LightLogR.An example for the code currently in question would be to use
data.table::fread()during the incremental import, which I am virtually certain would be faster and more memory-efficient.