From 76ee460cb7a9ca4f91f495e5563e00b8dd3b06f4 Mon Sep 17 00:00:00 2001 From: Stas Kolenikov Date: Fri, 26 Apr 2024 11:36:29 -0500 Subject: [PATCH] Update README.md Added resources on `targets`, `arrow::parquet`, `summarytools` --- README.md | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/README.md b/README.md index ef5ed1c..42fc361 100644 --- a/README.md +++ b/README.md @@ -58,6 +58,7 @@ For coding style practices, follow the [tidyverse style guide](https://style.tid mtcars %>% skim() %>% print() sink() # close the sink ``` + - Other tools to provide a single descriptive view of the data set are `DescTools::Desc()` and `summarytools::dfSummary()`. ## Folder structure @@ -75,6 +76,13 @@ Generally, within a project folder, we have a subfolder called `analysis` where ## Scripts structure + + ### Separating scripts Because we often work with large data sets and efficiency is important, I advocate (nearly) always separating the following three actions into different scripts: @@ -197,6 +205,9 @@ Below is a brief example of a 00_run.R script. +- For interim to large data sets, the `qs` package is recommended. There is also `fst` package that provides comparable speed improvements but it strips data frames of their attributes (which often include `haven::labelled()` variable and value labels inherited from Stata). See comparison at http://svmiller.com/blog/2020/02/comparing-qs-fst-rds-for-bigger-datasets/, +- When you expect to be accessing your data in parts (a subset of variables/columns such demographic variables plus a single wage variable from a rich data set, or a subset of rows/observations such as one or a few states at a time), `arrow::parquet` storage provides a columnar format optimized for I/O. For utmost performance, `parquet` files need to be organized in special ways -- namely grouped by the commonly accessed subsetting variables, with other variables coded numeric so that a quick scan of the file headers could reveal if there are any data worth accessing in a given stripe of a file (https://arrow.apache.org/docs/r/articles/dataset.html). +- A comparison of a broad range of fast(er) data-to-and-from-disk packages is given at https://rsangole.netlify.app/posts/2022-09-14_data-read-write-performance/data-read-write-perf. Generally, the speed improvements are a combination of parallel read/writes (entirely absent from `base::saveRDS()`) and internal compression used by some of these packages. - When doing a time-consuming `map*()` or loop, e.g. reading in and manipulating separate data sets for each month, it is a good idea to save intermediate objects as part of the function being called by `map*()` or as part of the loop. That way, if something goes wrong you won't lose all your progress. ### Graphs