processing-jira-process-data/README.Rmd at main · AEADataEditor/processing-jira-process-data · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
---
title: "Process Data from Reproducibility Service"
author: "Lars Vilhuber"
date: '`r Sys.Date()`'
output:
  html_document:
    keep_md: yes
    toc: yes
    toc_depth: 3
    toc_float: yes
editor_options:
  chunk_output_type: console
contributors:
  - Lars Vilhuber
  - Linda Wang
  - Takshil Sachdev
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
source(here::here("global-libraries.R"), echo = TRUE)
source(here::here("programs", "config.R"), echo = TRUE)

VERSION <- "V6"
YEAR <- "2025"
ICPSR_PROJECT <- "117876"
ICPSR_DOI <- paste0("https://doi.org/10.3886/E", ICPSR_PROJECT, VERSION)
```

> Note: The [PDF version](https://aeadataeditor.github.io/processing-jira-process-data/README.pdf) of this document is transformed by manually printing from a browser.

## Overview

This README describes how to process data for the AEA Pre-publication Verification Service. The code constructs the analysis file from raw process data extracted from Jira using an API. The replicator should expect the code to run for approximately 2 hours.

## Data Availability and Provenance Statements

Data used originates from Jira system used by the AEA data editor and the members of his replication lab.

### Statement about Rights

- [x] I certify that the author(s) of the manuscript have legitimate access to and permission to use the data used in this manuscript.
- [x] I certify that the author(s) of the manuscript have documented permission to redistribute/publish the data contained within this replication package.

### Summary of Availability

- [x] Some data **cannot be made** publicly available.
- [x] Confidential data used are not provided as part of the public replication package.

## Details on the Data

### Raw process data

![Workflow stages](images/AEADataEditorWorkflow-20251128.png)

Raw process data from each step of the workflow is extracted from Jira using API (see Instructions to Replicators for details), and saved as `issue_history_MM-DD-YYYY.csv` (for detailed transaction-level data)

The data is not made available outside of the organization, as it contains names of replicators, manuscript numbers, and verbatim email correspondence. An anonymized version without identifying information is made available instead.

To obtain, run `programs/01_download_issues.py`. This will use the fields as specified in `data/metadata/jira-fields.xlsx`. If fields need to be updated (they are keyed on names), run `programs/00_jira_fields.py` to obtain a new CSV file, and mark the to-be-included fields with "True".

A full download of JIRA issues as of `r YEAR` would be

```
> python3 01_download_issues.py -s 2018-01-01 -e 2024-11-30
Summary:
- Start Date: 2018-01-01
- End Date: 2024-11-30

 About to extract all issue history between these dates from https://aeadataeditors.atlassian.net.
 The output will be written to /home/rstudio/data/confidential/issue_history_2024-12-03.csv.
```

```{r listfiles,include=FALSE}
listfiles <- as.data.frame(list.files(file.path(jiraconf),pattern = "^issue_history_"))
names(listfiles) <- c("filename")
latest <- listfiles %>%
  separate(filename,c("base","extension"),sep = "([\\.])",remove = FALSE) %>%
  separate(base,c("prefix","prefix2","date_str"),sep="_") %>%
  mutate(date = as.Date(date_str, format = "%Y-%m-%d")) %>% select(-prefix)
```

At this time, the latest extract was made `r max(latest$date)`.

#### Anonymized data
We subset the raw data to variables of interest, and substitute random numbers for sensitive strings. This is done by running `02_jira_anonymize.R`. The programs saves both the confidential version and the anonymized version.

```{r anonymize, echo = TRUE, eval = TRUE, warning=FALSE}
source(file.path(programs,"02_jira_anonymize.R"),echo=TRUE)
```

### Publishing data

Some additional cleaning and matching, and then we write out the file

```{r publish, echo = TRUE, eval = FALSE, warning=FALSE}
source(file.path(programs,"10_jira_anon_publish.R"),echo=TRUE)
```

Finally, we push the confidential data to Box, using the following code, which we specifically run manually:

```{bash, echo = TRUE, eval = FALSE, warning=TRUE}
cd programs
R CMD BATCH 99_push_box.R
```

## Describing the Data
```{r read,include=FALSE}
jira <- readRDS(file.path(jiraanon,"jira.anon.RDS"))
```

The anonymized data has `r ncol(jira)` columns.

### Variables

```{r describe, echo=FALSE,cache=FALSE,warning=FALSE,message=FALSE}
description <- read_csv("data/metadata/description_anon.csv")

jira.tmp <- as.data.frame(names(jira),stringsAsFactors = FALSE)
names(jira.tmp) <- c("name")
jira.tmp %>% left_join(description,by=c("name")) %>%
  kable()
```

### Lab members during this period

We list the lab members active at some point during this period. This still requires confidential data as an input.

```{r lab, echo=FALSE,cache=FALSE,warning=FALSE,message=FALSE,eval=FALSE}
source(file.path(programs,"03_lab_members.R"),echo=TRUE)
```

```{r labmembers,echo=FALSE,include=FALSE}
lab.member <- readRDS(file=file.path(jiraanon,"replicationlab_members.Rds"))

```

There were a total of `r nrow(lab.member)` lab members over the course of the 12 month period.

### R session info

```{r sessionInfo}
sessionInfo()
```

## Software Requirements
- R (last run with R `r paste0(R.Version()$major,".",R.Version()$minor)`)
  - package `here` (>=0.1)
- Python
  - module `venv`

Other packages will be installed automatically by the programs, as long as the requirements above are met, see [Session Info](#r-session-info).
R (last run with R r paste0(R.Version()$major, ".", R.Version()$minor))

### R packages
```{r libs, echo=FALSE,message=FALSE,warning=FALSE}

left_join(as.data.frame(global.libraries) %>% select(Package=global.libraries),
          as.data.frame(installed.packages())) %>%
  select(Package,Version) %>% kable()
```

### Python packages
```{r pylibs,echo=FALSE,,warning=FALSE,message=FALSE}

read_csv("requirements.txt",col_names = c("Modules")) %>% kable()
```

### Docker
These requirements are satisfied in the Docker image created by `Dockerfile`, see [description below](#setup---1.-docker)

## Controlled Randomness
- [x] No Pseudo random generator is used in the analysis described here.

## Memory, Runtime, Storage Requirements
The code was last run successfully on GitHub Codespaces on a 2-core machine with 8GB RAM and 32GB storage. Approximate time needed to reproduce the analysis varies depending on how much data is downloaded from the Jira API. Downloading the variables listed above took approximately 5 seconds for each case.

## Description of Programs/Code
- 00_get_fields.py: Marks the to-be-extracted JIRA fields with "True" and outputs file `data/metadata/jira-fields.xlsx`
- 01_download_issues.py: Extracts raw process data from Jira using API
- 02_jira_anonymize.R: Subsets the raw data to variables of interest, and substitute random numbers for sensitive strings
- 03_lab_members.R: Outputs list of lab members active at some point during extracted period
- 10_jira_anon_publish.R: Does final cleaning and matching and writes out the anonymized file
- 99_push_box.R: Uploads extracted data to secure Box folder
- 99_render_README.R: Renders Rmd README file

## Instructions to Replicators
- Clone this repository onto your device or a GitHub Codespace

### Set up Docker
- The `Dockerfile` is used to build the Docker image.
- The image is built with the `build.sh` script, which requires a `TAG` argument, and will otherwise read parameters from the [`.myconfig.sh`](.myconfig.sh) file.

```bash
bash ./build.sh TAG
```
> [NOTE]: If working on BioHPC, remember to replace docker with docker1 in the relevant code.

- Use `ls-tags.sh` to list available tags.

```bash
bash ./ls-tags.sh
```

- To run the image as a Rstudio interactive development image, use

```bash
bash ./start_rstudio.sh TAG
```

```{r getenvs,include=FALSE}
lines <- readLines(file.path(basepath,".myconfig.sh"))

# Find the line that starts with TAG=
tag_line <- lines[grep("^tag=", lines)]

# Extract the value after TAG=
tag_value <- sub("tag=", "", tag_line)

```
- It defaults to the `r tag_value` image if you don't specify a tag.

### Set up JIRA
- Obtain the per-individual API Key
- The API Key is not stored in this repository.
- Go to [https://id.atlassian.com/manage-profile/security/api-tokens](https://id.atlassian.com/manage-profile/security/api-tokens)

![JIRA API token](images/jira-account-apitokens.png)

- Click on "Create API token"
- Enter a label for the token (e.g. "JIRA Extract")
- Copy the token to the clipboard

![JIRA API token](images/jira-account-api-copy.png)

- Use it with the Python scripts in this repository, in one of the following ways:
    - Set the environment variable `JIRA_API_KEY` to the token value
      - On github codespaces this involves creating a Github secret with the exact name `JIRA_API_KEY` and value of the key you get from JIRA
    - Create a file named `.env` in the root directory of this project, and add the following line to it:
      `JIRA_API_KEY=<token value>`
    - Pass the token value to the Python scripts when prompted

### Set up Box
- Location: [https://cornell.app.box.com/folder/143352802492](https://cornell.app.box.com/folder/143352802492)🔒
- We use the subfolder [`jira_exports`](https://cornell.app.box.com/folder/235801403908)🔒
- In order to up- and download, you need not just an API key, but a JSON file with other credentials. This file is called `client_enterprise_id,"_",client_key_id,"_config.json"`, e.g. `81483_bkgnsg4p_config.json`
  - The `client_enterprise_id` is identified in the JSON file itself as well
  - The `client_key_id` is the name of the key in the [Box developer console](https://cornell.app.box.com/developers/console/app/1590771/configuration)🔒
- The JSON file is key
  - It is not stored in this repository, but is stored in the Box folder `InternalData`
  - To use this, the file must be downloaded and stored in the root of the project directory
- Then the `.env` file needs to be appropriately adjusted with the relevant numbers as per below entered:

```dotenv
BOX_FOLDER_ID=12345678890
BOX_PRIVATE_KEY_ID=abcdef4g
BOX_ENTERPRISE_ID=123456
```

- Here:
  - The BOX_FOLDER_ID is the 12 digit number in the URL of the box folder `.../folder/12345678890?...`
  - The BOX_PRIVATE_KEY_ID refers to the `publicKeyID` in the JSON file
  - The BOX_ENTERPRISE_ID is the number at the beginning of the name of the JSON file
- Alternatively, on Github Codespaces, these need to be encoded as secrets.

### Start Docker and Set up Environment

- Run `./start_rstudio.sh` (bash ./start_rstudio.sh from the command line) it should pull the image from the docker and open a port for you to develop in a familiar RStudio environment
  - On GitHub Codespaces you can access this port by clicking on ports and then the little globe icon to open it in a new tab
  - On a local computer, you may need to open a browser at <http://localhost:8787>
  - To obtain a console in the running Docker container, open a second terminal and connect: `

```bash
container_id=$(docker container ls | head -2 | tail -1 | awk ' { print $1 } ')
docker exec -it -u rstudio $container_id /bin/bash`
```

> [NOTE]: The remainder of the instructions assume you are working within the Docker environment. Adjust as necessary if you are only using the code base in your own environment.

- Change to the correct working directory:
  - Rstudio: click on `processing-jira-process-data/processing-jira-process-data.Rproj`
  - Console: cd to the correct directory
- Install any missing Python packages by running `pip install -r requirements.txt`.
- Set up **environment variables**:
  - Ensure an `.env` file is present in the root project directory (or your GitHub Secrets are set)
  - Ensure that the Box `JSON` file as outlined above is present in the root project directory
  - Provide JIRA and BOX information

> Template file

```
JIRA_USERNAME=
JIRA_API_KEY=
BOX_FOLDER_ID=
BOX_PRIVATE_KEY_ID=
BOX_ENTERPRISE_ID=
```

- Define start and end dates:
  - Update the `extractday`, `firstday`, and `lastday` fields in the [`programs/config.R`](programs/config.R) file.
  - You will need to manually provide them to the Python programs (for now)

### Obtain Extract

- `cd programs`
- To obtain extract **run `python3 01_download_issues.py -s ` `r firstday` ` -e ` `r lastday`** with the relevant dates.
  - This will get the fields as specified in `data/metadata/jira-fields.xlsx`.
  - If fields need to be updated (they are keyed on names), edit `programs/00_jira_fields.py` to obtain the full list of fields, open the resulting Excel file (`data/metadata/jira-fields.xlsx`) and  mark the to-be-included fields with "True"
  - Otherwise running `programs/00_jira_fields.py` is not required.

- **Run R programs in numerical order** to create the confidential and anonymized files used for the report.
  - Running with `R CMD BATCH name_of_file.R` will create the necessary log files.
  - This is encapsulated in the `main.sh` file, for convenience:

```bash
cd programs
bash -x ./main.sh
```

- Push the extracted confidential data to Box, using the following code, which we specifically run manually:

```{bash, echo = TRUE, eval = FALSE, warning=TRUE}
cd programs
R CMD BATCH 99_push_box.R
```

- Finally, run `99_render_README.Rout` to update the .Rmd README file and output a .md file and .html file.
  - Manually print the .html file to obtain a PDF.

## Citation
> Vilhuber, Lars. `r YEAR`. "Process data for the AEA Pre-publication Verification Service." *American Economic Association [publisher]*. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], `r Sys.Date()`. [`r ICPSR_DOI`](`r ICPSR_DOI`)

```{r, results='asis', echo=FALSE}
cat("```\n")
cat(paste0("@techreport{10.3886/e117876",VERSION,",\n"))
cat(paste0("  doi = {10.3886/E117876",VERSION,"},\n"))
cat(paste0("  url = {",ICPSR_DOI,"},\n"))
cat("  author = {Vilhuber,  Lars},\n")
cat("  title = {Process data for the AEA Pre-publication Verification Service},\n")
cat("  institution = {American Economic Association [publisher]},\n")
cat("  series = {ICPSR - Interuniversity Consortium for Political and Social Research},\n")
cat(paste0("  year = {",YEAR,"}"))
cat("}\n")
cat("```\n")
```

## References
Vilhuber, Lars. r YEAR. "Process Data for the AEA Pre-publication Verification Service." American Economic Association [publisher]. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], r Sys.Date(). [r ICPSR_DOI](r ICPSR_DOI).

## Acknowledgements
This README was adapted from the social-science-data-editors/template_README template.

---