Skip to content

Commit 837a731

Browse files
authored
Merge pull request #73 from AEADataEditor/feature/zenodo-unified-download
feat: unified Zenodo download — Python public script, request URL support, orchestrator
2 parents 41564e7 + fc3bce9 commit 837a731

11 files changed

Lines changed: 2021 additions & 99 deletions

bitbucket-pipelines.yml

Lines changed: 21 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,8 @@ pipelines:
3939
- name: openICPSRID
4040
- name: jiraticket
4141
- name: ZenodoID
42+
# Accepts: numeric ID, full URL, DOI, or community request URL.
43+
# Leave blank if jiraticket is set (orchestrator will query Jira).
4244
#- name: DataverseID
4345
#- name: OSFID
4446
- name: ProcessStata
@@ -87,19 +89,19 @@ pipelines:
8789
- if [ -z $openICPSRID ]; then openICPSRID=$openicpsr; fi
8890
- if [ -z $ZenodoID ]; then ZenodoID=$zenodo; fi
8991
- projectID="${openICPSRID}"
90-
- projectID="${projectID:-zenodo-$ZenodoID}"
9192
- if [ -z "$jiraticket" ] && [ -n "${openICPSRID:-}" ]; then jiraticket=$(python3 tools/jira_find_task_by_icpsr.py "$openICPSRID" 2>/dev/null || true); else echo "Jira ticket not set"; fi
9293
- export jiraticket
9394
- echo "Using Jira case $jiraticket"
9495
- ./tools/update_config.sh
9596
- ./automations/70_publish_comment.sh 1-populate-from-icpsr started
96-
- if [ -d $projectID ]; then \rm -rf $projectID; fi
97+
- if [ -d "${projectID:-__none__}" ]; then \rm -rf $projectID; fi
9798
- if [ ! -z $openICPSRID ]; then python3 tools/download_openicpsr-private.py $openICPSRID; fi
98-
- if [ ! -z $ZenodoID ]; then python3 tools/download_zenodo_draft.py $ZenodoID; fi
99+
- if [ ! -z "$ZenodoID" ] || [ ! -z "$jiraticket" ]; then zenodo_dir=$(python3 tools/download_zenodo.py ${ZenodoID:+--zenodo-id "$ZenodoID"} ${jiraticket:+--jira-ticket "$jiraticket"} --print-id 2>&1 | tail -1); fi
100+
- if [ -z "$projectID" ] && [ ! -z "${zenodo_dir:-}" ]; then projectID="$zenodo_dir"; fi
99101
- ./automations/00_unpack_zip.sh $projectID
100102
- mkdir cache
101103
- if [ ! -z $openICPSRID ]; then mv *.zip cache/; fi
102-
- if [ ! -z $ZenodoID ]; then zip -rp cache/${projectID}.zip $projectID/* ; fi
104+
- if [ ! -z "${zenodo_dir:-}" ]; then zip -rp cache/${projectID}.zip $projectID/* ; fi
103105
- ./automations/00_prepare_aux.sh
104106
- ./automations/01_check_file_sizes.sh $projectID
105107
- ./automations/02_list_data_files.sh $projectID
@@ -119,6 +121,7 @@ pipelines:
119121
- if [ -z $openICPSRID ]; then openICPSRID=$openicpsr; fi
120122
- if [ -z $ZenodoID ]; then ZenodoID=$zenodo; fi
121123
- projectID="${openICPSRID}"
124+
- if [ ! -z "$ZenodoID" ] && echo "$ZenodoID" | grep -q '/'; then ZenodoID=$(echo "$ZenodoID" | python3 -c "import sys,re; s=sys.stdin.read().strip().rstrip('/'); m=re.search(r'zenodo\.org/(?:records?|deposit)/(\d+)', s) or re.search(r'zenodo\.(\d+)', s); print(m.group(1) if m else s)"); fi
122125
- projectID="${projectID:-zenodo-$ZenodoID}"
123126
- if [ -f cache/$projectID.zip ]; then echo "✅ Found $projectID.zip in cache"; else echo "⚠️ Did not find $projectID.zip in cache. 🛑 You may need to use the BIG pipeline!!! "; exit 2; fi
124127
- parallel: # we will run these in parallel
@@ -134,6 +137,7 @@ pipelines:
134137
- if [ -z $openICPSRID ]; then openICPSRID=$openicpsr; fi
135138
- if [ -z $ZenodoID ]; then ZenodoID=$zenodo; fi
136139
- projectID="${openICPSRID}"
140+
- if [ ! -z "$ZenodoID" ] && echo "$ZenodoID" | grep -q '/'; then ZenodoID=$(echo "$ZenodoID" | python3 -c "import sys,re; s=sys.stdin.read().strip().rstrip('/'); m=re.search(r'zenodo\.org/(?:records?|deposit)/(\d+)', s) or re.search(r'zenodo\.(\d+)', s); print(m.group(1) if m else s)"); fi
137141
- projectID="${projectID:-zenodo-$ZenodoID}"
138142
- chmod a+rx ./automations/*.sh
139143
- ./automations/00_preliminaries.sh $ProcessStata $projectID
@@ -153,6 +157,7 @@ pipelines:
153157
- if [ -z $openICPSRID ]; then openICPSRID=$openicpsr; fi
154158
- if [ -z $ZenodoID ]; then ZenodoID=$zenodo; fi
155159
- projectID="${openICPSRID}"
160+
- if [ ! -z "$ZenodoID" ] && echo "$ZenodoID" | grep -q '/'; then ZenodoID=$(echo "$ZenodoID" | python3 -c "import sys,re; s=sys.stdin.read().strip().rstrip('/'); m=re.search(r'zenodo\.org/(?:records?|deposit)/(\d+)', s) or re.search(r'zenodo\.(\d+)', s); print(m.group(1) if m else s)"); fi
156161
- projectID="${projectID:-zenodo-$ZenodoID}"
157162
- chmod a+rx ./automations/*.sh
158163
- ./automations/00_preliminaries.sh $ProcessPii $projectID
@@ -169,6 +174,7 @@ pipelines:
169174
- if [ -z $openICPSRID ]; then openICPSRID=$openicpsr; fi
170175
- if [ -z $ZenodoID ]; then ZenodoID=$zenodo; fi
171176
- projectID="${openICPSRID}"
177+
- if [ ! -z "$ZenodoID" ] && echo "$ZenodoID" | grep -q '/'; then ZenodoID=$(echo "$ZenodoID" | python3 -c "import sys,re; s=sys.stdin.read().strip().rstrip('/'); m=re.search(r'zenodo\.org/(?:records?|deposit)/(\d+)', s) or re.search(r'zenodo\.(\d+)', s); print(m.group(1) if m else s)"); fi
172178
- projectID="${projectID:-zenodo-$ZenodoID}"
173179
- chmod a+rx ./automations/*.sh
174180
- ./automations/00_preliminaries.sh $ProcessR $projectID
@@ -184,6 +190,7 @@ pipelines:
184190
- if [ -z $openICPSRID ]; then openICPSRID=$openicpsr; fi
185191
- if [ -z $ZenodoID ]; then ZenodoID=$zenodo; fi
186192
- projectID="${openICPSRID}"
193+
- if [ ! -z "$ZenodoID" ] && echo "$ZenodoID" | grep -q '/'; then ZenodoID=$(echo "$ZenodoID" | python3 -c "import sys,re; s=sys.stdin.read().strip().rstrip('/'); m=re.search(r'zenodo\.org/(?:records?|deposit)/(\d+)', s) or re.search(r'zenodo\.(\d+)', s); print(m.group(1) if m else s)"); fi
187194
- projectID="${projectID:-zenodo-$ZenodoID}"
188195
- chmod a+rx ./automations/*.sh
189196
- ./automations/00_preliminaries.sh $ProcessPython $projectID
@@ -200,6 +207,7 @@ pipelines:
200207
- if [ -z $openICPSRID ]; then openICPSRID=$openicpsr; fi
201208
- if [ -z $ZenodoID ]; then ZenodoID=$zenodo; fi
202209
- projectID="${openICPSRID}"
210+
- if [ ! -z "$ZenodoID" ] && echo "$ZenodoID" | grep -q '/'; then ZenodoID=$(echo "$ZenodoID" | python3 -c "import sys,re; s=sys.stdin.read().strip().rstrip('/'); m=re.search(r'zenodo\.org/(?:records?|deposit)/(\d+)', s) or re.search(r'zenodo\.(\d+)', s); print(m.group(1) if m else s)"); fi
203211
- projectID="${projectID:-zenodo-$ZenodoID}"
204212
- chmod a+rx ./automations/*.sh
205213
- ./automations/00_preliminaries.sh $ProcessJulia $projectID
@@ -219,6 +227,7 @@ pipelines:
219227
- if [ -z $openICPSRID ]; then openICPSRID=$openicpsr; fi
220228
- if [ -z $ZenodoID ]; then ZenodoID=$zenodo; fi
221229
- projectID="${openICPSRID}"
230+
- if [ ! -z "$ZenodoID" ] && echo "$ZenodoID" | grep -q '/'; then ZenodoID=$(echo "$ZenodoID" | python3 -c "import sys,re; s=sys.stdin.read().strip().rstrip('/'); m=re.search(r'zenodo\.org/(?:records?|deposit)/(\d+)', s) or re.search(r'zenodo\.(\d+)', s); print(m.group(1) if m else s)"); fi
222231
- projectID="${projectID:-zenodo-$ZenodoID}"
223232
- chmod a+rx ./automations/*.sh
224233
- ./automations/00_preliminaries.sh yes $projectID
@@ -247,6 +256,7 @@ pipelines:
247256
- if [ -z $openICPSRID ]; then openICPSRID=$openicpsr; else echo "openICPSRID not set"; fi
248257
- if [ -z $ZenodoID ]; then ZenodoID=$zenodo; else echo "ZenodoID not set"; fi
249258
- projectID="${openICPSRID}"
259+
- if [ ! -z "$ZenodoID" ] && echo "$ZenodoID" | grep -q '/'; then ZenodoID=$(echo "$ZenodoID" | python3 -c "import sys,re; s=sys.stdin.read().strip().rstrip('/'); m=re.search(r'zenodo\.org/(?:records?|deposit)/(\d+)', s) or re.search(r'zenodo\.(\d+)', s); print(m.group(1) if m else s)"); fi
250260
- projectID="${projectID:-zenodo-$ZenodoID}"
251261
- if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
252262
- if [ -z "$jiraticket" ] && [ -n "${openICPSRID:-}" ]; then jiraticket=$(python3 tools/jira_find_task_by_icpsr.py "$openICPSRID" 2>/dev/null || true); else echo "Jira ticket not set"; fi
@@ -450,8 +460,10 @@ pipelines:
450460
w-big-populate-from-icpsr: #name of this pipeline
451461
- variables: #list variable names under here
452462
# These do not need to have a value, if "config.yml" is filled out.
453-
- name: openICPSRID
454-
- name: ZenodoID
463+
- name: openICPSRID
464+
- name: ZenodoID
465+
# Accepts: numeric ID, full URL, DOI, or community request URL.
466+
# Leave blank if jiraticket is set (orchestrator will query Jira).
455467
- name: jiraticket
456468
- step:
457469
name: Download and commit
@@ -467,14 +479,14 @@ pipelines:
467479
- if [ -z $openICPSRID ]; then openICPSRID=$openicpsr; fi
468480
- if [ -z $ZenodoID ]; then ZenodoID=$zenodo; fi
469481
- projectID="${openICPSRID}"
470-
- projectID="${projectID:-zenodo-$ZenodoID}"
471482
- if [ -z "$jiraticket" ] && [ -n "${openICPSRID:-}" ]; then jiraticket=$(python3 tools/jira_find_task_by_icpsr.py "$openICPSRID" 2>/dev/null || true); fi
472483
- export jiraticket
473484
- ./tools/update_config.sh
474485
- ./automations/70_publish_comment.sh w-big-populate-from-icpsr started
475-
- if [ -d $projectID ]; then \rm -rf $projectID; fi
486+
- if [ -d "${projectID:-__none__}" ]; then \rm -rf $projectID; fi
476487
- if [ ! -z $openICPSRID ]; then python3 tools/download_openicpsr-private.py $openICPSRID; fi
477-
- if [ ! -z $ZenodoID ]; then python3 tools/download_zenodo_draft.py $ZenodoID; fi
488+
- if [ ! -z "$ZenodoID" ] || [ ! -z "$jiraticket" ]; then zenodo_dir=$(python3 tools/download_zenodo.py ${ZenodoID:+--zenodo-id "$ZenodoID"} ${jiraticket:+--jira-ticket "$jiraticket"} --print-id 2>&1 | tail -1); fi
489+
- if [ -z "$projectID" ] && [ ! -z "${zenodo_dir:-}" ]; then projectID="$zenodo_dir"; fi
478490
- chmod a+rx ./automations/*.sh
479491
- ./automations/00_prepare_aux.sh
480492
- ./automations/00_unpack_zip.sh $projectID

docs/96-90-download_zenodo.md

Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
(help-download_zenodo)=
2+
# download_zenodo.py — Zenodo download orchestrator
3+
4+
::::{warning}
5+
6+
This documentation was AI-generated by Claude Code and should be reviewed for accuracy. Please report any errors or inconsistencies.
7+
8+
::::
9+
10+
## Description
11+
12+
Single entry point for all Zenodo downloads. Parses any Zenodo URL, DOI,
13+
or record ID, determines whether the target is a public record or a private
14+
draft/community request, and delegates to the appropriate script.
15+
16+
When no `--zenodo-id` is given, the orchestrator queries the Jira ticket for
17+
the "Replication package URL".
18+
19+
## Usage
20+
21+
```bash
22+
# Explicit ID or URL
23+
python3.12 tools/download_zenodo.py --zenodo-id 10848594
24+
python3.12 tools/download_zenodo.py --zenodo-id https://zenodo.org/records/10848594
25+
python3.12 tools/download_zenodo.py --zenodo-id https://zenodo.org/communities/aeajournals/requests/61cff0cb-b3ca-48aa-bfe6-5b17dc8eb665
26+
27+
# From Jira ticket
28+
python3.12 tools/download_zenodo.py --jira-ticket AEAREP-8983
29+
30+
# In a pipeline (capture directory name)
31+
zenodo_dir=$(python3.12 tools/download_zenodo.py --zenodo-id "$ZenodoID" --print-id 2>&1 | tail -1)
32+
```
33+
34+
## Options
35+
36+
| Option | Description |
37+
|--------|-------------|
38+
| `--zenodo-id URL_OR_ID` | Zenodo record ID, URL, DOI, or community request URL. Skips Jira lookup. |
39+
| `--jira-ticket KEY` | Jira issue key; used when `--zenodo-id` is absent |
40+
| `--print-id` | Print `zenodo-NNNNN` to stdout (last line) for pipeline capture |
41+
| `--dry-run` | Pass through to the selected download script |
42+
| `--sandbox` | Use `sandbox.zenodo.org` |
43+
44+
## URL Routing
45+
46+
| URL pattern | Script called |
47+
|-------------|--------------|
48+
| `/records/NNNNN`, `/record/NNNNN`, `10.5281/zenodo.NNNNN`, bare ID | `download_zenodo_public.py` |
49+
| `/deposit/NNNNN` | `download_zenodo_draft.py` |
50+
| `/communities/.../requests/{uuid}` | `download_zenodo_draft.py` (resolves UUID → record ID via API) |
51+
52+
## Environment Variables
53+
54+
| Variable | Purpose |
55+
|----------|---------|
56+
| `JIRA_USERNAME`, `JIRA_API_KEY` | Required when `--jira-ticket` is used |
57+
| `ZENODO_ACCESS_TOKEN` | Required for draft/private downloads |
58+
| `CI` | Auto-commit behaviour in pipelines |
59+
60+
## Exit Codes
61+
62+
| Code | Meaning |
63+
|------|---------|
64+
| 0 | Success |
65+
| 1 | Error |
66+
| 2 | Replication URL from Jira is not a Zenodo URL |
67+
68+
## See Also
69+
70+
- `tools/download_zenodo_public.py`
71+
- `tools/download_zenodo_draft.py`

docs/96-90-download_zenodo_draft.md

Lines changed: 18 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -107,4 +107,21 @@ You need a Zenodo access token to access draft deposits:
107107
- Reports detailed error messages for API issues
108108
- Validates checksums for downloaded files
109109

110-
This tool is essential for working with unpublished Zenodo deposits in research workflows that require access to draft materials.
110+
This tool is essential for working with unpublished Zenodo deposits in research workflows that require access to draft materials.
111+
112+
## Community Request URLs
113+
114+
Draft deposits under community review can be addressed using the request URL:
115+
116+
```text
117+
https://zenodo.org/communities/<community>/requests/<uuid>
118+
```
119+
120+
The script calls `GET /api/requests/{uuid}` to resolve the deposit record ID,
121+
then proceeds with the normal draft download. An access token is required.
122+
123+
```bash
124+
python3.12 tools/download_zenodo_draft.py \
125+
https://zenodo.org/communities/aeajournals/requests/61cff0cb-b3ca-48aa-bfe6-5b17dc8eb665 \
126+
--access-token $ZENODO_ACCESS_TOKEN
127+
```
Lines changed: 35 additions & 73 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
(help-download_zenodo_public)=
2-
# download_zenodo_public.sh - Download files from public Zenodo repositories
2+
3+
# download_zenodo_public.py — Download files from public Zenodo records
34

45
::::{warning}
56

@@ -9,93 +10,54 @@ This documentation was AI-generated by Claude Code and should be reviewed for ac
910

1011
## Description
1112

12-
This script downloads all files from a public Zenodo record using the zenodo_get command-line tool. It's designed for replication workflows where researchers need to download published datasets, code, and supplementary materials from Zenodo repositories for analysis and verification.
13+
Pure-Python script that downloads all files from a published Zenodo record,
14+
then writes SHA-256, MD5, and metadata manifests to `generated/` using the
15+
same format as `download_zenodo_draft.py`.
16+
17+
The legacy shell wrapper `download_zenodo_public.sh` is retained for
18+
backwards compatibility but is deprecated; use the Python script instead.
1319

1420
## Usage
1521

1622
```bash
17-
./download_zenodo_public.sh <RECORD_ID>
18-
bash tools/download_zenodo_public.sh <RECORD_ID>
23+
python3.12 tools/download_zenodo_public.py RECORD_ID_OR_URL
24+
python3.12 tools/download_zenodo_public.py --dry-run RECORD_ID_OR_URL
1925
```
2026

2127
## Arguments
2228

23-
- **RECORD_ID** - Zenodo record identifier, can be:
24-
- Numeric record ID (e.g., "1234567")
25-
- Full Zenodo URL (e.g., "https://zenodo.org/record/1234567")
26-
- Zenodo DOI (e.g., "10.5281/zenodo.1234567")
29+
- **RECORD_ID_OR_URL** — Zenodo identifier in any of these forms:
30+
- Numeric ID: `12345678`
31+
- Record URL: `https://zenodo.org/records/12345678`
32+
- Legacy URL: `https://zenodo.org/record/12345678`
33+
- DOI string: `10.5281/zenodo.12345678`
34+
- DOI URL: `https://doi.org/10.5281/zenodo.12345678`
2735

28-
## Examples
36+
## Options
2937

30-
```bash
31-
# Using Zenodo record ID
32-
./download_zenodo_public.sh 1234567
38+
| Option | Description |
39+
| -------------- | --------------------------------- |
40+
| `--output DIR` | Parent directory (default: `.`) |
41+
| `--dry-run` | List files without downloading |
42+
| `--sandbox` | Use `sandbox.zenodo.org` |
3343

34-
# Using full Zenodo URL (script extracts ID automatically)
35-
./download_zenodo_public.sh https://zenodo.org/record/1234567
44+
## Output
3645

37-
# Using Zenodo DOI (script extracts ID automatically)
38-
./download_zenodo_public.sh 10.5281/zenodo.1234567
3946
```
40-
41-
## Requirements
42-
43-
- **zenodo_get**: Zenodo command-line client (`pip install zenodo_get`)
44-
- Internet connection to access Zenodo API
45-
- Read/write permissions in current directory
46-
47-
## Features
48-
49-
- Flexible input parsing (extracts record ID from URLs and DOIs)
50-
- Creates organized directory structure: `zenodo-[RECORD_ID]`
51-
- Downloads all files from the specified Zenodo record
52-
- Prevents overwriting existing downloads
53-
- Simple error handling and validation
54-
55-
## Behavior
56-
57-
- Parses input to extract Zenodo record ID
58-
- Creates target directory named "zenodo-[RECORD_ID]"
59-
- Checks if directory already exists (prevents accidental overwrites)
60-
- Downloads all files using zenodo_get tool
61-
- Maintains original file names and organization
62-
63-
## Output Structure
64-
65-
```
66-
Input: 1234567 (or https://zenodo.org/record/1234567)
67-
Output directory: ./zenodo-1234567/
68-
Contents: All files from the Zenodo record
47+
zenodo-12345678/ ← downloaded files
48+
generated/
49+
manifest.zenodo-12345678.YYYY-MM-DD.sha256
50+
manifest.zenodo-12345678.YYYY-MM-DD.md5
51+
metadata.zenodo-12345678.txt
6952
```
7053

71-
## Error Handling
72-
73-
- Validates command-line arguments (requires exactly one argument)
74-
- Checks for existing output directory
75-
- Reports download failures from zenodo_get
76-
- Exits with error code 2 on validation failures
77-
78-
## Dependencies
79-
80-
### zenodo_get installation:
81-
```bash
82-
pip install zenodo_get
83-
# or
84-
pip install -r requirements.txt # if included in project requirements
85-
```
86-
87-
## Zenodo API
88-
89-
- Uses zenodo_get which interfaces with Zenodo's REST API
90-
- Works with public records (no authentication required)
91-
- Supports both published and pre-published public records
54+
## Environment Variables
9255

93-
## How It Works
56+
| Variable | Purpose |
57+
| -------- | ---------------------------------------------- |
58+
| `CI` | Suppresses progress; auto-commits with `[skip ci]` |
9459

95-
1. **Input Parsing**: Extracts numeric record ID from various input formats
96-
2. **Directory Creation**: Creates organized output directory
97-
3. **Validation**: Checks for existing downloads to prevent overwrites
98-
4. **Download**: Uses zenodo_get to download all files from the record
99-
5. **Organization**: Maintains original file structure and names
60+
## See Also
10061

101-
This tool is essential for reproducible research workflows that rely on datasets and code hosted in Zenodo repositories.
62+
- `tools/download_zenodo_draft.py` — for draft or community-review deposits
63+
- `tools/download_zenodo.py` — orchestrator (recommended entry point)

0 commit comments

Comments
 (0)