Skip to content

Commit 13d629c

Browse files
authored
feat: Adding settings, utils, write to S3, and writer close (#67)
* feat: Adding settings, utils, and writer close * Adding S3 writer integration * fix bad merge * Adding unit test for s3 reads * Adding boto3 dependency * adding more unit tests * Making test cross-platform compatible * Integrated review feedback * Adding .env and .vscode to gitignore * Making fsspec[s3] an optional dependency * Adding unit test
1 parent b6f5f02 commit 13d629c

16 files changed

Lines changed: 693 additions & 123 deletions

File tree

.github/workflows/ci.yaml

Lines changed: 78 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,12 @@ on:
88
branches:
99
- main
1010

11+
# These permissions are needed to interact with AWS S3 via GitHub's OIDC Token endpoint
12+
permissions:
13+
id-token: write
14+
contents: read
15+
pull-requests: read
16+
1117
jobs:
1218
unit-tests:
1319
runs-on: ${{ matrix.os }}
@@ -56,7 +62,33 @@ jobs:
5662
pip install setuptools
5763
5864
- name: Install cdx_toolkit
59-
run: pip install .[test]
65+
run: pip install .[all]
66+
67+
- name: Configure AWS credentials from OIDC (disabled for forks)
68+
if: github.event.pull_request.head.repo.full_name == github.repository || github.event_name == 'push'
69+
uses: aws-actions/configure-aws-credentials@v4
70+
with:
71+
role-to-assume: arn:aws:iam::837454214164:role/GitHubActions-Role
72+
aws-region: us-east-1
73+
74+
- name: Disable S3 unit tests for Python 3.8 (botocore requires Python 3.9+)
75+
if: ${{ startsWith(matrix.python-version, '3.8') }}
76+
uses: actions/github-script@v7
77+
with:
78+
script: |
79+
core.exportVariable('CDXT_DISABLE_S3_TESTS', '1')
80+
- name: Set environment variables for faster unit tests (requests are mocked)
81+
uses: actions/github-script@v7
82+
with:
83+
script: |
84+
core.exportVariable('CDXT_MAX_ERRORS', '2')
85+
core.exportVariable('CDXT_WARNING_AFTER_N_ERRORS', '2')
86+
core.exportVariable('CDXT_DEFAULT_MIN_RETRY_INTERVAL', '0.01')
87+
core.exportVariable('CDXT_CC_INDEX_MIN_RETRY_INTERVAL', '0.01')
88+
core.exportVariable('CDXT_CC_DATA_MIN_RETRY_INTERVAL', '0.01')
89+
core.exportVariable('CDXT_IA_MIN_RETRY_INTERVAL', '0.01')
90+
core.exportVariable('DISABLE_ATHENA_TESTS', '1')
91+
core.exportVariable('LOGLEVEL', 'DEBUG')
6092
6193
- name: Lint code
6294
run: |
@@ -70,3 +102,48 @@ jobs:
70102
uses: codecov/codecov-action@v4
71103
with:
72104
token: ${{ secrets.CODECOV_TOKEN }}
105+
106+
unit-tests-minimal:
107+
runs-on: ${{ matrix.os }}
108+
strategy:
109+
fail-fast: true
110+
matrix:
111+
include:
112+
- python-version: '3.9'
113+
os: ubuntu-22.04
114+
- python-version: '3.14'
115+
os: ubuntu-latest
116+
117+
steps:
118+
- name: checkout
119+
uses: actions/checkout@v4
120+
121+
- name: Set up Python ${{ matrix.python-version }}
122+
uses: actions/setup-python@v5
123+
with:
124+
python-version: ${{ matrix.python-version }}
125+
126+
- name: Install setuptools on python 3.12+
127+
if: ${{ matrix.python-version >= '3.12' }}
128+
run: |
129+
pip install setuptools
130+
131+
- name: Install cdx_toolkit (minimal)
132+
run: pip install .[test]
133+
134+
- name: Set environment variables for faster unit tests (requests are mocked)
135+
uses: actions/github-script@v7
136+
with:
137+
script: |
138+
core.exportVariable('CDXT_MAX_ERRORS', '2')
139+
core.exportVariable('CDXT_WARNING_AFTER_N_ERRORS', '2')
140+
core.exportVariable('CDXT_DEFAULT_MIN_RETRY_INTERVAL', '0.01')
141+
core.exportVariable('CDXT_CC_INDEX_MIN_RETRY_INTERVAL', '0.01')
142+
core.exportVariable('CDXT_CC_DATA_MIN_RETRY_INTERVAL', '0.01')
143+
core.exportVariable('CDXT_IA_MIN_RETRY_INTERVAL', '0.01')
144+
core.exportVariable('DISABLE_ATHENA_TESTS', '1')
145+
core.exportVariable('LOGLEVEL', 'DEBUG')
146+
147+
- name: test minimal
148+
run: |
149+
make test

.gitignore

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,4 +4,6 @@ __pycache__
44
cdx_toolkit.egg-info
55
.coverage
66
.eggs/
7-
tmp/
7+
tmp/
8+
.env
9+
.vscode

CONTRIBUTING.md

Lines changed: 15 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,18 @@ Clone the repository, setup a virtual environment, and run the following command
1010
make install
1111
```
1212

13+
For S3-related features or tests, install optional dependencies:
14+
15+
```bash
16+
pip install -e ".[s3]"
17+
```
18+
19+
To install everything (dev/test/S3), use:
20+
21+
```bash
22+
pip install -e ".[all]"
23+
```
24+
1325
## Tests
1426

1527
To test code changes, please run our test suite before submitting pull requests:
@@ -33,14 +45,14 @@ If the remote APIs change, new mock data can be semi-automatically collected by
3345
```bash
3446
# set environment variable (DISABLE_MOCK_RESPONSES should not be set)
3547
export SAVE_MOCK_RESPONSES=./tmp/mock_responses
36-
48+
3749
# run the test for what mock data should be saved to $SAVE_MOCK_RESPONSES/<test_file>/<test_func>.jsonl
3850
pytest tests/test_cli.py::test_basics
3951
```
4052

4153
## Code format & linting
4254

43-
Please following the definitions from `.editorconfig` and `.flake8`.
55+
Please following the definitions from `.editorconfig` and `.flake8`.
4456

4557
To test the linting, run this command:
4658

@@ -54,4 +66,4 @@ You can also run the hooks manually on all files:
5466

5567
```bash
5668
pre-commit run --all-files
57-
```
69+
```

README.md

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,13 @@ $ pip install cdx_toolkit
2424

2525
or clone this repo and use `pip install .`
2626

27+
Optional extras:
28+
29+
```
30+
$ pip install cdx_toolkit[s3] # enable S3 and other remote filesystem support
31+
$ pip install cdx_toolkit[all] # install all optional dependencies
32+
```
33+
2734
## Command-line tools
2835

2936
```
@@ -275,7 +282,7 @@ cdx_toolkit has reached the beta-testing stage of development.
275282

276283
## Contributing
277284

278-
See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines on contributing
285+
See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines on contributing
279286
and running tests.
280287

281288
## License

cdx_toolkit/cli.py

Lines changed: 19 additions & 54 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,14 @@
1-
from argparse import ArgumentParser
1+
from argparse import ArgumentParser, Namespace
22
import logging
33
import csv
44
import sys
55
import json
66
import os
77

88
import cdx_toolkit
9-
from cdx_toolkit.commoncrawl import normalize_crawl
9+
10+
from cdx_toolkit.utils import get_version, setup_cdx_fetcher_and_kwargs
11+
1012

1113
LOGGER = logging.getLogger(__name__)
1214

@@ -135,7 +137,7 @@ def main(args=None):
135137
cmd.func(cmd, cmdline)
136138

137139

138-
def set_loglevel(cmd):
140+
def set_loglevel(cmd: Namespace):
139141
loglevel = os.getenv('LOGLEVEL') or 'WARNING'
140142
if cmd.verbose:
141143
if cmd.verbose > 0:
@@ -151,58 +153,15 @@ def set_loglevel(cmd):
151153
LOGGER.info('set loglevel to %s', str(loglevel))
152154

153155

154-
def get_version():
155-
return cdx_toolkit.__version__
156-
157-
158-
def setup(cmd):
159-
kwargs = {}
160-
kwargs['source'] = 'cc' if cmd.crawl else cmd.cc or cmd.ia or cmd.source or None
161-
if kwargs['source'] is None:
162-
raise ValueError('must specify --cc, --ia, or a --source')
163-
if cmd.wb:
164-
kwargs['wb'] = cmd.wb
165-
if cmd.cc_mirror:
166-
kwargs['cc_mirror'] = cmd.cc_mirror
167-
if cmd.crawl:
168-
kwargs['crawl'] = normalize_crawl([cmd.crawl]) # currently a string, not a list
169-
if getattr(cmd, 'warc_download_prefix', None) is not None:
170-
kwargs['warc_download_prefix'] = cmd.warc_download_prefix
171-
172-
cdx = cdx_toolkit.CDXFetcher(**kwargs)
173-
174-
kwargs = {}
175-
if cmd.limit:
176-
kwargs['limit'] = cmd.limit
177-
if 'from' in vars(cmd) and vars(cmd)['from']: # python, uh, from is a reserved word
178-
kwargs['from_ts'] = vars(cmd)['from']
179-
if cmd.to:
180-
kwargs['to'] = cmd.to
181-
if cmd.closest:
182-
if not cmd.get: # pragma: no cover
183-
LOGGER.info('note: --closest works best with --get')
184-
kwargs['closest'] = cmd.closest
185-
if cmd.filter:
186-
kwargs['filter'] = cmd.filter
187-
188-
if cmd.cmd == 'warc' and cmd.size:
189-
kwargs['size'] = cmd.size
190-
191-
if cmd.cmd == 'size' and cmd.details:
192-
kwargs['details'] = cmd.details
193-
194-
return cdx, kwargs
195-
196-
197-
def winnow_fields(cmd, fields, obj):
156+
def winnow_fields(cmd: Namespace, fields, obj):
198157
if cmd.all_fields:
199158
printme = obj
200159
else:
201160
printme = dict([(k, obj[k]) for k in fields if k in obj])
202161
return printme
203162

204163

205-
def print_line(cmd, writer, printme):
164+
def print_line(cmd: Namespace, writer, printme):
206165
if cmd.jsonl:
207166
print(json.dumps(printme, sort_keys=True))
208167
elif writer:
@@ -211,8 +170,8 @@ def print_line(cmd, writer, printme):
211170
print(', '.join([' '.join((k, printme[k])) for k in sorted(printme.keys())]))
212171

213172

214-
def iterator(cmd, cmdline):
215-
cdx, kwargs = setup(cmd)
173+
def iterator(cmd: Namespace, cmdline):
174+
cdx, kwargs = setup_cdx_fetcher_and_kwargs(cmd)
216175
fields = set(cmd.fields.split(','))
217176
if cmd.csv:
218177
writer = csv.DictWriter(sys.stdout, fieldnames=sorted(list(fields)))
@@ -232,8 +191,8 @@ def iterator(cmd, cmdline):
232191
print_line(cmd, writer, printme)
233192

234193

235-
def warcer(cmd, cmdline):
236-
cdx, kwargs = setup(cmd)
194+
def warcer(cmd: Namespace, cmdline: str):
195+
cdx, kwargs = setup_cdx_fetcher_and_kwargs(cmd)
237196

238197
ispartof = cmd.prefix
239198
if cmd.subprefix:
@@ -275,9 +234,15 @@ def warcer(cmd, cmdline):
275234
LOGGER.warning('revisit record being resolved for url %s %s', url, timestamp)
276235
writer.write_record(record)
277236

237+
writer.close()
278238

279-
def sizer(cmd, cmdline):
280-
cdx, kwargs = setup(cmd)
239+
240+
def sizer(cmd: Namespace, cmdline):
241+
cdx, kwargs = setup_cdx_fetcher_and_kwargs(cmd)
281242

282243
size = cdx.get_size_estimate(cmd.url, **kwargs)
283244
print(size)
245+
246+
247+
if __name__ == "__main__":
248+
main()

cdx_toolkit/commoncrawl.py

Lines changed: 10 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,8 @@
99
import json
1010
import logging
1111

12+
from cdx_toolkit.settings import CACHE_DIR, get_mock_time
13+
1214
from .myrequests import myrequests_get
1315
from .timeutils import (
1416
time_to_timestamp,
@@ -34,7 +36,7 @@ def normalize_crawl(crawl):
3436

3537

3638
def get_cache_names(cc_mirror):
37-
cache = os.path.expanduser('~/.cache/cdx_toolkit/')
39+
cache = os.path.expanduser(CACHE_DIR)
3840
filename = re.sub(r'[^\w]', '_', cc_mirror.replace('https://', ''))
3941
return cache, filename
4042

@@ -128,9 +130,13 @@ def apply_cc_defaults(params, crawl_present=False, now=None):
128130
LOGGER.info('to but no from_ts, setting from_ts=%s', params['from_ts'])
129131
else:
130132
if not now:
131-
# now is passed in by tests. if not set, use actual now.
132-
# XXX could be changed to mock
133-
now = time.time()
133+
# Check for test/override time first
134+
mock_time = get_mock_time()
135+
if mock_time:
136+
now = mock_time
137+
else:
138+
# now is passed in by tests. if not set, use actual now.
139+
now = time.time()
134140
params['from_ts'] = time_to_timestamp(now - year)
135141
LOGGER.info('no from or to, setting default 1 year ago from_ts=%s', params['from_ts'])
136142
else:

cdx_toolkit/myrequests.py

Lines changed: 21 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,18 @@
1+
from typing import Optional
12
import requests
23
import logging
34
import time
45
from urllib.parse import urlparse
56

67
from . import __version__
8+
from .settings import (
9+
DEFAULT_MIN_RETRY_INTERVAL,
10+
CC_DATA_MIN_RETRY_INTERVAL,
11+
CC_INDEX_MIN_RETRY_INTERVAL,
12+
IA_MIN_RETRY_INTERVAL,
13+
MAX_ERRORS,
14+
WARNING_AFTER_N_ERRORS,
15+
)
716

817
LOGGER = logging.getLogger(__name__)
918

@@ -23,19 +32,19 @@ def dns_fatal(hostname):
2332
retry_info = {
2433
'default': {
2534
'next_fetch': 0,
26-
'minimum_interval': 3.0,
35+
'minimum_interval': DEFAULT_MIN_RETRY_INTERVAL,
2736
},
2837
'index.commoncrawl.org': {
2938
'next_fetch': 0,
30-
'minimum_interval': 1.0,
39+
'minimum_interval': CC_INDEX_MIN_RETRY_INTERVAL,
3140
},
3241
'data.commoncrawl.org': {
3342
'next_fetch': 0,
34-
'minimum_interval': 0.55,
43+
'minimum_interval': CC_DATA_MIN_RETRY_INTERVAL,
3544
},
3645
'web.archive.org': {
3746
'next_fetch': 0,
38-
'minimum_interval': 6.0,
47+
'minimum_interval': IA_MIN_RETRY_INTERVAL,
3948
},
4049
}
4150

@@ -60,12 +69,18 @@ def myrequests_get(
6069
headers=None,
6170
cdx=False,
6271
allow404=False,
63-
raise_error_after_n_errors: int = 100,
64-
raise_warning_after_n_errors: int = 10,
72+
raise_error_after_n_errors: Optional[int] = None,
73+
raise_warning_after_n_errors: Optional[int] = None,
6574
retry_max_sec: int = 60,
6675
):
6776
t = time.time()
6877

78+
if raise_error_after_n_errors is None:
79+
raise_error_after_n_errors = MAX_ERRORS
80+
81+
if raise_warning_after_n_errors is None:
82+
raise_warning_after_n_errors = WARNING_AFTER_N_ERRORS
83+
6984
hostname = urlparse(url).hostname
7085
next_fetch, minimum_interval = get_retries(hostname)
7186

0 commit comments

Comments
 (0)