feat: more robust OA resolution (all OA locations + arXiv DOIs) by enieuwy · Pull Request #8 · Rimagination/instsci

enieuwy · 2026-06-29T14:28:59Z

Two independent OA-resolution gaps surfaced while fetching real papers (both made instsci miss content it could have retrieved).

1. Fall back across all OA locations

unpaywall.check_oa() collapsed every OA location into a single pdf_url (the best/publisher copy). When that link returns 403/Forbidden — common on publisher PDFs behind bot detection — the available repository copy in oa_locations was never tried, and the fetch escalated to institutional access unnecessarily.

OAResult now carries pdf_urls: an ordered, de-duplicated list (best location first, then every other OA location).
_try_open_access iterates the candidates, so a forbidden publisher link falls back to a repository copy.
pdf_url is retained (first candidate) for backward compatibility.

Real example: 10.1080/10447318.2023.2301250 — the publisher PDF 403s, but Unpaywall also lists a repository copy that was previously ignored.

2. Route arXiv DataCite DOIs directly to arXiv

arXiv registers DOIs of the form 10.48550/arXiv.<id>. Unpaywall does not index these, so check_oa returned "not OA", the paper fell through to institutional access, and never downloaded — even though the arXiv source handles it directly.

fetch now detects 10.48550/arXiv.* DOIs and routes them straight to the arXiv source.

Real examples: 10.48550/arXiv.2504.06435, 10.48550/arXiv.2602.11522 — both arXiv-only papers that previously failed.

Testing

tests/test_unpaywall_oa_locations.py — collect_pdf_urls ordering, dedup, whitespace, no-best, empty.
tests/test_arxiv_doi_routing.py — arXiv-DOI detection (incl. case-insensitive prefix) and non-arXiv DOIs returning None.
Full existing suite still green (37 fetcher-adjacent tests pass; no regressions).

Based on main. Independent of the in-flight codex/instsci-user-install-doctor branch.

Two independent gaps surfaced while fetching real papers: 1. OA fallback was single-shot. check_oa() collapsed every OA location to one pdf_url (the best/publisher copy). When that link 403'd, a perfectly good repository copy in oa_locations was never tried. check_oa() now exposes pdf_urls (ordered, de-duplicated, best first) and _try_open_access iterates them, so a forbidden publisher link falls back to a repository copy instead of escalating prematurely. 2. arXiv DataCite DOIs (10.48550/arXiv.*) are not indexed by Unpaywall, so they returned no OA and fell through to institutional access and never downloaded -- even though the arXiv source handles them directly. fetch now detects these DOIs and routes them straight to the arXiv source. Adds unit tests for collect_pdf_urls() ordering/dedup and the arXiv-DOI detector.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: more robust OA resolution (all OA locations + arXiv DOIs)#8

feat: more robust OA resolution (all OA locations + arXiv DOIs)#8
enieuwy wants to merge 1 commit into
Rimagination:mainfrom
enieuwy:feat/robust-oa-resolution

enieuwy commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

enieuwy commented Jun 29, 2026

1. Fall back across all OA locations

2. Route arXiv DataCite DOIs directly to arXiv

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant