Skip to content

feat: more robust OA resolution (all OA locations + arXiv DOIs)#8

Open
enieuwy wants to merge 1 commit into
Rimagination:mainfrom
enieuwy:feat/robust-oa-resolution
Open

feat: more robust OA resolution (all OA locations + arXiv DOIs)#8
enieuwy wants to merge 1 commit into
Rimagination:mainfrom
enieuwy:feat/robust-oa-resolution

Conversation

@enieuwy

@enieuwy enieuwy commented Jun 29, 2026

Copy link
Copy Markdown

Two independent OA-resolution gaps surfaced while fetching real papers (both made instsci miss content it could have retrieved).

1. Fall back across all OA locations

unpaywall.check_oa() collapsed every OA location into a single pdf_url (the best/publisher copy). When that link returns 403/Forbidden — common on publisher PDFs behind bot detection — the available repository copy in oa_locations was never tried, and the fetch escalated to institutional access unnecessarily.

  • OAResult now carries pdf_urls: an ordered, de-duplicated list (best location first, then every other OA location).
  • _try_open_access iterates the candidates, so a forbidden publisher link falls back to a repository copy.
  • pdf_url is retained (first candidate) for backward compatibility.

Real example: 10.1080/10447318.2023.2301250 — the publisher PDF 403s, but Unpaywall also lists a repository copy that was previously ignored.

2. Route arXiv DataCite DOIs directly to arXiv

arXiv registers DOIs of the form 10.48550/arXiv.<id>. Unpaywall does not index these, so check_oa returned "not OA", the paper fell through to institutional access, and never downloaded — even though the arXiv source handles it directly.

  • fetch now detects 10.48550/arXiv.* DOIs and routes them straight to the arXiv source.

Real examples: 10.48550/arXiv.2504.06435, 10.48550/arXiv.2602.11522 — both arXiv-only papers that previously failed.

Testing

  • tests/test_unpaywall_oa_locations.pycollect_pdf_urls ordering, dedup, whitespace, no-best, empty.
  • tests/test_arxiv_doi_routing.py — arXiv-DOI detection (incl. case-insensitive prefix) and non-arXiv DOIs returning None.
  • Full existing suite still green (37 fetcher-adjacent tests pass; no regressions).

Based on main. Independent of the in-flight codex/instsci-user-install-doctor branch.

Two independent gaps surfaced while fetching real papers:

1. OA fallback was single-shot. check_oa() collapsed every OA location to
   one pdf_url (the best/publisher copy). When that link 403'd, a perfectly
   good repository copy in oa_locations was never tried. check_oa() now
   exposes pdf_urls (ordered, de-duplicated, best first) and _try_open_access
   iterates them, so a forbidden publisher link falls back to a repository
   copy instead of escalating prematurely.

2. arXiv DataCite DOIs (10.48550/arXiv.*) are not indexed by Unpaywall, so
   they returned no OA and fell through to institutional access and never
   downloaded -- even though the arXiv source handles them directly. fetch
   now detects these DOIs and routes them straight to the arXiv source.

Adds unit tests for collect_pdf_urls() ordering/dedup and the arXiv-DOI
detector.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant