feat: more robust OA resolution (all OA locations + arXiv DOIs)#8
Open
enieuwy wants to merge 1 commit into
Open
feat: more robust OA resolution (all OA locations + arXiv DOIs)#8enieuwy wants to merge 1 commit into
enieuwy wants to merge 1 commit into
Conversation
Two independent gaps surfaced while fetching real papers: 1. OA fallback was single-shot. check_oa() collapsed every OA location to one pdf_url (the best/publisher copy). When that link 403'd, a perfectly good repository copy in oa_locations was never tried. check_oa() now exposes pdf_urls (ordered, de-duplicated, best first) and _try_open_access iterates them, so a forbidden publisher link falls back to a repository copy instead of escalating prematurely. 2. arXiv DataCite DOIs (10.48550/arXiv.*) are not indexed by Unpaywall, so they returned no OA and fell through to institutional access and never downloaded -- even though the arXiv source handles them directly. fetch now detects these DOIs and routes them straight to the arXiv source. Adds unit tests for collect_pdf_urls() ordering/dedup and the arXiv-DOI detector.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Two independent OA-resolution gaps surfaced while fetching real papers (both made instsci miss content it could have retrieved).
1. Fall back across all OA locations
unpaywall.check_oa()collapsed every OA location into a singlepdf_url(the best/publisher copy). When that link returns 403/Forbidden — common on publisher PDFs behind bot detection — the available repository copy inoa_locationswas never tried, and the fetch escalated to institutional access unnecessarily.OAResultnow carriespdf_urls: an ordered, de-duplicated list (best location first, then every other OA location)._try_open_accessiterates the candidates, so a forbidden publisher link falls back to a repository copy.pdf_urlis retained (first candidate) for backward compatibility.Real example:
10.1080/10447318.2023.2301250— the publisher PDF 403s, but Unpaywall also lists a repository copy that was previously ignored.2. Route arXiv DataCite DOIs directly to arXiv
arXiv registers DOIs of the form
10.48550/arXiv.<id>. Unpaywall does not index these, socheck_oareturned "not OA", the paper fell through to institutional access, and never downloaded — even though the arXiv source handles it directly.fetchnow detects10.48550/arXiv.*DOIs and routes them straight to the arXiv source.Real examples:
10.48550/arXiv.2504.06435,10.48550/arXiv.2602.11522— both arXiv-only papers that previously failed.Testing
tests/test_unpaywall_oa_locations.py—collect_pdf_urlsordering, dedup, whitespace, no-best, empty.tests/test_arxiv_doi_routing.py— arXiv-DOI detection (incl. case-insensitive prefix) and non-arXiv DOIs returningNone.Based on
main. Independent of the in-flightcodex/instsci-user-install-doctorbranch.