Fix broken link checking CI and broken decoupler pseudobulk link#236
Fix broken link checking CI and broken decoupler pseudobulk link#236maltekuehl wants to merge 15 commits intoscverse:mainfrom
Conversation
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Nonono. What you’re doing is the coding equivalent of breaking into a house because you forgot your umbrella back when you were invited.
Before we deceptively pretend we’re a browser, how about trying this in polite ways?
- We should set a custom user agent. This is the most normal and basic thing every single HTTP bot should do!
- We could do a HEAD request instead of a GET. IDK if HTTPX downloads the full page when used like this, but doing a HEAD request should still work and signals that we don’t even want a lot of data.
- If all that fails, we should ask RTD nicely to be removed from their blocklist.
If all that fails and RTD can’t recommend any other recourse, then we can try this.
|
Thanks for the analogy, initially I just wanted to fix a broken link and might have gotten carried away. Though I did try both 1 and 2 first and they were insufficient, in my defense. So with 1 and 2 down, I guess we will have to try 3! Since I do not have any official role within scverse and no email address, perhaps you could reach out @flying-sheep? Form is available at: https://app.readthedocs.org/support/ Alternatively, I could create an issue on https://github.com/readthedocs/readthedocs.org/issues if okay with you. |
|
I just filed an issue. I think support is more for when your own hosted service gets taken down: readthedocs/readthedocs.org#12471 |
Yeah, I think it's just that scverse-tutorials doesn't get the love it deserves. |
|
@maltekuehl why did you try to use a browser-like user agent? That’s already deceptive. No wonder this would be blocked! Check out my commits: You’re supposed to use a completely custom one: https://user-agents.net/bots It still got a 403, but probably because we ended up on some blocklist or so. /edit: I guess I know why. It’s insane, when you search the web for “httpx user agent”, the first two results for me are blogspam articles that blatantly tell you you should “fake” user agents and “avoid detection” instead of doing the right thing. Makes me angry:
only the third tutorial actually tells you the right thing: https://proxiesapi.com/articles/customizing-httpx-user-agents-for-effective-api-requests |
|
I tried with a completely custom one, too, though it did not end up in commit history. That was also locally from my IP and GitHub IPs are also not specific to this repository, so we did not end up on a block list, it just never worked in any configuration. Since the 403 error is non-obvious (rather than a 429), the initial objective when trying out different fixes was to find out what was causing this problem. I am still not 100% sure the 403 is even related to the AI bot block, aa in other configurations I was getting a 429 instead, which would be the more expected response. |
|
That makes sense! Good point about the block list and GH IPs. Let’s wait for RTD to respond! |
Automated Nova fix for scverse#236 **Original PR**: scverse#236 **Base**: `scverse/scverse-tutorials@main` **Source SHA**: `5f0a3df5ed1e7da0f3cecd966247680e07eb7440` **Nova Hint**: fix link checker CI; update broken links; target minimal 2-file fix **Nova Mode**: local This patch aims to be minimal and CI-verifiable. 🤖 Generated with [Nova CI-Rescue](https://github.com/anthropics/nova-ci-rescue) Co-Authored-By: Nova <noreply@anthropic.com>
|
Now this is interesting: sturm@hochvogel ~ % uv run --python 3.13 --with httpx python -c "import httpx; print(httpx.__version__); print(httpx.head('https://scanpy.readthedocs.io/en/stable/tutorials/plotting/core.html'))"
0.28.1
<Response [200 OK]>
sturm@hochvogel ~ % uv run --python 3.12 --with httpx python -c "import httpx; print(httpx.__version__); print(httpx.head('https://scanpy.readthedocs.io/en/stable/tutorials/plotting/core.html'))"
0.28.1
<Response [403 Forbidden]>So exactly the same request fails on python 3.12 but succeeds on python 3.13. It also affects only the scanpy link, all other links can be checked successfully also with python 3.12. |
|
Closing in favor of #249 |
|
@flying-sheep, do you have any idea why httpx could behave differently with different python versions? Is it worth reporting this to httpx? |
|
Maybe it just has the Python version in its default user agent or so? And whatever machine learning tool flags it as “bad” only learned to recognize that version? |
|
Is the httpx version the same? Maybe the latest does not exist for 3.11 |
It seems that RTD blocked bot-like requests and introduced rate limits, which resulted in the original pipeline failing. To be more respectful of RTD and fix this issue, I have adapted the CI to instantiate a Chromium browser for requests and just use a HEAD instead of a GET request, as this will not download the page's data. I have also added a timeout.
The problem was another example of AI companies making the web worse for everyone: https://about.readthedocs.com/blog/2024/07/ai-crawlers-abuse/
Additionally, I have fixed a broken decoupler link. Current link is broken, see https://scverse.org/learn/
One question I have is why the failing CI was not caught for 4 months? I see multiple people watching the repository. If it was caught and ignored because of priority, no problem, just if it was not caught, that might be something to fix.
@grst @Zethson