Fix broken link checking CI and broken decoupler pseudobulk link by maltekuehl · Pull Request #236 · scverse/scverse-tutorials

maltekuehl · 2025-09-11T13:42:44Z

It seems that RTD blocked bot-like requests and introduced rate limits, which resulted in the original pipeline failing. To be more respectful of RTD and fix this issue, I have adapted the CI to instantiate a Chromium browser for requests and just use a HEAD instead of a GET request, as this will not download the page's data. I have also added a timeout.

The problem was another example of AI companies making the web worse for everyone: https://about.readthedocs.com/blog/2024/07/ai-crawlers-abuse/

Additionally, I have fixed a broken decoupler link. Current link is broken, see https://scverse.org/learn/

One question I have is why the failing CI was not caught for 4 months? I see multiple people watching the repository. If it was caught and ignored because of priority, no problem, just if it was not caught, that might be something to fix.

@grst @Zethson

for more information, see https://pre-commit.ci

flying-sheep

Nonono. What you’re doing is the coding equivalent of breaking into a house because you forgot your umbrella back when you were invited.

Before we deceptively pretend we’re a browser, how about trying this in polite ways?

We should set a custom user agent. This is the most normal and basic thing every single HTTP bot should do!
We could do a HEAD request instead of a GET. IDK if HTTPX downloads the full page when used like this, but doing a HEAD request should still work and signals that we don’t even want a lot of data.
If all that fails, we should ask RTD nicely to be removed from their blocklist.

If all that fails and RTD can’t recommend any other recourse, then we can try this.

maltekuehl · 2025-09-11T16:32:18Z

Thanks for the analogy, initially I just wanted to fix a broken link and might have gotten carried away. Though I did try both 1 and 2 first and they were insufficient, in my defense.

So with 1 and 2 down, I guess we will have to try 3! Since I do not have any official role within scverse and no email address, perhaps you could reach out @flying-sheep? Form is available at: https://app.readthedocs.org/support/

Alternatively, I could create an issue on https://github.com/readthedocs/readthedocs.org/issues if okay with you.

flying-sheep · 2025-09-11T19:07:51Z

I just filed an issue. I think support is more for when your own hosted service gets taken down: readthedocs/readthedocs.org#12471

grst · 2025-09-12T06:04:13Z

One question I have is why the failing CI was not caught for 4 months? I see multiple people watching the repository. If it was caught and ignored because of priority, no problem, just if it was not caught, that might be something to fix.

Yeah, I think it's just that scverse-tutorials doesn't get the love it deserves.

flying-sheep · 2025-09-12T13:15:36Z

@maltekuehl why did you try to use a browser-like user agent? That’s already deceptive. No wonder this would be blocked! Check out my commits: You’re supposed to use a completely custom one: https://user-agents.net/bots

It still got a 403, but probably because we ended up on some blocklist or so.

/edit: I guess I know why. It’s insane, when you search the web for “httpx user agent”, the first two results for me are blogspam articles that blatantly tell you you should “fake” user agents and “avoid detection” instead of doing the right thing. Makes me angry:

only the third tutorial actually tells you the right thing: https://proxiesapi.com/articles/customizing-httpx-user-agents-for-effective-api-requests

maltekuehl · 2025-09-12T13:33:01Z

I tried with a completely custom one, too, though it did not end up in commit history. That was also locally from my IP and GitHub IPs are also not specific to this repository, so we did not end up on a block list, it just never worked in any configuration. Since the 403 error is non-obvious (rather than a 429), the initial objective when trying out different fixes was to find out what was causing this problem. I am still not 100% sure the 403 is even related to the AI bot block, aa in other configurations I was getting a 429 instead, which would be the more expected response.

flying-sheep · 2025-09-12T13:52:13Z

That makes sense! Good point about the block list and GH IPs. Let’s wait for RTD to respond!

Automated Nova fix for scverse#236 **Original PR**: scverse#236 **Base**: `scverse/scverse-tutorials@main` **Source SHA**: `5f0a3df5ed1e7da0f3cecd966247680e07eb7440` **Nova Hint**: fix link checker CI; update broken links; target minimal 2-file fix **Nova Mode**: local This patch aims to be minimal and CI-verifiable. 🤖 Generated with [Nova CI-Rescue](https://github.com/anthropics/nova-ci-rescue) Co-Authored-By: Nova <noreply@anthropic.com>

grst · 2025-11-08T20:50:27Z

Now this is interesting:

sturm@hochvogel ~ % uv run --python 3.13 --with httpx python -c "import httpx; print(httpx.__version__); print(httpx.head('https://scanpy.readthedocs.io/en/stable/tutorials/plotting/core.html'))"
0.28.1
<Response [200 OK]>
sturm@hochvogel ~ % uv run --python 3.12 --with httpx python -c "import httpx; print(httpx.__version__); print(httpx.head('https://scanpy.readthedocs.io/en/stable/tutorials/plotting/core.html'))"
0.28.1
<Response [403 Forbidden]>

So exactly the same request fails on python 3.12 but succeeds on python 3.13. It also affects only the scanpy link, all other links can be checked successfully also with python 3.12.

grst · 2025-11-08T20:56:33Z

Closing in favor of #249

grst · 2025-11-13T13:27:16Z

@flying-sheep, do you have any idea why httpx could behave differently with different python versions? Is it worth reporting this to httpx?

flying-sheep · 2025-11-13T18:08:54Z

Maybe it just has the Python version in its default user agent or so? And whatever machine learning tool flags it as “bad” only learned to recognize that version?

maltekuehl · 2025-11-13T19:45:16Z

Is the httpx version the same? Maybe the latest does not exist for 3.11

maltekuehl and others added 6 commits September 11, 2025 15:42

Fix broken decoupler pseudobulk link

615b672

Try to fix RTD bot block with user agent

4e25bc0

[pre-commit.ci] auto fixes from pre-commit.com hooks

69766b2

for more information, see https://pre-commit.ci

Try installing Chromium to access web pages

e8b4dfe

[pre-commit.ci] auto fixes from pre-commit.com hooks

a79e681

for more information, see https://pre-commit.ci

Use playwright HEAD requests

c5bd364

maltekuehl changed the title ~~Fix broken decoupler pseudobulk link~~ Fix broken link checking CI and broken decoupler pseudobulk link Sep 11, 2025

Remove uv lock file

89bbbfd

flying-sheep requested changes Sep 11, 2025

View reviewed changes

flying-sheep mentioned this pull request Sep 11, 2025

Recommended way to poke rtd.org programmatically readthedocs/readthedocs.org#12471

Open

flying-sheep added 2 commits September 12, 2025 15:06

reset

99ff10a

Use user agent

5f0a3df

seabass011 mentioned this pull request Sep 13, 2025

🤖 Nova CI-Rescue: fix failing CI for PR #236 #237

Closed

flying-sheep added 6 commits September 15, 2025 16:04

fix scanpy tutorials

3758cfc

use client

a0a96fd

update versions

3df6e94

name

62a679f

HTTP 2

7a60d6d

with deps

49b5602

grst mentioned this pull request Nov 8, 2025

Check links with HEAD request #249

Merged

grst closed this Nov 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix broken link checking CI and broken decoupler pseudobulk link#236

Fix broken link checking CI and broken decoupler pseudobulk link#236
maltekuehl wants to merge 15 commits intoscverse:mainfrom
maltekuehl:patch-1

maltekuehl commented Sep 11, 2025 •

edited

Loading

Uh oh!

flying-sheep left a comment •

edited

Loading

Uh oh!

maltekuehl commented Sep 11, 2025 •

edited

Loading

Uh oh!

flying-sheep commented Sep 11, 2025

Uh oh!

grst commented Sep 12, 2025

Uh oh!

flying-sheep commented Sep 12, 2025 •

edited

Loading

Uh oh!

maltekuehl commented Sep 12, 2025

Uh oh!

flying-sheep commented Sep 12, 2025

Uh oh!

grst commented Nov 8, 2025

Uh oh!

grst commented Nov 8, 2025

Uh oh!

grst commented Nov 13, 2025

Uh oh!

flying-sheep commented Nov 13, 2025

Uh oh!

maltekuehl commented Nov 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

maltekuehl commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

flying-sheep left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maltekuehl commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

flying-sheep commented Sep 11, 2025

Uh oh!

grst commented Sep 12, 2025

Uh oh!

flying-sheep commented Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maltekuehl commented Sep 12, 2025

Uh oh!

flying-sheep commented Sep 12, 2025

Uh oh!

grst commented Nov 8, 2025

Uh oh!

grst commented Nov 8, 2025

Uh oh!

grst commented Nov 13, 2025

Uh oh!

flying-sheep commented Nov 13, 2025

Uh oh!

maltekuehl commented Nov 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

maltekuehl commented Sep 11, 2025 •

edited

Loading

flying-sheep left a comment •

edited

Loading

maltekuehl commented Sep 11, 2025 •

edited

Loading

flying-sheep commented Sep 12, 2025 •

edited

Loading