Process URLs strictly in the given order

If wpull experiences a problem fetching an URL, skips it and processes it in the end. This is the reasonable approach in most cases.

But there are applications where it is important that no URLs are skipped, but they are processed in the given order. (Even if they need to be retried for long because of an error.) Such one is when saving paginated lists that are rolling down due to new elements – leaving out one page and processing later may cause losing some elements of the meanwhile updated list.

I tried to modify the behaviour, even wpull code for that, but it's more complex than I thought, The handle_error hook function returning Actions.RETRY doesn't solve this. In engine.py, get_next_url_record method I tried changing the order of looking up URLs in the database (first error, then todo) but this gave only a partial solution, because, as the log suggests, multiple URL records are started to be processed, that is, new URLs are taken from the database (or from some cache if multiple URLs are taken at the same time from the db), so the next candidate is chosen before the last one is finished. This seems to be a different multithreaded behaviour than the one adjustable with --concurrency.

An option for commanding wpull to keep the order of URLs would be a possible but, I admit, a far not important enhancement. So, besides leaving this here as an enhancement suggestion, I would like to ask if there is a way – even by modifying the code – that the multithreaded behaviour described above can be turned off? The other possible workarounds not touching wpull itself (e.g. wpulling URLs one by one) are way less efficient, as far as I can imagine.

Thank you in advance.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Process URLs strictly in the given order #300

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Process URLs strictly in the given order #300

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions