Skip to content

Process URLs strictly in the given order #300

@bzc6p

Description

@bzc6p

If wpull experiences a problem fetching an URL, skips it and processes it in the end. This is the reasonable approach in most cases.

But there are applications where it is important that no URLs are skipped, but they are processed in the given order. (Even if they need to be retried for long because of an error.) Such one is when saving paginated lists that are rolling down due to new elements – leaving out one page and processing later may cause losing some elements of the meanwhile updated list.

I tried to modify the behaviour, even wpull code for that, but it's more complex than I thought, The handle_error hook function returning Actions.RETRY doesn't solve this. In engine.py, get_next_url_record method I tried changing the order of looking up URLs in the database (first error, then todo) but this gave only a partial solution, because, as the log suggests, multiple URL records are started to be processed, that is, new URLs are taken from the database (or from some cache if multiple URLs are taken at the same time from the db), so the next candidate is chosen before the last one is finished. This seems to be a different multithreaded behaviour than the one adjustable with --concurrency.

An option for commanding wpull to keep the order of URLs would be a possible but, I admit, a far not important enhancement. So, besides leaving this here as an enhancement suggestion, I would like to ask if there is a way – even by modifying the code – that the multithreaded behaviour described above can be turned off? The other possible workarounds not touching wpull itself (e.g. wpulling URLs one by one) are way less efficient, as far as I can imagine.

Thank you in advance.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions