Skip to content

Conversation

@Gallaecio
Copy link
Owner

@Gallaecio Gallaecio commented Jan 3, 2025

cc @kmike @wRAR

To be merged into zytedata#78, or if that’s merged first, to be moved (the PR) to the upstream repo with main as the target branch.

It’s a rather hacky approach, to be honest:

  • The number of total responses and undocumented error responses is tracked in class variables. Tests involve monkey patching to reset or hard-code their values.

    • If we want to keep the implementation at tenacity level, this is the cleanest path I saw, although I don’t discard there being better ways.

      The use of the class to count, instead of an instance, is due to tenacity creating copies of the instance for each function wrapping. The specifics are not completely clear to me, I am not 100% sure there is no way around it, but this line seems problematic for instance-based storage of these counters. The statistics on the next line are also not global, and get reset on every wrapped call.

    • We could consider implementing this logic outside tenacity, in the client code. It would allow for a cleaner implementation, e.g. based on stats. The main issue I see, and it might be considered minor, is that then we cannot stop new requests as soon as the condition is met, we still need to wait for the retries of 1 of the on-going requests to finish, which means we may sometimes stop new requests only after e.g. 11+ and not only 10 undocumented error responses. There is also the issue of not being able to customize this logic through a custom retry policy class, but we might not want to support that to begin with, and if we did, we could instead provide some client parameter(s) instead.

I made some other decisions I’m not entirely sure about:

  • Did not make the aggressive retry policy behave any different here, i.e. it also stops new requests on ≥10 and ≥1%, and not on ≥20 and ≥2%.
  • Did not provide any facility to override these settings in custom retry policy classes, assuming we might want to discourage overriding this, while still allowing it.

Post-merge work:

  • Prepare a PR for scrapy-zyte-api that closes a spider with a specific close reason upon getting the TooManyUndocumentedErrors exception.

Gallaecio and others added 30 commits July 29, 2022 10:48
* n_results is renamed to n_success;
* n_extracted_queries is removed, because it's always the same as
  n_results (i.e. n_success);
* n_input_queries is removed: it wasn't really a number of input queries,
  (it was a number of processed queries), and it can be computed
  from other stats: success + fatal errors;
* added a short comment which explains each stat value
Network error retry time: 5 minutes → 15 minutes
It seems aiohttp has troubles with edge cases of Keep-Alive,
and disabling it helps with ServerDisconnectedErrors.

Using aiohttp sessions is still important, because it allows
to reduce the  number of ClientConnectorErrors.
Comment on lines +75 to +76
if store_errors and isinstance(e, RequestError):
write_output(e.parsed.data)
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was an error detected while addressing typing issues, and removing the request error mock from tests. write_output does json.dumps, so when passing a string here before, it was printing in the file '{"… instead of the expected {"….

raise

logger.error(str(e))
logger.exception("Exception raised during response handling")
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found it useful to see the exception traceback when the exception is not RequestError. Although it might make things too verbose for RequestError. Maybe we should only log the traceback when the exception is unknown?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like something that can/should be re-evaluated after using it.

@Gallaecio
Copy link
Owner Author

Continued at zytedata#82

@Gallaecio Gallaecio closed this Feb 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants