Skip to content

Fix replay job to work as advertised#6

Open
alex-ball wants to merge 7 commits into
eprintsug:masterfrom
alex-ball:patch-1
Open

Fix replay job to work as advertised#6
alex-ball wants to merge 7 commits into
eprintsug:masterfrom
alex-ball:patch-1

Conversation

@alex-ball
Copy link
Copy Markdown

@alex-ball alex-ball commented Nov 13, 2023

This PR fixes several problems with the implementation:

  • The URL being requested ($request_url) is a required part of the ping to IRUS (svc_dat), but is not available to the replay job (it is not recorded in the Access instance). This PR adds it as a parameter to the job so it can be passed on to PIRUS::log.
  • Access instances can be populated with NULL values; such instances should not be turned into pings. This PR allows such instances to be disregarded silently.
  • If the ping fails again during the replay job, the steps are taken to reschedule the job for 24 hours' time, and this is reported in the error message. However, the job then returns a 0 value, which causes EPrints::DataObj::EventQueue::execute to log an additional error and then delete the job; so the ping is not in fact retried. This PR restores the HTTP_RESET_CONTENT return value that allows the job to be retried.
  • The _archive_id method is out of sync with EPrints::OpenArchives::archive_id, since it does not fall back to securehost.

NB. I have also simplified the idioms used for loading an Access instance and creating the replay job. These work for EPrints v3.3+ but I don't have access to the source code for v3.2 to be able to check compatibility there.

In addition:

Use direct method of retrieving Access from database.

Do not retry ping if there is not enough information to form a ping.

Actually reschedule for 24 hours' time: a return code of 0 is treated
by EPrints::DataObj::EventQueue::execute the same as HTTP_NOT_FOUND,
which means changes to the event are not committed, and if 'cleanup'
is True (default), the event is immediately deleted. HTTP_RESET_CONTENT
is the correct return value to do what is promised in the $fail_message.
Also, EPrints::DataObj::EventQueue::create_unique computes
`eventqueueid` itself, discarded any value passed in, but will use
`params` if given.
@jesusbagpuss
Copy link
Copy Markdown

jesusbagpuss commented Nov 13, 2023

Hi @alex-ball ,
Can I check my understanding here.

Currently the PIRUS integration would create one 'replay' job - that's the purpose of the create_unique combined with eventqueueid here:

irus/lib/cfg.d/pirus.pl

Lines 90 to 91 in 0f0561a

my $event = $repo->dataset( "event_queue" )->dataobj_class->create_unique( $repo, {
eventqueueid => Digest::MD5::md5_hex( "Event::PIRUS::replay" ),

The event queue job stores the failed access ID. When the 'replay' was run, it would replay relevant items from the access dataset - from the stored accessid to the most recent access record.

Does your PR change this behaviour?
Will there be many 'replay' events - one for each failed request?

@alex-ball
Copy link
Copy Markdown
Author

alex-ball commented Nov 13, 2023

Please do correct me if I'm wrong, but the way I'm reading it, the current workflow is as follows:

  1. Access instance is created.
  2. Ping is sent to IRUS. If successful nothing further happens. If there is an error, a replay job is created with current accessid and run ASAP.
  3. Replay searches for all Access instances with that accessid or greater (up to 1000) – there shouldn't be many, probably just one unless the server is really popular. It attempts to send a ping to IRUS for each one (regardless of whether a ping was previously successful), but with an undefined value populating the svc_dat part of the ping. The first time one of these pings fails, the map loop dies.
  4. If and when the loop dies, an error is logged. The starting accessid and description of the job is updated, and the start time set to 24 hours in the future. The job returns 0, execute logs an error to say that 0 is not a recognised return code so please check it, then deletes the job.

It won't be possible to know which accessid failed unless you have verbose_error_logging on, and then only for the original trigger and not a subsequent replay.

As far as I can see, eventqueueid is written to but not read by create_unique:

sub create_unique
{
    my( $class, $session, $data, $dataset ) = @_;

    $dataset ||= $session->dataset( $class->get_dataset_id );

    my $md5 = Digest::MD5->new;
    $md5->add( $data->{pluginid} );
    $md5->add( $data->{action} );
    $md5->add( EPrints::MetaField::Storable->freeze( $session, $data->{params} ) )
        if EPrints::Utils::is_set( $data->{params} );
    $data->{eventqueueid} = $md5->hexdigest;
    [...]
}

This PR does indeed change the behaviour so there is one replay job per failed request.

  1. When the ping fails at the trigger stage, a replay job is created with the current accessid and request_url and run ASAP.
  2. The replay job retries the ping with svc_dat field still populated. If it fails again, the job is set to rerun in 24 hours time and so on until it succeeds or the job is deleted.

Benefits:

  • No access event is lost unless deliberately aborted. If a systematic problem is present, it should be easier to spot and correct, and then the problem pings can finally be sent.
  • No information loss, comparing replayed pings versus original ones.
  • IRUS isn't spammed with duplicate pings.

Downsides:

  • There will be a waiting job in the Task list for each failing ping, as opposed to the Task list remaining empty. I can see this would be annoying if the reason for them failing can't be corrected.

@alex-ball
Copy link
Copy Markdown
Author

Another way of handling it could be for the initial retry to be postponed for 24 hours (as opposed to ASAP), and then if it fails again return HTTP_INTERNAL_SERVER_ERROR so it persists in the Task list as a failed job. Then the failed jobs could be retried or cleared en masse with the --retry or --clear options to indexer.

@jesusbagpuss
Copy link
Copy Markdown

jesusbagpuss commented Nov 13, 2023

That's a good point about the eventqueueid not being referenced - although passing it a set of unchanging parameters each time also led to the same outcome.

I think your changes seem reasonable and sensible.
I've been monitoring our error logs (and sharing the data with IRUS) for a while. The worst 'failure' event we have seen was 1950 failed transmissions - which would could be alarming if anyone pays close attention to the event queue (humans don't. Monitoring might).

Have you spoken to the folk at IRUS at all about this? I think they'd welcome the change, but would be good to get their feedback.

@alex-ball
Copy link
Copy Markdown
Author

For many years my indexer logs have been full to bursting with error messages from the PIRUS replay event. I have been in contact with IRUS a few times to try and work out what's going on, but as far as they were concerned they weren't seeing any problems at their end.

It was also a bit of a problem that (a) I didn't understand where verbose_error_logging would be sending the extra information, so didn't find it, and (b) the failure message from the original ping would only otherwise be logged in the description of the replay event, which would be deleted after a few seconds.

I finally tracked down my problems to the User Agent not picking up the proxy settings properly, but since some data at least has been getting through to IRUS, I think that means the proxy issue was only affecting the Indexer. I am still not sure why the original pings were failing, but since fixing the proxy settings and trying out these changes in our repository we have only had successful pings, no failed ones yet.

Event params are not changed, and attempting to set them to the same
value again triggers an error.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants