Skip to content

State_change value maxes out at 32bits #300

@nurmie

Description

@nurmie

What happened?

We encountered an issue where the state_change counter in checkpoint file maxed out with a value over 2,147,483,647.
This was initially noticed when ecflow_ui connection to the running server stopped to function.
Server continued to function, creating and submitting jobs just fine and the state_change continued to increase. As noted below restarting of the server would not fix the immediate issue.
Visibility to the state of the server was lost as UI did not work with the server.

Error message:
Defs::restore_from_string: Defs::read_state: invalid state_change specified : defs_state NET state>:active flag:message,sigterm state_change:2147713152 modify_change:44739 cal_count:1469
Could not parse 'defs_state NET state>:active flag:message,sigterm state_change:2147713152 modify_change:44739 cal_count:1469' around line number 2
Ecflow version(5.13.5) boost(1.83.0) compiler(gcc 14.2.1) protocol(JSON cereal 1.3.0) openssl(enabled) Compiled on Nov 22 2024 21:09:04

This was on a server that had been running a long time and with the version: Ecflow version(5.9.2) boost(1.69.0) compiler(gcc 8.5.0) protocol(JSON cereal 1.3.0) openssl(enabled) Compiled on Dec 14 2022 11:02:33
UI versions ranged from 5.8.1 -> 5.13.5

Issue was reproduced and tested with ecflow server version: Ecflow version(5.13.4) boost(1.69.0) compiler(gcc 8.5.0) protocol(JSON cereal 1.3.0) openssl(enabled) Compiled on Oct 17 2024 04:56:31
The checkpoint file was copied over to a test instance and tried to start the server using it,
Failed to load *DEFS* check point file /home/users/***/ecflow-server/***.4819.check, because: Defs::defs_restore_from_checkpt: Defs::read_state: invalid state_change specified : defs_state MIGRATE state>:active flag:message,sigterm state_change:2147711764 modify_change:44739 cal_count:1468

Fix was to manually edit the state value to something below 32bits and startup the server. Then the state_change value was reset by the server it self.

What are the steps to reproduce the bug?

Mentioned in the what happened

Version

5.9.1, 5.13.4

Platform (OS and architecture)

RHEL 8

Relevant log output

Accompanying data

No response

Organisation

FMI

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions