Skip to content

pg_shard may fail to mark shard placement as invalid under some circumstances #101

@onderkalaci

Description

@onderkalaci

The bug happens when pg_shard fails to INSERT to shard placement and postgres is shut down or psql connection is closed before shard placement status is updated.

This is not easy to reproduce bug. But, if a sleep() function call is added to this line, reproducing becomes easy.

Assuming that sleep() is added, the bug can be reproduced with following steps:

  1. Create a cluster with 1 master, 2 workers
  2. Distribute table and create worker shards with replication factor 2
  3. Stop one of the worker nodes
  4. Connect to psql, and get its pid, select pg_backend_pid();
  5. Issue an INSERT on that psql session. During the INSERT (since we added a sleep, it takes at least the sleep seconds), execute shell command "kill -9 pid_of_psql"
  6. Restart both master and the stopped worker node.
  7. Connect to worker nodes and observe that one of the shards is divergent
  8. But shard placements on metadata has all STATE_FINALIZED status

The main problem here is that we do not execute remote commands and state status changes in an atomic way.

A possible Solution that we can try is to check whether HOLD_INTERRUPTS()/RESUME_INTERRUPTS() works. Also, check if these function call pair has any drawbacks.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions