Skip to content

Mesmer flaky instantiation on HPC cluster #733

@colganwi

Description

@colganwi

Describe the bug
It seems like Mesmer reads the TF SavedModel in write mode which means that multiple processes cannot load Mesmer simultaneously. This results in flaky instantiation when running Mesmer in parallel on a HPC cluster.

To Reproduce
Run the code below with >20 cores. If one core is currently loading Mesmer other cores will throw Read less bytes than requested or a number of other errors.

Code:

from deepcell.applications import Mesmer
attempts = 10
model = None
for attempt in range(attempts):
    try:
        model = Mesmer()
        break  # If successful, exit the loop
    except Exception as e:
        print(f"Attempt {attempt + 1} failed: {e}")
        time.sleep(10) 
if model is None:
    print("Failed to initialize the Mesmer after 10 attempts.")
else:
    print("Model initialized successfully.")

Running:

#!/bin/bash
# Configuration values for SLURM job submission.
#SBATCH --job-name=mesmer
#SBATCH --nodes=1 
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=8gb
#SBATCH --array=1-400%50

FOV=$(($SLURM_ARRAY_TASK_ID - 1))
echo "FOV: ${FOV}"

source activate deepcell-env
python run_mesmer.pu

Error:

INFO:root:Checking for cached data
INFO:root:Checking MultiplexSegmentation-9.tar.gz against provided file_hash...
INFO:root:MultiplexSegmentation-9.tar.gz with hash a1dfbce2594f927b9112f23a0a1739e0 already available.
INFO:root:Extracting /home/wcolgan/.deepcell/models/MultiplexSegmentation-9.tar.gz
INFO:root:Successfully extracted /home/wcolgan/.deepcell/models/MultiplexSegmentation-9.tar.gz into /home/wcolgan/.deepcell/models
2024-09-19 08:27:00.092152: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/wcolgan/miniconda3/envs/py10-env/lib/python3.10/site-packages/cv2/../../lib64:
2024-09-19 08:27:00.092216: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2024-09-19 08:27:00.092255: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (c3b7): /proc/driver/nvidia/version does not exist
2024-09-19 08:27:00.092772: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-09-19 08:27:10.302641: W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at save_restore_v2_ops.cc:222 : OUT_OF_RANGE: Read less bytes than requested
INFO:root:Checking for cached data
INFO:root:Checking MultiplexSegmentation-9.tar.gz against provided file_hash...
INFO:root:MultiplexSegmentation-9.tar.gz with hash a1dfbce2594f927b9112f23a0a1739e0 already available.
INFO:root:Extracting /home/wcolgan/.deepcell/models/MultiplexSegmentation-9.tar.gz
INFO:root:Successfully extracted /home/wcolgan/.deepcell/models/MultiplexSegmentation-9.tar.gz into /home/wcolgan/.deepcell/models
2024-09-19 08:27:31.767108: W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at save_restore_v2_ops.cc:222 : OUT_OF_RANGE: Read less bytes than requested
INFO:root:Checking for cached data
INFO:root:Checking MultiplexSegmentation-9.tar.gz against provided file_hash...
INFO:root:MultiplexSegmentation-9.tar.gz with hash a1dfbce2594f927b9112f23a0a1739e0 already available.
INFO:root:Extracting /home/wcolgan/.deepcell/models/MultiplexSegmentation-9.tar.gz
INFO:root:Successfully extracted /home/wcolgan/.deepcell/models/MultiplexSegmentation-9.tar.gz into /home/wcolgan/.deepcell/models
INFO:root:Checking for cached data
INFO:root:Checking MultiplexSegmentation-9.tar.gz against provided file_hash...
INFO:root:MultiplexSegmentation-9.tar.gz with hash a1dfbce2594f927b9112f23a0a1739e0 already available.
INFO:root:Extracting /home/wcolgan/.deepcell/models/MultiplexSegmentation-9.tar.gz
INFO:root:Successfully extracted /home/wcolgan/.deepcell/models/MultiplexSegmentation-9.tar.gz into /home/wcolgan/.deepcell/models
2024-09-19 08:28:14.570058: W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at save_restore_v2_ops.cc:222 : OUT_OF_RANGE: Read less bytes than requested
INFO:root:Checking for cached data
INFO:root:Checking MultiplexSegmentation-9.tar.gz against provided file_hash...
INFO:root:MultiplexSegmentation-9.tar.gz with hash a1dfbce2594f927b9112f23a0a1739e0 already available.
INFO:root:Extracting /home/wcolgan/.deepcell/models/MultiplexSegmentation-9.tar.gz
INFO:root:Successfully extracted /home/wcolgan/.deepcell/models/MultiplexSegmentation-9.tar.gz into /home/wcolgan/.deepcell/models
INFO:root:Checking for cached data
INFO:root:Checking MultiplexSegmentation-9.tar.gz against provided file_hash...
INFO:root:MultiplexSegmentation-9.tar.gz with hash a1dfbce2594f927b9112f23a0a1739e0 already available.
INFO:root:Extracting /home/wcolgan/.deepcell/models/MultiplexSegmentation-9.tar.gz
INFO:root:Successfully extracted /home/wcolgan/.deepcell/models/MultiplexSegmentation-9.tar.gz into /home/wcolgan/.deepcell/models
INFO:root:Checking for cached data
INFO:root:Checking MultiplexSegmentation-9.tar.gz against provided file_hash...
INFO:root:MultiplexSegmentation-9.tar.gz with hash a1dfbce2594f927b9112f23a0a1739e0 already available.
INFO:root:Extracting /home/wcolgan/.deepcell/models/MultiplexSegmentation-9.tar.gz
INFO:root:Successfully extracted /home/wcolgan/.deepcell/models/MultiplexSegmentation-9.tar.gz into /home/wcolgan/.deepcell/models
INFO:root:Checking for cached data
INFO:root:Checking MultiplexSegmentation-9.tar.gz against provided file_hash...
INFO:root:MultiplexSegmentation-9.tar.gz with hash a1dfbce2594f927b9112f23a0a1739e0 already available.
INFO:root:Extracting /home/wcolgan/.deepcell/models/MultiplexSegmentation-9.tar.gz
INFO:root:Successfully extracted /home/wcolgan/.deepcell/models/MultiplexSegmentation-9.tar.gz into /home/wcolgan/.deepcell/models
WARNING:tensorflow:No training configuration found in save file, so the model was *not* compiled. Compile it manually.
WARNING:tensorflow:No training configuration found in save file, so the model was *not* compiled. Compile it manually.
Attempt 1 failed: Read less bytes than requested [Op:RestoreV2]
Attempt 2 failed: Read less bytes than requested [Op:RestoreV2]
Attempt 3 failed: Read less bytes than requested
Attempt 4 failed: Read less bytes than requested [Op:RestoreV2]
Attempt 5 failed: Read less bytes than requested
Attempt 6 failed: Read less bytes than requested
Attempt 7 failed: Read less bytes than requested
Model initialized successfully.

Expected behavior
Initiating Mesmer should be reliable and not include any file locks or write operations

Desktop (please complete the following information):

  • OS: Linux c4b2 5.4.0-137-generic 154-Ubuntu SMP Thu Jan 5 17:03:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
  • Python Version: 3.10.13

Additional context
Add any other context about the problem here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions