Skip to content

Cannot train search-r1 model based on skyrl #154

@HelloWorldLTY

Description

@HelloWorldLTY

Hi there, I find a bug when I intend to train search-r1 based on skyrl:

  File "/gpfs/radev/home/tl688/.cache/uv/builds-v0/.tmpC6Ti8f/lib/python3.12/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/gpfs/radev/home/tl688/.cache/uv/builds-v0/.tmpC6Ti8f/lib/python3.12/site-packages/ray/_private/worker.py", line 2858, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfs/radev/home/tl688/.cache/uv/builds-v0/.tmpC6Ti8f/lib/python3.12/site-packages/ray/_private/worker.py", line 958, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ImportError): ray::skyrl_entrypoint() (pid=841440, ip=10.190.168.34)
  File "/gpfs/radev/pi/ying_rex/tl688/SkyRL/skyrl-train/skyrl_train/entrypoints/main_base.py", line 296, in skyrl_entrypoint
    exp.run()
  File "/gpfs/radev/pi/ying_rex/tl688/SkyRL/skyrl-train/skyrl_train/entrypoints/main_base.py", line 287, in run
    trainer = self._setup_trainer()
              ^^^^^^^^^^^^^^^^^^^^^
  File "/gpfs/radev/pi/ying_rex/tl688/SkyRL/skyrl-train/skyrl_train/entrypoints/main_base.py", line 251, in _setup_trainer
    from skyrl_train.workers.fsdp.fsdp_worker import PolicyWorker, CriticWorker, RefWorker, RewardWorker
  File "/tmp/ray/session_2025-09-14_18-32-41_325053_837954/runtime_resources/working_dir_files/_ray_pkg_1d35ea9510ec3692/skyrl_train/workers/fsdp/fsdp_worker.py", line 17, in <module>
    from skyrl_train.models import Actor, get_llm_for_sequence_regression
  File "/tmp/ray/session_2025-09-14_18-32-41_325053_837954/runtime_resources/working_dir_files/_ray_pkg_1d35ea9510ec3692/skyrl_train/models.py", line 19, in <module>
    from flash_attn.bert_padding import pad_input, unpad_input
  File "/gpfs/radev/home/tl688/.cache/uv/builds-v0/.tmpei0sQS/lib/python3.12/site-packages/flash_attn/__init__.py", line 3, in <module>
    from flash_attn.flash_attn_interface import (
  File "/gpfs/radev/home/tl688/.cache/uv/builds-v0/.tmpei0sQS/lib/python3.12/site-packages/flash_attn/flash_attn_interface.py",line 15, in <module>
    import flash_attn_2_cuda as flash_attn_gpu
ImportError: /lib64/libc.so.6: version `GLIBC_2.32' not found (required by /gpfs/radev/home/tl688/.cache/uv/builds-v0/.tmpei0sQS/lib/python3.12/site-packages/flash_attn_2_cuda.cpython-312-x86_64-linux-gnu.so)

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

It seems that there are always missing files for flash attn, but since we use uv and it is automatic process, how to fix it? Moreover, other training codes in skyrl work well from my end.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions