Skip to content

core dump error because of parameter "n_flows_max" #5

@horser1

Description

@horser1

./bin/SimAI_m4 -w ./example/microAllReduce.txt -n ./Spectrum-X_128g_8gps_100Gbps_A100.txt
It would cause a core dump error:

terminate called after throwing an instance of 'c10::IndexError'
  what():  select(): index 50400 out of range for tensor of size [50000] at dimension 0
Exception raised from select_symint at ../aten/src/ATen/native/TensorShape.cpp:1845 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7effbf460f86 in /home/ma/m4/.venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x11606df (0x7effa25bd6df in /home/ma/m4/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x2e2c393 (0x7effa4289393 in /home/ma/m4/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x2e2e099 (0x7effa428b099 in /home/ma/m4/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #4: at::_ops::select_int::redispatch(c10::DispatchKeySet, at::Tensor const&, long, c10::SymInt) + 0xc5 (0x7effa3e77c05 in /home/ma/m4/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #5: <unknown function> + 0x50e26bc (0x7effa653f6bc in /home/ma/m4/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #6: <unknown function> + 0x50e2aec (0x7effa653faec in /home/ma/m4/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #7: at::_ops::select_int::redispatch(c10::DispatchKeySet, at::Tensor const&, long, c10::SymInt) + 0xc5 (0x7effa3e77c05 in /home/ma/m4/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #8: <unknown function> + 0x4a00749 (0x7effa5e5d749 in /home/ma/m4/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #9: <unknown function> + 0x4a00edc (0x7effa5e5dedc in /home/ma/m4/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #10: at::_ops::select_int::call(at::Tensor const&, long, c10::SymInt) + 0x1ad (0x7effa3eded7d in /home/ma/m4/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #11: <unknown function> + 0x1ec5c (0x55df7fe91c5c in ./bin/SimAI_m4)
frame #12: <unknown function> + 0x21546 (0x55df7fe94546 in ./bin/SimAI_m4)
frame #13: <unknown function> + 0x37fb9 (0x55df7feaafb9 in ./bin/SimAI_m4)
frame #14: <unknown function> + 0x1e20b (0x55df7fe9120b in ./bin/SimAI_m4)
frame #15: <unknown function> + 0x1e3b0 (0x55df7fe913b0 in ./bin/SimAI_m4)
frame #16: <unknown function> + 0x1ea08 (0x55df7fe91a08 in ./bin/SimAI_m4)
frame #17: <unknown function> + 0x1b544 (0x55df7fe8e544 in ./bin/SimAI_m4)
frame #18: __libc_start_main + 0xf3 (0x7eff6bd21083 in /lib/x86_64-linux-gnu/libc.so.6)
frame #19: <unknown function> + 0x1d43e (0x55df7fe9043e in ./bin/SimAI_m4)

And I find It is caused by this parameter "n_flows_max" in M4.cc

The gdb history is as follows:
gdb.txt
So if I set a higher value(such as 200000), are there any side effects??

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions