diff --git a/alternative-method.md b/alternative-method.md new file mode 100644 index 0000000..5e62428 --- /dev/null +++ b/alternative-method.md @@ -0,0 +1,110 @@ +This is a copy of the instructions I have committed [here](https://github.com/realKarthikNair/16-xf0xxx-linux-troubleshooting/blob/main/fedora39-tensorflow-gpu.md) + +1. Add negativo17/fedora.nvidia and rpmfusion-nonfree-release repos + +```bash +sudo dnf config-manager --add-repo=https://negativo17.org/repos/fedora-nvidia.repo +sudo dnf install \ + https://download1.rpmfusion.org/nonfree/fedora/rpmfusion-nonfree-release-$(rpm -E %fedora).noarch.rpm +sudo dnf upgrade +``` + +2. Remove any existing nvidia thing you have (knowing what you're doing before running this command is a bonus) + +```bash +sudo dnf remove *nvidia* +``` +3. Install nvidia-driver + +```bash +sudo dnf install nvidia-driver +``` + +4. Install cuda drivers and cuda development packages and remove xorg-x11-drv-nvidia-cuda and xorg-x11-drv-nvidia-cuda-libs from rpmfusion-nonfree + +```bash +sudo dnf remove xorg-x11-drv-nvidia-cuda +sudo dnf remove xorg-x11-drv-nvidia-cuda-libs +sudo dnf install nvidia-driver-cuda +sudo dnf install cuda-devel +sudo dnf remove xorg-x11-drv-nvidia-cuda +sudo dnf remove xorg-x11-drv-nvidia-cuda-libs +``` + +5. Install cudnn and cudnn development packages + +```bash +sudo dnf install cuda-cudnn.x86_64 cuda-cudnn-devel.x86_64 +``` + +6. If you are on the latest version of Fedora (I am on 39 as of writing this), the default version of Python would be the latest version which won't support tensorflow. So install an older version. + +```bash +sudo dnf install python310 +python3.10 -m ensurepip +python3.10 -m pip install tensorrt #tensorrt doesnt exactly work on fedora 39 on latest nvidia drivers even with nvidia's official tensorrt package on their website but its good to have this module (i mean i couldnt make it work atleast) +``` + +7. ```cd``` to your python project path + +```bash +virtualenv -p `which python3.10` .venv +source .venv/bin/activate +pip uninstall tensorflow[and-cuda] +sudo dnf install mlocate +sudo updatedb +cp $(locate libdevice.10.bc) . +``` + +Congratulations! You can now use tensorflow with CUDA on Fedora 39! + +Footnotes: + +1. Sometimes Nvidia UVM bugs out so just remove the module right before using anything that uses CUDA or cuDNN. +The kernel will reload it automatically when you run a program that requires it. Sometimes it still fails for no reason. In that case just spam the command a few times on the terminal (I don't know how but it works) + +```bash +sudo modprobe -r nvidia_uvm #you might need this sometimes to reload +``` + +2. If you are recieving errors such as `2024-01-10 23:40:45.421281: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered`, install `tf-nightly[and-cuda]` + +```bash +# Assuming you are on a virtual environment +pip uninstall tensorflow[and-cuda] +pip install tf-nightly[and-cuda] +``` + +3. If you are recieving warnings like ` successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355`, this is caused by a bug that resets `numa_node` to -1 after reboots : + + Fix 1 (temporary, needs to be run manually after every reboot, recommended if you are new to linux) + + 1. Identify the PCI-ID of your GPU + 2. Set 0 to `/sys/bus/pci/devices//numa_node + + For example + ```bash + karthik@fedora:~$ lspci -D | grep NVIDIA + 0000:01:00.0 VGA compatible controller: NVIDIA Corporation AD107M [GeForce RTX 4060 Max-Q / Mobile] (rev a1) + 0000:01:00.1 Audio device: NVIDIA Corporation Device 22be (rev a1) + karthik@fedora:~$ sudo su + [sudo] password for karthik: + root@fedora:/home/karthik# echo 0 | tee -a "/sys/bus/pci/devices/0000:01:00.0/numa_node" + 0 + ``` + + + Fix 2 (same as Fix 1 but permanent) [credit](https://stackoverflow.com/a/70225257) + + ```bash + # 1) Identify the PCI-ID (with domain) of your GPU + # For example: PCI_ID="0000.81:00.0" + lspci -D | grep NVIDIA + # 2) Add a crontab for root + sudo crontab -e + # Add the following line + @reboot (echo 0 | tee -a "/sys/bus/pci/devices//numa_node") + ``` + + +4. TODO: figure out how to make tensorrt work