How To Use Nvidia Gpus For Ai Training

I’m trying to set up NVIDIA GPUs for AI training, but I ran into problems with drivers, CUDA, and getting my framework to detect the GPU correctly. I expected training to run faster, but right now nothing is working the way it should. I need help figuring out the right setup so I can start training models without wasting more time troubleshooting.

Start from the bottom and verify each layer.

  1. Check the GPU.
    Run:
    nvidia-smi

If this fails, your driver is broken or missing. Fix this first. On Windows, install the latest NVIDIA driver from NVIDIA. On Ubuntu, use:
ubuntu-drivers devices
sudo ubuntu-drivers autoinstall

Reboot. Run nvidia-smi again. You should see your card, driver version, and VRAM.

  1. Check CUDA compatibility.
    Your framework needs a CUDA build that matches the driver. Example:
    PyTorch 2.3 with CUDA 12.1 needs a driver new enough for CUDA 12.x. If your driver is old, GPU detection fails.

  2. Install the framework with GPU support.
    For PyTorch:
    pip uninstall torch torchvision torchaudio
    pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

For TensorFlow on recent Linux:
pip install tensorflow

  1. Test detection.
    PyTorch:
    python -c ‘import torch; print(torch.cuda.is_available()); print(torch.cuda.get_device_name(0) if torch.cuda.is_available() else ‘no gpu’)’

TensorFlow:
python -c ‘import tensorflow as tf; print(tf.config.list_physical_devices(‘GPU’))’

  1. Common gotchas.
    You installed CPU-only wheels.
    Your conda env uses old libs.
    CUDA toolkit is not needed for most pip installs, but drivers are.
    WSL needs its own setup.
    Laptop users sometiems run on integrated graphics.

If nvidia-smi works but your framework does not, post:
OS, GPU model, driver version, Python version, framework version, and output of those test commands. That info tells the story fast.

I’d add one thing to what @sterrenkijker said: don’t go install the full CUDA toolkit right away unless you actually need it for compiling custom ops or building from source. A lot of people make the setup worse by stacking system CUDA, conda CUDA, pip wheels, cuDNN zip files, and then wondering why nothing detects the GPU. Been there, very dumb experience.

What usually works better is:

  • install a proper NVIDIA driver
  • use a fresh venv or conda env
  • install the framework’s GPU build only
  • test with a tiny script
  • only then touch CUDA toolkit stuff

Also, training not being “faster” can be normal if:

  • your batch size is tiny
  • data loading is the bottleneck
  • model is too small
  • you’re accidentally running on CPU for preprocessing
  • laptop GPU is power-limited

Another thing people miss: if nvidia-smi works, that does not prove your Python env is sane. It only proves the driver layer is alive. The Python side can still be totally borked.

If you want the fastest path, nuke the env and recreate it clean. Honestly often faster than debugging dependency soup for 4 hours. If you post exact error logs, ppl can probably spot the problem prety quick.

One thing I’d push a bit differently from @sterrenkijker: on Linux, if you installed drivers from the NVIDIA .run file, I’d seriously consider removing that and using your distro’s packaged driver instead. The .run installer is a classic source of weird breakage after kernel updates.

My usual sanity chain is:

  1. lspci | grep -i nvidia
  2. nvidia-smi
  3. python -c 'import torch; print(torch.cuda.is_available())'
  4. python -c 'import torch; print(torch.cuda.get_device_name(0))'

If step 2 fails, it’s system level.
If step 2 works and step 3 fails, it’s environment/framework mismatch.
If step 3 works but training is still slow, it’s probably workload related, not setup.

Also check this because people overlook it:

  • Secure Boot can block kernel modules
  • WSL needs its own supported path, not random Linux instructions
  • Docker needs NVIDIA Container Toolkit or the container will see no GPU
  • mixed package managers cause pain, especially apt + conda + pip all fighting

Pros for a clean setup: reproducible, easier upgrades, fewer mystery crashes.
Cons: reinstalling envs is annoying, pinned versions can feel restrictive, old GPUs have limited support.

For AI training specifically, monitor actual utilization, not vibes. nvidia-smi -l 1 will show if the GPU is busy or just allocated. High VRAM use does not automatically mean high compute use.