How to Run Reinforcement Learning (RLVR) Locally on NVIDIA GPUs

The AI Revolution Just Hit Your Gaming Rig

Superhuman game-playing AI used to live in research papers and windowless data centers. AlphaGo, OpenAI Five, DeepMind’s StarCraft II bots—systems like these burned through thousands of GPUs and research budgets that looked like small IPOs. Now, the same reinforcement learning playbook that beat Go grandmasters can run on a single RTX-powered gaming PC under your desk.

For years, training agents to conquer games or drive cars demanded clusters that cost millions of dollars. You needed racks of accelerators, custom networking, and a team of PhDs babysitting brittle pipelines. Today, an RTX AI PC with a consumer NVIDIA GPU can chew through the same category of algorithms locally, trading scale for accessibility and putting frontier-style experimentation in reach of solo developers.

That shift is what this hands-on guide explores. With NVIDIA sponsoring the build, we’re using an RTX AI PC as a proving ground for local reinforcement learning, following the Reinforcement Learning Tutorial from Matthew Berman’s “Reinforcement Learning Tutorial - RLVR with NVIDIA & Unsloth.” The goal is not a toy demo that just replays scripted moves, but a genuine learning system that improves through trial and error.

The recipe leans on RLVR—Reinforcement Learning with Verifiable Rewards—running on Unsloth’s highly optimized training stack. Instead of a human clicking “good” or “bad” on model outputs, a reward function automatically scores each move, removing humans from the loop. That same pattern underpins how frontier labs harden models on verifiable tasks like math, coding, and games.

To make this concrete, we’ll train an AI to master the puzzle game 2048 starting from zero knowledge. The agent begins as a base GPT-OSS model that does not know the rules, the goal, or any strategies. Through thousands of self-play interactions, a reward function nudges it toward better tile merges, higher scores, and eventually consistent wins.

You will see how to set this up end to end on a gaming rig: NVIDIA App, CUDA Toolkit, WSL, Unsloth, and the 2048 Notebook, all running locally. By the end, your PC won’t just play games; it will train an AI to beat them.

Beyond Human Feedback: The Power of RLVR

Reinforcement learning sounds fancy, but the core idea feels familiar: an agent pokes at an environment, gets rewarded or punished, and slowly figures out what works. Imagine a dog learning tricks, except the “dog” is a neural network and the “tricks” are moves in a game, lines of code, or steps in a math proof. Every action updates the model’s internal policy so it picks higher‑reward actions more often next time.

Traditional reinforcement learning needed huge clusters to play millions of games of chess, Go, or StarCraft. Now, RTX‑class GPUs shrink that loop onto a gaming PC, and a newer twist called Reinforcement Learning with Verifiable Rewards (RLVR) makes the whole process dramatically more scalable. Instead of humans scoring behavior, a programmatic “verifier” hands out rewards automatically.

RLVR replaces a human in the loop with a strict, machine‑checkable rule. You define a reward function that says, “Given the environment state and the model’s action, compute a numeric score.” No vibes, no opinions—just math. If the outcome matches what the rules consider correct, the model gets points; if not, it loses them.

The 2048 demo from Matthew Berman’s Reinforcement Learning Tutorial uses this idea in its purest form. The environment is the 4x4 grid; the actions are swipes up, down, left, right. The verifier is literally the game’s code, which can:

Reject illegal moves
Add reward when tiles merge and the score increases
Penalize moves that stall or end the game early

Because the game engine already knows the score and whether you lost, it can act as an objective judge for every move. Start with GPT‑OSS, a model that has never “seen” 2048 strategy, and after enough RLVR updates it starts chaining moves that consistently produce higher‑value tiles and avoid filling the board. No human ever labels a “good” or “bad” turn.

That stands in sharp contrast to Reinforcement Learning from Human Feedback (RLHF), where people compare model outputs and train a reward model to mimic their preferences. RLHF works for fuzzy goals—politeness, helpfulness, tone—but it scales badly and bakes in bias. RLVR thrives whenever tasks have verifiable outcomes: math benchmarks like GSM8K, code that either compiles and passes tests or doesn’t, games like 2048, chess, and Go. For those, automated verifiers plus tools like Unsloth and RTX GPUs turn your gaming PC into a frontier‑style training lab.

Your Home Lab: Gearing Up for Local RL

Frontier RL on a gaming PC starts with a short hardware and software checklist, not a research lab. You need an NVIDIA RTX GPU, the latest NVIDIA App for drivers, the CUDA Toolkit, and Windows Subsystem for Linux (WSL) running Ubuntu. That stack mirrors what Matthew Berman uses in his Reinforcement Learning Tutorial to train GPT-OSS on the 2048 game.

You do not need an RTX 5090 monster card. Any recent RTX GPU with Tensor Cores works: RTX 3060, 3070, 4070, or a laptop RTX 40-series will all run RLVR; training just scales with cores, VRAM, and power. Expect slower iteration on midrange cards, but the exact same code path and results.

Think of the RTX GPU as the RL workhorse. It crunches matrix multiplications for policy updates and environment rollouts, turning millions of 2048 moves into gradients. More VRAM lets you bump batch sizes, context windows, or model size without out-of-memory crashes.

CUDA sits one layer above the silicon. The CUDA Toolkit provides the parallel computing runtime and libraries (cuBLAS, cuDNN) that frameworks like PyTorch and Unsloth lean on. Without CUDA, your “GPU-accelerated” RL session silently falls back to CPU and crawls.

WSL completes the picture by giving Windows users a real Linux environment without dual-booting. You install Ubuntu via WSL, then run Python, Jupyter, Unsloth, and the GPT-OSS RLVR notebook exactly as the Unsloth Docs describe. Command-line tools like `nvidia-smi` confirm that WSL can see your RTX GPU.

Here is the minimal setup checklist with official links, matching the video’s resources: - NVIDIA App: https://www.nvidia.com/en-eu/software/nvidia-app/ - CUDA Toolkit: https://developer.nvidia.com/cuda-downloads - WSL + Ubuntu instructions (via Unsloth Docs): https://docs.unsloth.ai/get-started/install-and-update/windows-installation - Unsloth: https://unsloth.ai/ - Unsloth Docs RLVR tutorial: https://docs.unsloth.ai/new/gpt-oss-reinforcement-learning/tutorial-how-to-train-gpt-oss-with-rl For deeper theory on policies, rewards, and GRPO, Unsloth’s Reinforcement Learning (RL) Guide | Unsloth Documentation connects the hardware you just set up to the algorithms you are about to run.

The WSL Bridge: Why Linux on Windows Is Your Best Bet

WSL acts as a bridge between your Windows gaming rig and the Linux-first AI ecosystem that tools like Unsloth expect. After testing multiple approaches—native Windows Python, full dual-boot, Docker on Windows—WSL came out ahead for stability, GPU support, and not wrecking your existing setup. You keep your everyday Windows workflow while gaining a near-native Linux environment for RLVR experiments.

Installation boils down to a single command in PowerShell or Windows Terminal, run as Administrator: `wsl.exe --install ubuntu-24.04` Windows downloads the Linux kernel, sets up Ubuntu 24.04, and prompts you to create a Unix username and password the first time it launches.

Once Ubuntu boots inside WSL, you want to confirm two things: Linux is actually running, and it can see your RTX GPU. In the Ubuntu shell, type: `nvidia-smi` If everything worked, you’ll see a table listing your NVIDIA GPU (e.g., “GeForce RTX 5090”), driver version, and CUDA version instead of an error.

You can also verify that you’re inside WSL by running: `wsl.exe --status` from a Windows terminal, or by checking that your Linux prompt shows a typical path like `/home/username` instead of `C:\Users\...`. If `nvidia-smi` fails, fix drivers and CUDA on Windows before touching any RL code.

For anyone who has never touched Linux, WSL is not a scary “second operating system.” It behaves more like a secure, sandboxed dev container that lives alongside your Windows apps. You can open VS Code, your browser, and your game launcher in Windows while your RL training jobs crunch away inside Ubuntu.

This containerized model also reduces risk. You can install, break, and wipe Python environments, CUDA-compatible libraries, and experimental RLVR stacks without polluting your main Windows install. When Unsloth Docs, the Reinforcement Learning Tutorial, or future toolchains assume “Linux + CUDA,” WSL quietly satisfies that requirement on your existing RTX PC.

Unleash Unsloth: The Secret to Blazing-Fast Training

Unsloth sits at the center of this whole local RLVR stack. The open-source library has racked up nearly 50,000 GitHub stars, not because of hype, but because it makes training large language models on consumer GPUs actually practical instead of masochistic.

Traditional fine-tuning often slams into your VRAM ceiling fast. Unsloth sidesteps that by cutting memory usage by more than 60% and squeezing more useful work out of every CUDA core, which translates into noticeably faster training runs on the same RTX card.

The trick: Unsloth leans hard on LoRA (Low-Rank Adaptation) and custom CUDA kernels. LoRA keeps most of a model’s weights frozen and only learns a tiny set of low-rank adapters, so you can fine-tune 7B–20B parameter models on a single gaming GPU without watching your system thrash or crash.

Optimized kernels handle the heavy tensor math far more efficiently than stock PyTorch ops. That means tighter GPU utilization, fewer memory copies, and less overhead per step — exactly what you want when you’re running thousands of RLVR rollouts inside a Jupyter notebook on your desktop.

Installation inside your WSL environment stays refreshingly boring. Once your Python virtualenv is active and PyTorch is installed with CUDA support, you run a single command: `pip install unsloth` and WSL pulls the latest release from PyPI, no custom wheels or obscure flags required.

Because you’re inside WSL, Unsloth talks directly to the NVIDIA drivers and CUDA Toolkit you set up earlier. You get full access to your RTX GPU from Linux tooling while still living on a Windows desktop, which is exactly the hybrid workflow most home labs want.

Unsloth also ships with cutting-edge RL algorithms, including GRPO (Group Relative Policy Optimization). GRPO keeps the spirit of PPO but drops the baggage: it avoids separate reward and value models, which slashes memory use and simplifies the training loop.

That design makes GRPO dramatically more efficient than traditional PPO-style setups, especially for RLVR recipes where a verifier function scores outputs directly. For a 2048 agent or a math/code tutor, that means more rollouts per second, more gradient steps per hour, and faster improvement curves on the exact same hardware.

Setting the Stage: Your First RL Training Run

Fresh WSL install ready, your next move is to carve out a clean Python sandbox so RL experiments do not collide with the rest of your system. Update Ubuntu’s packages, then pull in Python and venv support: `sudo apt update` followed by `sudo apt install python3 python3-full python3-pip python3-venv -y`. That stack gives you the tools to isolate dependencies and keep CUDA-friendly builds of PyTorch under control.

Create a dedicated virtual environment for RLVR work. From your home directory, run `python3 -m venv unslothrl` and then activate it with `source unslothrl/bin/activate`. Your prompt should now show `(unslothrl)`, signaling that any `pip install` goes into this self-contained bubble.

With the venv live, install a GPU-enabled PyTorch build that speaks CUDA. Follow NVIDIA’s wheel index or Unsloth’s guidance, for example: `pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121`. After it finishes, sanity-check with `python -c "import torch; print(torch.cuda.is_available())"` and expect `True` on a properly configured RTX card.

Next, pull in the tools that make this feel like a modern ML lab. Install Jupyter Notebook and Unsloth in one shot: `pip install jupyter unsloth`. This combo gives you the RL training primitives plus a browser-based control panel to poke at every step of the 2048 agent’s brain.

You now need the actual 2048 RL recipe. Head to the OpenAI GPT-OSS notebook link used by Unsloth: the Reinforcement Learning Tutorial points to `reinforcement-fine-tuning.ipynb` hosted on Colab. Open it in your browser, hit File → Download, and save the `.ipynb` file into a folder your WSL instance can see, like your Linux home directory or a mounted Windows Downloads path.

Back in the WSL terminal, navigate to the directory containing the notebook and start Jupyter with `jupyter notebook`. The server prints a `http://localhost:8888/?token=...` URL; copy that into your Windows browser, and Jupyter’s file browser appears. Click the downloaded `.ipynb` to open the full RLVR 2048 pipeline.

Notebooks change how RL experimentation feels. You run the training stack cell-by-cell, tweak hyperparameters, fix a broken import, or restart just a single step without nuking a multi-hour job. This is the same iterative loop NVIDIA showcases for larger LLM work in guides like Train an LLM on NVIDIA Blackwell with Unsloth—and Scale for Production, just shrunk down to your gaming PC and a deceptively simple tile game.

Inside the Notebook: From Blank Slate to Game Master

Blank Jupyter cell, blinking cursor, RTX fans idling. You start by importing Unsloth, wiring it into PyTorch, and pointing it at OpenAI’s open-source GPT-OSS checkpoint. One line pulls in the 20B-parameter model; another attaches Unsloth’s GRPO-powered RLVR trainer that will eventually turn this generic text model into a 2048 specialist.

Next, the notebook quietly flexes a very 2025 trick: the entire 2048 game engine you’re about to use was written by an AI. The Python implementation of the grid, tile merges, and scoring logic comes from GPT-4, pulled from the official GPT-OSS 2048 example. AI-generated tools become the sandbox where another AI learns to play.

Before any training, you make sure the sandbox behaves. Early cells define a lightweight `Game2048` class, then instantiate a board and print it as a 4×4 matrix of integers. You can step through moves directly in the notebook, calling helper functions to slide tiles up, down, left, or right and watching the board update after each action.

Manual play isn’t just for fun; it sanity-checks the environment. You verify that: - Invalid moves leave the board unchanged - Valid moves merge equal tiles correctly - The score and “game over” flag update as expected

Once the rules look solid, the notebook pivots from human to model. A prompt template describes the game state as a 4×4 array plus current score, then asks GPT-OSS to output a Python function that encodes its move policy. Instead of replying “UP” or “LEFT,” the model must generate code that returns one of the valid actions.

Prompt engineering here does the heavy lifting. The template: - Pins down the function name and signature - Enumerates allowed moves (`"up"`, `"down"`, `"left"`, `"right"`) - Demands syntactically valid Python with no external imports

That constraint turns an LLM into a program-synthesizing agent. Every response becomes executable strategy, which the RLVR loop can run inside the 2048 environment, score automatically, and feed back into Unsloth’s training pipeline.

The Reward Engine: How the AI Actually Learns

Reward functions act as the secret contract between your RTX-powered agent and the 2048 board. In RLVR, you don’t hand out gold stars manually; you encode them as Python. Those tiny functions buried in the notebook decide what “good” looks like, every single turn.

At the core of this setup sit three verifiers: `function_works`, `no_cheating`, and `strategy_succeeds`. Each one inspects the model’s suggested move sequence and returns a clean, machine-readable score. Together, they form a miniature tribunal that judges every attempt your GPT-OSS agent makes.

`function_works` plays bouncer at the door. It checks whether the model’s response parses as valid code or a valid move description, whether arguments line up, and whether the game engine can actually execute it without throwing an exception. If the code crashes or produces nonsense, the reward drops, and the policy quietly shifts away from that behavior in the next update.

`no_cheating` handles the dark arts: reward hacking and rules lawyering. Large language models excel at exploiting underspecified instructions, so this verifier scans for moves that break 2048’s mechanics, tamper with the board state, or bypass the allowed API. If the model tries to “win” by editing the grid directly or skipping turns, `no_cheating` slams it with a strong negative reward.

`strategy_succeeds` focuses on actual gameplay progress. It runs the proposed moves inside the 2048 environment and checks concrete signals: score increase, tile merges, and whether the board survives instead of hard-locking. Successful strategies earn positive points; stagnant or losing lines get penalized, nudging the model toward higher-scoring, longer-lived runs.

Together, these verifiers create an automated feedback loop. Every training step follows the same rhythm: the model proposes a strategy, the verifiers execute and score it, and RLVR uses that scalar reward to tweak the model’s parameters. Over hundreds or thousands of iterations, the policy shifts from random swipes to something that starts to look like a human-crafted 2048 guide.

Reward hacking always lurks in the background of RL. Robust verifiers like these—explicit code checks, anti-cheat guards, and outcome-based scoring—box the agent into learning the actual task rather than gaming your metrics. That’s how RLVR keeps your home-brewed frontier model honest while it grinds its way to mastery.

From Failure to Fluency: Kicking Off the Training Loop

Kicking off training comes down to a single line in your notebook: `trainer.train()`. That call hands control to Unsloth’s RL engine, which starts chewing through your prompts, spinning up generations, and pushing them through the verifiable reward pipeline you configured earlier.

Once the loop starts, the GPT-OSS model repeatedly proposes strategies for the 2048 board. The environment runs those moves, the verifiers score them, and RLVR converts those scores into gradients that nudge the model’s weights. Each step slightly rewires the network, biasing it toward sequences of actions that produced higher rewards.

Under the hood, this looks a lot like a game of millions of tiny bets. For each prompt, the model samples a move sequence, the environment returns a numeric reward, and the optimizer updates parameters so higher-reward trajectories become more likely next time. Over hundreds or thousands of steps, that trial-and-error process turns random flailing into recognizable strategy.

One of the most instructive moments in Matthew Berman’s Reinforcement Learning Tutorial comes when the model generates incomplete code for the game logic. The verifier immediately fails it: no compile, no reward. That hard “0” is not a dead end; it is precisely the negative signal the optimizer needs to steer the model away from half-finished code paths.

Failure becomes training data. When incomplete or logically broken snippets repeatedly score poorly, gradient updates suppress those patterns and amplify complete, verifiable solutions. You literally watch RLVR transform “barely runs” into “passes every check” by weaponizing mistakes.

While all this is happening, your screen may look deceptively quiet. The notebook cell running `trainer.train()` can sit on “In [*]” for long stretches, especially on a midrange RTX card. That usually means your GPU is saturated, not that anything crashed.

To confirm progress, keep an eye on: - Terminal logs printing training steps, rewards, and loss values - `nvidia-smi` showing GPU utilization near 90–100% - VRAM usage climbing to match your model and batch size

For deeper internals, the unslothai/unsloth - GitHub repo and Unsloth Docs detail how the trainer batches prompts, applies GRPO-style updates, and exposes hooks if you want to customize the loop further.

The Future is Local: What You Can Build Next

You just pulled off a stunt that, a few years ago, belonged in a DeepMind paper: you trained a frontier-style RL agent on a consumer GPU, inside Windows, using WSL, NVIDIA’s CUDA stack, and Unsloth. No managed Kubernetes cluster, no mystery cloud bill—just a gaming PC teaching a GPT-OSS model to beat a puzzle game through pure trial and error.

2048 is the demo, not the destination. The exact same RLVR recipe—policy model, verifiable environment, automated reward—is already pushing open models on math benchmarks like GSM8K, where the answer is objectively right or wrong, and on code generation, where a unit test suite becomes your reward function. If a program compiles, passes tests, and runs within time limits, it gets points; if it fails, the gradient flows the other way.

This shift matters because verifiable domains are everywhere. You can turn a math contest, a LeetCode archive, or a company’s private integration tests into a training ground. Instead of labeling preferences, you define rules: - For math: exact numeric or symbolic equality - For code: tests passed, runtime, memory use - For games/sims: score, survival time, win rate

Hardware barriers keep dropping too. Unsloth recently added FP8 support for its GRPO-style training, squeezing models into less VRAM and pushing more tokens per second on mid-range RTX cards. You trade a bit of numerical precision for a lot more throughput, which means deeper training runs on GPUs that used to be “inferencing only.”

From here, experimentation becomes the main constraint. You can clone the 2048 notebook, swap in GSM8K, wire up a local judge, and watch a model climb its own private leaderboard. Local, verifiable RL stops being a research buzzword and starts looking like a new platform—one where developers, researchers, and hobbyists can all run frontier-grade experiments without asking anyone’s permission.

Frequently Asked Questions

What is Reinforcement Learning with Verifiable Rewards (RLVR)?

RLVR is a type of AI training where a model learns by trial and error in an environment with automated, rule-based rewards. Unlike RLHF which uses human feedback, RLVR is ideal for tasks with clear success criteria, like solving math problems or winning a game like 2048.

What hardware do I need to follow this tutorial?

You need a Windows PC with any modern NVIDIA RTX GPU. While the video features a high-end card, the process works on any consumer RTX graphics card, though training times may be longer on lower-end models.

Why is Unsloth recommended for local RL training?

Unsloth is an open-source library optimized for speed and memory efficiency. It enables techniques like GRPO and uses features like LoRA to fine-tune large models on consumer hardware, dramatically reducing memory usage by over 60% compared to traditional methods.

Can I apply this RLVR method to tasks other than games?

Yes. RLVR is highly effective for any task where performance can be automatically and objectively verified. This includes code generation, mathematical reasoning, and other logic-based problems.

Run Frontier AI on Your Gaming PC