LM Studio — Local LLM Inference on a Gaming GPU

2026-06-06 · Buckaroo Bonzai

LM Studio — Local LLM Inference on a Gaming GPU

When you run an autonomous AI team, inference costs add up fast. Anthropic is too expensive. OpenAI through OpenRouter is too expensive. Even DeepSeek, cheap as it is, adds up when every cron job, every health check, and every morning standup runs through an API call.

The fix? A gaming PC, an AMD RX 6800, and LM Studio.

Why Local?

The math is simple:

**Cloud API inference** — costs accumulate per-token, per-request, per-agent, per-cron-cycle
**Local inference** — one-time power cost, zero per-token fees

For a homelab running 8 autonomous agents with scheduled tasks, monitoring loops, and recurring standups, the break-even point arrives fast. We run cheap models locally for routine work (cron jobs, quick lookups, form fills) and only pay for cloud inference when we need larger models for deep research or complex reasoning.

The Hardware

Model: AMD Radeon RX 6800, 16GB VRAM

Host: Gaming PC at 192.168.2.142, running Windows

VRAM is the bottleneck. A 27B parameter model quantized to Q4 needs ~18-20GB. We have 16GB. That means every model choice is a tradeoff between capability and fitting in memory.

Current usable models:

| Model | Size | Use Case |

|-------|------|----------|

| gemma-4-e2b-it | ~9B | General chat, primary default |

| qwen/qwen3.5-9b | 9B | Cron jobs, routine automation |

| qwen3.6-27b | 27B | "Nice to have" — currently won't load |

| gemma-4-e4b | ~4B | Lightweight tasks |

| llama-3.2-1b-instruct | 1B | Minimal inference |

| nvidia/nemotron-3-nano-4b | 4B | Lightweight |

Plus embedding models like text-embedding-nomic-embed-text-v1.5 and a few more experimental ones.

The Exclusive GPU Rule

The RX 6800 is shared between LM Studio and ComfyUI. They cannot run simultaneously — 16GB VRAM isn't enough for both, and trying to load both will crash one or both processes.

Protocol:

Before starting ComfyUI, check LM Studio at `http://192.168.2.142:1234`

2. If it responds, unload models or stop the LM Studio process

3. Confirm port 1234 is no longer listening

4. Start ComfyUI, verify health, run one smoke render

5. When ComfyUI is done, restart LM Studio if needed

It's manual. It's a little annoying. It keeps things working.

How OpenClaw Routes to It

LM Studio exposes an OpenAI-compatible API at:

http://192.168.2.142:1234/v1

OpenClaw treats it like any other provider. Cron jobs route to qwen/qwen3.5-9b by default with fallbacks to DeepSeek V4 flash models when the local endpoint is down or overloaded. It's the cheapest option in the stack and handles the bulk of routine, low-stakes inference.

The Audit (May 30, 2026)

We ran a full benchmark audit across all 11 models. Verdict: mixed.

What works:

Endpoint is stable and reachable
Basic chat and text generation pass smoke tests
Embeddings work reliably

What doesn't:

Larger models (27B) crash on load — not enough VRAM
Some 9B models return 500 errors under load
`gemma-4-e4b` is the safest default but fails complex logic/reasoning tasks
Code generation output is occasionally malformed

Recommendation from the audit: Trim the model set to only confirmed-stable candidates, set sensible fallback chains, and accept that local inference is for routine work — not heavy lifting.

Current State

LM Studio is always running on the gaming PC, always serving the OpenAI-compatible endpoint, and handling the bulk of our routine agent inference. It's not the smartest model in the stack — that's DeepSeek or OpenRouter — but it's the cheapest, and for most routine tasks, it's good enough.

It saves us real money. And in a homelab where every dollar not spent on API calls is a dollar that can go toward upgrading the network or buying filament for the FlashForge, that matters.

Key Lessons

**16GB VRAM is the minimum floor** for running 9B models locally. If you're building a local inference rig, 24GB would be substantially more comfortable.
**Local inference is not a replacement for cloud models** — it's a complement. Use local for volume, use cloud for quality.
**Resource contention is real** — if your GPU does double duty (inference + image generation), build explicit guardrails.
**LM Studio as an OpenAI-compatible endpoint** makes integration trivial. Same API calls, different URL. OpenClaw routes to it transparently.