Your Mac Does Not Have Hidden VRAM, But This Tweak Helps Local LLMs

There is a particular kind of modern disappointment that only happens on an Apple Silicon Mac. You have 16 GB of unified memory, your model file is “only” 11 GB on disk, LM Studio looks optimistic for a moment, and then the load fails like a Victorian gentleman fainting at the sight of a spreadsheet. The internet calls this “hidden VRAM”. That phrase is catchy, but it is also slightly nonsense. Your Mac does not have secret gamer VRAM tucked behind the wallpaper. What it has is a shared memory pool and a tunable guardrail for how much of that pool the GPU side of the system is allowed to wire down. Move the guardrail carefully and some local LLMs that previously refused to load will suddenly run. Move it carelessly and your machine turns into a very expensive beachball generator.

Apple Silicon chip with a glowing memory fence and model blocks pushing against it
This is not free VRAM. It is a movable fence inside a shared pool.

The practical knob is iogpu.wired_limit_mb. On this 16 GB Apple Silicon Mac, the default is still the default from the video:

$ sysctl iogpu.wired_limit_mb
iogpu.wired_limit_mb: 0

That 0 means “use the system default policy”, not “unlimited”. For people running local models, that distinction matters. The interesting bit is that the policy is often conservative enough that a model which should fit on paper does not fit in practice once GPU allocations, KV cache, context length, the window server, and ordinary macOS overhead all take their cut. The result is a very familiar sentence: failed to load model.

What This Setting Actually Changes

Apple’s architecture is the key to understanding why this works at all. Unlike a desktop PC with separate system RAM and discrete GPU VRAM, Apple Silicon uses one shared pool. Apple says it plainly:

“Apple GPUs have a unified memory model in which the CPU and the GPU share system memory.” – Apple Developer, Choosing a resource storage mode for Apple GPUs

That one sentence explains both the magic and the pain. The magic is that Apple laptops and minis can run surprisingly capable local models without a discrete GPU at all. The pain is that every byte you hand to GPU-backed inference is a byte you are not handing to the rest of the operating system. This is capacity planning, not sorcery. Kleppmann would recognise it instantly from Designing Data-Intensive Applications: one finite resource, several hungry consumers, and trouble whenever you pretend the budget is not real.

The lower-level Metal API exposes the same idea in more formal language. The property recommendedMaxWorkingSetSize is defined by Apple as:

“An approximation of how much memory, in bytes, this GPU device can allocate without affecting its runtime performance.” – Apple Developer, MTLDevice.recommendedMaxWorkingSetSize

Notice the wording: without affecting runtime performance. Apple is not promising a hard technical ceiling. It is describing a safety line. The iogpu.wired_limit_mb trick is, in effect, you saying: “thank you for the safety line, I would like to move it because I know what else is running on this machine”.

If you want to see the same concept from code rather than from a slider in LM Studio, a tiny Metal program can query the recommended budget directly:

import Metal

if let device = MTLCreateSystemDefaultDevice() {
    let bytes = device.recommendedMaxWorkingSetSize
    let gib = Double(bytes) / 1024.0 / 1024.0 / 1024.0
    print(String(format: "Recommended GPU working set: %.2f GiB", gib))
}

That value is the polite answer. iogpu.wired_limit_mb is how you become impolite, but hopefully still civilised.

Why Models Fail Before RAM Looks Full

Most newcomers look at the model file size and do schoolboy arithmetic: “11 GB file, 16 GB machine, therefore fine.” That works right up until reality arrives with a clipboard. Runtime memory use includes the model weights, the KV cache, backend allocations, context-length overhead, app overhead, and the rest of macOS. LM Studio explicitly gives you a way to inspect this before you pull the pin:

“Preview memory requirements before loading a model using --estimate-only.” – LM Studio Docs, lms load

That is not a decorative feature. Use it. Also note LM Studio’s platform advice for macOS: 16GB+ RAM recommended, with 8 GB machines reserved for smaller models and modest contexts. The point is simple: local inference is not decided by model download size alone. It is decided by total live working set.

# Ask LM Studio for the memory estimate before loading
lms load --estimate-only openai/gpt-oss-20b

# Lower context if the estimate is close to the edge
lms load --estimate-only openai/gpt-oss-20b --context-length 4096

# If needed, reduce GPU usage rather than insisting on "max"
lms load openai/gpt-oss-20b --gpu 0.75 --context-length 4096

That last point is underappreciated. Sometimes the right answer is not “raise the wired limit”. Sometimes the right answer is “pick a saner context length” or “run a smaller quant”. Engineers love hidden toggles because they feel like boss fights. In practice, boring budgeting wins.

The Failure Modes Nobody Mentions in the Thumbnail

The YouTube version of this story is understandably upbeat: type command, load bigger model, cue triumphant tokens per second. The real world deserves a sterner briefing. Three failure cases show up over and over.

Case 1: The Model File Fits, But the Live Working Set Does Not

The trigger is a model whose weights fit comfortably on disk, but whose runtime footprint exceeds the combined budget once context and cache are included.

# Bad mental model:
# "The GGUF is 11 GB, so my 16 GB Mac can obviously run it."

model_weights_gb = 11.2
kv_cache_gb = 1.8
backend_overhead_gb = 0.8
desktop_overhead_gb = 2.0

total_live_working_set = (
    model_weights_gb +
    kv_cache_gb +
    backend_overhead_gb +
    desktop_overhead_gb
)

print(total_live_working_set)  # 15.8 GB, and we still have no safety margin

What happens next is either a clean refusal to load or a dirty scramble into memory pressure. The correct pattern is to estimate first, shrink context if necessary, and accept that a lower-bit quant is often the smarter answer than a higher limit.

# Better approach: estimate, then choose the model tier that fits
lms load --estimate-only qwen/qwen3-8b
lms load --estimate-only openai/gpt-oss-20b --context-length 4096

# If the estimate is borderline, step down a tier
lms load qwen/qwen3-8b --gpu max --context-length 8192

Case 2: You Raise the Limit So High That macOS Starts Fighting Back

This happens when you treat unified memory as if it were dedicated VRAM and leave the operating system too little breathing room. Headless Mac minis tolerate this better. A laptop with browsers, Finder, Spotlight, and a normal human life happening on it does not.

# Aggressive and often reckless on a 16 GB machine
sudo sysctl iogpu.wired_limit_mb=16000

# Then immediately try to load a borderline model
lms load openai/gpt-oss-20b --gpu max

The machine may still succeed, which is what makes this dangerous. Success under orange memory pressure is not proof of wisdom. It is proof that you got away with it once. The better pattern is to leave deliberate headroom for the OS and keep a close eye on Activity Monitor while you test.

# A more conservative example for a 16 GB Mac
sudo sysctl iogpu.wired_limit_mb=14336

# Verify the setting
sysctl iogpu.wired_limit_mb

# Then test with a realistic context length
lms load openai/gpt-oss-20b --context-length 4096

Case 3: You Optimise the Wrong Thing and Ignore Context Length

A surprisingly common mistake is to chase the biggest possible model whilst leaving an unnecessarily large context window enabled. KV cache is not free. A smaller context often buys you more stability than another dramatic sysctl ever will.

# Bad: max everything, then act surprised
lms load some-14b-model --gpu max --context-length 32768

# Better: fit the workload, not your ego
lms load some-14b-model --gpu max --context-length 4096
lms load some-8b-model  --gpu max --context-length 8192

This is the computing equivalent of bringing a grand piano to a pub quiz. Impressive, yes. Appropriate, no.

Memory pressure gauge rising as an oversized model pushes against system limits
The danger sign is not failure to load. It is successful loading with no oxygen left for the rest of the machine.

How to Tune It Without Turning Your Mac Into a Toaster

The safe way to use this setting is incremental, reversible, and boring. Those are good qualities in systems work. Start from default, raise in steps, test one model at a time, and watch memory pressure rather than vibes.

  1. Check the current value. If it is 0, you are on the system default policy.
  2. Pick a target that still leaves real headroom. On a 16 GB machine, 14 GB is already adventurous. On a dedicated headless box, you can be bolder.
  3. Restart the inference app. Tools like LM Studio need to re-detect the budget.
  4. Load with a realistic context length. Do not benchmark recklessness.
  5. Reset to default if the machine becomes unpleasant. A responsive Mac beats a heroic screenshot.
# 1. Inspect current policy
sysctl iogpu.wired_limit_mb

# 2. Raise cautiously
sudo sysctl iogpu.wired_limit_mb=12288

# 3. Test, observe, then step up if needed
sudo sysctl iogpu.wired_limit_mb=14336

# 4. Return to default policy
sudo sysctl iogpu.wired_limit_mb=0

If you truly need this on every boot, automate it like any other operational setting. Treat it as host configuration, not as a ritual you half-remember from a video. But also ask the adult question first: if you need a startup hack to run the model comfortably, should you really be running that model on this machine?

The second practical tool is estimation. Run the estimator before the load, not after the error:

# Compare two candidates before wasting time
lms load --estimate-only qwen/qwen3-8b
lms load --estimate-only mistral-small-3.1-24b --context-length 4096

# Use the estimate to choose the smaller model or context
lms load qwen/qwen3-8b --gpu max --context-length 8192

Which Models Actually Make Sense on Different Apple Silicon Macs

This is the section everybody really wants. Exact numbers depend on quant format, context length, backend behaviour, and what else the machine is doing. But the tiers below are realistic enough to save people from magical thinking.

Mac memory tier Comfortable local LLM tier Possible with tuning Usually a bad idea
16 GB 3B to 8B models, 12B class with modest context Some 14B to 20B quants if you raise the limit and stay disciplined Large context 20B+, 30B-class models during normal desktop use
24 GB 8B to 14B models, many 20B-class quants Some 24B to 32B models with sensible context Treating it like a 64 GB workstation
32 GB to 48 GB 14B to 32B models comfortably, larger contexts for practical work Some 70B quants on the upper end, especially on dedicated machines Huge models plus giant context plus desktop multitasking
64 GB and above 30B to 70B-class quants become genuinely usable Aggressive large-model experimentation on headless or dedicated Macs Assuming every app uses memory exactly the same way

If you want a one-line rule of thumb, it is this: on a 16 GB machine, think “excellent 7B to 8B box, adventurous 14B box, occasional 20B parlour trick”. On a 24 GB or 32 GB machine, the world gets much nicer. On a 64 GB+ Mac Studio, the conversation changes from “can I load this?” to “is the speed good enough for the inconvenience?”

Also remember that smaller, better-tuned models often beat larger awkward ones for day-to-day coding, search, summarisation, and chat. A responsive 8B or 14B model you actually use is more valuable than a 20B model that only runs when the stars align and Chrome is closed.

# Practical workflow: compare candidates before downloading your dignity away
lms load --estimate-only qwen/qwen3-4b
lms load --estimate-only qwen/qwen3-8b
lms load --estimate-only openai/gpt-oss-20b --context-length 4096
Tiered Apple Silicon model sizing chart with memory classes and LLM blocks
The machine class matters more than the myth. Fit the model tier to the memory tier.

When the Default Setting Is Actually Fine

The balanced answer is that the default exists for good reasons. If your Mac is a general-purpose laptop, if you care about battery life and responsiveness, if you run multiple heavy apps at once, or if your local LLM work is mostly 7B to 8B models, leave the setting alone. The system default is often the correct trade-off.

This is also true if your workload is bursty rather than continuous. For occasional summarisation, coding assistance, or local RAG over documents, it is usually better to pick a slightly smaller model and preserve the machine’s overall behaviour. The hidden cost of “bigger model at any price” is that you stop trusting the computer. Once a laptop feels brittle, you use it less. That is bad engineering and worse ergonomics.

There is another subtle point. The wired-limit trick helps most when the machine is effectively dedicated to inference: a headless Mac mini, a quiet box on the shelf, a Mac Studio cluster, or a desktop session where you are willing to treat inference as the primary job. The closer your Mac is to a single-purpose appliance, the more sense this tweak makes.

What To Check Right Now

  • Check the current policy: run sysctl iogpu.wired_limit_mb and confirm whether you are on the default setting.
  • Estimate before loading: use lms load --estimate-only so you know the model’s live working set before you commit.
  • Audit context length: if you are using 16k or 32k context by habit, ask whether 4k or 8k would do the same job.
  • Watch memory pressure, not just free RAM: Activity Monitor tells you more truth than a single headline number.
  • Leave deliberate headroom: a model that barely runs is not a production setup, it is a stunt.
  • Reset when testing is over: sudo sysctl iogpu.wired_limit_mb=0 is a perfectly respectable ending.
Minimal dark checklist for Apple Silicon LLM tuning with commands and safety guardrails
Most wins come from estimation, context discipline, and realistic model choice, not from one dramatic command.

The honest headline, then, is better than the clickbait one. Your Mac does not have hidden VRAM waiting to be unlocked like a cheat code in a 1998 driving game. What it has is unified memory, a conservative GPU working-set policy, and enough flexibility that informed users can rebalance the machine for local inference. That is genuinely useful. It is also exactly the sort of useful that punishes people who confuse “possible” with “free”.

Video Attribution

This article was inspired by Alex Ziskind’s video on adjusting the GPU wired-memory limit for local LLM use on Apple Silicon Macs. The video is worth watching for the quick demonstration, particularly if you want to see the behaviour in LM Studio before you touch your own machine.

Original video: Your Mac Has Hidden VRAM… Here’s How to Unlock It by Alex Ziskind.

nJoy 😉

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.