GobblesGobbles

140 GB just for the weights: that’s what a 70-billion-parameter model needs at full precision before a single word of conversation.

VRAM is destiny for local AI

The endless “which AI model should I run on my own machine?” debate collapses into one question: how much memory sits on your graphics card. That number quietly decides what loads, how long the conversation can be, how fast answers come back, how many people it can serve at once, and what it costs. The punchline is brutal: a 70B model at full precision wants about 140 GB, and even squeezed to 4-bit it still wants around 42 GB — too much for a 24 GB card, which then spills onto main memory and drops output to roughly one to three tokens per second. The practical rule flips the usual conversation: start with the memory you have, then ask what you can run at a speed you’ll actually tolerate. On an 8 GB card, a 7B model at 4-bit runs comfortably at about 20–25 tokens per second; push that same card to a 30B model and it collapses to 1–3 tokens per second.

Gobbles Gobble's Take: The model leaderboard matters less than the hardware ceiling — if it doesn’t fit, it doesn’t matter. Source: Perplexity Search


In Case You Missed It

Yesterday's top stories:

Was this briefing useful?

One tap helps Gobbles learn what to cover more carefully.

Get Tech Gobbles in your inbox

Free daily briefing. No spam. Unsubscribe anytime.

See something wrong? Report an inaccuracy