How Much GPU and VRAM You Actually Need for AI Work

The most important spec in an AI workstation is not the GPU's name — it's how much VRAM that GPU has. VRAM is the memory the card holds your model in, and it sets the largest model you can run at all. Most buyers reach for the biggest headline number and over- or under-spend; the right move is to start from the largest model you'll actually run, work out the VRAM that holds it, and only then pick the card. Every figure here is a 2025–2026 range to re-verify at quote time — prices and benchmarks move week to week.

Size My GPU Call 832-338-2926

VRAM is the first constraint, not clock speed

Here is the rule that decides everything: a model has to fit in VRAM, or it does not run well. The video memory on the GPU holds the model's weights while it works. If the model is bigger than the card's VRAM, you either can't load it or it spills over into slow system memory and crawls. No amount of raw TFLOPS or clock speed fixes a model that doesn't fit.

That's why the buying order is backwards from how most people shop. Don't start with "which is the fastest card" — start with "what is the largest model I'll run, and how much memory does it need?" Size the VRAM to that, and the card largely picks itself. Clock speed and core counts decide how fast a model that already fits will run; VRAM decides whether it fits in the first place.

This page sizes the GPU for a single-user, desk-side AI workstation. When the model you need outgrows what one or two cards at a desk can hold, that's a shared GPU AI server conversation instead — and we'll say so honestly.

How to estimate VRAM for a model

A rough rule of thumb by model size and quantization. Quantization (compressing the model's numbers, e.g. to 4-bit) is the lever that lets a big model fit a smaller card, with a small quality trade-off. Figures below are approximate weight sizes for inference; add headroom for the KV cache, which grows with context length. Re-verify against your actual model.

Model size	~VRAM at 4-bit (Q4)	~VRAM at FP16	Smallest card that fits (Q4)
8B (e.g. Llama-class small)	~5–6GB	~16GB	24GB card, lots of room
13B	~8–10GB	~26GB	24GB card, comfortable
32B	~18–22GB	~64GB	24–32GB card
70B	~38–40GB	~140GB	48GB or 96GB card

Approximate inference weight sizes from public 2025–2026 hardware guides; a 70B model at Q4_K_M is widely reported at ~38–40GB and ~140GB at full FP16. Fine-tuning needs more: LoRA and QLoRA add overhead, and a full fine-tune needs several times the inference figure. Add KV-cache headroom for long context. Re-verify for your exact model and framework.

The current card tiers, plainly

Four VRAM tiers cover almost every desk-side AI build. Here's what each one unlocks. Prices are 2025–2026 ranges and street-volatile — re-verify at quote.

VRAM tier	Example card	~Power	~Price range	What it unlocks
24GB	RTX 4090	~450W	consumer-tier	8B–13B models, light fine-tune, creative work
32GB	RTX 5090	~575W	~$1,999 MSRP (street-volatile)	Headroom on mid models, into 32B, Flux/SDXL
48GB	RTX 6000 Ada (prior-gen pro)	~300W	pro-tier, verify	Larger models, ECC, quieter pro cooling
96GB	RTX PRO 6000 Blackwell	~600W	~$8,000–9,200	One-card 70B at Q4, heavy fine-tune, ECC

Specs and prices are 2025–2026 and subject to change. The RTX 5090's $1,999 MSRP is from its January 2025 launch; street prices frequently run above it. The RTX PRO 6000 Blackwell carries 96GB GDDR7 ECC at ~600W; its Max-Q variant runs at ~300W for denser multi-GPU builds. The 48GB RTX 6000 Ada tier is referenced from prior-gen specs — confirm exact memory and price before you order. Re-verify everything at quote.

GeForce vs. pro-grade (RTX 5090 vs. RTX PRO 6000)

The gap between a GeForce RTX 5090 and a pro-grade RTX PRO 6000 Blackwell is bigger than the price tag suggests, and it is not mostly about speed. The differences that matter for AI work are:

VRAM size — 32GB on the 5090 vs. 96GB on the PRO 6000. This alone decides which models fit.
ECC memory — the pro card's VRAM catches and corrects bit errors, which matters for long, unattended fine-tuning runs where a silent error wastes hours.
Cooling style — the 5090 is a flow-through dual-fan card that cools well solo but dumps heat into the case; pro cards offer blower or Max-Q designs built to pack several together.
Driver and feature set — pro drivers target sustained professional workloads and multi-GPU density.

When is the premium worth it? When you genuinely need the 96GB to fit a large model, when ECC protects long training runs, or when you're packing multiple cards. For a single-card desk machine running models that fit in 32GB, the 5090 is the better value — and we'll recommend the cheaper card when it does the job. To configure either, see our NVIDIA AI workstation page.

Memory bandwidth and inference speed

VRAM capacity decides what fits; memory bandwidth decides how fast it runs. Bandwidth is how quickly data moves in and out of VRAM, and for token generation it's often the real bottleneck — the GPU spends much of its time waiting on memory, not on math. That's why a card with more bandwidth can feel faster on the same model even when raw compute looks similar.

For example, the RTX 5090 reports roughly 1.79TB/s of memory bandwidth — about 78% more than the prior-generation 4090 — and third-party testers report meaningfully higher tokens-per-second on LLM runs as a result. Treat any specific tokens-per-second figure with care: those numbers come from third-party benchmarks and swing widely by quantization, framework, and context length. We don't publish them as TIS-measured. The takeaway is the principle, not a leaderboard: once your model fits, bandwidth is what you look at for speed.

One card now, room for two later

You don't have to buy your final configuration today. The smart move is one card sized to your current largest model, in a chassis with the power supply, board, and cooling already spec'd to take a second card when you need it. That headroom is cheap to plan up front and expensive to retrofit later.

A second GPU helps in specific cases — more pooled VRAM for bigger models, or running parallel jobs — but it isn't automatically faster, and two big cards at a desk bring real power and heat. Before you assume two is better than one, read our honest take in the multi-GPU guide. Often a single higher-VRAM card is the simpler, better answer.

Match the card to your workload

Run through these and the right VRAM tier usually picks itself.

What is the largest model you will run?

Size VRAM to it. 8B–13B fits 24GB easily; into 32B wants 32GB; a 70B at Q4 needs ~38–40GB, so 48GB or 96GB.

Inference only, or fine-tuning too?

Inference fits the tiers above. LoRA and QLoRA add overhead; a full fine-tune needs several times the inference VRAM — size up.

How long is your typical context?

Long context and many sessions grow the KV cache, which adds to VRAM use. Leave headroom above the bare model size.

Do you need ECC for long runs?

Unattended overnight training benefits from the error-correcting VRAM on pro cards. For short interactive use it matters less.

Will you ever want a second card?

If yes, we spec the PSU, board, and cooling for two now so you can add one later without a rebuild.

Is this truly a desk machine, or a team server?

One user at a desk = workstation. A team sharing it 24/7 is a server conversation — we route you there honestly.

We size the VRAM and build it here in Texas

You don't have to settle the 24 vs. 32 vs. 96GB question alone. Tell us the models you run and we size the card by VRAM, then hand-build and bench-test the workstation here in Texas — delivered and set up in person across Houston, Katy, Fulshear and the Fort Bend area. The card that fits your work, picked by the people who build it. See our Texas service areas.

GPU & VRAM questions

How much VRAM do I need for AI work?+

Size it to your largest model. An 8B model at 4-bit fits in roughly 5–6GB, so a 24GB card runs smaller and quantized models with room to spare. 32GB adds headroom for bigger models and longer context. A 70B model at Q4_K_M needs about 38–40GB, which points to a 48GB or 96GB card. Always size from the biggest thing you actually plan to run, not the average.

Is the RTX 5090 enough, or do I need the RTX PRO 6000?+

The RTX 5090 (32GB) suits most single-user inference and small-model fine-tuning, and it is the cheaper card. Step up to the RTX PRO 6000 Blackwell (96GB) when you need to run large models, do heavier fine-tuning, or want ECC memory for long unattended runs. We recommend the 5090 when it does the job and only reach for the 96GB card when your model sizes demand it.

What is the difference between a GeForce card and a pro-grade card for AI?+

It is more than raw speed. Pro-grade cards like the RTX PRO 6000 add ECC VRAM that catches memory errors on long runs, a different driver and feature set, and blower or flow-through cooling built for packing several cards together. GeForce cards like the RTX 5090 are excellent value for a single-card desk machine. The premium is about memory size, error correction, and multi-GPU density — not a faster headline number.

Does more VRAM make AI faster?+

Not directly. VRAM decides whether a model fits at all — too little and it will not run, or it spills to system memory and crawls. Once a model fits, speed comes mostly from memory bandwidth and the GPU itself, not from having extra unused VRAM. So size VRAM to hold your model, then look at bandwidth for how fast it runs.

Can I run a 70B model on a workstation?+

Yes, quantized. A 70B model at Q4_K_M needs roughly 38–40GB of VRAM, which fits comfortably on a single 96GB pro card or, with care, across two high-VRAM cards. Full-precision FP16 70B is about 140GB and is a server conversation, not a desk-side workstation. We will tell you honestly when your model size has crossed that line.

Next, configure an NVIDIA AI workstation around the card, weigh a second GPU in the multi-GPU guide, or read the full AI workstations overview. When the VRAM you need outgrows a desk, see GPU AI servers.

Not sure how much VRAM your work needs?

Tell us the largest model you want to run and we'll size the card by VRAM, then build a Texas workstation you own outright — no over-spending on a number you'll never use.