Can I run a serious LLM on my Mac with M-chip?

Question

Accepted Answer

1. Set your RAM budget: 16 GB is enough for 7B (quantised), 32 GB for 13B, 64 GB for 30B, 128 GB+ for 70B. Remember the OS and your work also need RAM.
2. Install Ollama or LM Studio. Both run on Apple Silicon with Metal acceleration, no extra config. LM Studio has a GUI, Ollama is CLI with an API server beneath.
3. Pick a quantised model: Q4_K_M or Q5_K_M is the right quality-vs-memory compromise for most SMB use. Unquantised 16-bit models need four times the RAM.
4. Test latency: a well-tuned M2 Ultra hits 30 to 60 tokens per second on a 13B model. Fine for interactive chat, patience for batch.
5. Limits: in practice one Mac means one user at a time. Two concurrent sessions race on memory. For a team pick a Linux GPU server or cloud inference.

When to bring us in: 
Want us to compare a Mac setup against a light GPU server for your case, we can ground the choice in numbers.

Can I run a serious LLM on my Mac with M-chip?

Try this first

When to bring us in

See also

None of the above fits?

Who are you?

Or skip the DIY entirely