Running Local AI Models on macOS
Install, configure, and run local AI on Apple Silicon, with memory-optimised settings and VS Code Copilot integration
- Published
- 4 June 2026
- Read time
- 10 min read
Was this useful?
I use GitHub Copilot at work and Claude for personal projects. Both switched to usage-based billing this month, dropping the flat subscription model. For anyone using these tools heavily across multiple projects, that shift makes the monthly cost unpredictable. Running models locally removes that variable entirely: no usage bills, no rate limits, and everything stays on your machine.
The quality gap has closed enough that local models are a realistic daily driver now, not just an experiment.
This guide covers the first-time setup on a Mac with Apple Silicon. I run this on an M1 MacBook Pro with 16 GB of unified memory. The default settings are tuned for that hardware, but each relevant section also covers what to change if you have more RAM.
How it fits together
Prerequisites
- macOS with Apple Silicon (M series)
- Homebrew installed
- A few GB of free disk space per model (most 7–8B models need 4–5 GB each)
- VS Code, for the integration sections at the end; a GitHub Copilot subscription is needed to use cloud models, but the local Ollama integration works without one
Install Ollama
Ollama is the runtime that downloads, manages, and serves local models. Install it via Homebrew:
brew install --cask ollamaAlternatively, download the installer directly from ollama.com. Once launched, Ollama places an icon in the menu bar and starts the API server at http://localhost:11434. Confirm it is running:
curl http://localhost:11434The response should be Ollama is running.
Memory and performance settings
Running a language model is fundamentally a memory operation, not a compute one. A model’s weights are the billions of numerical parameters that encode its behaviour, and they must be loaded entirely into RAM before a single token can be generated. A 7B model in Q4_K_M quantisation takes around 4–5 GB; an 8B model is similar. If those weights do not fit and the system starts paging to disk, inference slows to a near halt regardless of how fast your CPU is.
On Apple Silicon this matters more than on a typical machine: the CPU, Metal GPU, and every running application share a single pool of unified memory. VS Code, a dev server, a browser, and Ollama are all drawing from the same 16 GB.
Ollama’s defaults are generous with memory, which compounds these pressures. Without tuning, the runtime may load multiple models simultaneously, allocate a context window far larger than needed, and leave your other tools fighting for RAM.
Add these variables to ~/.zshrc or ~/.zprofile:
# Limit concurrency — one model at a time on 16 GBexport OLLAMA_MAX_LOADED_MODELS=1export OLLAMA_NUM_PARALLEL=1
# Keep the model warm between requests — avoids cold-start latency in VS Codeexport OLLAMA_KEEP_ALIVE=30m
# Default context windowexport OLLAMA_CONTEXT_LENGTH=4096
# Apple Silicon optimisations — the highest-impact pair for 16 GBexport OLLAMA_FLASH_ATTENTION=1export OLLAMA_KV_CACHE_TYPE=q8_0Then apply them without restarting your shell:
source ~/.zshrc| Setting | Why |
|---|---|
OLLAMA_MAX_LOADED_MODELS=1 | Prevents multiple models competing for the same 16 GB |
OLLAMA_NUM_PARALLEL=1 | Explicit default; prevents accidental concurrent loads |
OLLAMA_FLASH_ATTENTION=1 | Reduces peak activation memory on M1 Metal |
OLLAMA_KV_CACHE_TYPE=q8_0 | Halves KV cache RAM compared to the default f16 |
OLLAMA_KEEP_ALIVE=30m | Model stays loaded between requests, no cold-start delay |
OLLAMA_FLASH_ATTENTION and OLLAMA_KV_CACHE_TYPE=q8_0 together free around 1–2 GB of effective headroom. That is enough to run 8B parameter models comfortably on 16 GB when they would otherwise be marginal.
Adjusting for more RAM
The settings above are conservative, tuned for 16 GB. On machines with more unified memory you can relax the concurrency limits and drop the KV cache compression:
| Setting | 16 GB (M1/M2) | 32 GB (M2 Pro/M3 Pro) | 64 GB+ (M3 Max/Ultra) |
|---|---|---|---|
OLLAMA_MAX_LOADED_MODELS | 1 | 2 | 3 or more |
OLLAMA_NUM_PARALLEL | 1 | 2 | 4 |
OLLAMA_KV_CACHE_TYPE | q8_0 | q8_0 or omit | Omit: use default f16 |
OLLAMA_FLASH_ATTENTION=1 is still worth keeping on any Apple Silicon machine: it reduces peak activation memory regardless of total RAM.
Staying fully local
Ollama does not send your prompts anywhere by default. If you are working with sensitive data and want a hard guarantee, add this flag too:
# Optional — disables remote inference and web search entirelyexport OLLAMA_NO_CLOUD=1Choosing a model
Every model has a name and a size tag. The number in the tag reflects how many billion parameters it contains, which determines both output quality and how much RAM it needs to load. Use the table below to pick the right fit for your hardware and use case.
Model reference
| Model | Size | Vision | Best for |
|---|---|---|---|
gemma3:4b | ~2.5 GB | Yes | Fast chat, vision, light tasks |
qwen3:8b | ~4.5 GB | No | Best all-rounder, strong reasoning |
qwen2.5vl:7b | ~4.5 GB | Yes | Vision and text |
qwen2.5vl:3b | ~2 GB | Yes | Lightweight vision |
qwen2.5-coder:7b | ~4.3 GB | No | Code generation |
mistral-nemo | ~7 GB | No | Long documents, 32K context |
gemma3:12b | ~8 GB | Yes | Higher quality, viable with flash attention |
nomic-embed-text | ~0.3 GB | No | Embeddings and RAG pipelines |
On 16 GB, avoid 13B models and larger. They will page to swap and feel sluggish under any real workload. On 32 GB you can run 13B and 14B models comfortably, and gemma3:12b and qwen3:14b become reliable daily drivers. On 64 GB or more, 27B and 32B models are viable. Check ollama.com/library for the full catalogue.
Prefer Q4_K_M quantised variants when available. They offer the best speed-to-quality tradeoff regardless of hardware tier.
When you want to attach an image to a conversation, switch to a vision model like qwen2.5vl:7b or gemma3:4b. Text-only models reject image input.
Pulling and running a model
Use ollama pull to download a model and ollama run to test it interactively:
ollama pull qwen3:8bollama run qwen3:8bThe first pull downloads several gigabytes, so run this on a decent connection. After that, the model lives on disk at ~/.ollama/models/ and launches instantly.
A good starting set for most daily-use scenarios:
ollama pull qwen3:8b # daily driver — best all-rounderollama pull qwen2.5-coder:7b # coding tasksollama pull qwen2.5vl:7b # vision and textollama pull gemma3:4b # lightweight vision alternativeollama pull nomic-embed-text # embeddings and RAGVS Code Copilot integration
VS Code Copilot can use a local Ollama server as a model provider. The setup is straightforward, but there is one catch: Copilot reads each model’s maximum reported context size and may allocate the full window upfront.
| Model | Reported max context | KV cache cost at max |
|---|---|---|
qwen3:8b | 41K tokens | ~4 GB |
qwen2.5vl:7b | 128K tokens | ~16 GB |
qwen2.5-coder:7b | 33K tokens | ~3 GB |
OLLAMA_CONTEXT_LENGTH=4096 sets a global default, but Copilot does not always respect it in API requests. The reliable fix is a Modelfile: a small config file that bakes a capped context window into a named model variant.
Create capped model variants
A Modelfile is a plain text file that tells Ollama how to build a named variant from an existing base. The two fields that matter here are FROM (the base model to derive from) and PARAMETER num_ctx (the context window to enforce):
FROM qwen3:8bPARAMETER num_ctx 4096PARAMETER temperature 0.7temperature controls how much variation the model introduces when generating a response. 0.7 is a reasonable general-purpose default: creative enough to avoid repetitive output, focused enough to stay on topic. The coder variant uses 0.2 because code generation benefits from deterministic output. There is usually one right answer, not several equally valid variations.
Running ollama create <name> -f <Modelfile> registers that file as a new named model. No additional data is downloaded: Ollama references the base model already on disk with the specified parameters baked in.
The following creates all four variants in one pass:
mkdir -p ~/ollama-models
printf 'FROM qwen3:8b\nPARAMETER num_ctx 4096\nPARAMETER temperature 0.7\n' \ > ~/ollama-models/Modelfile.qwen3-fast
printf 'FROM qwen2.5-coder:7b\nPARAMETER num_ctx 4096\nPARAMETER temperature 0.2\n' \ > ~/ollama-models/Modelfile.coder-fast
printf 'FROM qwen2.5vl:7b\nPARAMETER num_ctx 4096\n' \ > ~/ollama-models/Modelfile.vision-fast
printf 'FROM gemma3:4b\nPARAMETER num_ctx 4096\n' \ > ~/ollama-models/Modelfile.gemma-fast
ollama create qwen3-fast -f ~/ollama-models/Modelfile.qwen3-fastollama create coder-fast -f ~/ollama-models/Modelfile.coder-fastollama create vision-fast -f ~/ollama-models/Modelfile.vision-fastollama create gemma-fast -f ~/ollama-models/Modelfile.gemma-fastConnect to Ollama and select a model
To wire VS Code Copilot to a local Ollama server, add this to your VS Code settings.json:
"github.copilot.chat.ollama.endpoint": "http://localhost:11434"You can also do this through the UI: open Copilot Chat (Cmd+Shift+I on macOS), click the model picker dropdown at the top of the chat panel, and choose “Manage Models”. VS Code discovers all models running on localhost:11434 automatically once Ollama is running.
Once connected, the capped variants appear in the picker alongside any cloud models. Switch to qwen3-fast, coder-fast, vision-fast, or gemma-fast depending on the task. After starting a conversation, confirm the model loaded:
ollama psCopilot CLI
GitHub Copilot has a standalone CLI for the terminal. Install it via Homebrew:
brew install copilot-cliOnce installed, run copilot from any project directory. On first launch it asks you to trust the folder and log in to GitHub. You type prompts directly in the terminal and Copilot can read, modify, and run files in the current directory. It supports plan mode (Shift+Tab to toggle), custom agents, and MCP servers.
You can point it at Ollama to use local models rather than GitHub’s cloud. The quickest way is:
ollama launch copilotThis opens a model selector populated from Ollama’s library. To specify a model directly:
ollama launch copilot --model qwen3:8bFor manual wiring, set the Ollama endpoint via environment variables before running copilot:
export COPILOT_PROVIDER_BASE_URL=http://localhost:11434/v1export COPILOT_PROVIDER_API_KEY=export COPILOT_PROVIDER_WIRE_API=responsesexport COPILOT_MODEL=qwen3:8bOne caveat: Copilot CLI works best with a generous context window. The Ollama docs recommend at least 64K tokens, so the 4K capped variants created above are too small for it. Use the base models directly and raise OLLAMA_CONTEXT_LENGTH to 32768 or higher when running Copilot CLI sessions.
Wrapping up
This covers the full stack: Ollama installed and tuned, a model set selected for different use cases, VS Code Copilot wired to local variants, and the standalone Copilot CLI pointed at Ollama. On 16 GB the memory settings and capped context variants make local inference genuinely practical for everyday coding and chat work, not just a curiosity.
For tasks that fit in a 4K context window a local model handles them without touching any external service. For longer context, heavier reasoning, or the times a cloud model simply performs better, the paid providers are still there. The difference is that reaching for them is now a deliberate choice rather than the default.
Keeping up with new model releases is a single ollama pull command. Ollama fetches only changed layers, so updates stay fast even at multi-GB model sizes.
I’m also working on a dedicated machine for local AI inference: custom hardware that removes the unified memory constraint entirely. I’ll write that up once it’s running. If you’re building something similar or have a setup you’re happy with, drop a comment below or subscribe via the form at the end of this page to catch that post when it lands.
Working on something similar?
Need help raising the bar?
I help teams improve engineering practice through hands-on delivery, pragmatic reviews, and mentoring. If you want a second pair of eyes or practical support, let's talk.
- Engineering practice review
- Hands-on delivery
- Team mentoring
If this has been useful, you can back the writing with a one-off tip through a secure Stripe checkout.
Free · Practical · One email per post
Get practical engineering notes
One short email when a new article goes live. Useful if you are breaking into tech, growing as an engineer, or improving engineering practice on your team.
Comments
Loading comments…
Leave a comment