Local LLM Setup Guide

Set up a local AI engine for Privy apps. This guide covers Ollama, LM Studio, and Rapid-MLX, using Google Gemma 3 4B (gemma3:4b) or a comparable local model as an example. Use Ollama or LM Studio for broad platform support, and Rapid-MLX when you want an OpenAI-compatible local server optimized for Apple Silicon Macs.

⚡ One-Click Automated Installer (macOS)

Interactive installer for Ollama, LM Studio, or Rapid-MLX on macOS, with 16GB-friendly model picks (4B–7B). Run in Terminal:

curl -fsSL https://privy.kenelite.com/engine/install.sh | bash

Engines, model scenarios, API endpoints, and integration examples are on the Quick Start guide.

1. Ollama

Ollama provides a CLI and local API. Default port is 11434.

1.1 Install

  • macOS: Download from ollama.com/download/mac; the app auto-updates.
  • Windows: Download OllamaSetup.exe from ollama.com/download, or run in PowerShell:
    irm https://ollama.com/install.ps1 | iex
    Requires Windows 10 or later.
  • Linux: Run in a terminal:
    curl -fsSL https://ollama.com/install.sh | sh

1.2 Pull a model (e.g. Gemma 3 4B)

In Ollama this model is named gemma3:4b (Hugging Face: google/gemma-3-4b).

ollama pull gemma3:4b

Then run a chat:

ollama run gemma3:4b

1.3 Allow LAN access

By default Ollama listens on 127.0.0.1. To allow other devices on your network, set OLLAMA_HOST=0.0.0.0.

  • One-off (current terminal):
    OLLAMA_HOST=0.0.0.0 ollama serve
    If the Ollama app is already running, quit it first, then run the command above in a terminal.
  • macOS (persistent): Edit Ollama’s launchd plist (e.g. under ~/Library/LaunchAgents/ or Homebrew’s plist) and add inside the <dict>:
    <key>EnvironmentVariables</key>
    	<dict>
    	  <key>OLLAMA_HOST</key>
    	  <string>0.0.0.0</string>
    	</dict>
    Restart Ollama after saving. Alternatively, skip editing the plist and run OLLAMA_HOST=0.0.0.0 ollama serve in a terminal when you need LAN access.
  • Windows: Add a user or system environment variable OLLAMA_HOST = 0.0.0.0, then restart the Ollama app/service.
  • Linux (systemd): Edit /etc/systemd/system/ollama.service (or equivalent), add under [Service]:
    Environment="OLLAMA_HOST=0.0.0.0"
    Then run:
    sudo systemctl daemon-reload
    	sudo systemctl restart ollama

For browser or cross-origin clients you may also set OLLAMA_ORIGINS=* (recommended only on a trusted LAN).

Other devices on the LAN can then use http://<your-machine-IP>:11434 (e.g. http://192.168.1.100:11434).

2. LM Studio

LM Studio offers a GUI and an OpenAI-compatible local API. It includes the lms CLI for downloading models, loading them, and running the server from the terminal. See LM Studio CLI docs for the full reference.

2.1 Install

  • macOS: Download from lmstudio.ai/download (Apple Silicon only), or:
    curl -fsSL https://lmstudio.ai/install.sh | bash
  • Windows: Download the installer from the same page, or PowerShell:
    irm https://lmstudio.ai/install.ps1 | iex
  • Linux: Download the AppImage or use the install script from the official site.

16GB+ RAM is recommended; on Windows, 4GB+ dedicated VRAM is recommended. You must run LM Studio at least once before the lms CLI is available.

2.2 Download and load a model (e.g. Gemma 3 4B)

GUI: Open LM Studio, search for Gemma 3 4B or google/gemma-3-4b in the discovery view, choose a quantization (e.g. Q4_K_M), and download. Then load the model in the Local Server / Developer tab.

CLI: Use lms get to search and download models, lms ls to list models on disk, and lms load to load a model (e.g. with --gpu=max or --context-length=8192). Example:

lms get google/gemma-3-4b
	lms load google/gemma-3-4b --identifier="gemma3-4b"

Start the server with lms server start; stop it with lms server stop. Custom port: lms server start --port 3000. For web or cross-origin clients, add --cors (use only on a trusted network).

2.3 Allow LAN access

  • GUI: In LM Studio’s server settings, enable “Serve on Local Network”. The server will bind to your machine’s LAN IP so other devices on the same network can reach it. See Serve on Local Network.
  • CLI: Bind to all interfaces so the server is reachable on the LAN:
    lms server start --bind 0.0.0.0
    Or set the environment variable LMS_SERVER_HOST=0.0.0.0 before starting the server.

Default port is usually 1234 (or the last used port). Use http://<your-machine-IP>:1234 as the API base URL from other devices on the LAN.

3. Rapid-MLX

Rapid-MLX is a local AI engine for Apple Silicon Macs. It exposes an OpenAI-compatible API, so apps that support a custom OpenAI base URL can point to http://localhost:8000/v1. Use it when you are running Privy apps on or near a Mac with Apple Silicon and want a fast local model server.

3.1 Install

  • macOS Apple Silicon: Homebrew is the recommended install path:
    brew install raullenchai/rapid-mlx/rapid-mlx
  • pip: Requires Python 3.10 or later:
    pip install rapid-mlx
  • One-line installer: Auto-setup script from the Rapid-MLX project:
    curl -fsSL https://raullenchai.github.io/Rapid-MLX/install.sh | bash

3.2 Serve a model

Rapid-MLX uses short model aliases. A practical starting point on a 16 GB Apple Silicon Mac is qwen3.5-4b; run rapid-mlx models to list available aliases.

rapid-mlx serve qwen3.5-4b

The first run downloads the model. When the server is ready, use http://localhost:8000/v1 as the OpenAI-compatible base URL, with default as the model name.

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"default","messages":[{"role":"user","content":"Say hello"}]}'

3.3 LAN access

Rapid-MLX serves on port 8000 by default and its server flag reference lists --host with default 0.0.0.0. On a trusted LAN, other devices can use http://<your-machine-IP>:8000/v1 if the macOS firewall allows inbound access.

For a stricter local-only setup, bind to localhost:

rapid-mlx serve qwen3.5-4b --host 127.0.0.1 --port 8000

If you expose Rapid-MLX beyond your own Mac, consider setting an API key and limiting access to trusted devices only.

4. Summary

Item Ollama LM Studio Rapid-MLX
Example model gemma3:4b Gemma 3 4B (Hugging Face) qwen3.5-4b or another Rapid-MLX alias
Pull / download ollama pull gemma3:4b GUI or lms get; load with lms load rapid-mlx serve qwen3.5-4b
Default port 11434 1234 8000
LAN access OLLAMA_HOST=0.0.0.0 Enable “Serve on Local Network” or --bind 0.0.0.0 Default host is 0.0.0.0; use --host 127.0.0.1 for local-only
Best fit Simple cross-platform local LLM server Desktop GUI with OpenAI-compatible local API Fast Apple Silicon OpenAI-compatible server

Related links

This page is a public LLM setup reference from the Privy product site, for use with PrivyPDF, PrivyFeed, PrivaTranslate, PrivyApiStudio, and other apps that use a local LLM.