One-Click Automated Installer (macOS)
An interactive, color-coded bash script to automate local LLM setup on macOS. It installs engines and configures model sizes (4B–12B) optimized for a 16GB Mac.
🚀 Quick Start
Run the following one-liner in your terminal to start the interactive installation:
curl -fsSL https://privy.kenelite.com/engine/install.sh | bash Key Benefits
- Automatic Detection: Identifies macOS architecture (Intel vs. Apple Silicon) and check dependencies.
- Engine Autoinstall: Installs Ollama, LM Studio, or Rapid-MLX automatically.
- Interactive Setup: Configures general chat, reasoning (DeepSeek-R1), embedding, and translation models in one flow.
🛠️ Supported Inference Engines
The installer script supports downloading and configuring three major local inference engines:
- Ollama (Highly Recommended): A lightweight, headless background service with low memory consumption and a vast model ecosystem. API endpoint:
http://localhost:11434. - LM Studio: A feature-rich desktop GUI app supporting model discovery, an interactive playground, and a local OpenAI-compatible server. API endpoint:
http://localhost:1234/v1. - Rapid-MLX (Apple Silicon Macs Only): Highly optimized local serving engine powered by Apple's MLX framework. Delivers 2–4x faster generation speeds than alternatives and includes prompt caching. API endpoint:
http://localhost:8000/v1.
📦 Model Scenarios & Recommendations (16GB RAM Optimized)
To ensure a smooth local experience on a 16GB Mac without hitting aggressive system swap, the default model recommendations are scoped to roughly the 4B – 12B parameter range.
| Scenario | Ollama Model | LM Studio Model | Rapid-MLX Model |
|---|---|---|---|
| Scenario 1: General Chat | qwen3:8b / llama3.1:8b / gemma3:12b | Qwen3-8B / Llama-3.1-8B / gemma-3-12b | qwen3.5-4b / llama-3.1-8b / gemma-4-4b |
| Scenario 2: Reasoning Model | deepseek-r1:7b / deepseek-r1:8b / deepseek-r1:1.5b | DeepSeek-R1-Distill-Qwen-7B / DeepSeek-R1-0528-Qwen3-8B | deepseek-r1-distill-qwen-7b / deepseek-r1-distill-llama-8b |
| Scenario 3: Embedding | qwen3-embedding:4b / nomic-embed-text / bge-m3 | Qwen3-Embedding-4B / nomic-embed-text-v1.5 | nomic-embed-text |
| Scenario 4: Translation | qwen3:8b / nllb / gemma3:1b | Qwen3-8B / gemma-3-1b | qwen3.5-4b / qwen3.5-9b |
⚠️ Note: Meta Llama 3.3 is only released as a 70B parameter model, which is too large to run smoothly on 16GB Macs. The installer automatically substitutes this option with the highly optimized Llama 3.1 8B.
💡 Development & Integration
Once installed, each engine runs locally and exposes an OpenAI-compatible API. You can configure them in your projects or third-party tools (like Cursor, Claude Code, or Aider):
1. Test Connection via Terminal
# Test Ollama API server status
curl http://localhost:11434/api/tags
# Test LM Studio API server status (make sure server is running)
curl http://localhost:1234/v1/models 2. Python Integration Example
from openai import OpenAI
# Point to your local inference server (e.g. Rapid-MLX on port 8000)
client = OpenAI(
base_url="http://localhost:8000/v1", # Use http://localhost:11434/v1 for Ollama, http://localhost:1234/v1 for LM Studio
api_key="none" # No key is required for local instances
)
response = client.chat.completions.create(
model="qwen3.5-4b", # Specify the model you downloaded
messages=[
{"role": "user", "content": "Please introduce LLMs in three sentences."}
]
)
print(response.choices[0].message.content) 3. Serve Models using Rapid-MLX
Rapid-MLX uses on-demand downloading. To download and serve a model, run the following CLI command after script execution:
# Serve the Qwen 3.5 4B model
rapid-mlx serve qwen3.5-4b The first run will download files from Hugging Face automatically and serve the model at http://localhost:8000/v1.
❓ Frequently Asked Questions
- Q: What if the script complains about missing Homebrew?
A: The script automatically checks for Homebrew and prompts you to install it if missing. Enter your macOS administrator password when prompted. - Q: Why is my system slow during model inference?
A: Avoid loading multiple large models into RAM simultaneously. In LM Studio, click "Unload Model" to free up RAM before loading a new one. Ollama will automatically unload models from memory after a few minutes of inactivity.