One-Click Automated Installer (macOS)

An interactive, color-coded bash script to automate local LLM setup on macOS. It installs engines and configures model sizes (4B–12B) optimized for a 16GB Mac.

🚀 Quick Start

Run the following one-liner in your terminal to start the interactive installation:

curl -fsSL https://privy.kenelite.com/engine/install.sh | bash

Key Benefits

Automatic Detection: Identifies macOS architecture (Intel vs. Apple Silicon) and check dependencies.
Engine Autoinstall: Installs Ollama, LM Studio, or Rapid-MLX automatically.
Interactive Setup: Configures general chat, reasoning (DeepSeek-R1), embedding, and translation models in one flow.

🛠️ Supported Inference Engines

The installer script supports downloading and configuring three major local inference engines:

Ollama (Highly Recommended): A lightweight, headless background service with low memory consumption and a vast model ecosystem. API endpoint: http://localhost:11434.
LM Studio: A feature-rich desktop GUI app supporting model discovery, an interactive playground, and a local OpenAI-compatible server. API endpoint: http://localhost:1234/v1.
Rapid-MLX (Apple Silicon Macs Only): Highly optimized local serving engine powered by Apple's MLX framework. Delivers 2–4x faster generation speeds than alternatives and includes prompt caching. API endpoint: http://localhost:8000/v1.

📦 Model Scenarios & Recommendations (16GB RAM Optimized)

To ensure a smooth local experience on a 16GB Mac without hitting aggressive system swap, the default model recommendations are scoped to roughly the 4B – 12B parameter range.

Scenario	Ollama Model	LM Studio Model	Rapid-MLX Model
Scenario 1: General Chat	`qwen3:8b` / `llama3.1:8b` / `gemma3:12b`	`Qwen3-8B` / `Llama-3.1-8B` / `gemma-3-12b`	`qwen3.5-4b` / `llama-3.1-8b` / `gemma-4-4b`
Scenario 2: Reasoning Model	`deepseek-r1:7b` / `deepseek-r1:8b` / `deepseek-r1:1.5b`	`DeepSeek-R1-Distill-Qwen-7B` / `DeepSeek-R1-0528-Qwen3-8B`	`deepseek-r1-distill-qwen-7b` / `deepseek-r1-distill-llama-8b`
Scenario 3: Embedding	`qwen3-embedding:4b` / `nomic-embed-text` / `bge-m3`	`Qwen3-Embedding-4B` / `nomic-embed-text-v1.5`	`nomic-embed-text`
Scenario 4: Translation	`qwen3:8b` / `nllb` / `gemma3:1b`	`Qwen3-8B` / `gemma-3-1b`	`qwen3.5-4b` / `qwen3.5-9b`

⚠️ Note: Meta Llama 3.3 is only released as a 70B parameter model, which is too large to run smoothly on 16GB Macs. The installer automatically substitutes this option with the highly optimized Llama 3.1 8B.

💡 Development & Integration

Once installed, each engine runs locally and exposes an OpenAI-compatible API. You can configure them in your projects or third-party tools (like Cursor, Claude Code, or Aider):

1. Test Connection via Terminal

# Test Ollama API server status
curl http://localhost:11434/api/tags

# Test LM Studio API server status (make sure server is running)
curl http://localhost:1234/v1/models

2. Python Integration Example

from openai import OpenAI

# Point to your local inference server (e.g. Rapid-MLX on port 8000)
client = OpenAI(
    base_url="http://localhost:8000/v1",  # Use http://localhost:11434/v1 for Ollama, http://localhost:1234/v1 for LM Studio
    api_key="none"  # No key is required for local instances
)

response = client.chat.completions.create(
    model="qwen3.5-4b",  # Specify the model you downloaded
    messages=[
        {"role": "user", "content": "Please introduce LLMs in three sentences."}
    ]
)

print(response.choices[0].message.content)

3. Serve Models using Rapid-MLX

Rapid-MLX uses on-demand downloading. To download and serve a model, run the following CLI command after script execution:

# Serve the Qwen 3.5 4B model
rapid-mlx serve qwen3.5-4b

The first run will download files from Hugging Face automatically and serve the model at http://localhost:8000/v1.

❓ Frequently Asked Questions

Q: What if the script complains about missing Homebrew?
A: The script automatically checks for Homebrew and prompts you to install it if missing. Enter your macOS administrator password when prompted.
Q: Why is my system slow during model inference?
A: Avoid loading multiple large models into RAM simultaneously. In LM Studio, click "Unload Model" to free up RAM before loading a new one. Ollama will automatically unload models from memory after a few minutes of inactivity.