One-Click Automated Installer (macOS)

An interactive, color-coded bash script to automate local LLM setup on macOS. It installs engines and configures model sizes (4B–12B) optimized for a 16GB Mac.

🚀 Quick Start

Run the following one-liner in your terminal to start the interactive installation:

curl -fsSL https://privy.kenelite.com/engine/install.sh | bash

Key Benefits

  • Automatic Detection: Identifies macOS architecture (Intel vs. Apple Silicon) and check dependencies.
  • Engine Autoinstall: Installs Ollama, LM Studio, or Rapid-MLX automatically.
  • Interactive Setup: Configures general chat, reasoning (DeepSeek-R1), embedding, and translation models in one flow.

🛠️ Supported Inference Engines

The installer script supports downloading and configuring three major local inference engines:

  • Ollama (Highly Recommended): A lightweight, headless background service with low memory consumption and a vast model ecosystem. API endpoint: http://localhost:11434.
  • LM Studio: A feature-rich desktop GUI app supporting model discovery, an interactive playground, and a local OpenAI-compatible server. API endpoint: http://localhost:1234/v1.
  • Rapid-MLX (Apple Silicon Macs Only): Highly optimized local serving engine powered by Apple's MLX framework. Delivers 2–4x faster generation speeds than alternatives and includes prompt caching. API endpoint: http://localhost:8000/v1.

📦 Model Scenarios & Recommendations (16GB RAM Optimized)

To ensure a smooth local experience on a 16GB Mac without hitting aggressive system swap, the default model recommendations are scoped to roughly the 4B – 12B parameter range.

Scenario Ollama Model LM Studio Model Rapid-MLX Model
Scenario 1: General Chat qwen3:8b / llama3.1:8b / gemma3:12b Qwen3-8B / Llama-3.1-8B / gemma-3-12b qwen3.5-4b / llama-3.1-8b / gemma-4-4b
Scenario 2: Reasoning Model deepseek-r1:7b / deepseek-r1:8b / deepseek-r1:1.5b DeepSeek-R1-Distill-Qwen-7B / DeepSeek-R1-0528-Qwen3-8B deepseek-r1-distill-qwen-7b / deepseek-r1-distill-llama-8b
Scenario 3: Embedding qwen3-embedding:4b / nomic-embed-text / bge-m3 Qwen3-Embedding-4B / nomic-embed-text-v1.5 nomic-embed-text
Scenario 4: Translation qwen3:8b / nllb / gemma3:1b Qwen3-8B / gemma-3-1b qwen3.5-4b / qwen3.5-9b

⚠️ Note: Meta Llama 3.3 is only released as a 70B parameter model, which is too large to run smoothly on 16GB Macs. The installer automatically substitutes this option with the highly optimized Llama 3.1 8B.

💡 Development & Integration

Once installed, each engine runs locally and exposes an OpenAI-compatible API. You can configure them in your projects or third-party tools (like Cursor, Claude Code, or Aider):

1. Test Connection via Terminal

# Test Ollama API server status
curl http://localhost:11434/api/tags

# Test LM Studio API server status (make sure server is running)
curl http://localhost:1234/v1/models

2. Python Integration Example

from openai import OpenAI

# Point to your local inference server (e.g. Rapid-MLX on port 8000)
client = OpenAI(
    base_url="http://localhost:8000/v1",  # Use http://localhost:11434/v1 for Ollama, http://localhost:1234/v1 for LM Studio
    api_key="none"  # No key is required for local instances
)

response = client.chat.completions.create(
    model="qwen3.5-4b",  # Specify the model you downloaded
    messages=[
        {"role": "user", "content": "Please introduce LLMs in three sentences."}
    ]
)

print(response.choices[0].message.content)

3. Serve Models using Rapid-MLX

Rapid-MLX uses on-demand downloading. To download and serve a model, run the following CLI command after script execution:

# Serve the Qwen 3.5 4B model
rapid-mlx serve qwen3.5-4b

The first run will download files from Hugging Face automatically and serve the model at http://localhost:8000/v1.

❓ Frequently Asked Questions

  • Q: What if the script complains about missing Homebrew?
    A: The script automatically checks for Homebrew and prompts you to install it if missing. Enter your macOS administrator password when prompted.
  • Q: Why is my system slow during model inference?
    A: Avoid loading multiple large models into RAM simultaneously. In LM Studio, click "Unload Model" to free up RAM before loading a new one. Ollama will automatically unload models from memory after a few minutes of inactivity.