Private chat assistants
internal company chat that never sends prompts to OpenAI

// screenshot of ollama.com ↗
Ollama is the de-facto standard for running local large language models on your own hardware. A single binary + REST API that pulls models from a public registry (Llama 3.3, Mistral, Qwen, DeepSeek, Phi-4 and dozens more), handles quantization, GPU offload, and exposes a simple /api/generate and /api/chat interface that's API-compatible with the OpenAI SDK.
Ollama is the de-facto standard for running local large language models on your own hardware. A single binary + REST API that pulls models from a public registry (Llama 3.3, Mistral, Qwen, DeepSeek, Phi-4 and dozens more), handles quantization, GPU offload, and exposes a simple /api/generate and /api/chat interface that's API-compatible with the OpenAI SDK.
It's the boring, reliable engine that every other self-hosted AI tool ends up integrating against — Open WebUI, AnythingLLM, LibreChat, Flowise, LangChain, LiteLLM, n8n.
Concrete scenarios where teams pick Ollama over the SaaS alternative.
internal company chat that never sends prompts to OpenAI
EU customers in healthcare, legal, finance who can't push prompts to US clouds
predictable per-month VPS bill vs metered API spend
on-prem or restricted-network environments
local dev loop for engineers building on top of LLMs
If your team profile matches one of these, Ollama is a strong fit out of the box.
fast local dev loop, no API rate limits, no $ per token while building
legal, healthcare, finance, gov teams forbidden from US-hosted LLM APIs
resellers offering "private AI VPS" to their customers as a higher-margin SKU
evaluating open models without paying OpenAI / Anthropic per experiment
predictable per-month VPS cost beats unpredictable per-token bills as traffic grows
When evaluating self-hosted options for this category, here are the dimensions on which Ollama consistently lands above the alternatives.
ollama pull llama3.3)The stack you'll plug Ollama into — services, protocols, and adjacent apps in the BluixApps catalog.
/api/embeddings works with Chroma, Qdrant, pgvector RAG stacks/var/lib/ollama for persistence across upgradesollama/ollama:0.5.4 image, tracked weekly against upstream127.0.0.1:11434; SSL + auth via Nginx Proxy Manager when paired with Open WebUI/var/lib/ollama before each update (models can be 4-20 GB — opt-in)Operational guidance from running this in production — what to do before you scale, what to lock down, what surprises people.
OLLAMA_KEEP_ALIVE1h for warm latency, -1 to keep foreverollama ps — if model says "100% CPU" you're not using your GPU; check NVIDIA drivers + CUDA toolkitollama list then ollama rm unused; models silently accumulate in /usr/share/ollama/.ollama/models11434:11434 · ollama/ollama:latest