Blog

Building Local-First AI: Why Zero Telemetry Changes Everything

March 18, 2026 · 3 min read · by Jacob C. Smith

When you run an LLM locally with zero telemetry, something changes. Not just privacy—the entire relationship between developer and model shifts. Here's what I learned building JCLAW.

Building Local-First AI: Why Zero Telemetry Changes Everything

March 18, 2026 · 7 min read


I've been building JCLAW—a local-first LLM runtime—for about eight months. The original motivation was privacy. I wanted to run language models without sending queries to external servers, without logs I don't control, without inference happening on hardware I can't inspect.

Privacy was the reason I started. It's not the most interesting thing I found.

What Changes When You Own the Stack

When you run inference locally with zero telemetry, several things shift simultaneously:

Latency becomes mechanical, not network-dependent. Your query round-trip is now bounded by RAM bandwidth and compute throughput—deterministic, tunable, debuggable. Not bounded by traffic shaping, rate limits, or geographic routing. When something is slow, you know exactly why.

The model becomes a tool, not a service. This sounds like marketing copy. It's not. When you can kill the process, inspect the weights, swap the context window, or pipe the output directly to another program without going through an API, the relationship changes. You're not sending a request to a black box. You're running a computation on your machine.

Failure modes become comprehensible. API-based LLM failures are opaque—the error comes back as a status code and a vague message. Local inference failures are traceable. Segfaults have call stacks. Memory pressure is visible in system monitors. You can attach a debugger.

The Architecture

JCLAW has three layers:

Gate Server (Express + WebSocket)
  ├── JSON-RPC protocol router (~30 methods)
  ├── React + Vite dashboard (LCARS aesthetic)
  └── Shared tool execution engine

Storage Layer (SQLite with WAL + FTS5)
  ├── Sessions and message history
  ├── Prompt library with versioning
  ├── Eval results and benchmarks
  └── Sandbox configuration

Runtime Modules
  ├── Chat (streaming, multi-turn, branching)
  ├── Eval (benchmark suites, regression tracking)
  ├── Embeddings (local vector search)
  └── Diffing (word/line-level response comparison)

The interesting design constraint was making the Gate Server serve both a human-facing dashboard and Claude Code's MCP protocol simultaneously. Same tool execution engine, two interfaces. This forced a clean separation between tool logic and transport layer that made the whole system more maintainable.

Session Forking

The feature I use most isn't streaming chat. It's session forking.

At any message in a conversation, you can fork—create a branch that diverges from that point. The original session is preserved. The fork inherits all prior context. You can explore alternative reasoning paths without losing your work.

This turns out to be how I actually want to use language models. Not as a linear conversation but as a branching tree of explorations. The chat metaphor was always a poor fit; it was just the default because APIs are stateless. Local-first removes that constraint.

Prompt Injection Sandboxing

Running models locally doesn't automatically make them safe. The prompt injection problem—getting a model to execute instructions embedded in its context window—is substrate-independent.

JCLAW has a sandboxing layer that:

  1. Classifies incoming context for injection patterns before forwarding to the model
  2. Maintains a separate "trusted" context window that external content can't overwrite
  3. Logs all tool calls with their originating context for post-hoc audit

It's not bulletproof. Nothing is. But it makes the attack surface explicit and auditable.

What I Learned About Language Models

Running your own inference changes how you think about what these things actually are.

The biggest shift: they're not oracles. They're compression artifacts. The weights encode statistical patterns extracted from training data; inference is the process of decompressing those patterns against your prompt. Understanding this mechanistically changes how you write prompts, how you interpret outputs, and how you design systems that incorporate model output.

The second shift: context is everything. Local control over the context window—what goes in, what gets truncated, what persists across turns—is the actual lever for getting useful behavior out of these systems. API wrappers give you a narrow interface to context management. Local control gives you the full surface.

The Tradeoff

Local inference has real costs. The setup overhead is significant. Model quality per parameter is still below frontier API providers. Some tasks need scale that a local GPU can't provide.

The tradeoff is worth it for a specific use case: development, research, and any workflow where you need to understand and control the full stack. For consumer applications where ease of use matters more than control, API-based solutions are still the right answer.

For me, JCLAW is the right answer. I know exactly what's running, exactly where my data goes, and exactly what I'm trading off. That clarity is worth the setup cost.

← All posts