Back to Blog
Engineering Mar 24, 2026 6 min read

Local Models + Hybrid Routing

Airplane mode should not break your developer flow. CodeGrex routes requests between local Ollama models and cloud providers in real time, so you always get the best answer available.

Most AI editors force a trade-off: either cloud speed and quality, or local privacy and control. CodeGrex does both. Its routing engine evaluates request type, latency targets, privacy mode, and model availability before selecting where each completion or chat request should run.

1) Local-first when available

If your selected Ollama model is running, short completions can execute locally for ultra-low latency. This is ideal for inline suggestions, repetitive boilerplate, and disconnected environments.

2) Cloud fallback for heavier reasoning

When a task needs longer context windows or stronger reasoning, CodeGrex can escalate to your cloud provider automatically. You can define exactly when this happens from routing settings.

3) Privacy-aware routing policies

Routing rules respect runtime privacy mode. In Offline mode, cloud routes are disabled. In Sensitive mode, completions can remain local while chat selectively uses approved providers.

4) Resilient failover

If a model is unavailable, CodeGrex retries with exponential backoff and switches to a fallback route. Sessions continue instead of failing hard.

5) Performance budgets built in

You can tune first-token latency targets and prefer local completions to keep your typing loop fast. The router optimizes for responsiveness first, then model cost.

Recommended setup

  • Enable `Prefer Local Completions`.
  • Set runtime mode to `Hybrid`.
  • Keep one fast local model and one high-capability cloud model configured.