single-tenant runtime
Each model gets its own warm process pinned to dedicated GPUs. No noisy neighbors, no eviction during inference, no shared KV-cache between tenants.
→ p99 ttft typically < 250ms on 7B–32Bneurohab runs your transformer in a single-tenant runtime — no cold starts, no shared queues, no opaque autoscaler latency. You bring weights; we keep them warm on dedicated silicon and give you back the metrics you'd expect to have built yourself.
Each model gets its own warm process pinned to dedicated GPUs. No noisy neighbors, no eviction during inference, no shared KV-cache between tenants.
→ p99 ttft typically < 250ms on 7B–32BNative loaders for HF transformers, GGUF, ONNX, and raw PyTorch state dicts. LoRA and QLoRA adapters hot-swap without restarting the engine.
→ no proprietary format. no lock-in.Per-token TTFT and ITL, KV-cache occupancy, GPU memory traces, host-network jitter. Exported as Prometheus metrics and OpenTelemetry spans.
→ standard exporters, no separate agent.