neurohab
private beta · cohort 04 / runtime v0.4.2

Inference for models you actually deployed.

neurohab runs your transformer in a single-tenant runtime — no cold starts, no shared queues, no opaque autoscaler latency. You bring weights; we keep them warm on dedicated silicon and give you back the metrics you'd expect to have built yourself.

single-tenant runtime

Each model gets its own warm process pinned to dedicated GPUs. No noisy neighbors, no eviction during inference, no shared KV-cache between tenants.

→ p99 ttft typically < 250ms on 7B–32B

bring your weights

Native loaders for HF transformers, GGUF, ONNX, and raw PyTorch state dicts. LoRA and QLoRA adapters hot-swap without restarting the engine.

→ no proprietary format. no lock-in.

observability built in

Per-token TTFT and ITL, KV-cache occupancy, GPU memory traces, host-network jitter. Exported as Prometheus metrics and OpenTelemetry spans.

→ standard exporters, no separate agent.
infrastructure
Bare-metal H100, A100-80g, and L40S nodes in jp-nrt-1 and de-fra-1. PCIe Gen5 bus, NVLink where applicable.
availability
Currently invite-only. We accept ~3 new tenants per cohort while we stabilize the multi-region control plane. Cohort 04 closes when capacity allocates out.
pricing
Flat hourly per GPU plus a flat per-million-token rate. No reserved-capacity tax, no annual commit, no minimum spend. Detailed pricing shared after the intake call.
licensing
The hosted control plane is closed. A self-hosted runtime under a source-available license is on the roadmap for Q4 — see the about page for the longer note.
security
Tenant isolation via separate cgroups and IOMMU groups. Weights are encrypted at rest with per-tenant keys. We do not log inputs or outputs unless explicitly requested for debugging.