private beta · cohort 04 / runtime v0.4.2

Inference for models you actually deployed.

neurohab runs your transformer in a single-tenant runtime — no cold starts, no shared queues, no opaque autoscaler latency. You bring weights; we keep them warm on dedicated silicon and give you back the metrics you'd expect to have built yourself.

request access how it works

cluster jp-nrt-1   tenant ████-██████   uptime 41d 06:12

gpu 0   nvidia rtx 4090         idle                 12 GiB / 24
gpu 1   nvidia a100-80g    ◉ loaded · 1.2s      43 GiB / 80
gpu 2   nvidia a100-80g         idle                 15 GiB / 80
gpu 3   nvidia l40s        ⊙ warming             9 GiB / 48

→ POST /v1/infer  model=qwen3-32b-instruct  stream=true
← 200 ok   ttft=143ms   total=1.84s   tokens=512_

single-tenant runtime

Each model gets its own warm process pinned to dedicated GPUs. No noisy neighbors, no eviction during inference, no shared KV-cache between tenants.

→ p99 ttft typically < 250ms on 7B–32B

bring your weights

Native loaders for HF transformers, GGUF, ONNX, and raw PyTorch state dicts. LoRA and QLoRA adapters hot-swap without restarting the engine.

→ no proprietary format. no lock-in.

observability built in

Per-token TTFT and ITL, KV-cache occupancy, GPU memory traces, host-network jitter. Exported as Prometheus metrics and OpenTelemetry spans.

→ standard exporters, no separate agent.

infrastructure: Bare-metal H100, A100-80g, and L40S nodes in jp-nrt-1 and de-fra-1. PCIe Gen5 bus, NVLink where applicable.
availability: Currently invite-only. We accept ~3 new tenants per cohort while we stabilize the multi-region control plane. Cohort 04 closes when capacity allocates out.
pricing: Flat hourly per GPU plus a flat per-million-token rate. No reserved-capacity tax, no annual commit, no minimum spend. Detailed pricing shared after the intake call.
licensing: The hosted control plane is closed. A self-hosted runtime under a source-available license is on the roadmap for Q4 — see the about page for the longer note.
security: Tenant isolation via separate cgroups and IOMMU groups. Weights are encrypted at rest with per-tenant keys. We do not log inputs or outputs unless explicitly requested for debugging.