Featured image of post The First Wave of Local Inference Tooling — gpum v1.1.0 and TokenSpeed

The First Wave of Local Inference Tooling — gpum v1.1.0 and TokenSpeed

Two releases shipped on the same day target different layers of the local inference stack and mark the first generation of observability and management tooling outside the vendor incumbents

Overview

Operational tooling for inference has long split into two worlds. Cloud LLM stacks have a mature observability layer — Langsmith, OpenLLMetry, Helicone, Langfuse — that handles traces, costs, and request shaping above the model API. Local and on-prem inference — the world of Ollama, llama.cpp, LM Studio, vLLM, and TensorRT-LLM — still leans on nvidia-smi and shell scripts. On 2026-05-09 two tools landed on the same day, each targeting a different layer of that stack: drewdrew0414/AIGPUManager’s gpum v1.1.0 for GPU allocation, quotas, and safety, and lightseekorg/tokenspeed for the token throughput of the inference engine itself. Both come from individuals or small new orgs rather than vendors. That is the same shape the cloud LLM observability category had in 2023: the first generation of tooling arrives, and it doesn’t arrive from incumbents.

1. gpum v1.1.0 — A resource manager for shared GPU boxes

gpum is a Java 21 CLI. It is not aimed at the single-user nvidia-smi workflow. It targets the situation where several people share the same GPU server and need a coordination layer. Earlier versions covered inventory (“which GPU is where”) and basic allocation. v1.1.0 is the release where the operations layer fills in.

1.1 Compute policy and an approval workflow

The most distinctive addition in v1.1.0 is an approval workflow for high-risk hardware operations.

gpum gpu reset --id node1:0 --soft --apply
gpum rbac approval list --status pending
gpum rbac approval approve --id <approval-id> --reason "maintenance window"
gpum gpu reset --id node1:0 --soft --apply --approval-id <approval-id>

Power-limit changes, ECC toggling, GPU reset — none of these execute immediately. They produce an approval record. On top of that, real hardware writes only happen when GPUM_ENABLE_HARDWARE_WRITE=1 is set in the calling shell, and dry-run is the default everywhere else. The positioning is clear: this is for environments where dragging in Slurm or the Kubernetes Device Plugin is too heavy, but raw SSH is too thin — a one-CLI middle ground for one or two shared GPU boxes.

1.2 Multi-vendor inventory

gpum reads NVIDIA NVML, AMD ROCm-SMI, and Intel Level Zero in the same scan. NVML is accessed through JNA; the Level Zero loader is discovered separately. When a library is not installed, the row is marked unavailable rather than failing silently. This is not a mobile or embedded tool — the design assumes heterogeneous GPUs in a workstation or small cluster.

1.3 Topology-aware scheduling

gpum alloc estimate --model llama3-70b --params-b 70 --precision fp16 --context 8192 --batch 4
gpum schedule reserve create --gpus 4 --start 2026-05-10T22:00:00 --end 2026-05-11T06:00:00
gpum schedule gang --nodes 2 --gpus-per-node 8

It recognizes NVLink, AMD XGMI, and Intel Xe Link and adjusts packed/spread placement hints. Distributed training that must start with all nodes (gang scheduling), short idle windows filled with backfill, and fair-share weighting by historical GPU-hours — these are textbook cluster-manager features packed into a single CLI.

1.4 Safety guardrails

v1.1.0 keeps emphasizing prevention at the operations layer.

GuardBehavior
Max GPU / requestReject requests that exceed policy permanently
Max lease hoursExpired leases become reclamation candidates
Thermal thresholdPreflight detection of thermal-critical GPUs
Power capPreflight detection of saturated GPUs
Stale heartbeatCleanup of dead workers
Min free VRAMReject jobs that would breach memory limit

Incidents wrap this with GPU quarantine and node drain actions. It reads like a single-host distillation of the SRE playbook.

1.5 AI tooling integration

The most immediately useful command group is gpum integration ai. It turns an allocation lease into launch commands for torchrun, accelerate, DeepSpeed, or vLLM.

gpum integration ai launch --allocation-id alloc-001 --tool torchrun --arg train.py
gpum integration ai launch --allocation-id alloc-001 --tool vllm --from-file vllm-serve.yaml

It auto-injects CUDA_VISIBLE_DEVICES, MASTER_ADDR, GPUM_RDZV_ENDPOINT, and vendor-specific equivalents (ROCR_VISIBLE_DEVICES for AMD, ZE_AFFINITY_MASK for Intel). The whole flow — allocation lease → injected env → launch command — is one motion.

2. TokenSpeed — Aiming straight at engine throughput

TokenSpeed, released the same day, sits on a different layer. Where gpum manages and observes GPU resources, TokenSpeed is the inference engine. The README states the ambition flatly: TensorRT-LLM-level performance with vLLM-level usability. The announcement blog shows Pareto curves on NVIDIA B200 running Kimi K2.5, claiming to push past the TensorRT-LLM front.

2.1 Four design pieces

From the README’s component breakdown:

LayerRole
Modelinglocal-SPMD with a static compiler that auto-generates collectives from module-boundary annotations
SchedulerC++ control plane / Python execution plane, request lifecycle modeled as a finite-state machine
KernelsPluggable, layered, with one of the fastest MLA (Multi-head Latent Attention) implementations on Blackwell
EntrypointSMG-integrated AsyncLLM, designed to keep CPU-side request handling cheap

MLA was popularized by DeepSeek-V2 and compresses the KV cache into a latent representation, which collapses memory bandwidth pressure during decode. TokenSpeed claims to re-implement it for the Blackwell architecture. The detail that KV cache ownership is enforced at compile time through the type system is the interesting one — it moves a class of bugs that vLLM solves at runtime via PagedAttention into the type checker.

2.2 Targeting agentic workloads

The README keeps returning to one phrase: agentic workloads. Unlike a chat workload (long single responses), agent workloads look like thousands of short responses with tool calls between them. In that pattern, CPU-side request handling cost and KV cache reuse/reallocation dominate throughput. The emphasis on the FSM, the type system, and AsyncLLM is the direct response to that pattern.

2.3 Status and limits

The repo is explicit that this is a preview.

  • Reproducible today: B200 with Kimi K2.5 and TokenSpeed MLA
  • In progress: Qwen 3.6, DeepSeek V4, MiniMax M2.7 model coverage
  • In progress: prefill-decode separation (PD), EPLB, KV store, Mamba cache, VLM, metrics
  • In progress: Hopper and MI350 optimization

So today it is a runtime design demonstration, not a production deployment target. That said, picking up 900+ stars in the first days is a signal that the “faster than vLLM, easier than TensorRT-LLM” slot in the inference engine category has been waiting to be filled.

3. Where the two tools meet

Mapping them onto the inference stack:

gpum abstracts hardware and drivers and hands them safely to the inference engine. TokenSpeed is the inference engine itself. They do not overlap. In practice you would imagine gpum integration ai launch --tool vllm producing the launch command, and whatever engine sits inside the launcher (vLLM, TokenSpeed, TensorRT-LLM) is a downstream choice.

4. Comparison with cloud LLM observability

In the cloud LLM stack, Langsmith, OpenLLMetry, Helicone, and Langfuse cover two axes:

AxisCloud LLMLocal / on-prem inference
Tracing / logsLangsmith, Langfuse(gap — gpum’s audit log fills a slice)
Tokens / costHelicone, OpenLLMetry(gap — gpum cost reports, TokenSpeed throughput)
Model gatewayOpenRouter, PortkeyLiteLLM (hybrid)
Resource allocation(managed)gpum
Engine throughput(managed)TokenSpeed, vLLM, SGLang

Cloud-side observability is already past its first generation (2023–2024) and into a consolidation phase. Local inference is at the start of generation one — built by individuals or new orgs, exactly the shape of the early Langsmith and Helicone era. gpum being a one-maintainer project and TokenSpeed coming from a brand-new lightseekorg underline that timing.

5. Practical scenarios

The two tools land in clearly different setups.

  • A team sharing 1–2 GPU servers: gpum fits cleanly. Start with gpum scan --refresh for inventory, then gpum submit to run batch jobs in containers, then gpum gpu health --score --quarantine-threshold to take dying GPUs out of rotation before they cascade. Too small for Slurm or Run:ai, too big for raw SSH.
  • Evaluating the inference engine itself: TokenSpeed is still preview, but the B200 + Kimi K2.5 reproduction is meaningful as a forward-looking comparison. If you expect to be making engine choices for production over the next 12 months, having a hands-on reference point for the “post-vLLM” design space is worth setting up early.

Insights

Two tools shipping the same day at different layers of the same stack is a market signal: local and on-prem inference has hit the point where it needs first-generation operational tooling. Cloud LLM stacks passed this milestone in 2023 when Langsmith externalized LangChain’s operational burden. Local inference is hitting it in 2026 with gpum filling the resource-management gap from the bottom and TokenSpeed reframing engine design from the top. Both carry the typical first-generation limits — gpum is a single-maintainer Java project, TokenSpeed is preview-only and B200-bound — but first-generation tooling in a category is what defines the shape of the category. The cloud observability arc says most of the first wave survives and a small subset becomes standard. The same pattern is starting at the local inference layer this week.

References

Release and repo

Local inference runtimes

Techniques and standards

Cloud LLM observability — for comparison

Built with Hugo
Theme Stack designed by Jimmy