<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Gpu Manager on ICE-ICE-BEAR-BLOG</title><link>https://ice-ice-bear.github.io/tags/gpu-manager/</link><description>Recent content in Gpu Manager on ICE-ICE-BEAR-BLOG</description><generator>Hugo -- gohugo.io</generator><language>en</language><lastBuildDate>Sat, 09 May 2026 00:00:00 +0900</lastBuildDate><atom:link href="https://ice-ice-bear.github.io/tags/gpu-manager/index.xml" rel="self" type="application/rss+xml"/><item><title>The First Wave of Local Inference Tooling — gpum v1.1.0 and TokenSpeed</title><link>https://ice-ice-bear.github.io/posts/2026-05-09-local-inference-tooling/</link><pubDate>Sat, 09 May 2026 00:00:00 +0900</pubDate><guid>https://ice-ice-bear.github.io/posts/2026-05-09-local-inference-tooling/</guid><description>&lt;img src="https://ice-ice-bear.github.io/" alt="Featured image of post The First Wave of Local Inference Tooling — gpum v1.1.0 and TokenSpeed" /&gt;&lt;h2 id="overview"&gt;Overview
&lt;/h2&gt;&lt;p&gt;Operational tooling for inference has long split into two worlds. Cloud LLM stacks have a mature observability layer — &lt;a class="link" href="https://www.langchain.com/langsmith" target="_blank" rel="noopener"
 &gt;Langsmith&lt;/a&gt;, &lt;a class="link" href="https://github.com/traceloop/openllmetry" target="_blank" rel="noopener"
 &gt;OpenLLMetry&lt;/a&gt;, &lt;a class="link" href="https://www.helicone.ai/" target="_blank" rel="noopener"
 &gt;Helicone&lt;/a&gt;, &lt;a class="link" href="https://langfuse.com" target="_blank" rel="noopener"
 &gt;Langfuse&lt;/a&gt; — that handles traces, costs, and request shaping above the model API. Local and on-prem inference — the world of &lt;a class="link" href="https://ollama.com" target="_blank" rel="noopener"
 &gt;Ollama&lt;/a&gt;, &lt;a class="link" href="https://github.com/ggml-org/llama.cpp" target="_blank" rel="noopener"
 &gt;llama.cpp&lt;/a&gt;, &lt;a class="link" href="https://lmstudio.ai" target="_blank" rel="noopener"
 &gt;LM Studio&lt;/a&gt;, &lt;a class="link" href="https://github.com/vllm-project/vllm" target="_blank" rel="noopener"
 &gt;vLLM&lt;/a&gt;, and &lt;a class="link" href="https://github.com/NVIDIA/TensorRT-LLM" target="_blank" rel="noopener"
 &gt;TensorRT-LLM&lt;/a&gt; — still leans on &lt;code&gt;nvidia-smi&lt;/code&gt; and shell scripts. On 2026-05-09 two tools landed on the same day, each targeting a different layer of that stack: &lt;a class="link" href="https://github.com/drewdrew0414/AIGPUManager/releases/tag/v1.1.0" target="_blank" rel="noopener"
 &gt;drewdrew0414/AIGPUManager&amp;rsquo;s &lt;code&gt;gpum&lt;/code&gt; v1.1.0&lt;/a&gt; for GPU &lt;strong&gt;allocation, quotas, and safety&lt;/strong&gt;, and &lt;a class="link" href="https://github.com/lightseekorg/tokenspeed" target="_blank" rel="noopener"
 &gt;lightseekorg/tokenspeed&lt;/a&gt; for the &lt;strong&gt;token throughput&lt;/strong&gt; of the inference engine itself. Both come from individuals or small new orgs rather than vendors. That is the same shape the cloud LLM observability category had in 2023: the first generation of tooling arrives, and it doesn&amp;rsquo;t arrive from incumbents.&lt;/p&gt;
&lt;pre class="mermaid" style="visibility:hidden"&gt;graph TD
 HW["Hardware &amp;lt;br/&amp;gt; (NVIDIA / AMD / Intel / B200)"] --&gt; DRV["Drivers &amp;lt;br/&amp;gt; (CUDA / ROCm / Level Zero)"]
 DRV --&gt; RT["Inference runtime &amp;lt;br/&amp;gt; (llama.cpp / vLLM / TensorRT-LLM / TokenSpeed)"]
 RT --&gt; APP["App layer &amp;lt;br/&amp;gt; (Ollama / LM Studio / agents)"]
 DRV --&gt; MGR["Resource management &amp;lt;br/&amp;gt; (gpum)"]
 MGR -.quotas, scheduling, safety.-&gt; RT
 RT -.token throughput.-&gt; BENCH["Benchmarking and observability &amp;lt;br/&amp;gt; (TokenSpeed measures its own runtime)"]&lt;/pre&gt;&lt;h2 id="1-gpum-v110--a-resource-manager-for-shared-gpu-boxes"&gt;1. gpum v1.1.0 — A resource manager for shared GPU boxes
&lt;/h2&gt;&lt;p&gt;&lt;a class="link" href="https://github.com/drewdrew0414/AIGPUManager" target="_blank" rel="noopener"
 &gt;gpum&lt;/a&gt; is a Java 21 CLI. It is not aimed at the single-user &lt;code&gt;nvidia-smi&lt;/code&gt; workflow. It targets the situation where &lt;strong&gt;several people share the same GPU server&lt;/strong&gt; and need a coordination layer. Earlier versions covered inventory (&amp;ldquo;which GPU is where&amp;rdquo;) and basic allocation. &lt;a class="link" href="https://github.com/drewdrew0414/AIGPUManager/releases/tag/v1.1.0" target="_blank" rel="noopener"
 &gt;v1.1.0&lt;/a&gt; is the release where the operations layer fills in.&lt;/p&gt;
&lt;h3 id="11-compute-policy-and-an-approval-workflow"&gt;1.1 Compute policy and an approval workflow
&lt;/h3&gt;&lt;p&gt;The most distinctive addition in v1.1.0 is an &lt;strong&gt;approval workflow&lt;/strong&gt; for high-risk hardware operations.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;gpum gpu reset --id node1:0 --soft --apply
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;gpum rbac approval list --status pending
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;gpum rbac approval approve --id &amp;lt;approval-id&amp;gt; --reason &lt;span class="s2"&gt;&amp;#34;maintenance window&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;gpum gpu reset --id node1:0 --soft --apply --approval-id &amp;lt;approval-id&amp;gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Power-limit changes, ECC toggling, GPU reset — none of these execute immediately. They produce an approval record. On top of that, real hardware writes only happen when &lt;code&gt;GPUM_ENABLE_HARDWARE_WRITE=1&lt;/code&gt; is set in the calling shell, and dry-run is the default everywhere else. The positioning is clear: this is for environments where dragging in &lt;a class="link" href="https://slurm.schedmd.com/" target="_blank" rel="noopener"
 &gt;Slurm&lt;/a&gt; or the &lt;a class="link" href="https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/" target="_blank" rel="noopener"
 &gt;Kubernetes Device Plugin&lt;/a&gt; is too heavy, but raw SSH is too thin — &lt;strong&gt;a one-CLI middle ground&lt;/strong&gt; for one or two shared GPU boxes.&lt;/p&gt;
&lt;h3 id="12-multi-vendor-inventory"&gt;1.2 Multi-vendor inventory
&lt;/h3&gt;&lt;p&gt;&lt;code&gt;gpum&lt;/code&gt; reads &lt;a class="link" href="https://developer.nvidia.com/management-library-nvml" target="_blank" rel="noopener"
 &gt;NVIDIA NVML&lt;/a&gt;, AMD ROCm-SMI, and &lt;a class="link" href="https://spec.oneapi.io/level-zero/latest/index.html" target="_blank" rel="noopener"
 &gt;Intel Level Zero&lt;/a&gt; in the same scan. NVML is accessed through JNA; the Level Zero loader is discovered separately. When a library is not installed, the row is marked &lt;code&gt;unavailable&lt;/code&gt; rather than failing silently. This is not a mobile or embedded tool — the design assumes &lt;strong&gt;heterogeneous GPUs in a workstation or small cluster&lt;/strong&gt;.&lt;/p&gt;
&lt;h3 id="13-topology-aware-scheduling"&gt;1.3 Topology-aware scheduling
&lt;/h3&gt;&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;gpum alloc estimate --model llama3-70b --params-b &lt;span class="m"&gt;70&lt;/span&gt; --precision fp16 --context &lt;span class="m"&gt;8192&lt;/span&gt; --batch &lt;span class="m"&gt;4&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;gpum schedule reserve create --gpus &lt;span class="m"&gt;4&lt;/span&gt; --start 2026-05-10T22:00:00 --end 2026-05-11T06:00:00
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;gpum schedule gang --nodes &lt;span class="m"&gt;2&lt;/span&gt; --gpus-per-node &lt;span class="m"&gt;8&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;It recognizes &lt;a class="link" href="https://www.nvidia.com/en-us/data-center/nvlink/" target="_blank" rel="noopener"
 &gt;NVLink&lt;/a&gt;, AMD XGMI, and Intel Xe Link and adjusts packed/spread placement hints. Distributed training that must start with all nodes (&lt;strong&gt;gang scheduling&lt;/strong&gt;), short idle windows filled with &lt;strong&gt;backfill&lt;/strong&gt;, and &lt;strong&gt;fair-share&lt;/strong&gt; weighting by historical GPU-hours — these are textbook cluster-manager features packed into a single CLI.&lt;/p&gt;
&lt;h3 id="14-safety-guardrails"&gt;1.4 Safety guardrails
&lt;/h3&gt;&lt;p&gt;&lt;code&gt;v1.1.0&lt;/code&gt; keeps emphasizing prevention at the operations layer.&lt;/p&gt;
&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Guard&lt;/th&gt;
 &lt;th&gt;Behavior&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;Max GPU / request&lt;/td&gt;
 &lt;td&gt;Reject requests that exceed policy permanently&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Max lease hours&lt;/td&gt;
 &lt;td&gt;Expired leases become reclamation candidates&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Thermal threshold&lt;/td&gt;
 &lt;td&gt;Preflight detection of thermal-critical GPUs&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Power cap&lt;/td&gt;
 &lt;td&gt;Preflight detection of saturated GPUs&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Stale heartbeat&lt;/td&gt;
 &lt;td&gt;Cleanup of dead workers&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Min free VRAM&lt;/td&gt;
 &lt;td&gt;Reject jobs that would breach memory limit&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Incidents wrap this with &lt;strong&gt;GPU quarantine&lt;/strong&gt; and node drain actions. It reads like a single-host distillation of the &lt;a class="link" href="https://sre.google/sre-book/table-of-contents/" target="_blank" rel="noopener"
 &gt;SRE playbook&lt;/a&gt;.&lt;/p&gt;
&lt;h3 id="15-ai-tooling-integration"&gt;1.5 AI tooling integration
&lt;/h3&gt;&lt;p&gt;The most immediately useful command group is &lt;code&gt;gpum integration ai&lt;/code&gt;. It turns an allocation lease into launch commands for &lt;a class="link" href="https://docs.pytorch.org/docs/stable/elastic/run.html" target="_blank" rel="noopener"
 &gt;torchrun&lt;/a&gt;, &lt;a class="link" href="https://huggingface.co/docs/accelerate/index" target="_blank" rel="noopener"
 &gt;accelerate&lt;/a&gt;, &lt;a class="link" href="https://github.com/deepspeedai/DeepSpeed" target="_blank" rel="noopener"
 &gt;DeepSpeed&lt;/a&gt;, or &lt;a class="link" href="https://github.com/vllm-project/vllm" target="_blank" rel="noopener"
 &gt;vLLM&lt;/a&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;gpum integration ai launch --allocation-id alloc-001 --tool torchrun --arg train.py
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;gpum integration ai launch --allocation-id alloc-001 --tool vllm --from-file vllm-serve.yaml
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;It auto-injects &lt;code&gt;CUDA_VISIBLE_DEVICES&lt;/code&gt;, &lt;code&gt;MASTER_ADDR&lt;/code&gt;, &lt;code&gt;GPUM_RDZV_ENDPOINT&lt;/code&gt;, and vendor-specific equivalents (&lt;code&gt;ROCR_VISIBLE_DEVICES&lt;/code&gt; for AMD, &lt;code&gt;ZE_AFFINITY_MASK&lt;/code&gt; for Intel). The whole flow — &lt;strong&gt;allocation lease → injected env → launch command&lt;/strong&gt; — is one motion.&lt;/p&gt;
&lt;h2 id="2-tokenspeed--aiming-straight-at-engine-throughput"&gt;2. TokenSpeed — Aiming straight at engine throughput
&lt;/h2&gt;&lt;p&gt;&lt;a class="link" href="https://github.com/lightseekorg/tokenspeed" target="_blank" rel="noopener"
 &gt;TokenSpeed&lt;/a&gt;, released the same day, sits on a different layer. Where &lt;code&gt;gpum&lt;/code&gt; manages and observes GPU resources, TokenSpeed &lt;em&gt;is&lt;/em&gt; the inference engine. The README states the ambition flatly: &lt;strong&gt;TensorRT-LLM-level performance with vLLM-level usability&lt;/strong&gt;. The &lt;a class="link" href="https://lightseek.org/blog/lightseek-tokenspeed.html" target="_blank" rel="noopener"
 &gt;announcement blog&lt;/a&gt; shows Pareto curves on &lt;a class="link" href="https://www.nvidia.com/en-us/data-center/b200/" target="_blank" rel="noopener"
 &gt;NVIDIA B200&lt;/a&gt; running &lt;a class="link" href="https://moonshotai.com" target="_blank" rel="noopener"
 &gt;Kimi K2.5&lt;/a&gt;, claiming to push past the &lt;a class="link" href="https://github.com/NVIDIA/TensorRT-LLM" target="_blank" rel="noopener"
 &gt;TensorRT-LLM&lt;/a&gt; front.&lt;/p&gt;
&lt;h3 id="21-four-design-pieces"&gt;2.1 Four design pieces
&lt;/h3&gt;&lt;p&gt;From the README&amp;rsquo;s component breakdown:&lt;/p&gt;
&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Layer&lt;/th&gt;
 &lt;th&gt;Role&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;Modeling&lt;/td&gt;
 &lt;td&gt;local-SPMD with a static compiler that auto-generates collectives from module-boundary annotations&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Scheduler&lt;/td&gt;
 &lt;td&gt;C++ control plane / Python execution plane, request lifecycle modeled as a finite-state machine&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Kernels&lt;/td&gt;
 &lt;td&gt;Pluggable, layered, with one of the fastest &lt;a class="link" href="https://arxiv.org/abs/2405.04434" target="_blank" rel="noopener"
 &gt;MLA (Multi-head Latent Attention)&lt;/a&gt; implementations on Blackwell&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Entrypoint&lt;/td&gt;
 &lt;td&gt;SMG-integrated AsyncLLM, designed to keep CPU-side request handling cheap&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;MLA was popularized by &lt;a class="link" href="https://arxiv.org/abs/2405.04434" target="_blank" rel="noopener"
 &gt;DeepSeek-V2&lt;/a&gt; and compresses the KV cache into a latent representation, which collapses memory bandwidth pressure during decode. TokenSpeed claims to re-implement it for the &lt;a class="link" href="https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/" target="_blank" rel="noopener"
 &gt;Blackwell architecture&lt;/a&gt;. The detail that KV cache ownership is enforced at compile time through the type system is the interesting one — it moves a class of bugs that vLLM solves at runtime via &lt;a class="link" href="https://arxiv.org/abs/2309.06180" target="_blank" rel="noopener"
 &gt;PagedAttention&lt;/a&gt; into the type checker.&lt;/p&gt;
&lt;h3 id="22-targeting-agentic-workloads"&gt;2.2 Targeting agentic workloads
&lt;/h3&gt;&lt;p&gt;The README keeps returning to one phrase: &lt;strong&gt;agentic workloads&lt;/strong&gt;. Unlike a chat workload (long single responses), agent workloads look like &lt;strong&gt;thousands of short responses with tool calls between them&lt;/strong&gt;. In that pattern, CPU-side request handling cost and KV cache reuse/reallocation dominate throughput. The emphasis on the FSM, the type system, and AsyncLLM is the direct response to that pattern.&lt;/p&gt;
&lt;h3 id="23-status-and-limits"&gt;2.3 Status and limits
&lt;/h3&gt;&lt;p&gt;The repo is explicit that this is a preview.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Reproducible today: B200 with Kimi K2.5 and TokenSpeed MLA&lt;/li&gt;
&lt;li&gt;In progress: &lt;a class="link" href="https://qwenlm.github.io" target="_blank" rel="noopener"
 &gt;Qwen 3.6&lt;/a&gt;, DeepSeek V4, MiniMax M2.7 model coverage&lt;/li&gt;
&lt;li&gt;In progress: prefill-decode separation (PD), EPLB, KV store, Mamba cache, VLM, metrics&lt;/li&gt;
&lt;li&gt;In progress: Hopper and MI350 optimization&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So today it is &lt;strong&gt;a runtime design demonstration, not a production deployment target&lt;/strong&gt;. That said, picking up 900+ stars in the first days is a signal that the &amp;ldquo;faster than vLLM, easier than TensorRT-LLM&amp;rdquo; slot in the inference engine category has been waiting to be filled.&lt;/p&gt;
&lt;h2 id="3-where-the-two-tools-meet"&gt;3. Where the two tools meet
&lt;/h2&gt;&lt;p&gt;Mapping them onto the inference stack:&lt;/p&gt;
&lt;pre class="mermaid" style="visibility:hidden"&gt;graph LR
 A["Hardware"] --&gt; B["Drivers"]
 B --&gt; C["Inference engine"]
 C --&gt; D["API gateway"]
 D --&gt; E["Agents / apps"]
 A -.gpum.-&gt; B
 B -.gpum.-&gt; C
 C -.TokenSpeed.-&gt; D&lt;/pre&gt;&lt;p&gt;&lt;code&gt;gpum&lt;/code&gt; abstracts &lt;strong&gt;hardware and drivers and hands them safely to the inference engine&lt;/strong&gt;. TokenSpeed is &lt;strong&gt;the inference engine itself&lt;/strong&gt;. They do not overlap. In practice you would imagine &lt;code&gt;gpum integration ai launch --tool vllm&lt;/code&gt; producing the launch command, and whatever engine sits inside the launcher (vLLM, TokenSpeed, TensorRT-LLM) is a downstream choice.&lt;/p&gt;
&lt;h2 id="4-comparison-with-cloud-llm-observability"&gt;4. Comparison with cloud LLM observability
&lt;/h2&gt;&lt;p&gt;In the cloud LLM stack, &lt;a class="link" href="https://docs.smith.langchain.com" target="_blank" rel="noopener"
 &gt;Langsmith&lt;/a&gt;, &lt;a class="link" href="https://github.com/traceloop/openllmetry" target="_blank" rel="noopener"
 &gt;OpenLLMetry&lt;/a&gt;, &lt;a class="link" href="https://www.helicone.ai" target="_blank" rel="noopener"
 &gt;Helicone&lt;/a&gt;, and &lt;a class="link" href="https://langfuse.com" target="_blank" rel="noopener"
 &gt;Langfuse&lt;/a&gt; cover two axes:&lt;/p&gt;
&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Axis&lt;/th&gt;
 &lt;th&gt;Cloud LLM&lt;/th&gt;
 &lt;th&gt;Local / on-prem inference&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;Tracing / logs&lt;/td&gt;
 &lt;td&gt;Langsmith, Langfuse&lt;/td&gt;
 &lt;td&gt;(gap — gpum&amp;rsquo;s audit log fills a slice)&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Tokens / cost&lt;/td&gt;
 &lt;td&gt;Helicone, OpenLLMetry&lt;/td&gt;
 &lt;td&gt;(gap — gpum cost reports, TokenSpeed throughput)&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Model gateway&lt;/td&gt;
 &lt;td&gt;&lt;a class="link" href="https://openrouter.ai" target="_blank" rel="noopener"
 &gt;OpenRouter&lt;/a&gt;, &lt;a class="link" href="https://portkey.ai" target="_blank" rel="noopener"
 &gt;Portkey&lt;/a&gt;&lt;/td&gt;
 &lt;td&gt;&lt;a class="link" href="https://github.com/BerriAI/litellm" target="_blank" rel="noopener"
 &gt;LiteLLM&lt;/a&gt; (hybrid)&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Resource allocation&lt;/td&gt;
 &lt;td&gt;(managed)&lt;/td&gt;
 &lt;td&gt;gpum&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Engine throughput&lt;/td&gt;
 &lt;td&gt;(managed)&lt;/td&gt;
 &lt;td&gt;TokenSpeed, &lt;a class="link" href="https://github.com/vllm-project/vllm" target="_blank" rel="noopener"
 &gt;vLLM&lt;/a&gt;, &lt;a class="link" href="https://github.com/sgl-project/sglang" target="_blank" rel="noopener"
 &gt;SGLang&lt;/a&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Cloud-side observability is already past its first generation (2023–2024) and into a consolidation phase. Local inference is &lt;strong&gt;at the start of generation one&lt;/strong&gt; — built by individuals or new orgs, exactly the shape of the early Langsmith and Helicone era. &lt;code&gt;gpum&lt;/code&gt; being a one-maintainer project and TokenSpeed coming from a brand-new &lt;code&gt;lightseekorg&lt;/code&gt; underline that timing.&lt;/p&gt;
&lt;h2 id="5-practical-scenarios"&gt;5. Practical scenarios
&lt;/h2&gt;&lt;p&gt;The two tools land in clearly different setups.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;A team sharing 1–2 GPU servers&lt;/strong&gt;: &lt;code&gt;gpum&lt;/code&gt; fits cleanly. Start with &lt;code&gt;gpum scan --refresh&lt;/code&gt; for inventory, then &lt;code&gt;gpum submit&lt;/code&gt; to run batch jobs in containers, then &lt;code&gt;gpum gpu health --score --quarantine-threshold&lt;/code&gt; to take dying GPUs out of rotation before they cascade. Too small for &lt;a class="link" href="https://slurm.schedmd.com/" target="_blank" rel="noopener"
 &gt;Slurm&lt;/a&gt; or &lt;a class="link" href="https://www.run.ai/" target="_blank" rel="noopener"
 &gt;Run:ai&lt;/a&gt;, too big for raw SSH.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Evaluating the inference engine itself&lt;/strong&gt;: TokenSpeed is still preview, but the B200 + Kimi K2.5 reproduction is meaningful as a forward-looking comparison. If you expect to be making engine choices for production over the next 12 months, having a hands-on reference point for the &amp;ldquo;post-vLLM&amp;rdquo; design space is worth setting up early.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="insights"&gt;Insights
&lt;/h2&gt;&lt;p&gt;Two tools shipping the same day at different layers of the same stack is a market signal: &lt;strong&gt;local and on-prem inference has hit the point where it needs first-generation operational tooling&lt;/strong&gt;. Cloud LLM stacks passed this milestone in 2023 when &lt;a class="link" href="https://www.langchain.com/langsmith" target="_blank" rel="noopener"
 &gt;Langsmith&lt;/a&gt; externalized LangChain&amp;rsquo;s operational burden. Local inference is hitting it in 2026 with &lt;code&gt;gpum&lt;/code&gt; filling the resource-management gap from the bottom and TokenSpeed reframing engine design from the top. Both carry the typical first-generation limits — &lt;code&gt;gpum&lt;/code&gt; is a single-maintainer Java project, TokenSpeed is preview-only and B200-bound — but first-generation tooling in a category is what defines the shape of the category. The cloud observability arc says most of the first wave survives and a small subset becomes standard. The same pattern is starting at the local inference layer this week.&lt;/p&gt;
&lt;h2 id="references"&gt;References
&lt;/h2&gt;&lt;p&gt;&lt;strong&gt;Release and repo&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://github.com/drewdrew0414/AIGPUManager/releases/tag/v1.1.0" target="_blank" rel="noopener"
 &gt;drewdrew0414/AIGPUManager v1.1.0 release&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://github.com/drewdrew0414/AIGPUManager" target="_blank" rel="noopener"
 &gt;drewdrew0414/AIGPUManager repository&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://github.com/lightseekorg/tokenspeed" target="_blank" rel="noopener"
 &gt;lightseekorg/tokenspeed repository&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://lightseek.org/blog/lightseek-tokenspeed.html" target="_blank" rel="noopener"
 &gt;TokenSpeed announcement blog&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Local inference runtimes&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://github.com/ggml-org/llama.cpp" target="_blank" rel="noopener"
 &gt;llama.cpp&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://ollama.com" target="_blank" rel="noopener"
 &gt;Ollama&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://lmstudio.ai" target="_blank" rel="noopener"
 &gt;LM Studio&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://github.com/vllm-project/vllm" target="_blank" rel="noopener"
 &gt;vLLM&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://github.com/sgl-project/sglang" target="_blank" rel="noopener"
 &gt;SGLang&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://github.com/NVIDIA/TensorRT-LLM" target="_blank" rel="noopener"
 &gt;NVIDIA TensorRT-LLM&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Techniques and standards&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://arxiv.org/abs/2405.04434" target="_blank" rel="noopener"
 &gt;MLA — Multi-head Latent Attention (DeepSeek-V2 paper)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://arxiv.org/abs/2309.06180" target="_blank" rel="noopener"
 &gt;PagedAttention — vLLM paper&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://developer.nvidia.com/management-library-nvml" target="_blank" rel="noopener"
 &gt;NVIDIA NVML&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://spec.oneapi.io/level-zero/latest/index.html" target="_blank" rel="noopener"
 &gt;Intel Level Zero&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/" target="_blank" rel="noopener"
 &gt;NVIDIA Blackwell architecture&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Cloud LLM observability — for comparison&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://www.langchain.com/langsmith" target="_blank" rel="noopener"
 &gt;Langsmith&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://langfuse.com" target="_blank" rel="noopener"
 &gt;Langfuse&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://github.com/traceloop/openllmetry" target="_blank" rel="noopener"
 &gt;OpenLLMetry&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://www.helicone.ai/" target="_blank" rel="noopener"
 &gt;Helicone&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://github.com/BerriAI/litellm" target="_blank" rel="noopener"
 &gt;LiteLLM&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;</description></item></channel></rss>