Undervolting Your GPU for Local Inference: Lower Heat, Same Tokens/sec

TL;DR

Thorsten Meyer AI published a workstation tuning guide arguing that GPU power limits and undervolting can cut heat, power draw and fan noise during local AI inference. The guide says an RTX 4090 workload kept most tokens-per-second performance at lower wattage, though results vary by card, model and workload.

Thorsten Meyer AI has published a new workstation tuning guide arguing that GPU power limiting and undervolting can lower heat, power draw and fan noise for local AI inference while preserving much of tokens-per-second performance, a claim aimed at users running high-power desktop AI rigs.

The guide presents power limiting as the first recommended change for high-power AI workstations because it is free, reversible and simple to test. It says users should try limiting a GPU to about 70% of its normal power target before buying cooling hardware, changing a case or rearranging fans.

Thorsten Meyer AI attributes the result to the workload profile of local inference. The guide says many local LLM workloads are memory-bandwidth-bound, meaning the GPU is often waiting on VRAM rather than running its compute cores at full use. Under that pattern, the guide says cutting core power and clocks can reduce heat faster than it reduces token throughput.

The source material includes measured RTX 4090 figures from a sustained workload. In one table, stock operation is listed at 390 watts, 72 degrees Celsius and 100% speed. A 70% power limit is listed at 300 watts, 67 degrees Celsius and 93.4% speed, while a 60% setting is listed at 260 watts, 62 degrees Celsius and 91.5% speed. The same guide describes the 70% setting as a recommended point that removes about 90 watts of heat for about a 7% speed loss in that measured case.

Why It Matters

The report matters for readers running local AI models because GPU heat and fan noise are among the main limits of desktop inference setups. If the reported pattern holds for a user’s workload, a power limit can reduce thermal load without a matching drop in output, making long inference sessions quieter and easier to cool.

The finding also affects buying decisions. The guide frames power limiting as a low-cost step to test before spending money on larger coolers, new cases or additional fans. For users running one or more high-end GPUs, a reduction of tens of watts per card can also change room heat and sustained system stability.

Amazon

GPU undervolting software

As an affiliate, we earn on qualifying purchases.

Background

Modern GPUs are typically shipped with voltage and clock settings designed to keep a wide range of chips stable at rated performance. The guide says that factory voltage curves include safety margin, and that the final portion of voltage can add heat for a small amount of extra performance.

The article separates two approaches. Power limiting uses a single setting to restrict the card’s maximum power draw, letting the GPU manage voltage and clocks automatically. Undervolting changes the voltage-frequency curve directly, which may preserve more speed for a given heat target but needs workload-specific testing.

The guide recommends tools such as MSI Afterburner on Windows and nvidia-smi or LACT on Linux. It says a Linux user could set a power limit with a command such as nvidia-smi -pl 300, then test temperature, power draw, held clock and tokens per second under the real workload rather than relying on a short benchmark.

“This is the first thing you should do to a high-power AI workstation, and it costs nothing.”

— Thorsten Meyer AI guide

“Local inference is memory-bound — the GPU core spends much of its time waiting on VRAM, not maxing out compute.”

— Thorsten Meyer AI guide

“Power limiting moves one slider and can’t damage anything.”

— Thorsten Meyer AI guide

“This is a tuning guide, not a warranty document.”

— Thorsten Meyer AI disclosure

Amazon

GPU power limiters for AI inference

As an affiliate, we earn on qualifying purchases.

What Remains Unclear

The reported figures are workload- and hardware-specific. The guide says results vary by card, model, quantization and workload, and its RTX 4090 numbers should not be treated as a guarantee for every system. It is also not yet clear from the source material how each cited test was standardized across different GPUs, cooling setups and inference software stacks.

Undervolting carries more uncertainty than basic power limiting because a voltage curve that appears stable briefly can fail later in a long run, according to the guide. Users still need to test their own models, batch sizes, context lengths and runtime settings.

ARCTIC MX-4 (incl. Spatula, 4 g) – Premium Performance Thermal Paste for All Processors (CPU, GPU – PC), Very high Thermal Conductivity, Long Durability, Safe Application

WELL PROVEN QUALITY: The design of our thermal paste packagings has changed several times, the formula of the…

As an affiliate, we earn on qualifying purchases.

What’s Next

The next step for readers is measurement on their own systems: set a conservative power limit, run the real inference workload for an extended period, and compare tokens per second, temperature, wattage and noise against stock settings. The guide says users who want finer control can then test undervolting, starting around 0.9 to 0.95 volts and validating stability over longer sessions.

GELRHONR PCI-e 5.0 12VHPWR 90 Degree 16Pin (12+4) Female to Male Extension Cable for RTX 4070 4080 4090 4070Ti 4080Ti Graphic Card GPU-5.3in (Type A)

✪ PCI-e 5.0 12VHPWR 90Degree 16pin GPU Extension Cable: Designed for 40 Series Graphics Card :4070 4080 4090…

As an affiliate, we earn on qualifying purchases.

Key Questions

What is the confirmed development?

Thorsten Meyer AI has published a guide and interactive infographic recommending GPU power limiting and undervolting as an early step for reducing heat and noise in local AI inference workstations.

Does power limiting always preserve tokens per second?

No. The guide reports modest speed losses in its cited RTX 4090 workload, but it says results vary by GPU, model, quantization and workload. Users need to test their own setup.

The guide points to a 70% power limit as a practical starting point. In its cited RTX 4090 example, that setting reduced power from 390 watts to 300 watts while keeping 93.4% of speed.

Is power limiting the same as undervolting?

No. Power limiting restricts maximum GPU power and lets the card adjust itself. Undervolting edits the voltage-frequency curve directly and may need more careful stability testing.

What remains unclear?

The source does not establish that the same savings will appear across all local inference workloads. Long-run stability, tokens-per-second loss and heat reduction remain system-specific.

Source: Thorsten Meyer AI

Undervolting Your GPU for Local Inference: Lower Heat, Same Tokens/sec

Up next

Best Quiet CPU Coolers for Sustained AI/Compute Loads

Author

Auto Blogging Team

Share article