Fixing Rust memory allocation slowdown in VS Code on Windows
On Windows, Rust programs can have surprisingly poor performance
when running in VS Code with CodeLLDB, and in some other IDEs.
This particularly affects programs that use allocation-heavy containers like
HashMap<String, _>
. Even a few megabytes of small allocations can lead to terrible performance
when the container is deallocated.
There is an easy fix: set the environment
variable _NO_DEBUG_HEAP=1
and you can get multiple orders of magnitude improvement.
This isn’t a novel discovery, but it’s a fairly obscure feature and I’ve found no discussion of it in the context of Rust,
and I just got bitten by it yet again, so I’ll try to explain it here.
To reproduce the issue:
Start a new project with cargo new --bin
. Open in VS Code, with the standard rust-analyzer and CodeLLDB extensions.
Use “LLDB: Generate Launch Configurations from Cargo.toml” to build a default launch.json
(or you can use the “rust-analyzer: Debug” command).
Add this code:
use ;
Run it with F5 (“Start Debugging”):
Constructed in 3.04 secs
Dropped in 52.75 secs
The drop
is remarkably slow. It’s the same if we don’t explicitly drop
and just let the object go out of scope.
If we change the number of iterations, the cost seems to increase as roughly O(n²), which is very bad.
Of course we shouldn’t expect great performance from a debug build.
Let’s try adding --release
to the build command in launch.json
:
Constructed in 2.14 secs
Dropped in 52.30 secs
Hmm, that’s barely any better. Maybe the debugger is interfering? Try ctrl+F5 (“Run Without Debugging”) – nope, that’s just as slow.
For comparison, let’s try cargo run --release
from a command line:
Constructed in 0.44 secs
Dropped in 0.20 secs
Deallocation is 250x faster! Why is it so slow when running from the IDE?
Try pausing the debugger during the drop
, and the call stack looks like:
RtlTryEnterCriticalSection (@RtlTryEnterCriticalSection:957)
RtlTryEnterCriticalSection (@RtlTryEnterCriticalSection:602)
RtlTryEnterCriticalSection (@RtlTryEnterCriticalSection:691)
RtlTryEnterCriticalSection (@RtlTryEnterCriticalSection:1234)
RtlDeleteBoundaryDescriptor (@RtlDeleteBoundaryDescriptor:392)
RtlGetCurrentServiceSessionId (@RtlGetCurrentServiceSessionId:1203)
RtlFreeHeap (@RtlFreeHeap:24)
RtlRegisterSecureMemoryCacheCallback (@RtlRegisterSecureMemoryCacheCallback:348)
EtwLogTraceEvent (@EtwLogTraceEvent:201)
RtlGetCurrentServiceSessionId (@RtlGetCurrentServiceSessionId:1203)
RtlFreeHeap (@RtlFreeHeap:24)
<hashbrown::raw::RawTable<T,A> as core::ops::drop::Drop>::drop (@<hashbrown::raw::RawTable<T,A> as core::ops::drop::Drop>::drop:57)
...
Rtl
is the “run-time library” from the mostly-undocumented Windows NT native API (the level below the properly-documented Win32 API).
If we step into drop
with the debugger, we get to
GlobalAlloc::dealloc
which calls HeapFree
which calls RtlFreeHeap
. This is simply freeing a single pointer,
so it shouldn’t be that expensive.
CodeLLDB isn’t great at stack tracing through the OS, so let’s try with WinDbg instead, which gives more sensible output:
ntdll!RtlpHeapFindListLookupEntry+0x1c0
ntdll!RtlpFindEntry+0x3a
ntdll!RtlpFreeHeap+0x94b
ntdll!RtlpFreeHeapInternal+0x7c4
ntdll!RtlFreeHeap+0x51
ntdll!RtlDebugFreeHeap+0x273
ntdll!RtlpFreeHeap+0x83ae6
ntdll!RtlpFreeHeapInternal+0x7c4
ntdll!RtlFreeHeap+0x51
debug_heap_test!hashbrown::raw::RawTableInner::resize_inner+0x327
...
The RtlDebugFreeHeap
stands out.
It turns out this debug heap is an old (~1994) Windows feature with almost zero official documentation.
Processes created by a debugger have a few global flags set,
causing RtlCreateHeap
to set corresponding heap flags
that enable some debug features.
In particular HEAP_FREE_CHECKING_ENABLED
overwrites freed memory with a fixed pattern (0xFEEEFEEE
),
and validates the heap by checking that every free block still has the same pattern
(indicating it has not been mistakenly overwritten by the application).
It appears to perform this validation over the entire heap on every call to RtlFreeHeap
,
resulting in the O(n²) cost when freeing a large number of objects.
The debug heap is meant to help C programmers detect buffer overflows, use after free, etc. We don’t need that help, because we’re using Rust.
(Actually it’s not even helpful for most C programmers, because Microsoft’s C runtime has its own separate
CRT debug heap
to catch these errors before they reach the Rtl
heap. Recently
AddressSanitizer
came to Windows too; that’s significantly more powerful since it can detect reads to bad addresses, not just writes.)
Fortunately the Rtl
debug heap can be disabled by setting the environment variable _NO_DEBUG_HEAP=1
before creating the process. (The process can’t set the variable itself, because the flags will have already
been set and the heap already created.)
With CodeLLDB you can set this globally in settings.json
(“Preferences: Open User Settings (JSON)”):
"lldb.launch.env": ,
or set it per project if you prefer. Now let’s try the same program in the debugger:
Constructed in 0.41 secs
Dropped in 0.17 secs
Problem solved.
This issue typically doesn’t affect C/C++ in VS Code, because the C/C++ extension already
sets _NO_DEBUG_HEAP
by default.
That also applies to Rust programs if you configure VS Code with "rust-analyzer.debug.engine": "ms-vscode.cpptools"
(or leave it at the default auto
and don’t install CodeLLDB)
and launch the program with the “rust-analyzer: Debug” command,
since that debugs via the C/C++ extension instead of CodeLLDB.
Visual Studio sets _NO_DEBUG_HEAP by default since 2015.
This issue does affect the RustRover IDE, though only when using the “Debug” command.
The “Run” command is not affected, because Windows doesn’t think there’s a debugger
(IsDebuggerPresent()
returns false, unlike VS Code’s “run without debugging”).
WinDbg triggers the debug heap by default, but has the
-hd
command line option
to disable it.
Some profilers might trigger the debug heap, though at least Intel VTune and AMD uProf appear not to
(the process runs with IsDebuggerPresent() == false
).