iTLB multihit

EKOPARTY PRESENTATION NOVEMBER 2023

The information in this writeup is partly superseded by new research that we presented on Ekoparty in November 2023. This includes a new, universally one-shot single-instruction(!) trigger.

You can find the slides here

You can get the new trigger code at Github

Video of full presentation

TL;DR

Bug history

Circa 2013 we developed a series of hypervisor-specific fuzzers, these were designed to be generic, so they could easily be adapted to run in other targets. One of the fuzzers, which we called PageFuzzer, was designed to target shadow memory, which was still in use by some systems at the time. This fuzzer, yielded good results against the Xen hypervisor, a target of interest during that period.

Years later (2017) we were running some of them against Hyper-V. Running PageFuzzer against Hyper-V obviously makes very little sense since it makes use of SLAT for address translation. But… fuzzing is cheap, so why not? To our great surprise it actually caused a system crash.

At first, we were overjoyed thinking we had actually encountered a hypervisor bug, but the joy quickly turned into confusion as we started to analyze it. On the initial test system, the trigger simply caused a total system freeze. Hypervisor debugging was totally hosed after the freeze, which is something that can be expected under certain circumstances (after all, the debugger is still a part of the hypervisor).

We were able to hook the potentially relevant fault handlers such as #DF and #MC, and have them output tiny bits of information over the serial port (for debugging we used kdnet, so the serial port was free to use). Surprisingly, the fault turned out to be a #MC (machine check), and the IRET frame pointed to guest code!

After having the same machine check occur on numerous systems, we now knew there was an actual bug and not some issue with this particular system, but even DCI debugging wasn’t very informative - the CPU turned out to be too screwed up after the trigger. The most readable information came from the management log on an HP server:

Uncorrectable Machine Check Exception (Board 0, Processor 2, APIC ID 0x00000032, Bank 0x00000002, Status 0xB2000000'00070150, Address 0x00000000'00000000, Misc 0x00000000'00000000)

The value in the MCi_STATUS MSR decodes to a TLB error on instruction fetch at level 0 - we believe L0 here refers the iTLB (or dTLB if it were a data fetch), in distinction to the sTLB that would be L1.

Bug (re)discovery by Intel

At this point we were fairly certain this was actually a CPU bug, but we had other priorities, and we filed it away. Later in 2019 we found out that Intel had internally discovered the issue which was assigned CVE-2018-12207.

NOTE: How this bug was actually discovered by Intel is unknown to us. For readers interested in the testing and verification process of a modern CPU design, we can recommend the OpenSPARC Internals book, which is probably the most modern CPU where the process is documented in public.

Like all bugs nowadays, it has a name: iTLB multihit. The reason for that will become obvious in a bit…

TLB implementation and past issues

The TLB and the surrounding infrastructure have, unsurprisingly since it’s complex performance-critical logic, been a rich source of CPU bugs. A famous example is the AMD Phenom TLB bug, which caused quite a fuss back in 2008. The underlying logic error actually has some similarities with iTLB multihit, although it involves interaction with the cache rather than the contents of the TLB itself.

While not bugs per se, unintended behaviors have been used for both defensive and offensive purposes. Before the sTLB was introduced, it was possible to desynchronize the iTLB and dTLB so that instruction and data fetches returned different results. This was used to implement non-executable pages on x86 back in the days before this was “officially” possible (NX bit). Data fetches of non-executable filled the dTLB with the proper value and were allowed to continue unimpeded, while instruction fetches caused the process to terminate. Voilà - non-executable pages! By instead having data and instruction fetches stick different addresses into their corresponding TLBs, a technique for hiding rootkits known as Shadow Walker was born. These techniques are no longer possible on modern CPUs due to the introduction of the sTLB and synchronization mechanisms between the dTLB and the iTLB.

The setting for the bug

The CPUs affected by this bug all share the same basic TLB implementation, with a separate iTLB and a dTLB for instruction and data lookups, respectively. This is a logical design choice for performance reasons. Instructions and data rarely overlap, in fact, doing so has a negative performance impact on modern CPUs. Additionally, large and fast TLBs are expensive, hard to design, and produce a significant amount of heat. Instead, using two smaller TLBs is more efficient.

The iTLB and dTLB are not directly filled from page tables, rather they are filled from the shared TLB (sTLB). This, together with enforcing consistency between the different TLBs (iTLB, dTLB, and sTLB), is what makes techniques like Shadow Walker impossible.

In addition to this - and what makes this bug actually possible - each of the iTLB and dTLB are composed of different TLBs in turn, one per page size.

NOTE: This is not true in Atom models, where only one iTLB/dTLB is used for all page sizes. This is the reason why the Atom processors are not vulnerable to his bug.

Obviously, the TLBs for different page sizes are queried in parallel, otherwise this would be a performance pessimization rather than optimization.

This couldn’t have any downsides, right? Logically speaking, a page can only have one size. What could possibly go wrong?

What goes wrong?

As the name iTLB multihit implies, and as the description makes clear, the bug condition happens when an instruction fetch “hits” (i.e. virtual address matches) entries in the iTLBs for two (or more) different page sizes. The actual bug is what makes this “impossible” condition happen.

Why this causes such a catastrophic failure isn’t possible to tell without knowing exactly how this is implemented on the die level. Our best guess is that the data pins from the different TLBs are simply connected in parallel, and multiple TLBs driving them at the same time causes corruption at the electrical level. Presumably one of the pins would be parity, and the parity mismatch could be what triggers the #MC. An interesting side note is that this sort of issue could cause actual hardware damage (due to issues such as over-current through the transistors driving the pins). However, we have triggered this many times on multiple systems and haven’t seen any sign of this.

Why does it go wrong?

So, when exactly can this condition occur? An obvious candidate would be as the TLB is filled from a page walk, after a TLB miss where the entry isn’t present in the L1 (iTLB/dTLB), nor the L2 (sTLB) TLB. However, this explanation quickly falls apart, as both according to the documentation and observed behavior, execution is completely stalled while the TLB is being filled from a page walk. However, there is another scenario apart from TLB misses where the TLB can be updated. Like many architectures, x86 page table entries have Accessed and Dirty bits that are updated the first time a page is read (A) or written (D). To do this, the CPU needs to perform a page walk even if the entry is already in the TLB.

Here’s where it gets interesting - what happens if the page tables in memory no longer match what’s in the TLB? For example, what if the physical address of a page has changed? The answer is that the old TLB entry will be replaced with a new one. That’s the easy one!

Now, what if physical address and the page size have both changed? That gets more fun! The old TLB entry has to be removed from the iTLB for the old page size and new entries poked into another. What if we go from a smaller page size to a larger one (such as 4K to 2M)? Then we will be removing several entries, and adding one. Certainly sounds like things could go wrong here - like an entry briefly existing in TLBs for both sizes at the same time. Maximum fun!

Our hypothesis is that some instruction fetching - at least prefetching - can actually continue during this process. This explains why only the iTLB is affected. The underlying condition actually exists in both the iTLB and dTLB implementations, but only instruction fetches are possible, not data accesses. While not 100% confirmed, this explanation neatly accounts for all our observations.

How do we make it go wrong?

From the previous section, we know that we need at least to:

During our research we produced multiple triggers for the bug, we initially released one of them which should work well in most models (except a few old models with a different branch predictor). Recently, we added an alternative (older) one.

Mitigation

When using EPT (which is Intel’s implementation of what’s generally known as SLAT or nested paging), the guest has full access to modify page tables. However, instead of directly using the physical addresses from the guest, they undergo a second level of translation from what’s known as guest physical address to actual physical addresses, using the EPT page tables provided by the hypervisor. The hypervisor can’t control the contents of the guest page tables (at least not without very intrusive measures with a large performance impact), but it does control the second translation level.

This “second level” has its own set of page sizes (and flags, which will become important in a bit). This is all documented, for example in the Intel Software Developer’s Manual, 28.2. Being able to use large pages for guest mappings, regardless of the actual page size used by the guest, greatly reduces the record keeping needed by the hypervisor. Otherwise, a guest using 4GB RAM would need a million EPT page table entries. What’s far less obvious and pretty much undocumented - in fact, while obvious in hindsight, it took us somewhat by surprise - is what happens if the guest and EPT page tables use different page sizes. It seems to be the case that the smaller page size is what actually ends up in the actual TLB.

Together with the NX flag of the EPT page table entries, this provides a far less intrusive way of mitigating the bug than would otherwise be possible:

Now, any pages containing code will be treated as if they were 4K, and thus only ever end up in the iTLB for a single page size. No more iTLB multihit!

This still comes with a potentially significant performance impact. Normally, the hypervisor would use large (>4K) pages for most, if not all, EPT mappings.

Most of the Hypervisors we tested have this mitigation implemented but disabled by default (except KVM):

PRODUCT MITIGATION DEFAULT
KVM Yes Yes
Hyper-V Yes No
VMWare ESXi Yes No
VirtualBox ??? No
Xen ??? ???