Crisis: Security vs Performance

April 6th, 2018

Crisis: Security vs performance.

It’s a choice we’ve been avoiding.

Prologue:

We thought it prudent to give a description of the recent Meltdown and Spectre processor bugs from a broader perspective that includes the ideas that underpin our work here at Trustworthy Systems.

We also thought it would be responsible, and expected of us to answer the questions our community would most certainly have on their minds.

How do Meltdown and Spectre affect seL4 itself? How do Meltdown and Spectre affect systems designed and built around seL4? What unique value proposition does formal verification offer in this now uncertain world? Here at Trustworthy Systems, our old guard has 20 years of microkernel history and expertise that we regularly draw upon, trying to make software more secure. Meltdown and Spectre have introduced 20 years of processor vulnerability for us to reckon with.

The following blog post is an examination from our point of view, of what this means for the future of secure, trustworthy computing, how it affects our software ecosystem vision, and finally a disclosure of what steps we will be taking to mitigate Meltdown and a set of recommendations for how to harden seL4-based systems against Spectre.

Context: what are Meltdown and Spectre?

Meltdown and Spectre are two recently disclosed processor bugs which affect almost all x86 processors fabricated in the last 2 decades. Furthermore in the case of Spectre, it has also been confirmed to affect several models of ARM processors which use speculative execution. These bugs allow an attacker to ignore the security mechanisms of the operating system and go one level lower, bypassing OS security checks and exploiting flaws in the processor silicon itself. They have been found to be effective against Linux, Windows, Macintosh and all major Oses including a mobile OS running on at least one ARM-based smart phone.

The primary problem is that on these flawed processors, speculative execution of code that should not be executed has side effects which can be observed by software running on the processor. Any results or side-effects of incorrectly speculatively executed code are usually thrown away by the processor, but on these flawed processors, not all side effects are properly discarded. Using complicated techniques, talented software hackers are able to direct the processor to speculatively execute some code that it should not execute and then by observing the side effects which were not properly discarded, read data that they should not be able to read.

The three main properties of a secure system are confidentiality (read-security), integrity (write-security) and availability (service-quality guarantees). Meltdown and Spectre expose vulnerabilites that break confidentiality – that is, they allow information which should be readable only by authorized agents to be read by unauthorized agents. Unfortunately, breaking confidentiality also opens up the opportunity for you to subsequently break integrity and eventually availability.

The wider industry story that is highlighted by Meltdown and Spectre.

It is easy to talk about the present and the past. The whole community is currently engaged in a discussion about current fiasco and the history that led up to it. Few people are talking about the future of trustworthy computing.

To us at Trustworthy Systems, it is more important to talk about the future. Several of our researchers and engineers have been trying to highlight the very fact that on modern processors, hardware caches and other shared resources on the processor chip are not properly partitioned and that this exposes side channels that can be exploited by malicious agents to extract information. Both Meltdown and Spectre rely on side channels to read unauthorized information.

Two of our researchers, Gernot Heiser’s and Qian Ge’s work over the past 2 years has culminated in the conclusion that unfortunately in cases where the processor manufacturers specify instructions to manage their hardware caches they do not meet the requirements for full isolation in practice.

They further explain in their paper published in September 2017, “Your Processor Leaks Information – and There’s Nothing You Can Do About It”, that until the temporal properties of modern processors’ architectures are specified by processor manufacturers, it will remain impossible for OS vendors to write secure, timing-channel impervious code.

And problems like Meltdown and Spectre will continue to pop up.

It is useful to raise the point that if processor branch prediction history was partitioned on Intel processors, then the indirect branch predictor training variant of the Spectre exploit could not have worked. Furthermore, Qian Ge is currently working on hardening the seL4 kernel against timing channels. Her PhD work enables the seL4 microkernel to create partitions between groups of processes which close timing channels – and partitioning of this kind would have hardened systems against cross-process Spectre attacks, though not against Meltdown.

seL4 in the wake of the Meltdown and Spectre storm.

seL4 and Meltdown:

seL4 is affected by the Meltdown bug, and as a result we are internally working on pushing out a patch similar to Linux’s KAISER patch which we’re internally calling the “SKIM window” (SKIM: “Static Kernel Image with Microstate”). We expect to be pushing this out as soon as possible. The patch has surprisingly little effect on the kernel’s formal modeling, in principle.

Our lead kernel engineer, Adrian Danis has disclosed that there is a non-negligible performance penalty imposed by the SKIM window patch. We haven’t got final figures as yet and we’ll be measuring the performance impact over the coming weeks.

The SKIM patch has been done for 64-bit x86 processors (with Intel64 specifically in mind). Other x86_64 vendors such as AMD64 and VIA are not officially supported by seL4, though seL4 certainly runs reliably on AMD64 processors. That being said, it is still unclear whether or not AMD64 processors are vulnerable to Meltdown anyway and it may turn out that they do not need the SKIM patch.

For 32-bit x86, there will eventually be a patch, but given that x86-32 is not as urgent as x86_64, we haven’t focused delivering x86-32 SKIM as promptly, and are moving on to mitigating Spectre right now. When 32-bit x86 SKIM is delivered, it will also be “officially” for Intel, and not for other x86 vendors such as AMD, Centaur, VIA and so on since these vendors are not officially supported by seL4’s x86-32 implementation.

seL4 and Spectre:

The seL4 kernel itself is currently vulnerable to a Spectre attack from a userspace thread; but we believe that it can be easily hardened. The question of whether or not a userspace attacker can attack another userspace thread running on seL4 is an unequivocal yes.

We are currently in the middle of deploying the branch-prediction barriers (x86) and BTB flushing mechanisms (ARMv6, ARMv7) that are necessary to prevent attackers from attacking other userspace threads on seL4.

We are currently in the middle of an investigation of how difficult it would be to harden the seL4 kernel itself against a Spectre based attack from userspace. More details will be released in the upcoming months about our reasoning. We are very eager at the chance that the seL4 kernel itself may require only trivial changes to be hardened against Spectre.

As for protecting userspace processes from each other:

seL4 will allow userspace processes to specify whether or not they want to take the performance hit that is incurred by effectively flushing the branch predictor when switching between processes on x86 (which is currently the only way to mitigate this variant of Spectre). The performance penalty of flushing the branch predictor on x86 processors is, unfortunately, very high.

Formal methods and Spectre:

As an entirely coincidental benefit of us having written the seL4 microkernel in a subset of the C language which our formal methods engineers are able to reason about, we don’t use function pointers in the seL4 kernel source.

This, along with certain microkernel design principles which we have followed in implementing seL4, has the side effect of possibly making the seL4 kernel hardened against the indirect branch prediction variant of Spectre. Once again, we’re currently investigating this possibility and we will announce our results when we’re done investigating.

Impact of the mitigation Meltdown and Spectre fixes on seL4 based systems.

Impact on the seL4 microkernel itself:

As with every major OS, seL4 has also seen a performance hit as a result of the implementation of the SKIM (Static Kernel Image with Microstate) patch.

Initial tentative estimations of the impact are that they will likely be higher than those experienced by a monolithic kernel for the simple reason that microkernels switch address spaces more due to their essential function as an IPC engine, and the SKIM window patch alone increases the number of address space switches that the kernel must do.

For Spectre, we must effectively suppress branch prediction across static branches and disable the dynamic branch predictor from using branch history that it gathered from userspace, in order to shut down the Spectre bug.

The performance effect of needing to manage the branch predictor is still internally being discussed and we cannot give any conclusive statements as yet simply because a lot of our design decisions as a microkernel strongly hint at us being able to make seL4 resilient against Spectre with little or no work.

Impact on seL4-based systems:

Meltdown:

Meltdown is entirely mitigated by the SKIM-window patches to the seL4 kernel, and there is no need for developers of systems that run on top of seL4 to do anything to their software to protect against Meltdown.

To be protected against Meltdown, please just keep an eye on our Mailing list and watch out for our announcement of when we’ve released the SKIM-window patches.

Out-of-bounds (OoB) speculation Spectre:

The OoB speculation variant of the Spectre exploit allows userspace processes to attack one another using gadgets (footholds) in each others’ binary code. The only way to mitigate this is to insert speculation barriers in the source code of your binary to suppress the processor’s branch prediction across locations in the code where such a gadget may exist.

Unfortunately the seL4 kernel cannot directly intervene here to help developers of seL4-based systems.

Indirect branch predictor Spectre:

The indirect branch predictor variant of the Spectre exploit allows userspace processors to attack one another by training the dynamic branch predictor to speculatively execute a particular gadget in the victim process’s binary code.

The seL4 microkernel will protect userspace processes from this if they opt into it using the system call that we will provide. This system call marks a process as being interested in having the branch predictor flushed when it is being switched to, and is likely to be provided only on x86 when it is released (mostly because it may not be necessary to provide it on ARM for technical reasons – the kernel will already need to flush the branch predictor whenever it is invoked, so there might be no need to have processes explicitly ask for it to be flushed).

In combination with the previously mentioned extension, developers can make use of the so called “Retpoline” defense mechanism which has been deployed by Google and has a lower performance penalty than flushing the branch predictor. The “Retpoline” trick works by suppressing the branch predictor only across indirect jumps, so it is less extreme than a full flush of all branch prediction state.

We’ll keep in touch.

As we wade through these uncertain waters now, all of us from processor architects to software developers and end users, we need clear communication to understand the problems involved and we need effective dialogue to come up with solutions.

We’ll keep you posted as things proceed.

Keep in contact with us by polling this blog. Don’t like blogs? Try our Mailing List or contact us some other way. G’day!