2018-01-04

Speculation


Speculative execution has been intel's strategy for keeping the x86 architecture alive since the P6/Pentium Pro part shipped in '95.

I remember coding explicitly for the P6 in a project in 1997; HPLabs was working with HP's IC Division to build their first CMOS-camera IC, which was an interesting problem. Suddenly your IC design needs to worry about light, aligning the optical colour filter with the sensors, making sure it all worked.

Eyeris

I ended up writing the code to capture the raw data at full frame rate, streaming to HDD, with an option to alternatively render it with/without the colour filtering (algorithms from another bit HPL team). Which means I get to nod knowingly when people complain about "raw" data. Yes, it's different for every device precisely because its raw.

The data rates of the VGA-resolution sensor via the PCI boards used to pull this off meant that a both cores of a multiprocessor P6 box were needed. It was the first time I'd ever had a dual socket system, but both sockets were full with the 150MHz parts and with careful work we could get away with the "full link rate" data capture which was a core part of the qualification process. It's not enough to self test the chips any more see, you need to look at the pictures.

Without too many crises, everything came together, which is why I have a framed but slightly skewed IC part to hand. And it's why I have memories of writing multithreaded windows C++ code with some of the core logic in x86 assembler. I also have memories of ripping out that ASM code as it turned out that it was broken, doing it as C pointer code and having it be just as fast. That's because: C code compiled to x86 by a good compiler, executed on a great CPU, is at least performant as hand-written x86 code by someone who isn't any good at assembler, and can be made to be correct more easily by the selfsame developer.

150 MHz may be a number people laugh at today, but the CPU:RAM clock ratios weren't as bad as they are today: cache misses are less expensive in terms of pipeline stalls, and those parts were fast. Why? Speculative and out of order execution, amongst other things
  1. The P6 could provisionally guess which way a branch was going to go, speculatively executing that path until it became clear whether or not the guess was correct -and then commit/abort that speculative code path.
  2. It uses a branch predictor to make that guess on the direction a branch was taken, based on the history of previous attempts, and a default option (FWIW, this is why I tend to place the most likely outcome first in my if() statements; tradition and superstition).
  3. It could execute operations out of order. That is, it's predecessor, the P5, was the last time mainstream intel desktop/server parts executed x86 code in the order the compiler generated them, or the human wrote them.
  4. register renaming meant that even though the parts had a limited set of registers, those OOO operations could reuse the same EAX, EBX, ECX registers without problems.
  5. It had caching to deal with the speed mismatch between that 150 MHz CPU & RAM.
  6. It supported dual CPU desktops, and I believe quad-CPU servers too. They'd be called "dual core" and "quad core" these days and looked down at.

Being the first multicore system I'd ever used, it was a learning experience. First was learning how too much windows NT4 code was still not stable in such a world. NTFS crashes with all all volumes corrupted? check. GDI rendering triggering kernel crash? check. And on a 4-core system I got hold of, everything crashed more often. Lesson: if you want a thread safe OS, give your kernel developers as many cores as you can.

OOO forced me to learn about the x86 memory model itself: barrier opcodes, when things could get reordered and when they wouldn't. Summary: don't try and be clever about synchronization, as your assumptions are invalid.

Speculation is always an unsatisfactory solution though. Every mis-speculation is lost cycles. And on a phone or laptop, that's wasted energy as much as time. And failed reads could fill up the cache with things you didn't want. I've tried to remember if I ever tried to use speculation to preload stuff if present, but doubt it. The CMOV command was a non-branching conditional assignment which was better, even if you had to hand code it.  The PIII/SSE added the PREFETCH opcode so you could a non-faulting hinted prefetch which you could stick into your non-branching code, but that was a niche opcode for people writing games/media codecs &c. And as Linus points out, what was clever for one CPU model turns out to be a stupid idea a generation later. (arguably, that applies to Itanium/IA-64, though as it didn't speculate, it doesn't suffer from the Spectre & Meltdown attacks).

Speculation, then: a wonderful use of transistors to compensate for how we developers write so many if() statements in our code. Wonderful, it kept the x86 line alive and so helped Intel deliver shareholder value and keep the RISC CPU out of the desktop, workstation and server businesses. Terrible because :"transistors" is another word for "CPU die area" with its yield equations and opportunity cost, and also for "wasted energy on failed speculations". If we wrote code which had fewer branches in it, and that got compiled down to CMOV opcodes, life would be better. But we have so many layers of indirection these days; so many indirect references to resolve before those memory accesses. Things are probably getting worse now, not better.

This week's speculation-side-channel attacks are fascinating then. These are very much architectural issues about speculation and branch prediction in general, rather than implementation details. Any CPU manufacturer whose parts do speculative execution has to be worried here, even if there's no evidence that your shipping parts aren't vulnerable to the current set of attacks. The whole point about speculation is to speed up operation based on the state of data held in registers or memory, so the time-to-execute is always going to be a side-channel providing information about the data used to make a branch.


The fact that you can get at kernel memory, even from code running under a hypervisor, means, well, a lot. It means that VMs running in cloud infrastructure could get at the data of the host OS and/or those of other VMs running on the same host (those S3 SSE-C keys you passed up to your VM? 0wned, along with your current set of IAM role credentials). It potentially means that someone else's code could be playing games with branch prediction to determine what codepaths your code is taking. Which, in public cloud infrastructure is pretty serious, as the only way to stop people running their code alongside yours is currently to pay for the top of the line VMs and hope they get a dedicated part. I'm not even sure that dedicated cores in a multicore CPU are sufficient isolation, not for anything related to cache-side-channel attacks (they should be good for branch prediction, I think, if the attacker can't manipulate the branch predictor of the other cores).

I can imagine the emails between cloud providers and CPU vendors being fairly strained, with the OEM/ODM teams on the CC: list. Even if the patches being rolled out mitigate things, if the slowdown on switching to kernelspace is as expensive as hinted, then that slows down applications, which means that the cost of running the same job in-cloud just got more expensive. Big cloud customers will be talking to their infrastructure suppliers on this, and then negotiating discounts for the extra CPU hours, which is a discount the cloud providers will expected to recover when they next buy servers. I feel as sorry for the cloud CPU account teams as I do for the x86 architecture group.

Meanwhile, there's an interesting set of interview questions you could ask developers on this topic.
  1. What does the generated java assembly for the Ival++ on a java long look like?
  2. What if the long is marked as volatile?
  3. What does the generated x86 assembler for a Java Optional<AtomicLong> opt.map(AtomicLong::addAndGet(1)) look like?
  4. What guarantees do you get about reordering?
  5. How would you write code which attempted to defend against speculation timing attacks?

I don't have the confidence to answer 1-4 myself, but I could at least go into detail about what I believed to be the case for 1-3; for #4 I should do some revision.

As for #5, defending. I would love to see what others suggest. Conditional CMOV ops could help against branch-prediction attacks, by eliminating the branches. However, searching for references to CMOV and the JDK turns up some issues which imply that branch prediction can sometimes be faster...", including "JDK-8039104. Don't use Math.min/max intrinsic on x86" it may be that even CMOV gets speculated on; with the CPU prefetching what is moved and keeping the write uncommitted until the state of the condition is known.

I suspect that the next edition of Hennessy and Patterson, "Computer Architecture, a Quantitative Approach" will be covering this topic.I shall look forward to with even greater anticipation than I have had for all the previous, beloved, versions.

As for all those people out there panicking about this, worrying if their nearly-new laptop is utterly exposed? You are running with Flash enabled on a laptop you use in cafe wifis without a VPN and with the same password, "k1tten",  you use for gmail and paypal. You have other issues.

No comments:

Post a Comment

Comments are usually moderated -sorry.