Notes on CPU Cache Side Channel Attacks

4 minutes read about 5 years ago about 5 years ago

Background

Cache Properties

CPU caches work to speed up loading of data from memory. Access time and throughput of memory addresses in cache are orders of magnitude faster than going to memory due to characteristics like distance from core and transfer rate. However cache size is many many orders of magnitude smaller due to lower density, needing more space, and more power, needing more heat dissipation.

To use the limited size efficiently a Least Recently Used (LRU) eviction policy is used since most addresses which have been accessed recently will be accessed again. Modern caches are usually divided into pools of cached addresses, known as cache sets, which store many entries, known as cache lines. Cache lines usually store a fixed number of contiguous addresses to take advantage of the fact that memory access is often sequential, aka locality of reference. And by storing multiple entries in a cache set a LRU policy is more effective since one can effectively look further back in the access history.

Cache Access

Overview

Cache timing attacks work by setting a cache set, which contains cache lines, to a known measurable state, such as not containing the target address, and measuring the access time of the target address’s cache line. This exploits the fact that the cache access is significantly faster than non-cache access, this delay can be reliably measured, and that cache state is not isolated. If the access time is above a given, system dependent, threshold then it is inferred that the read came from memory and was not in the cache, and the program of interest had not accessed the address.

The CPU cache works on the granularity of cache lines which are often composed of multiple addresses. This means that a target address can end up in the cache if a non-target address which happens to be in the same cache line is accessed. This can cause false positives. However this can also be used to target an address which is not shared with another process.

Memory sharing is common for libraries (content aware sharing) and VMs (content based page sharing) and is needed for some attacks to measure if the target address is cached but not all attacks require memory sharing.

Measuring access time of a target address can be done 1) directly by explicitly loading the target cache line or 2) indirectly such as by requesting another process to load the target cache line in a measurable way like via a function call.

Specific Attacks

Spectre

Takes advantage of speculative execution to control the cache state.

Paper [PDF, Annotated]

Flush+Reload

Steps:

Flush: clear the target address from cache using the clflush processor operation.
Allow the program of interest to run.
Reload: measuring the access time of the target address. If the access time is above a given, system dependent, threshold then it is inferred that the read came from memory and was not in the cache, and the program of interest had not accessed the address.

One advantage of clflush is that it works across cores. The operation clears the cache line of the address from the last level cache, shared by all cores, and all higher (closer to core) caches allowing a process running on one core to infer memory access of a process on any core which shares the cache line of the address of interest.

Prime+Probe

Steps:

Prime: clearing the target address from the cache by loading addresses known to not be the target address; this can be done by having an unrelated process access a large set of its own addresses.
Allow the program of interest to run.
Probe: measure access time of the target address. The probe phase can also serve as the prime phase for the next iteration.

This attack can work across cores by targeting the last level cache (LLC) if a subset of cache lines in the LLC can be targeted; without being able to target a subset of the LLC all addresses would need to be targeted, and due to the size of the LLC and slower access time targeting specific address access of a target process becomes too difficult because of the low resolution of the measurements (multiple cache loads would be measured in a single probe phase).

Some Thoughts

If one can suspend the target process before and after the measurement phase for a known duration without introducing too much noise then attacks which have a long measure phase that results in low granularity could be made to have a higher granularity. Perhaps this can be done by evicting the instruction cache of the program of interest?