sean butler

size, speed and order tradeoffs

cpp   cache   repo   selected

tldr

intro to cache friendly programming supported by a cpp code repo with timing graphs

modern cpu

modern computers are driven by CPUs which already have optimisations happening for you in the hardware. writing faster (low latency) code requires that you know what speedups are already happening and do what you can to facilitate them or at least that you dont scupper them.

ram is cheap

dynamic ram is cheap so machines can have lots of it but its not very fast. faster types of ram exist and are situated closer to (or within) the circuits of the CPU itself. even though it works faster, commodity machines cant have much of it, because this kind of ram is much more expensive to manufacture.

cacheing

when the cpu reads a memory address it first checks the cache, if the address is present in the cache then it reads from there. otherwise it tries the next level cache and so on until it reads from the memory which is much slower. once the address is found it copies this to the higher level cache(s) for faster access next time.

this process continues for all reads and writes potentially speeding up execution. it causes speedup because programmes and data are usually arranged so that related items are near each other in memory.

consider the members of a single instance of a class, they are adjacent. also consider the elements of an array, passing over it in sequence means the neighboring elements are cached.

conversely consider a dynamic data structure like a linked list on the heap. if the structure is very dynamic and has been edited, then each element will likely be at an unrelated memory address. in this case loading an item wont load its neighbors, so the cache has little effect and the algorithm runs slower.

cache architecture and tools

in the following image the CPUs are grey with the various caches sitting between them and the ram at the top.

cache architecture image

each processor has its own L1 and L2 caches and they all share a L3 cache. you can generate a similar image for your machine by installing hwloc and using lstopo.

sudo apt install hwloc
lstopo -p

using getconf on linux we also can learn the various parameters of the CPU architecture.

getconf -a | grep CACHE

the machine im woking on now outputs the following

LEVEL1_ICACHE_SIZE                 32768
LEVEL1_ICACHE_ASSOC                
LEVEL1_ICACHE_LINESIZE             64
LEVEL1_DCACHE_SIZE                 32768
LEVEL1_DCACHE_ASSOC                8
LEVEL1_DCACHE_LINESIZE             64
LEVEL2_CACHE_SIZE                  262144
LEVEL2_CACHE_ASSOC                 4
LEVEL2_CACHE_LINESIZE              64
LEVEL3_CACHE_SIZE                  16777216
LEVEL3_CACHE_ASSOC                 16
LEVEL3_CACHE_LINESIZE              64
LEVEL4_CACHE_SIZE                  0
LEVEL4_CACHE_ASSOC                 
LEVEL4_CACHE_LINESIZE          

decypher

lets decipher some of this

there are three levels of cache, which work in a cascade with reads and writes working through each cache out to ram.

in all cases the caches have a LINESIZE (the size of the read write unit) which is 64 bytes which means when a part of memory is loaded

do your own tests

on github i have a repo with two experiments the source is avaiable so you can run on your own machine to see how they work for you.

same work over larger data area

in this experiment we have a large array on the heap, we access it in order, then test again several times, with successivly larger arrays. each time however we also increase the stride or skip of the traversal so when the array is 64x larger we skip 64 and then overall the same work is being done.

chart showing two horizontal lines

take a look at the time taken to carry out the traversal. the larger arrays take longer and longer though the same work is being done. this is because many fewer cache hits are occuring in the larger arrays so the data has to be read from a slow place.

accessing arrays in two different ways

in this experiment we have a moderately sized array and are accessig it in two different ways. in row first order then in column first order.

chart showing two horizontal-ish lines

as you can see there is a surprising time difference between the two. again the order we read from an array has an impact and you can see clearly the same array read in a different order allows us to do work quicker.

takeaways

when successful we call this a cache hit, unsuccessful a cache miss. if a programmer wishes to make their code run faster on modern processors one way to achieve this is to write in a way that has many more cache hits.

to write code that is more friendly to cached cpus we have a few options.