sean butler / size, speed and order tradeoffs

cpp cache repo selected

tldr

intro to cache friendly programming supported by a cpp code repo with timing graphs

modern cpu

modern computers are driven by CPUs which already have optimisations happening for you in the hardware. writing faster (low latency) code requires that you know what speedups are already happening and do what you can to facilitate them or at least that you dont scupper them.

ram is cheap

dynamic ram is cheap so machines can have lots of it but its not very fast. faster types of ram exist and are situated closer to (or within) the circuits of the CPU itself. even though it works faster, commodity machines cant have much of it, because this kind of ram is much more expensive to manufacture.

cacheing

when the cpu reads a memory address it first checks the cache, if the address is present in the cache then it reads from there. otherwise it tries the next level cache and so on until it reads from the memory which is much slower. once the address is found it copies this to the higher level cache(s) for faster access next time.

this process continues for all reads and writes potentially speeding up execution. it causes speedup because programmes and data are usually arranged so that related items are near each other in memory.

consider the members of a single instance of a class, they are adjacent. also consider the elements of an array, passing over it in sequence means the neighboring elements are cached.

conversely consider a dynamic data structure like a linked list on the heap. if the structure is very dynamic and has been edited, then each element will likely be at an unrelated memory address. in this case loading an item wont load its neighbors, so the cache has little effect and the algorithm runs slower.

cache architecture and tools

in the following image the CPUs are grey with the various caches sitting between them and the ram at the top.

cache architecture image

each processor has its own L1 and L2 caches and they all share a L3 cache. you can generate a similar image for your machine by installing hwloc and using lstopo.

sudo apt install hwloc
lstopo -p

using getconf on linux we also can learn the various parameters of the CPU architecture.

getconf -a | grep CACHE

the machine im woking on now outputs the following

LEVEL1_ICACHE_SIZE                 32768
LEVEL1_ICACHE_ASSOC                
LEVEL1_ICACHE_LINESIZE             64
LEVEL1_DCACHE_SIZE                 32768
LEVEL1_DCACHE_ASSOC                8
LEVEL1_DCACHE_LINESIZE             64
LEVEL2_CACHE_SIZE                  262144
LEVEL2_CACHE_ASSOC                 4
LEVEL2_CACHE_LINESIZE              64
LEVEL3_CACHE_SIZE                  16777216
LEVEL3_CACHE_ASSOC                 16
LEVEL3_CACHE_LINESIZE              64
LEVEL4_CACHE_SIZE                  0
LEVEL4_CACHE_ASSOC                 
LEVEL4_CACHE_LINESIZE

decypher

lets decipher some of this

there are three levels of cache, which work in a cascade with reads and writes working through each cache out to ram.

ICACHE this is the instruction cache
DCACHE is the data cache
the L1 Cache is 32 KB
the L2 Cache is 256 KB
the L3 Cache is 16 MB

in all cases the caches have a LINESIZE (the size of the read write unit) which is 64 bytes which means when a part of memory is loaded

do your own tests

on github i have a repo with two experiments the source is avaiable so you can run on your own machine to see how they work for you.

same work over larger data area

in this experiment we have a large array on the heap, we access it in order, then test again several times, with successivly larger arrays. each time however we also increase the stride or skip of the traversal so when the array is 64x larger we skip 64 and then overall the same work is being done.

chart showing two horizontal lines

take a look at the time taken to carry out the traversal. the larger arrays take longer and longer though the same work is being done. this is because many fewer cache hits are occuring in the larger arrays so the data has to be read from a slow place.

accessing arrays in two different ways

in this experiment we have a moderately sized array and are accessig it in two different ways. in row first order then in column first order.

chart showing two horizontal-ish lines

as you can see there is a surprising time difference between the two. again the order we read from an array has an impact and you can see clearly the same array read in a different order allows us to do work quicker.

takeaways

when successful we call this a cache hit, unsuccessful a cache miss. if a programmer wishes to make their code run faster on modern processors one way to achieve this is to write in a way that has many more cache hits.

to write code that is more friendly to cached cpus we have a few options.

wherever possible use the stack rather than the heap.
when a varible is to be used, it it is closely related to another, then place them within 64 bytes of each other.
construct your containers so when passed over in order the neighboring are increases the chance of cache hits by being close in memory, perferably adjacent. arrays = best, vectors = good, linked lists = bad.
if you have an algorithm which is to run fast, then try to make it fit within the ICACHE (32k)
if you have large amounts of data and code perhaps consider an appraoach which keeps it all within one of the larger caches for the duration of its need.