Do all the things like ++ or -- rants, post your own rants, comment on others' rants and build your customized dev avatarSign Up
From the creators of devRant, Pipeless lets you power real-time personalized recommendations and activity feeds using a simple APILearn More
RememberMe1435429dYou considering a job in high performance computing by any chance? Because that peak performance bit if taken literally is exactly what HPC is about
And it's a ton of fun
RememberMe1435428d@OmerFlame fair enough, have fun :p
If you ever feel a compulsion to suffer for performance, as my advisor likes to say, do come to uni for computer engineering. These bois get pretty extreme when talking about performance.
Just to put it into perspective with other kinds of dev, this is a field in which DRAM access is considered slow as molasses and to be strictly avoided unless necessary, and when needed only done using a DMA to prefetch large chunks in a carefully controlled access pattern so that it maximises row buffer hits and bank parallelism. SSDs give these guys nightmares and network latency is the first sign of madness.
RememberMe1435428d@OmerFlame yup, caches are where it's at. RAM can take up to hundreds of CPU cycles to access, caches are 1-tens. You really really twist your code around to make sure you're using caches properly, because that's like 50% or more of your CPU die area and the speedup is enormous. On fancier devices like FPGAs, that's replaced by on-chip block RAMs that you need to manage manually (think manual caches), you have full control but that also means you need to do it properly. On GPUs, you need to make sure all your computation stays in local memory that's accessible to threads directly, and only access VRAM in long, synchronised loads that bring in a ton of data at once. On special accelerators it can vary depending on the architecture of the device.
Then there's exploiting superscalar out of order execution in modern CPUs - they have many functional units with deep-ish pipelines and have WAY more registers than you think they do or than the ISA (x86, armv8, risc-v etc. standards) says they do. Keeping that execution engine fed at more than 2 instructions per clock (the average parallelism in everyday code) is a whole challenge in itself, especially when your arithmetic intensity (number of arithmetic ops done per memory op basically) is low, because memory is slow latency, high throughput.
If you're interested, check out something simple like tiled matrix multiplication (importantly, *why* it's much faster)