Well, you share ideas with like-minded individuals, you work on what you like and you try to make the most out of it, and maybe get famous on the way (a small bonus). We progress the current technology to the edge using more efficient software. It’s basically answering a question: what is “peak performance” in the blend of hardware and software.

At least that’s how I feel about it.


  • 1
    You considering a job in high performance computing by any chance? Because that peak performance bit if taken literally is exactly what HPC is about
    And it's a ton of fun
  • 1
    @RememberMe it's a *FASCINATING* topic to say the least, though I don't know if that's my general direction.

    I will make the big choice later in my life, I still have a few years and I would like to savor it :D

    I like being a teen.
  • 0
    @OmerFlame fair enough, have fun :p

    If you ever feel a compulsion to suffer for performance, as my advisor likes to say, do come to uni for computer engineering. These bois get pretty extreme when talking about performance.

    Just to put it into perspective with other kinds of dev, this is a field in which DRAM access is considered slow as molasses and to be strictly avoided unless necessary, and when needed only done using a DMA to prefetch large chunks in a carefully controlled access pattern so that it maximises row buffer hits and bank parallelism. SSDs give these guys nightmares and network latency is the first sign of madness.

    Fun stuff.
  • 1
    @RememberMe WHA

    Well, if you’re comparing it to on-die CPU cache, well then you have a point.
  • 1
    @OmerFlame yup, caches are where it's at. RAM can take up to hundreds of CPU cycles to access, caches are 1-tens. You really really twist your code around to make sure you're using caches properly, because that's like 50% or more of your CPU die area and the speedup is enormous. On fancier devices like FPGAs, that's replaced by on-chip block RAMs that you need to manage manually (think manual caches), you have full control but that also means you need to do it properly. On GPUs, you need to make sure all your computation stays in local memory that's accessible to threads directly, and only access VRAM in long, synchronised loads that bring in a ton of data at once. On special accelerators it can vary depending on the architecture of the device.

    Then there's exploiting superscalar out of order execution in modern CPUs - they have many functional units with deep-ish pipelines and have WAY more registers than you think they do or than the ISA (x86, armv8, risc-v etc. standards) says they do. Keeping that execution engine fed at more than 2 instructions per clock (the average parallelism in everyday code) is a whole challenge in itself, especially when your arithmetic intensity (number of arithmetic ops done per memory op basically) is low, because memory is slow latency, high throughput.

    If you're interested, check out something simple like tiled matrix multiplication (importantly, *why* it's much faster)
  • 1
    @RememberMe so this is a whole rabbit hole.

Add Comment