Do all the things like ++ or -- rants, post your own rants, comment on others' rants and build your customized dev avatarSign Up
For the vast majority of code optimizing for a specific uarch does not help. For example, “ls” will not have any measurable power or time footprint improvement by compiling for your host CPUs feature set. So many applications are limited by IO or memory bandwidth anyway.
For a very, very small number of applications it does matter, namely web browsers, soft video decoders, etc. in many of those cases though, the actual binary you get has all the compiled routines for various architectures and it is chosen at runtime, a small price is paid for the indirect branch, and from that point on you get your best performance.
@FrodoSwaggins I don’t have any measurable practical stats to make an argument against your comments yet, but uarch is important, ISA doesn’t tell how much clock cycle an instruction will take, also the instruction pipeline and out of oder execution technique are different for diff uarch , sometimes it helps to compile for particular uarch even for small programs, specially demons or anything that run in background.
Also the on chip GPU will be will be completely different
I don’t think it’s a small price because it adds up for multiple things and produces a laggy unoptimized experience
PlatinumFire66530dExpecting a generic output of distros from Linux is about is like saying "I thought computers were fast but this pentium 4 is so slow".
There are many flavours of Linux with different foci, lumping them all together doesn't make sense.
@hardfault If the experience is laggy, it's not the compilation options, but superfluous services and shitty SW architecture. Recompiling wouldn't help anyway because you wouldn't even notice some 10% speed.
E.g. Windows has dynamic thread prios with a momentary prio boost when a thread gets woken up. That's because it's designed as desktop OS for better GUI snappiness.
@hardfault believe me, I have designed out of order pipelines and compilers for a lot of them too. That’s what I do for a living. I’m acutely aware of the difference. However, for the vast majority of code it turns out not to really matter. And it matters even LESS on modern chips than it did in the net burst and early core days.
So many applications are IO and memory bound that even doubling the amount of time that it takes for integer crunching basic blocks to run has little impact. People have studied this in great detail and that is why to this day virtually all 32 bit intel Linux distributions ship with i486 elf user binaries and kernels are i686-pae typically. We actually sat down and benchmarked it and found it was not worth it to adopt many of those new features for boilerplate code because ESPECIALLY with modern out of order pipelines, two different implementations of the same algorithm as long as they have parity in memory fabric access and approximate operation will boil down to essentially the same microcode.
An example of this is AVX MOV instructions on amd64. When first implemented they made a billion of these with hints as to what type of information was in the XMM/YMM etc registers, packed doubles, packed single, scalar single, etc. many of them have functional differences but many don’t. The processor needed the hint because the register renaming engine would have the values stored in various islands in each pipeline and the instruction would do the right thing if you used the wrong hint but you would incur a penalty. With modern implementations this is substantially less so, because it never pays to expose the limitations of your implementation in a software visible way. See delay slots.
@hardfault tl;dr this has been studied a lot by very smart people and we came to the opposite conclusion as you. There are instances where this matters, and I’m not claiming it makes zero difference, but for the majority of use cases having a binary that runs on more systems pays out more than the 15% performance increase on workloads that are not IO or memory bound (which is not very many workloads, but there are some)
One exception to this is ISA features that were designed to accelerate particular tasks, in which case of course that wins out. You’re giving the processor a substantial hint as to what your goal is and the pipeline can be optimized for that. Examples AES encryption, though even a lot of that has now been pushed down stream to NICs for the networking use case.
@hardfault I can actually take you one farther than that. There have been many instances in the intel history in particular where sometimes a “more modern” version can suffer performance over a stupider implementation. It’s not frequent, but either because more massaging has to go into your logic to load vectors, or because there are various pipeline penalties, or just decoding the stupid thing can be slower. I will try to find a paper to support this. Also there are NOPs with side effects because only intel could be that special.
The idea with an out of order pipeline is up to some barrier, either loads stores or higher level than that, the CPU cares about “what is the essence of this code? What side effects does it produce” rather than “what does instruction 420 do?” It works kind of like a compiler IR. At no point is there a register EAX that has the software architecturally visible value in it, rather there are probably 6 different registers and it’s 6 different ones every time you look that have different values of EAX at different times. This is effectively your SSA AST implemented in verilog. So the game is to figure out overall what sequence of instructions balances pressure appropriately so as to achieve the lowest latency.
This question is so hard to answer that formally proving it while possible is not really practical. So what a lot of compilers including Llvm do is they will actually try a few different tweaks on the backend and produce a few different versions of the code, then run them with an mmapped page and see which one is the fastest and pick that. No bullshit. Now sure this will generate good code for your host cpu but can you really point to what popped out and say why?
@FrodoSwaggins agree, but what about on chip peripherals and different ISA extensions.
Some chip might have HW acceleration for some workload while others don't.
example intel's (in skylake uarch)HEVC codec acceleration via GPU .
making a generic image results in not utilising this on chip features that result software doing more work.
i do not believe in benchmark test because they are too vanilla for sketching a image of CPU performance for real workloads
@hardfault I did call that out I believe. Fixed function pipelines and special accelerators are definitely a different thing. But the software has to be written specially to take advantage of those things. Depending on the granularity of the instruction it can be very hard or impossible for the compiler to lower many IR nodes to a single instruction and the more complex the operation the more true this becomes (though it is not impossible). Usually software that takes advantage of such things will look at the CPUID bits and take an indirect branch hit and use a different implementation at runtime. Which is fine. That’s what a lot of web browsers and stuff that does make big use of such features do.
@FrodoSwaggins that's what my whole rant was about web browsers & video is the most generic workload, also not to mention the rendering GUI as complex as GNOME these are basic elements for any user and they should work properly.
fact that linux distros are only optimising for ISA(like amd64) not for uarch(like skylake) means they will be never able to meet the snappiness of other OS
it will be always
Mac OS (sticks to uarch) > Windows( have to support AMD and intel chips ) > Linux distors(have to support every thing, so basically SW have to be shittier)
@hardfault Nope, it's not the CPU optimisation, which is pretty pointless. The Linux problem is that the GUI is independent from the kernel.
As I mentioned, Windows does clever tricks with dynamic prios because it's designed together with its GUI. It's not about CPU throughput, it's thread latency.
You'll never get this in Linux, though multi-core CPUs reduced the fallout from the SW architecture. On the upside, you can run Linux without GUI, it's just that this comes at a price.
@Fast-Nop in my experience i do not find windows being that optimised either , i am comparing everything to XNU.
i use all the OS regularly i do agree windows is better few specialised tasks (3D design ) but MacOS seems to be more superior in nailing the basics and better core /Memory utilisation
also as i am using a VM i wanted that instated if using every thing in software may be if os was optimised for uarch it used on chip peripheral to accelerate the virtualisation experience
@hardfault I measure only around 10% more performance for the same program under Linux compared to Windows, but Windows is still a little bit snappier. On a 10 year old CPU!
If you have issues with a lagging Linux on modern computers, I'd check your config first, especially how aggressive the swapping is done even when you have loads of RAM. Try tuning down the swappiness parameter, that can increase snappiness.
@Fast-Nop i have tried the swappiness parameters and all other things, but still some times things will freeze
right now running on VM, but i will buy a linux box i just wanted to know, baseline performance needed for my use case (vivado tools).
but things freezes sometimes in low cpu or memory utilisation too!! so i am confused whether linux is worth it
@hardfault Apple ships one binary. That means it is NOT uarch optimized. At least the last time I unpacked those images that was the case. They might do it for the kernel but definitely not user binaries
“Snappiness” has everything to do with whether software is shit and nothing at all to do with what compiler options were used, aside from -O3.
The unix workstation was designed with huge ideas, like that x servers could be connected to by remote terminals, etc. however, I find it ludicrously fast despite how much code is supporting that flexibility. Honestly we’re probably talking GPU driver here. In my experience even bloatware like KDE responds faster than windows and macros. And I use i3 which actually doesn’t suck.
Apple does sunset hardware which means their least common multiple of features is 9 years old rather than 16, but that’s about the limit of the problem.
Again, this is not an issue. You are not noticing performance differences because of the micro architecture. You’re noticing them because those are completely different operating systems written by different teams with different goals.
Apple puts more effort into making their GUI /appear/ snappy than having any of their software actually work and personally I don’t care much for that. It shouldn’t surprise you therefore that their GUI is in fact snappy.
@FrodoSwaggins can’t use any random distro
Vivado tools are supported in
I have tried unbuntu and cent-os
now i will try lubuntu and suse
i once wanna try paid linux to see if there are any improvements in performance
I think i might finally go with lubuntu
But again just testing on VM, may the best os win !!
python3370730dSounds like you need to stop pretending Linux is just Ubuntu and maybe give a try to manajaro or Linux Mx (both are more stripped down distros with easy setup)
Just an update. at this point my lubuntu vm is flying!! ,
Let’s see if vivado works properly or not this version is based on bionic bever (18.04) so hopping every things works properly !!
Parzi864130d> why bloated
gururaju42*Now that's what I call a Hacker* MOTHER OF ALL AUTOMATIONS This seems a long post. but you will definitely ...
linuxxx64This guy at my last internship. A windows fanboy to the fucking max! He was saying how he'd never use anythi...
creedasaurus64Another dev on my team just got a new machine. Before he came in today I made two separate USB installers and ...