22
cho-uc
5y

Found out other team's project result about performance for uni assignment. It's that Matlab is the fastest, followed by python and C++ is the slowest.

They are gonna get roasted during presentation (by many people in the audience including me).
This is gonna be fun.
/*devilish grin*/

Comments
  • 9
    That could indeed be the case if it's a naive C++ implementation and MATLAB and python are using prebuilt libraries or tricks like vectorization (I mean why else would you use them).
  • 1
    It seems like they got their Delta T wrong and are perhaps mistaking smaller numbers for slower performance, since that list seems a bit upside down.

    Of course, implementation matters, bla bla bla. Something's wrong for sure.
  • 0
    Pyhthon is faster than C++!? That's a good one!
  • 2
    Both C and C++ may have pointer aliasing issues if implemented by people who don't know what they're doing. Not only that CPU instruction level vectorisation goes out of the window, also tons of useless reloads happen.
  • 2
    @Fast-Nop not just that, well made libraries are very careful about cache usage, their computational block sizes are carefully tuned to avoid as many cache misses as possible. Plus vectorization and stable numerical algorithms, of course.
  • 1
    @RememberMe Assuming they're using numpy (which you normally would if you are doing the same thing in matlab and python) then their python code is using a fairly well optimized C library (and yes, it uses vector instructions when appropriate), there are equivalent libraries for C++ that outperforms numpy though (armadillo for example)
  • 0
    @irene Yeah, and it almost works, but only in half of the cases, and more importantly, it refers only to forbidden pointer aliasing. This issue here is the allowed, but unintended one.
  • 0
    @ItsNotMyFault or Eigen, or MKL, or a whole bunch of others, yes (personally I love Eigen's template based approach, very clean).
  • 0
    @Fast-Nop doesn't C (C99?) allow you to tell the compiler that there's no aliasing on a pointer? I think it was the restrict keyword or something. So theoretically I don't think it makes much of a difference from Fortran if you do include that keyword when needed (though I guess it would be messier to verify). I've never worked with this though, just guessing here.
  • 0
    @RememberMe you're totally right, BUT it's opt-in, not opt-out. Someone who isn't aware of the issue wouldn't know that and happily deliver a slow solution.

    However, there's a little gotcha: restrict only works if every memory access is done either through that pointer or through calculations based on that pointer. You cant e.g. assign another pointer and then use this one.
  • 0
    @irene naaahhh. Globals are baaad. Instead, you declare a pointer to a god struct in main(), alloc memory to it, and then hand around that pointer like a trailer park hoe on H turkey. No globals, super clean shit!
  • 0
    @irene not if you hand the pointer around as function argument. Which you have to because it's declared inside main and therefore not visible outside of main.

    And the underlying struct is only declared as type, not as variable. That's why the malloc.
  • 0
    @irene yeah of course you can, but it isn't recommended because it isn't thread safe.
  • 0
    @irene let the caller allocate the memory and hand over a pointer to the function.
  • 1
  • 0
    @irene the first one returns static memory, which isn't a good idea. The question was, if I understood correctly, whether this is possible.
  • 0
    @irene ah that way. Well then it's just a global, and declaration will allocate the space for the variable itself.

    In the original example, if you declare a pointer to a struct as static, then the memory for the struct itself will not be allocated, only for the pointer.

    You can of course initialise the pointer to any valid value and have it point at some memory area that you got via other means, e.g. by static declaration of a matching struct.

    Or, totally hacky, you can even initialise it to a fixed address and make sure manually that this address is fine for access.
  • 0
    @irene if the same memory is accessed by incompatible pointer types, then it's undefined behaviour, that may break, yes.

    In the code example I gave, the point is that the first one returns a pointer to always the same memory, no matter how many threads are using it. In the second, each thread us supposed to supply the underlying memory itself.
  • 0
    @irene that can work if you use a union for type punning.

    Otherwise, it's undefined behaviour that may bite you after the next compiler update. I've had to fix exactly such a bug this year.
  • 2
    @irene There was a uint8_t array that was also accessed with uint32_t pointers. The previous compiler version had always aligned the array to 4 bytes, but the update didn't. Well yeah, it wasn't obliged to align after all.

    That resulted in 32 bit access at odd addresses. Now this CPU just ignored the lowest two address bits for a 32 bit access, so the access didn't happen at the expected address and instead garbled something else.

    The symptom was that we set a variable, and some lines afterwards, it was as if we hadn't set the variable. Really strange behaviour.
  • 0
    @irene yeah that's one of the issues. But even if you have a uint32_t and a float, then both are four bytes, so the alignment works fine.

    However, if you write as uint32_t and read as float, the compiler is free to ignore the uint32_t write before because the pointers are incompatible types. It can give any value to the float read.

    It may even deduce that the float read, by that logic, is an ininitialised variable and can optimise the whole read away. Or the function. Or even the whole program.

    The correct way to do that is either via a union (only C, not C++) or via memcpy. The compiler will optimise the memcpy call away and do exactly the direct read instead.
  • 2
    For those of you who are curious :
    -Matlab : use parfor from Parallel Computing Toolbox
    -Python : use multiprocessing (don't know which lib)
    -C++: use OpenMP

    Result :
    Serial program is even slower than parallel version. Matlab is the fastest.
    Reason according to them:
    -High overhead, matrix size is not big enough to overcome the communication cost.
    - Should use coarse-grain parallelism instead of fine-grained

    What I think :
    Utter BS because matrix size is big enough (1 million elements). They just made up those reasons.

    They didn't show a single code in their slides (probably they know that they will get roasted if they did), so cannot confirm my suspicion that their code is horrendous
Add Comment