10
K-ASS
162d

So a little bit explanation to my last fuck rant

I was trying to make a cuda code faster, specifically eigen value decomposition for 12 by 12 matrices. For a week a made a fast and accurate version and a faster but less accurate version, both are faster than cuda. Then I was thinking about how to make the faster version more accurate.

Then we had this idea of using power iterations. And honestly I hoped it won’t work. But then, fuck me it worked, which means I had more work to do.

But hey, at least now I’m way faster than cuda on this

Comments
  • 0
    @mansur85 are you really interested or being sarcastic
  • 1
    @mansur85 so assuming you know at least what eigen values are, a typical way of doing eigen value decomposition is called Jacobi iteration. What it does is it iteratively finds the largest off diagonal value, apply a rotation matrix so that the off diagonal value is minimized to 0. After a number of rotations you will get a diagonal matrix, which is the eigen values, and if you multiply all the rotation matrices together you get eigen vectors.

    Now this is good for small matrices, and cuda just implemented that for batch eigen value decompositions.

    However, the number of iterations can be large, so what the c++ eigen library does is to first apply a house holder transformation to get A=HUH^T where U is a tri diagonal matrix and H is orthornotmal, which means HH^T is identity matrix. Now, after you get the U, you can apply QR decomposition, where Q now is orthonormal and R is an upper triangular matrix. So U=QR, and you then do U’=RQ and continue the decomposition.
  • 1
    @mansur85 ideally after certain iterations you will get a diagonal matrix, which is now your eigen values. However, doing the QR then RQ may converge very slowly when the matrix is almost diagonal. So there’s the shifted QR algorithm. This will be expensive on cuda so I guess it’s one reason they didn’t do it. But for cpu version this is fine. So after you do all the QR and multiply all the Qs together, with the H you got before, you have the eigen vectors.

    So now, I want to implement this as fast as possible and on cuda. Which means I should avoid shift QR, or any branches. So what I did is, do the householder transformation, which gives me the symmetric tri disgonal matrix. Do a fixed number of QR iterations so it’s almost diagonal, then do Jacobi iterations to eliminate off diagonal matrix. This is stable and will not produce any NaN value when getting the eigen vectors.

    But now, remember all the multiply by Q, multiply the rotation, and multiply the H matrix stuffs? Slow
  • 1
    @mansur85 but when we learned eigen value and eigen vectors, we were told we can solve a linear system to get the eigen vectors given eigen values right? And technically we just need to solve the tri diagonal system and multiply the result by H and we are done. However, the problems is we are solving for a non trivial solution, so it’s first unstable and can have large error when you reconstruct the matrix from the eigen vectors solved this way. So again, we are solving (A - lambda * I)x=0.
    So here comes power iteration, it basically just try to converge to a better solution, let’s say you got x in first iteration, now put it to the right hand side and solve for (A-lambda *I)x’ = x, and eventually it will converge to a better solution. And because we had a very good initial solve, it converges in 2 iterations.
  • 2
    @mansur85 so here are some stats.

    For 1M 12 by 12 matrices, on 4090 cusolver will take around 580ms, if I do the no solve tri diagonal way, I can finish them in 340ms, if I do the tri diagonal solve i can do it in 200ms. However, the solve is unstable when there are no off diagonal entry on some of the rows, so what I did is whenever NaN is detected in eigen vectors using the fast solve, I just redo the whole thing in stable way. Now the perk of the code is that cusolver requires you to store all the matrices and stop whatever you are doing to call the routine, but for my code I can just call it whenever I want because it’s just a single threaded cuda code. And if I do it inline without stopping the execution and storing every matrix, the number can go down to 150ms.
  • 2
    @mansur85 For reference, evd is often performed on cpu because people know gpu performance is bad. I have i9-13900f, for single thread it takes 8628 ms. The reason I didn’t do multi threading is because there are p cores and e cores which has different performance, and hyper threading doesn’t really give you 2x performance, furthermore, transferring that 1M from gpu to cpu takes 70ms, so round trip takes 140ms, which is almost the same time as if I’m just executing my code
  • 1
  • 0
    You use Cuda to generate ducks for the whole world aren't ya?
  • 1
    @retoor lmao I technically can, I will generate 8b of them and assign each one a duck
  • 0
    I'm not understanding any of it but it seems fucking awesome. I'm going to watch your repository for sure.
  • 0
    @NeatNerdPrime I’m not going to update though, it’s done. There’s nothing that can compete with it’s speed at the moment
  • 0
    @mansur85 ok sorry I didn’t see that you weren’t familiar with the eigen values.

    Let me explain, imagine now you have a matrix and multiply a vector. You can imagine the matrix is a kernel, but how do we quantify what this matrix does? Like why does it multiply a vector and return you another vector. This is what eigen vectors come in.

    Ok, so by definition A = VUV^T where V is eigen vectors and U is eigen values, and VV^t is identity matrix. Let’s start from the definition, VVt is identity means that each column of the matrix is a direction, and those directions are orthogonal to each other.

    Ok now let’s do Ax, which is just VUVtX and let’s break it down, Vtx will give you, for each direction, how much x actually is on that direction. What does that mean? Think of dot product of two vectors, the result is the length of one vector projected on another vector, that is why the dot product of two orthogonal vectors is 0, and the dot product of the same vector will give you 1.
  • 0
    @mansur85 ok now we did Vtx, let’s do UVtx, what is the U? It is the eigen values, but what does it mean? Well it essentially tells you that if you are decomposing a kernel into a combination of directions, how much does each direction actually matter. So Vtx told us that if we project x onto each of the eigen vector, what is the magnitude along that vector, Now when we multiply U, it is putting weights to those vectors. Now we have those weighted magnitudes, the final thing is multiply by another V, which is saying, now I have all the magnitudes, i just need to combine those directions with the weights, and this is the final result.

    If U is hard to understand, just imagine an identity matrix. We know what identity matrix does, but when you do eigen value decomposition, it is actually reasonable to choose any set of orthonormal vectors instead of <1,0,0…> something like this. But the eigen values has to be 1 because each of those vectors has equal contribution.
  • 0
    @mansur85 and this is why Ax=lambda x make sense, because x is an eigen vector, it is orthogonal to each of other eigen vectors, and what is left is the magnitude of of that vector in the matrix
  • 0
    @kobenz where?
  • 0
    @kobenz oh god if I have loops that reduce the performance of cusolver I’m fucked
  • 0
    @kobenz this is just verification step, performance is not priority
  • 0
    @mansur85 the 12 by 12 I wrote is used for simulation, and it’s mostly auto generated. One can extend this technique to larger matrices for control systems in robotics, the downside is the code generator specifically generates code for specific matrix size
Add Comment