Spent a few hours wrestling with AMD ROCm to get it working. Had to change my kernel a few times, install different versions of the rocm packages, and in one case selectively upgrade a package. I also need to run my programs with a few shady environment variable exports to work around some bugs. The whole thing looks shaky right now, nowhere near as simple as CUDA. Also, horrid names (seriously AMD, what's with the 3dgy names).

However once I got it working it works pretty well, happily training stuff via tensorflow-rocm, with decent performance. This is also probably a good project to contribute to, I'm nowhere close to AMD's engineers at this stuff but basic bug fixing and quality of life stuff are probably within reach.

  • 0
    If you ever write down what you did and/or what things to look out for, please share it here.
    I've been trying for a while to get ROCm working with my vega gpus, but I'm kind of a noob at driver stuff so I have no idea if I'm doing things right or not or how to check that everything is working correctly.
  • 1
    @endor what errors are you getting?
    Your Vega GPUs should be officially supported and tested, mine is the "supported, but not tested" RX570, so it kept crashing at weird stuff.

    Basically the combination that worked for me was rocm 2.7 (latest is 2.8 I think) and kernel 4.18.19 on Ubuntu 18.04 using the rock-dkms package instead of the upstream kernel driver.

    If you can see your cards in rocminfo and clinfo, then try using the rocm/tensorflow docker image, there are instructions on the commands you need to run it on the Dockerhub page (ignore the stuff they say about installing the ROCK kernel and all that shit, upstream drivers/rock-dkms is just that).
  • 0
    @RememberMe It's been a while since I last tried tbh, but I was trying to do some overclocking/undervolting and some powerplay tuning but I couldn't get things working the way they were shown in the docs and guides.

    Thanks for your suggestions, I'll see if I can get it running this time (any maybe try tensorflow for the first time :D)
Add Comment