6
NoMad
4y

Fuck off git!

Why the fuck is a simple "git status" getting stuck? Is it even gonna return something by the end? How long am I supposed to wait?

Ffs 😡😡😣

Comments
  • 4
    git gc?

    (Assuming you haven't got some huge file you've left somewhere in your repo by mistake.)
  • 2
    @AlmondSauce not mistake 😛 that's where I'm hoarding data rn. It's a lot of little files, currently around 10GB size.
  • 8
    @Jilano you bet 😜 personal info of my classmates included.
  • 3
    @Jilano ahahahahahahahaha
    Anywho, it ain't my fault. They didn't exactly give me cloud storage of massive data privileges, and I'm already running low on space everytime I shoot up the ol'dusty keras again.
    No one said touching data was easy.
  • 1
    Re-cloning the same repo. Wish me luck.
  • 2
    Lots of little files that total 10GB?! Damn, what weird ML thing you doing this time that has that use case?!
  • 5
    @AlmondSauce robots. Mother fucking robots, man.
  • 5
    Checking out files: 1% (xxx/31000)

    ... Yeah I'm screwed...
  • 4
    Bad git management right there bud
  • 1
    @010001111 tell me how you manage your research data then. Lol.
  • 2
    @NoMad FWIW, when I've had ML data like that I've just zipped it into a single file (or a few large files, whatever makes most sense), and then used Git LFS to store it. If I can be arsed, I've re-organised the data so it's just in a few massive files I can feed straight in as training data, but that's often a PITA.

    Not perfect of course - Git LFS is a bit of a cludge, it requires a tedious zipping process, and it won't track changes sanely, but it kinda worked ok for me. Git just ain't designed for that use case it seems 🤷‍♂️
  • 3
    @NoMad Can you "shallow clone" your repository?
    For Windows VFSForGit could also improve the performance, but I don't know which hosters support it.
  • 2
    @AlmondSauce is on the right track. Definitely gzip related sets.

    Also have a look at how people handle git on 3d platforms like unity that are high noise:

    https://thoughtbot.com/blog/...
  • 0
    Weird thing. It got the 10GB existing data in under ten minutes, but status after adding the last 2GB is hanging again.

    I get that it's not the normal use, but until I finish cleaning the dataset, I can't use the main dataset repo, which is on a different server afaik.
    Also, it's very hard to add more data to a zipped folder dataset. Again, I have tiny files, few KB each, accumulating to 10GB so yeah there's so many to index. I just don't get why this last 2GB is causing problems and not the other 10GB.
  • 2
    Hallelujah! After what feels like eternity, git status came back!

    Now off to get screwed by git add. 😒😒
  • 1
    Try out git-annex for media/large file storage with git. It's been a huge help to me!
  • 1
    Git add went through, but I've been waiting for git commit for more than half an hour now. :bashing head into wall:
  • 2
    Adding or removing files shouldn't be a problem.

    Use tar.

    It's uncompressed, available on any system, and you have all kind of knobs
  • 1
    Git isn't the best general purpose file system with snapshot support.
    Use something else for massive amounts of training data.
  • 0
    @NoMad what filesystem/storage type is this on?
    I've only had to wait on it on Windows with NTFS
  • 1
    https://git-scm.com/book/en/...

    In a nutshell, @NoMad has a myriad of extreme small files.

    It's less to do with the filesystem - more with the fact that even an Optane or any other harddrive with low access time would get pummeled - as every file needs to be accessed at least once.

    Hence the workaround to create a single file like an tar archive and add / remove the myriad of files to this.

    Tracking a single file isn't a problem.
  • 1
    @IntrusionCM you're on the right track.

    Except for that I'm also adding to the small files every time with each commit. Plus, this isn't my main storage and I'm still gathering this damn data. The whole shebang is complex, but in short, I'm gathering data from robot performance each time. And I'm not even at learning phase yet. But good news is, it did push last night. When I have all the data, I can then tar it.

    I think the issue is indexing tho. Like, 30k items are not that easy to index by git.
  • 2
    @NoMad yeah. I'm still sipping on the first coffee, so was a bit lazy. ;)

    Jsust wanted to make sure noone gets crazy ideas...

    Since this is really a hw limited problem, not a sw limited problem.
  • 1
    @IntrusionCM pass that coffee over here. I had one but still can't get dressed properly to leave... 😴
  • 3
    @NoMad *takes a blanket and wraps NoMad in it*

    *Get's a shopping cart*

    Time for adventure. XD
  • 1
    @IntrusionCM ahahahahahaha
    That's not how people normally go to work... 🤣
Add Comment