wtf? - Fuck off git! Why the fuck is a simple "git status" getting stuck? Is it even gonna return something by the e - devRant

Ranter

Comments

3

AlmondSauce

15618

5y

git gc?

(Assuming you haven't got some huge file you've left somewhere in your repo by mistake.)
2

NoMad

13489

5y

@AlmondSauce not mistake 😛 that's where I'm hoarding data rn. It's a lot of little files, currently around 10GB size.
8

NoMad

13489

5y

@Jilano you bet 😜 personal info of my classmates included.
3

NoMad

13489

5y

@Jilano ahahahahahahahaha
Anywho, it ain't my fault. They didn't exactly give me cloud storage of massive data privileges, and I'm already running low on space everytime I shoot up the ol'dusty keras again.
No one said touching data was easy.
1

NoMad

13489

5y

Re-cloning the same repo. Wish me luck.
2

AlmondSauce

15618

5y

Lots of little files that total 10GB?! Damn, what weird ML thing you doing this time that has that use case?!
5

NoMad

13489

5y

@AlmondSauce robots. Mother fucking robots, man.
5

NoMad

13489

5y

Checking out files: 1% (xxx/31000)

... Yeah I'm screwed...
4

010001111

3002

5y

Bad git management right there bud
1

NoMad

13489

5y

@010001111 tell me how you manage your research data then. Lol.
2

AlmondSauce

15618

5y

@NoMad FWIW, when I've had ML data like that I've just zipped it into a single file (or a few large files, whatever makes most sense), and then used Git LFS to store it. If I can be arsed, I've re-organised the data so it's just in a few massive files I can feed straight in as training data, but that's often a PITA.

Not perfect of course - Git LFS is a bit of a cludge, it requires a tedious zipping process, and it won't track changes sanely, but it kinda worked ok for me. Git just ain't designed for that use case it seems 🤷‍♂️
3

sbiewald

4008

5y

@NoMad Can you "shallow clone" your repository?
For Windows VFSForGit could also improve the performance, but I don't know which hosters support it.
2

SortOfTested

19558

5y

@AlmondSauce is on the right track. Definitely gzip related sets.

Also have a look at how people handle git on 3d platforms like unity that are high noise:

https://thoughtbot.com/blog/...
0

NoMad

13489

5y

Weird thing. It got the 10GB existing data in under ten minutes, but status after adding the last 2GB is hanging again.

I get that it's not the normal use, but until I finish cleaning the dataset, I can't use the main dataset repo, which is on a different server afaik.
Also, it's very hard to add more data to a zipped folder dataset. Again, I have tiny files, few KB each, accumulating to 10GB so yeah there's so many to index. I just don't get why this last 2GB is causing problems and not the other 10GB.
2

NoMad

13489

5y

Hallelujah! After what feels like eternity, git status came back!

Now off to get screwed by git add. 😒😒
1

justamuslimguy

281

5y

Try out git-annex for media/large file storage with git. It's been a huge help to me!
1

NoMad

13489

5y

Git add went through, but I've been waiting for git commit for more than half an hour now. :bashing head into wall:
2

IntrusionCM

13947

5y

Adding or removing files shouldn't be a problem.

Use tar.

It's uncompressed, available on any system, and you have all kind of knobs
1

Oktokolo

11330

5y

Git isn't the best general purpose file system with snapshot support.
Use something else for massive amounts of training data.
0

hjk101

5546

5y

@NoMad what filesystem/storage type is this on?
I've only had to wait on it on Windows with NTFS
1

IntrusionCM

13947

5y

https://git-scm.com/book/en/...

In a nutshell, @NoMad has a myriad of extreme small files.

It's less to do with the filesystem - more with the fact that even an Optane or any other harddrive with low access time would get pummeled - as every file needs to be accessed at least once.

Hence the workaround to create a single file like an tar archive and add / remove the myriad of files to this.

Tracking a single file isn't a problem.
1

NoMad

13489

5y

@IntrusionCM you're on the right track.

Except for that I'm also adding to the small files every time with each commit. Plus, this isn't my main storage and I'm still gathering this damn data. The whole shebang is complex, but in short, I'm gathering data from robot performance each time. And I'm not even at learning phase yet. But good news is, it did push last night. When I have all the data, I can then tar it.

I think the issue is indexing tho. Like, 30k items are not that easy to index by git.
2

IntrusionCM

13947

5y

@NoMad yeah. I'm still sipping on the first coffee, so was a bit lazy. ;)

Jsust wanted to make sure noone gets crazy ideas...

Since this is really a hw limited problem, not a sw limited problem.
1

NoMad

13489

5y

@IntrusionCM pass that coffee over here. I had one but still can't get dressed properly to leave... 😴
3

IntrusionCM

13947

5y

@NoMad *takes a blanket and wraps NoMad in it*

*Get's a shopping cart*

Time for adventure. XD
1

NoMad

13489

5y

@IntrusionCM ahahahahahaha
That's not how people normally go to work... 🤣

Related Rants

devRant © 2021 Hexical Labs LLC
Privacy Policy | Terms of Service