0
mr-user
5y

I need to add 44,000 image to git repo. I try to use git-lfs but it's too slow when I run "git add ." command. Is there any faster solution?

Extra Information : The image are the data set for my AI model. The reason I use git is because I wanted to manage my data set easier since I am going to add/remove images to that data set.

Comments
  • 6
    Yeah don't use git for data storage beyond source code.
  • 0
    @kescherRant Any recommendation? I can only think of FTP but I don't think it will be easy to version control over FTP.
  • 0
    @mr-user You should set up a Nextcloud.
  • 2
    Put them in a directory and add a .xml file that contains (sub-)sets of relative paths to the images. Check-in that xml.
  • 0
    @No3x Can you explain it in more detail. I don't understand what you are saying. Sorry.

    According to what I understand I will just only push the xml file which contains the relative path and not the actual images.
  • 2
    Zip them and mount the archive when needed? Then store that zip whereever you like
  • 0
    Store them in Amazon S3 or similar?
  • 0
    Divide and conquer.

    What you're trying is sheer brute force.

    I'm very sure you could either do bulk adds with an git gc.

    Or even better (and described before)… store the pathes in a DB and create a hierarchical file system, eg current date (If relevant) as root folder, then for each batch of 500 files one directory with an hash, inside all 500 files.

    (Date) -> (Hash) -> _File 1...File N_

    It's pretty easy and fast and you won't have performance Problems...

    Plus you don't have the massive storage overhead of git.
  • 0
    @IntrusionCM

    I am confused.

    Do you mean I should

    1)Create the hierarchical path (root--> hash--> File1,File2,...FileN)

    2)Store the relative path in database

    3)Push(git add) the folder individually to git instead of pushing all the images together

    4)Push(git add) the database to git

    And another question : What do you think I should hash for the batch folder?
  • 0
    @mr-user sorry, overworked

    2 different solutions.

    1) batch your git adds (if possible)

    Instead of adding 44 000 files at once, create batches, eg 500 files per batch, so multiple git add's, then one commit

    After commiting, you could run git gc

    2) Create a directory tree....

    And store it inside a database.

    :) Hashes could be UUID v4.

    The database would simplify iterating by knowing the full path to all files.

    The date root folder could be a replacement for GIT.

    You can (maybe should) combine in solution 1 the directory tree with GIT.

    GIT doesn't know directories, but your FS will thank you. The number of files per folder is limited… But you won't reach that limit, as Listing a folder with >10k files is impossibro imho
  • 0
    @IntrusionCM Thank you and sorry for adding more work while you are overwork.

    I should create the database which have relative path and batch add the database along with images to git? I think HDF5 file format is better for my case.

    Could you please explain what do you mean by git doesn't understand directories?

    According to what I know git know directories since when you clone the repo , the git tools also create all the directories (along with files) in the repo for you.
  • 0
    @mr-user Nah no trouble.

    You don't need GIT per se.

    It's additional overhead, when you don't need a VCS at all.

    To understand my directory comment, you can read up on:

    https://git-scm.com/book/en/...

    In GIT internal storage, directories do not exist - they're represented by a tree. That's the reason you cannot add an empty folder in GIT - it's internal representation (the tree) has no entries. Mostly you add a gitignore file to the empty directory, so the tree has an entry :) ;)

    Instead of an VCS, i would utilise a folder structure, supported by a database.

    So no GIT at all.
  • 0
    @IntrusionCM I am hang up on the version control (git) because I wanted a single source of truth that I can recover from.

    I could store the complete backup but it will be quickly become out of sync and it is easy to have be create a duplicate version (Version1-Complete,Version2-Complete,VersionN-Complete) which will be hard to manage.

    I could use Dropbox which auto-sync folder but there is no way to recover deleted images (I just know I will accidentally delete some images that I shouldn't)This is my first time handling big data so I feel like a fish out of water.

    Forgive me if my comment seem too rude but I could see the benefits of the folder structured in local drive you describe but I am looking for a way to manage the data(images) as a single source of truth.
  • 0
    I just know the hammer (git) so everything seem like a nail to me. Could you recommend me which direction should I take? If possible I also want to upload it to cloud (such as gitlab) so I have additional backup point
  • 0
    @mr-user Not rude at all, I think I understand what you want to achieve.

    The better questions would be imho how you want to handle this performance wise and what your long term goal would be.

    GIT would mean that you exponentially grow the necessary storage and it will take extremely long when you jiggle with such an amount of data.

    When you are worried about deleting data, it looks like you are utilising GIT as an backup solution... Which I dislike.

    So let's take a straight look at the facts:
    - Long time storage of files
    - Backing up files at remote location
    - Utilising files for AI

    The reason I dislike GIT in this case is because of the storage overhead and the performance. It will decrease.

    And GIT is not a backup.

    PostGRES allows storing large blobs... You're thinking of a file system approach and you're worried about backups. I'm thinking in a different direction - how you could store the data long term without duplicating the data and without performance loss.

    - storing in a database would be one way (streaming the file blob)
    - an additional table could represent your dataset

    Eg. table 'dataset' with dataset_id, name
    'dataset_file' with dataset_id, file_id

    You would store each file directly in database, the assoc table dataset_file the necessary references. :)

    You could achieve the same on a filesystem - by utilizing symlinks, although this is ugly imho.

    Backing up a database is easy, can be done incrementally and - encrypted - stored in cloud.

    Backing up filesystem, too.
  • 0
    @IntrusionCM Thank for giving me new direction to think about. I should forget about version controlling the images (but it's too tempting to just version control it).

    I know that it is not really recommended to store the images as blob data type and just store relative path.

    Does PostGRES work really well with blob data structure?

    Are you talking about below table schema?

    Table = DataSetInfo

    DataSetId (PK) | DataSetName

    Table = FileInfo

    FileId (PK) | FileName | FileData(Blob)

    Table = Data

    Id | DataSetId (FK) | FileId (FK)

    and just store the database in the cloud?
  • 0
    @mr-user the data table doesn't need the Id field - it's an associative table, PK of DataSetId and FileId is the only sane choice.

    Storing data has been controversial, since it can easily be done wrong and even more for the wrong reasons.

    I guess that your images have a size less than 10MB each?

    Then don't worry.

    The primary reason why storing data is controversial, because
    1) memory concerns
    2) design concerns
    3) overhead

    memory stems from the large result set when you include the file data in the result set... So yes. Storing 1GB single Image high res file in a field will be a problem - but that's not what you do... Sounds more like a lot of very small files (1MB plus I guess). Memory is cheap....

    2) Design wise - store metadata regarding the file, at least the mime type so you know what's what. Regarding selects - don't include the File data column if you don't need it.

    3) Overhead

    That's a funny topic. Yes, a database has overhead, as in TCP / DB protocol and latency. But does it matter nowadays? I don't think so....

    Yes you'll have to create a temporary file if you cannot stream the bytedata directly to the program, resulting in additional overhead. But again... It doesn't matter nowadays.

    Especially not on a local machine when you just utilize a socket.

    Most of the overhead will just be painful if you have no local Access, eg an Internet Connection, between your setup and the database...

    https://wiki.postgresql.org/wiki/...

    Might give you additional ideas.

    I would go for bytea.
  • 0
    @mr-user hdf5 is more space efficient. I used it a few years ago and in a few gigs it compressed a bug amount of data.
Add Comment