data set

Ranter

mr-user

1269

Comments

0

mr-user

1269

6y

@kescherRant Any recommendation? I can only think of FTP but I don't think it will be easy to version control over FTP.
1

No3x

27

6y

Put them in a directory and add a .xml file that contains (sub-)sets of relative paths to the images. Check-in that xml.
0

mr-user

1269

6y

@No3x Can you explain it in more detail. I don't understand what you are saying. Sorry.

According to what I understand I will just only push the xml file which contains the relative path and not the actual images.
1

netikras

34626

6y

Zip them and mount the archive when needed? Then store that zip whereever you like
0

sbiewald

3997

6y

Store them in Amazon S3 or similar?
0

IntrusionCM

13820

6y

Divide and conquer.

What you're trying is sheer brute force.

I'm very sure you could either do bulk adds with an git gc.

Or even better (and described before)… store the pathes in a DB and create a hierarchical file system, eg current date (If relevant) as root folder, then for each batch of 500 files one directory with an hash, inside all 500 files.

(Date) -> (Hash) -> _File 1...File N_

It's pretty easy and fast and you won't have performance Problems...

Plus you don't have the massive storage overhead of git.
0

mr-user

1269

6y

@IntrusionCM

I am confused.

Do you mean I should

1)Create the hierarchical path (root--> hash--> File1,File2,...FileN)

2)Store the relative path in database

3)Push(git add) the folder individually to git instead of pushing all the images together

4)Push(git add) the database to git

And another question : What do you think I should hash for the batch folder?
0

IntrusionCM

13820

6y

@mr-user sorry, overworked

2 different solutions.

1) batch your git adds (if possible)

Instead of adding 44 000 files at once, create batches, eg 500 files per batch, so multiple git add's, then one commit

After commiting, you could run git gc

2) Create a directory tree....

And store it inside a database.

:) Hashes could be UUID v4.

The database would simplify iterating by knowing the full path to all files.

The date root folder could be a replacement for GIT.

You can (maybe should) combine in solution 1 the directory tree with GIT.

GIT doesn't know directories, but your FS will thank you. The number of files per folder is limited… But you won't reach that limit, as Listing a folder with >10k files is impossibro imho
0

mr-user

1269

6y

@IntrusionCM Thank you and sorry for adding more work while you are overwork.

I should create the database which have relative path and batch add the database along with images to git? I think HDF5 file format is better for my case.

Could you please explain what do you mean by git doesn't understand directories?

According to what I know git know directories since when you clone the repo , the git tools also create all the directories (along with files) in the repo for you.
0

IntrusionCM

13820

6y

@mr-user Nah no trouble.

You don't need GIT per se.

It's additional overhead, when you don't need a VCS at all.

To understand my directory comment, you can read up on:

https://git-scm.com/book/en/...

In GIT internal storage, directories do not exist - they're represented by a tree. That's the reason you cannot add an empty folder in GIT - it's internal representation (the tree) has no entries. Mostly you add a gitignore file to the empty directory, so the tree has an entry :) ;)

Instead of an VCS, i would utilise a folder structure, supported by a database.

So no GIT at all.
0

mr-user

1269

6y

@IntrusionCM I am hang up on the version control (git) because I wanted a single source of truth that I can recover from.

I could store the complete backup but it will be quickly become out of sync and it is easy to have be create a duplicate version (Version1-Complete,Version2-Complete,VersionN-Complete) which will be hard to manage.

I could use Dropbox which auto-sync folder but there is no way to recover deleted images (I just know I will accidentally delete some images that I shouldn't)This is my first time handling big data so I feel like a fish out of water.

Forgive me if my comment seem too rude but I could see the benefits of the folder structured in local drive you describe but I am looking for a way to manage the data(images) as a single source of truth.
0

mr-user

1269

6y

I just know the hammer (git) so everything seem like a nail to me. Could you recommend me which direction should I take? If possible I also want to upload it to cloud (such as gitlab) so I have additional backup point
0

IntrusionCM

13820

6y

@mr-user Not rude at all, I think I understand what you want to achieve.

The better questions would be imho how you want to handle this performance wise and what your long term goal would be.

GIT would mean that you exponentially grow the necessary storage and it will take extremely long when you jiggle with such an amount of data.

When you are worried about deleting data, it looks like you are utilising GIT as an backup solution... Which I dislike.

So let's take a straight look at the facts:
- Long time storage of files
- Backing up files at remote location
- Utilising files for AI

The reason I dislike GIT in this case is because of the storage overhead and the performance. It will decrease.

And GIT is not a backup.

PostGRES allows storing large blobs... You're thinking of a file system approach and you're worried about backups. I'm thinking in a different direction - how you could store the data long term without duplicating the data and without performance loss.

- storing in a database would be one way (streaming the file blob)
- an additional table could represent your dataset

Eg. table 'dataset' with dataset_id, name
'dataset_file' with dataset_id, file_id

You would store each file directly in database, the assoc table dataset_file the necessary references. :)

You could achieve the same on a filesystem - by utilizing symlinks, although this is ugly imho.

Backing up a database is easy, can be done incrementally and - encrypted - stored in cloud.

Backing up filesystem, too.
0

mr-user

1269

6y

@IntrusionCM Thank for giving me new direction to think about. I should forget about version controlling the images (but it's too tempting to just version control it).

I know that it is not really recommended to store the images as blob data type and just store relative path.

Does PostGRES work really well with blob data structure?

Are you talking about below table schema?

Table = DataSetInfo

DataSetId (PK) | DataSetName

Table = FileInfo

FileId (PK) | FileName | FileData(Blob)

Table = Data

Id | DataSetId (FK) | FileId (FK)

and just store the database in the cloud?
0

IntrusionCM

13820

6y

@mr-user the data table doesn't need the Id field - it's an associative table, PK of DataSetId and FileId is the only sane choice.

Storing data has been controversial, since it can easily be done wrong and even more for the wrong reasons.

I guess that your images have a size less than 10MB each?

Then don't worry.

The primary reason why storing data is controversial, because
1) memory concerns
2) design concerns
3) overhead

memory stems from the large result set when you include the file data in the result set... So yes. Storing 1GB single Image high res file in a field will be a problem - but that's not what you do... Sounds more like a lot of very small files (1MB plus I guess). Memory is cheap....

2) Design wise - store metadata regarding the file, at least the mime type so you know what's what. Regarding selects - don't include the File data column if you don't need it.

3) Overhead

That's a funny topic. Yes, a database has overhead, as in TCP / DB protocol and latency. But does it matter nowadays? I don't think so....

Yes you'll have to create a temporary file if you cannot stream the bytedata directly to the program, resulting in additional overhead. But again... It doesn't matter nowadays.

Especially not on a local machine when you just utilize a socket.

Most of the overhead will just be painful if you have no local Access, eg an Internet Connection, between your setup and the database...

https://wiki.postgresql.org/wiki/...

Might give you additional ideas.

I would go for bytea.
0

stop

6580

6y

@mr-user hdf5 is more space efficient. I used it a few years ago and in a few gigs it compressed a bug amount of data.

Related Rants

Add Comment

question

machine learning

git

git-lfs

ai