8

My implementation of facebook's haystack storage solution. It's certainly not a faithful recreation, but I think this served my needs better.

The idea is you store all of your files in one large file, and just write down where each of your files starts and ends. This particular implementation I called an indexed haystack because it gives you back an index, sort of like an array.

I was attracted to the idea because it makes the file structure of the server so much more simple, and backups so much easier when you only have a few files rather than a few thousand. Facebook came up with it because it was more efficient to store a million photos all in the same file rather than in a million separate ones.

There is a 100GB limit to each haystack but that isn't technical, it's just a sensible thing to do.

Comments
  • 9
    Ew. Why.

    Haystack is incredibly restrictive. To edit a file, you have to open and extract the entire stack. Or, you have to remove the old file and add the new one to the end of the stack. Or version the file and index the versions, leaving all the old files in permanent storage.

    You now have a working emulation of the first iterations of cassette storage. 1959 says hello.
  • 2
    @monr0e It has it's places. For example, the Facebook engineer said that deletion was incredibly rare, so they were not concerned by it.

    I'm using it for a media server. I don't even really have plans to allow for media to be removed from it, much less edited.

    Remember that a server can have more than one data solution too. Use a haystack for some stuff, database for other stuff, flat files for the other stuff, etc.
  • 4
    @AlgoRythm grumble grumble wheel something-or-other.

    A scenario in which edits rarely happen is astoundingly rare. Facebook most likely has a shit-tonne of issues with it that don't get addressed because the management structure and attitude there is horrific. Aren't you going to add play counts to your media files? Or perhaps metadata that is automatically collected when you place media on it? What about if you add new storage to your media centre and have to edit your location strings to reflect it? My point is, reel storage was abandoned because it couldn't adapt to an environment that nobody expected to change so quickly. Haystack is the same, but in an era where implementing it is absurd.
  • 0
    @monr0e view count and metadata is all in the database. The raw media is just stored in haystacks because it's convenient! Especially considering this is designed to be a household server not a distributed one.
  • 0
    @AlgoRythm why? Do you really have that much media? And are you going to build container, codec and compression attributes into a separate db in a fashion that a modern media player can read?
  • 0
    @monr0e each episode is uploaded as a separate mp4 file, and when it gets to hundreds of episodes, the file structure gets ugly. Just a directory full of (datetime).mp4

    It works well for what I need, and it solves an issue I had. And I had fun implementing it.
  • 4
    @AlgoRythm I can taste vomit.

    OK, in all seriousness, there's no reason this doesn't work. However, you best be damn sure you have some fault tolerance built in, since locking a file of that size open for that long feels like a recipe for drive failure. Unless, of course, you're opening the media entirely in memory, in which case I hope you are made of money, given the size of even h265 nowadays.
  • 1
    @AlgoRythm "deletion is incredibly rare"
    😱😱😱😱
    deletion never happens for facebook, so it makes sense to store "readonly"+"small sized" file in this horrendus way.
    It also make duplication across network much simpler, and reduces the inode load on the hd.

    never, ever, use this for files you want to modify.
  • 0
    Custom FS seems more reasonable
  • 0
    @AlgoRythm im intrigued, are all the files (roughly) the same size?
  • 0
    @not-user-telken Facebook or mine? Facebook yes mine no
  • 0
    @AlgoRythm was asking for yours, but extra info is appreciated. How do you handle moving indexes on insertion or deletion? Just like an array?
  • 0
    @not-user-telken indexes won't move, on deletion they would just get marked as deleted. If you actually wanted to remove the data to reclaim the space it would work the same as an array.splice
  • 0
    @AlgoRythm and no insertion just append? Also, for the purpose intended, i'd guess there is no "editing" files in haystack
  • 0
    @not-user-telken No, it's just a "throw it all in a big pile" sort of solution.
Add Comment