153

meanwhile @gitlab

Comments
  • 17
    Jesus, is this for real?
  • 31
    It looks like it is. I had thought about using GitLab but decided to stay away from it when I read about a bad security breach they had a while back.

    This is a whole different level. Very bad.
  • 8
    @dfox have you seen the doc that was linked in that tweet?

    Scary how bad the ops practices of an infrastructure company like that are.
  • 38
    @slowinversesqrt yeah, I was just looking at that. It is very scary.

    Have they figured out how much data they lost yet? I see it says their backups happen every 6 hours, but then at the bottom it says out of all the backup/replication techniques none were working reliably... what does that mean?

    And while the doc is nice for transparency, who has time to write all that stuff up while the app is still down? Seems crazy to me.

    And they obviously made the very common mistake of having no procedure for actually restoring from a backup.

    Oh, and for a company that's raised $25 million, HOW THE FUCK DO YOU DELETE ALL AN ENTIRE DATABASE?
  • 4
    Oh come on... No one is perfect.
  • 9
    Thats why i selfhost. I am responsible for everything that happens to my data.
  • 4
    @dfox Its stupid, this company should have offsite backups + hourly backups and a possibility to just replace the database in this case...
  • 6
    @drRoss There is a difference between not being perfect and somehow delete a database.
  • 25
    This just reeks of utter incompetence. It seems the database was deleted when an employee accidentally ran an rm -rf command. Ok... so even if it was the master, how could there not be slaves that still have the data? That makes absolutely no sense.

    Lol, sorry for the ranting, it's a bit infuriating.
  • 5
    @dfox Understandable, i feel sorry for the people who are relying on this...
  • 18
    @vortexman100 you're exactly right. This wasn't not being perfect - this was literally doing everything possible incorrectly and being incompetent.

    We're a company running out of our own pockets and we pay a good amount each month to make sure we can take hourly backups of our database and we have a restore procedure. And that's not even taking into account the slaves we have that also have all our data.
  • 20
    @dfox You have slaves, that's messed up. I bet they call you master too right? 😂
  • 4
    Transparency is one thing, to say an employee wiped the wrong db because he was tired as fuck is something else. Especially since he took no backup in any form before deleting
  • 3
    Was that employee hanged and shot then dried up to make failure jerky?
  • 5
  • 4
    Wow i was planning to migrate all repositories to gitlab, bet it's going to be a big no-no now - at least not without another backup on bitbucket or github

    Now no one can complain about wasting time doing manual backups!
  • 11
    Your SQL query doesn't work ?Just drop the database
  • 1
    Did it on my website database. I was used to the MySQL cli, never used something like phpmyadmin and dropped all 'posts' table instead of only one record.
    Unfortunately no backups, the hosting company to wich I give 10 bucks/year does not do personal data backups so I loose something like 20 articles :/

    Well, from that day I manually do a DB dump everyday.

    BTW I'm a noob. At least don't have a 25 million dollars Company...
  • 15
    @dfox judging by the number of "rm -rf" pranks in this week's topic, the person to blame might very well be here among us...
  • 5
    It happened to me once, my dumbass partner deleted the repo on bitbucket and man I abused the fuck out of him. It was hardly 7 days worth of my shitty html/css code but it was so precious to me. Gitlab is done, I mean developers livelihood depends on this. This is an Epic level blunder.
  • 1
    This really reads like nobody read the document, about what really happend/is gone. And just assumes the worst. From my understanding only newer then 6 hours ago (at the time of the accident) issues and pull requests (not the branches behind them) are lost.
  • 5
    @Razze Not the point, some TRUST this company. And some have their entire code there. So if this shit goes down, and every piece of code is lost, WHICH COULD HAPPEN because of not really existing backups, some of us could face really dark times. Thats the problem. Also the downtime but thats another story. At least they are transparent. And thats great. They will learn.
  • 3
    @Razze I know nothing mission critical was lost but it's epic level blunder anyway. And when you look at the sheer magnitude of users, even individual small losses account to a big blunder. Many people like me use these cloud repositories for backup, and it makes your heart skip a beat even with very little loss.
  • 1
    @vortexman100 good thing that git is distributed and not really needs a server, very unlikely that you loose any work. Maybe stuff that never gets checked out.
  • 1
    I read through the whole document, I believe repo data is fine, but it's stuff like pull requests and other data like that which is affected?

    And it reads more like they did have several backup procedures in place but none of them work
    This is why resiliency testing and monitoring is a thing.... And properly testing procedures..And not testing in prod -_-

    To be honest if someone doesn't know this stuff, the only way they learn is in a trial by fire (e.g. prod issue) - whether they're someone working on a personal project or a large company.
  • 2
    Holy fucking shitballs!
  • 2
    As far as I read about it till now, it seems bad. It's like they were doing some emergency backup thing and guy accidentally deleted db1.something.gitlab.com instead of db2.something.gitlab.com. A small mistake, and before he realised it already deleted data leaving ~4gb out of 360gb. Unfortunately activity was done after the 6 hours of daily backup. So some data related to merge requests and so is permanantly lost. I feel sorry for the guys.

    But at the same time many folks from open source has provided their advice which seems to be helped. Which is very good to hear (cheers to open source).
    And the hackernews thread related to this post also has some really good advice on how to avoid such scenarios.
  • 0
    @LicensedCrime Real... Unfortunately.
  • 0
    @vortexman100 Relied on this for years and was actually unaffected!
  • 0
    @Data-Bound ahum. Although its a big fuckup, he literally took a full backup just a few hours before that (got that from the live stream)
  • 0
    @thmnmlst I use it but could still use it during the down time so unaffected. Except for that, happy with their service and hadn't had a single down time (both github and bitbucket repos disappeared from my accounts) so for me this is the first reliable provider!
  • 3
    @dfox The command was not accidental. The server was.

    They were going to backup manually from db1 (source) to db2 (dest). The backup command disliked that there was already a incomplete backup at db2 (dest), which is pretty obvious why.

    And when you do a non-incremential backup, the destination must be empty, else you get conflicts and mixed-up backup data.

    Thus they had to erase the destination, so the backup from db1 to db2 was done clean.

    The mistake they did, was running the command on the source server instead of destination server, thus erasing all data they were going to backup.
  • 2
    @sebastian I think the biggest problem was there were many mistakes made, and general things that just seem odd to me.

    I think the deleting of the data was the most innocent mistake out of them all. I can see that happening/we've all done similar.

    But once that happened, they clearly had never practiced any restore procedures or any kind of disaster recovery. I'm not that familiar with Postgre and their exact setup, but 10-15 hours to copy and start up a 300gb database? That sounds absurd to me.

    And then the fact that they never bothered to look in the S3 bucket to see if their backups were even being saved... another ugly oversight. Tons of ways this could've been prevented.
  • 1
    @dfox The reason it took so long was because of that CDN spam user with 40 000 IPs (almost a class B, like a medium sized DDoS) which was loading down the db. Thats why they aborted the backup, resulting in the incomplete backup from the beginning, to ban the spam user.
  • 2
    @sebastian that doesn't explain why the restore took so long. It took 10-15 hours for them to transfer the datastore from their stage server to a db server and start it up.
  • 1
    @dfox Its mentoned in the document. The disk on the staging server did only have a read rate of 60MB/s = 7,5 Mb/s

    300gb in 7,5 Mb/s = 40 960 sec = 682 min = 11,3 hours.

    (dont know if they mean bytes or bits in the doc, they have wrote Mb but disk rates usually expressed in bits)

    Sounds resonable.
  • 0
    @sebastian it's reasonable if you use it as a staging server but not when you have no hot replica to cut over to and there's even the slimmest chance you have to ever pull a db off of it (in this case to save their company it seems).

    Either way, I can't understand why a company who's raised $25 million would be using spinning disks on any normal server like that. Seems like maybe they just cut corners and got badly burned in this case.
  • 0
    @sebastian Erm, Disk rates are usually in Bytes... It took them that long because it were loaded over a wan link...
  • 0
    Funny enough, now I would trust them. Because of this incident they can properly recover now in case something similar happens again. And then it might be caused by disk failure
  • 0
    @Data-Bound Yep, thats a plus. This will never happen again. I bet their backups are now rock solid.
  • 0
    @vortexman100 rock solid, same as memorial stone of their db admin? :D
  • 0
Add Comment