meanwhile @gitlab

Ranter

fireflies

1033

Comments

17

dfox

42156

9y

Jesus, is this for real?
31

dfox

42156

9y

It looks like it is. I had thought about using GitLab but decided to stay away from it when I read about a bad security breach they had a while back.

This is a whole different level. Very bad.
8

slowinversesqrt

318

9y

@dfox have you seen the doc that was linked in that tweet?

Scary how bad the ops practices of an infrastructure company like that are.
38

dfox

42156

9y

@slowinversesqrt yeah, I was just looking at that. It is very scary.

Have they figured out how much data they lost yet? I see it says their backups happen every 6 hours, but then at the bottom it says out of all the backup/replication techniques none were working reliably... what does that mean?

And while the doc is nice for transparency, who has time to write all that stuff up while the app is still down? Seems crazy to me.

And they obviously made the very common mistake of having no procedure for actually restoring from a backup.

Oh, and for a company that's raised $25 million, HOW THE FUCK DO YOU DELETE ALL AN ENTIRE DATABASE?
4

drRoss

4843

9y

Oh come on... No one is perfect.
9

vortexman100

2810

9y

Thats why i selfhost. I am responsible for everything that happens to my data.
4

vortexman100

2810

9y

@dfox Its stupid, this company should have offsite backups + hourly backups and a possibility to just replace the database in this case...
6

vortexman100

2810

9y

@drRoss There is a difference between not being perfect and somehow delete a database.
25

dfox

42156

9y

This just reeks of utter incompetence. It seems the database was deleted when an employee accidentally ran an rm -rf command. Ok... so even if it was the master, how could there not be slaves that still have the data? That makes absolutely no sense.

Lol, sorry for the ranting, it's a bit infuriating.
5

vortexman100

2810

9y

@dfox Understandable, i feel sorry for the people who are relying on this...
18

dfox

42156

9y

@vortexman100 you're exactly right. This wasn't not being perfect - this was literally doing everything possible incorrectly and being incompetent.

We're a company running out of our own pockets and we pay a good amount each month to make sure we can take hourly backups of our database and we have a restore procedure. And that's not even taking into account the slaves we have that also have all our data.
20

nblackburn

8647

9y

@dfox You have slaves, that's messed up. I bet they call you master too right? 😂
4

Data-Bound

2705

9y

Transparency is one thing, to say an employee wiped the wrong db because he was tired as fuck is something else. Especially since he took no backup in any form before deleting
3

perfectdark

2637

9y

Was that employee hanged and shot then dried up to make failure jerky?
5

dfox

42156

9y

@nblackburn lol!
4

beefo-11

306

9y

Wow i was planning to migrate all repositories to gitlab, bet it's going to be a big no-no now - at least not without another backup on bitbucket or github

Now no one can complain about wasting time doing manual backups!
11

GarreauArthur

2349

9y

Your SQL query doesn't work ?Just drop the database
1

aronnebrivio

144

9y

Did it on my website database. I was used to the MySQL cli, never used something like phpmyadmin and dropped all 'posts' table instead of only one record.
Unfortunately no backups, the hosting company to wich I give 10 bucks/year does not do personal data backups so I loose something like 20 articles :/

Well, from that day I manually do a DB dump everyday.

BTW I'm a noob. At least don't have a 25 million dollars Company...
14

kilobytelogic

2100

9y

@dfox judging by the number of "rm -rf" pranks in this week's topic, the person to blame might very well be here among us...
5

imadevmaybe

636

9y

It happened to me once, my dumbass partner deleted the repo on bitbucket and man I abused the fuck out of him. It was hardly 7 days worth of my shitty html/css code but it was so precious to me. Gitlab is done, I mean developers livelihood depends on this. This is an Epic level blunder.
1

Razze

56

9y

This really reads like nobody read the document, about what really happend/is gone. And just assumes the worst. From my understanding only newer then 6 hours ago (at the time of the accident) issues and pull requests (not the branches behind them) are lost.
5

vortexman100

2810

9y

@Razze Not the point, some TRUST this company. And some have their entire code there. So if this shit goes down, and every piece of code is lost, WHICH COULD HAPPEN because of not really existing backups, some of us could face really dark times. Thats the problem. Also the downtime but thats another story. At least they are transparent. And thats great. They will learn.
3

imadevmaybe

636

9y

@Razze I know nothing mission critical was lost but it's epic level blunder anyway. And when you look at the sheer magnitude of users, even individual small losses account to a big blunder. Many people like me use these cloud repositories for backup, and it makes your heart skip a beat even with very little loss.
1

Razze

56

9y

@vortexman100 good thing that git is distributed and not really needs a server, very unlikely that you loose any work. Maybe stuff that never gets checked out.
1

deMark

424

9y

I read through the whole document, I believe repo data is fine, but it's stuff like pull requests and other data like that which is affected?

And it reads more like they did have several backup procedures in place but none of them work
This is why resiliency testing and monitoring is a thing.... And properly testing procedures..And not testing in prod -_-

To be honest if someone doesn't know this stuff, the only way they learn is in a trial by fire (e.g. prod issue) - whether they're someone working on a personal project or a large company.
2

DarthVader

548

9y

As far as I read about it till now, it seems bad. It's like they were doing some emergency backup thing and guy accidentally deleted db1.something.gitlab.com instead of db2.something.gitlab.com. A small mistake, and before he realised it already deleted data leaving ~4gb out of 360gb. Unfortunately activity was done after the 6 hours of daily backup. So some data related to merge requests and so is permanantly lost. I feel sorry for the guys.

But at the same time many folks from open source has provided their advice which seems to be helped. Which is very good to hear (cheers to open source).
And the hackernews thread related to this post also has some really good advice on how to avoid such scenarios.
0

vortexman100

2810

9y

@LicensedCrime Real... Unfortunately.
0

linuxxx

142302

9y

@vortexman100 Relied on this for years and was actually unaffected!
0

linuxxx

142302

9y

@Data-Bound ahum. Although its a big fuckup, he literally took a full backup just a few hours before that (got that from the live stream)
0

linuxxx

142302

9y

@thmnmlst I use it but could still use it during the down time so unaffected. Except for that, happy with their service and hadn't had a single down time (both github and bitbucket repos disappeared from my accounts) so for me this is the first reliable provider!
3

sebastian

658

9y

@dfox The command was not accidental. The server was.

They were going to backup manually from db1 (source) to db2 (dest). The backup command disliked that there was already a incomplete backup at db2 (dest), which is pretty obvious why.

And when you do a non-incremential backup, the destination must be empty, else you get conflicts and mixed-up backup data.

Thus they had to erase the destination, so the backup from db1 to db2 was done clean.

The mistake they did, was running the command on the source server instead of destination server, thus erasing all data they were going to backup.
2

dfox

42156

9y

@sebastian I think the biggest problem was there were many mistakes made, and general things that just seem odd to me.

I think the deleting of the data was the most innocent mistake out of them all. I can see that happening/we've all done similar.

But once that happened, they clearly had never practiced any restore procedures or any kind of disaster recovery. I'm not that familiar with Postgre and their exact setup, but 10-15 hours to copy and start up a 300gb database? That sounds absurd to me.

And then the fact that they never bothered to look in the S3 bucket to see if their backups were even being saved... another ugly oversight. Tons of ways this could've been prevented.
1

sebastian

658

9y

@dfox The reason it took so long was because of that CDN spam user with 40 000 IPs (almost a class B, like a medium sized DDoS) which was loading down the db. Thats why they aborted the backup, resulting in the incomplete backup from the beginning, to ban the spam user.
2

dfox

42156

9y

@sebastian that doesn't explain why the restore took so long. It took 10-15 hours for them to transfer the datastore from their stage server to a db server and start it up.
1

sebastian

658

9y

@dfox Its mentoned in the document. The disk on the staging server did only have a read rate of 60MB/s = 7,5 Mb/s

300gb in 7,5 Mb/s = 40 960 sec = 682 min = 11,3 hours.

(dont know if they mean bytes or bits in the doc, they have wrote Mb but disk rates usually expressed in bits)

Sounds resonable.
0

dfox

42156

9y

@sebastian it's reasonable if you use it as a staging server but not when you have no hot replica to cut over to and there's even the slimmest chance you have to ever pull a db off of it (in this case to save their company it seems).

Either way, I can't understand why a company who's raised $25 million would be using spinning disks on any normal server like that. Seems like maybe they just cut corners and got badly burned in this case.
0

vortexman100

2810

9y

@sebastian Erm, Disk rates are usually in Bytes... It took them that long because it were loaded over a wan link...
0

Data-Bound

2705

9y

Funny enough, now I would trust them. Because of this incident they can properly recover now in case something similar happens again. And then it might be caused by disk failure
0

vortexman100

2810

9y

@Data-Bound Yep, thats a plus. This will never happen again. I bet their backups are now rock solid.
0

laivare

37

9y

@vortexman100 rock solid, same as memorial stone of their db admin? :D
0

vortexman100

2810

9y

@laivare hehe :D

Add Comment

meanwhile @gitlab

undefined