11

One thing when working with a ton of data:
If there is a slight, infinitesimal probability that something will be wrong, then it will 100% be wrong.

Never make assumptions that data is consistent, when dealing with tens of gigabytes of it, unless you get it sanitized from somewhere.

I've already seen it all:
* Duplicates where I've been assured "these are unique"
* In text fields that contain exclusively numeric values, there will always be some non-numeric values as well
* There will be negative numbers in "number sequences starting with 1"
* There will be dates in the future, and in the far far future, like 20115 in the future.
* Even if you have 200k customers, there will be a customer ID that will cause an integer overflow.

Don't trust anything. Always check and question everything.

Comments
  • 0
    @kaupeyk or Y2K-style dates:
    02-25-96

    I swear, when I saw that once around 2010 in a database that was ... STILL WORKING, somehow, despite this crap, I just wanted to do a mass update and rewrite the whole code, and the senior dev looked at me with a "you think we haven't tried that yet?" look so I calmed myself down.
  • 2
    Everything that can go wrong, will go wrong :p
  • 1
    Great post. It confirms the number 1 rule of ANY work with data: always look at your raw data first.
  • 1
    Murphy's law?
  • 1
    @LinuxUser0001 yeah, but with the law of large numbers applied.
Add Comment