Alright since I have to deal with this shit in my part time job I really have to ask.

What is the WORST form of abusing CSV you have ever witnessed?

I for one have to deal with something like this:


foo can either be foo, or a numeric value
if it is foo, the first number after the foo dictates how many times the content between this foo and the next bar is going to be repeated. Mind you, this can be nested:


foobar means the file ends.

Now since this isn't quite enough, there's also SIX DIFFERENT FLAVOURS OF THIS FILE. Each of them having different columns.

I really need to know - is it me, or is this format simply utterly stupid? I was always taught (and fuck, we always did it this way) that CSV was simply a means to store flat and simple data. Meanwhile when I explain my struggle I get a shrug and "Just parse it, its just csv!!"

To top it off, I can not use the flavours of these files interchangably. Each and everyone of them contains different data so I essentially have to parse the same crap in different ways.

OK this really needed to get outta the system

  • 4
    "Foo", "bar", "c", "1", 34, 56.63
    foo, bar, "c, that was a delimiter back there" , 1, "56.633"

    Got to love mixed content 😞

    Sometimes I really wonder if a system generated it or someone was crazy enough to write it them selves and save it.
  • 0
    Honestly, this sounds entertaining. It can easily be parsed with a recursive function.
  • 0
    Pretty gross, not impossible, but still gross
  • 0
    That's what I ended up doing actually
  • 0
    CSV is indeed flat; it's just that in this 'format', the flat data is itself a compressed representation of the data you really care about.

    Not that that helps at all.
  • 0
    There's also this:
    Consider yourself lucky
  • 1
    I have worked with a CSV, which has a free text column, which had such bad characters that they broke utf-8, ASCII and other encodings, had newline characters and all sorts of delimiters like tab, comma, semicolons, combinations of delimiters, non-breaking whitespace etc. It was not a double quoted column. So you could not read line and expect the entire row. You could not split by comma and get the columns. Splitting by comma could give you the number of columns you need, less than that or more than that depending on row. The free text field column data interchangibly had html, code in all different programming languages, comments. It was a stream of GitHub data
Add Comment