Fuck my life! I have been given a task to extract text (with proper formatting) from Docx files.
They look good on the outside but it is absolute hell parsing these files, add to these shitty XML human error and you get a dev's worst nightmare.
I wrote a simple function to extract text written in 'heading(0-9)' paragraph style and got all sorts of shit.
One guy used a table with borders colored white to write text so that he didn't have to use tabs. It is absolute bullshit.

  • 1
    That's why Apache POI calls them Horrible Word Processor Format
  • 0
    Maybe a little late for you, but https://nativedocuments.com/ai.html should help out others identifying with this rant. You'll see that in addition to extracting text, it can fully resolve styles, making identifying headings much easier.
Add Comment