Join devRant
Do all the things like
++ or -- rants, post your own rants, comment on others' rants and build your customized dev avatar
Sign Up
Pipeless API
From the creators of devRant, Pipeless lets you power real-time personalized recommendations and activity feeds using a simple API
Learn More
Search - "pdfparsing"
-
Why Pdf is a new religion:
Pdf is complicated.
Pdf is ubiquitous.
Everyone follows their own conventions and calls it a standardised pdf.
Conversion from pdf to any other format is problematic.
Keep adding to the list...8 -
After three weeks looking for decent pdf parser that will handle all documents I gathered for my project I decided to write my own.
All those I tried end up with more then 10% not correctly parsed pdfs or require to much coding.
I was sceptic so I waited another week debating if it’s good idea to do it and I said yes.
Spent 16 hours straight coding pdf document extraction library and command line tool based on pdf.js
Fuck, now when I open pdf I see opcodes instead of text.
Got two more hours until client planning meeting and then I go to sleep for a while.
Time to start testing this more deeply as I have about 60k ~ 20GB pdf documents to parse and then I need to build some dependency graph out of its text.
At least it’s more funny then making boring REST API for money.4