After three weeks looking for decent pdf parser that will handle all documents I gathered for my project I decided to write my own.
All those I tried end up with more then 10% not correctly parsed pdfs or require to much coding.
I was sceptic so I waited another week debating if it’s good idea to do it and I said yes.

Spent 16 hours straight coding pdf document extraction library and command line tool based on pdf.js

Fuck, now when I open pdf I see opcodes instead of text.

Got two more hours until client planning meeting and then I go to sleep for a while.

Time to start testing this more deeply as I have about 60k ~ 20GB pdf documents to parse and then I need to build some dependency graph out of its text.

At least it’s more funny then making boring REST API for money.

  • 2
    That's awesome. Loved last few lines. It's always exciting to solve challenging problems rather than doing same old CRUD operations.
  • 3
    I made a tool that parses PDF order forms (each customer had different production software that creates the orders in different formats) and converts them to a standardized excel form. I used a cloud service to first convert PDF to XLS and then fetching the data from the spreadsheet afterward. Don’t know if that’s an alternative for you?
  • 3
    @ZoRaC Nope I need to extract and mine text to get features and link multiple document references.

    It’s basically law data so I need to know row (start of line ) / page numbers etc.

    At this point all the worst part is done as I can now extract all I want and I have json as an output.

    Also need to add some parallel python script to put my bored 16 cores to work.

    One bug is pending with spaces placed correctly inside text, but I think I can get rid of it quick by calculating glyphs instead of hardcoded space for -275 😂

    As it is for personal use and it’s not crucial for my application I put project as opensource so maybe someone will use it in future.
  • 0
    Nice To read a post of someoneexcited about what they're doing and happy.
    I plus plus thee
Add Comment