Writing a custom PDF parser from scratch. Hehe fuuuuuuck me

  • 3
    Just been dealing with a lot of parsing shit, my suggestion is a quite high level functional language as it will be easier to do pattern matching and just manage it. Which language will you be writing it in btw?
  • 6
    Did once! Or... started to, realised it was a SHIT idea and put yet another "promising" side project on the shelf to collect dust.

    Long time ago..
  • 7
    Screw that.
  • 6
    Ah hell no.

    It's one thing to make the PDF, it's another to destruct the thing too.

    Would OCR be a better option?
  • 3
    Let's be open minded and forget that dealing with PDF is crap.

    What are you trying to achieve? If it's simply to read contents of PDFs and then index the content or find something specific you've got heaps of libraries (free and commercial) that can return the text as string (and images as binary objects). From there onwards it's no longer PDF related - it's simply content processing.

    If you are really building it "from scratch" as you said it makes me wonder why..?
  • 3
    @matt-jd Yeah, the goal is to have it written in Go, but I'm a lot quicker with JS, so that's what I'll be using for starters. It'll be slow, but at least I'll have all questions resolved and figured out.

    @C0D4 OCR might be interesting. Haven't done any work with it though, and have no idea where to start even 😅

    @devdiddydog Without spoiling too much, I have to figure out the content position in relation to each other, extract images and their positions, fonts, join characters into words, lines and paragraphs, figure out what's headings, distinguish between non-image graphics, and probably a little more when I've done all that. I haven't been able to find a solution that can do even half of that for me.
  • 2
    @ScriptCoded I'm pretty sure a decent (commercial) PDF library can give you most if not all of that information. Maybe not a JS based one, but at least in .NET or Java. I've worked with a couple of them doing the opposite (creating PDF documents), and they allowed for very fine grained definitions of objects, their type, position, margins, header/footer content, images, sizes etc. The same objects and properties were available to me when opening up an existing PDF for manipulation.

    Might be worth looking into it.
  • 0
    @devdiddydog If it wasn't for the fact that this is a personal project I'd definitely do that :)
Add Comment