5
vane
5y

TLDR;
Side project update.

Made simple nlp library in python and published it’s first version to open source.
Now I can feed it with parsed pdf text.
See rant https://devrant.com/rants/2192388/...

Why ?
Cause during reading book about nltk I couldn’t find simple extendible way to provide support for polish language and I wanted to abstract stemming, word normalization, tokenizer etc. so I can provide ex. different conditions for separate text files and don’t write much code what is an asset when you work solo.

It’s about 12GB of pdf public accessible law data I am trying to handle ( at first ) which is about 35000 files from last 90 years.

So far I automated downloading web pages and pdf documents from them. Extracting data from web pages and saving it to database. Extracting text from pdf files. I have about 5-6 projects to do all of it above maybe at the end I will put it to some workflow manager like Luigi or just run it by cronjob.

First thing for website version 1.0 part is find correlation between all documents inside law text using nlp library by building custom conditions. Then just generate directory structure and html files with links between documents.

Website version 2.0 is already in my mind but it will be creepy to make it and will take at least 1-2 months and I want to publish fast.

I have some pdfs with only images instead of text and tesseract worked quite good with them so maybe I will try to process them when everything go live.

Learned a lot about pdf as now I know that font in pdf is not always providing unicode characters ( stupid form of obfuscation) so when you extract text you need to build glyph vector to text map for every font.
Pdf is full vector representation - just like svg - what is logic if you think a bit and know that some printers are running using postscript.

Let’s hope next update will be about flutter mobile app which started all of shit above. It’s almost ready ( except getting data from api I am trying to do and logo for release version ). It’s last piece of puzzle.

Comments
  • 1
    nice, what is the business case behind your work? or science/help for your job?
  • 1
    @easyFish Business use case is to make law data more accessible.

    Hope to disrupt and angry some bigger names on local market and I wish they won’t kill me for that 😂
  • 0
    @vane will definitely look into it, I have law university dean in family and plans for using nltk in my side project for (unfortunately non English (czech)) language - perfect match
Add Comment