Join devRant
Do all the things like
++ or -- rants, post your own rants, comment on others' rants and build your customized dev avatar
Sign Up
Pipeless API
From the creators of devRant, Pipeless lets you power real-time personalized recommendations and activity feeds using a simple API
Learn More
Search - "pdf parsing"
-
At what point can I claim to not be a script kiddie anymore?
Like, I've built compilers, and interpreters for an excel-like syntax, I refactored a pdf-parsing library from the ground up. I managed databases and wrote protocols for communicating with hardware.
But most of my experience is with python / nodejs / golang. It is only recently that I started playing with C and rust for actual efficient system code.8 -
There are a couple:
A system that updates user accounts to connect them into our wifi system by parsing thousands of processing files written in Clojure. The project was short lived and mainly experimental, It has complete test cases and the jar generated from it is still purring silently on the main application. It was used to replace an $85k vendor application that made no fucking sense. The code has not been touched in 2 years and the jar is still there. The dba mentioned the solution to the vendor, the vendor tried buying it from me, but being that it belongs to the institution nothing was touched, still, it got the VP's attention that I can make programs that would be bought for that level, it caught his attention even more when I showed him the codebase and he recognized a Lisp variant (he is old, and was back in the day a Fortran and Cobol developer)
A small Python categorical ML program that determines certain attributes of user generated data and effectively places them on the proper categories on the main DB. The program generates estimates of the users and the predictions have a 95% correctness rate. The DBA still needs to double check the generated results before doing the db updates. I don't remember how I coded it because I was mostly drunk when I experiment on the scenario. It also got the attention of the VP and director since the web tech manager was apparently doing crazy ML shit that they were not expecting me to do, it made them paranoid that I would eventually leave for a ML role somewhere, still here, but I want more moneys!!
A program that generates PDF documentation from user data, written in Go, Python and Perl (yes Perl) I even got shit from the lead developer since I used languages outside of their current scope of work. Dude had no option but to follow along with it :P since I am his boss
Many more. I am normally proud of my work code. But my biggest moment is my current ntural language processing unit that I am trying to code for my home, but I don't have enough power to build it with my computers, currently, my AI is too stupid, but sometimes it does reply back to my commands and does the things I ask it to do (simple things, opening a browser, search for a song etc) but 7 times out of ten it wont work :P -
TLDR;
Side project update.
Made simple nlp library in python and published it’s first version to open source.
Now I can feed it with parsed pdf text.
See rant https://devrant.com/rants/2192388/...
Why ?
Cause during reading book about nltk I couldn’t find simple extendible way to provide support for polish language and I wanted to abstract stemming, word normalization, tokenizer etc. so I can provide ex. different conditions for separate text files and don’t write much code what is an asset when you work solo.
It’s about 12GB of pdf public accessible law data I am trying to handle ( at first ) which is about 35000 files from last 90 years.
So far I automated downloading web pages and pdf documents from them. Extracting data from web pages and saving it to database. Extracting text from pdf files. I have about 5-6 projects to do all of it above maybe at the end I will put it to some workflow manager like Luigi or just run it by cronjob.
First thing for website version 1.0 part is find correlation between all documents inside law text using nlp library by building custom conditions. Then just generate directory structure and html files with links between documents.
Website version 2.0 is already in my mind but it will be creepy to make it and will take at least 1-2 months and I want to publish fast.
I have some pdfs with only images instead of text and tesseract worked quite good with them so maybe I will try to process them when everything go live.
Learned a lot about pdf as now I know that font in pdf is not always providing unicode characters ( stupid form of obfuscation) so when you extract text you need to build glyph vector to text map for every font.
Pdf is full vector representation - just like svg - what is logic if you think a bit and know that some printers are running using postscript.
Let’s hope next update will be about flutter mobile app which started all of shit above. It’s almost ready ( except getting data from api I am trying to do and logo for release version ). It’s last piece of puzzle.3 -
Question.. architecting a large system. I’ve broken it down to microservices for the DB and rest API / gateway
I want there to be some some processes that run continuously not event driven via rest. Say analytics for example what is the best way todo that? Just another service running on on a server? And said service has its own API? That when the other rest APIs are called could then hop and call the new service?
Or say we had a PDF upload via rest should that service then do the parsing before uploading to DB .. or should the rest api that does the uploading then call another rest api to another service dedicated todo the parsing and uploading to the db?
I think the bigger way to explain the question is the encapsulation between DAL.. data access layer which I have existing.. but then there’s the BLL .. buisness logic layer which I don’t know if it should have its own APIs via own microservices running in the background.10