Do all the things like ++ or -- rants, post your own rants, comment on others' rants and build your customized dev avatarSign Up
From the creators of devRant, Pipeless lets you power real-time personalized recommendations and activity feeds using a simple APILearn More
Search - "nltk"
Not sure if many people heard about nltk in python but I'm currently using a lot now for research.
So one day I was doing multiprocessing while using lemmatizer in nltk, for those who don't know, lemmatizer is a thing that change the word to its base form. So it is like, ran to run, bitches to bitch.
Anyway, the nltk package, to ensure it does not take too much memory, here's what it does: it loads a data file, and once it is loaded and accessed for the first time, it breaks the data file into CSV file. And since I was doing multiprocessing, the data file is accessed for multiple time while it can only be loaded once, hence error happened.
Instead of changing my code, which I think is good already, I went to the package directory of nltk and directly changed the source code from there and now the code works perfectly.
I'm very proud of my self at the moment, this is a very good lesson that I've learned: always look for alternatives. And suck it, nltk.1
When you are trying to reverse engineering context free grammar rules from given sentences......
Not possible. Worst assignment yet.2
Side project update.
Made simple nlp library in python and published it’s first version to open source.
Now I can feed it with parsed pdf text.
See rant https://devrant.com/rants/2192388/...
Cause during reading book about nltk I couldn’t find simple extendible way to provide support for polish language and I wanted to abstract stemming, word normalization, tokenizer etc. so I can provide ex. different conditions for separate text files and don’t write much code what is an asset when you work solo.
It’s about 12GB of pdf public accessible law data I am trying to handle ( at first ) which is about 35000 files from last 90 years.
So far I automated downloading web pages and pdf documents from them. Extracting data from web pages and saving it to database. Extracting text from pdf files. I have about 5-6 projects to do all of it above maybe at the end I will put it to some workflow manager like Luigi or just run it by cronjob.
First thing for website version 1.0 part is find correlation between all documents inside law text using nlp library by building custom conditions. Then just generate directory structure and html files with links between documents.
Website version 2.0 is already in my mind but it will be creepy to make it and will take at least 1-2 months and I want to publish fast.
I have some pdfs with only images instead of text and tesseract worked quite good with them so maybe I will try to process them when everything go live.
Learned a lot about pdf as now I know that font in pdf is not always providing unicode characters ( stupid form of obfuscation) so when you extract text you need to build glyph vector to text map for every font.
Pdf is full vector representation - just like svg - what is logic if you think a bit and know that some printers are running using postscript.
Let’s hope next update will be about flutter mobile app which started all of shit above. It’s almost ready ( except getting data from api I am trying to do and logo for release version ). It’s last piece of puzzle.3
I am facing some problem on my new project. This is how I expect my project to work.
1. I am working on an Android application that uses Twitter sentiment analysis and gives out predictive result.
2. I want to keep the backend that is the python file/NLTK script on server/cloud/anywhere.
3. The Android App take the input and pass it to server.
4. In backend the tweets regarding the input are processed and after that a result is generated. The result is passed to Android App(client).
5. The app displays the result.
The problem is I don't know how to create the server(All I have is .py file). Secondly I don't know how to create request/response between the app and the server( client with .py) like how to create communication between them.
I have googled the query but am unable to come up with any reasonable solution. Any suggestions are welcome.1