Join devRant
Do all the things like
++ or -- rants, post your own rants, comment on others' rants and build your customized dev avatar
Sign Up
Pipeless API
From the creators of devRant, Pipeless lets you power real-time personalized recommendations and activity feeds using a simple API
Learn More
Search - "tokenizer"
-
The guy where I can only shake my head when I see his code, and he is really proud of if implementations, while he
- doesn't care about warnings
- breaks builds and doesn't care
- doesn't care about code styles and indents in a very column based way
- adds tons of comments to his code, mostly hard to understand, and sometimes that much you can hardly find the code
- implements a tokenizer where you have to inherit from its interface (Why would I wanna implement whole functions for a tokenizer and not just use it in place where needed? How do I use two of those in one class?)
- implement a "generic" state machine base class with fixed lengths array of 3 events and 3 strings (Why would I need events and strings hardcoded in a "generic" state machine? Why a maximum of 3?)
- once delivered a software without the needed runtime components, so the whole system (embedded device) wasn't working properly and only by chance missed the point of disabling update mechanisms
- make your ears bleed about his big inventions whenever he sees you, no matter how often he already told you about that blazing new feature5 -
A dev team has been spending the past couple of weeks working on a 'generic rule engine' to validate a marketing process. The “Buy 5, get 10% off” kind of promotions.
The UI has all the great bits, drop-downs, various data lookups, etc etc..
What the dev is storing the database is the actual string representation FieldA=“Buy 5, get 10% off” that is “built” from the UI.
Might be OK, but now they want to apply that string to an actual order. Extract ‘5’, the word ‘Buy’ to apply to the purchase quantity rule, ‘10%’ and the word ‘off’ to subtract from the total.
Dev asked me:
Dev: “How can I use reflection to parse the string and determine what are integers, decimals, and percents?”
Me: “That sounds complicated. Why would you do that?”
Dev: “It’s only a string. Parsing it was easy. First we need to know how to extract numbers and be able to compare them.”
Me: “I’ve seen the data structures, wouldn’t it be easier to serialize the objects to JSON and store the string in the database? When you deserialize, you won’t have to parse or do any kind of reflection. You should try to keep the rule behavior as simple as possible. Developing your own tokenizer that relies on reflection and hoping the UI doesn’t change isn’t going to be reliable.”
Dev: “Tokens!...yea…tokens…that’s what we want. I’ll come up with a tokenizing algorithm that can utilize recursion and reflection to extract all the comparable data structures.”
Me: “Wow…uh…no, don’t do that. The UI already has to map the data, just make it easy on yourself and serialize that object. It’s like one line of code to serialize and deserialize.”
Dev: “I don’t know…sounds like magic. Using tokens seems like the more straightforward O-O approach. Thanks anyway.”
I probably getting too old to keep up with these kids, I have no idea what the frack he was talking about. Not sure if they are too smart or I’m too stupid/lazy. Either way, I keeping my name as far away from that project as possible.4 -
Definitely Rust, and a bit Haskell.
Rust has made me much more conscious of data ownership through a program, to the point that any C/C++ function I wrote that takes a pointer nowadays gets a comment on ownership.
I wish it was a bit less pedantic about generics sometimes, which is why I've started working on a "less pedantic rust", where generics are done through multiple dispatch à la Julia, but still monomorphising everything I can. I've only started this week, but I already have a tokenizer and most of the type inference system (an SLD tree) ready. Next up is the borrow checker and parsing the tokenized input to whatever the type inference and borrow checker need to work with, and of course actual code generation...
Haskell is my first FP language, and introduced me to some FP patterns which, turns out, are super useful even with less FP languages. -
Read "How to implement a programming language" (http://lisperator.net/pltut) and it was very cool and enlightening. I have decided to use what I learned to make a math evaluator with a proper lexer and extensible library and stuff like that. Not a full programming language but a nice and advanced math evaluator.10
-
"The word ‘tokenizer’ makes a lot more sense, but ‘lexer’ is so much fun to say that I use it anyway."
No wonders people think a lot of programming subjects are intimidating.3 -
Last night, while under the full belief that I could write a very simple lisp interpreter, I was awake until stupid o'clock and to my credit I got the tokenizer working and produces an output of parsed code. It's really basic but I was pleased to have gotten anything correctly parsed at that time.
But I'm also sitting outside my apartment waiting for a locksmith because lack of sleep left me unprepared to function correctly today and I'm now locked out...
Well done me! -
I've made a NLP tokenizer which is 30 times faster than clearNlp and standford tokenizer, last year. It is rich and require very less space(don't load any model). Should i make it open source?9
-
My tokenizer can how tokenize many statements rather than just a single statement. ^_^ I'll make it open source once I've done a bit more work on it but so far I'm happy.9
-
Started writing a parser for moonscript. Because I want to do my own syntax highlighting and error support.
I'm sorry, but was this supposed to difficult? Every article I read claimed this was gonna be some impossible feat of herculean effort. I half dreaded it, the other half was kinda elated.
Only it didnt live up to the hype. The tokenizer is a glorified character stream. The lexer is little more than a tokenizer, and the "most complicated" bit is nothing but a fancy transformation of the token output into a tree.
I'm completely to new parsers proper and semantic checking and maybe that's why it seemed easy, but I dont see what all the forewarning in tutorials were ever about.4 -
! wk95
My project back in university, where I used bash and NLP and Python to create a utility thay would execute sentences written in English. Much like typing "change my wallpaper to abc.jpg"
Even though the tokenizer took almost five minutes to tokenize a sentence ( longer than five words ), and the parser took even longer, I still love it, for it was my first dive into ML ! -
TLDR;
Side project update.
Made simple nlp library in python and published it’s first version to open source.
Now I can feed it with parsed pdf text.
See rant https://devrant.com/rants/2192388/...
Why ?
Cause during reading book about nltk I couldn’t find simple extendible way to provide support for polish language and I wanted to abstract stemming, word normalization, tokenizer etc. so I can provide ex. different conditions for separate text files and don’t write much code what is an asset when you work solo.
It’s about 12GB of pdf public accessible law data I am trying to handle ( at first ) which is about 35000 files from last 90 years.
So far I automated downloading web pages and pdf documents from them. Extracting data from web pages and saving it to database. Extracting text from pdf files. I have about 5-6 projects to do all of it above maybe at the end I will put it to some workflow manager like Luigi or just run it by cronjob.
First thing for website version 1.0 part is find correlation between all documents inside law text using nlp library by building custom conditions. Then just generate directory structure and html files with links between documents.
Website version 2.0 is already in my mind but it will be creepy to make it and will take at least 1-2 months and I want to publish fast.
I have some pdfs with only images instead of text and tesseract worked quite good with them so maybe I will try to process them when everything go live.
Learned a lot about pdf as now I know that font in pdf is not always providing unicode characters ( stupid form of obfuscation) so when you extract text you need to build glyph vector to text map for every font.
Pdf is full vector representation - just like svg - what is logic if you think a bit and know that some printers are running using postscript.
Let’s hope next update will be about flutter mobile app which started all of shit above. It’s almost ready ( except getting data from api I am trying to do and logo for release version ). It’s last piece of puzzle.3 -
Anyone ever struggle with tokenizers? I'm writing one by hand and it's frustrating mr so much, because at times I think my logic gets flawed somewhere inside a good loop and things get slipped or missed I don't know how to explain it.5
-
Question for someone who uses Mongo Atlas Search:
If I'm only interested in autocomplete from the start of the text, which is more performant?
1) standard analyzer + edgeGram tokenizer
2) keyword analyzer + edgeGram tokenizer
I don't see why I should index separate words if I don't care about random positions :/
Thank you6