Join devRant
Do all the things like
++ or -- rants, post your own rants, comment on others' rants and build your customized dev avatar
Sign Up
Pipeless API
From the creators of devRant, Pipeless lets you power real-time personalized recommendations and activity feeds using a simple API
Learn More
Search - "one-based index"
-
A few weeks ago a client called me. His application contains a lot of data, including email addresses (local part and domain stored separately in SQL database). The application can filter data based on the domain part of the addresses. He ask me why sub.example.com is not included when he asked the application for example.com. I said: No problem, I can add this feature to the application, but the process will take a longer.
Client: No problem, please add this ASAP.
So, the next day I changed some of the SQL queries to lookup using the LIKE operator.
After a week the client called again: The process is really slow, how can this be?
Me: Well, you asked me to filter the subdomains as well. Before, the application could easily find all the domains (SQL index), but now it has to compare all the domains to check if it ends with the domain you are looking for.
Client: Okay, but why is it a lot slower than before?
Me: Do you have a dictionary in your office?
<Client search for a dictionary, came back with one>
Me: give me the definition of the word "time"
<Client gives definition of time>
Me: Give me the definition of all words ending with "time"
Client: But, ...
Never heard from him again on this issues :-P5 -
I wanted to print the second and third page of some document, so in the relevant field of the printer dialog I enter "1, 2" and I walk off to the printer.
My first thought when I saw the printer had printed the wrong pages was
"F*ing buggy software"
Second thought:
"Oh... right"
Third thought:
"Right, in the real world, one-based indices are the rule rather than the exception. "
Fourth thought:
"Dumb real world"3 -
So this web company i joined had a page load time in minutes. The free text search (inverted index search, based on elasticsearch) queries would return results in 10-45 seconds (should be milliseconds always). The indexes had no schema. And they would crawl data and feed into mssql db, which had a 2 gb/db limit on the free version. So everytime the db hit the limit, a new db was created and the name was incremented by one.
Had a very tough time cleaning up that mess. Plus the architect who had made this architecture was on his way out and unhelpful to the core.
What was worse was that most of the changes i did were very simple changes that should have been done long back. Basic sanity changes.4 -
Heres some research into a new LLM architecture I recently built and have had actual success with.
The idea is simple, you do the standard thing of generating random vectors for your dictionary of tokens, we'll call these numbers your 'weights'. Then, for whatever sentence you want to use as input, you generate a context embedding by looking up those tokens, and putting them into a list.
Next, you do the same for the output you want to map to, lets call it the decoder embedding.
You then loop, and generate a 'noise embedding', for each vector or individual token in the context embedding, you then subtract that token's noise value from that token's embedding value or specific weight.
You find the weight index in the weight dictionary (one entry per word or token in your token dictionary) thats closest to this embedding. You use a version of cuckoo hashing where similar values are stored near each other, and the canonical weight values are actually the key of each key:value pair in your token dictionary. When doing this you align all random numbered keys in the dictionary (a uniform sample from 0 to 1), and look at hamming distance between the context embedding+noise embedding (called the encoder embedding) versus the canonical keys, with each digit from left to right being penalized by some factor f (because numbers further left are larger magnitudes), and then penalize or reward based on the numeric closeness of any given individual digit of the encoder embedding at the same index of any given weight i.
You then substitute the canonical weight in place of this encoder embedding, look up that weights index in my earliest version, and then use that index to lookup the word|token in the token dictionary and compare it to the word at the current index of the training output to match against.
Of course by switching to the hash version the lookup is significantly faster, but I digress.
That introduces a problem.
If each input token matches one output token how do we get variable length outputs, how do we do n-to-m mappings of input and output?
One of the things I explored was using pseudo-markovian processes, where theres one node, A, with two links to itself, B, and C.
B is a transition matrix, and A holds its own state. At any given timestep, A may use either the default transition matrix (training data encoder embeddings) with B, or it may generate new ones, using C and a context window of A's prior states.
C can be used to modify A, or it can be used to as a noise embedding to modify B.
A can take on the state of both A and C or A and B. In fact we do both, and measure which is closest to the correct output during training.
What this *doesn't* do is give us variable length encodings or decodings.
So I thought a while and said, if we're using noise embeddings, why can't we use multiple?
And if we're doing multiple, what if we used a middle layer, lets call it the 'key', and took its mean
over *many* training examples, and used it to map from the variance of an input (query) to the variance and mean of
a training or inference output (value).
But how does that tell us when to stop or continue generating tokens for the output?
Posted on pastebin if you want to read the whole thing (DR wouldn't post for some reason).
In any case I wasn't sure if I was dreaming or if I was off in left field, so I went and built the damn thing, the autoencoder part, wasn't even sure I could, but I did, and it just works. I'm still scratching my head.
https://pastebin.com/xAHRhmfH33 -
Finally finished an algo to check an image for grouping of pixels that will form a rectangular area. I got the grouping to work on one image, but found it was utterly failing on another. I went through every step of the algo and still could not find the solution. The 128x128 image was working, but the 128x16 image was not. I knew it had something to do with the dimensions. Started thinking it was overflowing a buffer somewhere. So I started putting asserts in the functions that abstracted the buffer access. None of the numbers exceeded the proper bounds. It was close to bedtime so I finally gave up. I was tired. Then I realized it wouldn't be until the next evening when I could look at this again. So I got up again and started looking at the code again. I had a loop to check the output of my algo that I did the memory access of the buffer. It too was not fully filling my temp image to show how the algo was working. WTF!
Then I finally realized the flaw:
buffer[x+y*height]
And my test loop to test the algo:
buffer[x+y*ymax]
I kept overlooking the error because I was sure it was right. Also my asserts for the functions to access the buffers? They only checked the inputs x and y. So it didn't help that the math was wrong for reading and writing the buffers. It also worked fine on 128x128 images because the width and height were the same.
It is funny that I struggled with this part. The algo was actually surprisingly easy to formulate. I just looked through every point and checked a buffer to see if that point was used. If not then I would attempt to grow in the x and y direction the shaped of that point based upon pixel color. This was saved in a structure while growing that point. Then when that rectangle could not be grown further the inner loop would continue checking used points again.
I still have work to do to use the data this algo produces. I need to now figure out how to parent the rectangular areas to each other. I will probably use my check buffer to keep track of these rects by an index. Then do adjacent checks to determine parenting. Eventually I will have to extend this algo to 3 dimensions, but that should not be difficult.2 -
There is a element in mappView that lets you display a specific image from a list based on a index. It has the parameters selectedImageIndex and selectedIndex. One is used to set the other to read. Its not like that you need RW separation. Its done with variable binding. So why the FUCK is it there????2