Join devRant
Do all the things like
++ or -- rants, post your own rants, comment on others' rants and build your customized dev avatar
Sign Up
Pipeless API
From the creators of devRant, Pipeless lets you power real-time personalized recommendations and activity feeds using a simple API
Learn More
Search - "encodings"
-
Wow, what a fucking mess this sunday was.
My boss wrote me an email that one route of a RESTful API we wrote for a customer was not working anymore and puking back a status 500 with some error mentioning invalid UTF-8 characters.
Not one single person has had touched nor changed the code on production in some 6 months, so what the fuck could it be?
Phpunit did not give any errors (running only locally), the code had no syntax errors and the DB dump did not contain any invalid bytes (tested with a hex editor).
WHAT THE FUCK?!
OK so I started to comment out lines (all tested directly on production of course) until the error vanished.
Guess what was the culprit?
.
.
.
.
.
.
In the code (PHP) we used strftime(...) to get nice time strings. Of course we set the correct locale on the server, thus having months and days formatted in German.
So, in Geman there is this one mysterious month called "März" which contains an umlaut character.
Calling strftime generated the date with März in it, but the server locale was de_CH.iso-8859-1 and not fucking de_CH.utf8, so the "ä" was returned as 0xE4 instead of 0xC3A4 (valid UTF-8), which json_encode(...) did not want to swallow but instead threw an exception.8 -
*wrestling commentator voice*
"In this weeks episode of encoding hell:
The iiiinnnfamous UTF-8 Byte Order Mark veeeersus PHP!"
For an online shop we developed, there is currently a CSV upload feature in review by our client. Before we developed this feature, we created together with the client a very precise specification, including the file format and encoding (UTF-8).
After the first test day, the client informed us, that there were invalid characters after processing the uploaded file.
We checked the code and compared the customer's file with our template.
The file was encoded in ISO-8859-1 and NOT as specified UTF-8.
But what ever, we had to add an encoding check, thus allowing both encodings from now on.
Well well well welly welly fucking well...
Test day 2: We receive an email from said client, that the CSV is not working, again.
This time: UTF-8 encoding, but some fields had more colums with different values than specified.
Fucking hell.
We tell the customer that.
(I was about to write a nice death threat novel to them, but my boss held me back)
Testing day 3, today:
"The uploading feature is not working with our file, please fix it."
I tried to debug it, but only got misleading errors. After about 30 minutes, at 20 stacks of hatered, I finally had an idea to check the file in a hex editor:
God fucking what!?!!?!11?!1!!!?2!!
The encoding was valid UTF-8, all columns and fields were correct, but this time the file contained somthing different.
Something the world does not need.
Something nearly as wasteful as driving a monster truck in first gear from NYC to LA.
It was the UTF-8 Byte Order Mark.
3 bytes of pure hell.
Fucking 0xEFBBBF.
The archenemy of PHP and sane people.
If the devil had sex with the ethernet port of a rusty Mac OS X Server, then 9 microseconds later a UTF-8 BOM would have been born.
OK, maybe if PHP would actually cope with these bytes of death without crashing, that would be great.3 -
For fuck's sake! I just spent almost an entire workday trying to fix a bug which didn't even exist in the first place.
Well, at least now I got this to cheer me up:1 -
Yesterday a strange bug appeared in Chrome: In a small web app we have some umlaut characters like äöü. Strangely said characters were displayed as cyrillic, but only on my pc... On every other device it worked. I spent about 5 hours of checking encodings (everything was in UTF-8), reading posts in the almighty stackoverflow.
Finally i figured out, the font was broken. After reinstalling it, everything was peaceful again in my head. -
REDIS: Great for cloud, will fuck up your local disk if too many write operations per second.
DynamoDB: WTF 10Mb should not be "too large for a single record"!!
SPARK: NEVER CONNECT IT TO A DATABASE! Wasted A LOT of cluster time. Also, can you be LESS specific on exactly what are the bugs in my code? 'cause I don't think it's possible.
NPM: can't install a package for shit. tried it waaaay to many times.
Makefiles: Just fuck you.
WSL1: breaks more often than a glass hammer.
Python >= 3.6: FUCK ENCODINGS!!
Jupyter: STOP MESSING UP WHILE SAVING!
Living is to collet bugs, it seems.4 -
Hmm...recently I've seen an increase in the idea of raising security awareness at a user level...but really now , it gets me thinking , why not raise security awareness at a coding level ? Just having one guy do encryption and encoding most certainly isn't enough for an app to be considered secure . In this day an age where most apps are web based and even open source some of them , I think that first of all it should be our duty to protect the customer/consumer rather than make him protect himself . Most of everyone knows how to get user input from the UI but how many out here actually think that the normal dummy user might actually type unintentional malicious code which would break the app or give him access to something he shouldn't be allowed into ? I've seen very few developers/software architects/engineers actually take the blame for insecure code . I've seen people build apps starting on an unacceptable idea security wise and then in the end thinking of patching in filters , encryptions , encodings , tokens and days before release realise that their app is half broken because they didn't start the whole project in a more secure way for the user .
Just my two cents...we as devs should be more aware of coding in a way that makes apps more secure from and for the user rather than saying that we had some epic mythical hackers pull all the user tables that also contained unhashed unencrypted passwords by using magix . It certainly isn't magic , it's just our bad coding that lets outside code interact with our own code . -
Co-worker: "I'm getting files from customers which are in multiple types and encodings. Is there a way to force them to send only us-ascii so my tool gets only one kind which is easier to parse?"
Me: "Welcome to programming with arbitrary sources; you'll just have to build your tools to figure it out and not break (just rely on dozens of existing tools which you refuse to use)"1 -
Given the Base64 flying around devrant, figured I'd introduce the unknowing to a lovely little tool for dealing with various formats and encodings: CyberChef (online)
https://gchq.github.io/CyberChef/
I honestly don't remember how I lived without it. Enjoy!4 -
It's always fucking file encodings breaking these stupid fucking legacy projects. FUCK
Fuck this I'm going to bed. I'll fix it in the morning it's a little past midnight13 -
Heres some research into a new LLM architecture I recently built and have had actual success with.
The idea is simple, you do the standard thing of generating random vectors for your dictionary of tokens, we'll call these numbers your 'weights'. Then, for whatever sentence you want to use as input, you generate a context embedding by looking up those tokens, and putting them into a list.
Next, you do the same for the output you want to map to, lets call it the decoder embedding.
You then loop, and generate a 'noise embedding', for each vector or individual token in the context embedding, you then subtract that token's noise value from that token's embedding value or specific weight.
You find the weight index in the weight dictionary (one entry per word or token in your token dictionary) thats closest to this embedding. You use a version of cuckoo hashing where similar values are stored near each other, and the canonical weight values are actually the key of each key:value pair in your token dictionary. When doing this you align all random numbered keys in the dictionary (a uniform sample from 0 to 1), and look at hamming distance between the context embedding+noise embedding (called the encoder embedding) versus the canonical keys, with each digit from left to right being penalized by some factor f (because numbers further left are larger magnitudes), and then penalize or reward based on the numeric closeness of any given individual digit of the encoder embedding at the same index of any given weight i.
You then substitute the canonical weight in place of this encoder embedding, look up that weights index in my earliest version, and then use that index to lookup the word|token in the token dictionary and compare it to the word at the current index of the training output to match against.
Of course by switching to the hash version the lookup is significantly faster, but I digress.
That introduces a problem.
If each input token matches one output token how do we get variable length outputs, how do we do n-to-m mappings of input and output?
One of the things I explored was using pseudo-markovian processes, where theres one node, A, with two links to itself, B, and C.
B is a transition matrix, and A holds its own state. At any given timestep, A may use either the default transition matrix (training data encoder embeddings) with B, or it may generate new ones, using C and a context window of A's prior states.
C can be used to modify A, or it can be used to as a noise embedding to modify B.
A can take on the state of both A and C or A and B. In fact we do both, and measure which is closest to the correct output during training.
What this *doesn't* do is give us variable length encodings or decodings.
So I thought a while and said, if we're using noise embeddings, why can't we use multiple?
And if we're doing multiple, what if we used a middle layer, lets call it the 'key', and took its mean
over *many* training examples, and used it to map from the variance of an input (query) to the variance and mean of
a training or inference output (value).
But how does that tell us when to stop or continue generating tokens for the output?
Posted on pastebin if you want to read the whole thing (DR wouldn't post for some reason).
In any case I wasn't sure if I was dreaming or if I was off in left field, so I went and built the damn thing, the autoencoder part, wasn't even sure I could, but I did, and it just works. I'm still scratching my head.
https://pastebin.com/xAHRhmfH33 -
FUCK YOU MICROSOFT
GO FIX YOUR FUCKIN C# METHODS
Language felt good but jesus fuckin christ.
HOW YOUR File.Exists() can be so retarded jesus fuckin christ
I mean god, how retarded can it be when i obtain the current directory with your builtin method (System.Environment.CurrentDirectory) attach to it the directory name with the images i need and I ALWAYS GET FALSE ABOUT ITEMS THAT ARE FUCKIN THERE.
Fix your fuckin encodings too, suckers.6 -
Microsoft certsrv is returning UTF-8 on the authorization error page but UTF-16 when logging in via basic auth...
Debugged this for 2 hours today to parse the response correctly. Thanks Microsoft -
What is the problem with nowadays colleges? So many of the programmers I know know absolutely nothing regarding encodings and such.
It's pretty nice to see people resorting to hexadecimals to show their Latin characters in HTML instead of simply learning about bloody UTF8 -
I was in my IDE writing up a new program, but my compiler didn't accept the encoding format of the files. Because I used special characters, it flipped a lot of encodings around. Then I changed my projects master encoding, then all of the source files they all messed up. The all the files were composed of "????????" I went to a backup and they were messed up too. I exploded... Control Z couldn't even be my hero this time.3
-
Fuck powershell.
How in the world can one lang have so much problems with date conversion, and chars like üöä. Its so funny how they get through just fine sometime when u specify encoding utf8 but other times it has no ides wha5 these chars are.
Honestly fuck powershell