6

I'm back from the dead to rant again. This time it's punycode.

My job has to do with processing the commoncrawl web archives, and for some reason one in 20.000.000 archived webpages crashed my program. After some debugging I found this issue that seems to be the reason my code crashes https://github.com/servo/rust-url/...

To summarize the issue: Since punycode unicode characters can be encoded into domain names. But not every character is allowed. Not only do these invalid domains get registered, I need an in-depth knowledge about unicode to understand what is wrong here.

How did we turn domain names into something so complicated?

Comments
  • 4
    Language is complicated, and there are plenty of people who, for whatever reason, want non-english domains.
  • 1
    Oh. Complicated?

    Easy.

    This is a very bad idea, sir...

    But I want my Poop Emoji!!!! in Unicode!!!! every where!!!!!!!!!
  • 1
    @SortOfTested I agree, but I'm not sure if we didn't allow too many unicode characters at once. Now we have homoglyph-attacks and a way more complex definition of a domain-name.
Add Comment