stl

Ranter

lorentz

15165

Comments

3

IntrusionCM

13947

2y

https://devrant.com/rants/7682021/...

I'm still a bit confused, but from my understanding a string would be UTF-8 (which implies the existence of code points)...

So substring in string is a regular loop checking char by char I'd guess?

It's the easiest way.

Plus maybe the typical optimizations, for example validating length.

If length(substring) > length(string) - false

if remainingChars < length(substring) - false

Et cetera
0

lorentz

15165

2y

@IntrusionCM It's two nested loops, other than that the only complication is that they're both clonable reversible iterators and not indices because I iterate over graphemes produced by unicode_segmentation and not code points
1

lorentz

15165

2y

PS: it's not even that because chars() is also an iterator. I'm literally just struggling to write a subsequence testing loop.
0

lorentz

15165

2y

another issue is that I always try to optimize prematurely so my first attempt was to find the rarest element in the needle and then search the haystack for that element.
2

Fast-Nop

36817

2y

@lorentz The nice thing about UTF-8 is that you don't have to care about code points for many string operations. Just treat them as bytes.

It's only when you want to e.g. count letters that this won't fly anymore.
0

IntrusionCM

13947

2y

@lorentz hm...

HMMMMMMMMM....

I'm really not sure if I like that idea regarding strings.

It just sounds like a rather complicated attempt to fix a minor inconvenience.

Let's take an example: The turkish letter i problem. It's not only in the Turkish language present, but known mostly under that name.

http://en.wikipedia.org/wiki/...

Leading to e.g. this:
https://lucene.apache.org/core/...

If I get it right, you would just convert it down to latin I despite having a special meaning?

It reminds me a lot of the - nowadays - very hated decision of MySQL to strip UTF8 to a max of 3 bytes. They took out the 4th dataplane.

Which leads to the wonderful world of pain:
utf8 is actually utf8mb3
utd8mb4 is the "real" utf8

I know a lot of people (especially as I come from a PHP background) who are freaked out by UTF-8 and the difference between string length vs byte length.

But imho that's wrong. It doesn't hurt nor is it bad. Just use it appropriately and don't try to be overly clever...

That's imho why most people struggle with UTF-8... They worry about problems that don't really exist.

Unless you have extremely large strings (talking about dozens of megabytes plus), there is neither a reason to worry about processing a string nor about finding out the (byte-) length of a string.

Most compilers / environments have picked up heavily SIMD optimizations... Making performance problems really a thing of the past, especially
https://en.m.wikipedia.org/wiki/...
AVX.
0

Wisecrack

9212

2y

@lorentz means it's time to take a break, go do something else, or even go to sleep/nap.

I struggled the night before to write an eight line function doing some simple filtering.

Came back in the morning, and it literally took me two minutes to write it, but free first try.

It's almost always your brain saying its had enough.
0

lorentz

15165

2y

@IntrusionCM I think you misunderstand my stance. Unicode encodes text in codepoints. UTF-8 encodes codepoints in bytes. Historically, programs handled strings by the byte. Today, programs handle strings by the codepoint. My plan is to handle strings by the grapheme, which is a sequence of codepoints describing a unit of human writing. A grapheme usually consists of a base codepoint followed by various modifiers or ZWNJ+codepoint pairs. Graphemes usually coincide with font glyphs.

This is different from handling code points in that eg. Mandarin graphemes that consist of mostly two codepoints would not be cut in half when transforming strings. For the most part they would behave exactly like codepoint-sequences other than addressing this specific class of bugs.

1/2
1

lorentz

15165

2y

2/2

The drawback is that because Unicode places no limitations on the number of modifiers (Zalgo characters can be several kilobytes) the underlying datatype of char has to be an unbounded string as opposed to 4 byte codepoints.
1

lorentz

15165

2y

@IntrusionCM The statement of the rant you linked is basically just that since string processing is already variable width to handle code points, handling graphemes is marginally more complex and eliminates a new class of visual glitches such as mandarin letters getting cut in half by trimming or line breaks.
1

IntrusionCM

13947

2y

@lorentz yeah I misunderstood the grapheme part.

You're essentially representing a composed unicode character as a single character though it is actually a character consisting of multiple characters?
1

lorentz

15165

2y

@IntrusionCM That is exactly what I'm doing. Pretty much the same features as provided by the unicode_segmentation crate, but it's the default behaviour in the standard string library.
0

Fast-Nop

36817

2y

@lorentz Why don't you just break down the grapheme that you look for into its UTF-8 representation and proceed on raw byte level from there? Do identical graphemes have multiple possible representations?
2

lorentz

15165

2y

@Fast-Nop Since both are clonable, reversible iterators of equality-comparable elements, I don't have to do anything extra. My brain just stopped working temporarily.
1

lorentz

15165

2y

@Fast-Nop Most graphemes do have multiple representations, but I"m not handling that because the closest well-defined concept is "similar character" which is not an equivalence relation, and I don't want to complicate my API quite that much
1

lorentz

15165

2y

I'm not opposed to the idea of defining some equivalent sets of graphemes for the purpose of text processing, but it's a massive complication for relatively little gain so it's definitely not high priority.
2

lorentz

15165

2y

Also, messing with the concept of identical text is a massive security risk, eg. it may allow a bad actor to register under a username which is treated as a distinct account by surrounding code but has the same rights as another account if the validation is done within Orchid.

It's not even specific to strange characters either, here's a list of look-alikes to Latin characters. The first entry in several rows is so similar that any sensible notion of identical characters would need to include it:

https://gist.github.com/StevenACoff...
1

CoreFusionX

3450

2y

@lorentz

Don't forget about Unicode normalization.

Some codepoint + modifier(s) combos exist as standalone codepoints.

Without normalization, naive codepoint comparison would sometimes fail.
0

lorentz

15165

2y

@CoreFusionX Yeah but if it fails everywhere else uniformly and only succeeds in Orchid that could easily create security vulnerabilities even.
3

lorentz

15165

2y

Now I feel like I have to announce that I did finally manage to write the god damn motherfucking find function.

It's probably not the most efficient find function ever, but like, it's in Rust, it will be called from an interpreted language, and I have tens to hundreds of functions to write before the language is even remotely usable, so optimization can wait.
0

SidTheITGuy

9639

2y

Hah! I can relate. I often find myself googling "How to iterate over an array javascript"
1

AleCx04

27017

2y

I think I know what you mean, I had a similar issue with my own language when I was writing it, the first thing to keep in mind is to <dead>
0

lorentz

15165

2y

Update on this, there is apparently a well-defined concept of equivalent graphemes for Unicode which is an equivalence relation and a lot of languages use it, so that's coming too.

I now feel validated in my quest for text processing exclusively via the public API of Unicode and without acknowledging the encoding.
0

lorentz

15165

2y

That being said, I refuse to acknowledge locale as a property of the environment so I'm on the fence as to whether adding a locale property to strings or supporting only locale-independent algorithms is the less bad option
0

IntrusionCM

13947

2y

@lorentz

Why do you want to not support Locales?
0

lorentz

15165

2y

@IntrusionCM I do want to support some sort of localization, I just think that locale is a property of the text.
0

lorentz

15165

2y

Elixir beat me to it