What a time to be alive(lemmy.sdf.org)

Sure. OK. How about we put the Greek alphabet at the lower code points and the Latin alphabet higher up, and now you might argue that Latin takes up more space than necessary.

Potential counterpoint: “This is stupid. Latin goes in the lower code points, it always has, it always will. Who’s putting Greek down there??”

Well, if Greece had invented computing as well as, let’s say, democracy that’s very likely how things would be.

In that timeline, someone is using exactly the same line on you “[The representation of Latin text in memory i]s as long as it needs to be unique.” and you’re annoyed because your short letter to Grandma is using far too much space on your hard drive.

permalink

report

parent

[ - ]

TheHarpyEagle@lemmy.world

4 points

1 year ago

Genuine question, how many applications are bottlenecked by the size of text files? I understand your analogy, but even a doubling in size of all your utf-8 encoded files would likely be dwarfed by all the other binary data on your machine, right?

permalink

report

parent

[ - ]

lowleveldata@programming.dev

0 points

1 year ago

Oh true. I’d be so annoyed because I somehow wrote a whole letter to Grandma in English which she couldn’t read.

permalink

report

parent

[ - ]

whileloop@lemmy.world

85 points

1 year ago

This is a joke, right? This feels like a very dumb solution. I don’t know much about UTF-8 encoding, but it sounds like Roman characters can be encoded shorter than most or all others because of a shorthand that assumes Roman characters. In that case, why not take that functionality and let a UTF-8 block specify which language makes up most of the text so that you can have that savings almost every time? I don’t see why one would want it to be random.

permalink

report

[ - ]

datavoid@lemmy.ml

1 point

1 year ago

Deleted by creator

permalink

report

parent

[ - ]

alvvayson@lemmy.world

127 points

1 year ago

It’s a joke.

UTF-16 already exists, which doesn’t favor Roman characters as much, but UTF-8 is more popular because it is backword compatible with the legacy ASCII.

UTF-32 also exists which has exactly equal length representation for every character.

But the thing that equalizes languages is compression.

Yes, a text written in Cyrillic with UTF-8 will take more space than a Roman language, easily double. However this extra space is much more easily compressed by an algorithm like GZIP.

So after compression, the two compressed texts will then be similarly sized and much smaller than UTF-16 or UTF-32.

permalink

report

parent

[ - ]

jmcs@discuss.tchncs.de

19 points

1 year ago

Besides most text on the average computer is either within some configuration file (which tend to use latin script), or within some SGML derived format which has a bunch of latin characters in it. For network transmission most things will use HTML, XML or JSON and use English language property names even in countries that don’t speak English (see Yandex’s and Baidu’s APIs for example).

No one is moving large amounts of .txt files around.

permalink

report

parent

[ - ]

Buckshot@programming.dev

27 points

1 year ago

You’ve never worked in finance then. All our systems at work do nothing but move large amounts of txt files around.

That said, many of our clients still don’t support utf-8 so its all ascii and non-latin alphabets are screwed. They can’t even handle characters 128-255 so even stuff like £ is unsupported.

permalink

report

parent

Show more comments

Programmer Humor

!programmerhumor@lemmy.ml

Create post

Post funny things about programming here! (Or just rant about your favourite programming language.)

What a time to be alive(lemmy.sdf.org)

Programmer Humor

!programmerhumor@lemmy.ml

Rules:

Community stats

Community moderators