UTF-8 Rocks

2004-10-24

Something I found out recently, the guys that invented UTF−8 (a unicode format) and Shift−JIS (a Japanese format) were pretty smart. They made sure that it would be nearly impossible to have problems using either of them in a system that mostly assumed ASCII but generally doesn't care what's in a string otherwise

By that I mean for example in Japanese and Unicode there are all kinds of characters that take 2 or more bytes to represent. This is just an example but let's say that the code for 恋 was 9122 (hex). Well if you put that in an ASCII program that was looking for quote marks (") it would see the 22 as an ASCII quote and mess up. Another example might by let's say the code for 愛 was 943C. That 3C would be a < and < is used in HTML for webpages. It would really mess up your pages.

Well, in both cases UTF−8 and Shift−JIS, the designers made sure that could never happen. UTF−8 users only 80−FF for any character with a code greater than 7F. That means all ASCII is uneffected and it also means there is no possible way for there to be mis−interpreted puncuations hidden inside codes for other characters. Shift−JIS only uses 40−9F or so. Also avoiding most punctuation althugh there are still a few in there.

As for UTF−8. It also uses a shift value or escape code for all codes above 7F. That means basically all Japanese, Korean and Chinese gets turned into at least 3 bytes per character in UTF−8 instead of 2. I know some people have a slight issue with that since it means for example Japanese text stored in UTF−8 will take 50% more space than stored in an older Japanese only format but UTF−8 solves a huge problem which is that bascially most non unicode aware programs should be able to handle unicode utf−8 text throughout their systems. Any punctuation, control−characters or keywords they are looking for will never be mistaken for or mis−recognized. What a relief!

For more info here's a good place to start.

Comments
The Economics of making games
Lots of Reading Material