| Subcribe via RSS

why utf8 is important

May 28th, 2008 | 1 Comment | Posted in anything under the moonlight, developer's tools by dreamluverz

Advantages

Here are several advantages of UTF-8:

  • UTF-8 can be read and written quickly just with bit-mask and bit-shift operations.
  • Comparing two char strings in C/C++ with strcmp() gives the same result as wcscmp(), so that legicographic sorting and tree-search order are preserved.
  • Bytes FF and FE never appear in an UTF-8 output, so they can be used to indicate an UTF-16 or UTF-32 text (see BOM).
  • UTF-8 is byte order independent. The bytes order is the same on all systems, so that it doesn’t actually require a BOM.

Disadvantages



UTF-8 has several disadvantages:

  • You cannot determine the number of bytes of the UTF-8 text from the number of UNICODE characters because UTF-8 uses a variable length encoding.
  • It needs 2 bytes for those non-Latin characters that are encoded in just 1 byte with extended ASCII char sets.
  • ISO Latin-1, a subset of UNICODE, is not a subset of UTF-8.
  • The 8-bit chars of UTF-8 are stripped by many mail gateways because Internet messages were originally designed as 7-bit ASCII. The problem led to the creation of UTF-7.
  • UTF-8 uses the values 100xxxxx in more than 50% of its representation, but existing implementation of ISO 2022, 4873, 6429, and 8859 systems mistake these as C1 control codes. The problem led to the creation of UTF-7,5.

source:http://www.codeguru.com/cpp/misc/misc/multi-lingualsupport/article.php/c10451/

Why is it Important?

UTF-8 is an important encoding because of the following reasons:

  • ASCII compatible
  • easily supported
  • compact and efficient for most scripts
  • easily processed, unlike other multibyte encodings

source: http://developers.sun.com/dev/gadc/technicalpublications/articles/utf8.html

Tags: , ,