UTF-8 and Unicode Standards :: What is UTF-8?
UTF-8 stands for Unicode
Transformation
Format-8. It is an octet (8-bit)
lossless encoding of Unicode characters.
UTF-8 encodes each Unicode character as a variable number of 1 to 4
octets, where the number of octets depends on the integer value assigned
to the Unicode character. It is an efficient encoding of Unicode
documents that use mostly US-ASCII characters because it represents each
character in the range U+0000 through U+007F as a single octet. UTF-8
is the default encoding for XML.
Standards
- RFC
3629: UTF-8, a transformation format of ISO 10646. November 2003. - The
Unicode Standard 5.0, November 2006. [purchase
from Amazon.com] - In particular, see the informal
description of UTF-8 in sections 2.5 and 2.6, pages 30-32, and a
much more formal
definition in sections 3.9 and 3.10, pages 77-81.
Articles and background reading
- UTF-8 and
Unicode FAQ for Unix/Linux by Markus Kuhn - Forms
of Unicode, an excellent overview by Mark Davis - Wikipedia UTF-8
contains a good discussion of why five- and six-octet sequences are
now illegal UTF-8 - Unicode
Transformation Formats [czyborra.com] - Unicode
UTF-8 FAQ - Unicode in
XML and other Markup Languages: Unicode Technical Report #20 - The
Absolute Minimum Every Software Developer Absolutely, Positively
Must Know About Unicode and Character Sets (No Excuses!), an
amusing and informative article by Joel Spolsky
Character Sets
The MIME character set attribute for UTF-8 is UTF-8.
Character sets are case-insensitive, so utf-8 is equally
valid. [IANA Character
Sets].
In an HTML file, place this tag inside <head> …
</head>:
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
In an XML prolog, the encoding is typically specified as an
attribute:
<?xml version="1.0" encoding="UTF-8" ?>
In Apache server config or .htaccess, this will cause the HTTP header
to be generated for text/html and text/plain
content:
AddDefaultCharset UTF-8
powered by performancing firefox
This entry was posted on Saturday, March 3rd, 2007 at 1:36 pm and is filed under I Love Tech, I Love Programming!, Linux/Unix/GNU. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.
