IAintaBlonde.com

UTF-8 and Unicode Standards :: What is UTF-8?

UTF-8 stands for Unicode
Transformation
Format-8. It is an octet (8-bit)
lossless encoding of Unicode characters.

UTF-8 encodes each Unicode character as a variable number of 1 to 4
octets, where the number of octets depends on the integer value assigned
to the Unicode character. It is an efficient encoding of Unicode
documents that use mostly US-ASCII characters because it represents each
character in the range U+0000 through U+007F as a single octet. UTF-8
is the default encoding for XML.

Standards

RFC
3629
: UTF-8, a transformation format of ISO 10646. November 2003.
The
Unicode Standard 5.0
, November 2006. [purchase
from Amazon.com
]
In particular, see the informal
description
of UTF-8 in sections 2.5 and 2.6, pages 30-32, and a
much more formal
definition
in sections 3.9 and 3.10, pages 77-81.

Unicode Demystified: A Practical Programmer's Guide to the Encoding Standard

Articles and background reading

UTF-8 and
Unicode FAQ for Unix/Linux
by Markus Kuhn
Forms
of Unicode
, an excellent overview by Mark Davis
Wikipedia UTF-8
contains a good discussion of why five- and six-octet sequences are
now illegal UTF-8
Unicode
Transformation Formats
[czyborra.com]
Unicode
UTF-8 FAQ
Unicode in
XML and other Markup Languages
: Unicode Technical Report #20
The
Absolute Minimum Every Software Developer Absolutely, Positively
Must Know About Unicode and Character Sets (No Excuses!)
, an
amusing and informative article by Joel Spolsky

Character Sets

The MIME character set attribute for UTF-8 is UTF-8.
Character sets are case-insensitive, so utf-8 is equally
valid. [IANA Character
Sets
].

In an HTML file, place this tag inside <head>
</head>:

<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">

In an XML prolog, the encoding is typically specified as an
attribute:

<?xml version="1.0" encoding="UTF-8" ?>

In Apache server config or .htaccess, this will cause the HTTP header
to be generated for text/html and text/plain
content:

AddDefaultCharset UTF-8

powered by performancing firefox

This entry was posted on Saturday, March 3rd, 2007 at 1:36 pm and is filed under I Love Tech, I Love Programming!, Linux/Unix/GNU. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

One Response to UTF-8 and Unicode Standards :: What is UTF-8? »»


Comments


Leave a Reply »»

Close
E-mail It