DjDead
W:O:A Metalhead
This is the second of our on-going series of articles that explain some the new features in MySQL 4.1, which as of this writing is in the gamma phase of its development cycle, on the way to a production-ready release in the near future.
By Jim Winstead
One of the major new features in MySQL 4.1 is strong Unicode support, along with support for specifying character sets at many different levels. This makes it much simpler to handle content in a wide range of languages in your applications, as well as making it possible to handle content in multi-byte character encodings that were not supported in earlier versions of MySQL.
Character Encodings and Unicode
A character encoding is a way of mapping a character (the letter 'A') to an integer in a character set (the number 65 in the US-ASCII character set). With something as limited as the characters in the US-ASCII character set (the twenty-six letters of the English alphabet, both lowercase and uppercase, numbers from 0 to 9, and some punctuation), fitting this into a single byte is not a problem. But once you start to create a character set for languages like German, Swedish, Hungarian, and Japanese, you start to either hit the boundaries of the 8-bit byte when you try to create a character set to represent two of the languages, or even a single language like Japanese.
So throughout the history of computing, a number of different character encodings have been specified for mapping different characters to integers. For character sets that wouldn't fit in a single byte, double-byte character sets created, and so were multi-byte character sets that use a special character to signal a shift between single-byte and double-byte encoding.
The Unicode Consortium came together to create a specification for a character encoding that would be able to encompass the characters in all written languages (although contrary to what you may have heard, that does not yet include Klingon). The result was the Unicode character set, and some encodings. The two most common (and the two that MySQL 4.1 supports) are UCS-2, which encodes everything as two-byte characters, and UTF-8, which uses a multi-byte encoding scheme that extends US-ASCII.
ISO-8859-1 is the most common character set used for Western languages, and it is extended by the Windows-1252 character set to include some other characters, such as the euro (€) and trademark symbol (™). Because Windows-1252 is a superset of ISO-8859-1, the character set is known as latin1 to MySQL, and there is no distinct ISO-8859-1 character set. This matches the common behavior in web applications, which often treat the two interchangeably.
So why not just use UCS-2 or UTF-8 for everything? Well, if you're already working with a lot of data in a particular encoding, like Big-5 (often used for Chinese), you can avoid the processing overhead of converting into and out of UTF-8 by just storing the data in Big-5 encoding. UTF-8 encoding also tends to be larger (byte-wise) than more specific encodings, because characters outside of the normal ASCII range take at least two bytes. The string "d�ja v�" is only seven bytes in ISO-8859-1, but nine in UTF-8. The characters in scripts such as Chinese, Japanese, and Korean are each three bytes in UTF-8, but can be represented as two bytes in more specific encodings such as Big-5.
Collations
Example of Sorting Differences
Language Swedish: z < ö
German: ö < z
Usage
(German) Dictionary: öf < of
Telephone: of < öf
Collation. The process of ordering units of textual information. Collation is usually specific to a particular language.
— Unicode Glossary
Sorting strings is a common action to take, but on top of not everyone using the same characters, not everyone even sorts the same characters the same way! A collation is a defined way of sorting strings, and it is often language-dependent. While both Swedish and German generally use the ISO-8859-1 encoding (or latin1), there are characters that are sorted differently in the two languages (or actually two different ways for German), as the "Example of Sorting Differences" table shows.
The Unicode Collation Algorithm (UCA) is a general purpose algorithm for sorting Unicode strings. There is also the Default Unicode Collation Element Table (DUCET), which supplies a default ordering for all Unicode characters.
MySQL 4.1 implements this general-purpose algorithm, which makes creating specific collations very simple relative to the amount of coding that would have been necessary without support for the UCA. For example, here's the specification of the ucs2_lithuanian_ci collation (from strings/ctype-uca.c in the MySQL 4.1 source code).
static const char lithuanian[]=
"& C << ch <<< Ch <<< CH< \\u010D <<< \\u010C"
"& E << \\u0119 <<< \\u0118 << \\u0117 <<< \\u0116"
"& I << y <<< Y"
"& S < \\u0161 <<< \\u0160"
"& Z < \\u017E <<< \\u017D";
By Jim Winstead
One of the major new features in MySQL 4.1 is strong Unicode support, along with support for specifying character sets at many different levels. This makes it much simpler to handle content in a wide range of languages in your applications, as well as making it possible to handle content in multi-byte character encodings that were not supported in earlier versions of MySQL.
Character Encodings and Unicode
A character encoding is a way of mapping a character (the letter 'A') to an integer in a character set (the number 65 in the US-ASCII character set). With something as limited as the characters in the US-ASCII character set (the twenty-six letters of the English alphabet, both lowercase and uppercase, numbers from 0 to 9, and some punctuation), fitting this into a single byte is not a problem. But once you start to create a character set for languages like German, Swedish, Hungarian, and Japanese, you start to either hit the boundaries of the 8-bit byte when you try to create a character set to represent two of the languages, or even a single language like Japanese.
So throughout the history of computing, a number of different character encodings have been specified for mapping different characters to integers. For character sets that wouldn't fit in a single byte, double-byte character sets created, and so were multi-byte character sets that use a special character to signal a shift between single-byte and double-byte encoding.
The Unicode Consortium came together to create a specification for a character encoding that would be able to encompass the characters in all written languages (although contrary to what you may have heard, that does not yet include Klingon). The result was the Unicode character set, and some encodings. The two most common (and the two that MySQL 4.1 supports) are UCS-2, which encodes everything as two-byte characters, and UTF-8, which uses a multi-byte encoding scheme that extends US-ASCII.
ISO-8859-1 is the most common character set used for Western languages, and it is extended by the Windows-1252 character set to include some other characters, such as the euro (€) and trademark symbol (™). Because Windows-1252 is a superset of ISO-8859-1, the character set is known as latin1 to MySQL, and there is no distinct ISO-8859-1 character set. This matches the common behavior in web applications, which often treat the two interchangeably.
So why not just use UCS-2 or UTF-8 for everything? Well, if you're already working with a lot of data in a particular encoding, like Big-5 (often used for Chinese), you can avoid the processing overhead of converting into and out of UTF-8 by just storing the data in Big-5 encoding. UTF-8 encoding also tends to be larger (byte-wise) than more specific encodings, because characters outside of the normal ASCII range take at least two bytes. The string "d�ja v�" is only seven bytes in ISO-8859-1, but nine in UTF-8. The characters in scripts such as Chinese, Japanese, and Korean are each three bytes in UTF-8, but can be represented as two bytes in more specific encodings such as Big-5.
Collations
Example of Sorting Differences
Language Swedish: z < ö
German: ö < z
Usage
(German) Dictionary: öf < of
Telephone: of < öf
Collation. The process of ordering units of textual information. Collation is usually specific to a particular language.
— Unicode Glossary
Sorting strings is a common action to take, but on top of not everyone using the same characters, not everyone even sorts the same characters the same way! A collation is a defined way of sorting strings, and it is often language-dependent. While both Swedish and German generally use the ISO-8859-1 encoding (or latin1), there are characters that are sorted differently in the two languages (or actually two different ways for German), as the "Example of Sorting Differences" table shows.
The Unicode Collation Algorithm (UCA) is a general purpose algorithm for sorting Unicode strings. There is also the Default Unicode Collation Element Table (DUCET), which supplies a default ordering for all Unicode characters.
MySQL 4.1 implements this general-purpose algorithm, which makes creating specific collations very simple relative to the amount of coding that would have been necessary without support for the UCA. For example, here's the specification of the ucs2_lithuanian_ci collation (from strings/ctype-uca.c in the MySQL 4.1 source code).
static const char lithuanian[]=
"& C << ch <<< Ch <<< CH< \\u010D <<< \\u010C"
"& E << \\u0119 <<< \\u0118 << \\u0117 <<< \\u0116"
"& I << y <<< Y"
"& S < \\u0161 <<< \\u0160"
"& Z < \\u017E <<< \\u017D";