Consider two vertical bars with a horizontal bar between them: H. This H is a
glyph. To us, this glyph is a symbol -- it's the letter "Aitch" and it appears
in English words. Thus, a "letter" is a combination of a glyph and a meaning.
(The importance of this will become apparent as you go through this chapter.)
We know that H is the eighth letter of an alphabet. Or -- in SQL terms -- H is
the eighth symbol in a character repertoire. While an alphabet is a familiar
example of a character repertoire, any computer will also support a repertoire
containing digits, punctuators and special characters as well as glyphless
(non-printing) characters such as the '\0' which terminates strings in C.
However, no computer can handle the letter H by itself. It can only handle numbers, so we have a convention that the letter H will be represented by some number, such as 72. If we all agree that "72 is the code for H ... 74 is the code for J ..." and so on for every character in the repertoire, we can make a computer character set. Every character set has two parts: a character repertoire and an agreement on how the repertoire's characters will be encoded as numbers. (The SQL Standard calls the encoding scheme itself a Form-of-use.)
Internationally, there are many character sets, and many names for them. When using SQL, you'll be concerned mainly with the ones that the International Standards Organization (ISO) has published standards for:
- ISO 646 (1965), also known as US-ASCII, DIN 66003, etc.
- ISO 8859-1 (1985), also known as LATIN 1 and Windows Code Page 1252.
- ISO 8859-2 (1987), also known as LATIN 2 and Windows Code Page 1250.
- ... and several others which we'll describe as we go along.
The numeric codes between 0 and 127 have the same meaning in every ISO standard; differences exist only for codes greater than 127. In each standard, the codes between 0 and 127 include punctuation, digits, special characters and (what's most important for this discussion) the upper case letters A to Z (for example, the code for H is indeed 72) and the lower case letters a to z. Thus, all character sets start with Latin.
Latin
Cicero never saw a W or a lower-case j, but medieval and modern Latin have 26 letters, each of which has two forms ("Upper Case" and "Lower Case"). These are the simple Latin letters:
ABCDEFGHIJKLMNOPQRSTUVWXYZ -- simple upper-case Latin letters abcdefghijklmnopqrstuvwxyz -- simple lower-case Latin letters
These letters appear in all standard computer character sets including 7-bit ASCII and EBCDIC. They even appear in character sets for languages which don't use Latin. For instance, character sets used in Russia (such as KOI8R or ISO 8859-5) contain all the Latin letters with the same codes as 7-bit ASCII has; the Russian Cyrillic letters follow the Latin letters.
Because they are universal, the Latin letters are vital for computer languages. In SQL, all <keyword>s and <regular identifier>s (or names) are made with Latin letters (though they may also contain digits and the underline character).
The letters of the Latin alphabet are in alphabetic order. That might sound ... well, "Duh" ... but it helps to point out the most fundamental point about a collation, or collating sequence -- namely, that a collation represents a fixed order, and that order is known to most members of the general public. They expect to see words and names sorted that way (though they don't expect A to sort before a -- that's a computer thing).
English
English uses the Latin alphabet but adds rules when sorting. To set the scene for the complexities that follow, here are the rules for English sorting. (We'll naìvely pretend that when we write à l'anglaise we don't need even a soupçon of those pesky "accented letters" for names like El Niño -- we'll look at English only. All examples come from The Concise Oxford Dictionary Of Current English.)
Preprocess (Map to Simple Latin Upper Case Letters)
Preprocessing involves (a) stripping accents off all accented characters (e.g.: Ò becomes O), (b) eliminating all special characters such as hyphen or space or apostrophe or dot, and (c) changing ligatures such as Æ to double-characters such as AE.
SORT
Use the alphabetical order: ABCDEFGHIJKLMNOPQRSTUVWXYZ
Tiebreakers
- Tie breakers apply only if two words are otherwise equal.
- If a special character (such as a hyphen or apostrophe) was stripped in the preprocess stage, the word containing it follows. For example: "cant" before "can't" before "Cant", "recollect" before "re-collect", "nones" before "non est", "francophone" before "franc tireur".
- If a word contained an accented character, it follows. For example: "lame" before "lamé".
- If a word contains a capital letter, it doesn't matter. For example: "Negus" before "negus", "AR" before "Ar" -- but: "dan" before "Dan", "ma" before "MA". (We acknowledge that some people believe it does matter; we report only what actually happens.)
About 50% of English-world phone books have additional rules:
- Mc and Mac appear before M (sometimes M' does too, but only in Ireland). For example: "McDonald" before "Macdonald" before "Maastricht".
- Ste and St appear as if they were spelled "Saint". For example: "St John" before "Saan Stores Limited".
- In these cases, we effectively combine a group of letters into one, and place the result in the hidden interstices between O and M, or between R and S. It looks odd, but many languages have two-letter combinations (called "digraphs") which fit between primary letters of the alphabet.
From all this, we see that English sorting involves, in addition to the basic A-to-Z stuff, several special collation rules, including mapping, ligatures ("one character maps to two"), tie breaking and digraphs ("two characters map to one"). Nobody makes a fuss about English because the special cases are rare. But special cases in non-English languages are not rare at all -- they are the norm, so it's the non-English collation rules that get the attention.
French
French words have:
| a or e with Grave accent | ès à |
| a or e with Acute accent | Chrétien résumé |
| a or i or o with Circumflex | hôte côte |
| c with Cedilla | soupçon ça |
| i or e or o with Diaeresis | naïf coöpërant |
In simple cases, the French sorting rules are like the English rules.
- Accented characters map to their Latin upper case equivalents ("résumé" comes after "rester" but before "retard").
- Words with accented characters come after words with none -- "à" comes after "a" -- but that's only a tie-breaker, "à" comes before "abbé".
Then things get complicated.
- For words with multiple accented characters, the word with the last accented character follows. For example: "pêche" before "péché".
- France and French Canada don't capitalize accented letters the same way. In France the upper case of the word résumé is RESUME -- there are no capital letters with accents, which is why the old PC-DOS Character set (IBM extended ASCII) could dispense with the letters É, Ó, etc. But French Canadians capitalize the word résumé and keep the accents: RÉSUMÉ.
This shows us that we have more characters, another collation rule and conflicting ideas for changing lower case to upper case.
German
German words have:
| a or o or u with umlaut | Mädchen östlich Führer |
| sharp S or Eszet | bißchen |
Sharp S's similarity to a Greek Beta is just coincidence: it is really a ligature which maps to "ss". This is the only character in any Latin alphabet which is always lower case.
There are three ways to sort characters with umlauts, depending on the German-speaking country and the application.
- In the German DIN-1 official standard, Ä = A and Ö = O and Ü = U. That is, accented characters map to unaccented equivalents. This is the standard used for dictionaries, book indexes or any lists of words. Thus, Führer = Fuhrer.
- In German DIN-2 official standard, Ä = AE and Ö = OE and Ü = UE. That is, accented characters map to unaccented character plus E. This is the standard used for phone books, voter lists, or any lists of names in Germany. Thus, Göbbels = Goebbels and Düsseldorf = Duesseldorf.
- In the Austrian unofficial standard, Ä > AZ and Ö > OZ and Ü > UZ. That is, accented character are treated as a separate letter between A and B or O and P or U and V. This is the standard used for lists of names in Austria. Thus, Müller > Mzilikaze.
The Swiss use DIN 1 for all sorting.
The German tie-breaker rules are: unaccented before accented (this rule is universal), Eszet before umlaut, umlaut before any other accent. Here are three lists sorted according to DIN 1 (Zürich phone book), DIN 2 (Berlin phone book) and Austrian collation (Vienna phone book):
| DIN-1 | DIN-2 | Austrian |
|---|---|---|
| Moeller | Moeller | Moeller |
| Moffat | Möller | Moffat |
| Möller | Moffat | Morse |
| Morse | Morse | Möller |
| Möse | Möse | Möse |
| Motler | Motler | Motler |
At this point, you may be wondering what happens if you click "use German collating sequence" on some Windows dialog box. Well, Microsoft seems to prefer DIN 1, but the point is: if you just click, you'll get a Microsoft choice which is wrong at least half the time. As we said at the beginning of this chapter: you have to know what the problem is before you know what the solutions are worth.
Spanish
Spanish words have:
| tilde over n | cañon |
| acute accent over a u o i | corazón |
Some circles (such as the Spanish-speaking community in the United States) use the English style rules to sort Spanish words: all accented characters map to their simple Latin equivalents, period. However, the following discussion on Spanish collation refers to the traditional rules which apply in most situations.
- CH comes between C and D
- LL comes between L and M
- N comes before Ñ
(From now on, we'll use the formula "<x> follows <y>" to mean "<x> is treated as a separate letter which follows <y> in the alphabet", as in: CH follows C.
The Spanish CH and LL digraphs are truly separate letters. They get separate
headings in the dictionaries and the Spanish alphabet really has 28 letters:
"a b c ch d e f g h i j k l ll m n ñ o p q r s t u v x y z".
Here is an example of a sorted Spanish list:
| calza |
| calle |
| cantor |
| cañon |
| culto |
| che |
Dutch, Italian, Portuguese, Basque, Catalan
These languages present no special problems. Dutch does have one digraph, IJ, which is sometimes used for Y, but it is acceptable to use the letters IJ and sort with the digraph IJ. Acute accents are common in Portuguese, grave accents in Italian; however, they all map to simple Latin letters and there are no special collation rules.
Welsh
CH follows C. FF follows F. LL follows L. PH follows P. RH follows R. TH follows T.
Nordic
The Scandinavian alphabets have a common history and a few similar features, the most notable of which is their tendency to add letters at the end of the alphabet (after Z).
Central Europe:
So far, we've discussed eleven alphabets and collations without having to change the standard computer character set. Every one of the symbols in the alphabets we've discussed appears in ISO 8859-1, usually called the LATIN I character set. They also appear in ISO 8859-1's most famous derivative, Windows code page 1252, which is the default code page on computers in North America and in all European countries west of a line drawn through Finland, Germany, Austria and Italy. We are now going to cross that line, and briefly discuss ISO 8859-2, usually called the LATIN 2 character set (or, in Windows, code page 1250).
The Central European character set supports languages that use Latin-based characters with macrons, slashes, haceks, ogoneks, breves, bars and carons. Many of these do not map to a simple Latin letter when sorting.
The many extra letters created by adding these marks to the simple Latin
letters made it impossible to fit them all into the 8-bit LATIN 1 character
set, since there can only be a maximum of 256 different symbols in an 8-bit
set. (In fact, LATIN 1 actually provides fewer than 256 characters because
0x00 to 0x1f and 0x7f
to 0x9f are reserved for control "characters".) Therefore,
the most convenient thing to do was group the Latin-based East European
character sets together into LATIN 2.
There has been lots of confusion in the past and several national character set encodings are still in common use, so it's a bit idealist to write as if the ISO standards were already universal; however, there's no other useful assumption that a small shop can make.
Baltic
The Baltic countries have alphabets containing yet more characters, culminating in the world's longest Latin-based alphabet -- Latvian. Again, the addition of the extra letters made it necessary to set up another 8-bit character set. This one is ISO 8859-4, usually called the BALTIC character set (or, in Windows, code page 1257). The Baltic collations are the only ones in which the simple Latin letters do not follow what we usually think of as alphabetic order.
Turkish
The Turkish character set is defined by ISO 8859-9, usually called the LATIN 5 character set (or, in Windows, code page 1254). There is an irritating capitalization problem in Turkish (irritating to us that is, we assume it doesn't bother the Turks much): the capital of i is a dotted capital "I", and the capital of ı is I. Because of this single character, it is impossible to capitalize simple-Latin-letter strings without being sure that the strings are not written in Turkish.
Cyrillic
The main Cyrillic character sets can be represented by a single Windows code page: 1251, a derivative of ISO 8859-5. In Ukrainian, the soft sign appears at the end of the alphabet, instead of the position (between YERU and E) used by other Cyrillic languages. This affects the order of a few words, so two collations are necessary for completeness: a Ukrainian collation and a generalized-Cyrillic (Bulgarian Russian Serbian) collation, which differs only with respect to the position of this one letter.
NOTE: The Cyrillic letters А В Е І Ј К М Н О Р С Ѕ Т Ү Х are only similar to Latin letters in one respect: their appearance. Remember that we defined a character as "a combination of a glyph and a meaning" -- this is why. Similarity don't mean equivalence. These Cyrillic letters have codes of their own, they never map to Latin letters.
Greek
Greek is the oldest European character set and contains a few archaic features (for example, breath marks) which don't really indicate anything nowadays. On the other hand, we'd just as often want to use the Greek letters for classical texts as for modern Greek. So it's good that they're all found in ISO 8859-7, usually called the GREEK character set (or, in Windows, code page 1253).
The Rest of the World
We have now described all the alphabets of Europe and given you some idea of their collation quirks. We think that's the minimum requirement for a book that mentions character sets. We also think that it's the maximum! With some relief, we note that other alphabets and other writing systems (syllabaries, hieroglyphs etc.) are unavailable on a typically-configured computer in "the western world", so we won't talk about them in any detail. Three other systems deserve a mention, though.
Arabic Each character is displayable in several different forms, depending on whether it is alone, or is the first, middle or last character in a word. Thus, search and sort features are needed that can recognize multiple encodings of the same symbol.
Chinese There are at least 13,000 different signs, which are implemented with double byte character sets (usually "BIG5" in Taiwan, or "GB" in Red China); ideally the arrangement is that characters are sorted phonetically (Level 1) or according to radical and number of strokes (Level 2).
Japanese Several systems are in simultaneous use. For example, the word for "water" can be a single character (Kanji), two characters (Hiragana), two characters (Katakana) or four Latin letters (Romaji). The double character sets are usually encoded using escape characters. That saves space, but makes it impossible to determine a text's length or find its nth letter without starting at the front and working through the text.