Contents
In SQL, a Character set may be a Character set defined by a national or international standard, by your DBMS or by a user of SQL-data.
Standard-defined Character sets consist of a set of characters predefined by
some standards body and have a default Collation that is the order of the
characters in the relevant standard. The default Collation has the
PAD SPACE characteristic. The SQL Standard requires a DBMS
to support, at a minimum, these standard-defined Character sets:
SQL_CHARACTER, GRAPHIC_IRV (also called
ASCII_GRAPHIC), LATIN1,
ISO8BIT (also called ASCII_FULL) and
UNICODE (also called ISO10646).
Implementation-defined Character sets consist of a set of characters
predefined by your DBMS and have a default Collation that is also defined by
your DBMS. The default Collation may have either the
PAD SPACE characteristic or the NO PAD
characteristic. The SQL Standard requires a DBMS to support, at a minimum, this
implementation-defined Character set: SQL_TEXT.
SQL_CHARACTER
The SQL_CHARACTER Character set is an 8-bit Character set
that consists of the 83 <SQL language character>s. These are found in codepage
1252.
SQL_CHARACTER characters, and their codepage 1252 Form-of-use codes, are:
decimal code 032 | <space> |
decimal code 034 | " <double quote> |
decimal code 037 | % <percent> |
decimal code 038 | & <ampersand> |
decimal code 039 | ' <single quote> |
decimal code 040 | ( <left parenthesis> |
decimal code 041 | ) <right parenthesis> |
decimal code 042 | * <asterisk> |
decimal code 043 | + <plus sign> |
decimal code 044 | , <comma> |
decimal code 045 | - <minus sign> |
decimal code 046 | . <period> |
decimal code 047 | / <solidus> |
decimal code 048 | 0 |
decimal code 049 | 1 |
decimal code 050 | 2 |
decimal code 051 | 3 |
decimal code 052 | 4 |
decimal code 053 | 5 |
decimal code 054 | 6 |
decimal code 055 | 7 |
decimal code 056 | 8 |
decimal code 057 | 9 |
decimal code 058 | : <colon> |
decimal code 059 | ; <semicolon> |
decimal code 060 | < <less than operator> |
decimal code 061 | = <equals operator> |
decimal code 062 | > <greater than operator> |
decimal code 063 | ? <question mark> |
decimal code 065 | A |
decimal code 066 | B |
decimal code 067 | C |
decimal code 068 | D |
decimal code 069 | E |
decimal code 070 | F |
decimal code 071 | G |
decimal code 072 | H |
decimal code 073 | I |
decimal code 074 | J |
decimal code 075 | K |
decimal code 076 | L |
decimal code 077 | M |
decimal code 078 | N |
decimal code 079 | O |
decimal code 080 | P |
decimal code 081 | Q |
decimal code 082 | R |
decimal code 083 | S |
decimal code 084 | T |
decimal code 085 | U |
decimal code 086 | V |
decimal code 087 | W |
decimal code 088 | X |
decimal code 089 | Y |
decimal code 090 | Z |
decimal code 095 | _ <underscore> |
decimal code 097 | a |
decimal code 098 | b |
decimal code 099 | c |
decimal code 100 | d |
decimal code 101 | e |
decimal code 102 | f |
decimal code 103 | g |
decimal code 104 | h |
decimal code 105 | i |
decimal code 106 | j |
decimal code 107 | k |
decimal code 108 | l |
decimal code 109 | m |
decimal code 110 | n |
decimal code 111 | o |
decimal code 112 | p |
decimal code 113 | q |
decimal code 114 | r |
decimal code 115 | s |
decimal code 116 | t |
decimal code 117 | u |
decimal code 118 | v |
decimal code 119 | w |
decimal code 120 | x |
decimal code 121 | y |
decimal code 122 | z |
decimal code 124 | | <vertical bar> |
[NON-PORTABLE] The default Collation for the SQL_CHARACTER Character set is
non-standard because the SQL Standard requires implementors to define
SQL_CHARACTER's default Collation. Most DBMSs will sort the
characters of SQL_CHARACTER in the decimal order shown
above.
[OCELOT Implementation] The OCELOT DBMS that comes with this book has
a default Collation (called SQL_CHARACTER) for the
SQL_CHARACTER Character set that sorts the characters in
their codepage 1252 decimal order.
GRAPHIC_IRV and ASCII_GRAPHIC
The GRAPHIC_IRV Character set
(ASCII_GRAPHIC is a synonym) is an 8-bit Character set that
consists of 95 characters: all of the SQL_CHARACTER
characters, plus an additional 12 characters from codepage 1252. The default
Collation for the GRAPHIC_IRV Character set sorts the
characters in their codepage 1252 decimal order.
GRAPHIC_IRV characters, and their codepage
1252 Form-of-use codes (in default Collation order)
decimal code 032 | <space> |
decimal code 033 | ! <exclamation mark> |
decimal code 034 | " <double quote> |
decimal code 035 | # <number sign> |
decimal code 036 | $ <dollar sign> |
decimal code 037 | % <percent> |
decimal code 038 | & <ampersand> |
decimal code 039 | ' <single quote> |
decimal code 040 | ( <left parenthesis> |
decimal code 041 | ) <right parenthesis> |
decimal code 042 | * <asterisk> |
decimal code 043 | + <plus sign> |
decimal code 044 | , <comma> |
decimal code 045 | - <minus sign> |
decimal code 046 | . <period> |
decimal code 047 | / <solidus> |
decimal code 048 | 0 |
decimal code 049 | 1 |
decimal code 050 | 2 |
decimal code 051 | 3 |
decimal code 052 | 4 |
decimal code 053 | 5 |
decimal code 054 | 6 |
decimal code 055 | 7 |
decimal code 056 | 8 |
decimal code 057 | 9 |
decimal code 058 | : <colon> |
decimal code 059 | ; <semicolon> |
decimal code 060 | < <less than operator> |
decimal code 061 | = <equals operator> |
decimal code 062 | > <greater than operator> |
decimal code 063 | ? <question mark> |
decimal code 064 | @ <commercial at sign> |
decimal code 065 | A |
decimal code 066 | B |
decimal code 067 | C |
decimal code 068 | D |
decimal code 069 | E |
decimal code 070 | F |
decimal code 071 | G |
decimal code 072 | H |
decimal code 073 | I |
decimal code 074 | J |
decimal code 075 | K |
decimal code 076 | L |
decimal code 077 | M |
decimal code 078 | N |
decimal code 079 | O |
decimal code 080 | P |
decimal code 081 | Q |
decimal code 082 | R |
decimal code 083 | S |
decimal code 084 | T |
decimal code 085 | U |
decimal code 086 | V |
decimal code 087 | W |
decimal code 088 | X |
decimal code 089 | Y |
decimal code 090 | Z |
decimal code 091 | [ <left square bracket> |
decimal code 092 | \ <backslash> |
decimal code 093 | ] <right square bracket> |
decimal code 094 | ^ <circumflex accent mark> |
decimal code 095 | _ <underline character> |
decimal code 096 | ` <grave accent mark> |
decimal code 097 | a |
decimal code 098 | b |
decimal code 099 | c |
decimal code 100 | d |
decimal code 101 | e |
decimal code 102 | f |
decimal code 103 | g |
decimal code 104 | h |
decimal code 105 | i |
decimal code 106 | j |
decimal code 107 | k |
decimal code 108 | l |
decimal code 109 | m |
decimal code 110 | n |
decimal code 111 | o |
decimal code 112 | p |
decimal code 113 | q |
decimal code 114 | r |
decimal code 115 | s |
decimal code 116 | t |
decimal code 117 | u |
decimal code 118 | v |
decimal code 119 | w |
decimal code 120 | x |
decimal code 121 | y |
decimal code 122 | z |
decimal code 123 | { <left brace> |
decimal code 124 | | <vertical bar> |
decimal code 125 | } <right brace> |
decimal code 126 | ~ <tilde> |
[NON-PORTABLE] The default Collation for the GRAPHIC_IRV
Character set has a non-standard name because the SQL Standard requires
implementors to define what a legal <Collation name> is.
[OCELOT Implementation] The OCELOT DBMS that comes with this book has a
default Collation called ASCII_GRAPHIC for the
GRAPHIC_IRV Character set.
LATIN1
The LATINI Character set is an 8-bit Character set that
consists of all the characters commonly used in Danish, Dutch, English,
Faeroese, Finnish, French, German, Icelandic, Irish, Italian, Norwegian,
Portuguese, Spanish and Swedish -- a total of 191 characters: all of the
GRAPHIC_IRV characters, plus an additional 96 characters
from codepage 1252. The default Collation for the LATIN1
Character set sorts the characters in their codepage 1252 decimal order.
Additional LATIN1 characters, and their codepage 1252
Form-of-use codes (in default Collation order):
decimal code 160 | <no-break space> |
decimal code 161 | ¡ <inverted exclamation mark> |
decimal code 162 | ¢ <cent sign> |
decimal code 163 | £ <pound sign> |
decimal code 164 | ¤ <currency sign> |
decimal code 165 | ¥ <yen sign> |
decimal code 166 | ¦ <broken bar> |
decimal code 167 | § <section sign> |
decimal code 168 | ¨ <diaeresis> |
decimal code 169 | © <copyright sign> |
decimal code 170 | ª <feminine ordinal indicator> |
decimal code 171 | « <left angle quotation mark> |
decimal code 172 | ¬ <not sign> |
decimal code 173 | <soft hyphen> |
decimal code 174 | ® <registered trade mark sign> |
decimal code 175 | ¯ <macron> |
decimal code 176 | ° <degree sign> |
decimal code 177 | ± <plus-minus sign> |
decimal code 178 | ² <superscript two> |
decimal code 179 | ³ <superscript three> |
decimal code 180 | ´ <acute accent> |
decimal code 181 | µ <micro sign> |
decimal code 182 | ¶ <pilcrow sign> |
decimal code 183 | · <middle dot> |
decimal code 184 | ¸ <cedilla> |
decimal code 185 | ¹ <superscript one> |
decimal code 186 | º <masculine ordinal indicator> |
decimal code 187 | » <right angle quotation mark> |
decimal code 188 | ¼ <fraction one quarter> |
decimal code 189 | ½ <fraction one half> |
decimal code 190 | ¾ <fraction three quarters> |
decimal code 191 | ¿ <inverted question mark> |
decimal code 192 | À <A accent grave> |
decimal code 193 | Á <A acute accent> |
decimal code 194 | Â <A circumflex> |
decimal code 195 | Ã <A tilde> |
decimal code 196 | Ä <A diaeresis> |
decimal code 197 | Å <A circle> |
decimal code 198 | Æ <AE ligature> |
decimal code 199 | Ç <C cedilla> |
decimal code 200 | È <E accent grave> |
decimal code 201 | É <E acute accent> |
decimal code 202 | Ê <E circumflex> |
decimal code 203 | Ë <E diaeresis> |
decimal code 204 | Ì <I accent grave> |
decimal code 205 | Í <I acute accent> |
decimal code 206 | Î <I circumflex> |
decimal code 207 | Ï <I diaeresis> |
decimal code 208 | Ð <Icelandic ETH> |
decimal code 209 | Ñ <N tilde> |
decimal code 210 | Ò <O accent grave> |
decimal code 211 | Ó <O acute accent> |
decimal code 212 | Ô <O circumflex> |
decimal code 213 | Õ <O tilde> |
decimal code 214 | Ö <O diaeresis> |
decimal code 215 | × <multiplication sign> |
decimal code 216 | Ø <O stroke> |
decimal code 217 | Ù <U accent grave> |
decimal code 218 | Ú <U acute accent> |
decimal code 219 | Û <U circumflex> |
decimal code 220 | Ü <U diaeresis> |
decimal code 221 | Ý <Y acute accent> |
decimal code 222 | Þ <THORN> |
decimal code 223 | ß <sharp s> |
decimal code 224 | à <a accent grave> |
decimal code 225 | á <a acute accent> |
decimal code 226 | â <a circumflex> |
decimal code 227 | ã <a tilde> |
decimal code 228 | ä <a diaeresis> |
decimal code 229 | å <a circle> |
decimal code 230 | æ <ae ligature> |
decimal code 231 | ç <c cedilla> |
decimal code 232 | è <e accent grave> |
decimal code 233 | é <e acute accent> |
decimal code 234 | ê <e circumflex> |
decimal code 235 | ë <e diaeresis> |
decimal code 236 | ì <i accent grave> |
decimal code 237 | í <i acute accent> |
decimal code 238 | î <i circumflex> |
decimal code 239 | ï <i diaeresis> |
decimal code 240 | ð <Icelandic eth> |
decimal code 241 | ñ <n tilde> |
decimal code 242 | ò <o accent grave> |
decimal code 243 | ó <o acute accent> |
decimal code 244 | ô <o circumflex> |
decimal code 245 | õ <o tilde> |
decimal code 246 | ö <o diaeresis> |
decimal code 247 | ÷ <division sign> |
decimal code 248 | ø <o slash> |
decimal code 249 | ù <u accent grave> |
decimal code 250 | ú <u acute accent> |
decimal code 251 | û <u circumflex> |
decimal code 252 | ü <u diaeresis> |
decimal code 253 | ý <y acute accent> |
decimal code 254 | þ <thorn> |
decimal code 255 | ÿ <y diaeresis> |
[NON-PORTABLE] The default Collation for the LATIN1
Character set has a non- standard name because the SQL Standard requires
implementors to define what a legal <Collation name> is.
[OCELOT Implementation] The OCELOT DBMS that comes with this book has a
default Collation called LATIN1 for the
LATIN1 Character set.
ISO8BIT and ASCII_FULL
The ISO8BIT Character set (ASCII_FULL is
a synonym) is an 8-bit Character set that consists of 256 characters: all of
the LATIN1 characters, plus the rest of the characters
(including control characters and graphic characters) from codepage 1252. The
default Collation for the ISO8BIT Character set sorts the
characters in their codepage 1252 decimal order.
ISO8BIT characters and their codepage 1252 Form-of-use
codes (in default Collation order):
decimal code 000 = control character; not displayed |
decimal code 001 = control character; not displayed |
decimal code 002 = control character; not displayed |
decimal code 003 = control character; not displayed |
decimal code 004 = control character; not displayed |
decimal code 005 = control character; not displayed |
decimal code 006 = control character; not displayed |
decimal code 007 = control character; not displayed |
decimal code 008 = control character; not displayed |
decimal code 009 = control character; not displayed |
decimal code 010 = control character; not displayed |
decimal code 011 = control character; not displayed |
decimal code 012 = control character; not displayed |
decimal code 013 = control character; not displayed |
decimal code 014 = control character; not displayed |
decimal code 015 = control character; not displayed |
decimal code 016 = control character; not displayed |
decimal code 017 = control character; not displayed |
decimal code 018 = control character; not displayed |
decimal code 019 = control character; not displayed |
decimal code 020 = control character; not displayed |
decimal code 021 = control character; not displayed |
decimal code 022 = control character; not displayed |
decimal code 023 = control character; not displayed |
decimal code 024 = control character; not displayed |
decimal code 025 = control character; not displayed |
decimal code 026 = control character; not displayed |
decimal code 027 = control character; not displayed |
decimal code 028 = control character; not displayed |
decimal code 029 = control character; not displayed |
decimal code 030 = control character; not displayed |
decimal code 031 = control character; not displayed |
decimal codes 032 through 126, from LATIN1 |
decimal code 127 = control character; not displayed |
decimal code 128 = control character; not displayed |
decimal code 129 = control character; not displayed |
decimal code 130 = , <comma> |
decimal code 131 = <f italics> |
decimal code 132 = <close curly double quote subscript> |
decimal code 133 = ... <ellipsis> |
decimal code 134 = <cross> |
decimal code 135 = <double cross> |
decimal code 136 = <carat> |
decimal code 137 = <modified percent sign> |
decimal code 138 = <S hacek> |
decimal code 139 = <modified less than> |
decimal code 140 = <OE> |
decimal code 141 = control character; not displayed |
decimal code 142 = control character; not displayed |
decimal code 143 = control character; not displayed |
decimal code 144 = control character; not displayed |
decimal code 145 = <open quote sign> |
decimal code 146 = <close quote sign> |
decimal code 147 = <open curly double quote> |
decimal code 148 = <close curly double quote> |
decimal code 149 = <dark circle> |
decimal code 150 = <long hyphen> |
decimal code 151 = <hyphen> |
decimal code 152 = control character; not displayed |
decimal code 153 = <TM sign> |
decimal code 154 = <s hacek> |
decimal code 155 = <modified greater than> |
decimal code 156 = <oe> |
decimal code 157 = control character; not displayed |
decimal code 158 = control character; not displayed |
decimal code 159 = <Y umlaut> |
decimal codes 160 through 256, from LATIN1 |
[NON-PORTABLE] The default Collation for the ISO8BIT
Character set has a non- standard name because the SQL Standard requires
implementors to define what a legal <Collation name> is.
[OCELOT Implementation] The OCELOT DBMS that comes with this book has a
default Collation called ISO8BIT for the
ISO8BIT Character set.
UNICODE and ISO10646
The UNICODE Character set (ISO10646 is a
synonym) is a 16-bit Character set that consists of every character represented
by the Unicode specification (specifically, by The Unicode Standard Version
2.1). Since there are 38,887 characters in this set, we won't include a list in
this book. It suffices to say that the UNICODE Character set
includes all of the ISO8BIT characters, plus many others.
The default Collation for the UNICODE Character set sorts
the characters in their Unicode Form-of-use code order.
[NON-PORTABLE] The default Collation for the UNICODE
Character set has a non- standard name because the SQL Standard requires
implementors to define what a legal <Collation name> is.
[OCELOT Implementation] The OCELOT DBMS that comes with this book has a
default Collation called SQL_TEXT for the
UNICODE Character set.
SQL_TEXT
The SQL Standard requires every DBMS to provide an Ur-Character set named
SQL_TEXT. The complete set of SQL_TEXT
characters is implementation defined, but SQL_TEXT must
contain every <SQL language character> plus every other character that the DBMS
supports -- that is, SQL_TEXT (at a minimum) includes all of
the UNICODE characters and is therefore at least a 16-bit
Character set.
[NON-PORTABLE] The default Collation for the SQL_TEXT
Character set must be called SQL_TEXT too, but is otherwise
non-standard because the SQL Standard requires implementors to define the
default Collation and Form-of-use of SQL_TEXT.
[OCELOT Implementation] The OCELOT DBMS that comes with this book sorts
the characters of SQL_TEXT in their Unicode
Form-of-use code order.
There are some curious consequences to making SQL_TEXT (and
UNICODE) a 16-bit, rather than an 8-bit Character set.
The consequences are the result of the fact that all <identifier>s and other
character strings in INFORMATION_SCHEMA, all <SQL-server
name>s, all <Connection name>s, all <AuthorizationID>s, and the results of many
functions (e.g.: USER, CURRENT_USER,
SESSION_USER, SYSTEM_USER) are
SQL_TEXT character strings.
SQL_TEXT, however, is normally not the default Character set
(the default Character set is almost always an 8-bit Character set) -- and SQL
specifically prohibits comparisons between strings that belong to different
Character sets. (The rules require that, for such comparisons, one string be
translated to the other string's Character set first -- see the
TRANSLATE function" in Chapter 7, "Character Strings".)
So, will you have special trouble when you want to display
INFORMATION_SCHEMA strings, <AuthorizationID>s, <SQL-server
name>s, <Connection name>s or the like? There are two possibilities, and the
answer is usually "NO" to both.
Possibility 1
Consider the following SQL statements:
SELECT TABLE_NAME FROM INFORMATION_SCHEMA.TABLES; SELECT SESSION_USER FROM INFORMATION_SCHEMA.TABLES;
The SQL Standard allows a DBMS to decide whether there should be an automatic
translation from SQL_TEXT to the default Character set when
an SQL statement assigns an SQL_TEXT string to a host
language variable or parameter. In such cases, most DBMSs do the automatic
translation for you (as, for example, the OCELOT DBMS that comes with this book
does) -- so, to display the result of a SELECT from
INFORMATION_SCHEMA or the result from a function that
returns an SQL_TEXT string, you won't need to do anything
special. It won't normally be necessary to use the TRANSLATE function to
translate the retrieved values from 16-bit SQL_TEXT
characters into 8-bit, readable format (or vice versa) -- just write your SQL
statement as if the Character set difference doesn't exist. (In the example
above, by the way, the ASCII <identifier> TABLE_NAME would
also be automatically converted to a SQL_TEXT string by your
DBMS before the search was done.)
Possibility 2
Consider this SQL statement:
SELECT TABLE_NAME FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_NAME = 'ocelot';
The SQL Standard does not allow a DBMS to decide whether there should be an
automatic translation from SQL_TEXT to the default Character
set when an SQL statement involves a comparison between a
SQL_TEXT string and a default Character set string -- but
the rules do allow a DBMS to provide a predefined Translation for this purpose.
In such cases, most DBMSs will use this option to do an automatic translation
for you (as, for example, the OCELOT DBMS that comes with this book does) --
so, to compare SQL_TEXT strings with default Character set
strings, you won't need to do anything special. It won't normally be necessary
to use the TRANSLATE function to translate the retrieved
values from 16-bit SQL_TEXT characters into 8-bit, readable
format (or vice versa) -- just write your SQL statement as if the Character set
difference doesn't exist.
A second consequence of the fact that SQL_TEXT is 16-bit and
the default Character set is usually 8-bit involves the use of
<AuthorizationID>s, <SQL-server name>s and <Connection name>s. Although
<AuthorizationID>s, <SQL-server name>s and <Connection name>s are stored as
16-bit SQL_TEXT strings, the SQL-Connection arguments in
CONNECT TO are inevitably in 8-bit sets, since ASCII is used
to type them. If, however, these same arguments are assigned to host language
variable parameters, you will see them as 16-bit strings. Therefore, when you
have an expression like this:
... WHERE= 'x' ...
you know that the <character string literal> 'x' is a 16-bit
string because <Authorization ID>s are SQL_TEXT strings. But
the same <literal> in a CONNECT statement, e.g.:
CONNECT TO 'x';
has to be an 8-bit string because it is typed in ASCII.
One last comment, about parsing SQL_TEXT strings. Consider
the following expression, which concatenates a character string with a value
from a Column:
... _INFORMATION_SCHEMA.SQL_TEXT 'A ' || column_1
The default Character set for an SQL statement is usually an 8-bit Character
set, but the string 'A ' in this example is explicitly
introduced as 16-bit SQL_TEXT. Thus, what appears to be two
8-bit characters -- A and <space> -- is, in fact, one 16-bit
character, the Unicode value 4120h.
Parsing of SQL statements is always based on 8-bit values regardless of the
Character set of a string, so your DBMS will consider the
SQL_TEXT string 'A ' to begin after the
first single quote mark and end before the second single quote mark; i.e.:
although the characters inside the single quotes are 16-bit, the delimiting
single quote marks themselves are 8-bit. Thus, when the parser reaches the
second single quote mark it knows that the string is over -- despite the fact
that in 16-bit, a single quote mark character has not yet been found.