In SQL, a Character set may be a Character set defined by a national or international standard, by your DBMS or by a user of SQL-data.

Standard-defined Character sets consist of a set of characters predefined by some standards body and have a default Collation that is the order of the characters in the relevant standard. The default Collation has the PAD SPACE characteristic. The SQL Standard requires a DBMS to support, at a minimum, these standard-defined Character sets: SQL_CHARACTER, GRAPHIC_IRV (also called ASCII_GRAPHIC), LATIN1, ISO8BIT (also called ASCII_FULL) and UNICODE (also called ISO10646).

Implementation-defined Character sets consist of a set of characters predefined by your DBMS and have a default Collation that is also defined by your DBMS. The default Collation may have either the PAD SPACE characteristic or the NO PAD characteristic. The SQL Standard requires a DBMS to support, at a minimum, this implementation-defined Character set: SQL_TEXT.

SQL_CHARACTER

The SQL_CHARACTER Character set is an 8-bit Character set that consists of the 83 <SQL language character>s. These are found in codepage 1252.

SQL_CHARACTER characters, and their codepage 1252 Form-of-use codes, are:

decimal code 032 <space>
decimal code 034" <double quote>
decimal code 037% <percent>
decimal code 038& <ampersand>
decimal code 039' <single quote>
decimal code 040( <left parenthesis>
decimal code 041) <right parenthesis>
decimal code 042* <asterisk>
decimal code 043+ <plus sign>
decimal code 044, <comma>
decimal code 045- <minus sign>
decimal code 046. <period>
decimal code 047/ <solidus>
decimal code 0480
decimal code 0491
decimal code 0502
decimal code 0513
decimal code 0524
decimal code 0535
decimal code 0546
decimal code 0557
decimal code 0568
decimal code 0579
decimal code 058: <colon>
decimal code 059; <semicolon>
decimal code 060< <less than operator>
decimal code 061= <equals operator>
decimal code 062> <greater than operator>
decimal code 063? <question mark>
decimal code 065A
decimal code 066B
decimal code 067C
decimal code 068D
decimal code 069E
decimal code 070F
decimal code 071G
decimal code 072H
decimal code 073I
decimal code 074J
decimal code 075K
decimal code 076L
decimal code 077M
decimal code 078N
decimal code 079O
decimal code 080P
decimal code 081Q
decimal code 082R
decimal code 083S
decimal code 084T
decimal code 085U
decimal code 086V
decimal code 087W
decimal code 088X
decimal code 089Y
decimal code 090Z
decimal code 095_ <underscore>
decimal code 097a
decimal code 098b
decimal code 099c
decimal code 100d
decimal code 101e
decimal code 102f
decimal code 103g
decimal code 104h
decimal code 105i
decimal code 106j
decimal code 107k
decimal code 108l
decimal code 109m
decimal code 110n
decimal code 111o
decimal code 112p
decimal code 113q
decimal code 114r
decimal code 115s
decimal code 116t
decimal code 117u
decimal code 118v
decimal code 119w
decimal code 120x
decimal code 121y
decimal code 122z
decimal code 124| <vertical bar>

[NON-PORTABLE] The default Collation for the SQL_CHARACTER Character set is non-standard because the SQL Standard requires implementors to define SQL_CHARACTER's default Collation. Most DBMSs will sort the characters of SQL_CHARACTER in the decimal order shown above. [OCELOT Implementation] The OCELOT DBMS that comes with this book has a default Collation (called SQL_CHARACTER) for the SQL_CHARACTER Character set that sorts the characters in their codepage 1252 decimal order.

GRAPHIC_IRV and ASCII_GRAPHIC

The GRAPHIC_IRV Character set (ASCII_GRAPHIC is a synonym) is an 8-bit Character set that consists of 95 characters: all of the SQL_CHARACTER characters, plus an additional 12 characters from codepage 1252. The default Collation for the GRAPHIC_IRV Character set sorts the characters in their codepage 1252 decimal order.

GRAPHIC_IRV characters, and their codepage 1252 Form-of-use codes (in default Collation order)

decimal code 032 <space>
decimal code 033! <exclamation mark>
decimal code 034" <double quote>
decimal code 035# <number sign>
decimal code 036$ <dollar sign>
decimal code 037% <percent>
decimal code 038& <ampersand>
decimal code 039' <single quote>
decimal code 040( <left parenthesis>
decimal code 041) <right parenthesis>
decimal code 042* <asterisk>
decimal code 043+ <plus sign>
decimal code 044, <comma>
decimal code 045- <minus sign>
decimal code 046. <period>
decimal code 047/ <solidus>
decimal code 0480
decimal code 0491
decimal code 0502
decimal code 0513
decimal code 0524
decimal code 0535
decimal code 0546
decimal code 0557
decimal code 0568
decimal code 0579
decimal code 058: <colon>
decimal code 059; <semicolon>
decimal code 060< <less than operator>
decimal code 061= <equals operator>
decimal code 062> <greater than operator>
decimal code 063? <question mark>
decimal code 064@ <commercial at sign>
decimal code 065A
decimal code 066B
decimal code 067C
decimal code 068D
decimal code 069E
decimal code 070F
decimal code 071G
decimal code 072H
decimal code 073I
decimal code 074J
decimal code 075K
decimal code 076L
decimal code 077M
decimal code 078N
decimal code 079O
decimal code 080P
decimal code 081Q
decimal code 082R
decimal code 083S
decimal code 084T
decimal code 085U
decimal code 086V
decimal code 087W
decimal code 088X
decimal code 089Y
decimal code 090Z
decimal code 091[ <left square bracket>
decimal code 092\ <backslash>
decimal code 093] <right square bracket>
decimal code 094^ <circumflex accent mark>
decimal code 095_ <underline character>
decimal code 096` <grave accent mark>
decimal code 097a
decimal code 098b
decimal code 099c
decimal code 100d
decimal code 101e
decimal code 102f
decimal code 103g
decimal code 104h
decimal code 105i
decimal code 106j
decimal code 107k
decimal code 108l
decimal code 109m
decimal code 110n
decimal code 111o
decimal code 112p
decimal code 113q
decimal code 114r
decimal code 115s
decimal code 116t
decimal code 117u
decimal code 118v
decimal code 119w
decimal code 120x
decimal code 121y
decimal code 122z
decimal code 123{ <left brace>
decimal code 124| <vertical bar>
decimal code 125} <right brace>
decimal code 126~ <tilde>

[NON-PORTABLE] The default Collation for the GRAPHIC_IRV Character set has a non-standard name because the SQL Standard requires implementors to define what a legal <Collation name> is. [OCELOT Implementation] The OCELOT DBMS that comes with this book has a default Collation called ASCII_GRAPHIC for the GRAPHIC_IRV Character set.

LATIN1

The LATINI Character set is an 8-bit Character set that consists of all the characters commonly used in Danish, Dutch, English, Faeroese, Finnish, French, German, Icelandic, Irish, Italian, Norwegian, Portuguese, Spanish and Swedish -- a total of 191 characters: all of the GRAPHIC_IRV characters, plus an additional 96 characters from codepage 1252. The default Collation for the LATIN1 Character set sorts the characters in their codepage 1252 decimal order.

Additional LATIN1 characters, and their codepage 1252 Form-of-use codes (in default Collation order):

decimal code 160  <no-break space>
decimal code 161¡ <inverted exclamation mark>
decimal code 162¢ <cent sign>
decimal code 163£ <pound sign>
decimal code 164¤ <currency sign>
decimal code 165¥ <yen sign>
decimal code 166¦ <broken bar>
decimal code 167§ <section sign>
decimal code 168¨ <diaeresis>
decimal code 169© <copyright sign>
decimal code 170ª <feminine ordinal indicator>
decimal code 171« <left angle quotation mark>
decimal code 172¬ <not sign>
decimal code 173­ <soft hyphen>
decimal code 174® <registered trade mark sign>
decimal code 175¯ <macron>
decimal code 176° <degree sign>
decimal code 177± <plus-minus sign>
decimal code 178² <superscript two>
decimal code 179³ <superscript three>
decimal code 180´ <acute accent>
decimal code 181µ <micro sign>
decimal code 182¶ <pilcrow sign>
decimal code 183· <middle dot>
decimal code 184¸ <cedilla>
decimal code 185¹ <superscript one>
decimal code 186º <masculine ordinal indicator>
decimal code 187» <right angle quotation mark>
decimal code 188¼ <fraction one quarter>
decimal code 189½ <fraction one half>
decimal code 190¾ <fraction three quarters>
decimal code 191¿ <inverted question mark>
decimal code 192À <A accent grave>
decimal code 193Á <A acute accent>
decimal code 194Â <A circumflex>
decimal code 195Ã <A tilde>
decimal code 196Ä <A diaeresis>
decimal code 197Å <A circle>
decimal code 198Æ <AE ligature>
decimal code 199Ç <C cedilla>
decimal code 200È <E accent grave>
decimal code 201É <E acute accent>
decimal code 202Ê <E circumflex>
decimal code 203Ë <E diaeresis>
decimal code 204Ì <I accent grave>
decimal code 205Í <I acute accent>
decimal code 206Î <I circumflex>
decimal code 207Ï <I diaeresis>
decimal code 208Ð <Icelandic ETH>
decimal code 209Ñ <N tilde>
decimal code 210Ò <O accent grave>
decimal code 211Ó <O acute accent>
decimal code 212Ô <O circumflex>
decimal code 213Õ <O tilde>
decimal code 214Ö <O diaeresis>
decimal code 215× <multiplication sign>
decimal code 216Ø <O stroke>
decimal code 217Ù <U accent grave>
decimal code 218Ú <U acute accent>
decimal code 219Û <U circumflex>
decimal code 220Ü <U diaeresis>
decimal code 221Ý <Y acute accent>
decimal code 222Þ <THORN>
decimal code 223ß <sharp s>
decimal code 224à <a accent grave>
decimal code 225á <a acute accent>
decimal code 226â <a circumflex>
decimal code 227ã <a tilde>
decimal code 228ä <a diaeresis>
decimal code 229å <a circle>
decimal code 230æ <ae ligature>
decimal code 231ç <c cedilla>
decimal code 232è <e accent grave>
decimal code 233é <e acute accent>
decimal code 234ê <e circumflex>
decimal code 235ë <e diaeresis>
decimal code 236ì <i accent grave>
decimal code 237í <i acute accent>
decimal code 238î <i circumflex>
decimal code 239ï <i diaeresis>
decimal code 240ð <Icelandic eth>
decimal code 241ñ <n tilde>
decimal code 242ò <o accent grave>
decimal code 243ó <o acute accent>
decimal code 244ô <o circumflex>
decimal code 245õ <o tilde>
decimal code 246ö <o diaeresis>
decimal code 247÷ <division sign>
decimal code 248ø <o slash>
decimal code 249ù <u accent grave>
decimal code 250ú <u acute accent>
decimal code 251û <u circumflex>
decimal code 252ü <u diaeresis>
decimal code 253ý <y acute accent>
decimal code 254þ <thorn>
decimal code 255ÿ <y diaeresis>

[NON-PORTABLE] The default Collation for the LATIN1 Character set has a non- standard name because the SQL Standard requires implementors to define what a legal <Collation name> is. [OCELOT Implementation] The OCELOT DBMS that comes with this book has a default Collation called LATIN1 for the LATIN1 Character set.

ISO8BIT and ASCII_FULL

The ISO8BIT Character set (ASCII_FULL is a synonym) is an 8-bit Character set that consists of 256 characters: all of the LATIN1 characters, plus the rest of the characters (including control characters and graphic characters) from codepage 1252. The default Collation for the ISO8BIT Character set sorts the characters in their codepage 1252 decimal order.

ISO8BIT characters and their codepage 1252 Form-of-use codes (in default Collation order):

decimal code 000 = control character; not displayed
decimal code 001 = control character; not displayed
decimal code 002 = control character; not displayed
decimal code 003 = control character; not displayed
decimal code 004 = control character; not displayed
decimal code 005 = control character; not displayed
decimal code 006 = control character; not displayed
decimal code 007 = control character; not displayed
decimal code 008 = control character; not displayed
decimal code 009 = control character; not displayed
decimal code 010 = control character; not displayed
decimal code 011 = control character; not displayed
decimal code 012 = control character; not displayed
decimal code 013 = control character; not displayed
decimal code 014 = control character; not displayed
decimal code 015 = control character; not displayed
decimal code 016 = control character; not displayed
decimal code 017 = control character; not displayed
decimal code 018 = control character; not displayed
decimal code 019 = control character; not displayed
decimal code 020 = control character; not displayed
decimal code 021 = control character; not displayed
decimal code 022 = control character; not displayed
decimal code 023 = control character; not displayed
decimal code 024 = control character; not displayed
decimal code 025 = control character; not displayed
decimal code 026 = control character; not displayed
decimal code 027 = control character; not displayed
decimal code 028 = control character; not displayed
decimal code 029 = control character; not displayed
decimal code 030 = control character; not displayed
decimal code 031 = control character; not displayed
decimal codes 032 through 126, from LATIN1
decimal code 127 = control character; not displayed
decimal code 128 = control character; not displayed
decimal code 129 = control character; not displayed
decimal code 130 = , <comma>
decimal code 131 = <f italics>
decimal code 132 = <close curly double quote subscript>
decimal code 133 = ... <ellipsis>
decimal code 134 = <cross>
decimal code 135 = <double cross>
decimal code 136 = <carat>
decimal code 137 = <modified percent sign>
decimal code 138 = <S hacek>
decimal code 139 = <modified less than>
decimal code 140 = <OE>
decimal code 141 = control character; not displayed
decimal code 142 = control character; not displayed
decimal code 143 = control character; not displayed
decimal code 144 = control character; not displayed
decimal code 145 = <open quote sign>
decimal code 146 = <close quote sign>
decimal code 147 = <open curly double quote>
decimal code 148 = <close curly double quote>
decimal code 149 = <dark circle>
decimal code 150 = <long hyphen>
decimal code 151 = <hyphen>
decimal code 152 = control character; not displayed
decimal code 153 = <TM sign>
decimal code 154 = <s hacek>
decimal code 155 = <modified greater than>
decimal code 156 = <oe>
decimal code 157 = control character; not displayed
decimal code 158 = control character; not displayed
decimal code 159 = <Y umlaut>
decimal codes 160 through 256, from LATIN1

[NON-PORTABLE] The default Collation for the ISO8BIT Character set has a non- standard name because the SQL Standard requires implementors to define what a legal <Collation name> is. [OCELOT Implementation] The OCELOT DBMS that comes with this book has a default Collation called ISO8BIT for the ISO8BIT Character set.

UNICODE and ISO10646

The UNICODE Character set (ISO10646 is a synonym) is a 16-bit Character set that consists of every character represented by the Unicode specification (specifically, by The Unicode Standard Version 2.1). Since there are 38,887 characters in this set, we won't include a list in this book. It suffices to say that the UNICODE Character set includes all of the ISO8BIT characters, plus many others. The default Collation for the UNICODE Character set sorts the characters in their Unicode Form-of-use code order.

[NON-PORTABLE] The default Collation for the UNICODE Character set has a non- standard name because the SQL Standard requires implementors to define what a legal <Collation name> is. [OCELOT Implementation] The OCELOT DBMS that comes with this book has a default Collation called SQL_TEXT for the UNICODE Character set.

SQL_TEXT

The SQL Standard requires every DBMS to provide an Ur-Character set named SQL_TEXT. The complete set of SQL_TEXT characters is implementation defined, but SQL_TEXT must contain every <SQL language character> plus every other character that the DBMS supports -- that is, SQL_TEXT (at a minimum) includes all of the UNICODE characters and is therefore at least a 16-bit Character set.

[NON-PORTABLE] The default Collation for the SQL_TEXT Character set must be called SQL_TEXT too, but is otherwise non-standard because the SQL Standard requires implementors to define the default Collation and Form-of-use of SQL_TEXT. [OCELOT Implementation] The OCELOT DBMS that comes with this book sorts the characters of SQL_TEXT in their Unicode Form-of-use code order.

There are some curious consequences to making SQL_TEXT (and UNICODE) a 16-bit, rather than an 8-bit Character set.

The consequences are the result of the fact that all <identifier>s and other character strings in INFORMATION_SCHEMA, all <SQL-server name>s, all <Connection name>s, all <AuthorizationID>s, and the results of many functions (e.g.: USER, CURRENT_USER, SESSION_USER, SYSTEM_USER) are SQL_TEXT character strings.

SQL_TEXT, however, is normally not the default Character set (the default Character set is almost always an 8-bit Character set) -- and SQL specifically prohibits comparisons between strings that belong to different Character sets. (The rules require that, for such comparisons, one string be translated to the other string's Character set first -- see the TRANSLATE function" in Chapter 7, "Character Strings".)

So, will you have special trouble when you want to display INFORMATION_SCHEMA strings, <AuthorizationID>s, <SQL-server name>s, <Connection name>s or the like? There are two possibilities, and the answer is usually "NO" to both.

Possibility 1

Consider the following SQL statements:

SELECT TABLE_NAME FROM INFORMATION_SCHEMA.TABLES;
SELECT SESSION_USER FROM INFORMATION_SCHEMA.TABLES;

The SQL Standard allows a DBMS to decide whether there should be an automatic translation from SQL_TEXT to the default Character set when an SQL statement assigns an SQL_TEXT string to a host language variable or parameter. In such cases, most DBMSs do the automatic translation for you (as, for example, the OCELOT DBMS that comes with this book does) -- so, to display the result of a SELECT from INFORMATION_SCHEMA or the result from a function that returns an SQL_TEXT string, you won't need to do anything special. It won't normally be necessary to use the TRANSLATE function to translate the retrieved values from 16-bit SQL_TEXT characters into 8-bit, readable format (or vice versa) -- just write your SQL statement as if the Character set difference doesn't exist. (In the example above, by the way, the ASCII <identifier> TABLE_NAME would also be automatically converted to a SQL_TEXT string by your DBMS before the search was done.)

Possibility 2

Consider this SQL statement:

   SELECT TABLE_NAME
   FROM   INFORMATION_SCHEMA.TABLES
   WHERE  TABLE_NAME = 'ocelot';

The SQL Standard does not allow a DBMS to decide whether there should be an automatic translation from SQL_TEXT to the default Character set when an SQL statement involves a comparison between a SQL_TEXT string and a default Character set string -- but the rules do allow a DBMS to provide a predefined Translation for this purpose. In such cases, most DBMSs will use this option to do an automatic translation for you (as, for example, the OCELOT DBMS that comes with this book does) -- so, to compare SQL_TEXT strings with default Character set strings, you won't need to do anything special. It won't normally be necessary to use the TRANSLATE function to translate the retrieved values from 16-bit SQL_TEXT characters into 8-bit, readable format (or vice versa) -- just write your SQL statement as if the Character set difference doesn't exist.

A second consequence of the fact that SQL_TEXT is 16-bit and the default Character set is usually 8-bit involves the use of <AuthorizationID>s, <SQL-server name>s and <Connection name>s. Although <AuthorizationID>s, <SQL-server name>s and <Connection name>s are stored as 16-bit SQL_TEXT strings, the SQL-Connection arguments in CONNECT TO are inevitably in 8-bit sets, since ASCII is used to type them. If, however, these same arguments are assigned to host language variable parameters, you will see them as 16-bit strings. Therefore, when you have an expression like this:

... WHERE  = 'x' ...

you know that the <character string literal> 'x' is a 16-bit string because <Authorization ID>s are SQL_TEXT strings. But the same <literal> in a CONNECT statement, e.g.:

CONNECT TO 'x';

has to be an 8-bit string because it is typed in ASCII.

One last comment, about parsing SQL_TEXT strings. Consider the following expression, which concatenates a character string with a value from a Column:

... _INFORMATION_SCHEMA.SQL_TEXT 'A ' || column_1

The default Character set for an SQL statement is usually an 8-bit Character set, but the string 'A ' in this example is explicitly introduced as 16-bit SQL_TEXT. Thus, what appears to be two 8-bit characters -- A and <space> -- is, in fact, one 16-bit character, the Unicode value 4120h.

Parsing of SQL statements is always based on 8-bit values regardless of the Character set of a string, so your DBMS will consider the SQL_TEXT string 'A ' to begin after the first single quote mark and end before the second single quote mark; i.e.: although the characters inside the single quotes are 16-bit, the delimiting single quote marks themselves are 8-bit. Thus, when the parser reaches the second single quote mark it knows that the string is over -- despite the fact that in 16-bit, a single quote mark character has not yet been found.

Comments

Comments loading...