- Georg Cantor was a German.
- He invented Set Theory.
- He was committed to a mental institution.
- He died in 1918.
We can explain the preceding statements using Georg Cantor's own theory: a set is any collection of definite distinguishable things. We can conceive of the set as a whole, and in fact we often do: for example, we speak of "the Germans" (a set) and can rephrase our first statement as "Georg Cantor was a member (or element) of the set of Germans". By rephrasing, we emphasize the collection of individual things over the individual things themselves. That much is intuitive. But Cantor was careful in his choice of words. By "distinguishable" (or distinct), he meant that in looking at any two things which fit in the set, we must be able to decide whether they are different. By "definite", he meant that if we know what the set is and we know what the thing is, we can decide whether the thing is a member of the set. Therefore, to know what a set "is", it is sufficient to know what the members are.
Here are a few examples. The "Germans" set also included Kaiser Wilhelm. However, it can be proved from historical records that Cantor was not a pseudonym or alias that the Kaiser used while off duty -- therefore, the two members are distinguishable. At the same time, we know that there were several million other Germans, also distinguishable, and we could define the set by taking a census of all the Germans. There might be some difficulty deciding the individual question "What is a German?", but once that was done there would be no difficulty deciding the collective question "What are the Germans?". Therefore, the members define the set, i.e., the members are definite.
The census we spoke of is possible because the Germans were a finite set (a fact which would have bored Cantor because he developed his theory to explain various gradations of infinity).
We could enumerate the set thus:
{Georg Cantor, Kaiser Wilhelm, ...}In this enumeration, we used braces to indicate "we are enumerating a set" and an ellipsis to indicate "and so on" -- that is, the list could go on but we felt it unnecessary to continue for the sake of our exposition. These are standard conventions and we will use braces and ellipses again.
Enumeration is unwieldy for large sets, so let us revisit the question "What is a German?" by taking the bounds stated in the song "Deutschland Ueber Alles" -- a German is a person living between the Maas, the Memel, the Esch and the Belt (four bodies of water which now lie respectively in Holland, Russia, Austria, and Denmark). In Cantor's terms, that formula expresses a defining property. It is either true or it is false. If it is true, the person is in the set. If it is false, the person is outside the set.
Without stating the defining property in advance, the German census-takers would be unable to put together their forms and plan their information collection. The objective, though, is to produce an enumeration of all Germans. In computer terminology, we call the definition the database design and the enumeration the database itself. The "Germans" set can be broken up into several subsets such as:
{Berliners, Frankfurters, Hamburgers, ...}These subsets are also sets, with defining properties (city of residence), and presumably, with members -- but that is not necessary. For example, the set of Germans who live in the Rhine River is an empty set, i.e., it has no members, but it is still a set. The implicit breaking-up that happens when we ask "Among the Germans which ones are Frankfurters?" is an example of a set operation. A set operation is something we do with sets that results in the production of more sets.
Relations
First, let's consider first a binary relation -- that is, a relation between two things. The things don't have to be of the same type; all we are concerned with is that they have some type of bond, and that they be in order. Getting back to our hero ... there is a relationship between Georg Cantor and the concept Set Theory (he invented it). There is also a relationship between Kaiser Wilhelm and World War I (he started it). We could use our braces notation to enumerate this:
{ (Georg Cantor, Set Theory), {Kaiser Wilhelm, World War I) }but it looks clearer when diagrammed as a two-dimensional Table:
| NAME | ACTIVITY |
|---|---|
| Georg Cantor | Set Theory |
| Kaiser Wilhelm | World War I |
There are some points to note about this diagram.
- The Table shows ordered pairs -- we couldn't reverse them because Set Theory
didn't invent Georg Cantor and World War I didn't cause Kaiser Wilhelm -- the
relationship between the
NAMEandACTIVITYvalues is directional. Note, however, that the word "ordered" refers only to the horizontal order in the illustration -- across the relation. We don't care which member of the set is listed first. - The Table shows binary pairs -- there is no space on the line for marking Georg Cantor's other achievements. Under "What did he do?" we can only express one thing.
So: ordered means ordered, and pair means pair.
So what, precisely, is the "relation" here? Well, it's the whole thing. The relationship is the set of all the ordered pairs, and the ordering itself (i.e., how part A relates to part B). What we have in the preceding diagram is a picture of a relation and nothing but a relation.
This relation is a set. It has members. The members define the set. However, the members are no longer "elements", but ordered pairs. There's no reason to limit ourselves to ordered pairs, though. That's just the simplest relation, the binary relation, which is sometimes called "a relation of degree 2" (because there are two columns). We could have relations of degree 3, degree 4, degree 5, degree 6, degree 7, and so on, i.e., relations which are not binary but triple, quadruple, pentuple, sextuple, septuple ... Did you notice how after a while all the words ended in "-tuple"? That's why the general term for a relation with n elements is n-tuple. Here is a relation of degree 4 showing all the information we have so far:
| NAME | ACTIVITY | RESIDENCE | DATE_OF_DEATH |
|---|---|---|---|
| Georg Cantor | Set Theory | Mental Institution | 1918 |
| Kaiser Wilhelm | World War I | Imperial Palace | ???? |
| ... |
Some differences now appear between the words Cantor used ("set theory terminology") and the words we use ("database terminology"). The line:
{Georg Cantor,Set Theory,Mental Institution,1918}would be one tuple to Cantor, but in SQL, we prefer the word row. Or, to be precise: the row value is the four-element:
{Georg Cantor,Set Theory,Mental Institution,1918)and the row is the box that holds that information. (We don't usually need to
be so precise in ordinary discussion.) Meanwhile, going down rather than
across, Cantor would say that we have four attributes, labelled
NAME, ACTIVITY,
RESIDENCE and DATE_OF_DEATH. But in SQL,
we prefer the word column instead and we call each element going down (such as
'Georg Cantor' or 'Kaiser Wilhelm') a column value.
Here's a quick summary. Moving across the relation are tuples (but don't use that word) or rows. Moving up-and-down the relation are attributes (but don't use that word) or columns. The contents of a row is a row value. The intersection of a row and a column is a column value. The column value is "atomic" -- it has only one element. There is always exactly one column value in each atomic box that's formed by the intersection of a row with a column.
Incidentally, in the diagram we used an ellipsis once more, to mean "and so on" -- there are more Germans in the set. We also used the symbol "????" for the column value of "When did he die?" for Kaiser Wilhelm. This is not a standard symbol -- there is no standard way of expressing the fact that not only do we not know when Kaiser Wilhelm died (i.e., "value is Unknown"), but we're not even sure that he's dead (i.e., "category is Not Applicable"). This is not a problem for Set Theory, but it is a problem in practice and we'll return to it later. When we do, we'll call the "????" a NULL, or null value.
And there we have it. A relation is an ordered n-tuple set, with all members having the same degree, which is representable as a table of rows and columns. It really does seem appropriate that we should know what a relation is, because (we trust the following is not an unpleasant surprise) SQL is a relational database management system, and that means the database is made of relations. To quote the Kellogg's Rice Krispies(tm) commercial, "What the heck did you think they were made of?"
Admittedly, when anything looks that obvious, it must be a lie. And it is. Relational databases are actually made of tables, rather than relations. What's the difference? Well, with a table we can have two Georg Cantors. This clearly breaks Cantor's "distinguishable" rule, and it's quite a big break; we all know that the famous "Set Of Integers" doesn't go
{1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,...}By definition, a set has no duplicates. A relation is a set, so a relation has no duplicates. But a table can have duplicates. This is a concession to practicalities, since we know that duplicate data happens in the "real world". However, if our database contains two Georg Cantors who really are the same person, then we have an error -- while if it contains two Georg Cantors whom we can't distinguish from one another, then we have a design flaw. The long and the short of it all is: (a) a relational database consists of tables, (b) a table is not a relation, but the only difference between them is that a table may have rows with duplicate row values, (c) you should get rid of duplicates regardless, therefore (d) therefore all relational databases should consist of relations. (By the way, a table is sometimes called a multiset to distinguish it from a regular set. You'll see the word multiset in some Microsoft publications.)
Set Operations
Since SQL operates on sets, there must be such things as "set operations" that
SQL can perform. In fact, SQL borrows all the standard operations from Set
Theory textbooks, along with the terminology. Mercifully, it does not also
borrow the notation. So to describe the set operations in this book, we've been
able to use some English-looking terms which, in fact, have fairly precise
meanings in Set Theory, and some English-looking terms which, in fact, are SQL
words for our notation. In the following quick introduction, we will describe a
sequence of set operations. A sequence, also known as a series, is an
ordered set. The order that we'll follow is the order of execution of the
clauses of the most famous of SQL statement types: the query, or
SELECT statement.
Identity
The easiest operation is take a copy of the table we started with:
| NAME | ACTIVITY |
|---|---|
| Georg Cantor | Set Theory |
| Kaiser Wilhelm | World War I |
The SQL for this is:
... FROM Germans ...
Because this is the first time we've said something "in SQL", we'll regard it
as our equivalent of the famous "hello world" -- so, some introductions are in
order. We've used the ellipsis before. As usual, it means "and so on", so you
can see that there's some stuff both before and after
FROM Germans that we're not going to
spell out just here. We need the ellipsis to indicate that we're not
illustrating a whole statement, merely one clause, called the
FROM clause. The word
FROM is an SQL keyword and the word
Germans is an SQL Table name. So: the
FROM clause takes the Germans table
and ends up with a result table that is the same as the Germans table.
Product
In the history of the world, there have been four great Mathematicians:
{Al-Khwarezm, Georg Cantor, Leonhard Euler, Rene Descartes}If we put that side-by-side with our original Germans set, we get:
| MATHEMATICIANS | GERMANS |
|---|---|
| Al-Khwarezm | Georg Cantor |
| Georg Cantor | Kaiser Wilhelm |
| Leonhard Euler | |
| Rene Descartes |
The Cartesian product operation yields the set of all pairs (x,y) such that x is a member of some set X and y is a member of some set Y. The result of GERMANS [Cartesian Product] MATHEMATICIANS is the following binary relation.
CARTESIAN_PRODUCT
| GERMANS.NAME | MATHEMATICIANS.NAME |
|---|---|
| Al-Khwarezm | Georg Cantor |
| Georg Cantor | Georg Cantor |
| Leonhard Euler | Georg Cantor |
| Rene Descartes | Georg Cantor |
| Al-Khwarezm | Kaiser Wilhelm |
| Georg Cantor | Kaiser Wilhelm |
| Leonhard Euler | Kaiser Wilhelm |
| Rene Descartes | Kaiser Wilhelm |
You must not object that Al-Khwarezm is unrelated to Kaiser Wilhelm, because the spirit of a Cartesian-product relation is that everything relates to everything. The table above is a relation -- it's mathematically valid (even if it doesn't seem to make any sense!). There are several ways to express this operation in SQL. The classic style is:
... FROM Germans, Mathematicians ...
That is, within a FROM clause the expression "table-name-1 ,
table-name-2" means "yield a table which is the Cartesian product of
table-name-1 and table-name-2".
Search Condition
In our Cartesian-product relation, there is something special about the row:
(Georg Cantor,Georg Cantor)
That row, and that row only, has two column values that are the same. This is significant because the columns are defined in a meaningful way: one 'Georg Cantor' is the name of a German, while the other 'Georg Cantor' is the name of a mathematician. Therefore, Georg Cantor must be the only person who is both a German and a mathematician. (This assumes that names are unique.) And again, there is an SQL way to say this:
... FROM Germans, Mathematicians WHERE Germans.name = Mathematicians.name ...
We now have two clauses in our example. The FROM clause exploded the tables into a Cartesian product relation. The WHERE clause reduces the result to a subset of that relation. The result subset is also a relation -- it contains only those rows where this condition is true: "the name equals the name". (This is known as a search condition in official-SQL vocabulary.) Now we have this result:
CARTESIAN_PRODUCTS
| GERMANS.NAME | MATHEMATICIANS.NAME |
|---|---|
| Georg Cantor | Georg Cantor |
So: the WHERE clause contains a search condition, takes as input the (table)
result of the FROM clause, and produces as output a table which contains only
those rows where the search condition is TRUE.
Join
In this example so far, the FROM clause contained two table
names and the WHERE clause contained a comparison between
columns from each table. This two-step process is usually called a join. In
modern SQL, there are several ways to ask for a join. We have started with the
oldest and best-known way, because it best illustrates that a typical join is
two separate operations.
Projection
Our result table now has one row with two columns, and the value in each column is 'Georg Cantor' -- but we only need one column to answer the question "Who is German and a mathematician?". If you think of the earlier search condition as being an operation which picks out certain rows, it's easy to get to the next step. A projection is a complementary operation which picks out certain columns. In SQL, we just list the columns we want:
SELECT Germans.name FROM Germans, Mathematicians WHERE Germans.name = Mathematicians.name ...
The column reference after the keyword SELECT is called a
select list. We now have a result table that looks like this:
CARTESIAN_PRODUCT
| GERMANS.NAME |
|---|
| Georg Cantor |
So, the select list does a projection on the table produced by the WHERE
clause. The result is (as always) a table. The words
SELECT Germans.name FROM Germans, Mathematicians WHERE Germans.name = Mathematicians.name
constitute a valid and complete SQL statement. Specifically, this sort of
statement is called a query, presumably because it translates a question (in
this case, "What Germans are mathematicians?"). It's also commonly called the
SELECT statement.
Other Set Operations
SQL also handles the standard set operations intersect, union, and except. The
following is an example of each: two example tables (the "input"), one result
table (the "output"), and the SQL statement that makes it happen. To make the
examples short, we've used unrealistically small examples with one column per
table and no WHERE clauses, but don't worry about syntactical
details or the exact workings of the operations -- that comes later. You
understand enough so far if you grasp that some well-known set operations can
be expressed in SQL, and work on tables.
| Inputs | SQL query | Output | |
|---|---|---|---|
INTEGERS_1COLUMN_15413 | INTEGERS_2COLUMN_133144 | SELECT column_1FROM Integers_1INTERSECTSELECT column_1FROM Integers_2 | INTEGERS_3COLUMN_14 |
STRINGS_1COLUMN_1SpainGreeceYugoslavia | STRINGS_2COLUMN_1ItalyDenmarkBelgium | SELECT column_1FROM Strings_1UNIONSELECT column_1FROM Strings_2 | STRINGS_3COLUMN_1SpainGreeceItalyDenmarkYugoslaviaBelgium |
DECIMALS_1COLUMN_1 5.32 4.1713.99 | DECIMALS_2COLUMN_133.0814.00 4.17 | SELECT column_1FROM Decimals_1EXCEPTSELECT column_1FROM Decimals_2 | DECIMALS_3COLUMN_1 5.3213.99 |
Therefore God Exists
Leonhard Euler is famed for his remark that
"(a+b)**n/n=X, therefore God exists", which shows that any argument looks
imposing if you use lots of math symbols. SQL does the opposite. It has a solid
base on a mathematical theory, but the operations of Set Theory are hidden
behind a somewhat English-like sentence structure. The result is, on balance,
for the good. Most SQL beginners have an easy time grasping the concepts behind
WHERE clauses or select lists, while they might have a less easy time with
the dense polysymbolic notation of standard Set Theory. Unfortunately, the SQL
language hides the operations so well that frequent delusions arise:
- SQL is a "nonprocedural" language. Perhaps people get this idea from the fact that any operation on a set should affect all set members simultaneously. But the set operations themselves are ordered. One step follows another.
- A whole SQL query operates on the tables in the
FROMclause. This mode of thinking causes people to make certain common errors which could be avoided if they kept in mind the truth; that each set operation produces a new, nameless, "virtual" table and passes it on. (Well, perhaps we should add "conceptually" -- there will be hundreds of times in this book that we could add "conceptually" because your DBMS may do internal things in some out-of-order sequence, provided the method doesn't affect the results. That is not our present concern. What we must know from the outset is how a human being is supposed to view the operations.)
Coming away from this chapter, you only need to know that SQL has a specialized vocabulary which to a large extent arises from Set Theory; and that SQL operations happen on tables. You should regard this as background information. This book takes a bottom-up approach, starting with the smallest units (the column and row values), so it will be several chapters before we reach the top, and begin to emphasize the sets once more.
Note:
Portions of the text in this entry are Copyright © 1999 by Ocelot Computer Services Incorporated. Used by permission.