• Georg Cantor was a German.
  • He invented Set Theory.
  • He was committed to a mental institution.
  • He died in 1918.

We can explain the preceding statements using Georg Cantor's own theory: a set is any collection of definite distinguishable things. We can conceive of the set as a whole, and in fact we often do: for example, we speak of "the Germans" (a set) and can rephrase our first statement as "Georg Cantor was a member (or element) of the set of Germans". By rephrasing, we emphasize the collection of individual things over the individual things themselves. That much is intuitive. But Cantor was careful in his choice of words. By "distinguishable" (or distinct), he meant that in looking at any two things which fit in the set, we must be able to decide whether they are different. By "definite", he meant that if we know what the set is and we know what the thing is, we can decide whether the thing is a member of the set. Therefore, to know what a set "is", it is sufficient to know what the members are.

Here are a few examples. The "Germans" set also included Kaiser Wilhelm. However, it can be proved from historical records that Cantor was not a pseudonym or alias that the Kaiser used while off duty -- therefore, the two members are distinguishable. At the same time, we know that there were several million other Germans, also distinguishable, and we could define the set by taking a census of all the Germans. There might be some difficulty deciding the individual question "What is a German?", but once that was done there would be no difficulty deciding the collective question "What are the Germans?". Therefore, the members define the set, i.e., the members are definite.

The census we spoke of is possible because the Germans were a finite set (a fact which would have bored Cantor because he developed his theory to explain various gradations of infinity).

We could enumerate the set thus:

{Georg Cantor, Kaiser Wilhelm, ...}

In this enumeration, we used braces to indicate "we are enumerating a set" and an ellipsis to indicate "and so on" -- that is, the list could go on but we felt it unnecessary to continue for the sake of our exposition. These are standard conventions and we will use braces and ellipses again.

Enumeration is unwieldy for large sets, so let us revisit the question "What is a German?" by taking the bounds stated in the song "Deutschland Ueber Alles" -- a German is a person living between the Maas, the Memel, the Esch and the Belt (four bodies of water which now lie respectively in Holland, Russia, Austria, and Denmark). In Cantor's terms, that formula expresses a defining property. It is either true or it is false. If it is true, the person is in the set. If it is false, the person is outside the set.

Without stating the defining property in advance, the German census-takers would be unable to put together their forms and plan their information collection. The objective, though, is to produce an enumeration of all Germans. In computer terminology, we call the definition the database design and the enumeration the database itself. The "Germans" set can be broken up into several subsets such as:

{Berliners, Frankfurters, Hamburgers, ...}

These subsets are also sets, with defining properties (city of residence), and presumably, with members -- but that is not necessary. For example, the set of Germans who live in the Rhine River is an empty set, i.e., it has no members, but it is still a set. The implicit breaking-up that happens when we ask "Among the Germans which ones are Frankfurters?" is an example of a set operation. A set operation is something we do with sets that results in the production of more sets.

Relations

First, let's consider first a binary relation -- that is, a relation between two things. The things don't have to be of the same type; all we are concerned with is that they have some type of bond, and that they be in order. Getting back to our hero ... there is a relationship between Georg Cantor and the concept Set Theory (he invented it). There is also a relationship between Kaiser Wilhelm and World War I (he started it). We could use our braces notation to enumerate this:

{ (Georg Cantor, Set Theory), {Kaiser Wilhelm, World War I) }

but it looks clearer when diagrammed as a two-dimensional Table:

NAMEACTIVITY
Georg CantorSet Theory
Kaiser WilhelmWorld War I

There are some points to note about this diagram.

  1. The Table shows ordered pairs -- we couldn't reverse them because Set Theory didn't invent Georg Cantor and World War I didn't cause Kaiser Wilhelm -- the relationship between the NAME and ACTIVITY values is directional. Note, however, that the word "ordered" refers only to the horizontal order in the illustration -- across the relation. We don't care which member of the set is listed first.
  2. The Table shows binary pairs -- there is no space on the line for marking Georg Cantor's other achievements. Under "What did he do?" we can only express one thing.

So: ordered means ordered, and pair means pair.

So what, precisely, is the "relation" here? Well, it's the whole thing. The relationship is the set of all the ordered pairs, and the ordering itself (i.e., how part A relates to part B). What we have in the preceding diagram is a picture of a relation and nothing but a relation.

This relation is a set. It has members. The members define the set. However, the members are no longer "elements", but ordered pairs. There's no reason to limit ourselves to ordered pairs, though. That's just the simplest relation, the binary relation, which is sometimes called "a relation of degree 2" (because there are two columns). We could have relations of degree 3, degree 4, degree 5, degree 6, degree 7, and so on, i.e., relations which are not binary but triple, quadruple, pentuple, sextuple, septuple ... Did you notice how after a while all the words ended in "-tuple"? That's why the general term for a relation with n elements is n-tuple. Here is a relation of degree 4 showing all the information we have so far:

NAMEACTIVITYRESIDENCEDATE_OF_DEATH
Georg CantorSet TheoryMental Institution1918
Kaiser WilhelmWorld War IImperial Palace????
...

Some differences now appear between the words Cantor used ("set theory terminology") and the words we use ("database terminology"). The line:

{Georg Cantor,Set Theory,Mental Institution,1918}

would be one tuple to Cantor, but in SQL, we prefer the word row. Or, to be precise: the row value is the four-element:

{Georg Cantor,Set Theory,Mental Institution,1918)

and the row is the box that holds that information. (We don't usually need to be so precise in ordinary discussion.) Meanwhile, going down rather than across, Cantor would say that we have four attributes, labelled NAME, ACTIVITY, RESIDENCE and DATE_OF_DEATH. But in SQL, we prefer the word column instead and we call each element going down (such as 'Georg Cantor' or 'Kaiser Wilhelm') a column value.

Here's a quick summary. Moving across the relation are tuples (but don't use that word) or rows. Moving up-and-down the relation are attributes (but don't use that word) or columns. The contents of a row is a row value. The intersection of a row and a column is a column value. The column value is "atomic" -- it has only one element. There is always exactly one column value in each atomic box that's formed by the intersection of a row with a column.

Incidentally, in the diagram we used an ellipsis once more, to mean "and so on" -- there are more Germans in the set. We also used the symbol "????" for the column value of "When did he die?" for Kaiser Wilhelm. This is not a standard symbol -- there is no standard way of expressing the fact that not only do we not know when Kaiser Wilhelm died (i.e., "value is Unknown"), but we're not even sure that he's dead (i.e., "category is Not Applicable"). This is not a problem for Set Theory, but it is a problem in practice and we'll return to it later. When we do, we'll call the "????" a NULL, or null value.

And there we have it. A relation is an ordered n-tuple set, with all members having the same degree, which is representable as a table of rows and columns. It really does seem appropriate that we should know what a relation is, because (we trust the following is not an unpleasant surprise) SQL is a relational database management system, and that means the database is made of relations. To quote the Kellogg's Rice Krispies(tm) commercial, "What the heck did you think they were made of?"

Admittedly, when anything looks that obvious, it must be a lie. And it is. Relational databases are actually made of tables, rather than relations. What's the difference? Well, with a table we can have two Georg Cantors. This clearly breaks Cantor's "distinguishable" rule, and it's quite a big break; we all know that the famous "Set Of Integers" doesn't go

{1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,...}

By definition, a set has no duplicates. A relation is a set, so a relation has no duplicates. But a table can have duplicates. This is a concession to practicalities, since we know that duplicate data happens in the "real world". However, if our database contains two Georg Cantors who really are the same person, then we have an error -- while if it contains two Georg Cantors whom we can't distinguish from one another, then we have a design flaw. The long and the short of it all is: (a) a relational database consists of tables, (b) a table is not a relation, but the only difference between them is that a table may have rows with duplicate row values, (c) you should get rid of duplicates regardless, therefore (d) therefore all relational databases should consist of relations. (By the way, a table is sometimes called a multiset to distinguish it from a regular set. You'll see the word multiset in some Microsoft publications.)

Set Operations

Since SQL operates on sets, there must be such things as "set operations" that SQL can perform. In fact, SQL borrows all the standard operations from Set Theory textbooks, along with the terminology. Mercifully, it does not also borrow the notation. So to describe the set operations in this book, we've been able to use some English-looking terms which, in fact, have fairly precise meanings in Set Theory, and some English-looking terms which, in fact, are SQL words for our notation. In the following quick introduction, we will describe a sequence of set operations. A sequence, also known as a series, is an ordered set. The order that we'll follow is the order of execution of the clauses of the most famous of SQL statement types: the query, or SELECT statement.

Identity

The easiest operation is take a copy of the table we started with:

NAMEACTIVITY
Georg CantorSet Theory
Kaiser WilhelmWorld War I

The SQL for this is:

... FROM Germans ...

Because this is the first time we've said something "in SQL", we'll regard it as our equivalent of the famous "hello world" -- so, some introductions are in order. We've used the ellipsis before. As usual, it means "and so on", so you can see that there's some stuff both before and after FROM Germans that we're not going to spell out just here. We need the ellipsis to indicate that we're not illustrating a whole statement, merely one clause, called the FROM clause. The word FROM is an SQL keyword and the word Germans is an SQL Table name. So: the FROM clause takes the Germans table and ends up with a result table that is the same as the Germans table.

Product

In the history of the world, there have been four great Mathematicians:

{Al-Khwarezm, Georg Cantor, Leonhard Euler, Rene Descartes}

If we put that side-by-side with our original Germans set, we get:

MATHEMATICIANSGERMANS
Al-KhwarezmGeorg Cantor
Georg CantorKaiser Wilhelm
Leonhard Euler
Rene Descartes

The Cartesian product operation yields the set of all pairs (x,y) such that x is a member of some set X and y is a member of some set Y. The result of GERMANS [Cartesian Product] MATHEMATICIANS is the following binary relation.

CARTESIAN_PRODUCT

GERMANS.NAMEMATHEMATICIANS.NAME
Al-KhwarezmGeorg Cantor
Georg CantorGeorg Cantor
Leonhard EulerGeorg Cantor
Rene DescartesGeorg Cantor
Al-KhwarezmKaiser Wilhelm
Georg CantorKaiser Wilhelm
Leonhard EulerKaiser Wilhelm
Rene DescartesKaiser Wilhelm

You must not object that Al-Khwarezm is unrelated to Kaiser Wilhelm, because the spirit of a Cartesian-product relation is that everything relates to everything. The table above is a relation -- it's mathematically valid (even if it doesn't seem to make any sense!). There are several ways to express this operation in SQL. The classic style is:

   ... FROM Germans, Mathematicians ...

That is, within a FROM clause the expression "table-name-1 , table-name-2" means "yield a table which is the Cartesian product of table-name-1 and table-name-2".

Search Condition

In our Cartesian-product relation, there is something special about the row:

   (Georg Cantor,Georg Cantor)

That row, and that row only, has two column values that are the same. This is significant because the columns are defined in a meaningful way: one 'Georg Cantor' is the name of a German, while the other 'Georg Cantor' is the name of a mathematician. Therefore, Georg Cantor must be the only person who is both a German and a mathematician. (This assumes that names are unique.) And again, there is an SQL way to say this:

... FROM  Germans, Mathematicians
    WHERE Germans.name = Mathematicians.name ...

We now have two clauses in our example. The FROM clause exploded the tables into a Cartesian product relation. The WHERE clause reduces the result to a subset of that relation. The result subset is also a relation -- it contains only those rows where this condition is true: "the name equals the name". (This is known as a search condition in official-SQL vocabulary.) Now we have this result:

CARTESIAN_PRODUCTS

GERMANS.NAMEMATHEMATICIANS.NAME
Georg CantorGeorg Cantor

So: the WHERE clause contains a search condition, takes as input the (table) result of the FROM clause, and produces as output a table which contains only those rows where the search condition is TRUE.

Join

In this example so far, the FROM clause contained two table names and the WHERE clause contained a comparison between columns from each table. This two-step process is usually called a join. In modern SQL, there are several ways to ask for a join. We have started with the oldest and best-known way, because it best illustrates that a typical join is two separate operations.

Projection

Our result table now has one row with two columns, and the value in each column is 'Georg Cantor' -- but we only need one column to answer the question "Who is German and a mathematician?". If you think of the earlier search condition as being an operation which picks out certain rows, it's easy to get to the next step. A projection is a complementary operation which picks out certain columns. In SQL, we just list the columns we want:

SELECT Germans.name
FROM   Germans, Mathematicians
WHERE  Germans.name = Mathematicians.name ...

The column reference after the keyword SELECT is called a select list. We now have a result table that looks like this:

CARTESIAN_PRODUCT

GERMANS.NAME
Georg Cantor

So, the select list does a projection on the table produced by the WHERE clause. The result is (as always) a table. The words SELECT Germans.name FROM Germans, Mathematicians WHERE Germans.name = Mathematicians.name constitute a valid and complete SQL statement. Specifically, this sort of statement is called a query, presumably because it translates a question (in this case, "What Germans are mathematicians?"). It's also commonly called the SELECT statement.

Other Set Operations

SQL also handles the standard set operations intersect, union, and except. The following is an example of each: two example tables (the "input"), one result table (the "output"), and the SQL statement that makes it happen. To make the examples short, we've used unrealistically small examples with one column per table and no WHERE clauses, but don't worry about syntactical details or the exact workings of the operations -- that comes later. You understand enough so far if you grasp that some well-known set operations can be expressed in SQL, and work on tables.

InputsSQL queryOutput
INTEGERS_1
COLUMN_1
5
4
13
INTEGERS_2
COLUMN_1
33
14
4
SELECT column_1
FROM Integers_1
INTERSECT
SELECT column_1
FROM Integers_2
INTEGERS_3
COLUMN_1
4
STRINGS_1
COLUMN_1
Spain
Greece
Yugoslavia
STRINGS_2
COLUMN_1
Italy
Denmark
Belgium
SELECT column_1
FROM Strings_1
UNION
SELECT column_1
FROM Strings_2
STRINGS_3
COLUMN_1
Spain
Greece
Italy
Denmark
Yugoslavia
Belgium
DECIMALS_1
COLUMN_1
5.32
4.17
13.99
DECIMALS_2
COLUMN_1
33.08
14.00
4.17
SELECT column_1
FROM Decimals_1
EXCEPT
SELECT column_1
FROM Decimals_2
DECIMALS_3
COLUMN_1
5.32
13.99

Therefore God Exists

Leonhard Euler is famed for his remark that "(a+b)**n/n=X, therefore God exists", which shows that any argument looks imposing if you use lots of math symbols. SQL does the opposite. It has a solid base on a mathematical theory, but the operations of Set Theory are hidden behind a somewhat English-like sentence structure. The result is, on balance, for the good. Most SQL beginners have an easy time grasping the concepts behind WHERE clauses or select lists, while they might have a less easy time with the dense polysymbolic notation of standard Set Theory. Unfortunately, the SQL language hides the operations so well that frequent delusions arise:

  1. SQL is a "nonprocedural" language. Perhaps people get this idea from the fact that any operation on a set should affect all set members simultaneously. But the set operations themselves are ordered. One step follows another.
  2. A whole SQL query operates on the tables in the FROM clause. This mode of thinking causes people to make certain common errors which could be avoided if they kept in mind the truth; that each set operation produces a new, nameless, "virtual" table and passes it on. (Well, perhaps we should add "conceptually" -- there will be hundreds of times in this book that we could add "conceptually" because your DBMS may do internal things in some out-of-order sequence, provided the method doesn't affect the results. That is not our present concern. What we must know from the outset is how a human being is supposed to view the operations.)

Coming away from this chapter, you only need to know that SQL has a specialized vocabulary which to a large extent arises from Set Theory; and that SQL operations happen on tables. You should regard this as background information. This book takes a bottom-up approach, starting with the smallest units (the column and row values), so it will be several chapters before we reach the top, and begin to emphasize the sets once more.

Note:

Portions of the text in this entry are Copyright © 1999 by Ocelot Computer Services Incorporated. Used by permission.

Comments

Comments loading...