A relational database is one which appears to be nothing more than a collection of tables. The model it is based on has three major aspects: the structure, the manipulation, and the integrity of data.
In the relational model, data are logically presented in two- dimensional tables made up of columns and rows. The rows of a table consist of a collection of values that describe an entity; for example, an employee. The columns of a table consist of a collection of similar data among rows; e.g., employee surnames. Operations on the data are simplified by the fact that a table's rows are unordered. The intersection of a row and a column contains individual data items called values. Values are always atomic; that is, each position in a table may contain only one datum.
Data manipulation is provided by a set of algebraic or calculus operators.
Data integrity is provided by two rules. Entity integrity requires that every value in the primary key of a table must be a unique, non-null data value. Referential integrity requires that every value in a foreign key must either equal some value of its related primary key, or it must be NULL.
Design of a Relational Database
One of the main advantages of the relational model of database design is that it is based on a foundation of formal mathematical theory that allows its concepts to be defined and examined with great precision.
Remember that a relation is (essentially) a two-dimensional table consisting of horizontal rows and vertical columns. The advantage of this form is that almost everyone is familiar with data presented as a simple table. The relational model stipulates that no two rows in the table (relation) may be identical; there must be some combination of columns, called a key, whose values will uniquely identify each row.
As an example of the model's capabilities, we will design a structure for a personnel database which contains the following information: the name of each employee, the programming languages the employee is familiar with, the number of years the employee has used each language, the employee's title, the employee's length of service with the company, the employee's hourly rate of pay, the current projects to which the employee is assigned, and the manager of each of these projects.
The following PERSONNEL_1 table shown below shows a sample of the data that
might be stored in such a database, along with one possible structure.
PERSONNEL_1
| NAME | LANG | YRS_USE | TITLE | YRS_EXP | PAY | PROJECT | MGR |
|---|---|---|---|---|---|---|---|
| Marvin | Cobol | 3 | Sr. Prog. | 4 | 25.00 | Payroll | Smith |
| Fortran | 2 | A/R | Jones | ||||
| Brown | Cobol | 2 | Sr. Prog. | 3 | 24.00 | Inventory | Norman |
| Basic | 1 | ||||||
| Ada | 3 | ||||||
| Norman | Cobol | 4 | Prj. Mgr. | 2 | 35.00 | Inventory | Norman |
| SQL | 2 | ||||||
| James | SQL | 1 | Sys. Ana. | 2 | 29.00 | A/R | Jones |
| Pascal | 3 | Datcomm | Harvey | ||||
| Jones | Cobol | 1 | Prj. Mgr. | 8 | 42.00 | A/R | Jones |
| Pascal | 5 | ||||||
| SQL | 2 | ||||||
| Basic | 9 |
Each employee in this table has a unique name, so NAME may
be used as the table's key. In practice, of course, there may be more than one
way to construct a key - social security numbers or employee numbers are other
values that might be used to uniquely identify an employee. Using this
structure, any request of the form, "Tell me something about employee E", is
easily answered. However, it isn't as simple to respond to such requests as
"Which employees can use language L?"; "Who is the manager of project P?";
"Display all employees assigned to project P"; or "Change the manager of
project P to employee E". But through a process known as normalization, the
organization of the data in PERSONNEL_1 can be changed so
that it can be used more flexibly.
The first step in normalizing the design is based on the relational rule that
each column of a table may take on only a single, non-repeating (atomic) value
for each row of the table. Looking at PERSONNEL_1, it's
easy to see that the columns LANG,
YRS_USE, PROJECT and
MGR violate this rule, because an employee may know more
than one programming language and may be assigned to more than one project at a
time. By duplicating the non-repeating values of NAME,
TITLE, YRS_EXP, and
PAY for each combination of values for the repeating groups,
the entire table can be represented in first normal form. The
PERSONNEL_2 table shown below is in first normal form.
PERSONNEL_2
| NAME | LANG | YRS_USE | TITLE | YRS_EXP | PAY | PROJECT | MGR |
|---|---|---|---|---|---|---|---|
| Marvin | Cobol | 3 | Sr. Prog. | 4 | 25.00 | Payroll | Smith |
| Marvin | Fortran | 2 | Sr. Prog. | 4 | 25.00 | A/R | Jones |
| Brown | Cobol | 2 | Sr. Prog. | 3 | 24.00 | Inventory | Norman |
| Brown | Basic | 1 | Sr. Prog. | 3 | 24.00 | Inventory | Norman |
| Brown | Ada | 3 | Sr. Prog. | 3 | 24.00 | Inventory | Norman |
| Norman | Cobol | 4 | Prj. Mgr. | 2 | 35.00 | Inventory | Norman |
| Norman | SQL | 2 | Prj. Mgr. | 2 | 35.00 | Inventory | Norman |
| James | SQL | 1 | Sys. Ani. | 2 | 29.00 | A/R | Jones |
| James | Pascal | 3 | Sys. Ani. | 2 | 29.00 | Datcomm | Harvey |
| Jones | Cobol | 1 | Prj. Mgr. | 8 | 42.00 | A/R | Jones |
| Jones | Pascal | 5 | Prj. Mgr. | 8 | 42.00 | A/R | Jones |
| Jones | SQL | 2 | Prj. Mgr. | 8 | 42.00 | A/R | Jones |
| Jones | Basic | 9 | Prj. Mgr. | 8 | 42.00 | A/R | Jones |
We can now see that NAME is no longer sufficient to uniquely
identify a row of PERSONNEL_2, because multiple rows may be
present for an employee who knows more than one language or is assigned to more
than one project. One solution is to create a new key from a combination of
columns. NAME, LANG, and
PROJECT combined would be such a key, as those three values
together uniquely identify a single row.
PERSONNEL_2 appears to be a step backward in our design. Not
only does it require more space to present the data, but responding to requests
such as, "Change employee E's title to T"; "Add the assignment of employee E to
project P"; or "Make employee E the manager of project P" is now more
difficult. This problem is addressed by the remaining normalization steps,
which are based on the concept of dependence and the relational rule that in
every row of a table, each column must be dependent on every part of the key.
If, for each row, the value of a column C1 uniquely
determines the value of a column C2, then
C2 is functionally dependent on C1. If
the value in C1 limits the possible values in
C2 to a specific set, then C2 is set
dependent on C1. For example, because each employee has only
one title, we may say that NAME determines
TITLE and that TITLE is functionally
dependent on NAME. PROJECT is set
dependent on NAME, since each employee is assigned to a
specific set of projects.
The columns TITLE, YRS_EXP and
PAY are not dependent on the entire key
(NAME, LANG, PROJECT)
of PERSONNEL_2 -- they are dependent on
NAME alone. To solve this, we must create a new table
containing only NAME, TITLE,
YRS_EXP, and PAY. The key for this table,
called EMPLOYEES, will be NAME. Of the
remaining columns, YRS_USE is determined by both
NAME and LANG and therefore cannot be
part of EMPLOYEES. Another table, called
LANGUAGES, must be formed by these three columns.
LANGUAGES will have a key formed by a combination of the
columns NAME and LANG. Because the table
contains the NAME column as well, it is still possible to
associate an employee's language experience with his employee data.
Splitting a table in this way prevents the experience part of the database from
having columns that are dependent on only part of the table's key. A first
normal form relation (atomic values in each portion of the table) that also has
no partial key dependence is said to be in second normal form. The following
tables, EMPLOYEES and LANGUAGES, are in
second normal form.
EMPLOYEES
| NAME | TITLE | YRS_EXP | PAY |
|---|---|---|---|
| Marvin | Sr. Prog. | 4 | 25.00 |
| Brown | Sr. Prog. | 3 | 24.00 |
| Norman | Prj. Mgr. | 2 | 35.00 |
| James | Sys. Ani. | 2 | 29.00 |
| Jones | Prj. Mgr. | 8 | 42.00 |
LANGUAGES
| NAME | LANG | YRS_USE |
|---|---|---|
| Marvin | Cobol | 3 |
| Marvin | Fortran | 2 |
| Brown | Cobol | 2 |
| Brown | Basic | 1 |
| Brown | Ada | 3 |
| Norman | Cobol | 4 |
| Norman | SQL | 2 |
| James | SQL | 1 |
| James | Pascal | 3 |
| Jones | Cobol | 1 |
| Jones | Pascal | 5 |
| Jones | SQL | 2 |
| Jones | Basic | 9 |
The situation with project assignments is slightly different. We have already
noted that an employee name determines the set of projects on which that
employee works. This is independent of the languages used by the employee. This
means that a table containing PROJECT should not have LANG in its
key. However, the project name uniquely determines the project manager.
MGR}}} is transitively dependent on {{{NAME}}}, because {{{NAME
determines a set of values for PROJECT, and PROJECT functionally
determines MGR. To complete our design, we should remove any transitive
dependencies, according to the relational rule that in every row of a table,
all columns must depend directly on the key, without any transitive
dependencies through other columns. A second normal form relation that has no
transitive dependence is said to be in third normal form.
Because each project has only one manager, we can form a PROJECTS table
with the columns PROJECT and MGR. PROJECTS' key will be
PROJECT. Note that MGR could also be a key, if each employee
managed only one project. Finally, since each employee works on one or more
projects, we will create a fourth table, called ASSIGNMENTS, using the
columns NAME and PROJECT. This table forms the association between
the EMPLOYEES and PROJECTS tables and is "all key" - i.e., it has
no additional dependent columns, because the only thing dependent on both
NAME and PROJECT is the fact that they are associated. Here are the
third normal form tables.
PROJECTS
| PROJECT | MGR |
|---|---|
| Payroll | Smith |
| A/R | Jones |
| Inventory | Norman |
| Datcomm | Harvey |
ASSIGNMENTS
| NAME | PROJECT |
|---|---|
| Marvin | Payroll |
| Marvin | A/R |
| Brown | Inventory |
| Norman | Inventory |
| James | A/R |
| James | Datcomm |
| Jones | A/R |
At this point, our design is complete. All tables are in third normal form, and requests such as those listed earlier can easily be dealt with.
Here are some tips for good database design:
- Don't use an existing database as the basis for a new database structure -- you don't want to inadvertently duplicate awkward or inconsistent table definitions.
- Make sure that each table represents just one subject -- that is, either one object or one event. This avoids unnecessary duplication of data.
- Define a primary key for every table -- not only will it uniquely identify a row value, you'll use it to join tables. A primary key should have these characteristics: its value must be unique and "known not nullable", and its value should consist of the minimum number of columns to guarantee uniqueness.
- Don't define any multi-value columns -- that is, don't
use the
ROWorARRAY<data type>s. - Implement data integrity; define unique keys and foreign keys for your tables.
Note:
Portions of the text in this entry are Copyright © 1999 by Ocelot Computer Services Incorporated. Used by permission.