Databases are everywhere today. Most people, whether they are aware of it or not, are using databases in their lives daily—some are using it every hour, every minute of the day. Modern computing relies heavily on databases and we live in an era of “Big Data” and “Cloud” computing more than ever. It’s clear that understanding databases is quite important to those of us in the digital industry.
With at all that in mind, I’m reviewing a Stanford/Coursera course, “Introduction to Databases,” by Jennifer Widom. I thought I’d summarize some notes as I go along.
1. Introduction
Database Management Systems (DBMS)
All Database Management Systems (DBMS) require certain features in designing them. The following are some of the key feature requirements of database design:
- Massive and scalable — should be able to handle huge volumes of data — terrabytes or more, sometimes per day.
- Persistent - the data must persist independent of the program and its relationship to memory
- Safety — to ensure that the data remains safe from any point of failure, such as hardware, software, power, users or hackers
- Multi-user — must be able to control for concurrency of users manipulating data
- Convenient — a database requires (a) physical data independence (the structure of the database should not affect the program usability) and often (b) the queryig of a database utilizes high-level query languages that are declarative.
- Efficient — it must be able to handle thousands/queries per second. In other words, performance, performance, performance
- Reliable — it must be able to work 99.9999% of the time.
That’s a fairly convincing feature list for strong database systems design.
However, not all data-intensive applications and programs will use…databases. Instead, you can file-based data storage, including things like Excel files. Hadoop, for example, is a processing framework for running operations to run on files.
Key concepts of DBMS designs
Every core DBMS utilizes certain core concepts and models to drive the design. The following are these essential concepts:
- Data models — a description of how the data is structured. The general form of the data stored in the DB. The most common one is the relational data model. Another model is the XML document, not a set of records, but as a hierarchical structure of data values. Or a graph data model, in terms of nodes and edges.
- Schema vs. data — Kind of like types and variables in programming languages. A type-token relationship structure.
- Data definition language (DDL) — Schemas are set up with a DDL for a database
- Data manipulation or query language (DML) — A data manipulation language queries and modifies the database
Key people involved
Of course, no database can be built without someone building it. These are some of the major roles involved in the DBMS industry, from designing to maintaining databases:
- DBMS implementer — person that builds the database system
- Database designer — person who establishes the schema for a database for any given application
- Database application developer — an interface between the user and the database. Many programs can interface with a database.
- Database administrator — maintains and keeps the DB running smoothly. Require tuning parameters to keep optimum performance. Highly valued personnel.
The focus of the course is primarily with database designers and application designers.
2. The Relational Model (RM)
The relational model is more than 35 years old and is the foundation of most DBMS, including commercial DBMS. The RM model is quite possibly one of the most ubiquitous today. It’s an incredibly simple model — which is one of its strengths.
A relational model also relies on high-level query languages that are clear, expressive and declarative. Their implementations tend to be efficient as well.
Basic constructs of the RM
The following are certain key costructs of a relational model of database design:
- A database is a set of named relations (or typically called tables)
- Each relation (or table set) has its own set of named atributes (or columns)
- The actual data is stored in what is called tuples (or the rows)
- Each attribute has has a type (or domain) — e.g., an attribute could be a type of string, an integer, float, binary, enumerated domain, etc. Types can be atomic or structured (more complex types of values in data attributes).
- Schema — the structural description of relations (for example a relational table of Students with set attributes, usually set ahead of time)
- Instance — actual contents at a given point that fill into a schema, numbering thousands as it grows
- Null values — unknown or undefined values. Can be problematic if null values are queried.
- Key — an attribute whose value is unique in every tuple/row. Non-unique attribute values are not keys. An attribute of a college may not be unique (there may be colleges, for example, with the same name). A numeric index may, however, be very unique.
Example of creating RM relations and tables in SQL:
Create Table "Student"(ID, name, GPA, photo)
Create Table "College"(name string, state char(2),enrollment integer)
The RM thus allows for a high-level declarative query language.
Querying a relational database
In order to query or create entries in a relational database certain basic architectural steps are taken:
- Design a schema (attributes/columns) and create it using DDL (see above)
- “Bulk load” the initial data to fill in the tuples/rows
- Finally, execute queries on the DB and retrieve results. Or update the data itself.
Relational databases allow for high-level ad-hoc queris. In all relational query languages, when you ask a query over a set of relations, you get a whole new query object result. This feature of query languages is called a closure of the object. Another query can be called on this returned object — this is what is known as compositionality.
Two Basic Relational Query languages
- Relational algebra — a formal language
- SQL — an actual, implementation of a relational algebra
These are two basic forms of how the sematics of a query language is defined.