SQL Odds and Ends

In this section we briefly discuss other important SQL-related concepts that we don't have time to explore more fully.

Reading

Reading questions

Notes

Indexes

An index is essentially an extra structure that is stored along with your database table, and facilitates certain searches. Each index is essentially tailor-made for a particular kind of search that uses a certain set of columns.

An index effectively maintains an efficient arrangement of the rows according to their values in a certain column. Think of it similar to the index at the back of a book: It offers you quick access to occurences of important words in the text. Similarly, an index on the variable "age" would for instance offer you quick access to the rows corresponding to students of a particular age, rather than having to go through all rows to find them.

This power comes at a cost: Every time you want to add a new entry or change an existing entry, the corresponding index needs to change to accomodate this new value. As a result, indexes make the INSERT, DELETE and UPDATE queries slower, in order to make certain SELECT queries faster.

As a concrete example, consider the following table:

CREATE TABLE `dummy_vals` (
  `id`         bigint(20) NOT NULL      AUTO_INCREMENT,
  `created`    timestamp  NULL          DEFAULT CURRENT_TIMESTAMP,
  `anint`      INT(11),
  PRIMARY KEY (id)
);

This table has an auto-generated key called id, and it is also a primary key. Any primary key is automatically part of an index. This means that any queries that address the table via the id will be very efficient. In order to work with the table, we will add a fairly large number of data to it. This uses a procedure, and you don't need to understand all the details:

DELIMITER $$
CREATE PROCEDURE generate_data()
BEGIN
  DECLARE i INT DEFAULT 0;
  WHILE i < 100000 DO
    INSERT INTO `dummy_vals` (`datetime`,`anint`) VALUES (
      FROM_UNIXTIME(UNIX_TIMESTAMP('2014-01-01 01:00:00')+FLOOR(RAND()*31536000)),
      ROUND(RAND()*50,2)
    );
    SET i = i + 1;
  END WHILE;
END$$
DELIMITER ;

CALL generate_data();

This generates 50000 "dummy" entries. As we say, these are by default indexed based on the id. So we could do quite efficiently a query like the following:

SELECT SQL_NO_CACHE *
FROM dummy_vals
WHERE id > 25000 AND id < 29000
ORDER BY id;

This takes a mere 0.004-0.007 seconds to execute. But a query on the integer field, matching about the same number of cases, would be slower:

SELECT SQL_NO_CACHE *
FROM dummy_vals
WHERE anint > 250 and anint < 270
ORDER BY anint;

This takes 0.03-0.5 seconds to execute, about 7-8 times slower. If we were to add an index to it, like so:

ALTER TABLE `dummy_vals`
ADD INDEX `anint_index` (anint);

Then the same anint-based query as above takes only 0.005-0.008 seconds, very close to the primary key-based query.

Indexing on multiple columns only helps if you use all the columns, or at least the first few columns. For example, if we take out that last index and add a new one:

ALTER TABLE `dummy_vals`
DROP INDEX `anint_index`,
ADD INDEX `combo_index` (created, anint);

Then the anint-based query is again slow, because this index requires a created clause before it can be used. Reversing the order helps:

ALTER TABLE `dummy_vals`
DROP INDEX `combo_index`,
ADD INDEX `combo_index2` (anint, created);

Now the anint-based query is again fast.

Some takeaways:

  1. Indexes may speed up SELECT queries.
  2. Indexes take up additional storage.
  3. Indexes slow down CREATE, UPDATE and DELETE queries, as they need to be updated along with the database.
  4. Having many indexes may slightly slow down a SELECT query as the engine has to choose which index to use for the specific query.
  5. Indexes typically benefit queries that would return a relatively small number of hits compared to the table size. If almost all values are to be returned, an index won't help much.

Views

Views are a bit like "virtual tables". A view is associated with a SELECT query, and it effectively stores the query for future use. In the future you can directly try to access the view as if it was a table, without needing to explicitly run the query again. There are two main uses of this:

  1. For frequently used queries that bring many tables together.
  2. For computations relying on existing data. For instance computing the number of credits and gpa that a student has based on their course records. We could have this dynamically computed with a view rather than updating the student table every time a new grade is inserted.
  3. When we want to restrict access. For instance we might need to provide someone access to student names and addresses, but not other more sensitive information stored in their account, like their GPA, number of credits, account holds etc. We can create a view that only contains specific columns from a table, and then give select users view access to this view, but not the table it came from.

Read more about views in the links above.

ORM

We now turn to an important topic that is more higher-level, namely Object-Relational Mapping. The idea is simple. Assume we are working in an object-oriented language like Python.