Skip to content
brianhigh edited this page Apr 29, 2015 · 22 revisions

Databases

Research produces data and other information that need to be stored for future review or processing. Databases are one such solution, offering an integrated system of information storage, and management. There are a variety of database systems on the market, each with their own pros and cons.

Databases come in all shapes and sizes. In general, there are three basic types. The flat file, which is much like a spreadsheet. Relational, which is like a series of spreadsheets linked by common identifiers. And finally, the family of NoSQL databases, which have various storage models for different use cases.

Much like the variety of database types, there are a variety of query languages. A query language is how you interact with the database. With it, you can insert new data, update records, etc. We’ll go over some specific languages like Structured Query Language later in the section.

  • Types

    • Flat

    • Relational

    • NoSQL (various storage models)

  • Query Languages

    • SQL

    • CQL

    • Etc.

Key Terms

Before we can discuss the details of each database type, and their related query languages, we have to get some key terms out of the way. These terms will come up again later, in addition to being found in numerous database books, and articles.

CRUD represents the four most basic database operations. Create refers to creating new entries, or adding new data to the database. Read, as the name implies is retrieving data from the database. Update is simply changing the contents of an existing database record. And, Delete, is the removal of a record.

ACID is a method of ensuring that data is stored reliably. Atomicity or an atomic operation, means either the transaction or query is completed fully, or not completed at all. In other words, if the query fails, the database must not be changed. Consistency means that database must remain in a valid state. If the database has any data validation rules, all data must be in compliance with them. Isolation is important in multi-user databases. With Isolation, each users transactions must not interfere with each other, so that each user has a consistent view of the database. Durability means the data has been committed to disk. So, that in the event of a power failure, crash, etc. your data is safe.

Normalization is all about removing redundant data. For example, consider storing a customer’s ID with their order instead of their contact information. The order would refer to a customer’s ID number, which refers to their contact information, stored separately from the orders. Thus, multiple orders can refer to the same customer contact data record.

CAP Theorem is mostly relevant to distributed NoSQL databases — databases which are spread across many servers. It states that you can’t deliver a consistent database view, ensure that every query is responded to, and tolerate a failure in the system — all at the same time. In other words, you must compromise in one of those areas, to guarantee the other two.

Database Schema is a formal description of the database structure. It defines the tables, data types, and other rules governing the database.

Flat File

Flat File
Flat File Model: US Department of Transportation, Public Domain

Flat file databases are extremely simple. They are simply a single table of values, just like a spreadsheet. The actual data format varies. There are proprietary formats like Excel, for example, but open standards like CSV, or comma separated values, are widely seen.

Due to their simple nature, they lack a number of features found in more capable database systems. For example, they lack indexing, which means that searching a large flat file could require considerably more time than a relational database. They also lack concurrent write access, since they aren’t server based. Finally, updates can be extremely slow, as they require rewriting the entire file to disk after making a change.

All these limitations aside, there are some benefits. Flat files can be very lightweight in resource usage, at least when they are smaller in size. For databases using CSV or similar file plain-text formats, you can even edit the database using a basic text editor.

Despite the limitations, there are situations where a flat file is a reasonable choice — for example, when recording sensor data from a temperature and pressure sensor. This type of data format tends to consist of timestamped rows with sensor readings, with each file named after the sensor.

That said, it’s generally best not to go with a flat file database for storage. If at some point you find that your needs change, converting to a more capable format could become a rather daunting task.

  • Lightweight to a point

  • Limited scalability

  • No built in concurrent access

  • No indexing

  • Lacks built in consistency checking

Relational

fabFORCE.net DBDesigner4
fabFORCE.net DBDesigner4: MichaelGZinner, CC BY SA 3.0

Relational databases, as their name implies, relate data from one table to another.[1] These relationships reduce the duplication of data that is common to multiple records. For example, in an ordering system, you could store your customers within one table, and link their orders to them via a customer identification number.

Relational databases are basically the jack of all trades in the database world. They are highly flexible enabling them to meet the widest number of use cases, such as recording survey results. But, that flexibility comes at a price. You have to tell the database about the data you plan to store. In other words, a relational database requires a schema to define the structure of your data, and how it relates.

Virtually all of them support ODBC or Open Database Connectivity, which is a standardized interface between an operating system or programming language and the database driver. Through ODBC, you can plug your database into a huge number of applications, such as R or even Excel.

There are variety of relational database engines on the market, some open source, others proprietary. The most common open source database is MySQL, which is now owned by Oracle. For those with modest needs, there is SQLite which is commonly embedded into other applications like Firefox and Android. Even Microsoft has an SQL offering, in the form of Microsoft SQL Server.


Example 1. Does climate impact human health?

A database can be used to store any kind of data. One group had the challenge of analyzing a large volume of climate data to compare with health records. The health care data was location coded, in a similar fashion to the climate data that was being used. Through the use of MySQL, they were able to relate health care records by location with the climate data. Thus enabling them to query the data, and look for patterns.


So, which should you choose? My advice is to pick an open source option, like MySQL or PostgreSQL. They have low minimum resources making them easy to setup on a laptop for testing purposes. Documentation is widely available. And, more importantly, their free nature means that a lot of folks have taken the time to integrate them into a huge array of tools.

That said, different engines make sacrifices to make gains in other areas. This commonly occurs in the area of ACID compliance. A database may forgo strict data validation, or consistent disk writing to speed up query response time. If you value your data over speed, be sure to pick an ACID compliant database.

In short, if you aren’t sure what kind of database to use, an ACID compliant relational database is a good start. It’s typically easier to migrate from a relational database to other types, as opposed to converting to one.

  • General purpose

  • Wide application support through ODBC

  • Less duplicate data

NoSQL

NoSQL or "Not Only SQL" databases serve a variety of niches. Each implementation is geared towards a limited set of use cases, in other words, they don’t try to be general purpose databases. NoSQL databases are most commonly associated with Big Data, and High Performance Computing. By Big Data, we’re talking about databases in the multi terabyte and larger range. But, not only are they big, they generally support high throughput (data access speeds) either from thousands of concurrent requests or simply doing a query across a few billion rows of data in a short period of time.

Within the NoSQL arena, there are a lot of options. A couple commonly used are Apache’s Cassandra, which functions similarly to a traditional database. It has tables, which consist of rows and columns. However, the method you query across multiple tables is a bit different than a relational database. mongoDB on the other hand, is known as a document store. A document being an object containing a series of values. Where a value could be as simple as a name, to an entire PDF. In addition, objects can be related to each other through their unique IDs.

Much like there are a number of NoSQL databases, each excelling in their own ways, there are a variety of storage models. A storage model is how the information is actually stored within the database. Each storage method is best suited to select "use cases", or applications. You’ll want to consult the documentation for each database to see what it’s most suited for. You may also want to look to sites like BioStar to see what others in your field are using to solve their problems.

With all this variation in data storage methods, there has to be a catch. While many relational databases offer ACID compliance, NoSQL databases rarely do. They make many sacrifices in order to scale massively. The guiding rule with NoSQL is the CAP Theorem, which basically says you can’t ensure that every server in a cluster shows the exact same data, can handle every query, and also provide access to all information in the event of a server crash.


Example 2. Collecting data from remote sensors.

While NoSQL implies Big Data, it doesn’t require it. NoSQL databases lend themselves to solving a variety of problems. One group was tasked with performing remote data collection using Internet connected sensors. Each of these sensors sends data back in 30 second intervals to KairosDB. KairosDB is a time-series database, designed for tracking sensor data over large periods of time, and from a large number of sensors. Being a purpose specific database, it’s easily integrated with tools used for trend analysis.


So, which NoSQL option should you go with? That isn’t an easy question to answer. Since each database system varies from the next, you have to closely evaluate your needs. You need to consider what questions you are asking from your data and also factor in the nature of your data. For example, is it a huge table of values, or millions of documents. Finally, look at the other tools you are using, as they may include support for one or maybe two NoSQL options. In which case, you are either stuck with those choices, or reconsider the other tools being used in your data system.

Query Languages

A database isn’t very useful if you don’t have a way to enter and retrieve data from it. A query language enables you to enter data, select data based on criteria, change existing data, and even remove it.

The most prevalent query language is SQL, or Structured Query Language, which is a standardized language implemented by most relational databases. However, it’s not alone in that arena. Some databases such as Oracle and PostgreSQL have their own add-ons called Procedural Languages which enable you to leverage scripting languages like Perl within the database engine itself.

Outside of the relational database world, the query languages differ from one database engine to the next. For example, Apache’s Cassandra offers Cassandra Query Language which has an SQL inspired syntax.

Structured Query Language

SQL or Structured Query Language is the most widely used database query language. With SQL, you can query for data, create new databases, etc. At the core, SQL is an ANSI standardized language, enabling multiple database vendors to implement it. That said, SQL itself doesn’t meet everyone’s needs, resulting in proprietary extensions to it.

When possible, it’s best to avoid the use of proprietary extensions. Otherwise, in the future, should you need to change your database back-end, for example switching from MySQL to PostgreSQL, you may find yourself rewriting a lot of you queries, and adjusting your database schema.

SQL Statement Anatomoy
SQL Statement Anatomy: Ferdna, CC BY SA 3.0 MIGRATED

So, what does SQL look like? It has a lot of commands, though there are 4 basic commands which you’ll want to get most comfortable with. The SELECT command, which is used to select or retrieve data from the database. The INSERT command, which as the name states, is used to insert new data. UPDATE, which is used to change values within the database. And, DELETE, which removes records from the database.

  • ANSI Standard

  • Proprietary Extensions

  • Widely Supported

Other Query Languages

Among the numerous non-standard query languages, here are a few examples. Apache’s Cassandra implements their Cassandra Query Language, which is SQL-inspired. Oracle, Microsoft, and others offer Procedural Languages which embed some kind of scripting language within the database. Procedural Languages enable you to perform complex operations, using things like loops and arrays within the database engine itself. By embedding the language into the engine, you gain performance from not having to transmit data across the network, as well as gaining capabilities like having some code executed when triggered by a database insert or update.

In short, it’s best to use a standardized language like SQL whenever possible. However, you shouldn’t hinder yourself when the database engine offers an option that will improve performance or reduce data processing time.

  • CQL: Cassandra

  • PL/SQL: Oracle

  • Transact-SQL: Microsoft SQL Server

Summary

We’ve gone over quite a bit of basics on databases. So, let’s review the key points to remember.

CRUD (or Create, Read, Update, Delete) are the basic operations performed on a database. In SQL these operations are represented with INSERT, SELECT, UPDATE, and DELETE.

ACID (or Atomicity, Consistency, Isolation, Durability) compliance controls how data is stored. It ensures that all queries are atomic, e.g. all or nothing. That data stored is in full compliance with any validation rules. And, that database changes are really stored on disk.

Normalization is the process of reducing duplication in a database by relating records between tables, instead of repeating the same information within several tables.

A schema is the design and specifications for the database. It defines tables, content types, etc.

The CAP Theorem applies mostly to NoSQL. It states that all database servers in a group can’t show the exact same data, respond to every query, and tolerate a node failure without compromising in one of those areas.

We covered the three basic types of databases, flat, relational, and NoSQL. With flat files being generally frowned upon for database use unless you have specific reasons for using them. NoSQL is typically only required when working with big data. And, relational being the most general purpose database type you can use.

We’ve also covered ODBC, or Open Database Connectivity. Which is a standardized scheme for connecting applications, and languages with databases. In the Java world, the equivalent is JDBC or Java Database Connectivity. Regardless of language, these interfaces are your friend.

Finally, we covered some query languages. SQL, being the most common query language, makes it a valuable tool to know how to use. If you plan to stick with a single database engine for a lot of projects, it may benefit you to become familiar with it’s specific language. For instance, if working with Oracle a lot, you’ll probably want to know PL/SQL.


1. The term "relation" in database terminology actually refers to what is commonly called the "table", not the link ("relationship") between tables. See: http://en.wikipedia.org/wiki/Relation_%28database%29