-
Notifications
You must be signed in to change notification settings - Fork 339
Simplifying JOIN syntax (JOIN Simplification and Acceleration Series 3)
As all joins involve the primary key, we can devise ways of simplifying JOIN code according to this characteristic. There are three join simplification methods.
Below are two tables:
employee table
id
name
nationality
department
department table
id
name
manager
Both tables use id field as the primary keys. The department field of employee table is a foreign key pointing to the department table. The manager field of the department table is a foreign key pointing to the employee table (because managers are also employees). This is the conventional table structures.
Now we want to find the US employees whose managers are Chinese.
SQL does this by JOINing three tables:
SELECT A.*
FROM employee A
JOIN department B ON A.department=B.id
JOIN employee C ON B.manager=C.id
WHERE A.nationality='USA' AND C.nationality='CHN'
First, FROM employee is used to obtain employee information; JOIN employee table and department table to get employees’ department information; JOIN department table and employee table to get information of managers. The employee table is involved into two JOINs and aliases are needed to distinguish the table in two JOINs, making the whole statement bloated and difficult to understand.
We can write the statement in the following way if we regard the foreign key field as the corresponding records in the dimension table:
SELECT *
FROM employee
WHERE nationality='USA' AND department.manager.nationality='CHN'
But this isn’t a standard SQL statement.
The "department.manager.nationality" in the second statement means “nationality of manager of department of the employee in the current record”. By understanding the foreign key field as corresponding records in the dimension table, the dimension table fields as amount to attributes of the foreign key field. Thus department.manager is “the manager of the department of the current employee”. Since the manager field is the foreign key in the department table, fields of records in the dimension table to which it points can be still treated as its attributes, which generates the code department.manager.nationality – “the nationality of the manager of the department of the employee”.
Obviously, the object-oriented thinking, which is called foreign key attributization, is more natural and intuitive than the way of understanding based on filtered Cartesian product. Foreign-key-based joins do not involve multiplication between two tables. The foreign key field is only used to find corresponding records in the foreign key table, without involving the Cartesian product operation having multiplication property.
As previously stipulated, the associative field in the dimension table for a foreign-key join must be the primary key. So, one foreign key value matches one record in the dimension table. This means that each department value in the employee table relates to only one record in the department table, and that each manager field value in the department table associates with only one record in the employee table. That ensures that, for each record in the employee table, department.manager.nationality is unique and can be uniquely defined.
As the SQL JOIN definition does not involve the primary key, we cannot make sure that each foreign key value of the fact table corresponds to only one record in the dimension table. It may relate to multiple records, and for each record of employee table, department.manager.nationality cannot be uniquely defined and thus becomes invalid.
The object-oriented syntax is common in high-level languages (like C language and Java), which stores data in objects. Though department field values in the employee table are displayed as numbers, they essentially represent objects. In many data tables, the primary key values do not have substantial meanings, but are only used to identify records. Similarly, the foreign key field is just for locating the matching records in its foreign key table. If the foreign key values are directly stored as objects, numbers will become unnecessary. But as SQL does not support the object storage, numbers are still needed.
There is another point you need to know. In a foreign-key-based join relationship, the fact table and the dimension table are not in an equal position. We can look up a field in a dimension table according to the fact table, but not vice versa.
The join between homo-dimension tables is simpler. Considering the following two tables:
employee table
id
name
salary
…
manager table
id
allowance
…
Both tables use id field as their primary keys. Managers are also employees, so the two tables share the ids. Since managers have more attributes, their information is stored in a separate table.
Now we want to find the total income (including the allowance) of each employee (including every manager).
A JOIN operation is necessary for SQL to do it:
SELECT employee.id, employee.name, employy.salary+manager.allowance
FROM employee
LEFT JOIN manager ON employee.id=manager.id
But for two tables having a one-to-one relationship, we can treat them like one table:
SELECT id,name,salary+allowance
FROM employee
According to the stipulation, homo-dimension tables are JOINed according to the primary keys. Records with same primary key values correspond exclusively to each other. The expression salary+allowance is uniquely computed over each record of the employee table, having no possibility of causing ambiguity. We call this simplification method Interconnection of homo-dimension tables.
Homo-dimension tables are equal. Each one can refer a field of the other.
A typical example of the primary and sub tables is the orders table and the order detail table, like the following two tables:
Orders table
id
customer
date
…
OrderDetail table
id
no
product
price
…
The Orders table’s primary key is id field. The OrderDetail table’s primary key is made up of id and no fields. The former is a part of the latter.
We want to know the total amount of every order.
Below is the SQL:
SELECT Orders.id, Orders.customer, SUM(OrderDetail.price)
FROM Orders
JOIN OrderDetail ON Orders.id=OrderDetail.id
GROUP BY Orders.id, Orders.customer
SQL needs a GROUP BY to reduce the number of records with same ids produced by JOIN operation.
If we treat records of the sub table OrderDetail that match the primary table’ primary key as a field of the latter, the JOIN and GROUP BY won’t be necessary:
SELECT id, customer, OrderDetail.SUM(price)
FROM Orders
Unlike a familiar field, values of OrderDetail field are sets when certain records of the table are considered a field of the Orders table because the relationship between a primary table and its sub table is one-to-many. Here an aggregation is performed over each set type value to get a single value. This simplification method is called sub table set-lization.
This perspective on primary and sub table association makes the query easy to write and understand, as well as less error prone.
Suppose the Orders table has another sub table that records payment information:
OrderPayment table
id
date
amount
…
We want to find the orders that have not yet been fully paid, or whose accumulated payments are less than the total amount.
We shouldn’t simply JOIN the three tables. A many-to-many relationship will occur between the OrderDetail table and the OrderPayment table and the result will be wrong (just think about the high probability of error occurrence by performing a many-to-many join mentioned in the previous essay). The right way is to GROUP the two tables separately and JOIN the grouped results with Orders Table. The query will include a subquery:
SELECT Orders.id, Orders.customer,A.x,B.y
FROM Orders
LEFT JOIN (SELECT id,SUM(price) x FROM OrderDetail GROUP BY id) A
ON Orders.id=A.id
LEFT JOIN (SELECT id,SUM(amount) y FROM OrderPayment GROUP BY id ) B
ON Orders.id=B.id
WHERE A.x>B.y
If we treat each sub table as a field of the primary table, the query will be simple and easy:
SELECT id,customer,OrderDetail.SUM(price) x,OrderPayment.SUM(amount) y
FROM Orders
WHERE x>y
This way, a many-to-many relationship error can be avoided.
The primary table and its sub table are not equal. But a two-way reference is useful. In the above we talked about the case of referencing records of the sub table in the primary table. The reference in an opposite direction is similar to that from a foreign key table.
By treating a multi-table association operation as a relatively complex single table operation, we abandon the Cartesian product to look at JOIN operations from a different perspective. The new approach eliminates associative actions from the most common equi-join operations and even the JOIN key word from the syntax, creating simple and easy to understand queries.
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code