-
Notifications
You must be signed in to change notification settings - Fork 339
Comparison of SQL and SPL:Non equi grouping
The equi-grouping operations divides a data table according to same field values or expression results, while there are certain common grouping operations use special grouping conditions. Grouping customers according to the order of Beijing, Shanghai, Guangzhou and Shenzhen and calculate total sales amount in each group, and divide families into wealthy, well-off, comfortable, and poor according to the yearly income. This essay will explain the solutions and basic principles of SQL and SPL, the two commonly used programming languages, in dealing with this type of grouping scenarios, and find the simpler and more efficient way through sample programs. Looking ${article} for details.
A grouping operation is called equi-grouping if the grouping condition is same values or expression results. Equi-grouping operations are the most seen and they divide a data set completely. Their features are as follows:
(1) No empty subset in the grouping result set;
(2) Each member in the original set belongs to one and only one subset.
Does there exist the non-equi grouping and incomplete division? The answer is yes. Their features are as follows:
(1) The grouping result set contains at least one empty subset;
(2) The grouping result set does not contain all members of the original set;
(3) One member may belong to more than one subset.
An alignment grouping compares a specific field or expression of each member in the to-be-grouped set with every member in a predefined base set and puts members matching same member in the base set in the same group. The number of groups in the result set is the same as that of members in the base set. The alignment grouping may generate empty group or leaves one or more members in outside of any group.
【Example 1】According to the Course table and Select_Course table below, find the unselected courses according to the order in Course table.
COURSE:
ID | NAME | TEACHERID |
---|---|---|
1 | Environmental protection and sustainable development | 5 |
2 | Mental health of College Students | 1 |
3 | Matlab | 8 |
4 | Electromechanical basic practice | 7 |
5 | Introduction to modern life science | 3 |
… | … | … |
SELECT_COURSE:
ID | COURSEID | STUDENTID |
---|---|---|
1 | 6 | 59 |
2 | 6 | 43 |
3 | 5 | 52 |
4 | 5 | 44 |
5 | 5 | 37 |
… | … | … |
SQL solution:
SQL does not support the alignment grouping. We get the distinct selected courses from the Select_Course table and then select records that do not contain these selected courses. Below are the SQL queries:
select *
from COURSE
where
ID not in
(select DISTINCT COURSEID
from SELECT_COURSE)
order by rownum
SPL solution:
SPL offers A.align() function to perform alignment grouping.
A | |
---|---|
1 | =T("Course.csv") |
2 | =T("SelectCourse.csv") |
3 | =A2.align(A1:ID, COURSEID) |
4 | =A1(A3.pos@a(null)) |
A1: Import Course table.
A2: Import SelectCourse table.
A3: Use A.align() function to group SelectCourse table according to Course table’s ID field.
A4: Get the courses that no students select from Course table.
【Example 2】According to DEPARTMENT table and EMPLOYEE table below, calculate the number of employees in each department according to the order of DEPARTMENT table. Below is part of the source tables:
DEPARTMENT:
DEPT | MANAGER |
---|---|
Administration | 1 |
Finance | 4 |
HR | 5 |
Marketing | 6 |
Production | 7 |
… | … |
EMPLOYEE:
ID | NAME | BIRTHDAY | STATE | DEPT | SALARY |
---|---|---|---|---|---|
1 | Rebecca | 1974/11/20 | California | R&D | 7000 |
2 | Ashley | 1980/07/19 | New York | Finance | 11000 |
3 | Rachel | 1970/12/17 | New Mexico | Sales | 9000 |
4 | Emily | 1985/03/07 | Texas | HR | 7000 |
5 | Ashley | 1975/05/13 | Texas | R&D | 16000 |
… | … | … | … | … | … |
SQL solution:
For the alignment grouping operation, each group will contain all matching members in the to-be-grouped set. Since SQL does not directly support this type of grouping, we group EMPLOYEE table by department and count employees in each department, perform a left join between DEPARTMENT table and the result set of grouping and summarizing EMPLOYEE table, and query each department to get the number of employees. Note that the SQL JOIN operation between two result sets does not ensure that records are arranged in a specific order, so we need to sort them by row numbers in DEPARTMENT table; otherwise, the query result may not be ordered in the same way as DEPARTMENT table. Below are the SQL queries:
select
t1.DEPT, t2.COUNT
from
(select DEPT,rownum ordernum from DEPARTMENT) t1
left join
(select
DEPT, count(*) COUNT
from EMPLOYEE
group by DEPT) t2
on t1.DEPT=t2.DEPT
order by ordernum
SPL solution:
SPL offers A.align() function to handle the alignment grouping. It works with @a option to return all matching members for each group.
A | |
---|---|
1 | =T("Department.csv") |
2 | =T("Employee.csv") |
3 | =A2.align@a(A1:DEPT, DEPT) |
4 | =A3.new(DEPT, ~.len():COUNT) |
A1: Import Department table.
A2: Import Employee table.
A3: Group Employee table according to the order in Department table and return all matching members for each group.
A4: Count employees in each department.
【Example 3】According to Employee table (Below is a part of it), calculate the average salary for each state of [California, Texas, New York, Florida] and for all the other states as a whole, which are classified as “Other”.
ID | NAME | BIRTHDAY | STATE | DEPT | SALARY |
---|---|---|---|---|---|
1 | Rebecca | 1974/11/20 | California | R&D | 7000 |
2 | Ashley | 1980/07/19 | New York | Finance | 11000 |
3 | Rachel | 1970/12/17 | New Mexico | Sales | 9000 |
4 | Emily | 1985/03/07 | Texas | HR | 7000 |
5 | Ashley | 1975/05/13 | Texas | R&D | 16000 |
… | … | … | … | … | … |
SQL solution:
In this case, the non-matching members are required to put into a new group. SQL does not have a method to do this. First, we define a set of the four specified state groups and the “Other” group, and then check which group each employee record belongs to using CASE WHEN statement. Below are SQL queries:
with cte1(ID,STATE) as
(select 1,'California' from DUAL
UNION ALL select 2,'Texas' from DUAL
UNION ALL select 3,'New York' from DUAL
UNION ALL select 4,'Florida' from DUAL
UNION ALL select 5,'OTHER' from DUAL)
select
t1.STATE, t2.AVG_SALARY
from cte1 t1
left join
(select
STATE,avg(SALARY) AVG_SALARY
from
( select
CASE WHEN
STATE IN ('California','Texas','New York','Florida')
THEN STATE
ELSE 'OTHER' END STATE,
SALARY
from EMPLOYEE)
group by STATE) t2
on t1.STATE=t2.STATE
order by t1.ID
SPL solution:
SPL has A.align() function to do this. It uses @n option to put the non-matching members into a new group.
A | |
---|---|
1 | =T("Employee.csv") |
2 | [California,Texas,New York,Florida] |
3 | =A1.align@an(A2,STATE) |
4 | =A3.new(if (#>A2.len(),"OTHER",STATE):STATE,~.avg(SALARY):AVG_SALARY) |
A1: Import Employee table.
A2: Define a constant set of regions.
A3: A.align() groups Employee table according to A2’s set of states. @a option is used to return all matching members for each group; @n option enables putting the non-matching members into a new group.
A4: Name the new group “OTHER” and calculate the average salary of each group.
SQL uses JOION to achieve an alignment grouping because it does not have a method to do it directly. As a SQL result set is unordered, we record the order of the records in the original table through their original row numbers. This makes SQL solution to an alignment grouping problem complicated. SPL offers A.align() function for specifically handling alignment grouping operations, which features concise syntax and efficient execution.
An enumeration grouping prespecifies a set of enumerated conditions, computes each conditional expression using every member of the to-be-grouped set as the parameter, and puts members that make the same condition true into the corresponding subset. The subsets in the result set correspond to the prespecified conditions one by one.
【Example 4】According to the following China urban population table, divide cities into groups and count the number in each group. Megacity: above 2 million population; Large city: between 1 to 2 million population; Major city: between 0.5 to 1 million population; Other cities. Below is part of the source table:
ID | CITY | POPULATION | PROVINCE |
---|---|---|---|
1 | Shanghai | 12286274 | Shanghai |
2 | Beijing | 9931140 | Beijing |
3 | Chongqing | 7421420 | Chongqing |
4 | Guangzhou | 7240465 | Guangdong |
5 | HongKong | 7010000 | HongKong Special Administrative Region |
… | … | … | … |
SQL solution:
SQL lacks a direct method of handling enumeration grouping operations. We segment the source table using CASE WHEN statement, then group data by comparing it with the segmentation values and perform aggregation. CASE WHEN statement put each record into the first matching grouping only. Below are SQL queries:
with cte1(ID,CITY_TYPE) as
(select 1,'Megacity ' from DUAL
UNION ALL select 2,'Large city ' from DUAL
UNION ALL select 3,'Major city ' from DUAL
UNION ALL select 4,'Other cities ' from DUAL)
select
t1.CITY_TYPE, nvl(t2.COUNT,0) COUNT
from cte1 t1
left join
(select CITY_TYPE, count(*) COUNT from
(select
case when POPULATION>2000000 then ' Megacity '
when POPULATION>1000000 then ' Large city '
when POPULATION>500000 then ' Major city '
when POPULATION<=500000 then ' Other cities '
else 'OTHER' end CITY_TYPE
from URBAN_POPULATION)
group by CITY_TYPE) t2
on t1.CITY_TYPE=t2.CITY_TYPE
order by t1.ID
SPL solution:
SPL provides A.enum() function to perform grouping according to the enumerated conditions.
A | |
---|---|
1 | =T("UrbanPopulation.txt") |
2 | [?>2000000,?>1000000,?>500000,?<=500000] |
3 | =A1.enum(A2,POPULATION) |
4 | [Megacity, Large city, Major city, Other cities] |
5 | =A3.new(A4(#):CITY_TYPE,~.count():COUNT) |
A1: Import UrbanPopulation table.
A2: Define grouping conditions.
A3: A.enum() function performs enumeration grouping on A1’s table according to A1’s conditions and, by default, puts each record in the first matching grouping only.
A4: Define a name for each group.
A5: Count cities in each group according to the enumeration grouping result.
【Example 5】Divide the following employee table into three groups according to age groups <35, <45 and other ages. Below is part of the source table:
ID | NAME | BIRTHDAY | STATE | DEPT | SALARY |
---|---|---|---|---|---|
1 | Rebecca | 1974/11/20 | California | R&D | 7000 |
2 | Ashley | 1980/07/19 | New York | Finance | 11000 |
3 | Rachel | 1970/12/17 | New Mexico | Sales | 9000 |
4 | Emily | 1985/03/07 | Texas | HR | 7000 |
5 | Ashley | 1975/05/13 | Texas | R&D | 16000 |
… | … | … | … | … | … |
SQL solution:
It is required, in this case, that the non-matching records be put into a new group. We use CASE WHEN … ELSE statement to define an “Others” group to achieve this. Below are SQL queries:
with cte1(ID,AGE_AREA) as
(select 1,'Below 35 ' from DUAL
UNION ALL select 2,'Below 45 ' from DUAL
UNION ALL select 3,'Others ' from DUAL)
select
t1.AGE_AREA, nvl(t2.AVG_SALARY,0) AVG_SALARY
from cte1 t1
left join
(select
AGE_AREA, avg(SALARY) AVG_SALARY
from
(select
case when TRUNC(months_between(sysdate, BIRTHDAY)/12)<35
then ' Below 35 '
when TRUNC(months_between(sysdate, BIRTHDAY)/12)<45
then ' Below 45 '
else ' Others ' end AGE_AREA,
SALARY
from EMPLOYEE)
group by AGE_AREA) t2
on t1.AGE_AREA=t2.AGE_AREA
order by t1.ID
SPL solution:
A.enum() function uses @n option to put the non-matching records into a new group.
A | |
---|---|
1 | =T("Employee.csv") |
2 | [?<35,?<45] |
3 | =A1.enum@n(A2, age(BIRTHDAY)) |
4 | [Below 35,Below 45, Others] |
5 | =A3.new(A4(#):AGE_AREA,~.avg(SALARY):AVG_SALARY) |
A1: Import Employee table.
A2: Define grouping conditions.
A3: A.enum() function performs enumeration grouping by age groups and works with @n option to put non-matching records into a new group.
A4: Define a name for each group.
A5: Calculate the average salary in each group according to A3’s result.
【Example 6】According to the following cities’ GDB table, calculate per capita GDP for the direct-controlled municipalities, first-tier cities and second-tier cities. Groups may contain common members as Beijing is both a direct-controlled city and a first-tier city. Below is part of the source table:
ID | CITY | GDP | POPULATION |
---|---|---|---|
1 | Shanghai | 32679 | 2418 |
2 | Beijing | 30320 | 2171 |
3 | Shenzhen | 24691 | 1253 |
4 | Guangzhou | 23000 | 1450 |
5 | Chongqing | 20363 | 3372 |
… | … | … | … |
SQL solution:
In this case, one record could be put into more than one group. CASE WHEN statement can only put a record into the first matching group and does not support re-putting it to another group, so we need to calculate per capita GDP for each group and then concatenate the results. Below are SQL queries:
select
'Direct-controlled municipality ' AREA,
SUM(GDP*1000)/SUM(POPULATION) CAPITA_GDP
from CITY_GDP
where
CITY in ('Beijing','Shanghai','Tianjin','Chongqing')
UNION ALL
select
'First-tier city ' AREA,
SUM(GDP*1000)/SUM(POPULATION) CAPITA_GDP
from CITY_GDP
where
CITY in ('Beijing','Shanghai','Guangzhou','Shenzhen')
UNION ALL
select
'Second-tier city ' AREA,
SUM(GDP*1000)/SUM(POPULATION) CAPITA_GDP
from CITY_GDP
where
CITY in ('Chengdu','Hangzhou','Chongqing','Wuhan','Xian','Suzhou','Tianjin','Nanjing','Changsha','Zhengzhou','Dongguan','Qingdao','Shenyang','Ningbo','Kunming')
In the SQL query, groups are respectively queried and summarized, and their results are concatenated one by one. Each group requires a piece of SQL code. When there are many groups, it’s complicated to write and maintain the SQL query.
SPL solution:
A.enum() function uses @r function to check whether all members in a group match a condition.
A | |
---|---|
1 | =T("CityGDP.txt") |
2 | [["Beijing","Shanghai","Tianjin","Chongqing"].pos(?)>0,["Beijing","Shanghai","Guangzhou","Shenzhen"].pos(?)>0,["Chengdu","Hangzhou","Chongqing","Wuhan","Xian","Suzhou","Tianjin","Nanjing","Changsha","Zhengzhou","Dongguan","Qingdao","Shenyang","Ningbo","Kunming"].pos(?)>0] |
3 | =A1.enum@r(A2,CITY) |
4 | [direct-controlled municipality, first-tier city, second-tier city] |
5 | =A3.new(A4(#):AREA, |
A1: Import CityGDP table.
A2: Enumerate conditions for direct-controlled city, first-tier city and second-tier ciy.
A3: A.enum() groups A1’s table according to the enumerated conditions and works with @r option to check whether all members in a group match a condition.
A4: Define a name for each group.
A5: Calculate per capita GDP in each group according to A3’s result.
SPL needs just one option to handle the re-grouping of a member in an enumeration grouping operation. No matter how many groups there are, users only need to maintain the enumerated conditions, instead of modifying the code for grouping & aggregation.
SQL is not fit to handle both alignment grouping and enumeration grouping. There are two causes for this. One is SQL does not provide a special statement or function to handle them, the other is that SQL’s theoretical foundation, the relational algebra, is based on unordered sets, which makes it hard to handle computations that involve the fixed order. SPL, however, is based on ordered sets, and is thus naturally good at handling order-based computations. Besides, it offers functions to specifically deal with the two types of grouping operations.
When the query is complicated, the complexity of SQL query increases by multiple times. It involves the use of temporary table and nested query, etc, which makes it harder to write and maintain the SQL query. SPL, however, can compose succinct code step by step according to the natural way of thinking.
The SPL-driven esProc is the professional data computation engine. It is ordered-set-based and offers a complete set of grouping functions, which makes the marriage of both Java and SQL advantages. A grouping operation in SPL will thus become simple and easy.
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code