-
Notifications
You must be signed in to change notification settings - Fork 339
Comparison of SQL and SPL:Select Operation
【Abstract】
The select operation gets one or members from a set according to a specific condition. This essay explains the handling of select operations using SQL and SPL. By exploring solutions and basic principles of the two commonly used programming languages and providing sample programs written in them, we try to find the faster and more efficient way for you.
The select operation is so familiar for users who have had any experience with SQL or SPL. It seems similar to the locate operation. The latter cares about the position of one or more members in a specific set whole the select operation focuses on specific information of one or more members. Cases include getting names of customers whose sales amounts are above 10,000 and finding the name of student who has the highest total score.
A common select operation is to get one or more members that meet a specified condition. Both SQL and SQL can handle such a select operation well. SQL uses WHERE sub statement to achieve the conditional filtering while SQL offers A.select() function to do that. Let’s look at how the two languages implement selection operations through some examples.
【Example 1】Get full names and salaries of employees in R&D department in the New York state. Below is part of the source data:
ID | NAME | SURNAME | STATE | DEPT | SALARY |
---|---|---|---|---|---|
1 | Rebecca | Moore | California | R&D | 7000 |
2 | Ashley | Wilson | New York | Finance | 11000 |
3 | Rachel | Johnson | New Mexico | Sales | 9000 |
4 | Emily | Smith | Texas | HR | 7000 |
5 | Ashley | Smith | Texas | R&D | 16000 |
… | … | … | … | … | … |
SQL solution:
select
Concat(Concat(NAME,' '),SURNAME) FULLNAME, SALARY
from
EMPLOYEE
where
STATE='New York' and DEPT='R&D'
SQL WHERE sub statement performs a conditional filtering according to the condition that state is New York and department is R&D. The keyword "and" is used when both or all conditions need to be met. With the Oracle database, we use Concat function to join up strings together when returning the full name.
SPL solution:
A | |
---|---|
1 | =T("Employee.csv") |
2 | =A1.select(STATE=="New York"&&DEPT=="R&D") |
3 | =A2.new(NAME+" "+SURNAME:FULLNAME, SALARY) |
A1: Import Employee table from the source file.
A2: A.select() selects records of employees of R&D department in the New York state.
A3: Use the plus sign “+” to join strings together to get the full name.
SPL uses similar relational operators and logical operators to those high-level languages, such as Java and C, use. It uses two equal signs "==" to compare whether two operands are equal, and two and sings "&&" if two or more conditions need to be satisfied.
SPL supports retrieving a data table from the database. Suppose the above source data is stored in table "tbl" in database "db", A1 in the above script can be modified as follows:
A | |
---|---|
1 | =connect("db").query("select * from tbl") |
【Example 2】Count employees who are below 30 years old and above 50 years old in each department. Below is part of the source data:
ID | NAME | BIRTHDAY | STATE | DEPT | SALARY |
---|---|---|---|---|---|
1 | Rebecca | 1974/11/20 | California | R&D | 7000 |
2 | Ashley | 1980/07/19 | New York | Finance | 11000 |
3 | Rachel | 1970/12/17 | New Mexico | Sales | 9000 |
4 | Emily | 1985/03/07 | Texas | HR | 7000 |
5 | Ashley | 1975/05/13 | Texas | R&D | 16000 |
… | … | … | … | … | … |
SQL solution:
select
DEPT, COUNT(*) COUNT
from
EMPLOYEE
where
TRUNC(months_between(sysdate, BIRTHDAY)/12)<30 or TRUNC(months_between(sysdate, BIRTHDAY)/12)>50
group by DEPT
SQL WHERE sub statement performs the conditional filtering according to the condition that the age is above 50 or below 30. The key word "or" is used when any of multiple conditions needs to be met. The months_between function is then used to calculate the difference of months between an employee’s birthday and the system time, divide it by 12 to get the number of years, and round it up using TRUNC function to get the age. As SQL cannot define a temporary variable in WHERE sub statement, it writes the same expression two times. The language needs a subquery in order to reuse a computing result.
SPL solution:
A | |
---|---|
1 | =T("Employee.csv") |
2 | =A1.select((age=age(BIRTHDAY), age<30 |
3 | =A2.groups(DEPT; count(~):COUNT) |
A1: Import Employee table.
A2: A.select() gets records of employees who are below 30 or above 50.
A3: Count the number of records meeting the specified condition in each department.
SPL offers age() function to calculate the age according to the birthday. It allows using the double vertical lines "||" if any of the multiple conditions needs to be met. SPL supports using a temporary variable in the search condition to do the filtering in one step.
【Example 3】Find the differences of salaries and ages between Emma Taylor and Alexis Jones. Below is part of the employee table:
ID | NAME | SURNAME | STATE | DEPT | SALARY |
---|---|---|---|---|---|
1 | Rebecca | Moore | California | R&D | 7000 |
2 | Ashley | Wilson | New York | Finance | 11000 |
3 | Rachel | Johnson | New Mexico | Sales | 9000 |
4 | Emily | Smith | Texas | HR | 7000 |
5 | Ashley | Smith | Texas | R&D | 16000 |
… | … | … | … | … | … |
SQL solution:
select
(select SALARY
from EMPLOYEE
where NAME='Emma' and SURNAME='Taylor')
- (select SALARY
from EMPLOYEE
where NAME='Alexis' and SURNAME='Jones') as SALARY_GAP,
(select TRUNC(months_between(sysdate, BIRTHDAY)/12) as AGE
from EMPLOYEE
where NAME='Emma' and SURNAME='Taylor')
-(select TRUNC(months_between(sysdate, BIRTHDAY)/12) as AGE
from EMPLOYEE
where NAME='Alexis' and SURNAME='Jones') as AGE_GAP
from EMPLOYEE
where rownum=1
The SQL code is rather complicated and requires multiple queries. We use a common table to simplify it:
with cte1 as
(select SALARY,TRUNC(months_between(sysdate, BIRTHDAY)/12) as AGE
from EMPLOYEE
where NAME='Emma' and SURNAME='Taylor'),
cte2 as
select SALARY,TRUNC(months_between(sysdate, BIRTHDAY)/12) as AGE
from EMPLOYEE
where NAME='Alexis' and SURNAME='Jones')
select
cte1.SALARY-cte2.SALARY as SALARY_GAP,
cte1.AGE-cte2.AGE as AGE_GAP
from cte1,cte2
The use common table increases efficiency by reducing the frequency of queries. Yet there is another problem. The expected two records of employees are treated as two tables thanks to SQL’s lack of explicit record type data. A single record in SQL is treated as a temporary table having one record, which is a set consisting of a single member.
SPL solution:
A | |
---|---|
1 | =T("Employee.csv") |
2 | =A1.select@1(NAME=="Emma"&&SURNAME=="Taylor") |
3 | =A1.select@1(NAME=="Alexis"&&SURNAME=="Jones") |
4 | =A2.SALARY-A3.SALARY |
5 | =age(A2.BIRTHDAY)-age(A3.BIRTHDAY) |
A1: Import Employee table.
A2: A.select() works with @1 option to get the record of Emma Taylor.
A3: A.select() works with @1 option to get the record of Alexis Jones.
A4: Calculate difference of their salaries.
A5: Calculate difference of their ages.
We can also use interval() function in A5 to calculate the interval between two dates:
A | |
---|---|
5 | =interval@y(A2.BIRTHDAY,A3.BIRTHDAY) |
The interval() function uses @y option to get the number of years between two dates.
SPL offers true record data type. A2 and A3 respectively return records of the two employees. In the subsequent calculations, we can get a value directly from a record without treating it as a table.
SQL and SPL almost end in a draw in terms of their abilities to select desired members according to a simple condition. When the select condition becomes not that simple, SPL adds one or two lines according to the logic while SQL script gets extremely roundabout. Moreover, SPL can define a temporary variable in the condition and provides more functions to generate concise statements.
【Example 4】Get ID of the student who has the lowest math score in class one according to the following score table. Below is part of the source data:
CLASS | STUDENTID | SUBJECT | SCORE |
---|---|---|---|
1 | 1 | English | 84 |
1 | 1 | Math | 77 |
1 | 1 | PE | 69 |
1 | 2 | English | 81 |
1 | 2 | Math | 80 |
… | … | … | … |
SQL solution:
It is hard to get record containing the max/min value in SQL. Without the window function, we need to first calculate the max/min value and the perform a nested query using the max/min value as the condition.
Below is SQL script:
select
STUDENTID
from
SCORES
where
SUBJECT='Math' and and
SCORE=(
select MIN(SCORE)
from SCORES
where SUBJECT='Math' and
)
The record containing the min value could have been gotten through one loop, yet we use an inefficient method to get it due to SQL’s inability though this is not the right choice.
Some databases provides their compensations. Oracle, for instance, has analytic function KEEP for the job. The function makes the SQL statements simple and fast:
select
MIN(STUDENTID) keep (dense_rank first order by SCORE asc) STUDENTID
from
SCORES
where
SUBJECT='Math' and
But KEEP function can only work with an aggregate function. It becomes awkward when there are multiple records containing the max/min value.
SPL solution:
SPL offers A.minp() function to get the record containing the minimum value.
A | |
---|---|
1 | =T("Scores.csv").select(SUBJECT=="Math"&&CLASS==1) |
2 | =A1.minp(SCORE).STUDENTID |
A1: Import Scores table and get records Math scores in class one.
A2: A.minp() function gets record containing the lowest score and then gets the student ID from it.
Sometimes there are multiple records containing the maximum value. To return them all, we can use @a option with A.minp() function:
A | |
---|---|
2 | =A1.minp@a(SCORE).(STUDENTID) |
A2: A.minp() function works with @a function to get all records containing the lowest score and then gets the student IDs from it.
Below is the result returned by A1.minp@a(SCORE):
CLASS | STUDENTID | SUBJECT | SCORE |
---|---|---|---|
1 | 5 | Math | 60 |
1 | 14 | Math | 60 |
The SPL script is simple and clear because it selects the record(s) containing the max/min value directly. SQL, however, is not able to do that. Oracle KEEP function can only calculate the aggregate according to the sorting result.
【Example 5】Find the country that holds the champion in terms total number of medals for the most games and its detailed medal information according to the following Olympic medal table. Below is part of the source data:
Game | Nation | Gold | Silver | Copper |
---|---|---|---|---|
30 | USA | 46 | 29 | 29 |
30 | China | 38 | 27 | 23 |
30 | UK | 29 | 17 | 19 |
30 | Russia | 24 | 26 | 32 |
30 | Korea | 13 | 8 | 7 |
… | … | … | … | … |
SQL solution:
with cte1 as
(select GAME, max(NATION) keep (dense_rank first order by 1000000*GOLD+1000*SILVER+COPPER desc) NATION
from OLYMPIC
group by GAME),
cte2 as
(select min(NATION) NATION, min(GAME) GAMESTART, count(*) GAMECOUNT
from
(select GAME, NATION, row_number() over(ORDER by GAME) rowno, ROW_NUMBER()OVER(PARTITION BY NATION ORDER BY GAME) groupno
from cte1)
group by NATION, groupno-rowno),
cte3 as
(select min(NATION) keep (dense_rank first order by GAMECOUNT desc) NATION, min(GAMESTART) keep (dense_rank first order by GAMECOUNT desc) GAMESTART, max(GAMECOUNT) GAMECOUNT
from cte2)
select t1.GAME,t1.NATION,t1.GOLD,t1.SILVER,t1.COPPER
from OLYMPIC t1
right join cte3
on t1.nation=cte3.nation and t1.game>=cte3.GAMESTART and t1.game<(cte3.GAMESTART+cte3.GAMECOUNT)
It is a great hassle to perform the task with SQL. The language selects the record of champion in each game by the total number of medals, groups all selected records by continuous country names, gets the group containing the most members and other information including the beginning game and the number of games it holds the champion continuously, and finally, selects records from the original table according to the related information.
SPL solution:
SPL offers A.maxp() function to get record containing the maximum value.
A | |
---|---|
1 | =T("Olympic.txt") |
2 | =A1.sort@z(GAME, 1000000*GOLD+1000*SILVER+COPPER) |
3 | =A2.group@o1(GAME) |
4 | =A3.group@o(NATION) |
5 | =A4.maxp(~.len()) |
A1: Import Olympic medal table.
A2: Sort A1’s table by game and the total number of medals.
A3: Get the first record from each group, which is the champion because records are already ordered by result in descending order.
A4: Create a new group when a different country appears.
A5: A.maxp() function gets the group containing the most members, which is the country that holds the champion for the longest time.
The SPL script remains its conciseness while SQL code becomes more and more complicated. Besides the reason that it is hard to get record containing the max/min value in SQL, there is another cause. SQL cannot retain the post-grouping subsets for further grouping and other operations.
【Example 7】Get scores of students who rank in top 2 for each subject in all classes based on the following score table. Below is part of the source data:
CLASS | STUDENTID | SUBJECT | SCORE |
---|---|---|---|
1 | 1 | English | 84 |
1 | 1 | Math | 77 |
1 | 1 | PE | 69 |
1 | 2 | English | 81 |
1 | 2 | Math | 80 |
… | … | … | … |
SQL solution:
SQL does not have a standard method of getting top/bottom N. Some databases support SELECT TOP statement while some, including Oracle, do not. As we use Oracle here, we handle the task using ORDER BY together with row numbers:
select CLASS, STUDENTID, SUBJECT, SCORE
from (
select CLASS, STUDENTID, SUBJECT, SCORE,
ROW_NUMBER() OVER(PARTITION BY CLASS,SUBJECT
ORDER BY SCORE DESC) grouprank
from SCORES
)
where grouprank <= 2
SPL solution:
SPL has A.top() function to get top/bottom N.
A | |
---|---|
1 | =T("Scores.csv") |
2 | =A1.group(CLASS,SUBJECT;~.top(-2;SCORE):TOP2) |
3 | =A2.conj(TOP2) |
A1: Import Scores table.
A2: A.top() function gets students whose scores rank in top 2 in each class for each subject. -2 means getting two records in descending order.
A3: Concatenate all selected records.
In certain occasions we need to calculate the sequence number of the interval where a value falls and get corresponding members from a set. Such cases include returning grades (excellent, very good, good, pass) according to exam scores and returning the income level (poverty, comfortable, well-off, rich) according to the family’s annual income.
【Example 6】Based on the score table, find the number of students whose English scores fall in the following intervals respectively: Excellent (score >=90), Good (80=< score <90), OK (70<= score <80), Pass (60<= score <70) , and Failed (score <60). Below is part of the source data:
CLASS | STUDENTID | SUBJECT | SCORE |
---|---|---|---|
1 | 1 | English | 84 |
1 | 1 | Math | 77 |
1 | 1 | PE | 69 |
1 | 2 | English | 81 |
1 | 2 | Math | 80 |
… | … | … | … |
Below is expected result:
ASSESSMENT | COUNT |
---|---|
EXCELLENT | 7 |
GOOD | 3 |
OK | 8 |
PASS | 6 |
FAIL | 4 |
SQL solution:
select
ASSESSMENT, count(*) COUNT
from
(select
case when score<60 then 'FAILED'
when score>=60 and score<70 then 'PASS'
when score>=70 and score<80 then 'OK'
when score>=80 and score<90 then 'GOOD'
when score>=90 then 'EXCELLENT'
else 'OTHER' end ASSESSMENT
from SCORES
where subject='English')
group by ASSESSMENT
SQL uses CASE WHEN to divide the source table into segments, gets grades according to students’ scores, group records by grades, and count students in each group.
SPL solution:
SPL supplies segmentation function A.segp() to return corresponding members in a sequence for each interval according to their sequence number.
A | |
---|---|
1 | =T("Scores.csv").select(SUBJECT:"English") |
2 | =create(ASSESSMENT,SCORE).record(["FAILed",0,"PASS",60,"OK",70,"GOOD",80,"EXCELLENT",90]) |
3 | =A1.groups(A2.segp(SCORE,SCORE).ASSESSMENT;count(1):COUNT) |
A1: Import Scores table and select English scores.
A2: Create the list of assessment & score.
A3: A.segp() function gets assessment according to sequence number of the interval in A2’s list where a score falls, and count students by assessment.
Both SQL and SPL are able to search for data according to the specific intervals. For each interval, SQL adds a piece of CASE WHEN … THEN … statement. When there are a lot of intervals, the code will be lengthy because it needs to define a segmentation condition for each segment. SPL is different. It creates a comparison list to define the name and segmentation value for each segment, which is convenient to maintain.
It is inconvenient to get the record having the maximum/minimum value in SQL. Oracle is only able to perform an aggregate operation on the sorting result with the KEEP function but cannot trykt get the record containing the maximum/minimum. By contrast, SPL offers A.maxp() function and A.minp() function to get all things done with a one-stop calculation.
When the query is complicated, the complexity of SQL query increases by multiple times. It involves the use of temporary table and nested query, etc, which makes it harder to write and maintain the SQL query. SPL, however, can compose succinct code step by step according to the natural way of thinking.
The SPL-driven esProc is the professional data computation engine. It is ordered-set-based and offers a complete set of grouping functions, which combines advantages of both Java and SQL. With SPL, the select operation will be convenient and easy.
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code