-
Notifications
You must be signed in to change notification settings - Fork 339
Membership Test for Structured Data
Sometimes we need to judge whether a member belongs to a certain set or not when selecting data from a table, such as getting housing prices of major cities from house price table, getting VIP customers’ sales records from sales table, etc. This article illustrates how to deal with these membership issues fast and efficiently, and offers esProc sample scripts.
Get records from a table according to whether a value is included in a specified set (where the number of members is within 10).
【Example 1】 Calculate the average salary in each department in first-tier cities based on the employee table. Below is part of the source data:
ID | NAME | CITY | SALARY |
---|---|---|---|
1 | Rebecca | Tianjin | 7000 |
2 | Ashley | Tianjin | 11000 |
3 | Rachel | Shijiazhuang | 9000 |
4 | Emily | Shenzhen | 7000 |
5 | Ashley | Nanjing | 16000 |
… | … | … | … |
【Analysis】 When getting data from the employee table, we need to check whether the city where an employee settles is included in the constant set of [Beijing, Shanghai, Guangzhou, Shenzhen]. When the number of members in the set is within 10, we can use A.contain() function to get the eligible records. 【SPL script】
A | B | |
---|---|---|
1 | =connect("db").query("select * from Employee") | / Connect to database and query Employee table |
2 | [Beijing, Shanghai, Guangzhou, Shenzhen] | / Define a constant set of first-tier cities |
3 | =A1.select(A2.contain(CITY)) | / Use A.contain() function to get records where the CITY value is contained in A2’s constant set |
4 | =A3.groups(DEPT; avg(SALARY):SALARY) | / Group selected records by DEPT and calculate average salary in each group |
A4’s result:
DEPT | SALARY |
---|---|
Finance | 7833.33 |
HR | 7187.5 |
Marketing | 7977.27 |
… | … |
Get records from a table according to whether a value is included in a specified set (that has members over 10).
【Example 2】 Calculate the sales amount of every big customer in each month of the year 2014 based on the sales table. Below is part of the source data:
ID | Customer | SellerId | Date | Amount |
---|---|---|---|---|
10400 | EASTC | 1 | 2014/01/01 | 3063.0 |
10401 | HANAR | 1 | 2014/01/01 | 3868.6 |
10402 | ERNSH | 8 | 2014/01/02 | 2713.5 |
10403 | ERNSH | 4 | 2014/01/03 | 1005.9 |
10404 | MAGAA | 2 | 2014/01/03 | 1675.0 |
… | … | … | … | … |
【Analysis】
Similarly, we need to check whether the Customer value in each sales record is included in the constant set of big customers. In this case the number of members in the set is relatively large (above 10). We first sort the constant set, and then use @b option with A.contain() function to perform a binary search.
【SPL script】
A | B | |
---|---|---|
1 | =connect("db").query("select * from Sales") | / Connect to database and query Sales table |
2 | =["SAVEA","QUICK","ERNSH","HUN","RATTC","HANAR","FOLKO","QUEEN,MEREP","WHITC","FRANK","KOENE"].sort() | / Define a constant set of big customers and sort it |
3 | =A1.select(year(Date)==2014 && A2.contain@b(Customer)) | / Get records of big customers in 2014. As set A is already ordered, use @b option with A.contain() function to perform a binary search |
4 | =A3.groups(month(Date):Month; sum(Amount):Amount) | / Group the selected records by month and sum the sales amounts in each month |
A4’s result:
STATE | SALARY |
---|---|
California | 7700.0 |
Texas | 7592.59 |
New York | 7677.77 |
Florida | 7145.16 |
Other | 7308.1 |
Get records from a table according to whether a record is included in the foreign key references in its associated table.
【Example 3】 Find how many students are there in each class who select the Matlab course. Below are SelectCourse table and Course table:
【Analysis】
We need to check whether the CourseID in a record is “Matlab” when getting records from the SelectCourse table. Before that we get a set of course records from the Course table where the course name is “Matlab”, and then get records from the SelectCourse table where the CourseID is contained in the set.
【SPL script】
A | B | |
---|---|---|
1 | =connect("db") | /Connect to database |
2 | =A1.query("select * from Course") | / Query Course table |
3 | =A1.query("select * from SelectCourse") | / Query SelectCourse table |
4 | =A2.select(Name=="Matlab") | / Get Course table records where the course name is “Matlab” |
5 | =A3.join@i(CourseID, A4:ID) | / Use @i option with A.join() function to perform a join filtering |
6 | =A5.groups(Class; count(1):SelectCount) | / Group the selected records and count the students who select the specified course in each class |
A6’s result:
Class | SelectCount |
---|---|
Class 1 | 3 |
Class 2 | 5 |
… | … |
Get records from a table according to whether a record is included in the non-foreign key references in its associated table.
【Example 4】 Find the number of students in each class who have a score of above 80 for any subject. Below are Score table and Student table:
【Analysis】
When getting records from the Student table, we need to check whether the current student has a subject whose score is above 80. First we select all Score table records where the Score value is greater than 80 and perform distinct by StudentID to generate a set of records of unique student IDs whose corresponding scores are above 80. Then we get Student table records whose IDs are included in that set of records.
【SPL script】
A | B | |
---|---|---|
1 | =connect("db") | / Connect to database |
2 | =A1.query("select * from Student") | / Query Student table |
3 | =A1.query("select * from Score") | / Query Score table |
4 | =A3.select(Score>80) | / Get Score table records where score value is greater than 80 |
5 | =A4.id(StudentID) | / id function performs distinct operation by student ID |
6 | =A2.join@i(ID, A5) | / Use A.join@i() function to perform join filtering |
7 | =A6.groups(Class; count(1):StudentCount) | / Group the joining records and count the eligible students in each class |
A7’s result:
Class | StudentCount |
---|---|
Class 1 | 9 |
Class 2 | 11 |
… | … |
In an optimized and efficient way, test and get records from a table according to whether a record is included in the non-foreign key references in its associated table.
【Example 5】Get the number of customers in each city who placed an order in 2014. Below are Sales table and Customer table:
【Analysis】
When getting records from the Sales table, we need to check whether the current customer placed an order in 2014. First we select sales table records of 2014 and perform distinct by customer ID to generate a set of records of unique customer IDs who have sales records in 2014. Then we get Costomer table records whose IDs are included in that set of records.
【SPL script】
A | B | |
---|---|---|
1 | =connect("db") | / Connect to database |
2 | =A1.query("select * from Customer") | / Query Customer table |
3 | =A1.query("select * from Sales where year(Date)=2014 order by CustomerID") | / Query sales records of 2014 and sort them by customer ID |
4 | =A3.groups@o(ID) | / As the selected records are already ordered, groups() function works with @o option to perform distinct by customer ID |
5 | =A2.join@i(ID, A4:CustomerID) | /Use A.join@i() function to perform a join filtering |
6 | =A5.groups(City; count(1):CustomerCount) | / Group the joining records and count the eligible customers in each city |
A6’s result:
City | CustomerCount |
---|---|
Dongying | 6 |
Tangshan | 7 |
… | … |
Get records from a table according to whether a record is not included in the non-foreign key references in its associated table.
【Example 6】 Find the total sales amount of each new customer in the year of 2014. Below are Sales table and Customer table:
【Analysis】
When getting records from the Sales table, we need to make sure the current customer didn’t place an order in 2014. First we select sales table records of 2014 and then we get those whose customer IDs are not included in the Customer table.
【SPL script】
A | B | |
---|---|---|
1 | =connect("db") | / Connect to database |
2 | =A1.query("select * from Sales where year(OrderDate)=2014") | / Query sales records of 2014 |
3 | =A1.query("select * from Customer") | / Query Customer table |
4 | =A2.join@d(CustomerID ,A3:ID) | /Use A.join@d() function to get sales records whose customer IDs are not included in Customer table |
5 | =A4.groups(CustomerID; sum(Amount):Amount) | / Group the selected records and sum sales amounts of each customer |
A5’s result:
CustomerID | Amount |
---|---|
DOS | 11830.1 |
HUN | 57317.39 |
… | … |
Find more examples in SPL CookBook.
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code