-
Notifications
You must be signed in to change notification settings - Fork 339
Python vs. SPL 6 Equivalence Grouping
When there are too many things, we usually classify those things into various groups and then perform aggregation operation on them. For example, calculations such as querying the highest score of each class or the average age of employees in each department of the company are called grouping operation which is often followed by the subsequent aggregation operation. The most common grouping is to split members with the same attribute value into one group, which is also known as equivalence grouping. In this article, we’ll compare the calculating abilities of Python and SPL in equivalence grouping and aggregation.
The most common grouping operation is to group the data by a certain column and then aggregate them, for example:
The employee information table of the company is as follows:
Python
import pandas as pd import datetime import numpy as np import math def max_age(s): earliest_birth=s.min() today = datetime. datetime.today() age=math.floor((today- earliest_birth)/np.timedelta64(1,'Y')) return age file="D:\data\EMPLOYEE.csv" emp=pd.read_csv(file) emp['BIRTHDAY']=pd.to_datetime(emp['BIRTHDAY']) dept_agg = emp.groupby('DEPT',as_index=False).agg({'SALARY':['count','mean'],'BIRTHDAY':max_age}) print(dept_agg) |
Import various library functions The function for calculating the maximum age Load the employee information Convert the data type of date Group and aggregate |
The groupby() function in Pandas is to group Dataframe to generate a grouping object. The “object” is difficult to understand and we cannot figure out what is in it. The agg() function is an aggregation function, in which we can define which aggregation operation(s) is(are) performed on which column in the form of dictionary. We can also use the user-defined functions when there is no ready-made function, for instance, the count and mean functions in the example are ready-made functions and the max_age is a user-defined function. This merit is quite programmer-friendly, but it is also worth noting that the input of user-define function is a Series i.e., the BIRTHDAY of each group, which requires us to think deeply about the principle of grouping and aggregation, otherwise we may be very confused. While the most common grouping and aggregation operation is probably just one aggregation, which is very easy to write in Python. For example, to calculate the average salary of each department:
dept_salary= emp.groupby('DEPT',as_index=False).SALARY.mean().
SPL
A | B | |
1 | D:\data\EMPLOYEE.csv | /the path of the employee information table |
2 | =file(A1).import@tc() | /load the data |
3 | =A2.groups(DEPT;count(SALARY):NUM,avg(SALARY):AVG_SALARY, max(age(BIRTHDAY)):MAX_AVG) | /group and aggregate |
The group() function in SPL performs the grouping and aggregation operations, in which the fields before and after “;” are used to group and aggregate respectively. In this function, not only the field for aggregation operation is very distinct, but also we can rename the field for aggregated result, for example, count(SALARY):NUM is to name the result of SALARY count as NUM to generate a new table sequence. The form of the function stays the same for only one aggregation, for instance, to calculate the average salary of each department: A3=A2.groups(DEPT;avg(SALARY): AVG_SALARY), which does not vary as much as Python does.
Sometimes we want to group the data by multiple columns, for example:
Python
#continue to use the employee information emp avg_salary= emp.groupby(['DEPT','GENDER'],as_index=False).SALARY.mean() print(avg_salary) |
Group and aggregate by multiple columns |
The grouping key in groupby() function supports the form of sequence, like ['DEPT','GENDER'], and the subsequent aggregation operation is the same.
SPL
A | B | |
… | /A2 is the employee information | |
5 | =A2.groups(DEPT,GENDER;avg(SALARY):AVG_SALARY) | /group and aggregate by multiple columns |
The groups() function in SPL does not require the multi-column grouping keys to be input in the form of sequence. The form can be the same as long as we place “,” before “;” to separate both.
The grouping result is not always forced to be aggregated because sometimes we are interested in the grouped subsets. For example:
Python
#continue to use emp emp_sort=emp.groupby('DEPT',as_index=False).apply(lambda x:x.sort_values('HIREDATE')).reset_index(drop=True) print(emp_sort) |
Group and sort |
After grouping, Python can use the apply+lambda expression to perform sorting on each group, and the reset_index() function at the end is to reset the index. The apply+lambda expression is previously introduced in Python vs. SPL 3 -- Loop Function, and here it is used in the same way.
SPL
A | B | |
… | /A2 is the employee information | |
7 | =A2.group(DEPT).conj(~.sort(HIREDATE)) | /group and sort |
The group() function in SPL only groups but does not aggregate, and the return result is the set of a set, which is in accordance with the essence of grouping operation. Unlike Python, SPL will never return an intricate object. Also, since the return results are sets, we can process them using the loop function: sort the subsets separately and then union them to get the result of grouping and sorting. What’s more, the group() function returns subsets, which is able to achieve all the functionality of groups() function theoretically, then why do we design a groups() function? The reason is that groups() calculates the aggregation value as an iterative function, during which it does not keep each grouped subset, in this way, the operation is more efficient and takes up less storage space.
Sometimes we want to get a certain aggregation result of the grouped subsets, but the operation is difficult to perform and can not be achieved with a simple aggregation function. Therefore, we need to keep the grouped subsets for further calculations. For example:
Python
#continue to use emp def salary_diff(g): max_age = g['BIRTHDAY'].idxmin() min_age = g['BIRTHDAY'].idxmax() diff = g.loc[max_age]['SALARY']-g.loc[min_age]['SALARY'] return diff emp['BIRTHDAY']=pd.to_datetime(emp['BIRTHDAY']) salary_diff=emp.groupby('DEPT').apply(salary_diff) print(salary_diff) |
The function for calculating the salary difference The index of the oldest employee The index of the youngest employee The salary difference Convert the data type of time Group and complex aggregate |
Python uses a user-defined function to perform the complex aggregation operation, and loops through each group using apply expression to get the final result. Here the input of the user-defined function is the members of grouped subsets, which is not easy to think of.
SPL
A | B | |
… | /A2 is the employee information | |
9 | =A2.group(DEPT;(ma=~.minp(BIRTHDAY),mi=~.maxp(BIRTHDAY), ma.SALARY-mi.SALARY):SALARY_DIF) | /group and complex aggregate |
SPL uses the group() function to get the grouped subsets, and minp() and maxp() functions return the members with minimum and maximum values by computing the to-be-calculated fields directly. The operation is done without any letup and fits perfectly with the logic of thinking. The code is similar to the groups() function, which only looks simple in this form, but essentially it is:
A9=A2.group(DEPT).new(DEPT,(ma=~.minp(BIRTHDAY),mi=~.maxp(BIRTHDAY),ma.SALARY-mi.SALARY):SALARY_DIF)
And the principle behind it is the same as group()+conj() in the previous example, which processes the grouped subsets.
During data analysis, we may also encounter the situation that counts the unique members after grouping. For example:
Python
#continue to use emp dept_state=emp.groupby('DEPT',as_index=False).agg({"STATE":pd.Series.nunique}) print(dept_state) |
Group and count the number |
There is a function for counting unique members in Python, that is, pd.Series.nunique, which executes in the same way as the grouping and aggregation operation introduced at the beginning of the article.
SPL
A | B | |
… | /A2 is the employee information | |
11 | =A2.groups(DEPT;icount(STATE)) | /group and count the number |
The icount() function in SPL returns the number of unique values in each group, which is equivalent to performing another grouping on the groups. But the grouping operation in the groups only counts the number without any other aggregation.
The grouping and aggregation operations in Python are relatively abundant and have its own system. The groupby+agg expression can achieve various aggregation operations; groupby+apply expression can process the grouped subsets, but the generated grouping object after executing the groupby function is quite difficult to understand. It is neither Series nor Dataframe and we have no idea about what kind of data structure it is, which has a great impact on the following agg or apple operation because we don’t know what is the input. Then we have to practice more to get to understand the “object” generated by the groupby function.
While SPL performs the operations following the normal logic: the grouped result is a subset and the whole result is a set of the set. The subsequent processing can be done according to the normal syntax without the need to figure out anything else, which makes it very easy to adapt to other situations. In addition, the groups() function can further improve the efficiency of grouping and aggregation operations.
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code