-
Notifications
You must be signed in to change notification settings - Fork 339
Looking for the Best Post SQL Scripting Tool
A post-SQL script is used for performing complex computations that SQL is awkward to handle. Here I select three post scripting tools to compare them in terms of efficiency, syntax expressiveness and function library capability, and to examine their performances in handling various scenarios such as set-based operations and order-based computations. The findings show that esProc SPL is the best. Looking Looking for the Best Post-SQL Scripting Tool for details.
SQL (or the stored procedure) can handle most of the database computations. If the computations are complex or hard to deal in SQL, we use another programming language to read data out of the database to manipulate it. Such a programming language handles the data read and manipulation with a simple script. So we call the process the post-SQL scripting.
The scenarios that SQL is not good at handling include complex set-based operations, order-based operations, associative operations and multi-step computations, etc. Due to SQL’s incomplete set orientation and lack of explicit set data type, it’s almost impossible to reuse the intermediate sets generated during the computation. The forced aggregate after each grouping operation makes it impossible to use the post-grouping subsets. The unordered-set-based SQL adopts a very roundabout way to handle order-based computations, like inter-row (group) computations and ranking operations. The language generates temporary sequence numbers using JOIN(s) or subqueries, making the program hard to write and slow to compute. Record reference is another SQL’s incapability. The language uses a subquery or a JOIN statement to express the relationship. Code becomes ridiculously complicated when there are multiple data levels or when self-joins are needed. SQL doesn’t foster multi-step coding. Programmers have to write very long query containing layers of subqueries. Though stored procedures can alleviate the problem, they are not always available. DBA has strict rules about the privileges of using stored procedures, and old and small databases don’t support stored procedures. Besides, it’s inconvenient to debug a stored procedure. This makes it unsuitable to do a procedural computation.
There are other scenarios that require post-SQL scripting. To migrate the algorithm between different database products or between database and non-relational database, data source or output target isn’t the database but a file, and mixed computation performed between multiple databases, for example. All these external database computations need post-SQL scripting.
The most important role of a post-SQL script is to achieve the complex computations that SQL is not good at. It would be better if they had some advanced features, such as the ability of processing data of various sources (files and non-relational databases, for example) and handling a relatively large amount of data, and satisfactory performance. But, the basic thing is that the scripting tool should be able to perform database read/write conveniently.
The popular post-SQL scripting tools are Java, Python pandas and esProc SPL. Now let’s look at and examine their scripting abilities.
High-level languages, such C++ and Java, are theoretically almighty and thus are able to manage computations SQL is hard to deal with. Java supports generic type and has relatively comprehensive set orientation to handle complex set-based computations. A Java array has intrinsic sequence numbers to implement order-based computations conveniently. Java can express a relationship using object reference and handle join operations well. It also supports procedural syntax, including branch and loop, to achieve complex multi-step computations.
Unfortunately, Java lacks class libraries for structured computations. It hardcodes even the simplest structured computations and creates the most basic data type manually. That makes Java code lengthy and complicated.
Here’s an example of order-based computation: get the number of longest consecutively rising trading days for a stock. Database AAPL stores a stock’s price information in fields including transaction dates and closing price.
In an intuitive way, we loop through the records ordered by dates, add 1 to the number of consecutive rising days (the initial value is 0) if the closing price of the current record is greater than that of the previous one; and compare the number of consecutively rising days with the current largest number of consecutively rising days (the initial value is 0) to get the new largest number and reset the number as 0. Repeat the process until the loop is over and the current largest number is the final desired one.
SQL can’t implement the intuitive algorithm because it doesn’t support order-based computations. But it has its own tricks. It groups stock records ordered by dates. The way is like this: put records whose closing prices rise consecutively in one group, that is, group the current record with the previous one if its closing price rises and create a new group if its price lowers; then count the members in every group and find the largest number, which is the largest number of consecutively rising days.
Below is the SQL query:
select max(continue_inc_days)
from (select count(*) continue_inc_days
from (select sum(inc_de_flag) over(order by transDate) continue_de_days
from (select transDate,
case when
price>LAG(price) over(order by transDate)
then 0 else 1 end inc_de_flag
from AAPL) )
group by continue_de_days)
The SQL program is not long but contains 4 layers of nested queries. The trick is peculiar and intelligible and hard to imagine.
Java, however, can do this in an intuitive way:
package stock;
import java.sql.*;
public class APP {
public static void main(String[] args) throws SQLException,ClassNotFoundException {
Connection con = null;
Class.forName("com.mysql.cj.jdbc.Driver");
con = DriverManager.getConnection("jdbc:mysql://127.0.0.1:3306/mysql?&useSSL=false&serverTimezone=UTC","root", "");
String dql = "select * from AAPL order by transDate";
PreparedStatement stmt = con.prepareStatement(dql,ResultSet.TYPE_SCROLL_INSENSITIVE, ResultSet.CONCUR_READ_ONLY);
ResultSet aapl = stmt.executeQuery();
int continue_inc_days = 0;
int max_continue_inc_days = 0;
float last_Price = 0;
while (aapl.next()) {
float price = aapl.getFloat("price");
if (price >= last_Price) {
continue_inc_days++;
} else {
if (continue_inc_days >= max_continue_inc_days) {
max_continue_inc_days = continue_inc_days;
}
continue_inc_days = 0;
}
last_Price = price;
}
System.out.println("max_continue_inc_days=" + max_continue_inc_days);
if (con != null){
con.close();
}
}
}
The later part of the Java program implements the above intuitive algorithm with only one layer of loop.
On the other hand, the Java program is lengthy though it is simple and easy to understand.
The program will be extremely long if it involves structured computation like grouping and join.
Java is also able but too roundabout in implementing advanced features like handling mixed data sources, performance optimization and big data processing.
In summary, Java is an excellent universal enterprise programming language but not specific enough due to the absence of class libraries for structured computations. The lack of focus results in roundabout and lengthy code and removes the language from the position of the most ideal post-SQL scripting tool.
Python boasts concise syntax and numerous third-party function libraries, among which Pandas is for structured data processing. That’s why Pandas is a popular post-SQL scripting tool.
Pandas has the capability to code a complex computation in a relatively simple way.
For example, to find the number of largest consecutively rising days for a stock, Pandas has its own way:
import pymysql
import pandas as pd
conn = pymysql.connect(
host = "127.0.0.1",
port = 3306,
user = "root",
password = "",
database = "mysql",
)
aapl = pd.read_sql_query("select price from AAPL order by transDate", conn)
continue_inc_days=0 ; max_continue_inc_days=0
for i in aapl['price'].shift(0)>aapl['price'].shift(1):
continue_inc_days =0 if i==False else continue_inc_days +1
max_continue_inc_days = continue_inc_days if max_continue_inc_days < continue_inc_days else max_continue_inc_days
print(max_continue_inc_days)
conn.close()
In the snippet, Pandas uses dataFrame structure to handle structured data. dataFrame has intrinsic sequence numbers that enable producing simpler code for order-based computations and easy access of a value in another row. Pandas offers a compact encapsulation of SQL data retrieval feature to generate shorter and more concise code than Java.
Here’s an example of set-based operation: to split one row into multiple rows. Database table tb has two fields, where ANOMALIES field stores tab-separated strings. The task is to split ANOMALIES by spaces to generate multiple rows with same ID.
Source data:
ID | ANOMALIES |
---|---|
1 | A1 B1 C1 D1 |
2 | A2 |
3 | A3 B3 C3 |
4 | A3 B4 D4 |
… | … |
The expected result:
ID | ANOMALIES |
---|---|
1 | A1 |
1 | B1 |
1 | C1 |
1 | D1 |
2 | A2 |
… | … |
The core Pandas code is as follows (database input and output is omitted):
…
split_dict = pd.read_sql_query("select * from tb", conn)
split_list = []
for key,value in split_dict.items():
anomalies = value[0].split(' ')
key_array = np.tile(key,len(anomalies))
split_df = pd.DataFrame(np.array([key_array,anomalies]).T,columns=['ID','ANOMALIES'])
split_list.append(split_df)
df = pd.concat(split_list,ignore_index=True)
Pandas splits each string into a dataFrame and concatenate multiple dataFrames using its set-related functions. The code is concise. To achieve similar effect in Java, manual work is needed and the code is complicated.
As a structured computation function library, Pandas produces much more concise code than Java, only for relatively simple scenarios. It generates lengthy and hard to understand code when the computation becomes really complex.
Here’s an example involving both set-based operation and order-based operation. Database duty records daily arrangement of duties. A worker may work shift for several workdays continuously before another worker replace him. The task is to get records of continuous shift working for each worker. Below is the data structure:
Source data:
Date | Name |
---|---|
2018-03-01 | Emily |
2018-03-02 | Emily |
2018-03-04 | Emily |
2018-03-04 | Johnson |
2018-04-05 | Ashley |
2018-03-06 | Emily |
2018-03-07 | Emily |
… | … |
The expected result:
Name | Begin | End |
---|---|---|
Emily | 2018-03-01 | 2018-03-03 |
Johnson | 2018-03-04 | 2018-03-04 |
Ashley | 2018-03-05 | 2018-03-05 |
Emily | 2018-03-06 | 2018-03-07 |
… | … | … |
Below is core Pandas code:
……
duty = pd.read_sql_query("select date,name from duty order by date", conn)
name_rec = ''
start = 0
duty_list = []
for i in range(len(duty)):
if name_rec == '':
name_rec = duty['name'][i]
if name_rec != duty['name'][i]:
begin = duty['date'].loc[start:i-1].values[0]
end = duty['date'].loc[start:i-1].values[-1]
duty_list.append([name_rec,begin,end])
start = i
name_rec = duty['name'][i]
begin = duty['date'].loc[start:i].values[0]
end = duty['date'].loc[start:i].values[-1]
duty_list.append([name_rec,begin,end])
duty_b_e = pd.DataFrame(duty_list,columns=['name','begin','end'])
The code is still roundabout even I have omitted the databased input and output.
Another example involves set-based operation and multi-step computation. Database table loan records loan information. It has four fields – loan ID, amount, term and rate, as shown below. The task is to calculate the details.
LoanID | LoanAmt | Term | Rate |
---|---|---|---|
L01 | 100000 | 5 | 4.8 |
L02 | 20000 | 2 | 5.0 |
L03 | 500000 | 12 | 4.5 |
The details include payment, interest, principle and principle balance for each term. The expected result is as follows:
LoanID | LoanAmt | Payment | Term | Rate | interest | principal | principlebalance |
---|---|---|---|---|---|---|---|
L01 | 100000 | 20238.13 | 5 | 4.75 | 395.83 | 19842.29 | 80159.71 |
L01 | 100000 | 20238.13 | 5 | 4.75 | 317.29 | 19920.83 | 60236.87 |
L01 | 100000 | 20238.13 | 5 | 4.75 | 238.44 | 19999.69 | 40237.18 |
L01 | 100000 | 20238.13 | 5 | 4.75 | 159.27 | 20078.85 | 20158.33 |
… |
Below is the core Pandas code:
loan_data = pd.read_sql_query("select loanID,LoanAmt,Term,Rate from loan", conn)
loan_data['mrate'] = loan_data['Rate']/(100*12)
loan_data['mpayment'] = loan_data['LoanAmt']*loan_data['mrate']*np.power(1+loan_data['mrate'],loan_data['Term']) \
/(np.power(1+loan_data['mrate'],loan_data['Term'])-1)
loan_term_list = []
for i in range(len(loan_data)):
loanid = np.tile(loan_data.loc[i]['LoanID'],loan_data.loc[i]['Term'])
loanamt = np.tile(loan_data.loc[i]['LoanAmt'],loan_data.loc[i]['Term'])
term = np.tile(loan_data.loc[i]['Term'],loan_data.loc[i]['Term'])
rate = np.tile(loan_data.loc[i]['Rate'],loan_data.loc[i]['Term'])
payment = np.tile(np.array(loan_data.loc[i]['mpayment']),loan_data.loc[i]['Term'])
interest = np.zeros(len(loanamt))
principal = np.zeros(len(loanamt))
principalbalance = np.zeros(len(loanamt))
loan_amt = loanamt[0]
for j in range(len(loanamt)):
interest[j] = loan_amt*loan_data.loc[i]['mrate']
principal[j] = payment[j] - interest[j]
principalbalance[j] = loan_amt - principal[j]
loan_amt = principalbalance[j]
loan_data_df = pd.DataFrame(np.transpose(np.array([loanid,loanamt,term,rate,payment,interest,principal,principalbalance])),columns = ['loanid','loanamt','term','rate','payment','interest','principal','principalbalance'])
loan_term_list.append(loan_data_df)
loan_term_pay = pd.concat(loan_term_list,ignore_index=True)
print(loan_term_pay)
So you see that Pandas is not always concise. This is probably because the third-party function library hasn’t enough syntax support from Python. The lack of focus on structured computation is another reason.
The lack of focus is reflected by handling mixed data sources. Pandas doesn’t have the uniform interface for various types of data sources. It doesn’t support complex data sources, such as Hadoop and MongoDb. It supports the local files of popular formats only. Users have to turn to a third-party, the fourth-party actually, function library to write complicated access code. Pandas even doesn’t have a universal database interface. There are several function libraries, PyMySQL, sqlalchemy and MySQLdb, for accessing MySQL, for example. But this isn’t the key problem. Pandas can support almost all popular databases more or less.
The key to a join operation involving heterogeneous sources is to be able to read in the data. Pandas gives a satisfactory solution, still for relatively simple scenarios. Any complex joins will make the implementation hard and difficult to understand.
Pandas is not so good at handling big data because it doesn’t have cursor data type. It hardcodes the data retrieval by loop instead of launching an automatic data exchange between the memory and external storage. The code is thus unusually complicated. You can learn more in How Python Handles Big Files.
Not excellent though, Pandas provides good performance. An obvious weakness is that it’s difficult to implement multithreaded processing in Python. Since database IO is slow, we use parallel processing to increase the retrieval performance when there is abundant database space. Pythons problem is that it needs a third-party function library to do this, leading to complicated code, awkward algorithm description, inefficient execution and non-universal queries.
In a word, Pandas is intended for structured computations but fails the expectation.
esProc SPL is equipped with a wealth of structured computation functions. Different from Pandas, esProc SPL is an enterprise level open-source product and truly professional rather than a third-party function library under a larger organization. esProc SPL has the uniform and all-round syntax for structured data processing and provides consistent structured data types in a bottom-up way to support as flexible and dynamic cooperation of functions as possible. These enable esProc SPL to handle post-SQL computations conveniently and efficiently.
As a professional structured data processing tool, SPL is an expert in handling complex computations that SQL is awkward to deal with. To find the largest number of consecutive rising trading days for a stock, for example, esPoc only needs to lines of code to implement the intuitive algorithm mentioned above:
A | |
---|---|
1 | =mysqlDB.query@x("select price from AAPL order by transDate") |
2 | =a=0,A1.max(a=if(price>price[-1],a+1,0)) |
The script uses table sequence structure and the loop function. A table sequence is specially designed for structured computations. It enables easier access of values in another row and more convenient context for handling order-based computations. A loop function gets rid of most of the for statements, though they are still needed in certain complex situations, to much simplify the code. SPL offers a more compact encapsulation of SQL data retrieval method and thus produces shorter code than Pandas.
To split one row into multiple rows, SPL has the following concise script:
A | |
---|---|
1 | =orcl.query@x("select * from tb") |
2 | =A1.news(ANOMALIES.split("");ID,~: ANOMALIES) |
SPL can handle more complex scenarios in an effortless way.
To get the duty details, for instance, SPL generates a much shorter script:
A | |
---|---|
1 | =orcl.query("select date,name from duty order by date") |
2 | =A1.group@o(name) |
3 | =A2.new(name,~.m(1).date:begin,~.m(-1).date:end) |
To calculate the loan installment details, SPL gives simpler solution than Pandas:
A | |
---|---|
1 | =orcl.query@x("select loanID,LoanAmt,Term,Rate from loan") |
2 | =A1.derive(Rate/100/12:mRate,LoanAmtmRatepower((1+mRate),Term)/(power((1+mRate),Term)-1):mPayment) |
3 | =A2.news((t=LoanAmt,Term);LoanID, LoanAmt, mPayment:payment, Term, Rate, t* mRate:interest, payment-interest:principal, t=t-principal:principlebalance) |
SPL can handle complex scenarios that Pandas find it hard to manage in a convenient and simple way. To handle one involving multi-step computation, set-based operation and dynamic data structure, for instance, SPL inserts the sub table to the main table horizontally in a graceful and smooth way:
The relationship between source tables:
Order (Main table) | OrderDetail (sub table) | |
---|---|---|
ID(pk) | ß- | OrderID(PK) |
Customer | Number(pk) | |
Date | Product | |
Amount |
The expected result:
ID | Customer | Date | Product1 | Amount1 | Product2 | Amount2 | Product3 | Amount3 |
---|---|---|---|---|---|---|---|---|
1 | 3 | 2019-01-01 | Apple | 5 | Milk | 3 | Salt | 1 |
2 | 5 | 2019-01-02 | Beef | 2 | Pork | 4 | ||
3 | 2 | 2019-01-02 | Pizza | 3 |
SPL generates a rather simple script:
A | B | |
---|---|---|
1 | =orcl.query@x("select * from OrderDetail left join Order on Order.ID=OrderDetail.OrderID") | |
2 | =A1.group(ID) | =A2.max(~.count()).("Product"+string(~)+","+"Amount"+string(~)).concat@c() |
3 | =create(ID,Customer,Date,${B2}) | >A2.run(A3.record([ID,Customer,Date] |
SPL also has advanced features to tackle some particularly thorny scenarios.
It supports access of heterogeneous data sources, including various file formats and complex sources such as Hadoop and MongoDB. This allows analysts to use same script to process data coming from various sources, including both databases and non-databases.
SPL supplies cursor mechanism from the low level to hide the details of data exchange between memory and external storage from the top level and to permit analysts to handle a large amount of data using syntax they employ to manipulate a small amount of data.
Database table orders records a large number of orders that can’t fit into the memory. So we need to do an out-of-database computation. The task is to get the largest three orders for each salesperson. Below is the SPL script:
A | |
---|---|
1 | =my.cursor@x("select sellerid,amount from orders") |
2 | =A1.groups(sellerid;top(3; -amount):top3) |
3 | =A2.conj(top3) |
SPL also has a series of simple and easy to use performance optimization methods. Suppose the monthly data in orders table is almost equal and we want to perform a multithreaded query by month in order to increase the performance. SPL has the following script:
A | B | |
---|---|---|
1 | fork to(1,12) | / A parallel processing using 12 threads |
2 | =connect(“my”) | |
3 | =B2.query@x(“select * from orders where month(orderdate)=?”,A1) | |
4 | =A1.conj() | / Concatenate the query results returned by the threads |
Intended for processing structured data and with a rich library of structured computation functions, esProc SPL can handle the complex computing logics that SQL is difficult to implement in a simple and convenient way. In a nutshell, esProc SPL is the best post-SQL scripting tool.
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code