-
Notifications
You must be signed in to change notification settings - Fork 339
Sample Programs of Structuralizing Excel Files
This article illustrates how to parse Excel files of different formats and sizes, including the common row-wise format, complicated table headers, free-style format, crosstab, main & sub tables, and big files, into structured data, and provides sample programs in esProc SPL.Looking ${article} for details.
An Excel file needs to be parsed and structuralized before it is computed or loaded into a relational database. Often we have an Excel file with non-standardized format and there are various file structures. The program for structuralizing an Excel file thus becomes complicated and bloated. It’s also difficult to migrate. You need to do the parsing from scratch for each type of format or structure before programming.
This article illustrates how to parse Excel files of different formats and sizes, including the common row-wise format, complicated table headers, free-style format, crosstab, main & sub tables, and big files, into structured data, and provides sample programs in esProc SPL. esProc is intended as a professional data computing engine that is based on Structured Process Language (SPL). SPL boasts a complete set of Excel file handling functions to structuralize and process Excel files and to write the result set to the database.
This is the simplest Excel file format. Each row is a record, and in most cases the first row contains column headers.
Example: Based on the students score file scores.xlsx, calculate the average Chinese score for each class. Below is part of the file:
esProc SPL script:
A | Comment | |
1 | =file(“e:/excel/scores.xlsx").xlsimport@t() | Read the file; @t option read the 1^st^ row as column headers |
2 | =A1.groups(Class;avg(Chinese):avg_Chinese) | Group rows by Class and calculate the average Chinese score of each class |
3 | =file("e:/excel/class_avg_c.xlsx").xlsexport@t(A2) | Export result set to a new Excel file |
In many real-world cases, Excel files have complicated table headers that occupy several rows, including table title, project name, person who completes the form, date, page information, etc. To parse such an Excel file, you need to skip the table headers, specify a row where the reading begins and give names to the structuralized columns.
Example: Based on the project price list itemPrices.xlsx, calculate the total project price. Below is part of the file:
esProc SPL script:
A | Comment | |
1 | =file(“e:/excel/itemPrices.xlsx").xlsimport(;1,5) | Parameters “1,5” means reading data in sheet 1 through to the end starting from the 5^th^ row |
2 | =A1.rename(#1:No,#2:ItemCode,#3:ItemName,#4:Unit,#5:Quantity,#6:Price,#7:Sum) | Rename columns |
3 | =A2.sum(Sum) | Get the total price |
In a free style Excel file, each record corresponds to multiple rows; cells containing column values are directly under or on the right of the column name cell; and there may be cross-row or cross-column merged cells. What we know for sure is that the number and structure of rows that form each record are the same. When reading the file by loop, get each record by the number of rows they occupy.
Example: Load the free style Excel file employee.xlsx that stores employee information into database table employee. Below is part of the file:
esProc SPL script:
A | B | C | |
1 | =create(ID,Name,Sex,Position,Birthday,Phone,Address,PostCode) | ||
2 | =file(“e:/excel/employe.xlsx").xlsopen() | ||
3 | [C,C,F,C,C,D,C,C] | [1,2,2,3,4,5,7,8] | |
4 | for | =A3.(~/B3(#)).(eval($[A2.xlscell(]/~/")")) | |
5 | if len(B4(1))==0 | break | |
6 | >A1.record(B4) | ||
7 | >B3=B3.(~+9) | ||
8 | =connect(“db”) | ||
9 | =A8.update(A1,employee) | ||
10 | =A8.close() |
A1 Create an empty table sequence consisting of columns “ID, Name, Sex, Position, Birthday, Phone, Address and PostCode”.
A2 Open the Excel file
A3 Define a sequence of column numbers commanding the cells containing employee information
B3 Define a sequence of row numbers commanding the cells containing employee information
A4 Retrieve information of each employee using for statement
B4 A3.(~/B3(#)) gets the sequence of cells holding information of the current employee and then reads the cell values to form a sequence of information. In the first round of loop, the cells are [C1,C2,F2,C3,C4,D5,C7,C8]. In the second round of the loop, the cells are [C10,C11,F11,C12,C13,D14,C16,C17]; and so on. Add 9 to the current row number for each round of loop.
B5 Check whether the current employee ID is null. Exit the loop and stop data retrieval if it is null.
B6 Append an employee record to A1’s table sequence
B7 Add 9 to each of the sequence numbers of the rows holding the current employee information to retrieve information of the next employee
A8-A10 Connect to the database to load the employee information to database table employee and close the connection
Below is the data retrieved by A1:
Commonly seen in statistics, a crosstab is a matrix-type table that displays the quantitative relationship between two variables. A column to row/row to column transposition between a column/row variable and the crossed value will be performed according to another column/row variable after the crosstab data is retrieved row by row.
Example: Parse crosstab cross.xlsx,* *where the orders area and the shipping type are crossed, into structured data. Below is part of the file:
esProc SPL script:
A | Comment | |
1 | =file(“e:/excel/cross.xlsx").xlsimport@t(;1,2) | Read in sheet 1 of the Excel file starting from the 2^nd^ row; use 2 as the column name |
2 | =A1.rename(#1:Type) | Rename column 1 Type |
3 | =A2.pivot@r(Type;Area,Amount) | Perform transposition over A2’s table sequence; @r option enables a column to row transposition; the new column names after transposition are Area and Amount |
Below is part of A3’s data:
In an Excel file, if one sheet is a record that contains a number of (N) sub-records, that forms the main and sub tables. The number of the main table records is that of the sheets in the Excel file. To retrieve data from such type of Excel files, we need to create two tables to respectively store records of the main table and those of the sub table.
Example: In staff information file staff.xlsx, each sheet stores information of staff and their families. We need to parse the file into two structured data tables that respectively store the staff information and family information. Below is one of the Excel sheets:
esProc SPL script:
A | B | C | |
1 | =create(IDCard,Name,Sex,Birthday,Nation,Phone,Depart,Home,Marital,Entry) | ||
2 | =create(IDCard,Name,Relation,Workplace,Phone) | ||
3 | [B4,B3,D3,F3,H3,F4,H4,B5,F5,H5] | ||
4 | =file(“e:/excel/staff.xlsx").xlsopen() | ||
5 | for A4 | =A3.(eval($[A4.xlscell(]/~/",\""/A5.stname/"\")")) | >A1.record(B5) |
6 | =A4.xlsimport@t(Family,Name,Relation,Workplace,Phone;A5.stname,6) | ||
7 | =B6.rename(Family:IDCard) | >B7.run(IDCard=B5(1)) | |
8 | >A2.insert@r(0:B7) |
A1 Create an empty table consisting of fields IDCard, Name, Sex, Birthday, Nation, Phone, Depart, Home, Marital and Entry to store the staff information of the main table
A2 Create an empty table made up of fields IDCard, Name, Relation, Workplace and Phone to store the family information in the sub table
A3 Define a sequence of cells holding the staff information in the main table
A4 Open the Excel file
A5 Read each sheet of the Excel file by loop
B5 Read staff information in each sheet as a sequence
C5 Store B5’s staff information in A1’s table sequence
B6 Read family information starting from the 6^th^ row; only 5 columns (Family, Name, Relation, Workplace and Phone) are read
B7 Change the name of column Family in B6’s table sequence into IDCard
C7 Assign IDCard values in the main table to column IDCard in B7’s table sequence
B8 Append family information in B7 to A2’s table sequence
Below is part of A1’s data:
Below is part of A2’s data:
There are detailed explanations and examples of the parsing, structuralization and computation of big text files in Big Structured Text File Processing & Sample Programs. Here we illustrate it again with an Excel file:
Example: Based on orders information table orders.xlsx – which is big, calculate the total orders amount for each area. Below is part of the file:
esProc SPL script:
A | Comment | |
1 | =file(“e:/excel/orders.xlsx").xlsimport@tc() | Read the Excel file; @t option enables reading row 1 as the column headers; @c option enables returning a cursor |
2 | =A1.groups(Area;sum(Amount)) | Group rows by Area and sum orders amounts in each area |
Find more examples in SPL CookBook.
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code