A small application that can be used to clean, filter and repair raw spreadsheet data before it's used in other tools.
Often, when dealing with raw data coming out of a database you will be dealing with artifacts and inconsistencies because data in the database can contain anything necessary in any format. However, spreadsheets and other applications need well-formatted regular data files to start from. The tool presented here tries to provide some repair and filtering capabilities for those instances where direct access to the database is not available. The tool will try to do its best but is not guaranteed to always come up with perfect results.
You can either call the Java application directly as can be seen in the run.sh shell script, or you can use the convenience wrapper repair.sh, which takes the same arguments but has pre-configures the Java settings.
usage: ./repair.sh [-h <arg>] [-i <arg>] [-if <arg>] [-o <arg>] [-of <arg>] [-p <arg>] [-t <arg>] [-v] [-w]
-h,--help <arg> Command line help
-i,--input <arg> Load data from input file
-if,--iformat <arg> Input delimiter character, use t for tab and c for
comma. Default is c (comma). Any other character
or string will be used as-is
-o,--output <arg> Write data to output file, or if not provided
write to stdout
-of,--oformat <arg> Output delimiter character, use t for tab and c
for comma. Default is c (comma). Any other
character or string will be used as-is
-p,--operation <arg> The operation to perform, one of: json2xml,
xml2json, trim, tolower, toupper, hashcode,
removewhitespace. Separate with | to run multiple
filters. Filters are executed left to right as
they are specified in this argument
-s,--salt <arg> Allows you to specify a common salt string to be used by
the hash function in the hashcode filter.
-t,--target <arg> Target column to modify, numeric index You can
specify a single index, a comma separated list of
indices, a range such as 1-4 or a combination
-v,--verbose Show verbose log output
-w,--overwrite Overwrite existing file if it exists
This application can be used to both repair and run transformations on table data. However, if you want to use it for it's repair functionality then you can only specify one target column. Currently the application can't automatically determine where repaired cells should solely based on their contents.
- Repair a comma delimited file that might have spurious commas in the cells of column 8:
./repair.sh -v -w --operation repair --target 8 --iformat c --oformat c --input ./data/person-commaseparator.txt --output ./output/person-commaseparator-repaired.filtered.csv
- Repair a tab delimited file where some cells have accidental newlines, the newlines are suspected to occur in column 8:
./repair.sh -v -w --operation repair --target 8 --iformat t --oformat t --input ./data/person-tabseparator.txt --output ./output/person-tabseparator-repaired.filtered.tsv
In order to use this tool you will need to have Java installed. Currently we're compiling against Java 1.8, but later versions should work.
If you want to build the tool yourself then you will need to have Java and Maven installed.
- JDK: https://docs.oracle.com/javase/8/docs/technotes/guides/install/install_overview.html
- Maven: https://www.baeldung.com/install-maven-on-windows-linux-mac
-
-if,--iformat
- Description: Input format, use t for tab and c for comma. Default is c (comma). Any other character or string will be used as-is
- Example to parse with tabs: --iformat t
- Example to use pipes as the separator: --iformat |
- Example to use a comma, either: --iformat c or: --iformat ,
- Required: no
-
-of,--oformat
- Description: Output format, use t for tab and c for comma. Default is c (comma). Any other character or string will be used as-is. If not specified the output delimiter will be set to the input delimiter
- Example to parse with tabs: --oformat t
- Example to use pipes as the separator: --oformat |
- Example to use a comma, either: --format c or: --oformat ,
- Required: no
-
-h,--help
- Description: Get help either on the entire application or on a specific command
- Required: no
-
-i,--input
- Description: Use this argument to point to a file on disk. Currently we do not yet support stdin (console) or url loading
- Example: --input ./data/person-tabseparator.txt
- Example: --input /tmp/data/person-pipeseparator.txt
- Required: yes
-
-o,--output
- Description: When present this argument has a file path. If not provided output will go to stdout (console)
- Example: --output ./output/person-tabseparator.filtered.tsv
- Required: no
-
-p,--operation
- Description: List one or more operations to be performed on each cell of the chosen column(s). See the sections in the README on each available filter. Separate with | to run multiple filters. Filters are executed left to right as they are specified in this argument
- Example: --operation tolower|trim|json2xml
- Example: --operation xml2json
- Example: --operation trim|hashcode
- Required: yes
-
-t,--target
- Description: A specification of which columns to apply the requested operation(s) on. You can specify a single column, a range or a combination separated by commas. WARNING: If you're including a repair filter then you can only provide one target column.
- Example (valid ranges):
- --target 1
- --target 1,2,3
- --target 1-9
- --target 1-9,10-19,20-199
- --target 1-8,9,10-18,19,20-199
- Example (invalid ranges):
- --target A
- --target 1,2,
- --target 1 - 9
- Required: yes
-
-v,--verbose
- Description: turn on debug output to the console. Use this if you want to see detailed information as to what the application does behind the scenes
- Required: no
-
-w,--overwrite
- Description: when provided, the application will overwrite any existing file with the same name and path as the output file.
- Required: no
-
drop, Effectively drops a column by replacing the content by a single space character
-
json2xml, Does a syntactic transformation from JSON format to XML
-
xml2json, Does a syntactic transformation from XML format to JSON
-
trim, Remove any redundant whitespace before and after the string value of a cell
-
tolower, Convert all characters in a string to lower case
-
toupper, Convert all characters in a string to upper case
-
hashcode, Run a hash function over the content of a cell using an additional salt if provided. This operation will produce an additional file containing a lookup table that maps hashcodes to the original values
-
removewhitespace, Take out any whitespace from the string value of a cell
-
repair, This filter will try to transform characters and strings that might harm the parsability by applications such as Excel. This filter will go through a number of iterations over the input cell data to ensure all characters are spreadsheet safe. If you include this filter then make sure it's the first in the list.
-
htmlencode, URL encoding converts characters into a format that can be transmitted over the Internet, Reference: https://www.w3schools.com/tags/ref_urlencode.asp
-
fixxml,
-
unescape,
-
removeboundingquotes, Take out any double quotes surrounding a string value, so "Hello World" becomes Hello World
-
removenewline, Remove any newlines from a string value, this includes \n as well as \r\n
- Allow arguments for the filters
- Add specific help for each filter/operation
- Generate a report in a file that provides details of what was done to each row with any exceptions and failures
- Allow operations to add a column instead of working in place
- Support the different formats of newlines: \r\n or \n