Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft: docs:Performance Scripts / Docs #431

Draft
wants to merge 8 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 22 additions & 7 deletions meltano.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,11 @@ plugins:
config:
streams:
- stream_name: animals
input_filename: https://gitlab.com/meltano/tap-smoke-test/-/raw/main/demo-data/animals-data.jsonl
input_filename:
https://gitlab.com/meltano/tap-smoke-test/-/raw/main/demo-data/animals-data.jsonl
- stream_name: page_views
input_filename: https://gitlab.com/meltano/tap-smoke-test/-/raw/main/demo-data/pageviews-data.jsonl
input_filename:
https://gitlab.com/meltano/tap-smoke-test/-/raw/main/demo-data/pageviews-data.jsonl
stream_maps:
animals:
__key_properties__: [id]
Expand All @@ -30,13 +32,22 @@ plugins:
- commits.url
- commits.sha
- commits.commit_timestamp
- name: tap-csv
variant: meltanolabs
pip_url: git+https://github.com/MeltanoLabs/tap-csv.git
config:
files:
- entity: data_target_postgres
path: $MELTANO_PROJECT_ROOT/performance/data.csv
keys: [column_1]
add_metadata_columns: false
loaders:
- name: target-postgres
namespace: target_postgres
pip_url: -e .
settings:
- name: sqlalchemy_url
kind: password
kind: string
sensitive: true
- name: ssl_enable
kind: boolean
Expand All @@ -46,16 +57,16 @@ plugins:
sensitive: true
- name: ssl_mode
- name: ssl_certificate_authority
kind: password
kind: string
sensitive: true
- name: ssl_client_certificate
kind: password
kind: string
sensitive: true
- name: ssl_client_private_key
kind: password
kind: string
sensitive: true
- name: password
kind: password
kind: string
sensitive: true
- name: host
- name: port
Expand All @@ -72,6 +83,10 @@ plugins:
password: postgres
database: postgres
target_schema: test
validate_records: false
add_record_metadata: true
- name: target-postgres-copy-branch
inherit_from: target-postgres
pip_url: git+https://github.com/kinghuang/target-postgres@bulk-insert-copy
environments:
- name: dev
3 changes: 3 additions & 0 deletions podman.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
#/bin/bash
#Username postgres password postgres
podman run -e POSTGRES_PASSWORD=postgres -p 5432:5432 -h postgres -d postgres
2 changes: 2 additions & 0 deletions scripts/performance/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
data.csv
data.singer
31 changes: 31 additions & 0 deletions scripts/performance/1m_rows_generate.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
import csv
import random
import string

num_rows = 1_000_000
num_columns = 10


# Generate random data for CSV
def random_string(length=10):
return "".join(random.choices(string.ascii_letters + string.digits, k=length))


# Generate the CSV file
csv_filename = "data.csv"

with open(csv_filename, mode="w", newline="") as csv_file:
writer = csv.writer(csv_file)

# Write header
header = [f"column_{i+1}" for i in range(num_columns)]
writer.writerow(header)

# Write data rows
for _ in range(num_rows):
row = [random_string() for _ in range(num_columns)]
writer.writerow(row)

print(
f"CSV file '{csv_filename}' with {num_rows} rows and {num_columns} columns has been generated."
)
36 changes: 36 additions & 0 deletions scripts/performance/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# target-postgres Performance Analysis

Main goal is to lay out an objective way to do performance analysis with target-postgres, and hopefuly the ground work for others if they want to do analysis with their target's.

Main points:
1. We need something to comapre to. For postgres we have native import commands that are well optimized. We will use this as a baseline.
1. Relative speed is the metric to focus on. If we focus on absolute speed then there's a bunch of hardware consideration that we are not trying to solve here (Would need to consider how paralleization fits into the mix here if we go there)

# Why do this work?
1. Without it we are guessing at what can help improve performance, this gives us a more objective way to pick what we should focus on

# How to run
1. `./prep.sh` gets the data together for you in the right place
2. `python speed_compare.py` runs all the tests and gives you the times for each test

# Results for 1 million records
| **Test Name** | **Total Run Time (s)** | **x Slower Than Native Copy** |
|-------------------------------------------------------------|------------------------|-------------------------------|
| `./perf_tests/pg_copy_upsert.sh` | 13.64 | 1.0000 |
| `./perf_tests/target_postgres_copy_branch_no_validate.sh` | 100.50 | 7.3697 |
| `./perf_tests/target_postgres_current_branch_no_validate.sh`| 141.48 | 10.3749 |
| `./perf_tests/target_postgres_copy_branch.sh` | 265.53 | 19.4719 |
| `./perf_tests/target_postgres_current_branch.sh` | 298.37 | 21.8799 |

# Other questions / concerns
1. `COPY` is single threaded, there's no reason we need to stick to a single thread. https://github.com/dimitri/pgloader is much faster. We should try this out as well
1. `prep.sh`'s tap-csv step runs to give us a data.singer file (jsonl output from the tap) this takes an extremely long time to run for one million records

# Next steps to improve performance
Next steps to improve peformance:
- [ ] Split the current [Bulk Insert Speed PR](https://github.com/MeltanoLabs/target-postgres/pull/370) to be a seperate sink that can be turned on with a configuration setting
- [ ] Test the new sink with the same tests as the main sink and add failures for the one's we know do not pass
- [ ] Note to folks in the main README about peformance and how to get the best performance right now is to turn on COPY mode, turn off record validation.
- [ ] Evaluate why we're not closer to native copy speeds. Within 50% of native speeds seems reasonable but that's just a guess
- [ ] Add [pg_loader](https://github.com/dimitri/pgloader) with multiple threads, no reason we couldn't do something similar in targets
- [ ] Add a CI job that calculates performance implications of PR for every run
67 changes: 67 additions & 0 deletions scripts/performance/meltano.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
version: 1
send_anonymous_usage_stats: true
default_environment: dev
project_id: target-postgres
plugins:
extractors:
- name: tap-csv
variant: meltanolabs
pip_url: git+https://github.com/MeltanoLabs/tap-csv.git
config:
files:
- entity: data_target_postgres
path: $MELTANO_PROJECT_ROOT/data.csv
keys: [column_1]
add_metadata_columns: false
loaders:
- name: target-postgres
namespace: target_postgres
pip_url: -e ../../.
settings:
- name: sqlalchemy_url
kind: string
sensitive: true
- name: ssl_enable
kind: boolean
sensitive: true
- name: ssl_client_certificate_enable
kind: boolean
sensitive: true
- name: ssl_mode
- name: ssl_certificate_authority
kind: string
sensitive: true
- name: ssl_client_certificate
kind: string
sensitive: true
- name: ssl_client_private_key
kind: string
sensitive: true
- name: password
kind: string
sensitive: true
- name: host
- name: port
kind: integer
- name: user
- name: database
- name: target_schema
- name: add_record_metadata
kind: boolean
- name: validate_records
kind: boolean
- name: batch_size_rows
kind: integer
config:
host: localhost
port: 5432
user: postgres
password: postgres
database: postgres
target_schema: test
add_record_metadata: true
- name: target-postgres-copy-branch
inherit_from: target-postgres
pip_url: git+https://github.com/kinghuang/target-postgres@bulk-insert-copy
environments:
- name: dev
53 changes: 53 additions & 0 deletions scripts/performance/perf_tests/pg_copy_upsert.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
#!/bin/bash

# Variables
CSV_FILE="data.csv"
DB_NAME="postgres"
DB_USER="postgres"
DB_PASSWORD="postgres"
DB_HOST="localhost"
DB_PORT="5432"

# Export the password to avoid being prompted
export PGPASSWORD=$DB_PASSWORD

# Execute COPY command to import the CSV into PostgreSQL
#psql -h $DB_HOST -p $DB_PORT -U $DB_USER -d $DB_NAME -c "\COPY large_data FROM '$CSV_FILE' CSV HEADER;"
# Begin transaction
psql -h $DB_HOST -p $DB_PORT -U $DB_USER -d $DB_NAME <<EOF

-- Create the staging table
DROP TABLE IF EXISTS large_data_staging;
CREATE UNLOGGED TABLE large_data_staging (
column_1 VARCHAR(255),
column_2 VARCHAR(255),
column_3 VARCHAR(255),
column_4 VARCHAR(255),
column_5 VARCHAR(255),
column_6 VARCHAR(255),
column_7 VARCHAR(255),
column_8 VARCHAR(255),
column_9 VARCHAR(255),
column_10 VARCHAR(255)
);

-- Import data into the staging table
\COPY large_data_staging FROM '$CSV_FILE' CSV HEADER;

-- Upsert data into the main table
INSERT INTO large_data AS target
SELECT * FROM large_data_staging
ON CONFLICT (column_1) DO UPDATE SET
column_2 = EXCLUDED.column_2,
column_3 = EXCLUDED.column_3,
column_4 = EXCLUDED.column_4,
column_5 = EXCLUDED.column_5,
column_6 = EXCLUDED.column_6,
column_7 = EXCLUDED.column_7,
column_8 = EXCLUDED.column_8,
column_9 = EXCLUDED.column_9,
column_10 = EXCLUDED.column_10;

EOF

echo "CSV file has been imported into the database with merge handling."
2 changes: 2 additions & 0 deletions scripts/performance/perf_tests/target_postgres_copy_branch.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
#!/bin/bash
meltano invoke target-postgres-copy-branch < data.singer
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
#!/bin/bash
export TARGET_POSTGRES_VALIDATE_RECORDS="false"
meltano invoke target-postgres-copy-branch < data.singer
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
#!/bin/bash
meltano invoke target-postgres < data.singer
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
#!/bin/bash
export TARGET_POSTGRES_VALIDATE_RECORDS="false"
meltano invoke target-postgres < data.singer
83 changes: 83 additions & 0 deletions scripts/performance/plugins/extractors/tap-csv--meltanolabs.lock
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
{
"plugin_type": "extractors",
"name": "tap-csv",
"namespace": "tap_csv",
"variant": "meltanolabs",
"label": "Comma Separated Values (CSV)",
"docs": "https://hub.meltano.com/extractors/tap-csv--meltanolabs",
"repo": "https://github.com/MeltanoLabs/tap-csv",
"pip_url": "git+https://github.com/MeltanoLabs/tap-csv.git",
"description": "Generic data extractor of CSV (comma separated value) files",
"logo_url": "https://hub.meltano.com/assets/logos/extractors/csv.png",
"capabilities": [
"catalog",
"discover"
],
"settings_group_validation": [
[
"files"
],
[
"csv_files_definition"
]
],
"settings": [
{
"name": "add_metadata_columns",
"kind": "boolean",
"value": false,
"label": "Add Metadata Columns",
"description": "When True, add the metadata columns (`_sdc_source_file`, `_sdc_source_file_mtime`, `_sdc_source_lineno`) to output."
},
{
"name": "csv_files_definition",
"kind": "string",
"label": "Csv Files Definition",
"documentation": "https://github.com/MeltanoLabs/tap-csv#settings",
"description": "Project-relative path to JSON file holding array of objects as described under [Files](#files) - with `entity`, `path`, `keys`, and other optional keys:\n\n```json\n[\n {\n \"entity\": \"<entity>\",\n \"path\": \"<path>\",\n \"keys\": [\"<key>\"],\n },\n // ...\n]\n```\n",
"placeholder": "Ex. files-def.json"
},
{
"name": "faker_config.locale",
"kind": "array",
"label": "Faker Config Locale",
"description": "One or more LCID locale strings to produce localized output for: https://faker.readthedocs.io/en/master/#localization"
},
{
"name": "faker_config.seed",
"kind": "string",
"label": "Faker Config Seed",
"description": "Value to seed the Faker generator for deterministic output: https://faker.readthedocs.io/en/master/#seeding-the-generator"
},
{
"name": "files",
"kind": "array",
"label": "Files",
"description": "Array of objects with `entity`, `path`, `keys`, and `encoding` [Optional] keys:\n\n* `entity`: The entity name, used as the table name for the data loaded from that CSV.\n* `path`: Local path (relative to the project's root) to the file to be ingested. Note that this may be a directory, in which case all files in that directory and any of its subdirectories will be recursively processed\n* `keys`: The names of the columns that constitute the unique keys for that entity.\n* `encoding`: [Optional] The file encoding to use when reading the file (i.e. \"latin1\", \"UTF-8\"). Use this setting when you get a UnicodeDecodeError error.\n Each input CSV file must be a traditionally-delimited CSV (comma separated columns, newlines indicate new rows, double quoted values).\n\nThe following entries are passed through in an internal CSV dialect that then is used to configure the CSV reader:\n\n* `delimiter`: A one-character string used to separate fields. It defaults to ','.\n* `doublequote`: Controls how instances of quotechar appearing inside a field should themselves be quoted. When True, the character is doubled. When False, the escapechar is used as a prefix to the quotechar. It defaults to True.\n* `escapechar`: A one-character string used by the reader, where the escapechar removes any special meaning from the following character. It defaults to None, which disables escaping.\n* `quotechar`: A one-character string used to quote fields containing special characters, such as the delimiter or quotechar, or which contain new-line characters. It defaults to '\"'.\n* `skipinitialspace`: When True, spaces immediately following the delimiter are ignored. The default is False.\n* `strict`: When True, raise exception Error on bad CSV input. The default is False.\n\nThe first row is the header defining the attribute name for that column and will result to a column of the same name in the database. It must have a valid format with no spaces or special characters (like for example `!` or `@`, etc).\n"
},
{
"name": "flattening_enabled",
"kind": "boolean",
"label": "Flattening Enabled",
"description": "'True' to enable schema flattening and automatically expand nested properties."
},
{
"name": "flattening_max_depth",
"kind": "integer",
"label": "Flattening Max Depth",
"description": "The max depth to flatten schemas."
},
{
"name": "stream_map_config",
"kind": "object",
"label": "Stream Map Config",
"description": "User-defined config values to be used within map expressions."
},
{
"name": "stream_maps",
"kind": "object",
"label": "Stream Maps",
"description": "Config object for stream maps capability. For more information check out [Stream Maps](https://sdk.meltano.com/en/latest/stream_maps.html)."
}
]
}
Loading