MeltanoLabs · visch · Sep 16, 2024 · Sep 16, 2024 · Sep 16, 2024 · Sep 16, 2024
@@ -11,9 +11,11 @@ plugins:
     config:
       streams:
       - stream_name: animals
-        input_filename: https://gitlab.com/meltano/tap-smoke-test/-/raw/main/demo-data/animals-data.jsonl
+        input_filename: 
+          https://gitlab.com/meltano/tap-smoke-test/-/raw/main/demo-data/animals-data.jsonl
       - stream_name: page_views
-        input_filename: https://gitlab.com/meltano/tap-smoke-test/-/raw/main/demo-data/pageviews-data.jsonl
+        input_filename: 
+          https://gitlab.com/meltano/tap-smoke-test/-/raw/main/demo-data/pageviews-data.jsonl
       stream_maps:
         animals:
           __key_properties__: [id]
@@ -30,13 +32,22 @@ plugins:
     - commits.url
     - commits.sha
     - commits.commit_timestamp
+  - name: tap-csv
+    variant: meltanolabs
+    pip_url: git+https://github.com/MeltanoLabs/tap-csv.git
+    config:
+      files:
+      - entity: data_target_postgres
+        path: $MELTANO_PROJECT_ROOT/performance/data.csv
+        keys: [column_1]
+      add_metadata_columns: false
   loaders:
   - name: target-postgres
     namespace: target_postgres
     pip_url: -e .
     settings:
     - name: sqlalchemy_url
-      kind: password
+      kind: string
       sensitive: true
     - name: ssl_enable
       kind: boolean
@@ -46,16 +57,16 @@ plugins:
       sensitive: true
     - name: ssl_mode
     - name: ssl_certificate_authority
-      kind: password
+      kind: string
       sensitive: true
     - name: ssl_client_certificate
-      kind: password
+      kind: string
       sensitive: true
     - name: ssl_client_private_key
-      kind: password
+      kind: string
       sensitive: true
     - name: password
-      kind: password
+      kind: string
       sensitive: true
     - name: host
     - name: port
@@ -72,6 +83,10 @@ plugins:
       password: postgres
       database: postgres
       target_schema: test
+      validate_records: false
       add_record_metadata: true
+  - name: target-postgres-copy-branch
+    inherit_from: target-postgres
+    pip_url: git+https://github.com/kinghuang/target-postgres@bulk-insert-copy
 environments:
 - name: dev
@@ -0,0 +1,3 @@
+#/bin/bash
+#Username postgres password postgres
+podman run -e POSTGRES_PASSWORD=postgres -p 5432:5432 -h postgres -d postgres
@@ -0,0 +1,2 @@
+data.csv
+data.singer
@@ -0,0 +1,31 @@
+import csv
+import random
+import string
+
+num_rows = 1_000_000
+num_columns = 10
+
+
+# Generate random data for CSV
+def random_string(length=10):
+    return "".join(random.choices(string.ascii_letters + string.digits, k=length))
+
+
+# Generate the CSV file
+csv_filename = "data.csv"
+
+with open(csv_filename, mode="w", newline="") as csv_file:
+    writer = csv.writer(csv_file)
+
+    # Write header
+    header = [f"column_{i+1}" for i in range(num_columns)]
+    writer.writerow(header)
+
+    # Write data rows
+    for _ in range(num_rows):
+        row = [random_string() for _ in range(num_columns)]
+        writer.writerow(row)
+
+print(
+    f"CSV file '{csv_filename}' with {num_rows} rows and {num_columns} columns has been generated."
+)
@@ -0,0 +1,36 @@
+# target-postgres Performance Analysis
+
+Main goal is to lay out an objective way to do performance analysis with target-postgres, and hopefuly the ground work for others if they want to do analysis with their target's.
+
+Main points:
+1. We need something to comapre to. For postgres we have native import commands that are well optimized. We will use this as a baseline.
+1. Relative speed is the metric to focus on. If we focus on absolute speed then there's a bunch of hardware consideration that we are not trying to solve here (Would need to consider how paralleization fits into the mix here if we go there)
+
+# Why do this work?
+1. Without it we are guessing at what can help improve performance, this gives us a more objective way to pick what we should focus on
+
+# How to run
+1. `./prep.sh` gets the data together for you in the right place
+2. `python speed_compare.py` runs all the tests and gives you the times for each test
+
+# Results for 1 million records
+| **Test Name**                                               | **Total Run Time (s)** | **x Slower Than Native Copy** |
+|-------------------------------------------------------------|------------------------|-------------------------------|
+| `./perf_tests/pg_copy_upsert.sh`                            | 13.64                  | 1.0000                        |
+| `./perf_tests/target_postgres_copy_branch_no_validate.sh`   | 100.50                 | 7.3697                        |
+| `./perf_tests/target_postgres_current_branch_no_validate.sh`| 141.48                 | 10.3749                       |
+| `./perf_tests/target_postgres_copy_branch.sh`               | 265.53                 | 19.4719                       |
+| `./perf_tests/target_postgres_current_branch.sh`            | 298.37                 | 21.8799                       |
+
+# Other questions / concerns
+1. `COPY` is single threaded, there's no reason we need to stick to a single thread. https://github.com/dimitri/pgloader is much faster. We should try this out as well
+1. `prep.sh`'s tap-csv step runs to give us a data.singer file (jsonl output from the tap) this takes an extremely long time to run for one million records
+
+# Next steps to improve performance
+Next steps to improve peformance:
+- [ ] Split the current [Bulk Insert Speed PR](https://github.com/MeltanoLabs/target-postgres/pull/370) to be a seperate sink that can be turned on with a configuration setting
+- [ ] Test the new sink with the same tests as the main sink and add failures for the one's we know do not pass
+- [ ] Note to folks in the main README about peformance and how to get the best performance right now is to turn on COPY mode, turn off record validation.
+- [ ] Evaluate why we're not closer to native copy speeds. Within 50% of native speeds seems reasonable but that's just a guess
+- [ ] Add [pg_loader](https://github.com/dimitri/pgloader) with multiple threads, no reason we couldn't do something similar in targets
+- [ ] Add a CI job that calculates performance implications of PR for every run
@@ -0,0 +1,67 @@
+version: 1
+send_anonymous_usage_stats: true
+default_environment: dev
+project_id: target-postgres
+plugins:
+  extractors:
+  - name: tap-csv
+    variant: meltanolabs
+    pip_url: git+https://github.com/MeltanoLabs/tap-csv.git
+    config:
+      files:
+      - entity: data_target_postgres
+        path: $MELTANO_PROJECT_ROOT/data.csv
+        keys: [column_1]
+      add_metadata_columns: false
+  loaders:
+  - name: target-postgres
+    namespace: target_postgres
+    pip_url: -e ../../.
+    settings:
+    - name: sqlalchemy_url
+      kind: string
+      sensitive: true
+    - name: ssl_enable
+      kind: boolean
+      sensitive: true
+    - name: ssl_client_certificate_enable
+      kind: boolean
+      sensitive: true
+    - name: ssl_mode
+    - name: ssl_certificate_authority
+      kind: string
+      sensitive: true
+    - name: ssl_client_certificate
+      kind: string
+      sensitive: true
+    - name: ssl_client_private_key
+      kind: string
+      sensitive: true
+    - name: password
+      kind: string
+      sensitive: true
+    - name: host
+    - name: port
+      kind: integer
+    - name: user
+    - name: database
+    - name: target_schema
+    - name: add_record_metadata
+      kind: boolean
+    - name: validate_records
+      kind: boolean
+    - name: batch_size_rows
+      kind: integer
+    config:
+      host: localhost
+      port: 5432
+      user: postgres
+      password: postgres
+      database: postgres
+      target_schema: test
+      add_record_metadata: true
+  - name: target-postgres-copy-branch
+    inherit_from: target-postgres
+    pip_url: git+https://github.com/kinghuang/target-postgres@bulk-insert-copy
+environments:
+- name: dev
@@ -0,0 +1,53 @@
+#!/bin/bash
+
+# Variables
+CSV_FILE="data.csv"
+DB_NAME="postgres"
+DB_USER="postgres"
+DB_PASSWORD="postgres"
+DB_HOST="localhost"
+DB_PORT="5432"
+
+# Export the password to avoid being prompted
+export PGPASSWORD=$DB_PASSWORD
+
+# Execute COPY command to import the CSV into PostgreSQL
+#psql -h $DB_HOST -p $DB_PORT -U $DB_USER -d $DB_NAME -c "\COPY large_data FROM '$CSV_FILE' CSV HEADER;"
+# Begin transaction
+psql -h $DB_HOST -p $DB_PORT -U $DB_USER -d $DB_NAME <<EOF
+
+-- Create the staging table
+DROP TABLE IF EXISTS large_data_staging;
+CREATE UNLOGGED TABLE large_data_staging (
+    column_1 VARCHAR(255),
+    column_2 VARCHAR(255),
+    column_3 VARCHAR(255),
+    column_4 VARCHAR(255),
+    column_5 VARCHAR(255),
+    column_6 VARCHAR(255),
+    column_7 VARCHAR(255),
+    column_8 VARCHAR(255),
+    column_9 VARCHAR(255),
+    column_10 VARCHAR(255)
+);
+
+-- Import data into the staging table
+\COPY large_data_staging FROM '$CSV_FILE' CSV HEADER;
+
+-- Upsert data into the main table
+INSERT INTO large_data AS target
+SELECT * FROM large_data_staging
+ON CONFLICT (column_1) DO UPDATE SET
+    column_2 = EXCLUDED.column_2,
+    column_3 = EXCLUDED.column_3,
+    column_4 = EXCLUDED.column_4,
+    column_5 = EXCLUDED.column_5,
+    column_6 = EXCLUDED.column_6,
+    column_7 = EXCLUDED.column_7,
+    column_8 = EXCLUDED.column_8,
+    column_9 = EXCLUDED.column_9,
+    column_10 = EXCLUDED.column_10;
+
+EOF
+
+echo "CSV file has been imported into the database with merge handling."
@@ -0,0 +1,2 @@
+#!/bin/bash
+meltano invoke target-postgres-copy-branch < data.singer
@@ -0,0 +1,3 @@
+#!/bin/bash
+export TARGET_POSTGRES_VALIDATE_RECORDS="false"
+meltano invoke target-postgres-copy-branch < data.singer
@@ -0,0 +1,2 @@
+#!/bin/bash
+meltano invoke target-postgres < data.singer
@@ -0,0 +1,3 @@
+#!/bin/bash
+export TARGET_POSTGRES_VALIDATE_RECORDS="false"
+meltano invoke target-postgres < data.singer
@@ -0,0 +1,83 @@
+{
+  "plugin_type": "extractors",
+  "name": "tap-csv",
+  "namespace": "tap_csv",
+  "variant": "meltanolabs",
+  "label": "Comma Separated Values (CSV)",
+  "docs": "https://hub.meltano.com/extractors/tap-csv--meltanolabs",
+  "repo": "https://github.com/MeltanoLabs/tap-csv",
+  "pip_url": "git+https://github.com/MeltanoLabs/tap-csv.git",
+  "description": "Generic data extractor of CSV (comma separated value) files",
+  "logo_url": "https://hub.meltano.com/assets/logos/extractors/csv.png",
+  "capabilities": [
+    "catalog",
+    "discover"
+  ],
+  "settings_group_validation": [
+    [
+      "files"
+    ],
+    [
+      "csv_files_definition"
+    ]
+  ],
+  "settings": [
+    {
+      "name": "add_metadata_columns",
+      "kind": "boolean",
+      "value": false,
+      "label": "Add Metadata Columns",
+      "description": "When True, add the metadata columns (`_sdc_source_file`, `_sdc_source_file_mtime`, `_sdc_source_lineno`) to output."
+    },
+    {
+      "name": "csv_files_definition",
+      "kind": "string",
+      "label": "Csv Files Definition",
+      "documentation": "https://github.com/MeltanoLabs/tap-csv#settings",
+      "description": "Project-relative path to JSON file holding array of objects as described under [Files](#files) - with `entity`, `path`, `keys`, and other optional keys:\n\n```json\n[\n  {\n    \"entity\": \"<entity>\",\n    \"path\": \"<path>\",\n    \"keys\": [\"<key>\"],\n  },\n  // ...\n]\n```\n",
+      "placeholder": "Ex. files-def.json"
+    },
+    {
+      "name": "faker_config.locale",
+      "kind": "array",
+      "label": "Faker Config Locale",
+      "description": "One or more LCID locale strings to produce localized output for: https://faker.readthedocs.io/en/master/#localization"
+    },
+    {
+      "name": "faker_config.seed",
+      "kind": "string",
+      "label": "Faker Config Seed",
+      "description": "Value to seed the Faker generator for deterministic output: https://faker.readthedocs.io/en/master/#seeding-the-generator"
+    },
+    {
+      "name": "files",
+      "kind": "array",
+      "label": "Files",
+      "description": "Array of objects with `entity`, `path`, `keys`, and `encoding` [Optional] keys:\n\n* `entity`: The entity name, used as the table name for the data loaded from that CSV.\n* `path`: Local path (relative to the project's root) to the file to be ingested. Note that this may be a directory, in which case all files in that directory and any of its subdirectories will be recursively processed\n* `keys`: The names of the columns that constitute the unique keys for that entity.\n* `encoding`: [Optional] The file encoding to use when reading the file (i.e. \"latin1\", \"UTF-8\"). Use this setting when you get a UnicodeDecodeError error.\n  Each input CSV file must be a traditionally-delimited CSV (comma separated columns, newlines indicate new rows, double quoted values).\n\nThe following entries are passed through in an internal CSV dialect that then is used to configure the CSV reader:\n\n* `delimiter`: A one-character string used to separate fields. It defaults to ','.\n* `doublequote`: Controls how instances of quotechar appearing inside a field should themselves be quoted. When True, the character is doubled. When False, the escapechar is used as a prefix to the quotechar. It defaults to True.\n* `escapechar`: A one-character string used by the reader, where the escapechar removes any special meaning from the following character. It defaults to None, which disables escaping.\n* `quotechar`: A one-character string used to quote fields containing special characters, such as the delimiter or quotechar, or which contain new-line characters. It defaults to '\"'.\n* `skipinitialspace`: When True, spaces immediately following the delimiter are ignored. The default is False.\n* `strict`: When True, raise exception Error on bad CSV input. The default is False.\n\nThe first row is the header defining the attribute name for that column and will result to a column of the same name in the database. It must have a valid format with no spaces or special characters (like for example `!` or `@`, etc).\n"
+    },
+    {
+      "name": "flattening_enabled",
+      "kind": "boolean",
+      "label": "Flattening Enabled",
+      "description": "'True' to enable schema flattening and automatically expand nested properties."
+    },
+    {
+      "name": "flattening_max_depth",
+      "kind": "integer",
+      "label": "Flattening Max Depth",
+      "description": "The max depth to flatten schemas."
+    },
+    {
+      "name": "stream_map_config",
+      "kind": "object",
+      "label": "Stream Map Config",
+      "description": "User-defined config values to be used within map expressions."
+    },
+    {
+      "name": "stream_maps",
+      "kind": "object",
+      "label": "Stream Maps",
+      "description": "Config object for stream maps capability. For more information check out [Stream Maps](https://sdk.meltano.com/en/latest/stream_maps.html)."
+    }
+  ]
+}
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		#!/bin/bash
		meltano invoke target-postgres-copy-branch < data.singer
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		#!/bin/bash
		meltano invoke target-postgres < data.singer