Skip to content

Commit

Permalink
feat: Add support for pgvector's vector data type
Browse files Browse the repository at this point in the history
  • Loading branch information
amotl committed Dec 13, 2023
1 parent af77af4 commit bb99a40
Show file tree
Hide file tree
Showing 10 changed files with 141 additions and 15 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/ci_workflow.yml
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ jobs:
pipx install poetry
- name: Install dependencies
run: |
poetry install
poetry install --all-extras
- name: Run pytest
run: |
poetry run pytest --capture=no
Expand Down
25 changes: 23 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -102,7 +102,7 @@ tap-carbon-intensity | target-postgres --config /path/to/target-postgres-config.

```bash
pipx install poetry
poetry install
poetry install --all-extras
pipx install pre-commit
pre-commit install
```
Expand Down Expand Up @@ -152,6 +152,8 @@ develop your own Singer taps and targets.

## Data Types

### Mapping

The below table shows how this tap will map between jsonschema datatypes and Postgres datatypes.

| jsonschema | Postgres |
Expand Down Expand Up @@ -202,7 +204,20 @@ The below table shows how this tap will map between jsonschema datatypes and Pos

Note that while object types are mapped directly to jsonb, array types are mapped to a jsonb array.

If a column has multiple jsonschema types, the following order is using to order Postgres types, from highest priority to lowest priority.
When using [pgvector], this type mapping applies, additionally to the table above.

| jsonschema | Postgres |
|------------------------------------------------|----------|
| array (with additional SCHEMA annotations [1]) | vector |

[1] `"storage": {"type": "vector", "dim": 4}`

### Resolution Order

If a column has multiple jsonschema types, there is a priority list for
resolving the best type candidate, from the highest priority to the
lowest priority.

- ARRAY(JSONB)
- JSONB
- TEXT
Expand All @@ -215,3 +230,9 @@ If a column has multiple jsonschema types, the following order is using to order
- INTEGER
- BOOLEAN
- NOTYPE

When using [pgvector], the `pgvector.sqlalchemy.Vector` type is added to the bottom
of the list.


[pgvector]: https://github.com/pgvector/pgvector
12 changes: 9 additions & 3 deletions docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
version: "2.1"
services:
postgres:
image: docker.io/postgres:latest
image: ankane/pgvector:latest
command: postgres -c ssl=on -c ssl_cert_file=/var/lib/postgresql/server.crt -c ssl_key_file=/var/lib/postgresql/server.key -c ssl_ca_file=/var/lib/postgresql/ca.crt -c hba_file=/var/lib/postgresql/pg_hba.conf
environment:
POSTGRES_USER: postgres
Expand All @@ -13,16 +13,19 @@ services:
POSTGRES_INITDB_ARGS: --auth-host=cert
# Not placed in the data directory (/var/lib/postgresql/data) because of https://gist.github.com/mrw34/c97bb03ea1054afb551886ffc8b63c3b?permalink_comment_id=2678568#gistcomment-2678568
volumes:
- ./target_postgres/tests/init.sql:/docker-entrypoint-initdb.d/init.sql
- ./ssl/server.crt:/var/lib/postgresql/server.crt # Certificate verifying the server's identity to the client.
- ./ssl/server.key:/var/lib/postgresql/server.key # Private key to verify the server's certificate is legitimate.
- ./ssl/ca.crt:/var/lib/postgresql/ca.crt # Certificate authority to use when verifying the client's identity to the server.
- ./ssl/pg_hba.conf:/var/lib/postgresql/pg_hba.conf # Configuration file to allow connection over SSL.
ports:
- "5432:5432"
postgres_no_ssl: # Borrowed from https://github.com/MeltanoLabs/tap-postgres/blob/main/.github/workflows/test.yml#L13-L23
image: docker.io/postgres:latest
image: ankane/pgvector:latest
environment:
POSTGRES_PASSWORD: postgres
volumes:
- ./target_postgres/tests/init.sql:/docker-entrypoint-initdb.d/init.sql
ports:
- 5433:5432
ssh:
Expand All @@ -37,17 +40,20 @@ services:
- PASSWORD_ACCESS=false
- USER_NAME=melty
volumes:
- ./target_postgres/tests/init.sql:/docker-entrypoint-initdb.d/init.sql
- ./ssh_tunnel/ssh-server-config:/config/ssh_host_keys:ro
ports:
- "127.0.0.1:2223:2222"
networks:
- inner
postgresdb:
image: postgres:13.0
image: ankane/pgvector:latest
environment:
POSTGRES_USER: postgres
POSTGRES_PASSWORD: postgres
POSTGRES_DB: main
volumes:
- ./target_postgres/tests/init.sql:/docker-entrypoint-initdb.d/init.sql
networks:
inner:
ipv4_address: 10.5.0.5
Expand Down
60 changes: 55 additions & 5 deletions poetry.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 4 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ packages = [
python = "<3.12,>=3.8.1"
requests = "^2.25.1"
singer-sdk = ">=0.28,<0.34"
pgvector = { version="^0.2.4", optional = true }
psycopg2-binary = "2.9.9"
sqlalchemy = ">=2.0,<3.0"
sshtunnel = "0.4.0"
Expand All @@ -50,6 +51,9 @@ types-simplejson = "^3.19.0.2"
types-sqlalchemy = "^1.4.53.38"
types-jsonschema = "^4.19.0.3"

[tool.poetry.extras]
pgvector = ["pgvector"]

[tool.mypy]
exclude = "tests"

Expand Down
20 changes: 20 additions & 0 deletions target_postgres/connector.py
Original file line number Diff line number Diff line change
Expand Up @@ -277,6 +277,19 @@ def pick_individual_type(jsonschema_type: dict):
if "object" in jsonschema_type["type"]:
return JSONB()
if "array" in jsonschema_type["type"]:
# FIXME: This currently uses a non-conformant
# definition for the Singer SCHEMA. Example:
# {"type": "array",
# "items": {"type": "number"},
# "storage": {"type": "vector", "dim": 4}}
if (
"storage" in jsonschema_type
and "type" in jsonschema_type["storage"]
and jsonschema_type["storage"]["type"] == "vector"
):
from pgvector.sqlalchemy import Vector

return Vector(jsonschema_type["storage"]["dim"])
return ARRAY(JSONB())
if jsonschema_type.get("format") == "date-time":
return TIMESTAMP()
Expand Down Expand Up @@ -310,6 +323,13 @@ def pick_best_sql_type(sql_type_array: list):
NOTYPE,
]

try:
from pgvector.sqlalchemy import Vector

precedence_order.append(Vector)
except ImportError:
pass

for sql_type in precedence_order:
for obj in sql_type_array:
if isinstance(obj, sql_type):
Expand Down
5 changes: 5 additions & 0 deletions target_postgres/tests/data_files/array_float_vector.singer
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
{"type": "SCHEMA", "stream": "array_float_vector", "key_properties": ["id"], "schema": {"required": ["id"], "type": "object", "properties": {"id": {"type": "integer"}, "value": {"type": "array", "items": {"type": "number"}, "storage": {"type": "vector", "dim": 4}}}}}
{"type": "RECORD", "stream": "array_float_vector", "record": {"id": 1, "value": [ 1.1, 2.1, 1.1, 1.3 ]}}
{"type": "RECORD", "stream": "array_float_vector", "record": {"id": 2, "value": [ 1.0, 1.0, 1.0, 2.3 ]}}
{"type": "RECORD", "stream": "array_float_vector", "record": {"id": 3, "value": [ 2.0, 1.2, 1.0, 0.9 ]}}
{"type": "STATE", "value": {"array_float_vector": 3}}
1 change: 1 addition & 0 deletions target_postgres/tests/init.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
CREATE EXTENSION IF NOT EXISTS vector;
19 changes: 19 additions & 0 deletions target_postgres/tests/test_target_postgres.py
Original file line number Diff line number Diff line change
Expand Up @@ -458,6 +458,25 @@ def test_array_boolean(postgres_target):
)


def test_array_float_vector(postgres_target):
pgvector_sa = pytest.importorskip("pgvector.sqlalchemy")
file_name = "array_float_vector.singer"
singer_file_to_target(file_name, postgres_target)
row = {
"id": 1,
"value": "[1.1,2.1,1.1,1.3]",
}
verify_data(postgres_target, "array_float_vector", 3, "id", row)
verify_schema(
postgres_target,
"array_float_vector",
check_columns={
"id": {"type": BIGINT},
"value": {"type": pgvector_sa.Vector},
},
)


def test_array_number(postgres_target):
file_name = "array_number.singer"
singer_file_to_target(file_name, postgres_target)
Expand Down
8 changes: 4 additions & 4 deletions tox.ini
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ isolated_build = true
allowlist_externals = poetry

commands =
poetry install -v
poetry install --all-extras -v
poetry run pytest
poetry run black --check target_postgres/
poetry run flake8 target_postgres
Expand All @@ -21,22 +21,22 @@ commands =
# To execute, run `tox -e pytest`
envlist = py37, py38, py39
commands =
poetry install -v
poetry install --all-extras -v
poetry run pytest

[testenv:format]
# Attempt to auto-resolve lint errors before they are raised.
# To execute, run `tox -e format`
commands =
poetry install -v
poetry install --all-extras -v
poetry run black target_postgres/
poetry run isort target_postgres

[testenv:lint]
# Raise an error if lint and style standards are not met.
# To execute, run `tox -e lint`
commands =
poetry install -v
poetry install --all-extras -v
poetry run black --check --diff target_postgres/
poetry run isort --check target_postgres
poetry run flake8 target_postgres
Expand Down

0 comments on commit bb99a40

Please sign in to comment.