Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Support integer types other than BIGINT #485

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
107 changes: 63 additions & 44 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -237,51 +237,51 @@ develop your own Singer taps and targets.

The below table shows how this tap will map between jsonschema datatypes and Postgres datatypes.

| jsonschema | Postgres |
|--------------------------------|-----------------------------------------|
| integer | bigint |
| UNSUPPORTED | bigserial |
| UNSUPPORTED | bit [ (n) ] |
| UNSUPPORTED | bit varying [ (n) ] |
| boolean | boolean |
| UNSUPPORTED | box |
| jsonschema | Postgres |
| ---------------------------------------------------------------------------------- | --------------------------------------- |
| integer | bigint |
| integer with minimum >= 32768 or maximum < 32768 | smallint |
| integer with minimum >= 2147483648 or maximum < 2147483648 | integer |
| UNSUPPORTED | bigserial |
| UNSUPPORTED | bit [ (n) ] |
| UNSUPPORTED | bit varying [ (n) ] |
| boolean | boolean |
| UNSUPPORTED | box |
| string with contentEncoding="base16" ([opt-in feature](#content-encoding-support)) | bytea |
| UNSUPPORTED | character [ (n) ] |
| UNSUPPORTED | character varying [ (n) ] |
| UNSUPPORTED | cidr |
| UNSUPPORTED | circle |
| string with format="date" | date |
| UNSUPPORTED | double precision |
| UNSUPPORTED | inet |
| UNSUPPORTED | integer |
| UNSUPPORTED | interval [ fields ] [ (p) ] |
| UNSUPPORTED | json |
| array; object | jsonb |
| UNSUPPORTED | line |
| UNSUPPORTED | lseg |
| UNSUPPORTED | macaddr |
| UNSUPPORTED | macaddr8 |
| UNSUPPORTED | money |
| number | numeric [ (p, s) ] |
| UNSUPPORTED | path |
| UNSUPPORTED | pg_lsn |
| UNSUPPORTED | pg_snapshot |
| UNSUPPORTED | point |
| UNSUPPORTED | polygon |
| UNSUPPORTED | real |
| UNSUPPORTED | smallint |
| UNSUPPORTED | smallserial |
| UNSUPPORTED | serial |
| string without format; untyped | text |
| string with format="time" | time [ (p) ] [ without time zone ] |
| UNSUPPORTED | time [ (p) ] with time zone |
| string with format="date-time" | timestamp [ (p) ] [ without time zone ] |
| UNSUPPORTED | timestamp [ (p) ] with time zone |
| UNSUPPORTED | tsquery |
| UNSUPPORTED | tsvector |
| UNSUPPORTED | txid_snapshot |
| string with format="uuid" | uuid |
| UNSUPPORTED | xml |
| UNSUPPORTED | character [ (n) ] |
| UNSUPPORTED | character varying [ (n) ] |
| UNSUPPORTED | cidr |
| UNSUPPORTED | circle |
| string with format="date" | date |
| UNSUPPORTED | double precision |
| UNSUPPORTED | inet |
| UNSUPPORTED | interval [ fields ] [ (p) ] |
| UNSUPPORTED | json |
| array; object | jsonb |
| UNSUPPORTED | line |
| UNSUPPORTED | lseg |
| UNSUPPORTED | macaddr |
| UNSUPPORTED | macaddr8 |
| UNSUPPORTED | money |
| number | numeric [ (p, s) ] |
| UNSUPPORTED | path |
| UNSUPPORTED | pg_lsn |
| UNSUPPORTED | pg_snapshot |
| UNSUPPORTED | point |
| UNSUPPORTED | polygon |
| UNSUPPORTED | real |
| UNSUPPORTED | smallserial |
| UNSUPPORTED | serial |
| string without format; untyped | text |
| string with format="time" | time [ (p) ] [ without time zone ] |
| UNSUPPORTED | time [ (p) ] with time zone |
| string with format="date-time" | timestamp [ (p) ] [ without time zone ] |
| UNSUPPORTED | timestamp [ (p) ] with time zone |
| UNSUPPORTED | tsquery |
| UNSUPPORTED | tsvector |
| UNSUPPORTED | txid_snapshot |
| string with format="uuid" | uuid |
| UNSUPPORTED | xml |

Note that while object types are mapped directly to jsonb, array types are mapped to a jsonb array.

Expand All @@ -297,9 +297,28 @@ If a column has multiple jsonschema types, the following order is using to order
- DECIMAL
- BIGINT
- INTEGER
- SMALLINT
- BOOLEAN
- NOTYPE

### Using the Singer catalog to narrow down the Postgres data types

You can use [Singer catalog's schema](https://github.com/singer-io/getting-started/blob/master/docs/DISCOVERY_MODE.md#schemas) to override the data types coming from the tap. The easiest way to do this is to use Meltano and its [`schema` setting](https://docs.meltano.com/concepts/plugins/#schema-extra) for the tap:

```yaml
# meltano.yml
plugins:
extractors:
- name: tap-my-tap
schema:
some_stream_id:
my_column:
type: integer
# This will be mapped to 'smallint'
minimum: 0
maximum: 1000
```

## Content Encoding Support

Json Schema supports the [`contentEncoding` keyword](https://datatracker.ietf.org/doc/html/rfc4648#section-8), which can be used to specify the encoding of input string types.
Expand Down
46 changes: 24 additions & 22 deletions poetry.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

24 changes: 22 additions & 2 deletions target_postgres/connector.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
import atexit
import io
import itertools
import math
import signal
import sys
import typing as t
Expand All @@ -18,7 +19,14 @@
import sqlalchemy as sa
from singer_sdk import SQLConnector
from singer_sdk.connectors.sql import JSONSchemaToSQL
from sqlalchemy.dialects.postgresql import ARRAY, BIGINT, BYTEA, JSONB, UUID
from sqlalchemy.dialects.postgresql import (
ARRAY,
BIGINT,
BYTEA,
JSONB,
SMALLINT,
UUID,
)
from sqlalchemy.engine import URL
from sqlalchemy.engine.url import make_url
from sqlalchemy.types import (
Expand Down Expand Up @@ -255,12 +263,23 @@ def _handle_array_type(self, jsonschema: dict) -> ARRAY | JSONB:
# Case 3: tuples
return ARRAY(JSONB()) if isinstance(items, list) else JSONB()

def _handle_integer_type(self, jsonschema: dict) -> SMALLINT | INTEGER | BIGINT:
"""Handle integer type."""
minimum = jsonschema.get("minimum", -math.inf)
maximum = jsonschema.get("maximum", math.inf)
if minimum >= -(2**15) and maximum < 2**15:
return SMALLINT()
if minimum >= -(2**31) and maximum < 2**31:
return INTEGER()

return BIGINT()

@cached_property
def jsonschema_to_sql(self) -> JSONSchemaToSQL:
"""Return a JSONSchemaToSQL instance with custom type handling."""
to_sql = JSONSchemaToPostgres(content_encoding=self.interpret_content_encoding)
to_sql.fallback_type = TEXT
to_sql.register_type_handler("integer", BIGINT)
to_sql.register_type_handler("integer", self._handle_integer_type)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pretty sure the answer is yes here, but would this also be able to parse the schema that is generated by postgres-tap:

{"type":["integer","null"]}

?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's right. It's handled upstream if you're curious: https://github.com/meltano/sdk/blob/e7783fd4ea6823daf9068124985118268da3a354/singer_sdk/connectors/sql.py#L354-L356.

I've added a test just to be safe: 79212a8.

to_sql.register_type_handler("object", JSONB)
to_sql.register_type_handler("array", self._handle_array_type)
to_sql.register_format_handler("date-time", TIMESTAMP)
Expand Down Expand Up @@ -365,6 +384,7 @@ def pick_best_sql_type(sql_type_array: list):
DECIMAL,
BIGINT,
INTEGER,
SMALLINT,
BOOLEAN,
NOTYPE,
]
Expand Down
Loading