Skip to content

Commit

Permalink
add SimpleOffsetPaginator (#48)
Browse files Browse the repository at this point in the history
* add SimpleOffsetPaginator

* add SimpleOffsetPaginator to README
  • Loading branch information
SebRollen authored Jun 17, 2024
1 parent 761f4bb commit 0bb645c
Show file tree
Hide file tree
Showing 3 changed files with 47 additions and 13 deletions.
28 changes: 15 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ There are many forms of Authentication supported by this tap. By default for leg
- OAuth
- AWS

Please note that OAuthJWTAuthentication has not been developed. If you are interested in contributing this, please fork and make a pull request.
Please note that OAuthJWTAuthentication has not been developed. If you are interested in contributing this, please fork and make a pull request.

Built with the Meltano [SDK](https://gitlab.com/meltano/sdk) for Singer Taps.

Expand Down Expand Up @@ -140,8 +140,8 @@ tap is available by running:
tap-rest-api-msdk --about
```

#### Top-level config options.
Parameters that appear at the stream-level will overwrite their top-level
#### Top-level config options.
Parameters that appear at the stream-level will overwrite their top-level
counterparts except where noted in the stream-level params. Otherwise, the values
provided at the top-level will be the default values for each stream.:
- `api_url`: required: the base url/endpoint for the desired api.
Expand Down Expand Up @@ -187,16 +187,16 @@ provided at the top-level will be the default values for each stream.:
- `oauth_expiration_secs`: optional: see authentication params below.
- `aws_credentials`: optional: see authentication params below.

#### Stream level config options.
#### Stream level config options.
Parameters that appear at the stream-level
will overwrite their top-level counterparts except where noted below:
- `name`: required: name of the stream.
- `path`: optional: the path appended to the `api_url`.
- `params`: optional: an object of objects that provide the `params` in a `requests.get` method.
Stream level params will be merged with top-level params with stream level params overwriting
Stream level params will be merged with top-level params with stream level params overwriting
top-level params with the same key.
- `headers`: optional: an object of headers to pass into the api calls. Stream level
headers will be merged with top-level params with stream level params overwriting
headers will be merged with top-level params with stream level params overwriting
top-level params with the same key
- `records_path`: optional: a jsonpath string representing the path in the requests response that contains the records to process. Defaults to `$[*]`.
- `primary_keys`: required: a list of the json keys of the primary key for the stream.
Expand All @@ -207,20 +207,20 @@ will overwrite their top-level counterparts except where noted below:
records are not duplicated for each item in lists.
- `num_inference_keys`: optional: number of records used to infer the stream's schema. Defaults to 50.
- `schema`: optional: A valid Singer schema or a path-like string that provides
the path to a `.json` file that contains a valid Singer schema. If provided,
the path to a `.json` file that contains a valid Singer schema. If provided,
the schema will not be inferred from the results of an api call.
- `start_date`: optional: used by the the **offset**, **page**, and **hateoas_body** response styles. This is an initial starting date for an incremental replication if there is no
existing state provided for an incremental replication. Example format 2022-06-10:23:10:10+1200.
- `source_search_field`: optional: used by the **offset**, **page**, and **hateoas_body** response style. This is a search/query parameter used by the API for an incremental replication.

The difference between the `replication_key` and the `source_search_field` is the search field used in request parameters whereas the replication_key is the name of the field in the API reponse. Example if the source_search_field = **last-updated** the generated schema from the api discovery
The difference between the `replication_key` and the `source_search_field` is the search field used in request parameters whereas the replication_key is the name of the field in the API reponse. Example if the source_search_field = **last-updated** the generated schema from the api discovery
might be **meta_lastUpdated**. The replication_key is set to meta_lastUpdated, and the search_parameter to last-updated. Note: Please set the `replication_key`, `start_date`, `source_search_field`, and `source_search_query` parameters all together.
- `source_search_query`: optional: used by the **offset**, **page**, and **hateoas_body** response style. This is a query template to be issued against the API. A simple query template example for FHIR API's is **gt$last_run_date**.

A more complex example against an Opensearch API, **{\\"bool\\": {\\"filter\\": [{\\"range\\": { \\"meta.lastUpdated\\": { \\"gt\\": \\"$last_run_date\\" }}}] }}**. Note: Any required double quotes in the query template must be escaped.

At run-time, the tap will dynamically change the value **$last_run_date** with either the defined `start_date` parameter or the last bookmark / state value.
Example: source_search_field=**last-updated**, the
Example: source_search_field=**last-updated**, the
source_search_query = **gt$last_run_date**, and the current replication state = 2022-08-10:23:10:10+1200. At run time this creates a request parameter **last-updated=gt2022-06-10:23:10:10+1200**.

#### Top-Level Authentication config options.
Expand Down Expand Up @@ -293,7 +293,7 @@ Example:
- headers = '{"x-api-key": "my_secret_api_key", "Request-Context": "my_example_Base64_encoded_json_object"}'

## Pagination
API Pagination is a complex topic as there is no real single standard, and many different implementations. Unless options are provided, both the request and results style type default to the `default`, which is the pagination style originally implemented. Where possible, this tap utilises the Meltano SDK paginators https://sdk.meltano.com/en/latest/reference.html#pagination .
API Pagination is a complex topic as there is no real single standard, and many different implementations. Unless options are provided, both the request and results style type default to the `default`, which is the pagination style originally implemented. Where possible, this tap utilises the Meltano SDK paginators https://sdk.meltano.com/en/latest/reference.html#pagination .

### Default Request Style
The default request style for pagination is using a `JSONPath Paginator` to locate the next page token.
Expand Down Expand Up @@ -325,6 +325,8 @@ There are additional request styles supported as follows for pagination.
- `single_page_paginator` - A paginator that does works with single-page endpoints.
- `page_number_paginator` - Paginator class for APIs that use page number. Looks at the response link to determine more pages.
- `next_page_token_path` - Use to locate an appropriate link in the response. Default `"hasMore"`.
- `simple_offset_paginator` - A paginator that uses `offset` and `limit` parameters to page through a collection of resources. Unlike `offset_paginator`, this paginator does not rely on any headers to determine whether it should keep paginating. Instead, it will continue paginating (by sending requests with increasing `offset`) until the API returns 0 results. You can use this paginator if the API returns a JSON array of records rather than a top-level object.
- `pagination_page_size` - Sets a limit to number of records per page / response. Default `25` records.

### Additional Response Styles
There are additional response styles supported as follows.
Expand All @@ -345,9 +347,9 @@ There are additional response styles supported as follows.
- `pagination_page_size` - Sets a limit to number of records per page / response. Default `25` records.
- `pagination_limit_per_page_param` - the name of the API parameter to limit number of records per page. Default parameter name `per_page`.
- `pagination_results_limit` - Restricts the total number of records returned from the API. Default None i.e. no limit.
- `hateoas_body` - This style requires a well crafted `next_page_token_path` configuration
- `hateoas_body` - This style requires a well crafted `next_page_token_path` configuration
parameter to retrieve the request parameters from the GET request response for a subsequent request.

### JSON Path for extracting tokens
The `next_page_token_path` and `records_path` use JSONPath to locate sections within the request reponse.

Expand All @@ -359,7 +361,7 @@ There are additional response styles supported as follows.
The following example demonstrates the power of JSONPath extensions by further splitting the URL and extracting just the parameters. Note: This is not required for FHIR API's but is provided for illustration of added functionality for complex use cases.
```json
"next_page_token_path": "$.link[?(@.relation=='next')].url.`split(?, 1, 1)`"
```
```
The [JSONPath Evaluator](https://jsonpath.com/) website is useful to test the correct json path expression to use.

Example json response from a FHIR API.
Expand Down
26 changes: 26 additions & 0 deletions tap_rest_api_msdk/pagination.py
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,32 @@ def has_more(self, response: requests.Response):
return False


class SimpleOffsetPaginator(BaseOffsetPaginator):
"""Simple Offset Paginator."""

def __init__(
self,
*args,
pagination_page_size: int = 25,
**kwargs
):
super().__init__(*args, **kwargs)
self._pagination_page_size = pagination_page_size

def has_more(self, response: requests.Response):
"""Return True if there are more pages to fetch.
Args:
response: The most recent response object.
Returns:
Whether there are more pages to fetch.
"""
return len(response.json()) == self._pagination_page_size



class RestAPIHeaderLinkPaginator(HeaderLinkPaginator):
"""REST API Header Link Paginator."""

Expand Down
6 changes: 6 additions & 0 deletions tap_rest_api_msdk/streams.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@
RestAPIBasePageNumberPaginator,
RestAPIHeaderLinkPaginator,
RestAPIOffsetPaginator,
SimpleOffsetPaginator
)
from tap_rest_api_msdk.utils import flatten_json, get_start_date

Expand Down Expand Up @@ -323,6 +324,11 @@ def get_new_paginator(self):
return RestAPIBasePageNumberPaginator(
jsonpath=self.next_page_token_jsonpath
)
elif self.pagination_request_style == "simple_offset_paginator":
return SimpleOffsetPaginator(
start_value=self.pagination_initial_offset,
pagination_page_size=self.pagination_page_size
)
else:
self.logger.error(
f"Unknown paginator {self.pagination_request_style}. Please declare "
Expand Down

0 comments on commit 0bb645c

Please sign in to comment.