Skip to content

Commit

Permalink
API: Add tag mapping and batching (rtdip#788)
Browse files Browse the repository at this point in the history
* API: Add tag mapping endpoint via databricks serving endpoint that makes business_unit, region, data_security_level and data_type optional if the endpoint is provided

* API: Replace dotenv in pydantic with pydantic settings. Includes required package update

* API: Add batch route, which allows users to send a list of requests

* API: Add additional tests for batch route

* API: Update json_responsse_batch to replace nan rows with None

* SDK: Update batch sdk function to use executor map in order to preserve order

* API: Update max threadpool workers to use environment variables and default to cpu count-1

* API: Update pydantic validator to validate conditionally based on environment variable. Update tests to capture/test change.

* API: Update readme with new environment variable descriptions

* API: Reintroduce examples in models for data_type

* API: Update lookup_before_get to ensure remains as dataframe and also update test imports

* Format code with black

* Additional formatting with black

---------

Signed-off-by: ummer-shell <[email protected]>
Co-authored-by: Ummer Taahir <Ummer Taahir>
  • Loading branch information
ummer-shell authored Jul 25, 2024
1 parent 3fdd12a commit b14cc22
Show file tree
Hide file tree
Showing 26 changed files with 2,043 additions and 53 deletions.
28 changes: 28 additions & 0 deletions src/api/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,34 @@ Ensure that you setup the **local.settings.json** file with the relevant paramet
|---------|-------|
|DATABRICKS_SQL_SERVER_HOSTNAME|adb-xxxxx.x.azuredatabricks.net|
|DATABRICKS_SQL_HTTP_PATH|/sql/1.0/warehouses/xxx|
|DATABRICKS_SERVING_ENDPOINT|https://adb-xxxxx.x.azuredatabricks.net/serving-endpoints/xxxxxxx/invocations|
|BATCH_THREADPOOL_WORKERS|3|
|LOOKUP_THREADPOOL_WORKERS|10|

### Information:

DATABRICKS_SERVING_ENDPOINT
- **This is an optional parameter**
- This represents a Databricks feature serving endpont, which is used to create lower-latency look-ups of databricks tables.
- In this API, this is used to map tagnames to their respective "CatalogName", "SchemaName" and "DataTable"
- This enables the parameters of business_unit, asset and data_security_level to be optional, thereby reducing user friction in querying data.
- Given these parameters are optional, custom validation logic based on the presence (or not) of the mapping endpoint is done in the models.py via pydantic.
- For more information on feature serving endpoints please see: https://docs.databricks.com/en/machine-learning/feature-store/feature-function-serving.html

LOOKUP_THREADPOOL_WORKERS
- **This is an optional parameter**
- In the event of a query with multiple tags residing in multiple tables, the api will query these tables separately and the results will be concatenated.
- This parameter will parallelise these requests.
- This defaults to 3 if it is not defined in the .env.

BATCH_THREADPOOL_WORKERS
- **This is an optional parameter**
- This represents the number of workers for parallelisation of requests in a batch sent to the /batch route.
- This defaults to the cpu count minus one if not defined in the .env.

Please note that the batch API route calls the lookup under the hood by default. Therefore if there are many requests, with each requiring multiple tables the total number of threads will be up to BATCH_THREADPOOL_WORKERS * LOOKUP_THREADPOOL_WORKERS.
For example, 10 requests in the batch with each querying 3 tables means there will be up to 30 simulatanous queries.
Therefore, it is recommended to set these parameters for performance optimization.

Please also ensure to install all the turbodbc requirements for your machine by reviewing the [installation instructions](https://turbodbc.readthedocs.io/en/latest/pages/getting_started.html) of turbodbc. On a macbook, this includes executing the following commands:

Expand Down
2 changes: 1 addition & 1 deletion src/api/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -21,4 +21,4 @@ googleapis-common-protos>=1.56.4
langchain>=0.2.0,<0.3.0
langchain-community>=0.2.0,<0.3.0
openai==1.13.3
pyjwt==2.8.0
pyjwt==2.8.0
1 change: 1 addition & 0 deletions src/api/v1/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@
circular_average,
circular_standard_deviation,
summary,
batch,
)
from src.api.auth.azuread import oauth2_scheme

Expand Down
144 changes: 144 additions & 0 deletions src/api/v1/batch.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@
# Copyright 2022 RTDIP
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import logging
import numpy as np
import os
from fastapi import HTTPException, Depends, Body # , JSONResponse

from src.api.v1.models import (
BaseQueryParams,
BaseHeaders,
BatchBodyParams,
BatchResponse,
LimitOffsetQueryParams,
HTTPError,
)
from src.api.auth.azuread import oauth2_scheme
from src.api.v1.common import (
common_api_setup_tasks,
json_response_batch,
lookup_before_get,
)
from src.api.FastAPIApp import api_v1_router
from src.api.v1.common import lookup_before_get
from concurrent.futures import *


ROUTE_FUNCTION_MAPPING = {
"/api/v1/events/raw": "raw",
"/api/v1/events/latest": "latest",
"/api/v1/events/resample": "resample",
"/api/v1/events/plot": "plot",
"/api/v1/events/interpolate": "interpolate",
"/api/v1/events/interpolationattime": "interpolationattime",
"/api/v1/events/circularaverage": "circularaverage",
"/api/v1/events/circularstandarddeviation": "circularstandarddeviation",
"/api/v1/events/timeweightedaverage": "timeweightedaverage",
"/api/v1/events/summary": "summary",
"/api/v1/events/metadata": "metadata",
"/api/v1/sql/execute": "execute",
}


async def batch_events_get(
base_query_parameters, base_headers, batch_query_parameters, limit_offset_parameters
):
try:
(connection, parameters) = common_api_setup_tasks(
base_query_parameters=base_query_parameters,
base_headers=base_headers,
)

# Validate the parameters
parsed_requests = []
for request in batch_query_parameters.requests:
# If required, combine request body and parameters:
parameters = request["params"]
if request["method"] == "POST":
if request["body"] == None:
raise Exception(
"Incorrectly formatted request provided: All POST requests require a body"
)
parameters = {**parameters, **request["body"]}

# Map the url to a specific function
try:
func = ROUTE_FUNCTION_MAPPING[request["url"]]
except:
raise Exception(
"Unsupported url: Only relative base urls are supported. Please provide any parameters in the params key"
)

# Rename tag_name to tag_names, if required
if "tag_name" in parameters.keys():
parameters["tag_names"] = parameters.pop("tag_name")

# Append to array
parsed_requests.append({"func": func, "parameters": parameters})

# Obtain max workers from environment var, otherwise default to one less than cpu count
max_workers = os.environ.get("BATCH_THREADPOOL_WORKERS", os.cpu_count() - 1)

# Request the data for each concurrently with threadpool
with ThreadPoolExecutor(max_workers=max_workers) as executor:
# Use executor.map to preserve order
results = executor.map(
lambda arguments: lookup_before_get(*arguments),
[
(parsed_request["func"], connection, parsed_request["parameters"])
for parsed_request in parsed_requests
],
)

return json_response_batch(results)

except Exception as e:
print(e)
logging.error(str(e))
raise HTTPException(status_code=400, detail=str(e))


post_description = """
## Batch
Retrieval of timeseries data via a POST method to enable providing a list of requests including the route and parameters
"""


@api_v1_router.post(
path="/events/batch",
name="Batch POST",
description=post_description,
tags=["Events"],
dependencies=[Depends(oauth2_scheme)],
responses={200: {"model": BatchResponse}, 400: {"model": HTTPError}},
openapi_extra={
"externalDocs": {
"description": "RTDIP Batch Query Documentation",
"url": "https://www.rtdip.io/sdk/code-reference/query/functions/time_series/batch/",
}
},
)
async def batch_post(
base_query_parameters: BaseQueryParams = Depends(),
batch_query_parameters: BatchBodyParams = Body(default=...),
base_headers: BaseHeaders = Depends(),
limit_offset_query_parameters: LimitOffsetQueryParams = Depends(),
):
return await batch_events_get(
base_query_parameters,
base_headers,
batch_query_parameters,
limit_offset_query_parameters,
)
12 changes: 10 additions & 2 deletions src/api/v1/circular_average.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@
PivotQueryParams,
LimitOffsetQueryParams,
)
from src.api.v1.common import common_api_setup_tasks, json_response
from src.api.v1.common import common_api_setup_tasks, json_response, lookup_before_get


def circular_average_events_get(
Expand All @@ -55,7 +55,15 @@ def circular_average_events_get(
base_headers=base_headers,
)

data = circular_average.get(connection, parameters)
if all(
(key in parameters and parameters[key] != None)
for key in ["business_unit", "asset", "data_security_level", "data_type"]
):
# if have all required params, run normally
data = circular_average.get(connection, parameters)
else:
# else wrap in lookup function that finds tablenames and runs function (if mutliple tables, handles concurrent requests)
data = lookup_before_get("circular_average", connection, parameters)

return json_response(data, limit_offset_parameters)
except Exception as e:
Expand Down
14 changes: 12 additions & 2 deletions src/api/v1/circular_standard_deviation.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@
LimitOffsetQueryParams,
CircularAverageQueryParams,
)
from src.api.v1.common import common_api_setup_tasks, json_response
from src.api.v1.common import common_api_setup_tasks, json_response, lookup_before_get


def circular_standard_deviation_events_get(
Expand All @@ -56,7 +56,17 @@ def circular_standard_deviation_events_get(
base_headers=base_headers,
)

data = circular_standard_deviation.get(connection, parameters)
if all(
(key in parameters and parameters[key] != None)
for key in ["business_unit", "asset", "data_security_level", "data_type"]
):
# if have all required params, run normally
data = circular_standard_deviation.get(connection, parameters)
else:
# else wrap in lookup function that finds tablenames and runs function (if mutliple tables, handles concurrent requests)
data = lookup_before_get(
"circular_standard_deviation", connection, parameters
)

return json_response(data, limit_offset_parameters)
except Exception as e:
Expand Down
Loading

0 comments on commit b14cc22

Please sign in to comment.