API: Add tag mapping and batching (rtdip#788)

* API: Add tag mapping endpoint via databricks serving endpoint that makes business_unit, region, data_security_level and data_type optional if the endpoint is provided * API: Replace dotenv in pydantic with pydantic settings. Includes required package update * API: Add batch route, which allows users to send a list of requests * API: Add additional tests for batch route * API: Update json_responsse_batch to replace nan rows with None * SDK: Update batch sdk function to use executor map in order to preserve order * API: Update max threadpool workers to use environment variables and default to cpu count-1 * API: Update pydantic validator to validate conditionally based on environment variable. Update tests to capture/test change. * API: Update readme with new environment variable descriptions * API: Reintroduce examples in models for data_type * API: Update lookup_before_get to ensure remains as dataframe and also update test imports * Format code with black * Additional formatting with black --------- Signed-off-by: ummer-shell <[email protected]> Co-authored-by: Ummer Taahir <Ummer Taahir>
GBBBAS · Jul 25, 2024 · b14cc22 · b14cc22
1 parent 3fdd12a
commit b14cc22
Show file tree

Hide file tree

Showing 26 changed files with 2,043 additions and 53 deletions.
diff --git a/src/api/README.md b/src/api/README.md
@@ -34,6 +34,34 @@ Ensure that you setup the **local.settings.json** file with the relevant paramet
 |---------|-------|
 |DATABRICKS_SQL_SERVER_HOSTNAME|adb-xxxxx.x.azuredatabricks.net|
 |DATABRICKS_SQL_HTTP_PATH|/sql/1.0/warehouses/xxx|
+|DATABRICKS_SERVING_ENDPOINT|https://adb-xxxxx.x.azuredatabricks.net/serving-endpoints/xxxxxxx/invocations|
+|BATCH_THREADPOOL_WORKERS|3|
+|LOOKUP_THREADPOOL_WORKERS|10|
+
+### Information:
+
+DATABRICKS_SERVING_ENDPOINT 
+- **This is an optional parameter**
+- This represents a Databricks feature serving endpont, which is used to create lower-latency look-ups of databricks tables.
+- In this API, this is used to map tagnames to their respective "CatalogName", "SchemaName" and "DataTable"
+- This enables the parameters of business_unit, asset and data_security_level to be optional, thereby reducing user friction in querying data.
+- Given these parameters are optional, custom validation logic based on the presence (or not) of the mapping endpoint is done in the models.py via pydantic.
+- For more information on feature serving endpoints please see: https://docs.databricks.com/en/machine-learning/feature-store/feature-function-serving.html
+
+LOOKUP_THREADPOOL_WORKERS
+- **This is an optional parameter**
+- In the event of a query with multiple tags residing in multiple tables, the api will query these tables separately and the results will be concatenated. 
+- This parameter will parallelise these requests.
+- This defaults to 3 if it is not defined in the .env.
+
+BATCH_THREADPOOL_WORKERS 
+- **This is an optional parameter**
+- This represents the number of workers for parallelisation of requests in a batch sent to the /batch route.
+- This defaults to the cpu count minus one if not defined in the .env.
+
+Please note that the batch API route calls the lookup under the hood by default. Therefore if there are many requests, with each requiring multiple tables the total number of threads will be up to BATCH_THREADPOOL_WORKERS * LOOKUP_THREADPOOL_WORKERS.
+For example, 10 requests in the batch with each querying 3 tables means there will be up to 30 simulatanous queries. 
+Therefore, it is recommended to set these parameters for performance optimization.
 
 Please also ensure to install all the turbodbc requirements for your machine by reviewing the [installation instructions](https://turbodbc.readthedocs.io/en/latest/pages/getting_started.html) of turbodbc. On a macbook, this includes executing the following commands:
 

diff --git a/src/api/requirements.txt b/src/api/requirements.txt
@@ -21,4 +21,4 @@ googleapis-common-protos>=1.56.4
 langchain>=0.2.0,<0.3.0
 langchain-community>=0.2.0,<0.3.0
 openai==1.13.3
-pyjwt==2.8.0
+pyjwt==2.8.0
diff --git a/src/api/v1/__init__.py b/src/api/v1/__init__.py
@@ -30,6 +30,7 @@
     circular_average,
     circular_standard_deviation,
     summary,
+    batch,
 )
 from src.api.auth.azuread import oauth2_scheme
 

diff --git a/src/api/v1/batch.py b/src/api/v1/batch.py
@@ -0,0 +1,144 @@
+# Copyright 2022 RTDIP
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import logging
+import numpy as np
+import os
+from fastapi import HTTPException, Depends, Body  # , JSONResponse
+
+from src.api.v1.models import (
+    BaseQueryParams,
+    BaseHeaders,
+    BatchBodyParams,
+    BatchResponse,
+    LimitOffsetQueryParams,
+    HTTPError,
+)
+from src.api.auth.azuread import oauth2_scheme
+from src.api.v1.common import (
+    common_api_setup_tasks,
+    json_response_batch,
+    lookup_before_get,
+)
+from src.api.FastAPIApp import api_v1_router
+from src.api.v1.common import lookup_before_get
+from concurrent.futures import *
+
+
+ROUTE_FUNCTION_MAPPING = {
+    "/api/v1/events/raw": "raw",
+    "/api/v1/events/latest": "latest",
+    "/api/v1/events/resample": "resample",
+    "/api/v1/events/plot": "plot",
+    "/api/v1/events/interpolate": "interpolate",
+    "/api/v1/events/interpolationattime": "interpolationattime",
+    "/api/v1/events/circularaverage": "circularaverage",
+    "/api/v1/events/circularstandarddeviation": "circularstandarddeviation",
+    "/api/v1/events/timeweightedaverage": "timeweightedaverage",
+    "/api/v1/events/summary": "summary",
+    "/api/v1/events/metadata": "metadata",
+    "/api/v1/sql/execute": "execute",
+}
+
+
+async def batch_events_get(
+    base_query_parameters, base_headers, batch_query_parameters, limit_offset_parameters
+):
+    try:
+        (connection, parameters) = common_api_setup_tasks(
+            base_query_parameters=base_query_parameters,
+            base_headers=base_headers,
+        )
+
+        # Validate the parameters
+        parsed_requests = []
+        for request in batch_query_parameters.requests:
+            # If required, combine request body and parameters:
+            parameters = request["params"]
+            if request["method"] == "POST":
+                if request["body"] == None:
+                    raise Exception(
+                        "Incorrectly formatted request provided: All POST requests require a body"
+                    )
+                parameters = {**parameters, **request["body"]}
+
+            # Map the url to a specific function
+            try:
+                func = ROUTE_FUNCTION_MAPPING[request["url"]]
+            except:
+                raise Exception(
+                    "Unsupported url: Only relative base urls are supported. Please provide any parameters in the params key"
+                )
+
+            # Rename tag_name to tag_names, if required
+            if "tag_name" in parameters.keys():
+                parameters["tag_names"] = parameters.pop("tag_name")
+
+            # Append to array
+            parsed_requests.append({"func": func, "parameters": parameters})
+
+        # Obtain max workers from environment var, otherwise default to one less than cpu count
+        max_workers = os.environ.get("BATCH_THREADPOOL_WORKERS", os.cpu_count() - 1)
+
+        # Request the data for each concurrently with threadpool
+        with ThreadPoolExecutor(max_workers=max_workers) as executor:
+            # Use executor.map to preserve order
+            results = executor.map(
+                lambda arguments: lookup_before_get(*arguments),
+                [
+                    (parsed_request["func"], connection, parsed_request["parameters"])
+                    for parsed_request in parsed_requests
+                ],
+            )
+
+        return json_response_batch(results)
+
+    except Exception as e:
+        print(e)
+        logging.error(str(e))
+        raise HTTPException(status_code=400, detail=str(e))
+
+
+post_description = """
+## Batch 
+
+Retrieval of timeseries data via a POST method to enable providing a list of requests including the route and parameters
+"""
+
+
+@api_v1_router.post(
+    path="/events/batch",
+    name="Batch POST",
+    description=post_description,
+    tags=["Events"],
+    dependencies=[Depends(oauth2_scheme)],
+    responses={200: {"model": BatchResponse}, 400: {"model": HTTPError}},
+    openapi_extra={
+        "externalDocs": {
+            "description": "RTDIP Batch Query Documentation",
+            "url": "https://www.rtdip.io/sdk/code-reference/query/functions/time_series/batch/",
+        }
+    },
+)
+async def batch_post(
+    base_query_parameters: BaseQueryParams = Depends(),
+    batch_query_parameters: BatchBodyParams = Body(default=...),
+    base_headers: BaseHeaders = Depends(),
+    limit_offset_query_parameters: LimitOffsetQueryParams = Depends(),
+):
+    return await batch_events_get(
+        base_query_parameters,
+        base_headers,
+        batch_query_parameters,
+        limit_offset_query_parameters,
+    )
diff --git a/src/api/v1/circular_average.py b/src/api/v1/circular_average.py
@@ -32,7 +32,7 @@
     PivotQueryParams,
     LimitOffsetQueryParams,
 )
-from src.api.v1.common import common_api_setup_tasks, json_response
+from src.api.v1.common import common_api_setup_tasks, json_response, lookup_before_get
 
 
 def circular_average_events_get(
@@ -55,7 +55,15 @@ def circular_average_events_get(
             base_headers=base_headers,
         )
 
-        data = circular_average.get(connection, parameters)
+        if all(
+            (key in parameters and parameters[key] != None)
+            for key in ["business_unit", "asset", "data_security_level", "data_type"]
+        ):
+            # if have all required params, run normally
+            data = circular_average.get(connection, parameters)
+        else:
+            # else wrap in lookup function that finds tablenames and runs function (if mutliple tables, handles concurrent requests)
+            data = lookup_before_get("circular_average", connection, parameters)
 
         return json_response(data, limit_offset_parameters)
     except Exception as e:

diff --git a/src/api/v1/circular_standard_deviation.py b/src/api/v1/circular_standard_deviation.py
@@ -33,7 +33,7 @@
     LimitOffsetQueryParams,
     CircularAverageQueryParams,
 )
-from src.api.v1.common import common_api_setup_tasks, json_response
+from src.api.v1.common import common_api_setup_tasks, json_response, lookup_before_get
 
 
 def circular_standard_deviation_events_get(
@@ -56,7 +56,17 @@ def circular_standard_deviation_events_get(
             base_headers=base_headers,
         )
 
-        data = circular_standard_deviation.get(connection, parameters)
+        if all(
+            (key in parameters and parameters[key] != None)
+            for key in ["business_unit", "asset", "data_security_level", "data_type"]
+        ):
+            # if have all required params, run normally
+            data = circular_standard_deviation.get(connection, parameters)
+        else:
+            # else wrap in lookup function that finds tablenames and runs function (if mutliple tables, handles concurrent requests)
+            data = lookup_before_get(
+                "circular_standard_deviation", connection, parameters
+            )
 
         return json_response(data, limit_offset_parameters)
     except Exception as e: