Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SQLAlchemy & BigQuery sources #1062

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft

Conversation

amaloney
Copy link
Collaborator

resolves #1051

Adds the following source objects

  • SQLAlchemy
  • BigQuery

@amaloney amaloney self-assigned this Feb 18, 2025
Copy link

codecov bot commented Feb 18, 2025

Codecov Report

Attention: Patch coverage is 0% with 78 lines in your changes missing coverage. Please review.

Project coverage is 57.20%. Comparing base (1dd6cee) to head (a3b1062).

Files with missing lines Patch % Lines
lumen/sources/bigquery.py 0.00% 53 Missing ⚠️
lumen/sources/sqlalchemy.py 0.00% 25 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1062      +/-   ##
==========================================
- Coverage   57.51%   57.20%   -0.32%     
==========================================
  Files         109      111       +2     
  Lines       14291    14369      +78     
==========================================
  Hits         8220     8220              
- Misses       6071     6149      +78     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.



class SQLAlchemySource(BaseSQLSource):
driver = param.String(default=None, doc="SQL driver.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be wise to have a way to input URL, perhaps as a classmethod from_url

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member

@philippjfr philippjfr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, looks like a great start. I do have a few initial comments:

At the highest level I'm not quite seeing how this correctly implements the BaseSQLSource APIs, i.e. at minimum I would have expected an implementation of

  • execute: This is meant for executing a SQL query and returning the result as a DataFrame, your run_query method seems to come close but you have to wrap it in pd.DataFrame.from_records I'm guessing.
  • get_tables: This should return a list of valid tables.

Additionally I would expect a somewhat similar API to other sources, where you can define a tables parameter that accepts a list of tables (as to limit which tables you can access), or a dictionary mapping from table name alias to a SQL expression.

I appreciate the documentation is a little sparse, so please don't hesitate to reach out to myself or Andrew to clarify things about the API.

I think there's also a misunderstanding of the role of get_schema. The additional schema information you are getting from BigQuery is quite helpful but I'd consider that part metadata. The schema in Lumen specifically refers to a dictionary that contains the type of the column and it's min-max and unique values.

@amaloney
Copy link
Collaborator Author

This is very helpful information. From the discussion I need to implement the following.

  • add a from_url method to the SQLAlchemy class
  • implement the methods required from the BaseSQLSource class (listed below). I definitely overlooked these requirements when focusing on the BigQuery class, which only required the immediate URL implementation.
    • get_sql_expr
    • create_sql_expr_source
    • execute -> pd.DataFrame
    • get_tables -> list[str] I don't see this in the base class, but I'll add it so others don't make the same mistake as myself
    • add a tables = param.ClassSelector(...)
  • refactor the get_schema method to be get_metadata and follow the get_schema method more closely found in e.g. DuckDB or IntakeBaseSource

For documentation purposes I think I will also update the Lumen AI -> How to Guides -> Custom Data Sources to include a more in-depth discussion on the methods a dev needs to write in order to follow the API specs found in the base class.

  • Update Lumen AI -> How to Guides -> Custom Data Sources for a better DX.
  • Add docstrings to methods I personally missed, to ensure they are not overlooked by others in the future.

Any other items y'all can think of will be added to the above to-do lists.

@ahuang11
Copy link
Contributor

One other thing are tests (terribly lacking in the ai/ directory, but should be maintained for source/).

Not sure how hard it is to set up a mysql server, but this seems useful for testing locally (probably not CI?) https://docs.getwren.ai/oss/getting_started/sample_data/hr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SQLAlchemy source for Lumen AI
3 participants