Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable Dataframe to be converted into views which can be used in register_table #1016

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

kosiew
Copy link
Contributor

@kosiew kosiew commented Feb 6, 2025

Which issue does this PR close?

Closes #1004.

Rationale for this change

Currently, Datafusion supports views via ViewTable, allowing logical plans to be registered as tables. However, this feature is not exposed in the Python bindings. This PR addresses that gap by enabling Dataframes to be converted to view, enabling users to create views programmatically without relying on raw SQL strings.

What changes are included in this PR?

Adds DataFrame.into_view which can then be used in
register_table.

Are there any user-facing changes?

Yes, Python users will now be able to convert dataframes into view, then register_table with the view.

@kosiew kosiew changed the title View Enable Dataframe to be converted into views which can be used in register_table Feb 6, 2025
Comment on lines +12 to +16
view = df_filtered.into_view()

assert view.kind == "view"

ctx.register_table("view1", view)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is modelled after how into_view is used in datafusion:

async fn with_column_renamed_ambiguous() -> Result<()> {
    let df = test_table().await?.select_columns(&["c1", "c2", "c3"])?;
    let ctx = SessionContext::new();

    let table = df.into_view();
    ctx.register_table("t1", table.clone())?;
    ctx.register_table("t2", table)?;

@kosiew kosiew marked this pull request as ready for review February 7, 2025 01:50
Copy link
Contributor

@timsaucer timsaucer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a valuable addition. Thank you!

I do have a few concerns below, and I'm happy to help us come to a common solution.

Comment on lines +55 to +70
#[pyclass(name = "TableProvider", module = "datafusion")]
pub struct PyTableProvider {
provider: Arc<dyn TableProvider>,
}

impl PyTableProvider {
pub fn new(provider: Arc<dyn TableProvider>) -> Self {
Self { provider }
}

pub fn as_table(&self) -> PyTable {
let table_provider: Arc<dyn TableProvider> = self.provider.clone();
PyTable::new(table_provider)
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I think this is a good idea, but I'm worried about causing confusion with a table provider created from a view and a table provider that is passed from an external source using pycapsule. I can imagine a user would think that a table provider object from one place can be used with another. That is, if I create a table provider with into_view I should be able to register it with the session context. Now, I don't think that operation is strictly necssary but I do expect it would cause some confusion.

What I think we want to do is to have a single common PyTableProvider that can be created either via a pycapsule or into_view.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a single common PyTableProvider that can be created either via a pycapsule or into_view.

Do you mean a constructor that takes a pycapsule argument, then extract provider to use in
PyTableProvider::new(provider)?

Can I check how I can obtain the provider from pub struct PyCapsule(PyAny)?

src/dataframe.rs Outdated
@@ -156,6 +174,20 @@ impl PyDataFrame {
PyArrowType(self.df.schema().into())
}

/// Convert this DataFrame into a Table that can be used in register_table
fn _into_view(&self) -> PyDataFusionResult<PyTable> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend disabling that specific clippy warning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines +84 to +90
## Registering a DataFrame as a View

You can use the `into_view` method to convert a DataFrame into a view and register it with the context.

```python
from datafusion import SessionContext, col, literal

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is very good, but would be more helpful if moved into the appropriate docs section so it goes into the online documentation rather than the readme.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created a view.rst for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support registering views
2 participants