-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add load_as methods for pyarrow dataset and table #240
base: main
Are you sure you want to change the base?
Conversation
@goodwillpunning @linzhou-db From the build logs I can see that the Seems like there are some API inconsistencies the pinned version 4.x which is causing build failure on GitHub but locally test cases are passing. I also verified with versions 5.x to 10.x and was not able to reproduce the issue. Can you please unpin or upgrade this |
Thanks @chitralverma , will take a look once back in Jan. |
try: | ||
import re | ||
|
||
decimal_pattern = re.compile(r"(\([^\)]+\))") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: add comment with examples that this pattern could handle and not?
and struct_field["type"]["type"] == "struct" | ||
for struct_field in element_type["fields"] | ||
): | ||
raise TypeError("Nested StructType cannot be converted to PyArrow type.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Double Nested cannot ..."?
isinstance(struct_field["type"], dict) and struct_field["type"]["type"] == "struct" | ||
for struct_field in f_type["fields"] | ||
): | ||
raise TypeError("Nested StructType cannot be converted to PyArrow type.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
double nested?
def test_pyarrow_schema_base(): | ||
base_schema_dict = { | ||
"type": "struct", | ||
"fields": [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cover all types in this test?
@@ -71,19 +79,112 @@ def limit(self, limit: Optional[int]) -> "DeltaSharingReader": | |||
timestamp=self._timestamp | |||
) | |||
|
|||
def to_pandas(self) -> pd.DataFrame: | |||
def _get_response(self) -> ListFilesInTableResponse: | |||
response = self._rest_client.list_files_in_table( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can directly return self._rest_client.list...
?
tbl: PyArrowTable = ds.head(left, **pyarrow_tbl_options) | ||
pa_tables.append(tbl) | ||
left -= tbl.num_rows | ||
assert ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this a hard limit? and does it require exact limit
number of rows returned?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes it results exactly the number of rows asked for. but does it file by file instead as I saw that in practice this is faster than just calling .head()
on the pyarrow table.
So this kind of mimics the implementation thats done in _to_pandas()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right. I wonder we don't have to fail if we got a few more rows?
same_scheme = False | ||
break | ||
|
||
assert same_scheme, "All files did not follow the same URL scheme." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "All files should follow the same URL scheme" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: what's an example of this failure? And is it possible to add a test case to cover it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont think the delta server can return files from different places for for the same table but I added this case just in case if in the future delta sharing turns in to a cross cloud service (some data in s3 and some data in GCS).
Another reason for adding this was that we dont have to initialize FSSPEC FS for each path if they all follow the same scheme.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't foresee delta sharing support a single table from multiple clouds any time soon.
Plus no test coverage on the code, could we rather turn this into a TODO.
assert ds.count_rows() == 0 | ||
|
||
|
||
def test_to_pyarrow_table_non_partitioned(tmp_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the difference between this test and test_to_pyarrow_dataset
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thats for pyarrow dataset (lazy, faster) and this is for pyarrow table (eager).
internally pyarrow implementation relies on dataset implementation in the PR
@@ -1,7 +1,7 @@ | |||
# Dependencies. When you update don't forget to update setup.py. | |||
pandas | |||
pyarrow>=4.0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: why are these removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove temporarily to see if this is causing the build to fail. I will add it again before the PR is completely ready.
please see my original comment regarding this.
Also what's your thought on loading cdf in pyarrow? is it something not needed for now? |
I would prefer to raise a separate PR for the CDF to keep things simple and concise, this is just for the data. |
@chitralverma @linzhou-db can we revive this PR? |
Adds separate implementations for
load_as_pyarrow_table
andload_as_pyarrow_dataset
that allows users to read delta sharing tables as pyarrow table and dataset respectively.closes #238