Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨feat(source-microsoft-sharepoint): add all sites iteration #55912

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

aldogonzalez8
Copy link
Contributor

@aldogonzalez8 aldogonzalez8 commented Mar 22, 2025

What

Enhanced the Microsoft SharePoint connector to iterate all the Sharepoint site drives.

The implementation supports:

  • Default site drives (when no site URL is provided)
  • All site drives across the tenant (when site URL ends with 'sharepoint.com/sites/') ✨NEW!
  • Specific site drives (when a full site URL is provided)

Follow up of #54658
Resolves #37003

How

  • Refactored the get_site_drive method in SourceMicrosoftSharePointStreamReader to handle different site URL patterns
  • Updated the authentication flow to properly use tenant-specific scopes when needed
  • Implemented comprehensive test cases to verify the correct behavior for each scenario

Review guide

  1. airbyte-integrations/connectors/source-microsoft-sharepoint/source_microsoft_sharepoint/stream_reader.py - Focus on the get_site_drive method implementation
  2. airbyte-integrations/connectors/source-microsoft-sharepoint/unit_tests/test_stream_reader.py - Check the test_get_site_drive test cases

User Impact

  • Users can now more easily configure the connector to access drives from:
    • Default SharePoint site
    • All sites in their tenant
    • A specific SharePoint site

No breaking changes to existing functionality

Can this PR be safely reverted and rolled back?

[x] YES 💚
[ ] NO ❌

@aldogonzalez8 aldogonzalez8 self-assigned this Mar 22, 2025
Copy link

vercel bot commented Mar 22, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
airbyte-docs 🛑 Canceled (Inspect) Mar 23, 2025 2:31am

@aldogonzalez8 aldogonzalez8 changed the title Aldogonzalez8/source microsoft sharepoint/add support iteration ✨feat(source-microsoft-sharepoint): add all sites iteration Mar 22, 2025
@aldogonzalez8
Copy link
Contributor Author

aldogonzalez8 commented Mar 23, 2025

/approve-regression-tests

Check job output.

✅ Approving regression tests

Copy link
Contributor

@maxi297 maxi297 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a lot of comments regarding the structure but the functionality seem to make sense to me. Feel free to push back on any of these as I'm not sure I fully understand the context

Office365-REST-Python-Client = "==2.5.5"
smart-open = "==6.4.0"
airbyte-cdk = {extras = ["file-based"], version = "^6"}
airbyte-cdk = {extras = ["file-based"], version = "^6.38.5"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we set this back to ^6 so that it gets the CDK updates?

return site


def get_site_prefix(site: Site):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add return types for these added functions?


@lru_cache(maxsize=None)
def get_site(graph_client: GraphClient, site_url: str = None):
if site_url:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I know some tools would consider not returning immediately as a code smell (example). I would refactor this as such:

def get_site(graph_client: GraphClient, site_url: str = None):
    if site_url:
        return execute_query_with_retry(graph_client.sites.get_by_url(site_url))
    return execute_query_with_retry(graph_client.sites.root.get())

@@ -103,6 +137,20 @@ def get_access_token(self):
# Directly fetch a new access token from the auth_client each time it's called
return self.auth_client._get_access_token()["access_token"]

def get_token_response_object(self, tenant_prefix: str = None) -> Callable:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like if we set the value to None by default, this should be optional. However, I don't see a case where tenant_prefix is not provided in the current code. What am I missing?

I see the _get_scope supports this case but I'm not sure when this can happen

for site_drive in drives:
all_sites_drives.add_child(site_drive)
return all_sites_drives

def get_site_drive(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this was outside of the scope of the PR but can we add typing here?

else:
result = reader.get_site_drive()

if expected_call == "drives":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

open question: It seems like we have three distinct tests and that we tried to fit all of this in a parametrize. I fear that if one of those cases change, we might need to update this test and it would impact the other cases just because they share the same test not because their behavior has also changed. WDYT?



@lru_cache(maxsize=None)
def get_site(graph_client: GraphClient, site_url: str = None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like get_site is always called with get_site_prefix. Should we make this private to avoid exposing more things than needed?



@lru_cache(maxsize=None)
def get_site(graph_client: GraphClient, site_url: str = None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how the lru_cache works but it seems like GraphClient does not implement __eq__ which I assume will only hit the cache if it is the same reference object. Is this what we want here?


return found_sites

def get_drives_from_sites(self, sites: List[MutableMapping[str, Any]]) -> EntityCollection:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a private method, right? Should we prefix the name by _?

List[MutableMapping[str, Any]]: A list of site information.
"""
_, root_site_prefix = get_site_prefix(get_site(self.one_drive_client))
ctx = self.get_client_context()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like get_site_prefix is also called within get_client_context. Since get_client_context should be private, should we pass the output of get_site_prefix as arguments for get_client_config? It would prevent us from calling this twice in the same context

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[source-sharepoint] connector fetches documents only from the main site
3 participants