✨feat(source-microsoft-sharepoint): add all sites iteration #55912

aldogonzalez8 · 2025-03-22T12:16:26Z

What

Enhanced the Microsoft SharePoint connector to iterate all the Sharepoint site drives.

The implementation supports:

Default site drives (when no site URL is provided)
All site drives across the tenant (when site URL ends with 'sharepoint.com/sites/') ✨NEW!
Specific site drives (when a full site URL is provided)

Follow up of #54658
Resolves #37003

How

Refactored the get_site_drive method in SourceMicrosoftSharePointStreamReader to handle different site URL patterns
Updated the authentication flow to properly use tenant-specific scopes when needed
Implemented comprehensive test cases to verify the correct behavior for each scenario

Review guide

airbyte-integrations/connectors/source-microsoft-sharepoint/source_microsoft_sharepoint/stream_reader.py - Focus on the get_site_drive method implementation
airbyte-integrations/connectors/source-microsoft-sharepoint/unit_tests/test_stream_reader.py - Check the test_get_site_drive test cases

User Impact

Users can now more easily configure the connector to access drives from:
- Default SharePoint site
- All sites in their tenant
- A specific SharePoint site

No breaking changes to existing functionality

Can this PR be safely reverted and rolled back?

[x] YES 💚
[ ] NO ❌

vercel · 2025-03-22T12:16:31Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
airbyte-docs	🛑 Canceled (Inspect)			Mar 23, 2025 2:31am

aldogonzalez8 · 2025-03-23T20:54:22Z

/approve-regression-tests

Check job output.

✅ Approving regression tests

maxi297

I added a lot of comments regarding the structure but the functionality seem to make sense to me. Feel free to push back on any of these as I'm not sure I fully understand the context

maxi297 · 2025-03-24T15:20:20Z

airbyte-integrations/connectors/source-microsoft-sharepoint/pyproject.toml

 Office365-REST-Python-Client = "==2.5.5"
 smart-open = "==6.4.0"
-airbyte-cdk = {extras = ["file-based"], version = "^6"}
+airbyte-cdk = {extras = ["file-based"], version = "^6.38.5"}


Should we set this back to ^6 so that it gets the CDK updates?

maxi297 · 2025-03-24T15:28:16Z

...yte-integrations/connectors/source-microsoft-sharepoint/source_microsoft_sharepoint/utils.py

+    return site
+
+
+def get_site_prefix(site: Site):


Can we add return types for these added functions?

maxi297 · 2025-03-24T15:35:06Z

...yte-integrations/connectors/source-microsoft-sharepoint/source_microsoft_sharepoint/utils.py

+
+@lru_cache(maxsize=None)
+def get_site(graph_client: GraphClient, site_url: str = None):
+    if site_url:


nit: I know some tools would consider not returning immediately as a code smell (example). I would refactor this as such:

def get_site(graph_client: GraphClient, site_url: str = None): if site_url: return execute_query_with_retry(graph_client.sites.get_by_url(site_url)) return execute_query_with_retry(graph_client.sites.root.get())

maxi297 · 2025-03-24T15:41:23Z

...grations/connectors/source-microsoft-sharepoint/source_microsoft_sharepoint/stream_reader.py

@@ -103,6 +137,20 @@ def get_access_token(self):
        # Directly fetch a new access token from the auth_client each time it's called
        return self.auth_client._get_access_token()["access_token"]

+    def get_token_response_object(self, tenant_prefix: str = None) -> Callable:


It seems like if we set the value to None by default, this should be optional. However, I don't see a case where tenant_prefix is not provided in the current code. What am I missing?

I see the _get_scope supports this case but I'm not sure when this can happen

maxi297 · 2025-03-24T15:42:35Z

...grations/connectors/source-microsoft-sharepoint/source_microsoft_sharepoint/stream_reader.py

+            for site_drive in drives:
+                all_sites_drives.add_child(site_drive)
+        return all_sites_drives
+
    def get_site_drive(self):


I know this was outside of the scope of the PR but can we add typing here?

maxi297 · 2025-03-24T15:46:11Z

airbyte-integrations/connectors/source-microsoft-sharepoint/unit_tests/test_stream_reader.py

+            else:
+                result = reader.get_site_drive()
+
+                if expected_call == "drives":


open question: It seems like we have three distinct tests and that we tried to fit all of this in a parametrize. I fear that if one of those cases change, we might need to update this test and it would impact the other cases just because they share the same test not because their behavior has also changed. WDYT?

maxi297 · 2025-03-24T16:06:00Z

...yte-integrations/connectors/source-microsoft-sharepoint/source_microsoft_sharepoint/utils.py

+
+
+@lru_cache(maxsize=None)
+def get_site(graph_client: GraphClient, site_url: str = None):


It seems like get_site is always called with get_site_prefix. Should we make this private to avoid exposing more things than needed?

maxi297 · 2025-03-24T16:06:45Z

...yte-integrations/connectors/source-microsoft-sharepoint/source_microsoft_sharepoint/utils.py

+
+
+@lru_cache(maxsize=None)
+def get_site(graph_client: GraphClient, site_url: str = None):


I'm not sure how the lru_cache works but it seems like GraphClient does not implement __eq__ which I assume will only hit the cache if it is the same reference object. Is this what we want here?

maxi297 · 2025-03-24T16:08:44Z

...grations/connectors/source-microsoft-sharepoint/source_microsoft_sharepoint/stream_reader.py

+
+        return found_sites
+
+    def get_drives_from_sites(self, sites: List[MutableMapping[str, Any]]) -> EntityCollection:


This seems like a private method, right? Should we prefix the name by _?

maxi297 · 2025-03-24T16:10:42Z

...grations/connectors/source-microsoft-sharepoint/source_microsoft_sharepoint/stream_reader.py

+            List[MutableMapping[str, Any]]: A list of site information.
+        """
+        _, root_site_prefix = get_site_prefix(get_site(self.one_drive_client))
+        ctx = self.get_client_context()


It seems like get_site_prefix is also called within get_client_context. Since get_client_context should be private, should we pass the output of get_site_prefix as arguments for get_client_config? It would prevent us from calling this twice in the same context

aldogonzalez8 added 2 commits March 22, 2025 06:06

sourece-microsoft-sharepoint: add feature to iterate through all sites

1b3db72

sourece-microsoft-sharepoint: ruff format

8966435

aldogonzalez8 self-assigned this Mar 22, 2025

octavia-squidington-iii added the connectors/source/microsoft-sharepoint label Mar 22, 2025

sourece-microsoft-sharepoint: update release information

78846ad

aldogonzalez8 changed the title ~~Aldogonzalez8/source microsoft sharepoint/add support iteration~~ ✨feat(source-microsoft-sharepoint): add all sites iteration Mar 22, 2025

sourece-microsoft-sharepoint: minor changes in docstrings

db1f266

vercel bot deployed to Preview March 22, 2025 12:49 View deployment

aldogonzalez8 added 3 commits March 22, 2025 14:33

sourece-microsoft-sharepoint: increase testing

7fa6cd7

sourece-microsoft-sharepoint: merge from master

47633fc

sourece-microsoft-sharepoint: remove extra comments

c17a7b3

aldogonzalez8 mentioned this pull request Mar 22, 2025

✨feat(source-microsoft-sharepoint): Provide ability to sync other sites than Main sharepoint site #54658

Merged

2 tasks

aldogonzalez8 added 2 commits March 22, 2025 15:03

sourece-microsoft-sharepoint: bump cdk

03ef494

sourece-microsoft-sharepoint: remove unnecesary comments

d271e1e

aldogonzalez8 requested a review from a team March 22, 2025 23:04

vercel bot deployed to Preview March 23, 2025 02:31 View deployment

maxi297 reviewed Mar 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

✨feat(source-microsoft-sharepoint): add all sites iteration #55912

✨feat(source-microsoft-sharepoint): add all sites iteration #55912

aldogonzalez8 commented Mar 22, 2025 •

edited

Loading

vercel bot commented Mar 22, 2025 •

edited

Loading

aldogonzalez8 commented Mar 23, 2025 •

edited by github-actions bot

Loading

maxi297 left a comment

maxi297 Mar 24, 2025

maxi297 Mar 24, 2025

maxi297 Mar 24, 2025

maxi297 Mar 24, 2025

maxi297 Mar 24, 2025

maxi297 Mar 24, 2025

maxi297 Mar 24, 2025

maxi297 Mar 24, 2025

maxi297 Mar 24, 2025

maxi297 Mar 24, 2025



		@lru_cache(maxsize=None)
		def get_site(graph_client: GraphClient, site_url: str = None):


		return found_sites

		def get_drives_from_sites(self, sites: List[MutableMapping[str, Any]]) -> EntityCollection:

✨feat(source-microsoft-sharepoint): add all sites iteration #55912

Are you sure you want to change the base?

✨feat(source-microsoft-sharepoint): add all sites iteration #55912

Conversation

aldogonzalez8 commented Mar 22, 2025 • edited Loading

What

How

Review guide

User Impact

Can this PR be safely reverted and rolled back?

vercel bot commented Mar 22, 2025 • edited Loading

aldogonzalez8 commented Mar 23, 2025 • edited by github-actions bot Loading

maxi297 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aldogonzalez8 commented Mar 22, 2025 •

edited

Loading

vercel bot commented Mar 22, 2025 •

edited

Loading

aldogonzalez8 commented Mar 23, 2025 •

edited by github-actions bot

Loading