Skip to content
This repository has been archived by the owner on Sep 19, 2024. It is now read-only.

Add verify_all_repo_access config parameter #8

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

koszti
Copy link
Contributor

@koszti koszti commented Aug 26, 2021

Problem

If lot of repositories selected then verifying access to all of them takes lot of time in discovery mode. In case of 2000 repositories it takes more than 10 minutes, consumes calls from API rate limit and not allowing to deploy new config changes quickly in parallel.

Proposed changes

Introducing verify_all_repo_access configurable parameter that defaults to True. It it' False then it verifies only one repository access from the selected list and not all of them. If no access to a certain repository then the tap will fail at sync time and not at discovery time making the discovery and config-import time much quicker.

Types of changes

What types of changes does your code introduce to PipelineWise?

  • Bugfix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation Update (if none of the other choices apply)

Checklist

  • Description above provides context of the change
  • I have added tests that prove my fix is effective or that my feature works
  • Unit tests for changes (not needed for documentation changes)
  • CI checks pass with my changes
  • Bumping version in setup.py is an individual PR and not mixed with feature or bugfix PRs
  • Commit message/PR title starts with [AP-NNNN] (if applicable. AP-NNNN = JIRA ID)
  • Branch name starts with AP-NNN (if applicable. AP-NNN = JIRA ID)
  • Commits follow "How to write a good git commit message"
  • Relevant documentation is updated including usage instructions

@Samira-El
Copy link
Contributor

I see the point of this, it can be useful. I'm concerned tho that disabling this feature would end up delaying catching problematic repos and risking the health of sync, because the latter would not be able to proceed unless the access is fixed as opposed to keeping the sync healthy and having a failing discovery.

I also think that this goes against what the discovery mode is meant to do.

@@ -349,6 +349,8 @@ def verify_access_for_repo(config):
session.headers.update({'authorization': 'token ' + access_token, 'per_page': '1', 'page': '1'})

repositories = get_repos_from_config(config)
if not config.get('verify_all_repo_access', True) and len(repositories) > 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to avoid the loop if that config is set instead of mutating the param passed.
Something like:

verifify_all = config.get('verify_all_repo_access', True)
if verify_all:
    for repo in repositories:
        verify_access(repo)
else: 
    [...]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks addressed by b4f9e5a

It's no longer mutating the passed parameter however it keeps the loop to avoid repeating code. Wdyt?

Copy link
Contributor

@diegofurtados diegofurtados left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other issue could be if you have diff repositories and don't have access to a few of them.
could happen in the same or diff org

@koszti
Copy link
Contributor Author

koszti commented Aug 26, 2021

I see the point of this, it can be useful. I'm concerned tho that disabling this feature would end up delaying catching problematic repos and risking the health of sync, because the latter would not be able to proceed unless the access is fixed as opposed to keeping the sync healthy and having a failing discovery.

I also think that this goes against what the discovery mode is meant to do.

fair points. The question is if we are happy that the current ppw config import is running for 15-20 minutes? Also it consumes 2000 api calls from the max 5000/hour and can cause the parallel running tap to fail by exceeding rate limit. For the same reason the parallel running tap can also cause importing new config and the discovery to fail.

We have a lot of config changes a day, and each triggers a full discovery that requires 2000 api requests. Currently we can’t run the tap in business hours only once a day at night otherwise the tap or importing new ppw config is expected to fail. Running the tap multiple times a day is not a requirement tho.

Having more than 5000 call rate limit is not supported by personal access tokens, and if we need more than that then we need to authenticate the tap as github app that gives 15000 requests/hour.

This feature is optional, defaults to the current behaviour and it only adds an extra option that we not necessarily have to use.

@koszti
Copy link
Contributor Author

koszti commented Aug 26, 2021

Other issue could be if you have diff repositories and don't have access to a few of them.
could happen in the same or diff org

in that case the tap fails at runtime and not when importing the config. The two things are running at different time so it’s already possible that we have access when importing the config but have no access when running the tap.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants