-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🥅 455 - Error handling and retries + job reporting #463
Conversation
src/jobs/ega/services/permissions.ts
Outdated
message: permissions.message, | ||
datasetId: datasetAccessionId, | ||
}); | ||
// TODO: add a max number of retries to prevent endless loop, then set paging = false? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling out my comment here - with the request retry setup for 504 and connection reset errors in the EgaClient, i'm wondering if any "max retries" should be added there instead? So far in testing I haven't run into a scenario where pagination never stops 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we should probably setup a mechanism to prevent infinite loops on your retry mechanisms. You can give them a default max retries and then also accept a parameter to control from here how many max retries to allow.
} | ||
logger.info('Completed processing permissions for all DACO approved users.'); | ||
const endTime = new Date(); | ||
const timeElapsed = moment(endTime).diff(startTime, 'minutes'); | ||
return { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My report gathering here and in revokePermissions
feels messy, I am open to suggestions. My main goal in the results data is to track whether the expected outcomes match the actual results, i.e.
- the total number of EgaUsers inputted to
processPermissionsForApprovedUsers
matches the total number of users successfully processed - the total number of Datasets inputted to
removeExpiredPermissions
matches the total number of datasets successfully processed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation may feel messy because it is representing 3 different outcomes with only two states:
- errors during job
- complete with incorrect number of processed users
- complete with correct numbers
But you are only reporting: success/failure.
This also doesn't capture your second bullet, the number of datasets vs number processed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha that makes it clearer. Maybe my third state can be something like INCOMPLETE
, so i would have SUCCESS
, INCOMPLETE
or FAILURE
.
For the second bullet point, I do have the total number of datasets recorded at the top level of the report, as datasetsCount
. Does it make the report clearer to add this to the permissionsRevoked
portion of the report, since that is the baseline i'm measuring against (maybe the same could be said for approvedEgaUsersCount
and the permissionsCreated
portion)
@@ -146,7 +84,13 @@ export const egaApiClient = async () => { | |||
|
|||
const getAccessToken = async (): Promise<IdpToken> => { | |||
if (currentToken) { | |||
return currentToken; | |||
const tokenIsExpired = await tokenExpired(currentToken); | |||
if (tokenIsExpired) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm looking at where this tokenExpiry check is called, and I'm not sure if it is useful? since the only places getAccessToken
is called is on the initial EgaClient creation, and after a 401 occurs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for describing this, I think you are correct in this context. As written, we don't call this before using the EGA axios client, instead we store the current token in the ega axios client and then update it with the fresh token after we retrieve it and that keeps our client up to date.
Since this is a cron-job, the client is created, then constantly used, and then the job ends - that makes this pattern acceptable. If this client was being used in a long lived server where the client could sit unused for a while and then the token go stale, then we would be better served to always call getAccessToken
before using the EGA axios client. In that alternate use case, this expired check would be useful. As this is just a cron job though, I'm fine with it as is.
I went back and forth on this message like 3 times so I'll probably come up with some new ideas by the time you read this.....
src/jobs/ega/axios/egaClient.ts
Outdated
return currentToken; | ||
const tokenIsExpired = await tokenExpired(currentToken); | ||
if (tokenIsExpired) { | ||
logger.info('token is expired'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is debug level logging...
@@ -146,7 +84,13 @@ export const egaApiClient = async () => { | |||
|
|||
const getAccessToken = async (): Promise<IdpToken> => { | |||
if (currentToken) { | |||
return currentToken; | |||
const tokenIsExpired = await tokenExpired(currentToken); | |||
if (tokenIsExpired) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for describing this, I think you are correct in this context. As written, we don't call this before using the EGA axios client, instead we store the current token in the ega axios client and then update it with the fresh token after we retrieve it and that keeps our client up to date.
Since this is a cron-job, the client is created, then constantly used, and then the job ends - that makes this pattern acceptable. If this client was being used in a long lived server where the client could sit unused for a while and then the token go stale, then we would be better served to always call getAccessToken
before using the EGA axios client. In that alternate use case, this expired check would be useful. As this is just a cron job though, I'm fine with it as is.
I went back and forth on this message like 3 times so I'll probably come up with some new ideas by the time you read this.....
// set new token on original request that had the 401 error | ||
error.config.headers['Authorization'] = refreshedBearerToken; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we modifying the original request now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is modifying the original request config
(error.config
) to have the refreshed token, so that when apiAxiosClient.request
is returned on line 145 it has the new token value in the authorization header
src/jobs/ega/axios/egaClient.ts
Outdated
`${CLIENT_NAME} - ${error.response.status} - ${error.response.statusText} - retrying original request.`, | ||
); | ||
// retry original request. this response error shouldn't be an issue because throttling is in place | ||
return apiAxiosClient.request(error.config); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
apiAxiosClient.request
is not throttled. This request can occur in addition to all the throttled requests, no?
This holds true for every use of client.request(). Should we throw a wrapper over that to throttle it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed on slack - will throttle the axios methods directly, and remove the throttle
call on the egaClient methods
src/jobs/ega/axios/idpClient.ts
Outdated
* @param token IdpToken | ||
* @returns Promise<boolean> | ||
*/ | ||
export const tokenExpired = async (token: IdpToken): Promise<boolean> => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Convention is probably to name this isTokenExpired
but this is a minor quibble. The description should probably state somethings like "returns true if token is expired, otherwise returns false. the token may be invalid and still return false"
src/jobs/ega/fetchPublicKey.ts
Outdated
console.error('Keycloak realm info not provided in config, aborting fetch attempt.'); | ||
return undefined; | ||
} | ||
console.info(`Fetching public key from Keycloak realm ${authRealmName}.`); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
debug level logging
} | ||
logger.info('Completed processing permissions for all DACO approved users.'); | ||
const endTime = new Date(); | ||
const timeElapsed = moment(endTime).diff(startTime, 'minutes'); | ||
return { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation may feel messy because it is representing 3 different outcomes with only two states:
- errors during job
- complete with incorrect number of processed users
- complete with correct numbers
But you are only reporting: success/failure.
This also doesn't capture your second bullet, the number of datasets vs number processed.
src/jobs/ega/services/permissions.ts
Outdated
message: permissions.message, | ||
datasetId: datasetAccessionId, | ||
}); | ||
// TODO: add a max number of retries to prevent endless loop, then set paging = false? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we should probably setup a mechanism to prevent infinite loops on your retry mechanisms. You can give them a default max retries and then also accept a parameter to control from here how many max retries to allow.
src/jobs/ega/services/permissions.ts
Outdated
@@ -199,16 +292,76 @@ export const processPermissionsForDataset = async ( | |||
for await (const requests of chunkedRevokeRequests) { | |||
const revokeResponse = await client.revokePermissions(requests); | |||
if (isSuccess(revokeResponse)) { | |||
logger.info( | |||
logger.debug( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is info logging. We are recording an action we took that changed state and we may want to find a record of in the future.
src/jobs/ega/axios/egaClient.ts
Outdated
baseURL: apiUrl, | ||
headers: { | ||
'Content-Type': 'application/json', | ||
}, | ||
}); | ||
axiosRetry(client, { retries: DEFAULT_RETRIES }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added axios-retry
and set max retries to 3. When i ran this locally i didn't encounter any retryable errors beyond this number, but the job report on completion didn't indicate that any user/dataset request was missed. All the totals lined up to what I expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you run or write a test to see how this behaves when it does fail 3+ times?
baseURL: authHost, | ||
}); | ||
axiosRetry(client, { retries: DEFAULT_RETRIES }); | ||
return client; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added same retry strategy here
@@ -31,6 +41,34 @@ import { | |||
createRevokePermissionRequest, | |||
} from '../utils'; | |||
|
|||
/** | |||
* Parse completionStatus for a reconciliation step, based on the number of successfully processed items vs total expected |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added an INCOMPLETE
status to reflect a scenario where no errors are recorded, but the expected totals do not line up, i.e. the number of permissions for a dataset does not match the approved users count. This could mean a request was dropped somewhere in the process, but did not register as an error, like a failed retry.
@@ -113,10 +193,16 @@ export const processPermissionsForApprovedUsers = async ( | |||
(perm) => perm.dataset_accession_id, | |||
); | |||
const datasetsRequiringPermissions = datasets.map((dataset) => dataset.accession_id); | |||
// if a dataset is removed from the incoming datasets list argument, i.e. the user has more dataset permissions than the incoming list, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just left this comment here, to remind me (and perhaps others) how this call to difference()
should behave. Not sure how likely it is that a dataset would be removed!
@@ -39,12 +41,50 @@ const JOB_NAME = 'RECONCILE_EGA_PERMISSIONS'; | |||
* 3) Retrieve corresponding list of users from EGA API | |||
* 4) Create permissions, on each dataset, for each user on the DACO approved list, if no existing permission is found | |||
* 5) Process existing permissions for each dataset + revoke those which belong to users not on the DACO approved list | |||
* 6) Return completed JobReport | |||
* @returns Promise<JobReport<ReconciliationJobReport>> | |||
* @example |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added this example report to show how the completionStatus lines up with the details
from each of the create permissions/delete permissions process steps
src/jobs/ega/types/constants.ts
Outdated
@@ -21,3 +21,4 @@ | |||
export const DEFAULT_LIMIT = 100; | |||
export const DEFAULT_OFFSET = DEFAULT_LIMIT; | |||
export const EGA_MAX_REQUEST_SIZE = 5000; | |||
export const DEFAULT_RETRIES = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally this is is configurable in the env to overwrite this value, but this is a sensible default so that isn't immediately needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since i'm pushing other changes i can add this too, may as well!
// throttled axios methods | ||
const throttledDelete = throttle(apiAxiosClient.delete); | ||
const throttledGet = throttle(apiAxiosClient.get); | ||
const throttledPost = throttle(apiAxiosClient.post); | ||
const throttledPut = throttle(apiAxiosClient.put); | ||
const throttledGenericRequest = throttle(apiAxiosClient.request); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Work for #455
Adds retry logic for timeout and connection hangup errors + adds JobReport data. JobReport data includes start/end time/time elapsed per job step, so it's clear which steps take the most time to complete.
EgaClient
error.response
beforeerror.request
, because a 401 error triggers both! If theerror.request
is checked first, the refresh logic is never reached.error.response
statusECONNRESET
request error, which is less frequent in this client but does occur.ERR_BAD_REQUEST
handling forerror.request
. These will not be retriedgetAccessToken
is checked to see if it is expiredaxios-retry
to set max allowed retries for a request. Defaults to 3IdpClient
decodeToken
function, usingjsonwebtoken
ECONNRESET
error handling. This occurs frequently on the first attempt to retrieve an access tokenaxios-retry
to client, same default as in ega clientMain job
JobReport
. This will be returned torunAllJobs
.Permissions Service
limit
to 100 to speed up requestsNote: no retries added here. Adding permissions can probably suffice with just reporting errors, and any permissions missed can be picked up on the next job run. For revoking permissions, it might make sense to include one retry block, if there are any errors reported relating to revoke requests that failed? The error data for revoking includes the datasetId, so it could be a matter of retrying on specific datasets, as needed.
Main batch job
egaPermissionsReconciliation
to main job triggered by/jobs/batch-transitions
. Added with a feature flag so that when enabled, the olderapprovedUsersEmail
will not run, and vice versa. Can remove this completely once testing is complete.Types
permissionsCreated
includes details on number of permissions expected to be granted, and permissions successfully createdpermissionsRevoked
includes details on number of datasets successfully processed (results match with expected number of revocations) and number of datasets with permissions revokedConfiguration
egaReconciliationEnabled
toconfig.ts
. This will enable theegaPermissionsReconcilation
job and disable theapprovedUsersEmail
jobpublicKey
from Ega's auth serversecrets.ts
, to enable parsing of the access token expiryDocumentation
FEATURE_EGA_RECONCILIATION_ENABLED
toREADME
,.env.example