Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🥅 455 - Error handling and retries + job reporting #463

Merged
merged 8 commits into from
Nov 26, 2024

Conversation

anncatton
Copy link
Contributor

@anncatton anncatton commented Oct 20, 2024

Work for #455

Adds retry logic for timeout and connection hangup errors + adds JobReport data. JobReport data includes start/end time/time elapsed per job step, so it's clear which steps take the most time to complete.

EgaClient

  • splits up this client from the IdpClient for clarity
  • cleans up error handling in the Axios interceptors. Moves the check for error.response before error.request, because a 401 error triggers both! If the error.request is checked first, the refresh logic is never reached.
  • adds retry handling for 504 error.response status
  • adds retry handling for the ECONNRESET request error, which is less frequent in this client but does occur.
  • adds ERR_BAD_REQUEST handling for error.request. These will not be retried
  • adds token expiry verification. An existing token in getAccessToken is checked to see if it is expired
  • adds axios-retry to set max allowed retries for a request. Defaults to 3

IdpClient

  • splits from EgaClient for code clarity
  • adds decodeToken function, using jsonwebtoken
  • adds ECONNRESET error handling. This occurs frequently on the first attempt to retrieve an access token
  • other errors are not retried, but logged, then rejected
  • adds axios-retry to client, same default as in ega client

Main job

  • updates function to return completed JobReport. This will be returned to runAllJobs.
  • aborts job if fetch for datasets fails; the job cannot proceed without this data

Permissions Service

  • adds job reporting data, inlcuding error reporting
  • increases pagination limit to 100 to speed up requests
    Note: no retries added here. Adding permissions can probably suffice with just reporting errors, and any permissions missed can be picked up on the next job run. For revoking permissions, it might make sense to include one retry block, if there are any errors reported relating to revoke requests that failed? The error data for revoking includes the datasetId, so it could be a matter of retrying on specific datasets, as needed.

Main batch job

  • adds egaPermissionsReconciliation to main job triggered by /jobs/batch-transitions. Added with a feature flag so that when enabled, the older approvedUsersEmail will not run, and vice versa. Can remove this completely once testing is complete.

Types

  • adds types to support job result reporting. Each process reports start/end time + completion status
  • permissionsCreated includes details on number of permissions expected to be granted, and permissions successfully created
  • permissionsRevoked includes details on number of datasets successfully processed (results match with expected number of revocations) and number of datasets with permissions revoked

Configuration

  • adds egaReconciliationEnabled to config.ts. This will enable the egaPermissionsReconcilation job and disable the approvedUsersEmail job
  • adds function to support fetching the publicKey from Ega's auth server
  • adds public key value to secrets.ts, to enable parsing of the access token expiry

Documentation

  • adds FEATURE_EGA_RECONCILIATION_ENABLED to README, .env.example

@anncatton anncatton changed the title 🥅 455 - Error handling + retry 🥅 455 - Error handling and retries + job reporting Oct 21, 2024
message: permissions.message,
datasetId: datasetAccessionId,
});
// TODO: add a max number of retries to prevent endless loop, then set paging = false?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling out my comment here - with the request retry setup for 504 and connection reset errors in the EgaClient, i'm wondering if any "max retries" should be added there instead? So far in testing I haven't run into a scenario where pagination never stops 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should probably setup a mechanism to prevent infinite loops on your retry mechanisms. You can give them a default max retries and then also accept a parameter to control from here how many max retries to allow.

}
logger.info('Completed processing permissions for all DACO approved users.');
const endTime = new Date();
const timeElapsed = moment(endTime).diff(startTime, 'minutes');
return {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My report gathering here and in revokePermissions feels messy, I am open to suggestions. My main goal in the results data is to track whether the expected outcomes match the actual results, i.e.

  • the total number of EgaUsers inputted to processPermissionsForApprovedUsers matches the total number of users successfully processed
  • the total number of Datasets inputted to removeExpiredPermissions matches the total number of datasets successfully processed

Copy link
Member

@joneubank joneubank Oct 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current implementation may feel messy because it is representing 3 different outcomes with only two states:

  1. errors during job
  2. complete with incorrect number of processed users
  3. complete with correct numbers

But you are only reporting: success/failure.

This also doesn't capture your second bullet, the number of datasets vs number processed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha that makes it clearer. Maybe my third state can be something like INCOMPLETE, so i would have SUCCESS, INCOMPLETE or FAILURE.

For the second bullet point, I do have the total number of datasets recorded at the top level of the report, as datasetsCount. Does it make the report clearer to add this to the permissionsRevoked portion of the report, since that is the baseline i'm measuring against (maybe the same could be said for approvedEgaUsersCount and the permissionsCreated portion)

@@ -146,7 +84,13 @@ export const egaApiClient = async () => {

const getAccessToken = async (): Promise<IdpToken> => {
if (currentToken) {
return currentToken;
const tokenIsExpired = await tokenExpired(currentToken);
if (tokenIsExpired) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm looking at where this tokenExpiry check is called, and I'm not sure if it is useful? since the only places getAccessToken is called is on the initial EgaClient creation, and after a 401 occurs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for describing this, I think you are correct in this context. As written, we don't call this before using the EGA axios client, instead we store the current token in the ega axios client and then update it with the fresh token after we retrieve it and that keeps our client up to date.

Since this is a cron-job, the client is created, then constantly used, and then the job ends - that makes this pattern acceptable. If this client was being used in a long lived server where the client could sit unused for a while and then the token go stale, then we would be better served to always call getAccessToken before using the EGA axios client. In that alternate use case, this expired check would be useful. As this is just a cron job though, I'm fine with it as is.

I went back and forth on this message like 3 times so I'll probably come up with some new ideas by the time you read this.....

return currentToken;
const tokenIsExpired = await tokenExpired(currentToken);
if (tokenIsExpired) {
logger.info('token is expired');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is debug level logging...

@@ -146,7 +84,13 @@ export const egaApiClient = async () => {

const getAccessToken = async (): Promise<IdpToken> => {
if (currentToken) {
return currentToken;
const tokenIsExpired = await tokenExpired(currentToken);
if (tokenIsExpired) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for describing this, I think you are correct in this context. As written, we don't call this before using the EGA axios client, instead we store the current token in the ega axios client and then update it with the fresh token after we retrieve it and that keeps our client up to date.

Since this is a cron-job, the client is created, then constantly used, and then the job ends - that makes this pattern acceptable. If this client was being used in a long lived server where the client could sit unused for a while and then the token go stale, then we would be better served to always call getAccessToken before using the EGA axios client. In that alternate use case, this expired check would be useful. As this is just a cron job though, I'm fine with it as is.

I went back and forth on this message like 3 times so I'll probably come up with some new ideas by the time you read this.....

Comment on lines +140 to +141
// set new token on original request that had the 401 error
error.config.headers['Authorization'] = refreshedBearerToken;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we modifying the original request now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is modifying the original request config (error.config) to have the refreshed token, so that when apiAxiosClient.request is returned on line 145 it has the new token value in the authorization header

`${CLIENT_NAME} - ${error.response.status} - ${error.response.statusText} - retrying original request.`,
);
// retry original request. this response error shouldn't be an issue because throttling is in place
return apiAxiosClient.request(error.config);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apiAxiosClient.request is not throttled. This request can occur in addition to all the throttled requests, no?

This holds true for every use of client.request(). Should we throw a wrapper over that to throttle it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed on slack - will throttle the axios methods directly, and remove the throttle call on the egaClient methods

* @param token IdpToken
* @returns Promise<boolean>
*/
export const tokenExpired = async (token: IdpToken): Promise<boolean> => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Convention is probably to name this isTokenExpired but this is a minor quibble. The description should probably state somethings like "returns true if token is expired, otherwise returns false. the token may be invalid and still return false"

console.error('Keycloak realm info not provided in config, aborting fetch attempt.');
return undefined;
}
console.info(`Fetching public key from Keycloak realm ${authRealmName}.`);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

debug level logging

}
logger.info('Completed processing permissions for all DACO approved users.');
const endTime = new Date();
const timeElapsed = moment(endTime).diff(startTime, 'minutes');
return {
Copy link
Member

@joneubank joneubank Oct 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current implementation may feel messy because it is representing 3 different outcomes with only two states:

  1. errors during job
  2. complete with incorrect number of processed users
  3. complete with correct numbers

But you are only reporting: success/failure.

This also doesn't capture your second bullet, the number of datasets vs number processed.

message: permissions.message,
datasetId: datasetAccessionId,
});
// TODO: add a max number of retries to prevent endless loop, then set paging = false?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should probably setup a mechanism to prevent infinite loops on your retry mechanisms. You can give them a default max retries and then also accept a parameter to control from here how many max retries to allow.

@@ -199,16 +292,76 @@ export const processPermissionsForDataset = async (
for await (const requests of chunkedRevokeRequests) {
const revokeResponse = await client.revokePermissions(requests);
if (isSuccess(revokeResponse)) {
logger.info(
logger.debug(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is info logging. We are recording an action we took that changed state and we may want to find a record of in the future.

baseURL: apiUrl,
headers: {
'Content-Type': 'application/json',
},
});
axiosRetry(client, { retries: DEFAULT_RETRIES });
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added axios-retry and set max retries to 3. When i ran this locally i didn't encounter any retryable errors beyond this number, but the job report on completion didn't indicate that any user/dataset request was missed. All the totals lined up to what I expected.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you run or write a test to see how this behaves when it does fail 3+ times?

baseURL: authHost,
});
axiosRetry(client, { retries: DEFAULT_RETRIES });
return client;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added same retry strategy here

@@ -31,6 +41,34 @@ import {
createRevokePermissionRequest,
} from '../utils';

/**
* Parse completionStatus for a reconciliation step, based on the number of successfully processed items vs total expected
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added an INCOMPLETE status to reflect a scenario where no errors are recorded, but the expected totals do not line up, i.e. the number of permissions for a dataset does not match the approved users count. This could mean a request was dropped somewhere in the process, but did not register as an error, like a failed retry.

@@ -113,10 +193,16 @@ export const processPermissionsForApprovedUsers = async (
(perm) => perm.dataset_accession_id,
);
const datasetsRequiringPermissions = datasets.map((dataset) => dataset.accession_id);
// if a dataset is removed from the incoming datasets list argument, i.e. the user has more dataset permissions than the incoming list,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just left this comment here, to remind me (and perhaps others) how this call to difference() should behave. Not sure how likely it is that a dataset would be removed!

@@ -39,12 +41,50 @@ const JOB_NAME = 'RECONCILE_EGA_PERMISSIONS';
* 3) Retrieve corresponding list of users from EGA API
* 4) Create permissions, on each dataset, for each user on the DACO approved list, if no existing permission is found
* 5) Process existing permissions for each dataset + revoke those which belong to users not on the DACO approved list
* 6) Return completed JobReport
* @returns Promise<JobReport<ReconciliationJobReport>>
* @example
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added this example report to show how the completionStatus lines up with the details from each of the create permissions/delete permissions process steps

@@ -21,3 +21,4 @@
export const DEFAULT_LIMIT = 100;
export const DEFAULT_OFFSET = DEFAULT_LIMIT;
export const EGA_MAX_REQUEST_SIZE = 5000;
export const DEFAULT_RETRIES = 3;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally this is is configurable in the env to overwrite this value, but this is a sensible default so that isn't immediately needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since i'm pushing other changes i can add this too, may as well!

Comment on lines +124 to +130
// throttled axios methods
const throttledDelete = throttle(apiAxiosClient.delete);
const throttledGet = throttle(apiAxiosClient.get);
const throttledPost = throttle(apiAxiosClient.post);
const throttledPut = throttle(apiAxiosClient.put);
const throttledGenericRequest = throttle(apiAxiosClient.request);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@anncatton anncatton merged commit afbb2c0 into add-refresh-token-flow Nov 26, 2024
2 checks passed
@anncatton anncatton deleted the 455-implement-error-retry branch November 26, 2024 21:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants