🥅 455 - Error handling and retries + job reporting #463

anncatton · 2024-10-20T22:28:09Z

Work for #455

Adds retry logic for timeout and connection hangup errors + adds JobReport data. JobReport data includes start/end time/time elapsed per job step, so it's clear which steps take the most time to complete.

EgaClient

splits up this client from the IdpClient for clarity
cleans up error handling in the Axios interceptors. Moves the check for error.response before error.request, because a 401 error triggers both! If the error.request is checked first, the refresh logic is never reached.
adds retry handling for 504 error.response status
adds retry handling for the ECONNRESET request error, which is less frequent in this client but does occur.
adds ERR_BAD_REQUEST handling for error.request. These will not be retried
adds token expiry verification. An existing token in getAccessToken is checked to see if it is expired
adds axios-retry to set max allowed retries for a request. Defaults to 3

IdpClient

splits from EgaClient for code clarity
adds decodeToken function, using jsonwebtoken
adds ECONNRESET error handling. This occurs frequently on the first attempt to retrieve an access token
other errors are not retried, but logged, then rejected
adds axios-retry to client, same default as in ega client

Main job

updates function to return completed JobReport. This will be returned to runAllJobs.
aborts job if fetch for datasets fails; the job cannot proceed without this data

Permissions Service

adds job reporting data, inlcuding error reporting
increases pagination limit to 100 to speed up requests
Note: no retries added here. Adding permissions can probably suffice with just reporting errors, and any permissions missed can be picked up on the next job run. For revoking permissions, it might make sense to include one retry block, if there are any errors reported relating to revoke requests that failed? The error data for revoking includes the datasetId, so it could be a matter of retrying on specific datasets, as needed.

Main batch job

adds egaPermissionsReconciliation to main job triggered by /jobs/batch-transitions. Added with a feature flag so that when enabled, the older approvedUsersEmail will not run, and vice versa. Can remove this completely once testing is complete.

Types

adds types to support job result reporting. Each process reports start/end time + completion status
permissionsCreated includes details on number of permissions expected to be granted, and permissions successfully created
permissionsRevoked includes details on number of datasets successfully processed (results match with expected number of revocations) and number of datasets with permissions revoked

Configuration

adds egaReconciliationEnabled to config.ts. This will enable the egaPermissionsReconcilation job and disable the approvedUsersEmail job
adds function to support fetching the publicKey from Ega's auth server
adds public key value to secrets.ts, to enable parsing of the access token expiry

Documentation

adds FEATURE_EGA_RECONCILIATION_ENABLED to README, .env.example

…ag enabling

anncatton · 2024-10-21T15:46:20Z

src/jobs/ega/services/permissions.ts

+        message: permissions.message,
+        datasetId: datasetAccessionId,
+      });
+      // TODO: add a max number of retries to prevent endless loop, then set paging = false?


Calling out my comment here - with the request retry setup for 504 and connection reset errors in the EgaClient, i'm wondering if any "max retries" should be added there instead? So far in testing I haven't run into a scenario where pagination never stops 🤔

Yes, we should probably setup a mechanism to prevent infinite loops on your retry mechanisms. You can give them a default max retries and then also accept a parameter to control from here how many max retries to allow.

anncatton · 2024-10-21T15:51:27Z

src/jobs/ega/services/permissions.ts

  }
  logger.info('Completed processing permissions for all DACO approved users.');
+  const endTime = new Date();
+  const timeElapsed = moment(endTime).diff(startTime, 'minutes');
+  return {


My report gathering here and in revokePermissions feels messy, I am open to suggestions. My main goal in the results data is to track whether the expected outcomes match the actual results, i.e.

the total number of EgaUsers inputted to processPermissionsForApprovedUsers matches the total number of users successfully processed

the total number of Datasets inputted to removeExpiredPermissions matches the total number of datasets successfully processed

The current implementation may feel messy because it is representing 3 different outcomes with only two states:

errors during job

complete with incorrect number of processed users

complete with correct numbers

But you are only reporting: success/failure.

This also doesn't capture your second bullet, the number of datasets vs number processed.

Aha that makes it clearer. Maybe my third state can be something like INCOMPLETE, so i would have SUCCESS, INCOMPLETE or FAILURE.

For the second bullet point, I do have the total number of datasets recorded at the top level of the report, as datasetsCount. Does it make the report clearer to add this to the permissionsRevoked portion of the report, since that is the baseline i'm measuring against (maybe the same could be said for approvedEgaUsersCount and the permissionsCreated portion)

anncatton · 2024-10-21T15:53:44Z

src/jobs/ega/axios/egaClient.ts

@@ -146,7 +84,13 @@ export const egaApiClient = async () => {

  const getAccessToken = async (): Promise<IdpToken> => {
    if (currentToken) {
-      return currentToken;
+      const tokenIsExpired = await tokenExpired(currentToken);
+      if (tokenIsExpired) {


I'm looking at where this tokenExpiry check is called, and I'm not sure if it is useful? since the only places getAccessToken is called is on the initial EgaClient creation, and after a 401 occurs.

Thanks for describing this, I think you are correct in this context. As written, we don't call this before using the EGA axios client, instead we store the current token in the ega axios client and then update it with the fresh token after we retrieve it and that keeps our client up to date.

Since this is a cron-job, the client is created, then constantly used, and then the job ends - that makes this pattern acceptable. If this client was being used in a long lived server where the client could sit unused for a while and then the token go stale, then we would be better served to always call getAccessToken before using the EGA axios client. In that alternate use case, this expired check would be useful. As this is just a cron job though, I'm fine with it as is.

I went back and forth on this message like 3 times so I'll probably come up with some new ideas by the time you read this.....

joneubank · 2024-10-22T21:48:49Z

src/jobs/ega/axios/egaClient.ts

-      return currentToken;
+      const tokenIsExpired = await tokenExpired(currentToken);
+      if (tokenIsExpired) {
+        logger.info('token is expired');


I think this is debug level logging...

joneubank · 2024-10-22T21:49:32Z

src/jobs/ega/axios/egaClient.ts

@@ -146,7 +84,13 @@ export const egaApiClient = async () => {

  const getAccessToken = async (): Promise<IdpToken> => {
    if (currentToken) {
-      return currentToken;
+      const tokenIsExpired = await tokenExpired(currentToken);
+      if (tokenIsExpired) {


Thanks for describing this, I think you are correct in this context. As written, we don't call this before using the EGA axios client, instead we store the current token in the ega axios client and then update it with the fresh token after we retrieve it and that keeps our client up to date.

Since this is a cron-job, the client is created, then constantly used, and then the job ends - that makes this pattern acceptable. If this client was being used in a long lived server where the client could sit unused for a while and then the token go stale, then we would be better served to always call getAccessToken before using the EGA axios client. In that alternate use case, this expired check would be useful. As this is just a cron job though, I'm fine with it as is.

I went back and forth on this message like 3 times so I'll probably come up with some new ideas by the time you read this.....

joneubank · 2024-10-22T21:56:24Z

src/jobs/ega/axios/egaClient.ts

+                // set new token on original request that had the 401 error
+                error.config.headers['Authorization'] = refreshedBearerToken;


Why are we modifying the original request now?

This is modifying the original request config (error.config) to have the refreshed token, so that when apiAxiosClient.request is returned on line 145 it has the new token value in the authorization header

joneubank · 2024-10-22T22:03:51Z

src/jobs/ega/axios/egaClient.ts

+                  `${CLIENT_NAME} - ${error.response.status} - ${error.response.statusText} - retrying original request.`,
+                );
+                // retry original request. this response error shouldn't be an issue because throttling is in place
+                return apiAxiosClient.request(error.config);


apiAxiosClient.request is not throttled. This request can occur in addition to all the throttled requests, no?

This holds true for every use of client.request(). Should we throw a wrapper over that to throttle it?

As discussed on slack - will throttle the axios methods directly, and remove the throttle call on the egaClient methods

joneubank · 2024-10-22T22:47:20Z

src/jobs/ega/axios/idpClient.ts

+ * @param token IdpToken
+ * @returns Promise<boolean>
+ */
+export const tokenExpired = async (token: IdpToken): Promise<boolean> => {


Convention is probably to name this isTokenExpired but this is a minor quibble. The description should probably state somethings like "returns true if token is expired, otherwise returns false. the token may be invalid and still return false"

joneubank · 2024-10-22T22:49:48Z

src/jobs/ega/fetchPublicKey.ts

+    console.error('Keycloak realm info not provided in config, aborting fetch attempt.');
+    return undefined;
+  }
+  console.info(`Fetching public key from Keycloak realm ${authRealmName}.`);


debug level logging

joneubank · 2024-10-22T22:52:03Z

src/jobs/ega/services/permissions.ts

  }
  logger.info('Completed processing permissions for all DACO approved users.');
+  const endTime = new Date();
+  const timeElapsed = moment(endTime).diff(startTime, 'minutes');
+  return {


The current implementation may feel messy because it is representing 3 different outcomes with only two states:

errors during job

complete with incorrect number of processed users

complete with correct numbers

But you are only reporting: success/failure.

This also doesn't capture your second bullet, the number of datasets vs number processed.

joneubank · 2024-10-22T22:53:39Z

src/jobs/ega/services/permissions.ts

+        message: permissions.message,
+        datasetId: datasetAccessionId,
+      });
+      // TODO: add a max number of retries to prevent endless loop, then set paging = false?


Yes, we should probably setup a mechanism to prevent infinite loops on your retry mechanisms. You can give them a default max retries and then also accept a parameter to control from here how many max retries to allow.

joneubank · 2024-10-22T22:56:36Z

src/jobs/ega/services/permissions.ts

@@ -199,16 +292,76 @@ export const processPermissionsForDataset = async (
    for await (const requests of chunkedRevokeRequests) {
      const revokeResponse = await client.revokePermissions(requests);
      if (isSuccess(revokeResponse)) {
-        logger.info(
+        logger.debug(


This is info logging. We are recording an action we took that changed state and we may want to find a record of in the future.

anncatton · 2024-11-26T16:01:02Z

src/jobs/ega/axios/egaClient.ts

    baseURL: apiUrl,
    headers: {
      'Content-Type': 'application/json',
    },
  });
+  axiosRetry(client, { retries: DEFAULT_RETRIES });


Added axios-retry and set max retries to 3. When i ran this locally i didn't encounter any retryable errors beyond this number, but the job report on completion didn't indicate that any user/dataset request was missed. All the totals lined up to what I expected.

Did you run or write a test to see how this behaves when it does fail 3+ times?

anncatton · 2024-11-26T16:02:11Z

src/jobs/ega/axios/idpClient.ts

+    baseURL: authHost,
+  });
+  axiosRetry(client, { retries: DEFAULT_RETRIES });
+  return client;


added same retry strategy here

anncatton · 2024-11-26T16:05:13Z

src/jobs/ega/services/permissions.ts

@@ -31,6 +41,34 @@ import {
  createRevokePermissionRequest,
 } from '../utils';

+/**
+ * Parse completionStatus for a reconciliation step, based on the number of successfully processed items vs total expected


added an INCOMPLETE status to reflect a scenario where no errors are recorded, but the expected totals do not line up, i.e. the number of permissions for a dataset does not match the approved users count. This could mean a request was dropped somewhere in the process, but did not register as an error, like a failed retry.

anncatton · 2024-11-26T16:06:48Z

src/jobs/ega/services/permissions.ts

@@ -113,10 +193,16 @@ export const processPermissionsForApprovedUsers = async (
            (perm) => perm.dataset_accession_id,
          );
          const datasetsRequiringPermissions = datasets.map((dataset) => dataset.accession_id);
+          // if a dataset is removed from the incoming datasets list argument, i.e. the user has more dataset permissions than the incoming list,


just left this comment here, to remind me (and perhaps others) how this call to difference() should behave. Not sure how likely it is that a dataset would be removed!

anncatton · 2024-11-26T16:10:13Z

src/jobs/ega/egaPermissionsReconciliation.ts

@@ -39,12 +41,50 @@ const JOB_NAME = 'RECONCILE_EGA_PERMISSIONS';
 * 3) Retrieve corresponding list of users from EGA API
 * 4) Create permissions, on each dataset, for each user on the DACO approved list, if no existing permission is found
 * 5) Process existing permissions for each dataset + revoke those which belong to users not on the DACO approved list
+ * 6) Return completed JobReport
+ * @returns Promise<JobReport<ReconciliationJobReport>>
+ * @example


added this example report to show how the completionStatus lines up with the details from each of the create permissions/delete permissions process steps

joneubank · 2024-11-26T17:03:26Z

src/jobs/ega/types/constants.ts

@@ -21,3 +21,4 @@
 export const DEFAULT_LIMIT = 100;
 export const DEFAULT_OFFSET = DEFAULT_LIMIT;
 export const EGA_MAX_REQUEST_SIZE = 5000;
+export const DEFAULT_RETRIES = 3;


Ideally this is is configurable in the env to overwrite this value, but this is a sensible default so that isn't immediately needed.

since i'm pushing other changes i can add this too, may as well!

joneubank · 2024-11-26T17:37:34Z

src/jobs/ega/axios/egaClient.ts

+  // throttled axios methods
+  const throttledDelete = throttle(apiAxiosClient.delete);
+  const throttledGet = throttle(apiAxiosClient.get);
+  const throttledPost = throttle(apiAxiosClient.post);
+  const throttledPut = throttle(apiAxiosClient.put);
+  const throttledGenericRequest = throttle(apiAxiosClient.request);
+


…fig, readme

anncatton added 3 commits October 20, 2024 18:14

Add ega public key fetch to secrets

e92dab7

Split clients into separate files. Fix error handling order

b25cca9

Adds ega job report data. Adds job to main batch job, with feature fl…

9cb3949

…ag enabling

anncatton changed the title ~~🥅 455 - Error handling + retry~~ 🥅 455 - Error handling and retries + job reporting Oct 21, 2024

add newline, cleanup

041962a

anncatton commented Oct 21, 2024

View reviewed changes

anncatton requested a review from joneubank October 21, 2024 15:53

joneubank approved these changes Oct 22, 2024

View reviewed changes

anncatton added 2 commits November 26, 2024 10:22

add axios-retry to egaClient

d7e93d1

improve ega report structure, add example report to tsdoc

4d3a9d1

anncatton commented Nov 26, 2024

View reviewed changes

anncatton requested a review from joneubank November 26, 2024 16:10

remove unnecessary comments

7880d28

joneubank approved these changes Nov 26, 2024

View reviewed changes

reject promise when max retry limit exceeded, add retry limits to con…

5a1d79b

…fig, readme

anncatton merged commit afbb2c0 into add-refresh-token-flow Nov 26, 2024
2 checks passed

anncatton deleted the 455-implement-error-retry branch November 26, 2024 21:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🥅 455 - Error handling and retries + job reporting #463

🥅 455 - Error handling and retries + job reporting #463

anncatton commented Oct 20, 2024 •

edited

Loading

anncatton Oct 21, 2024

joneubank Oct 22, 2024

anncatton Oct 21, 2024

joneubank Oct 22, 2024 •

edited

Loading

anncatton Oct 23, 2024

anncatton Oct 21, 2024

joneubank Oct 22, 2024

joneubank Oct 22, 2024

joneubank Oct 22, 2024

joneubank Oct 22, 2024

anncatton Oct 23, 2024

joneubank Oct 22, 2024

anncatton Oct 23, 2024

joneubank Oct 22, 2024

joneubank Oct 22, 2024

joneubank Oct 22, 2024 •

edited

Loading

joneubank Oct 22, 2024

joneubank Oct 22, 2024

anncatton Nov 26, 2024

joneubank Nov 26, 2024

anncatton Nov 26, 2024

anncatton Nov 26, 2024

anncatton Nov 26, 2024

anncatton Nov 26, 2024

joneubank Nov 26, 2024

anncatton Nov 26, 2024

joneubank Nov 26, 2024

		// set new token on original request that had the 401 error
		error.config.headers['Authorization'] = refreshedBearerToken;

🥅 455 - Error handling and retries + job reporting #463

🥅 455 - Error handling and retries + job reporting #463

Conversation

anncatton commented Oct 20, 2024 • edited Loading

EgaClient

IdpClient

Main job

Permissions Service

Main batch job

Types

Configuration

Documentation

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joneubank Oct 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joneubank Oct 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anncatton commented Oct 20, 2024 •

edited

Loading

joneubank Oct 22, 2024 •

edited

Loading

joneubank Oct 22, 2024 •

edited

Loading