[FLINK-37730][rest] Add client method for getting JM exception history #26525

vsantwana · 2025-05-02T07:33:01Z

What is the purpose of the change

(For example: This pull request makes task deployment go through the blob server, rather than through RPC. That way we avoid re-transferring them on each deployment (during recovery).)

This PR is a part of a bigger change in Flink Kubernetes Operator. As outline in this JIRA ticket, the idea is to expose the Job Manager Exceptions as Kubernetes Events.

This PR adds an API method to the RestClusterClient which will be then used in the operator's reconciler to keep a track of the exceptions.

Brief change log

(for example:)

The TaskInfo is stored in the blob store on job creation time as a persistent artifact
Deployments RPC transmits only the blob storage reference
TaskManagers retrieve the TaskInfo from the blob cache
RestClusterClient exposes method getJobExceptions given a JobID.
This is unit tested in RestClusterClientTest

Verifying this change

Please make sure both new and modified tests in this PR follow the conventions for tests defined in our code quality guide.

(Please pick either of the following options)

This change is a trivial rework / code cleanup without any test coverage.

(or)

This change is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end deployment with large payloads (100MB)
Extended integration test for recovery after master (JobManager) failure
Added test that validates that TaskInfo is transferred only once across recoveries
Manually verified the change by running a 4 node cluster with 2 JobManagers and 4 TaskManagers, a stateful streaming program, and killing one JobManager and two TaskManagers during the execution, verifying that recovery happens correctly.

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): no
The public API, i.e., is any changed class annotated with @Public(Evolving): no
The serializers: no
The runtime per-record code paths (performance sensitive): no
Anything that affects deployment or recovery: no
The S3 file system connector: no

Documentation

Does this pull request introduce a new feature? (yes / no)
If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

flinkbot · 2025-05-02T07:41:55Z

CI report:

a25b036 Azure: FAILURE

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run azure re-run the last Azure build

vsantwana · 2025-05-02T10:08:38Z

@flinkbot run azure

davidradl · 2025-05-05T21:44:40Z

flink-clients/src/main/java/org/apache/flink/client/program/rest/RestClusterClient.java

+     * @param jobID The job id
+     * @return Job exceptions
+     */
+    public CompletableFuture<JobExceptionsInfoWithHistory> getJobExceptions(JobID jobID) {


I am curious what happens if there are loops / repeated exceptions - will this cause an issue with the rest call. Is there a case to have a paging API here to avoid large responses?

rmetzger · 2025-05-06T09:50:26Z

Thanks for the PR. Looks like CI is failing: https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=67465&view=logs&j=0da23115-68bb-5dcd-192c-bd4c8adebde1&t=24c3384f-1bcb-57b3-224f-51bf973bbee8&l=11740

[FLINK-37730][rest] Add client method for getting JM exception history

e809fd2

[FLINK-37730][test] Fix failing test

a25b036

davidradl reviewed May 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FLINK-37730][rest] Add client method for getting JM exception history #26525

[FLINK-37730][rest] Add client method for getting JM exception history #26525

vsantwana commented May 2, 2025

Uh oh!

flinkbot commented May 2, 2025 •

edited

Loading

Uh oh!

vsantwana commented May 2, 2025

Uh oh!

davidradl May 5, 2025

Uh oh!

rmetzger commented May 6, 2025

Uh oh!

Uh oh!

[FLINK-37730][rest] Add client method for getting JM exception history #26525

Are you sure you want to change the base?

[FLINK-37730][rest] Add client method for getting JM exception history #26525

Conversation

vsantwana commented May 2, 2025

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Uh oh!

flinkbot commented May 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI report:

Uh oh!

vsantwana commented May 2, 2025

Uh oh!

davidradl May 5, 2025

Choose a reason for hiding this comment

Uh oh!

rmetzger commented May 6, 2025

Uh oh!

Uh oh!

flinkbot commented May 2, 2025 •

edited

Loading