Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure.Core pipeline has a very high failure rate and we need to get this green. #42287

Open
pallavit opened this issue Feb 29, 2024 · 7 comments
Labels
Azure.Core.TestFramework Client This issue points to a problem in the data-plane of the library.
Milestone

Comments

@pallavit
Copy link
Contributor

Azure.Core pipelines today have a pass rate of about 50% which impedes work and progress as shown below. We need to get this closer to 100%.
https://dev.azure.com/azure-sdk/internal/_pipeline/analytics/stageawareoutcome?definitionId=1180&contextType=build has more information.

@github-actions github-actions bot added Client This issue points to a problem in the data-plane of the library. needs-team-triage Workflow: This issue needs the team to triage. labels Feb 29, 2024
@annelo-msft
Copy link
Member

Per offline discussion, we should not remove tests from the pipeline to achieve this goal, as it allows build breaks to be introduced from core/generator changes that take a lot of time to track down.

@jsquire jsquire removed the needs-team-triage Workflow: This issue needs the team to triage. label Feb 29, 2024
@jsquire
Copy link
Member

jsquire commented Feb 29, 2024

I think one of the things to consider is allowing a pipeline-specific override for global test timeout, so that large suites such as Core and ResourceManager can offset the non-determinism for scheduling delays.

@christothes
Copy link
Member

christothes commented Feb 29, 2024

I looked into this a bit and noticed something interesting. There seems to be a correlation with TestProxy failures in the test logs. In speaking with @scbedd , he thinks this issue could be related. This would address the 500s, but there is still the issue of proxy cloning of recordings causing delays that are not the fault of the test itself.

I'm going to investigate if it is feasible to time the proxy initialization phase and subtract that from the test's total time somehow.

Alternate approach idea from @scbedd is to batch restore all test assets somehow prior to normal test execution. That seems more involved, but would also address the tests not being charged for proxy setup time.

example of proxy failures related to timeouts:

2024-02-28T22:45:25.3252590Z   Failed GetDevice_NotFound [20 s]
2024-02-28T22:45:25.3259380Z   Error Message:
2024-02-28T22:45:25.3261600Z    The test timed out twice:
2024-02-28T22:45:25.3263450Z First attempt: Azure.RequestFailedException : Service request failed.
2024-02-28T22:45:25.3265460Z Status: 500 (Internal Server Error)
2024-02-28T22:45:25.3266820Z 
2024-02-28T22:45:25.3268240Z Headers:
2024-02-28T22:45:25.3269730Z Date: Wed, 28 Feb 2024 22:45:14 GMT
2024-02-28T22:45:25.3271220Z Server: Kestrel
2024-02-28T22:45:25.3272950Z Content-Length: 0
2024-02-28T22:45:25.3274270Z 
2024-02-28T22:45:25.3275710Z TearDown : Azure.Core.TestFramework.TestTimeoutException : Test exceeded global time limit of 10 seconds. Duration: 00:00:10.6440430 
2024-02-28T22:45:25.3277590Z Second attempt: Azure.RequestFailedException : Service request failed.
2024-02-28T22:45:25.3279110Z Status: 500 (Internal Server Error)
2024-02-28T22:45:25.3280670Z 
2024-02-28T22:45:25.3282770Z Headers:
2024-02-28T22:45:25.3284040Z Date: Wed, 28 Feb 2024 22:45:24 GMT
2024-02-28T22:45:25.3285320Z Server: Kestrel
2024-02-28T22:45:25.3287080Z Content-Length: 0

@christothes
Copy link
Member

christothes commented Feb 29, 2024

Looks like we already subtract the proxy setup time from the test execution time

So let's wait to see if fixing Azure/azure-sdk-tools#7784 addresses the majority of cases.

@scbedd
Copy link
Member

scbedd commented Mar 19, 2024

Looks like we already subtract the proxy setup time from the test execution time

So let's wait to see if fixing Azure/azure-sdk-tools#7784 addresses the majority of cases.

Hey @christothes I don't know if it totally addressed your issues but your mac builds definitely stabilized a ton after the update!

#42758

@christothes
Copy link
Member

Here is a kusto query of the percentage of tests that have timed out over the last 9 days. I believe the fix merged late on Friday the 15th. Looks good to me!

image

@pallavit
Copy link
Contributor Author

The data looks promising!

@annelo-msft annelo-msft added this to the Backlog milestone May 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Azure.Core.TestFramework Client This issue points to a problem in the data-plane of the library.
Projects
None yet
Development

No branches or pull requests

5 participants