Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple issues with Service Fabric Runtime #10920

Closed
2 of 15 tasks
scale-tone opened this issue Nov 7, 2024 · 13 comments
Closed
2 of 15 tasks

Multiple issues with Service Fabric Runtime #10920

scale-tone opened this issue Nov 7, 2024 · 13 comments

Comments

@scale-tone
Copy link

scale-tone commented Nov 7, 2024

Description

We're running integration tests on a Service Fabric dev cluster provisioned on an Azure DevOps build pipeline.
We're using internal Windows Server 2022-based agent pool.
Everything worked until this Saturday 02.11.2024.

Before that we were getting this image: 20240922
Since Saturday we started getting this image: 20241021

Starting from Saturday the dev cluster fails to reach healthy state, due to the failing FaultAnalysisService (which is a non-configurable part of Service Fabric runtime). We don't have any visibility into why exactly it is failing.

This repo says we should be having this (two years old) version of Service Fabric runtime: 9.1.1436.9590.

That is not the case: the actual Service Fabric runtime, that now appears on our agents is this (two months old) one: 9.1.2718.9590. We established that by dumping FabricHost.exe from an agent.

We're not able to prove or disprove that SF runtime version is the actual culprit (because we cannot travel back in time to try the previous one - we're always getting the latest agent image, and cannot control its version), but it looks highly likely.

Question1: can there be any workaround for our failing SF cluster? E.g. maybe there's a way to override SF runtime version to be used? (Just remember that SF runtime installer requires root privileges, therefore just running it as part of the pipeline does not work).

Question2: why this repo's change history does not reflect the actual picture, and can this be fixed?

Question3: is there a chance to have SF runtime updated on the agent image? I cannot say which exact version it needs to be updated to (since we have no way to try them out), but maybe just to revert it to the previous, stable one?

Platforms affected

  • Azure DevOps
  • GitHub Actions - Standard Runners
  • GitHub Actions - Larger Runners

Runner images affected

  • Ubuntu 20.04
  • Ubuntu 22.04
  • Ubuntu 24.04
  • macOS 12
  • macOS 13
  • macOS 13 Arm64
  • macOS 14
  • macOS 14 Arm64
  • macOS 15
  • macOS 15 Arm64
  • Windows Server 2019
  • Windows Server 2022

Image version and build link

20241021

image

Is it regression?

yes

Expected behavior

SF dev cluster starts and successfully goes into healthy state on a build agent

Actual behavior

SF never reaches healthy state (waited for up to 1 hour)

Repro steps

  1. Run a build agent out of this image: 20241021
  2. Setup a Service Fabric dev cluster on it.
  3. Observe it never reaching healthy state (by periodically querying it with Powershell).
@vidyasagarnimmagaddi
Copy link
Contributor

Hi @scale-tone , we're looking into this issue , we will update on it ASAP. thank you !

@subir0071
Copy link
Contributor

Due to deployment issues in some of our VMs, windows 2022 image versions are inconsistent.
Will circle back once we are able to iron out the discrepancies.

@subir0071
Copy link
Contributor

Hello @scale-tone , Could you please check if you are still facing version mismatch.

@scale-tone
Copy link
Author

Could you please check if you are still facing version mismatch.

Oh, yes. The agent's version is now 20241021.1.0:

image

Yet Service Fabric Runtime's version is still the same faulty one:

image

And predictably it fails in the same way.

@subir0071 , what's the ETA for fixing this issue?
We're unable to properly run our integration tests for 3 weeks by now.

@subir0071
Copy link
Contributor

This image is from the runner windows-2022 runner image VM control panel -
image

This screenshot is from windows-2022 runner image from the fabrichost.exe properties -
The service fabric version is same what is in the readme file file for windows-2022
image

Please execute the below workflow to display the Service Fabric version via GHA -

name: Check Service Fabric Host Version
on:
  workflow_dispatch: # Manually trigger the workflow
jobs:
  check-version:
    runs-on: windows-2022
    steps:
    - name: Check for Service Fabric version
      shell: pwsh
      run: |
        # Path to the fabrichost.exe (usually located in the Service Fabric installation directory)
        $fabrichostPath = "C:\Program Files\Microsoft Service Fabric\bin\fabrichost.exe"
        
        # Check if fabrichost.exe exists
        if (Test-Path $fabrichostPath) {
            # Get version of fabrichost.exe using PowerShell
            $fileVersion = (Get-Item $fabrichostPath).VersionInfo.FileVersion
            Write-Host "fabrichost.exe version: $fileVersion"
        } else {
            Write-Host "fabrichost.exe not found at the expected location."
        }

I execute the same and got this same output.

@scale-tone
Copy link
Author

Please execute the below workflow to display the Service Fabric version via GHA -

@subir0071 , we are not using GitHub Actions.
It's great to know that it works well there, but we are not using it.

As I said, we're using a dedicated internal agent pool (maintained by our sister team) in Azure DevOps.
Can we have agent image fixed in it, so that we can continue running our integration tests?

@subir0071
Copy link
Contributor

Thanks for your quick response.

My apologies for the confusion -
The agent images are same for Azure DevOps and Github Actions.
So, no difference in the version.

However, let me check some more aspects around it.

@scale-tone
Copy link
Author

@subir0071 , look, in your test you're getting image version 20241113.3.0.
Which is much newer to what I reported above, and yes, looks promising.
What does it take for us to get your version on our build agents? Will it land there naturally after a while? If yes, when?

@subir0071
Copy link
Contributor

subir0071 commented Nov 26, 2024

It seems the agent pool for Azure DevOps that you are using is not being maintained by Azure.
As, you would have automatically moved to the latest image version.

Kindly check with the agent pool administrator (the sister team).
They should be able to help you.

@scale-tone
Copy link
Author

We checked with the pool administrator, and he confirmed that the pool takes images directly from this repo.

Just checked again. The image version we’re getting right now is ‘20241125.1.0’ , and Service Fabric Runtime version there is still the unstable one:
image

How can this be explained?

@subir0071
Copy link
Contributor

A very unique kind of issue...
Have started a discussion internally with relevant team to understand how the version is getting changed.

Will you be able to check if there is any other tools/packages are getting updated?

@subir0071
Copy link
Contributor

Thanks @mmrazik for the analysis on this issue.
Hi @scale-tone - It seems the agent that is used, is not MS hosted agents.

@mmrazik
Copy link

mmrazik commented Dec 4, 2024

@subir0071 feel free to close this issue. I clarified this with @scale-tone 1:1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants