Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rollout Creation with workloadRef Causes Service Downtime #4065

Open
Raghu7997 opened this issue Jan 20, 2025 · 0 comments
Open

Rollout Creation with workloadRef Causes Service Downtime #4065

Raghu7997 opened this issue Jan 20, 2025 · 0 comments
Labels
bug Something isn't working

Comments

@Raghu7997
Copy link

I am running a service in a Kubernetes cluster and implementing Canary deployment using Argo CD and Argo Rollouts.As per the documentation, there are two ways to migrate to a Canary deployment from a Kubernetes Deployment:

  1. Convert an existing Deployment resource to a Rollout resource.
  2. Reference an existing Deployment from a Rollout using the workloadRef field.

I chose the second option (workloadRef) for its reliability. Below are the specifications for my Rollout, VirtualService, and DestinationRule:

Rollout Specification:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: my-service
  namespace: engineering
spec:
  replicas: 4
  selector:
    matchLabels:
      app: my-service
      app.kubernetes.io/instance: my-service-app-chart
      app.kubernetes.io/name: my-service
  workloadRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-service
    scaleDown: onsuccess
  strategy:
    canary:
      steps:
        - setWeight: 10
        - pause: {}
      trafficRouting:
        istio:
          destinationRule:
            canarySubsetName: canary
            name: my-service-canary-ds
            stableSubsetName: stable
          virtualService:
            name: my-service-canary-vs
            routes:
              - primary

VirtualService Specification:

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: my-service-canary-vs
  namespace: engineering
spec:
  hosts:
    - my-service.engineering.svc.cluster.local
    - my-service.engineering.svc
    - my-service.engineering
  http:
    - name: primary
      route:
        - destination:
            host: my-service
            subset: stable
          weight: 100
        - destination:
            host: my-service
            subset: canary
          weight: 0

DestinationRule Specification:

apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: my-service-canary-ds
  namespace: engineering
spec:
  host: my-service
  subsets:
    - labels:
        app: my-service
      name: canary
    - labels:
        app: my-service
      name: stable

The Problem
When I apply the above specifications:

  1. The Rollout controller creates a new ReplicaSet and new pods.
  2. The DestinationRule subsets are immediately updated to include the label rollouts-pod-template-hash for both stable and canary.
  3. Both subsets (stable and canary) end up with the same rollouts-pod-template-hash value, which corresponds to the newly created ReplicaSet.
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: my-service-canary-ds
  namespace: engineering
spec:
  host: my-service
  subsets:
    - labels:
        app: my-service
        rollouts-pod-template-hash: 5c8796875b
      name: canary
    - labels:
        app: my-service
        rollouts-pod-template-hash: 5c8796875b
      name: stable

As a result, all traffic is immediately routed to the new pods, which require at least 120 seconds to become healthy. This leads to service unavailability during the rollout.

Expected Behavior

  1. The stable subset in the DestinationRule should point to the old ReplicaSet (existing Deployment).
  2. The canary subset in the DestinationRule should point to the new ReplicaSet created by the Rollout.
  3. Traffic should only be routed to canary once the pods are healthy.

Environment
Argo Rollouts version: 1.7.2
Istio version: 1.19
Kubernetes version: 1.30

Steps to Reproduce

  1. Create a Deployment with the above configuration.
  2. Create a Rollout using the workloadRef field, VirtualService, and DestinationRule as shown above.
  3. Observe the DestinationRule subsets immediately after the Rollout creates a new ReplicaSet.

Request for Assistance

How can I ensure that:

  1. The stable subset refers to the old ReplicaSet, and the canary subset refers to the new ReplicaSet during a rollout?
  2. Traffic is only routed to the canary subset once the pods are fully healthy?
@Raghu7997 Raghu7997 added the bug Something isn't working label Jan 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant