Bug 1948087

Summary: kube-storage-version-migrator should not set Available=False _NoMigratorPod on updates
Product: OpenShift Container Platform Reporter: W. Trevor King <wking>
Component: kube-storage-version-migratorAssignee: Stefan Schimanski <sttts>
Status: CLOSED WONTFIX QA Contact: Rahul Gangwar <rgangwar>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.8CC: jchaloup, kewang, mfojtik, sanchezl, wlewis
Target Milestone: ---Keywords: Upgrades
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: tag-ci
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-08-25 12:31:01 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description W. Trevor King 2021-04-10 00:44:33 UTC
From CI runs like [1]:

  : [bz-kube-storage-version-migrator] clusteroperator/kube-storage-version-migrator should not change condition/Available
    Run #0: Failed	0s
    1 unexpected clusteroperator state transitions during e2e test run 

    Apr 09 13:23:49.846 - 13s   E clusteroperator/kube-storage-version-migrator condition/Available status/False reason/Available: deployment/migrator.openshift-kube-storage-version-migrator: no replicas are available

With:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_installer/4831/pull-ci-openshift-installer-master-e2e-aws-upgrade/1380486185595441152/artifacts/e2e-aws-upgrade/openshift-e2e-test/build-log.txt | grep 'clusteroperator/kube-storage-version-migrator condition/Available'
Apr 09 13:23:49.846 E clusteroperator/kube-storage-version-migrator condition/Available status/False reason/_NoMigratorPod changed: Available: deployment/migrator.openshift-kube-storage-version-migrator: no replicas are available
Apr 09 13:23:49.846 - 13s   E clusteroperator/kube-storage-version-migrator condition/Available status/False reason/Available: deployment/migrator.openshift-kube-storage-version-migrator: no replicas are available
Apr 09 13:24:03.175 W clusteroperator/kube-storage-version-migrator condition/Available status/True reason/AsExpected changed: All is well
[bz-kube-storage-version-migrator] clusteroperator/kube-storage-version-migrator should not change condition/Available

Very popular:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&name=^periodic.*upgrade&type=junit&search=clusteroperator/kube-storage-version-migrator+s
hould+not+change+condition/Available' | grep 'failures match' | sort
periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade (all) - 16 runs, 100% failed, 88% of failures match = 88% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 19 runs, 100% failed, 95% of failures match = 95% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 17 runs, 100% failed, 94% of failures match = 94% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-ovn-upgrade (all) - 4 runs, 100% failed, 75% of failures match = 75% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade (all) - 4 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 20 runs, 100% failed, 80% of failures match = 80% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-upgrade (all) - 10 runs, 80% failed, 50% of failures match = 40% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-upgrade (all) - 10 runs, 50% failed, 60% of failures match = 30% impact
periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 10 runs, 100% failed, 90% of failures match = 90% impact
periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade (all) - 10 runs, 100% failed, 90% of failures match = 90% impact

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_installer/4831/pull-ci-openshift-installer-master-e2e-aws-upgrade/1380486185595441152

Comment 1 Scott Dodson 2021-07-14 18:01:43 UTC
This really should've been a 4.8.0 blocker but that intent was never conferred to assignees. I'm marking this as a blocker for 4.9.0 and would request that we backport this to 4.8 as soon as reasonable. We really need to get rid of negative signal that we generate during upgrades by operators going degraded during normal operations.

Comment 4 Jan Chaloupka 2021-09-03 07:55:41 UTC
Checking https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade/1432988446850289664:
```
4 unexpected clusteroperator state transitions during e2e test run 

Sep 01 09:20:16.134 - 206ms E clusteroperator/kube-storage-version-migrator condition/Available status/False reason/Available: deployment/migrator.openshift-kube-storage-version-migrator: could not be retrieved
1 tests failed during this blip (2021-09-01 09:20:16.134607583 +0000 UTC to 2021-09-01 09:20:16.134607583 +0000 UTC): [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial]
Sep 01 10:03:16.976 - 9s    E clusteroperator/kube-storage-version-migrator condition/Available status/False reason/Available: deployment/migrator.openshift-kube-storage-version-migrator: no replicas are available
1 tests failed during this blip (2021-09-01 10:03:16.976376214 +0000 UTC to 2021-09-01 10:03:16.976376214 +0000 UTC): [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial]
```

From loki (https://grafana-loki.ci.openshift.org/explore?orgId=1&left=%5B%221630447200000%22,%221630533599000%22,%22Grafana%20Cloud%22,%7B%22expr%22:%22%7Binvoker%3D%5C%22openshift-internal-ci%2Fperiodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade%2F1432988446850289664%5C%22%7D%20%7C%20unpack%20%7C%20namespace%3D%5C%22openshift-kube-storage-version-migrator-operator%5C%22%22%7D%5D):
```
2021-09-01 11:20:16	
I0901 09:20:16.314719       1 status_controller.go:211] clusteroperator/kube-storage-version-migrator diff {"status":{"conditions":[{"lastTransitionTime":"2021-09-01T09:01:59Z","message":"All is well","reason":"AsExpected","status":"False","type":"Degraded"},{"lastTransitionTime":"2021-09-01T09:20:16Z","message":"All is well","reason":"AsExpected","status":"False","type":"Progressing"},{"lastTransitionTime":"2021-09-01T09:20:16Z","message":"All is well","reason":"AsExpected","status":"True","type":"Available"},{"lastTransitionTime":"2021-09-01T09:01:59Z","reason":"NoData","status":"Unknown","type":"Upgradeable"}]}}

2021-09-01 11:20:16	
I0901 09:20:16.086073       1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-storage-version-migrator-operator", Name:"kube-storage-version-migrator-operator", UID:"ba007b65-7e7a-46f3-ab29-6cbea1b7f264", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/kube-storage-version-migrator changed: Degraded message changed from "All is well" to "TargetDegraded: \"deployments\": etcdserver: leader changed\nTargetDegraded: ",Progressing changed from False to True ("Progressing: syncing openshift-kube-storage-version-migrator resources: \"deployments\": etcdserver: leader changed"),Available changed from True to False ("Available: deployment/migrator.openshift-kube-storage-version-migrator: could not be retrieved")
```

Short blip (0.228646s) when the condition/Available flipped to False. The client-go returned "etcdserver: leader changed" error instead of retrying.

Solutions:
- have client-go retry on the leader election
- have the operator check for the leader election and retry without changing the condition/Available to false

Comment 5 Wally 2021-09-20 20:15:19 UTC
Pushing to 4.10.0.

Comment 10 W. Trevor King 2023-10-03 23:24:32 UTC
Moved to https://issues.redhat.com/browse/OCPBUGS-20062