Bug 1948087 - kube-storage-version-migrator should not set Available=False _NoMigratorPod on updates
Summary: kube-storage-version-migrator should not set Available=False _NoMigratorPod o...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-storage-version-migrator
Version: 4.8
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: Stefan Schimanski
QA Contact: Rahul Gangwar
URL:
Whiteboard: tag-ci
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-04-10 00:44 UTC by W. Trevor King
Modified: 2023-10-03 23:24 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-08-25 12:31:01 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description W. Trevor King 2021-04-10 00:44:33 UTC
From CI runs like [1]:

  : [bz-kube-storage-version-migrator] clusteroperator/kube-storage-version-migrator should not change condition/Available
    Run #0: Failed	0s
    1 unexpected clusteroperator state transitions during e2e test run 

    Apr 09 13:23:49.846 - 13s   E clusteroperator/kube-storage-version-migrator condition/Available status/False reason/Available: deployment/migrator.openshift-kube-storage-version-migrator: no replicas are available

With:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_installer/4831/pull-ci-openshift-installer-master-e2e-aws-upgrade/1380486185595441152/artifacts/e2e-aws-upgrade/openshift-e2e-test/build-log.txt | grep 'clusteroperator/kube-storage-version-migrator condition/Available'
Apr 09 13:23:49.846 E clusteroperator/kube-storage-version-migrator condition/Available status/False reason/_NoMigratorPod changed: Available: deployment/migrator.openshift-kube-storage-version-migrator: no replicas are available
Apr 09 13:23:49.846 - 13s   E clusteroperator/kube-storage-version-migrator condition/Available status/False reason/Available: deployment/migrator.openshift-kube-storage-version-migrator: no replicas are available
Apr 09 13:24:03.175 W clusteroperator/kube-storage-version-migrator condition/Available status/True reason/AsExpected changed: All is well
[bz-kube-storage-version-migrator] clusteroperator/kube-storage-version-migrator should not change condition/Available

Very popular:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&name=^periodic.*upgrade&type=junit&search=clusteroperator/kube-storage-version-migrator+s
hould+not+change+condition/Available' | grep 'failures match' | sort
periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade (all) - 16 runs, 100% failed, 88% of failures match = 88% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 19 runs, 100% failed, 95% of failures match = 95% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 17 runs, 100% failed, 94% of failures match = 94% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-ovn-upgrade (all) - 4 runs, 100% failed, 75% of failures match = 75% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade (all) - 4 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 20 runs, 100% failed, 80% of failures match = 80% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-upgrade (all) - 10 runs, 80% failed, 50% of failures match = 40% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-upgrade (all) - 10 runs, 50% failed, 60% of failures match = 30% impact
periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 10 runs, 100% failed, 90% of failures match = 90% impact
periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade (all) - 10 runs, 100% failed, 90% of failures match = 90% impact

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_installer/4831/pull-ci-openshift-installer-master-e2e-aws-upgrade/1380486185595441152

Comment 1 Scott Dodson 2021-07-14 18:01:43 UTC
This really should've been a 4.8.0 blocker but that intent was never conferred to assignees. I'm marking this as a blocker for 4.9.0 and would request that we backport this to 4.8 as soon as reasonable. We really need to get rid of negative signal that we generate during upgrades by operators going degraded during normal operations.

Comment 4 Jan Chaloupka 2021-09-03 07:55:41 UTC
Checking https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade/1432988446850289664:
```
4 unexpected clusteroperator state transitions during e2e test run 

Sep 01 09:20:16.134 - 206ms E clusteroperator/kube-storage-version-migrator condition/Available status/False reason/Available: deployment/migrator.openshift-kube-storage-version-migrator: could not be retrieved
1 tests failed during this blip (2021-09-01 09:20:16.134607583 +0000 UTC to 2021-09-01 09:20:16.134607583 +0000 UTC): [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial]
Sep 01 10:03:16.976 - 9s    E clusteroperator/kube-storage-version-migrator condition/Available status/False reason/Available: deployment/migrator.openshift-kube-storage-version-migrator: no replicas are available
1 tests failed during this blip (2021-09-01 10:03:16.976376214 +0000 UTC to 2021-09-01 10:03:16.976376214 +0000 UTC): [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial]
```

From loki (https://grafana-loki.ci.openshift.org/explore?orgId=1&left=%5B%221630447200000%22,%221630533599000%22,%22Grafana%20Cloud%22,%7B%22expr%22:%22%7Binvoker%3D%5C%22openshift-internal-ci%2Fperiodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade%2F1432988446850289664%5C%22%7D%20%7C%20unpack%20%7C%20namespace%3D%5C%22openshift-kube-storage-version-migrator-operator%5C%22%22%7D%5D):
```
2021-09-01 11:20:16	
I0901 09:20:16.314719       1 status_controller.go:211] clusteroperator/kube-storage-version-migrator diff {"status":{"conditions":[{"lastTransitionTime":"2021-09-01T09:01:59Z","message":"All is well","reason":"AsExpected","status":"False","type":"Degraded"},{"lastTransitionTime":"2021-09-01T09:20:16Z","message":"All is well","reason":"AsExpected","status":"False","type":"Progressing"},{"lastTransitionTime":"2021-09-01T09:20:16Z","message":"All is well","reason":"AsExpected","status":"True","type":"Available"},{"lastTransitionTime":"2021-09-01T09:01:59Z","reason":"NoData","status":"Unknown","type":"Upgradeable"}]}}

2021-09-01 11:20:16	
I0901 09:20:16.086073       1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-storage-version-migrator-operator", Name:"kube-storage-version-migrator-operator", UID:"ba007b65-7e7a-46f3-ab29-6cbea1b7f264", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/kube-storage-version-migrator changed: Degraded message changed from "All is well" to "TargetDegraded: \"deployments\": etcdserver: leader changed\nTargetDegraded: ",Progressing changed from False to True ("Progressing: syncing openshift-kube-storage-version-migrator resources: \"deployments\": etcdserver: leader changed"),Available changed from True to False ("Available: deployment/migrator.openshift-kube-storage-version-migrator: could not be retrieved")
```

Short blip (0.228646s) when the condition/Available flipped to False. The client-go returned "etcdserver: leader changed" error instead of retrying.

Solutions:
- have client-go retry on the leader election
- have the operator check for the leader election and retry without changing the condition/Available to false

Comment 5 Wally 2021-09-20 20:15:19 UTC
Pushing to 4.10.0.

Comment 10 W. Trevor King 2023-10-03 23:24:32 UTC
Moved to https://issues.redhat.com/browse/OCPBUGS-20062


Note You need to log in before you can comment on or make changes to this bug.