Bug 1945443
Summary: | operator-lifecycle-manager-packageserver flaps Available=False with no reason or message | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | W. Trevor King <wking> | |
Component: | OLM | Assignee: | Ben Luddy <bluddy> | |
OLM sub component: | OLM | QA Contact: | Bruno Andrade <bandrade> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | high | |||
Priority: | high | CC: | bandrade, bluddy, davegord, nhale | |
Version: | 4.8 | Keywords: | Triaged, Upgrades | |
Target Milestone: | --- | |||
Target Release: | 4.8.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1955761 1958285 1959009 (view as bug list) | Environment: |
[bz-OLM] clusteroperator/operator-lifecycle-manager-packageserver should not change condition/Available
|
|
Last Closed: | 2021-07-27 22:57:00 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1959009 |
Description
W. Trevor King
2021-03-31 23:33:37 UTC
Mar 31 22:22:06.723 E clusteroperator/operator-lifecycle-manager-packageserver condition/Available status/False changed: { "lastTransitionTime": "2021-03-31T22:22:06Z", "lastUpdateTime": "2021-03-31T22:22:06Z", "message": "installing: waiting for deployment packageserver to become ready: Waiting for rollout to finish: 1 of 2 updated replicas are available...\n", "phase": "Failed", "reason": "ComponentUnhealthy" } Effectively, if there are fewer than 2 available packageserver pods, the CSV flips to Failed. I changed this upstream so that the deployment availability test reads from the Available deployment condition directly (instead of the direct comparison .status.AvailableReplicas < .status.UpdatedReplicas), so the packageserver CSV won't flap if its deployment is still available. Verification is easy: observe that the packageserver CSV's .status.phase remains "Succeeded" while deleting one packageserver pod. Confirmed that package server deployment object has now rollingUpdate strategy with maxSurge and maxUnavailable as purposed. OCP: 4.8.0-0.nightly-2021-05-03-072623 oc get deployment packageserver -n openshift-operator-lifecycle-manager -o yaml | grep "spec:" -A 20 spec: progressDeadlineSeconds: 600 replicas: 2 revisionHistoryLimit: 1 selector: matchLabels: app: packageserver strategy: rollingUpdate: maxSurge: 1 maxUnavailable: 1 type: RollingUpdate Marking as VERIFIED. Verification in comment 6 was pretty narrow, and we still see a lot of Available=False issues in CI: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=clusteroperator%2Foperator-lifecycle-manager-packageserver+condition%2FAvailable+status%2FFalse&maxAge=24h&type=junit' | grep 'failures match' | grep -v 'rehearse-\|pull-ci-' periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade (all) - 14 runs, 93% failed, 108% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 17 runs, 94% failed, 100% of failures match = 94% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade (all) - 2 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-upgrade-rollback-oldest-supported (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 8 runs, 100% failed, 88% of failures match = 88% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 13 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-upgrade (all) - 8 runs, 88% failed, 114% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 16 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-proxy (all) - 9 runs, 56% failed, 20% of failures match = 11% impact periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade (all) - 8 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade (all) - 4 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-upgrade (all) - 8 runs, 88% failed, 57% of failures match = 50% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-ovn-upgrade (all) - 4 runs, 100% failed, 75% of failures match = 75% impact release-openshift-ocp-installer-upgrade-remote-libvirt-s390x-4.7-to-4.8 (all) - 2 runs, 100% failed, 100% of failures match = 100% impact release-openshift-ocp-installer-upgrade-remote-libvirt-ppc64le-4.7-to-4.8 (all) - 2 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-vsphere-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact release-openshift-okd-installer-e2e-aws-upgrade (all) - 15 runs, 93% failed, 14% of failures match = 13% impact release-openshift-origin-installer-launch-aws (all) - 75 runs, 33% failed, 8% of failures match = 3% impact periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-compact-serial (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact Some of those include 4.7, and it doesn't look like this fix was backported. But there are a number of jobs in there that are 4.8 or 4.9 only, so I'd have expected a fix for this bug to keep this operator from going Available=False in those jobs. Did you want me to re-open this bug, or open a new one for further investigation? The patch attached to this BZ addresses one cause of CO availability flaps. I cloned another issue to track the empty reason/message -- https://bugzilla.redhat.com/show_bug.cgi?id=1955761 -- which will make it easier to distinguish one cause from another. I checked several recent jobs from the linked CI search, and none of them have the signature of the issue fixed by this patch (https://bugzilla.redhat.com/show_bug.cgi?id=1945443#c3). I do see evidence of the second likely cause (examples in https://bugzilla.redhat.com/show_bug.cgi?id=1945443#c4), which is tracked in https://bugzilla.redhat.com/show_bug.cgi?id=1953715 and is being worked on now. The patch to populate reason/message on the CO condition merged yesterday, which made it easier to find a third cause of this flapping. Opened https://bugzilla.redhat.com/show_bug.cgi?id=1958285 to track it. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 |