Bug 1945443
| Summary: | operator-lifecycle-manager-packageserver flaps Available=False with no reason or message | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | W. Trevor King <wking> | |
| Component: | OLM | Assignee: | Ben Luddy <bluddy> | |
| OLM sub component: | OLM | QA Contact: | Bruno Andrade <bandrade> | |
| Status: | CLOSED ERRATA | Docs Contact: | ||
| Severity: | high | |||
| Priority: | high | CC: | bandrade, bluddy, davegord, nhale | |
| Version: | 4.8 | Keywords: | Triaged, Upgrades | |
| Target Milestone: | --- | |||
| Target Release: | 4.8.0 | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1955761 1958285 1959009 (view as bug list) | Environment: |
[bz-OLM] clusteroperator/operator-lifecycle-manager-packageserver should not change condition/Available
|
|
| Last Closed: | 2021-07-27 22:57:00 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1959009 | |||
|
Description
W. Trevor King
2021-03-31 23:33:37 UTC
Mar 31 22:22:06.723 E clusteroperator/operator-lifecycle-manager-packageserver condition/Available status/False changed:
{
"lastTransitionTime": "2021-03-31T22:22:06Z",
"lastUpdateTime": "2021-03-31T22:22:06Z",
"message": "installing: waiting for deployment packageserver to become ready: Waiting for rollout to finish: 1 of 2 updated replicas are available...\n",
"phase": "Failed",
"reason": "ComponentUnhealthy"
}
Effectively, if there are fewer than 2 available packageserver pods, the CSV flips to Failed. I changed this upstream so that the deployment availability test reads from the Available deployment condition directly (instead of the direct comparison .status.AvailableReplicas < .status.UpdatedReplicas), so the packageserver CSV won't flap if its deployment is still available. Verification is easy: observe that the packageserver CSV's .status.phase remains "Succeeded" while deleting one packageserver pod.
Confirmed that package server deployment object has now rollingUpdate strategy with maxSurge and maxUnavailable as purposed.
OCP: 4.8.0-0.nightly-2021-05-03-072623
oc get deployment packageserver -n openshift-operator-lifecycle-manager -o yaml | grep "spec:" -A 20
spec:
progressDeadlineSeconds: 600
replicas: 2
revisionHistoryLimit: 1
selector:
matchLabels:
app: packageserver
strategy:
rollingUpdate:
maxSurge: 1
maxUnavailable: 1
type: RollingUpdate
Marking as VERIFIED.
Verification in comment 6 was pretty narrow, and we still see a lot of Available=False issues in CI: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=clusteroperator%2Foperator-lifecycle-manager-packageserver+condition%2FAvailable+status%2FFalse&maxAge=24h&type=junit' | grep 'failures match' | grep -v 'rehearse-\|pull-ci-' periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade (all) - 14 runs, 93% failed, 108% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 17 runs, 94% failed, 100% of failures match = 94% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade (all) - 2 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-upgrade-rollback-oldest-supported (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 8 runs, 100% failed, 88% of failures match = 88% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 13 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-upgrade (all) - 8 runs, 88% failed, 114% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 16 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-proxy (all) - 9 runs, 56% failed, 20% of failures match = 11% impact periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade (all) - 8 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade (all) - 4 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-upgrade (all) - 8 runs, 88% failed, 57% of failures match = 50% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-ovn-upgrade (all) - 4 runs, 100% failed, 75% of failures match = 75% impact release-openshift-ocp-installer-upgrade-remote-libvirt-s390x-4.7-to-4.8 (all) - 2 runs, 100% failed, 100% of failures match = 100% impact release-openshift-ocp-installer-upgrade-remote-libvirt-ppc64le-4.7-to-4.8 (all) - 2 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-vsphere-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact release-openshift-okd-installer-e2e-aws-upgrade (all) - 15 runs, 93% failed, 14% of failures match = 13% impact release-openshift-origin-installer-launch-aws (all) - 75 runs, 33% failed, 8% of failures match = 3% impact periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-compact-serial (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact Some of those include 4.7, and it doesn't look like this fix was backported. But there are a number of jobs in there that are 4.8 or 4.9 only, so I'd have expected a fix for this bug to keep this operator from going Available=False in those jobs. Did you want me to re-open this bug, or open a new one for further investigation? The patch attached to this BZ addresses one cause of CO availability flaps. I cloned another issue to track the empty reason/message -- https://bugzilla.redhat.com/show_bug.cgi?id=1955761 -- which will make it easier to distinguish one cause from another. I checked several recent jobs from the linked CI search, and none of them have the signature of the issue fixed by this patch (https://bugzilla.redhat.com/show_bug.cgi?id=1945443#c3). I do see evidence of the second likely cause (examples in https://bugzilla.redhat.com/show_bug.cgi?id=1945443#c4), which is tracked in https://bugzilla.redhat.com/show_bug.cgi?id=1953715 and is being worked on now. The patch to populate reason/message on the CO condition merged yesterday, which made it easier to find a third cause of this flapping. Opened https://bugzilla.redhat.com/show_bug.cgi?id=1958285 to track it. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 |