In a recent CI run from 4.8.0-0.ci-2021-03-31-054135 to 4.8.0-0.ci-2021-03-31-202831, the API server went down (not great), but the package server was flapping Available without explaining why [1]: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade/1377357414700945408/build-log.txt | grep 'clusteroperator/operator-lifecycle-manager-packageserver' Mar 31 21:54:08.286 E clusteroperator/operator-lifecycle-manager-packageserver condition/Available status/False changed: Mar 31 21:54:08.286 W clusteroperator/operator-lifecycle-manager-packageserver condition/Progressing status/True changed: Working toward 0.17.0 Mar 31 21:54:57.249 W clusteroperator/operator-lifecycle-manager-packageserver condition/Available status/True changed: Mar 31 21:54:57.249 W clusteroperator/operator-lifecycle-manager-packageserver condition/Progressing status/False changed: Deployed version 0.17.0 Mar 31 21:54:57.249 I clusteroperator/operator-lifecycle-manager-packageserver versions: operator 4.8.0-0.ci-2021-03-31-054135 -> 4.8.0-0.ci-2021-03-31-202831 Mar 31 22:08:32.663 E clusteroperator/operator-lifecycle-manager-packageserver condition/Available status/False changed: Mar 31 22:08:32.801 W clusteroperator/operator-lifecycle-manager-packageserver condition/Progressing status/True changed: Working toward 0.17.0 Mar 31 22:08:54.980 W clusteroperator/operator-lifecycle-manager-packageserver condition/Available status/True changed: Mar 31 22:08:54.980 W clusteroperator/operator-lifecycle-manager-packageserver condition/Progressing status/False changed: Deployed version 0.17.0 Mar 31 22:15:18.755 E clusteroperator/operator-lifecycle-manager-packageserver condition/Available status/False changed: Mar 31 22:15:19.493 W clusteroperator/operator-lifecycle-manager-packageserver condition/Progressing status/True changed: Working toward 0.17.0 Mar 31 22:15:49.711 W clusteroperator/operator-lifecycle-manager-packageserver condition/Available status/True changed: Mar 31 22:15:49.711 W clusteroperator/operator-lifecycle-manager-packageserver condition/Progressing status/False changed: Deployed version 0.17.0 Mar 31 22:22:06.723 E clusteroperator/operator-lifecycle-manager-packageserver condition/Available status/False changed: Mar 31 22:22:28.248 W clusteroperator/operator-lifecycle-manager-packageserver condition/Progressing status/True changed: Working toward 0.17.0 Mar 31 22:22:33.667 W clusteroperator/operator-lifecycle-manager-packageserver condition/Available status/True changed: Mar 31 22:22:33.667 W clusteroperator/operator-lifecycle-manager-packageserver condition/Progressing status/False changed: Deployed version 0.17.0 It should set a reason an message explaining Available=False, because that's a "page the admin at midnight" level condition. Seems reasonably frequent: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=clusteroperator%2Foperator-lifecycle-manager-packageserver+condition%2FAvailable+status%2FFalse&maxAge=24h&context=1&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job' | grep 'failures match' | sort periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade (all) - 18 runs, 61% failed, 155% of failures match = 94% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 19 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 17 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-ovn-upgrade (all) - 4 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade (all) - 4 runs, 100% failed, 75% of failures match = 75% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade (all) - 4 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.9-e2e-gcp (all) - 19 runs, 89% failed, 6% of failures match = 5% impact periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 18 runs, 89% failed, 100% of failures match = 89% impact periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-upgrade (all) - 6 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-upgrade (all) - 6 runs, 100% failed, 83% of failures match = 83% impact periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 6 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade (all) - 7 runs, 100% failed, 100% of failures match = 100% impact pull-ci-cri-o-cri-o-master-e2e-gcp (all) - 13 runs, 31% failed, 25% of failures match = 8% impact .. pull-ci-operator-framework-operator-lifecycle-manager-master-e2e-upgrade (all) - 5 runs, 40% failed, 150% of failures match = 60% impact rehearse-17027-pull-ci-openshift-cluster-api-provider-baremetal-master-e2e-metal-ipi-upgrade (all) - 2 runs, 50% failed, 100% of failures match = 50% impact ... rehearse-17352-pull-ci-openshift-cluster-api-provider-baremetal-master-e2e-metal-ipi-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact release-openshift-ocp-installer-upgrade-remote-libvirt-ppc64le-4.7-to-4.8 (all) - 2 runs, 100% failed, 100% of failures match = 100% impact release-openshift-ocp-installer-upgrade-remote-libvirt-s390x-4.7-to-4.8 (all) - 2 runs, 100% failed, 100% of failures match = 100% impact release-openshift-okd-installer-e2e-aws-upgrade (all) - 15 runs, 73% failed, 82% of failures match = 60% impact release-openshift-origin-installer-e2e-aws-upgrade-4.5-to-4.6-to-4.7-to-4.8-ci (all) - 1 runs, 100% failed, 100% of failures match = 100% impact I'm guessing 4.8 based on that, but once folks understand the issue it should be easier to see if this is an "always been this way" or a regression. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade/1377357414700945408
Mar 31 22:22:06.723 E clusteroperator/operator-lifecycle-manager-packageserver condition/Available status/False changed: { "lastTransitionTime": "2021-03-31T22:22:06Z", "lastUpdateTime": "2021-03-31T22:22:06Z", "message": "installing: waiting for deployment packageserver to become ready: Waiting for rollout to finish: 1 of 2 updated replicas are available...\n", "phase": "Failed", "reason": "ComponentUnhealthy" } Effectively, if there are fewer than 2 available packageserver pods, the CSV flips to Failed. I changed this upstream so that the deployment availability test reads from the Available deployment condition directly (instead of the direct comparison .status.AvailableReplicas < .status.UpdatedReplicas), so the packageserver CSV won't flap if its deployment is still available. Verification is easy: observe that the packageserver CSV's .status.phase remains "Succeeded" while deleting one packageserver pod.
Confirmed that package server deployment object has now rollingUpdate strategy with maxSurge and maxUnavailable as purposed. OCP: 4.8.0-0.nightly-2021-05-03-072623 oc get deployment packageserver -n openshift-operator-lifecycle-manager -o yaml | grep "spec:" -A 20 spec: progressDeadlineSeconds: 600 replicas: 2 revisionHistoryLimit: 1 selector: matchLabels: app: packageserver strategy: rollingUpdate: maxSurge: 1 maxUnavailable: 1 type: RollingUpdate Marking as VERIFIED.
Verification in comment 6 was pretty narrow, and we still see a lot of Available=False issues in CI: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=clusteroperator%2Foperator-lifecycle-manager-packageserver+condition%2FAvailable+status%2FFalse&maxAge=24h&type=junit' | grep 'failures match' | grep -v 'rehearse-\|pull-ci-' periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade (all) - 14 runs, 93% failed, 108% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 17 runs, 94% failed, 100% of failures match = 94% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade (all) - 2 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-upgrade-rollback-oldest-supported (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 8 runs, 100% failed, 88% of failures match = 88% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 13 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-upgrade (all) - 8 runs, 88% failed, 114% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 16 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-proxy (all) - 9 runs, 56% failed, 20% of failures match = 11% impact periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade (all) - 8 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade (all) - 4 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-upgrade (all) - 8 runs, 88% failed, 57% of failures match = 50% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-ovn-upgrade (all) - 4 runs, 100% failed, 75% of failures match = 75% impact release-openshift-ocp-installer-upgrade-remote-libvirt-s390x-4.7-to-4.8 (all) - 2 runs, 100% failed, 100% of failures match = 100% impact release-openshift-ocp-installer-upgrade-remote-libvirt-ppc64le-4.7-to-4.8 (all) - 2 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-vsphere-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact release-openshift-okd-installer-e2e-aws-upgrade (all) - 15 runs, 93% failed, 14% of failures match = 13% impact release-openshift-origin-installer-launch-aws (all) - 75 runs, 33% failed, 8% of failures match = 3% impact periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-compact-serial (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact Some of those include 4.7, and it doesn't look like this fix was backported. But there are a number of jobs in there that are 4.8 or 4.9 only, so I'd have expected a fix for this bug to keep this operator from going Available=False in those jobs. Did you want me to re-open this bug, or open a new one for further investigation?
The patch attached to this BZ addresses one cause of CO availability flaps. I cloned another issue to track the empty reason/message -- https://bugzilla.redhat.com/show_bug.cgi?id=1955761 -- which will make it easier to distinguish one cause from another. I checked several recent jobs from the linked CI search, and none of them have the signature of the issue fixed by this patch (https://bugzilla.redhat.com/show_bug.cgi?id=1945443#c3). I do see evidence of the second likely cause (examples in https://bugzilla.redhat.com/show_bug.cgi?id=1945443#c4), which is tracked in https://bugzilla.redhat.com/show_bug.cgi?id=1953715 and is being worked on now.
The patch to populate reason/message on the CO condition merged yesterday, which made it easier to find a third cause of this flapping. Opened https://bugzilla.redhat.com/show_bug.cgi?id=1958285 to track it.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438