Bug 1945443

Summary:	operator-lifecycle-manager-packageserver flaps Available=False with no reason or message
Product:	OpenShift Container Platform	Reporter:	W. Trevor King <wking>
Component:	OLM	Assignee:	Ben Luddy <bluddy>
OLM sub component:	OLM	QA Contact:	Bruno Andrade <bandrade>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	high	CC:	bandrade, bluddy, davegord, nhale
Version:	4.8	Keywords:	Triaged, Upgrades
Target Milestone:	---
Target Release:	4.8.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1955761 1958285 1959009 (view as bug list)		Environment:	[bz-OLM] clusteroperator/operator-lifecycle-manager-packageserver should not change condition/Available
Last Closed:	2021-07-27 22:57:00 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1959009

Description W. Trevor King 2021-03-31 23:33:37 UTC

In a recent CI run from 4.8.0-0.ci-2021-03-31-054135 to 4.8.0-0.ci-2021-03-31-202831, the API server went down (not great), but the package server was flapping Available without explaining why [1]:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade/1377357414700945408/build-log.txt | grep 'clusteroperator/operator-lifecycle-manager-packageserver'
Mar 31 21:54:08.286 E clusteroperator/operator-lifecycle-manager-packageserver condition/Available status/False changed: 
Mar 31 21:54:08.286 W clusteroperator/operator-lifecycle-manager-packageserver condition/Progressing status/True changed: Working toward 0.17.0
Mar 31 21:54:57.249 W clusteroperator/operator-lifecycle-manager-packageserver condition/Available status/True changed: 
Mar 31 21:54:57.249 W clusteroperator/operator-lifecycle-manager-packageserver condition/Progressing status/False changed: Deployed version 0.17.0
Mar 31 21:54:57.249 I clusteroperator/operator-lifecycle-manager-packageserver versions: operator 4.8.0-0.ci-2021-03-31-054135 -> 4.8.0-0.ci-2021-03-31-202831
Mar 31 22:08:32.663 E clusteroperator/operator-lifecycle-manager-packageserver condition/Available status/False changed: 
Mar 31 22:08:32.801 W clusteroperator/operator-lifecycle-manager-packageserver condition/Progressing status/True changed: Working toward 0.17.0
Mar 31 22:08:54.980 W clusteroperator/operator-lifecycle-manager-packageserver condition/Available status/True changed: 
Mar 31 22:08:54.980 W clusteroperator/operator-lifecycle-manager-packageserver condition/Progressing status/False changed: Deployed version 0.17.0
Mar 31 22:15:18.755 E clusteroperator/operator-lifecycle-manager-packageserver condition/Available status/False changed: 
Mar 31 22:15:19.493 W clusteroperator/operator-lifecycle-manager-packageserver condition/Progressing status/True changed: Working toward 0.17.0
Mar 31 22:15:49.711 W clusteroperator/operator-lifecycle-manager-packageserver condition/Available status/True changed: 
Mar 31 22:15:49.711 W clusteroperator/operator-lifecycle-manager-packageserver condition/Progressing status/False changed: Deployed version 0.17.0
Mar 31 22:22:06.723 E clusteroperator/operator-lifecycle-manager-packageserver condition/Available status/False changed: 
Mar 31 22:22:28.248 W clusteroperator/operator-lifecycle-manager-packageserver condition/Progressing status/True changed: Working toward 0.17.0
Mar 31 22:22:33.667 W clusteroperator/operator-lifecycle-manager-packageserver condition/Available status/True changed: 
Mar 31 22:22:33.667 W clusteroperator/operator-lifecycle-manager-packageserver condition/Progressing status/False changed: Deployed version 0.17.0

It should set a reason an message explaining Available=False, because that's a "page the admin at midnight" level condition.  Seems reasonably frequent:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=clusteroperator%2Foperator-lifecycle-manager-packageserver+condition%2FAvailable+status%2FFalse&maxAge=24h&context=1&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job' | grep 'failures match' | sort
periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade (all) - 18 runs, 61% failed, 155% of failures match = 94% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 19 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 17 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-ovn-upgrade (all) - 4 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade (all) - 4 runs, 100% failed, 75% of failures match = 75% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade (all) - 4 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-gcp (all) - 19 runs, 89% failed, 6% of failures match = 5% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 18 runs, 89% failed, 100% of failures match = 89% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-upgrade (all) - 6 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-upgrade (all) - 6 runs, 100% failed, 83% of failures match = 83% impact
periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 6 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade (all) - 7 runs, 100% failed, 100% of failures match = 100% impact
pull-ci-cri-o-cri-o-master-e2e-gcp (all) - 13 runs, 31% failed, 25% of failures match = 8% impact
..
pull-ci-operator-framework-operator-lifecycle-manager-master-e2e-upgrade (all) - 5 runs, 40% failed, 150% of failures match = 60% impact
rehearse-17027-pull-ci-openshift-cluster-api-provider-baremetal-master-e2e-metal-ipi-upgrade (all) - 2 runs, 50% failed, 100% of failures match = 50% impact
...
rehearse-17352-pull-ci-openshift-cluster-api-provider-baremetal-master-e2e-metal-ipi-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
release-openshift-ocp-installer-upgrade-remote-libvirt-ppc64le-4.7-to-4.8 (all) - 2 runs, 100% failed, 100% of failures match = 100% impact
release-openshift-ocp-installer-upgrade-remote-libvirt-s390x-4.7-to-4.8 (all) - 2 runs, 100% failed, 100% of failures match = 100% impact
release-openshift-okd-installer-e2e-aws-upgrade (all) - 15 runs, 73% failed, 82% of failures match = 60% impact
release-openshift-origin-installer-e2e-aws-upgrade-4.5-to-4.6-to-4.7-to-4.8-ci (all) - 1 runs, 100% failed, 100% of failures match = 100% impact

I'm guessing 4.8 based on that, but once folks understand the issue it should be easier to see if this is an "always been this way" or a regression.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade/1377357414700945408

Comment 3 Ben Luddy 2021-04-30 18:03:31 UTC

Mar 31 22:22:06.723 E clusteroperator/operator-lifecycle-manager-packageserver condition/Available status/False changed: 

{
  "lastTransitionTime": "2021-03-31T22:22:06Z",
  "lastUpdateTime": "2021-03-31T22:22:06Z",
  "message": "installing: waiting for deployment packageserver to become ready: Waiting for rollout to finish: 1 of 2 updated replicas are available...\n",
  "phase": "Failed",
  "reason": "ComponentUnhealthy"
}

Effectively, if there are fewer than 2 available packageserver pods, the CSV flips to Failed. I changed this upstream so that the deployment availability test reads from the Available deployment condition directly (instead of the direct comparison .status.AvailableReplicas < .status.UpdatedReplicas), so the packageserver CSV won't flap if its deployment is still available. Verification is easy: observe that the packageserver CSV's .status.phase remains "Succeeded" while deleting one packageserver pod.

Comment 6 Bruno Andrade 2021-05-03 12:46:36 UTC

Confirmed that package server deployment object has now rollingUpdate strategy with maxSurge and maxUnavailable as purposed. 

OCP: 4.8.0-0.nightly-2021-05-03-072623

oc get deployment packageserver -n openshift-operator-lifecycle-manager -o yaml | grep "spec:" -A 20
spec:
 progressDeadlineSeconds: 600
 replicas: 2
 revisionHistoryLimit: 1
 selector:
   matchLabels:
     app: packageserver
 strategy:
   rollingUpdate:
     maxSurge: 1
     maxUnavailable: 1
   type: RollingUpdate

Marking as VERIFIED.

Comment 7 W. Trevor King 2021-05-07 01:04:09 UTC

Verification in comment 6 was pretty narrow, and we still see a lot of Available=False issues in CI:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=clusteroperator%2Foperator-lifecycle-manager-packageserver+condition%2FAvailable+status%2FFalse&maxAge=24h&type=junit' | grep 'failures match' | grep -v 'rehearse-\|pull-ci-'
periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade (all) - 14 runs, 93% failed, 108% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 17 runs, 94% failed, 100% of failures match = 94% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade (all) - 2 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-upgrade-rollback-oldest-supported (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 8 runs, 100% failed, 88% of failures match = 88% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 13 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-upgrade (all) - 8 runs, 88% failed, 114% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 16 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-proxy (all) - 9 runs, 56% failed, 20% of failures match = 11% impact
periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade (all) - 8 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade (all) - 4 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-upgrade (all) - 8 runs, 88% failed, 57% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-ovn-upgrade (all) - 4 runs, 100% failed, 75% of failures match = 75% impact
release-openshift-ocp-installer-upgrade-remote-libvirt-s390x-4.7-to-4.8 (all) - 2 runs, 100% failed, 100% of failures match = 100% impact
release-openshift-ocp-installer-upgrade-remote-libvirt-ppc64le-4.7-to-4.8 (all) - 2 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-vsphere-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
release-openshift-okd-installer-e2e-aws-upgrade (all) - 15 runs, 93% failed, 14% of failures match = 13% impact
release-openshift-origin-installer-launch-aws (all) - 75 runs, 33% failed, 8% of failures match = 3% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-compact-serial (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact

Some of those include 4.7, and it doesn't look like this fix was backported.  But there are a number of jobs in there that are 4.8 or 4.9 only, so I'd have expected a fix for this bug to keep this operator from going Available=False in those jobs.  Did you want me to re-open this bug, or open a new one for further investigation?

Comment 8 Ben Luddy 2021-05-07 13:56:20 UTC

The patch attached to this BZ addresses one cause of CO availability flaps. I cloned another issue to track the empty reason/message -- https://bugzilla.redhat.com/show_bug.cgi?id=1955761 -- which will make it easier to distinguish one cause from another. I checked several recent jobs from the linked CI search, and none of them have the signature of the issue fixed by this patch (https://bugzilla.redhat.com/show_bug.cgi?id=1945443#c3).

I do see evidence of the second likely cause (examples in https://bugzilla.redhat.com/show_bug.cgi?id=1945443#c4), which is tracked in https://bugzilla.redhat.com/show_bug.cgi?id=1953715 and is being worked on now.

Comment 9 Ben Luddy 2021-05-07 14:47:53 UTC

The patch to populate reason/message on the CO condition merged yesterday, which made it easier to find a third cause of this flapping. Opened https://bugzilla.redhat.com/show_bug.cgi?id=1958285 to track it.

Comment 12 errata-xmlrpc 2021-07-27 22:57:00 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438