1824981 – [4.3 upgrade][alert]Failed to install Operator packageserver version 0.13.0. Reason-ComponentUnhealthy

Bug 1824981 - [4.3 upgrade][alert]Failed to install Operator packageserver version 0.13.0. Reason-ComponentUnhealthy

Summary: [4.3 upgrade][alert]Failed to install Operator packageserver version 0.13.0. ...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	OLM
Sub Component:
Version:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Evan Cordell
QA Contact:	Jian Zhang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-04-16 19:10 UTC by Hongkai Liu
Modified:	2024-06-13 22:34 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-07-12 23:34:47 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1809232	1	None	None	None	2023-09-14 05:53:41 UTC
Red Hat Bugzilla	1811343	0	unspecified	CLOSED	GCP e2e release run reported high number of etcd changes alert	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	1824983	0	high	CLOSED	[4.3 upgrade][alert] KubePodCrashLooping: Pod openshift-sdn/sdn-24wfh (sdn) is restarting 0.42 times / 5 minutes.	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	1824986	0	low	CLOSED	[4.3 upgrade][alert]: etcdMembersDown	2023-09-14 05:55:34 UTC
Red Hat Bugzilla	1824988	0	medium	CLOSED	[4.3 upgrade][alert] KubeDaemonSetRolloutStuck: Only 96.43% of the desired Pods of DaemonSet openshift-dns/dns-default a...	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	1824996	0	medium	CLOSED	[4.3 upgrade][alert] KubeNodeUnreachable: ip-10-0-159-123.ec2.internal is unreachable and some workloads may be reschedu...	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	1825000	0	medium	CLOSED	[4.3 upgrade][alert] etcdHighNumberOfLeaderChanges: etcd cluster "etcd": 7.5 leader changes within the last 15 minutes.	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	1825001	0	low	CLOSED	[4.3 upgrade][alert] KubeAPILatencyHigh: The API server has an abnormal latency of 0.3257691935407682 seconds for POST p...	2023-10-06 19:41:38 UTC
Red Hat Bugzilla	1825003	0	low	CLOSED	[4.3 upgrade][clusterversion]Unclear message: Unable to apply ...: an unknown error has occurred	2021-03-16 16:13:31 UTC
Red Hat Bugzilla	1825006	0	low	CLOSED	[4.3 upgrade][clusterversion] scary: Unable to apply ...: the cluster operator ... has not yet successfully rolled out	2022-05-06 12:29:29 UTC
Red Hat Bugzilla	1825008	0	unspecified	CLOSED	[4.3 upgrade][clusterverion]the cluster operator machine-config has not yet successfully rolled out	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	1828630	0	medium	CLOSED	[4.3 upgrade][alert]AggregatedAPIErrors	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	1828631	0	low	CLOSED	[4.3 upgrade][clusterverion]the cluster operator openshift-apiserver is degraded	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	1828633	0	unspecified	CLOSED	[4.3 upgrade][alert]ImagePruningDisabled image-registry-operator	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	1830390	0	medium	CLOSED	[4.4 upgrade][alert]Cluster operator machine-config has been degraded for 10 mins: RenderConfigFailed	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	1846397	0	medium	CLOSED	[4.4 upgrade][alert] AlertmanagerConfigInconsistent	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	1851281	0	unspecified	CLOSED	[4.4 upgrade][alert]Deployment openshift-machine-config-operator/etcd-quorum-guard has not matched the expected number o...	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	1889540	0	unspecified	CLOSED	[4.5 upgrade][alert]CloudCredentialOperatorDown	2021-02-24 15:27:08 UTC
Red Hat Bugzilla	1889541	0	medium	CLOSED	[4.5 upgrade]console is not accessible during cluster upgrade	2021-02-22 00:41:40 UTC

Description Hongkai Liu 2020-04-16 19:10:40 UTC

During upgrade of a cluster in CI build farm, we have seen a sequence of alerts and messages of failures from clusterversion.

oc --context build01 adm upgrade --allow-explicit-upgrade --to-image registry.svc.ci.openshift.org/ocp/release:4.3.0-0.nightly-2020-04-13-190424 --force=true

Eventually upgrade was completed successfully (which is so nice).
But those alerts and messages are too frightening.

I would like to create a bug for each of those and feel better for the next upgrade.

https://coreos.slack.com/archives/CHY2E1BL4/p1587057909431000


[FIRING:1] FailingOperator olm-operator-metrics (https-metrics openshift-operator-lifecycle-manager 10.129.0.73:8081 packageserver openshift-operator-lifecycle-manager Failed olm-operator-68cdbb77d9-x87nh openshift-monitoring/k8s ComponentUnhealthy olm-operator-metrics info 0.13.0)
Failed to install Operator packageserver version 0.13.0. Reason-ComponentUnhealthy

Comment 2 Hongkai Liu 2020-04-16 19:48:35 UTC

The version before upgrade: 4.3.0-0.nightly-2020-03-23-130439

Comment 3 Hongkai Liu 2020-04-16 20:18:40 UTC

In general, it would be much comforting if
1. no alerts would be fired if upgrade is considered successful.
2. the status of clusterverion shows a nicer message if upgrade is still in process instead of "Unable to apply 4.3.0-0.nightly-2020-04-13-190424: the cluster operator machine-config has not yet successfully rolled out"

Comment 4 Clayton Coleman 2020-04-16 20:21:40 UTC

In general all components that can. should not fire an alert on short term disruption that is within safe bounds, especially upgrade.

On 2 we should potentially tolerate that one longer.

Comment 5 W. Trevor King 2020-04-16 21:49:45 UTC

(In reply to Hongkai Liu from comment #0)
> oc --context build01 adm upgrade --allow-explicit-upgrade --to-image registry.svc.ci.openshift.org/ocp/release:4.3.0-0.nightly-2020-04-13-190424 --force=true

Largely unrelated, but just to get the word out in close proximity to anyone mentioning --force, you would be much safer using by-digest pullspecs for the reasons described in [1,2], both of which are in flight to land client-side guards/warnings around this.

[1]: https://github.com/openshift/oc/pull/390
[2]: https://github.com/openshift/oc/pull/238

Comment 6 W. Trevor King 2020-04-16 21:52:11 UTC

(In reply to Hongkai Liu from comment #3)
> 1. no alerts would be fired if upgrade is considered successful.

[1] is in flight so we can enforce this, at least for update environments that we cover in CI.

[1]: https://github.com/openshift/origin/pull/24786

Comment 7 Hongkai Liu 2020-04-17 17:52:24 UTC

Thanks, Trevor.
I will bug you before the next upgrade.
- the oc cli version
- how to get the sha for an upgrade and how to use it in the oc-adm-update command.

Comment 8 Clayton Coleman 2020-04-20 18:00:15 UTC

This is more than low severity.  It caused a representative customer admin team to panic and assume our product was faulty.

Comment 13 W. Trevor King 2020-06-02 18:26:48 UTC

Updates should be zero-downtime.  If the root cause here is an API outage, then assign this bug to the API team or whoever is responsible for the API outage.  Or use this bug to improve the condition's reason/message, because currently "Failed to install Operator packageserver version 0.13.0. Reason-ComponentUnhealthy" does not sound like "API outage" to me.  OLM should clearly explain why it's failing, so it's clear that another component is responsible for the degradation.  It's not OLM's responsibility to raise timeouts to work around bugs in other components.

Note You need to log in before you can comment on or make changes to this bug.