2063164 – metal-ipi-ovn-ipv6 Job Permafailing and Blocking OpenShift 4.11 Payloads: insights operator is not available

Bug 2063164 - metal-ipi-ovn-ipv6 Job Permafailing and Blocking OpenShift 4.11 Payloads: insights operator is not available

Summary: metal-ipi-ovn-ipv6 Job Permafailing and Blocking OpenShift 4.11 Payloads: ins...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Insights Operator
Sub Component:
Version:	4.11
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	urgent
Target Milestone:	---
Target Release:	4.11.0
Assignee:	Tomas Remes
QA Contact:	Joao Fula
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-03-11 12:17 UTC by Devan Goodwin
Modified:	2022-08-10 10:54 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-08-10 10:53:18 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift insights-operator pull 592	0	None	open	Bug 2063164: Revert "Set default messages & reconcile clusteroperator status conditions"	2022-03-11 12:59:35 UTC
Red Hat Product Errata	RHSA-2022:5069	0	None	None	None	2022-08-10 10:54:10 UTC

Description Devan Goodwin 2022-03-11 12:17:56 UTC

Description of problem:

The last three 4.11 nightly payloads have all failed including metal-ipi-ovn-ipv6 failure. In each, the cluster fails to install complaining that:

 level=info msg=Waiting up to 1h0m0s (until 3:54AM) for the cluster at https://api.ostest.test.metalkube.org:6443 to initialize...
level=debug msg=Still waiting for the cluster to initialize: Some cluster operators are still updating: authentication, console, ingress, insights, kube-apiserver, monitoring, openshift-samples
level=debug msg=Still waiting for the cluster to initialize: Some cluster operators are still updating: authentication, console, ingress, insights, kube-apiserver, monitoring, openshift-samples
level=debug msg=Still waiting for the cluster to initialize: Working towards 4.11.0-0.nightly-2022-03-10-220936: 746 of 777 done (96% complete)
level=debug msg=Still waiting for the cluster to initialize: Working towards 4.11.0-0.nightly-2022-03-10-220936: 746 of 777 done (96% complete)
level=debug msg=Still waiting for the cluster to initialize: Working towards 4.11.0-0.nightly-2022-03-10-220936: 746 of 777 done (96% complete)
level=debug msg=Still waiting for the cluster to initialize: Some cluster operators are still updating: authentication, console, ingress, insights, monitoring
level=debug msg=Still waiting for the cluster to initialize: Working towards 4.11.0-0.nightly-2022-03-10-220936: 768 of 777 done (98% complete)
level=debug msg=Still waiting for the cluster to initialize: Some cluster operators are still updating: authentication, insights
level=debug msg=Still waiting for the cluster to initialize: Some cluster operators are still updating: authentication, insights
level=debug msg=Still waiting for the cluster to initialize: Some cluster operators are still updating: authentication, insights
level=debug msg=Still waiting for the cluster to initialize: Working towards 4.11.0-0.nightly-2022-03-10-220936: 773 of 777 done (99% complete)
level=debug msg=Still waiting for the cluster to initialize: Cluster operator insights is not available
bash: line 52: 16671 Killed                  timeout -s 9 105m make 

Examples:

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-metal-ipi-ovn-ipv6/1502214830407290880
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-metal-ipi-ovn-ipv6/1502099122117677056
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-metal-ipi-ovn-ipv6/1501915647955701760

Using the last of the three links above for analysis. ^

The insights operator in it's log reports:

I0310 15:45:44.183372       1 operator.go:201] The last pod state is unhealthy

This code originates from:

https://github.com/openshift/insights-operator/blob/b67157fc871fe846850235334b7eaa6ca3a74547/pkg/controller/operator.go#L201

Which seems to indicate it's unhappy about some container status on the insights pod itself, status for that pod can be seen here but bear in mind this would have been captured after the install failed, and state could have changed:

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-metal-ipi-ovn-ipv6/1501915647955701760/artifacts/e2e-metal-ipi-ovn-ipv6/gather-extra/artifacts/pods.json

We have no e2e intervals graph as this cluster did not make it past install.

We can see that this problem is only affecting metal-ipi, insights install success is very close to 100% on all other variants:

https://sippy.ci.openshift.org/sippy-ng/install/4.11/operators

As this is blocking all nightly payloads for all parts of the org that depend on them, TRT believes this bug should be marked urgent as we need to get the payloads flowing asap.

I am unsure if this is a metal problem or an insights problem, starting with insights, but I will notify the metal teams as well.

Comment 1 Devan Goodwin 2022-03-11 12:36:30 UTC

We believe we've identified the problematic PR and opened a revert: https://github.com/openshift/insights-operator/pull/592

This is the second time this PR has caused a regression and had to be reverted, need to make sure this is working on metal-ipi before going in again.

Comment 3 Devan Goodwin 2022-03-15 13:34:54 UTC

"cluster bootstrap should succeed" test shows this problem resolving as of around Mar 11. I think this is safe to verify.

Comment 4 Joao Fula 2022-03-15 13:47:41 UTC

Verified on Verified on 4.11.0-0.ci-2022-03-15-032841.

Test works as expected.

Comment 6 errata-xmlrpc 2022-08-10 10:53:18 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Note You need to log in before you can comment on or make changes to this bug.