2018965 – e2e-metal-ipi-upgrade is permafailing in 4.10

Bug 2018965 - e2e-metal-ipi-upgrade is permafailing in 4.10

Summary: e2e-metal-ipi-upgrade is permafailing in 4.10

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	urgent
Target Milestone:	---
Target Release:	4.10.0
Assignee:	Arda Guclu
QA Contact:	Amit Ugol
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-11-01 11:34 UTC by Stephen Benjamin
Modified:	2022-03-10 16:24 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:	job=periodic-ci-openshift-release-master-nightly-4.10-e2e-metal-ipi-upgrade=all
Last Closed:	2022-03-10 16:23:41 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2022:0056	0	None	None	None	2022-03-10 16:24:03 UTC

Description Stephen Benjamin 2021-11-01 11:34:29 UTC

periodic-ci-openshift-release-master-nightly-4.10-e2e-metal-ipi-upgrade

is failing frequently in CI, see:
https://testgrid.k8s.io/redhat-openshift-ocp-release-4.10-informing#periodic-ci-openshift-release-master-nightly-4.10-e2e-metal-ipi-upgrade


Jobs are failing with:

event happened 52 times, something is wrong: ns/openshift-kube-controller-manager pod/kube-controller-manager-master-2 node/master-2 - reason/BackOff Back-off restarting failed container

Example run:

https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-metal-ipi-upgrade/1454701490869374976


The logs for the kube-controller-manager-recovery-controller on master-2 is scrolling this: 

2021-10-31T08:09:40.342954618Z ++ ss -Htanop '(' sport = 9443 ')'
2021-10-31T08:09:40.347027579Z + '[' -n 'LISTEN 0      128    *:9443 *:*' ']'
2021-10-31T08:09:40.347082466Z + sleep 1


The other two control plane nodes report logs like this:

2021-10-31T07:27:04.940049797Z I1031 07:27:04.940010       1 leaderelection.go:248] attempting to acquire leader lease openshift-kube-controller-manager/cert-recovery-controller-lock...
2021-10-31T07:45:47.307456099Z E1031 07:45:47.307324       1 leaderelection.go:330] error retrieving resource lock openshift-kube-controller-manager/cert-recovery-controller-lock: Get "https://localhost:6443/api/v1/namespaces/openshift-kube-controller-manager/configmaps/cert-recovery-controller-lock?timeout=1m47s": dial tcp [::1]:6443: connect: connection refused


The operator reports:


2021-10-31T08:09:43.561392468Z I1031 08:09:43.555955       1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-controller-manager-operator", Name:"kube-controller-manager-operator", UID:"31940718-fc55-4c5a-b4ad-23c95667c430", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/kube-controller-manager changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready" to "NodeControllerDegraded: All master nodes are ready\nStaticPodsDegraded: pod/kube-controller-manager-master-2 container \"kube-controller-manager-recovery-controller\" is terminated: Error: 9443 *:*' ']'\nStaticPodsDegraded: + sleep 1\nStaticPodsDegraded: ++ ss -Htanop '(' sport = 9443 ')'\nStaticPodsDegraded: + '[' -n 'LISTEN 0      128    *:9443 *:*' ']'\nStaticPodsDegraded: "

Comment 2 Bob Fournier 2021-11-02 12:58:46 UTC

Arda has a fix to change the webhook port number being used - https://github.com/openshift/cluster-baremetal-operator/pull/213 which was thought to be the source of the problem, however the PR's upgrade job still failed with the same error. So, although that fix is necessary there may be something else going on.

Comment 3 Stephen Benjamin 2021-11-02 14:04:35 UTC

It does not look like all the references were fixed:


~  git  cluster-baremetal-operator $ grep -r 9443 .
./config/profiles/default/manager_webhook_patch.yaml:        - containerPort: 9443
./config/webhook/service.yaml:      targetPort: 9443
./manifests/0000_31_cluster-baremetal-operator_03_webhookservice.yaml:    targetPort: 9443
./manifests/0000_31_cluster-baremetal-operator_06_deployment.yaml:        - containerPort: 9443
./vendor/github.com/prometheus/procfs/fixtures.ttar:trans 706 944304 0
./vendor/sigs.k8s.io/controller-runtime/pkg/webhook/server.go:var DefaultPort = 9443
./vendor/sigs.k8s.io/controller-runtime/pkg/webhook/server.go:	// It will be defaulted to 9443 if unspecified.

Comment 4 Arda Guclu 2021-11-02 14:15:17 UTC

BMO is using masters IP addresses, but cluster baremetal operator uses 10.*.*.* IP addresses and runs for a long time(maybe that's why, it does not cause port conflict). I think, there is no need to change above configurations to fix that bug. But for long term, we should change to different port number.

Comment 5 Arda Guclu 2021-11-04 08:22:34 UTC

According to the latest metrics https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/job-history/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-metal-ipi-upgrade, upgrade jobs are passing after the fix.

I'm closing this bug.

Comment 8 errata-xmlrpc 2022-03-10 16:23:41 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Note You need to log in before you can comment on or make changes to this bug.