periodic-ci-openshift-release-master-nightly-4.10-e2e-metal-ipi-upgrade is failing frequently in CI, see: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.10-informing#periodic-ci-openshift-release-master-nightly-4.10-e2e-metal-ipi-upgrade Jobs are failing with: event happened 52 times, something is wrong: ns/openshift-kube-controller-manager pod/kube-controller-manager-master-2 node/master-2 - reason/BackOff Back-off restarting failed container Example run: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-metal-ipi-upgrade/1454701490869374976 The logs for the kube-controller-manager-recovery-controller on master-2 is scrolling this: 2021-10-31T08:09:40.342954618Z ++ ss -Htanop '(' sport = 9443 ')' 2021-10-31T08:09:40.347027579Z + '[' -n 'LISTEN 0 128 *:9443 *:*' ']' 2021-10-31T08:09:40.347082466Z + sleep 1 The other two control plane nodes report logs like this: 2021-10-31T07:27:04.940049797Z I1031 07:27:04.940010 1 leaderelection.go:248] attempting to acquire leader lease openshift-kube-controller-manager/cert-recovery-controller-lock... 2021-10-31T07:45:47.307456099Z E1031 07:45:47.307324 1 leaderelection.go:330] error retrieving resource lock openshift-kube-controller-manager/cert-recovery-controller-lock: Get "https://localhost:6443/api/v1/namespaces/openshift-kube-controller-manager/configmaps/cert-recovery-controller-lock?timeout=1m47s": dial tcp [::1]:6443: connect: connection refused The operator reports: 2021-10-31T08:09:43.561392468Z I1031 08:09:43.555955 1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-controller-manager-operator", Name:"kube-controller-manager-operator", UID:"31940718-fc55-4c5a-b4ad-23c95667c430", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/kube-controller-manager changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready" to "NodeControllerDegraded: All master nodes are ready\nStaticPodsDegraded: pod/kube-controller-manager-master-2 container \"kube-controller-manager-recovery-controller\" is terminated: Error: 9443 *:*' ']'\nStaticPodsDegraded: + sleep 1\nStaticPodsDegraded: ++ ss -Htanop '(' sport = 9443 ')'\nStaticPodsDegraded: + '[' -n 'LISTEN 0 128 *:9443 *:*' ']'\nStaticPodsDegraded: "
Arda has a fix to change the webhook port number being used - https://github.com/openshift/cluster-baremetal-operator/pull/213 which was thought to be the source of the problem, however the PR's upgrade job still failed with the same error. So, although that fix is necessary there may be something else going on.
It does not look like all the references were fixed: ~ git cluster-baremetal-operator $ grep -r 9443 . ./config/profiles/default/manager_webhook_patch.yaml: - containerPort: 9443 ./config/webhook/service.yaml: targetPort: 9443 ./manifests/0000_31_cluster-baremetal-operator_03_webhookservice.yaml: targetPort: 9443 ./manifests/0000_31_cluster-baremetal-operator_06_deployment.yaml: - containerPort: 9443 ./vendor/github.com/prometheus/procfs/fixtures.ttar:trans 706 944304 0 ./vendor/sigs.k8s.io/controller-runtime/pkg/webhook/server.go:var DefaultPort = 9443 ./vendor/sigs.k8s.io/controller-runtime/pkg/webhook/server.go: // It will be defaulted to 9443 if unspecified.
BMO is using masters IP addresses, but cluster baremetal operator uses 10.*.*.* IP addresses and runs for a long time(maybe that's why, it does not cause port conflict). I think, there is no need to change above configurations to fix that bug. But for long term, we should change to different port number.
According to the latest metrics https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/job-history/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-metal-ipi-upgrade, upgrade jobs are passing after the fix. I'm closing this bug.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056