Description of problem: OCP Bare Metal IPI has incorrect template file for static pod of keepalived running on non-master nodes. After an upgrade to 4.6.19 customer noticed that only the keepalived container was running but the keepalived-monitor kept crashing with readinessProbe failures, which was happening only on non-master nodes. After checking on the nodes, machineConfig and pods I noticed the workers have the same readinessProbe as the masters trying to curl /readyz on nodeIP on port 6443, which on workers will fail and the container crash. Version-Release number of selected component (if applicable): OCP 4.6.z How reproducible: Everytime Steps to Reproduce: 1. Install OCP on Bare metal using the IPI method Actual results: There should be different template files for keeplalived pod that will run on masters, like it seems to happen for the vShphere IPI Expected results: Confirmed on openshift-vsphere-infra and these static pods running on workers don't seem to have the readinessProbe as it should be for the openshift-kni-infra Additional info: https://github.com/openshift/machine-config-operator/blob/release-4.6/templates/common/baremetal/files/baremetal-keepalived.yaml
I was able to make this worker on my customer with the workaround below: https://access.redhat.com/solutions/5892851
This is very odd. The readinessProbe is incorrect, but it should also be harmless. A readiness probe will never trigger a restart of a container on its own, and kubernetes doesn't route any traffic to these pods so the ready status is irrelevant. I deployed a 4.6.19 cluster locally and my worker keepalived-monitors are fine despite the readiness probe failures. What version were they upgrading from? Maybe there is some sort of odd interaction happening on upgrade that is crashing the monitor. We can certainly backport the change to remove the readiness probe, but I'm concerned that isn't the underlying problem here. Adding a machine config to remove the readiness probe may have fixed the problem because it triggers a restart of the node.
Never mind, I see that's in the customer case. Sorry for the noise. It further suggests that the readiness probe is a red herring though since the same probe would have been in 4.6.16. I'll see if I can reproduce it by doing that upgrade.
Okay, after reading the case more carefully I think I see what is happening here. The "1/2" shown in the oc get pods output is readiness, not whether the containers are running. In this case the containers are all running correctly, but because of the incorrect readiness probe the monitor will never show as ready. This should not cause any functional issues, and it doesn't sound like it was. If I've misunderstood anything please let me know ASAP, but otherwise I will go ahead and backport the readiness probe removal to avoid similar confusion in the future. Contrary to what I said earlier it doesn't appear there is anything else needed here.
Verified on the OCP build: Cluster version is 4.6.0-0.nightly-2021-06-19-071833 [kni@provisionhost-0-0 ~]$ oc get pods -o wide -n openshift-kni-infra | grep keepalived-worker keepalived-worker-0-0.ocp-edge-cluster-0.qe.lab.redhat.com 2/2 Running 1 14h fd2e:6f44:5dd8::91 worker-0-0.ocp-edge-cluster-0.qe.lab.redhat.com <none> <none> keepalived-worker-0-1.ocp-edge-cluster-0.qe.lab.redhat.com 2/2 Running 1 14h fd2e:6f44:5dd8::64 worker-0-1.ocp-edge-cluster-0.qe.lab.redhat.com <none> <none> [kni@provisionhost-0-0 ~]$ oc get events -n openshift-kni-infra No "Readiness probe failed" warnings found
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6.36 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:2498