Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1940594

Summary: [BUG] Incorrect manifest for keepalived static pod running on bare metal IPI worker nodes
Product: OpenShift Container Platform Reporter: Andre Costa <andcosta>
Component: Machine Config OperatorAssignee: MCO Team <team-mco>
Machine Config Operator sub component: Machine Config Operator QA Contact: Aleksandra Malykhin <amalykhi>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: unspecified CC: aos-bugs, bperkins, mkrejci
Version: 4.6.zKeywords: Triaged
Target Milestone: ---   
Target Release: 4.6.z   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-06-29 06:26:19 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1941803    
Bug Blocks:    

Description Andre Costa 2021-03-18 17:02:15 UTC
Description of problem:
OCP Bare Metal IPI has incorrect template file for static pod of keepalived running on non-master nodes.
After an upgrade to 4.6.19 customer noticed that only the keepalived container was running but the keepalived-monitor kept crashing with readinessProbe failures, which was happening only on non-master nodes.
After checking on the nodes, machineConfig and pods I noticed the workers have the same readinessProbe as the masters trying to curl /readyz on nodeIP on port 6443, which on workers will fail and the container crash.

Version-Release number of selected component (if applicable):
OCP 4.6.z

How reproducible:
Everytime

Steps to Reproduce:
1. Install OCP on Bare metal using the IPI method

Actual results:
There should be different template files for keeplalived pod that will run on masters, like it seems to happen for the vShphere IPI 

Expected results:
Confirmed on openshift-vsphere-infra and these static pods running on workers don't seem to have the readinessProbe as it should be for the openshift-kni-infra

Additional info:
https://github.com/openshift/machine-config-operator/blob/release-4.6/templates/common/baremetal/files/baremetal-keepalived.yaml

Comment 1 Andre Costa 2021-03-18 17:14:49 UTC
I was able to make this worker on my customer with the workaround below:

https://access.redhat.com/solutions/5892851

Comment 2 Ben Nemec 2021-03-19 19:50:39 UTC
This is very odd. The readinessProbe is incorrect, but it should also be harmless. A readiness probe will never trigger a restart of a container on its own, and kubernetes doesn't route any traffic to these pods so the ready status is irrelevant. I deployed a 4.6.19 cluster locally and my worker keepalived-monitors are fine despite the readiness probe failures.

What version were they upgrading from? Maybe there is some sort of odd interaction happening on upgrade that is crashing the monitor. We can certainly backport the change to remove the readiness probe, but I'm concerned that isn't the underlying problem here. Adding a machine config to remove the readiness probe may have fixed the problem because it triggers a restart of the node.

Comment 3 Ben Nemec 2021-03-19 19:53:30 UTC
Never mind, I see that's in the customer case. Sorry for the noise. It further suggests that the readiness probe is a red herring though since the same probe would have been in 4.6.16. I'll see if I can reproduce it by doing that upgrade.

Comment 5 Ben Nemec 2021-03-22 19:08:29 UTC
Okay, after reading the case more carefully I think I see what is happening here. The "1/2" shown in the oc get pods output is readiness, not whether the containers are running. In this case the containers are all running correctly, but because of the incorrect readiness probe the monitor will never show as ready. This should not cause any functional issues, and it doesn't sound like it was.

If I've misunderstood anything please let me know ASAP, but otherwise I will go ahead and backport the readiness probe removal to avoid similar confusion in the future. Contrary to what I said earlier it doesn't appear there is anything else needed here.

Comment 9 Aleksandra Malykhin 2021-06-22 05:00:05 UTC
Verified on the OCP build:
Cluster version is 4.6.0-0.nightly-2021-06-19-071833

[kni@provisionhost-0-0 ~]$ oc get pods -o wide -n openshift-kni-infra | grep keepalived-worker
keepalived-worker-0-0.ocp-edge-cluster-0.qe.lab.redhat.com       2/2     Running   1          14h   fd2e:6f44:5dd8::91   worker-0-0.ocp-edge-cluster-0.qe.lab.redhat.com   <none>           <none>
keepalived-worker-0-1.ocp-edge-cluster-0.qe.lab.redhat.com       2/2     Running   1          14h   fd2e:6f44:5dd8::64   worker-0-1.ocp-edge-cluster-0.qe.lab.redhat.com   <none>           <none>


[kni@provisionhost-0-0 ~]$ oc get events -n openshift-kni-infra

No "Readiness probe failed" warnings found

Comment 11 errata-xmlrpc 2021-06-29 06:26:19 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6.36 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2498