1940594 – [BUG] Incorrect manifest for keepalived static pod running on bare metal IPI worker nodes

Bug 1940594 - [BUG] Incorrect manifest for keepalived static pod running on bare metal IPI worker nodes

Summary: [BUG] Incorrect manifest for keepalived static pod running on bare metal IPI ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Machine Config Operator
Sub Component:
Version:	4.6.z
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	4.6.z
Assignee:	MCO Team
QA Contact:	Aleksandra Malykhin
Docs Contact:
URL:
Whiteboard:
Depends On:	1941803
Blocks:
TreeView+	depends on / blocked

Reported:	2021-03-18 17:02 UTC by Andre Costa
Modified:	2022-10-12 03:59 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-06-29 06:26:19 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 2480	0	None	open	Bug 1940594: [baremetal and friends] Drop unnecessary readiness probes	2021-03-22 20:04:16 UTC
Red Hat Product Errata	RHBA-2021:2498	0	None	None	None	2021-06-29 06:26:39 UTC

Description Andre Costa 2021-03-18 17:02:15 UTC

Description of problem:
OCP Bare Metal IPI has incorrect template file for static pod of keepalived running on non-master nodes.
After an upgrade to 4.6.19 customer noticed that only the keepalived container was running but the keepalived-monitor kept crashing with readinessProbe failures, which was happening only on non-master nodes.
After checking on the nodes, machineConfig and pods I noticed the workers have the same readinessProbe as the masters trying to curl /readyz on nodeIP on port 6443, which on workers will fail and the container crash.

Version-Release number of selected component (if applicable):
OCP 4.6.z

How reproducible:
Everytime

Steps to Reproduce:
1. Install OCP on Bare metal using the IPI method

Actual results:
There should be different template files for keeplalived pod that will run on masters, like it seems to happen for the vShphere IPI 

Expected results:
Confirmed on openshift-vsphere-infra and these static pods running on workers don't seem to have the readinessProbe as it should be for the openshift-kni-infra

Additional info:
https://github.com/openshift/machine-config-operator/blob/release-4.6/templates/common/baremetal/files/baremetal-keepalived.yaml

Comment 1 Andre Costa 2021-03-18 17:14:49 UTC

I was able to make this worker on my customer with the workaround below:

https://access.redhat.com/solutions/5892851

Comment 2 Ben Nemec 2021-03-19 19:50:39 UTC

This is very odd. The readinessProbe is incorrect, but it should also be harmless. A readiness probe will never trigger a restart of a container on its own, and kubernetes doesn't route any traffic to these pods so the ready status is irrelevant. I deployed a 4.6.19 cluster locally and my worker keepalived-monitors are fine despite the readiness probe failures.

What version were they upgrading from? Maybe there is some sort of odd interaction happening on upgrade that is crashing the monitor. We can certainly backport the change to remove the readiness probe, but I'm concerned that isn't the underlying problem here. Adding a machine config to remove the readiness probe may have fixed the problem because it triggers a restart of the node.

Comment 3 Ben Nemec 2021-03-19 19:53:30 UTC

Never mind, I see that's in the customer case. Sorry for the noise. It further suggests that the readiness probe is a red herring though since the same probe would have been in 4.6.16. I'll see if I can reproduce it by doing that upgrade.

Comment 5 Ben Nemec 2021-03-22 19:08:29 UTC

Okay, after reading the case more carefully I think I see what is happening here. The "1/2" shown in the oc get pods output is readiness, not whether the containers are running. In this case the containers are all running correctly, but because of the incorrect readiness probe the monitor will never show as ready. This should not cause any functional issues, and it doesn't sound like it was.

If I've misunderstood anything please let me know ASAP, but otherwise I will go ahead and backport the readiness probe removal to avoid similar confusion in the future. Contrary to what I said earlier it doesn't appear there is anything else needed here.

Comment 9 Aleksandra Malykhin 2021-06-22 05:00:05 UTC

Verified on the OCP build:
Cluster version is 4.6.0-0.nightly-2021-06-19-071833

[kni@provisionhost-0-0 ~]$ oc get pods -o wide -n openshift-kni-infra | grep keepalived-worker
keepalived-worker-0-0.ocp-edge-cluster-0.qe.lab.redhat.com       2/2     Running   1          14h   fd2e:6f44:5dd8::91   worker-0-0.ocp-edge-cluster-0.qe.lab.redhat.com   <none>           <none>
keepalived-worker-0-1.ocp-edge-cluster-0.qe.lab.redhat.com       2/2     Running   1          14h   fd2e:6f44:5dd8::64   worker-0-1.ocp-edge-cluster-0.qe.lab.redhat.com   <none>           <none>


[kni@provisionhost-0-0 ~]$ oc get events -n openshift-kni-infra

No "Readiness probe failed" warnings found

Comment 11 errata-xmlrpc 2021-06-29 06:26:19 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6.36 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2498

Note You need to log in before you can comment on or make changes to this bug.