Bug 1940594
| Summary: | [BUG] Incorrect manifest for keepalived static pod running on bare metal IPI worker nodes | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Andre Costa <andcosta> |
| Component: | Machine Config Operator | Assignee: | MCO Team <team-mco> |
| Machine Config Operator sub component: | Machine Config Operator | QA Contact: | Aleksandra Malykhin <amalykhi> |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | medium | ||
| Priority: | unspecified | CC: | aos-bugs, bperkins, mkrejci |
| Version: | 4.6.z | Keywords: | Triaged |
| Target Milestone: | --- | ||
| Target Release: | 4.6.z | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-06-29 06:26:19 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 1941803 | ||
| Bug Blocks: | |||
|
Description
Andre Costa
2021-03-18 17:02:15 UTC
I was able to make this worker on my customer with the workaround below: https://access.redhat.com/solutions/5892851 This is very odd. The readinessProbe is incorrect, but it should also be harmless. A readiness probe will never trigger a restart of a container on its own, and kubernetes doesn't route any traffic to these pods so the ready status is irrelevant. I deployed a 4.6.19 cluster locally and my worker keepalived-monitors are fine despite the readiness probe failures. What version were they upgrading from? Maybe there is some sort of odd interaction happening on upgrade that is crashing the monitor. We can certainly backport the change to remove the readiness probe, but I'm concerned that isn't the underlying problem here. Adding a machine config to remove the readiness probe may have fixed the problem because it triggers a restart of the node. Never mind, I see that's in the customer case. Sorry for the noise. It further suggests that the readiness probe is a red herring though since the same probe would have been in 4.6.16. I'll see if I can reproduce it by doing that upgrade. Okay, after reading the case more carefully I think I see what is happening here. The "1/2" shown in the oc get pods output is readiness, not whether the containers are running. In this case the containers are all running correctly, but because of the incorrect readiness probe the monitor will never show as ready. This should not cause any functional issues, and it doesn't sound like it was. If I've misunderstood anything please let me know ASAP, but otherwise I will go ahead and backport the readiness probe removal to avoid similar confusion in the future. Contrary to what I said earlier it doesn't appear there is anything else needed here. Verified on the OCP build: Cluster version is 4.6.0-0.nightly-2021-06-19-071833 [kni@provisionhost-0-0 ~]$ oc get pods -o wide -n openshift-kni-infra | grep keepalived-worker keepalived-worker-0-0.ocp-edge-cluster-0.qe.lab.redhat.com 2/2 Running 1 14h fd2e:6f44:5dd8::91 worker-0-0.ocp-edge-cluster-0.qe.lab.redhat.com <none> <none> keepalived-worker-0-1.ocp-edge-cluster-0.qe.lab.redhat.com 2/2 Running 1 14h fd2e:6f44:5dd8::64 worker-0-1.ocp-edge-cluster-0.qe.lab.redhat.com <none> <none> [kni@provisionhost-0-0 ~]$ oc get events -n openshift-kni-infra No "Readiness probe failed" warnings found Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6.36 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:2498 |