Bug 2089775 - keepalived can keep ingress VIP on wrong node under certain circumstances
Summary: keepalived can keep ingress VIP on wrong node under certain circumstances
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.10
Hardware: All
OS: All
high
high
Target Milestone: ---
: 4.11.0
Assignee: Ben Nemec
QA Contact: Silvia Serafini
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-05-24 12:10 UTC by nsmirnov
Modified: 2022-08-10 11:13 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-08-10 11:13:40 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 3156 0 None open [on-prem] Bug 2089775: Fix regexp in keepalived script chk_default_ingress.sh 2022-05-24 16:35:39 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 11:13:54 UTC

Description nsmirnov 2022-05-24 12:10:01 UTC
Description of problem:
If string representation of IP an address of one node is a part of string representation of IP of another node, then keepalived make wrong decision about which node is running ingress at the time.

I.E., in this scenario:
nodeA is 10.12.11.10
nodeB is 10.12.11.11
nodeC is 10.12.11.101
nodeD is 10.12.11.102
So, nodeA IP as a strig is part of nodeB and nodeC IPs.
On my (failed) installation, keepalived decided that ingress is running on nodeA, but it was running on another node.

Version-Release number of MCO (Machine Config Operator) (if applicable):
Current

Platform (AWS, VSphere, Metal, etc.):
Any

Are you certain that the root cause of the issue being reported is the MCO (Machine Config Operator)?
(Y/N/Not sure): Y

How reproducible:
I created a cluster with master node IPs x.x.x.10-.12, and worker node IPs x.x.x.100-102.

Did you catch this issue by running a Jenkins job? If yes, please list:
1. Jenkins job:

2. Profile:

Steps to Reproduce:
1. I created a cluster with master node IPs x.x.x.10-.12, and worker node IPs x.x.x.100-102.

Actual results:
keepalived was keeping ingress VIP on master node x.x.x.10, although real ingress is on worker node x.x.x.100.

Expected results:
keepalived shoud keep VIP on node which runs ingress.

Additional info:
I proposed PR: https://github.com/openshift/machine-config-operator/pull/3156
And created issue: https://github.com/openshift/machine-config-operator/issues/3155
The cluster with issue was destroyed because I needed to complete my task, but it is obvious that the problem will exist in similar circumstances.

1. Please consider attaching a must-gather archive (via oc adm must-gather). Please review must-gather contents for sensitive information before attaching any must-gathers to a Bugzilla report. You may also mark the bug private if you wish.

2. If a must-gather is unavailable, please provide the output of:

$ oc get co machine-config -o yaml

$ oc get mcp (and oc describe mcp/${degraded_pool} if pools are degraded)

$ oc get mc

$ oc get pod -n openshift-machine-config-operator

$ oc get node -o wide

3. If a node is not accessible via API, please provide console/journal/kubelet logs of the problematic node

4. Are there RHEL nodes on the cluster? If yes, please upload the whole Ansible logs or Jenkins job

Comment 5 Silvia Serafini 2022-06-21 13:38:10 UTC
cluster 3 master + 3 workers

[kni@provisionhost-0-0 ~]$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-06-21-040754   True        False         50s     Cluster version is 4.11.0-0.nightly-2022-06-21-040754

master node IPs x.x.x.11-13, and worker node IPs x.x.x.110-112

ingressVIP: 192.168.123.10


[kni@provisionhost-0-0 ~]$ ssh core.123.10 -- hostname -s
worker-0-0

[kni@provisionhost-0-0 ~]$ oc get pods -n openshift-ingress -o wide
NAME                              READY   STATUS    RESTARTS   AGE   IP                NODE                                              NOMINATED NODE   READINESS GATES
router-default-5d7fbdd474-sd72q   1/1     Running   0          24m   192.168.123.111   worker-0-1.ocp-edge-cluster-0.qe.lab.redhat.com   <none>           <none>
router-default-5d7fbdd474-xwq6k   1/1     Running   0          24m   12.168.123.110   worker-0-0.ocp-edge-cluster-0.qe.lab.redhat.com   <none>           <none>

[core@worker-0-0 ~]$ sudo cat /var/log/containers/keepalived-worker-0-0.ocp-edge-cluster-0.qe.lab.redhat.com_openshift-kni-infra_keepalived-3e26b364e065e94a5dfb5743a6722f5ea2b69780c6c266209fd170acd6c22083.log | grep chk_default_ingress
2022-06-21T13:06:13.567216113+00:00 stderr F Tue Jun 21 13:06:13 2022: Script `chk_default_ingress` now returning 1
2022-06-21T13:06:13.567252575+00:00 stderr F Tue Jun 21 13:06:13 2022: VRRP_Script(chk_default_ingress) failed (exited with status 1)
2022-06-21T13:06:23.514668223+00:00 stderr F Tue Jun 21 13:06:23 2022: Script `chk_default_ingress` now returning 0
2022-06-21T13:06:33.482149496+00:00 stderr F Tue Jun 21 13:06:33 2022: VRRP_Script(chk_default_ingress) succeeded
2022-06-21T13:06:53.674191003+00:00 stderr F Tue Jun 21 13:06:53 2022: VRRP_Script(chk_default_ingress) considered successful on reload


ingressVIP moved to worker-0-0 where is running default router

Comment 6 errata-xmlrpc 2022-08-10 11:13:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069


Note You need to log in before you can comment on or make changes to this bug.