Description of problem: If string representation of IP an address of one node is a part of string representation of IP of another node, then keepalived make wrong decision about which node is running ingress at the time. I.E., in this scenario: nodeA is 10.12.11.10 nodeB is 10.12.11.11 nodeC is 10.12.11.101 nodeD is 10.12.11.102 So, nodeA IP as a strig is part of nodeB and nodeC IPs. On my (failed) installation, keepalived decided that ingress is running on nodeA, but it was running on another node. Version-Release number of MCO (Machine Config Operator) (if applicable): Current Platform (AWS, VSphere, Metal, etc.): Any Are you certain that the root cause of the issue being reported is the MCO (Machine Config Operator)? (Y/N/Not sure): Y How reproducible: I created a cluster with master node IPs x.x.x.10-.12, and worker node IPs x.x.x.100-102. Did you catch this issue by running a Jenkins job? If yes, please list: 1. Jenkins job: 2. Profile: Steps to Reproduce: 1. I created a cluster with master node IPs x.x.x.10-.12, and worker node IPs x.x.x.100-102. Actual results: keepalived was keeping ingress VIP on master node x.x.x.10, although real ingress is on worker node x.x.x.100. Expected results: keepalived shoud keep VIP on node which runs ingress. Additional info: I proposed PR: https://github.com/openshift/machine-config-operator/pull/3156 And created issue: https://github.com/openshift/machine-config-operator/issues/3155 The cluster with issue was destroyed because I needed to complete my task, but it is obvious that the problem will exist in similar circumstances. 1. Please consider attaching a must-gather archive (via oc adm must-gather). Please review must-gather contents for sensitive information before attaching any must-gathers to a Bugzilla report. You may also mark the bug private if you wish. 2. If a must-gather is unavailable, please provide the output of: $ oc get co machine-config -o yaml $ oc get mcp (and oc describe mcp/${degraded_pool} if pools are degraded) $ oc get mc $ oc get pod -n openshift-machine-config-operator $ oc get node -o wide 3. If a node is not accessible via API, please provide console/journal/kubelet logs of the problematic node 4. Are there RHEL nodes on the cluster? If yes, please upload the whole Ansible logs or Jenkins job
This is a bug in https://github.com/openshift/machine-config-operator/blob/031234ceb6f641ade2aa7d4176000960080a9e09/templates/common/on-prem/files/keepalived-script-default-ingress.yaml#L6 We need to tighten up the grep so it doesn't match incorrectly.
cluster 3 master + 3 workers [kni@provisionhost-0-0 ~]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-06-21-040754 True False 50s Cluster version is 4.11.0-0.nightly-2022-06-21-040754 master node IPs x.x.x.11-13, and worker node IPs x.x.x.110-112 ingressVIP: 192.168.123.10 [kni@provisionhost-0-0 ~]$ ssh core.123.10 -- hostname -s worker-0-0 [kni@provisionhost-0-0 ~]$ oc get pods -n openshift-ingress -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES router-default-5d7fbdd474-sd72q 1/1 Running 0 24m 192.168.123.111 worker-0-1.ocp-edge-cluster-0.qe.lab.redhat.com <none> <none> router-default-5d7fbdd474-xwq6k 1/1 Running 0 24m 12.168.123.110 worker-0-0.ocp-edge-cluster-0.qe.lab.redhat.com <none> <none> [core@worker-0-0 ~]$ sudo cat /var/log/containers/keepalived-worker-0-0.ocp-edge-cluster-0.qe.lab.redhat.com_openshift-kni-infra_keepalived-3e26b364e065e94a5dfb5743a6722f5ea2b69780c6c266209fd170acd6c22083.log | grep chk_default_ingress 2022-06-21T13:06:13.567216113+00:00 stderr F Tue Jun 21 13:06:13 2022: Script `chk_default_ingress` now returning 1 2022-06-21T13:06:13.567252575+00:00 stderr F Tue Jun 21 13:06:13 2022: VRRP_Script(chk_default_ingress) failed (exited with status 1) 2022-06-21T13:06:23.514668223+00:00 stderr F Tue Jun 21 13:06:23 2022: Script `chk_default_ingress` now returning 0 2022-06-21T13:06:33.482149496+00:00 stderr F Tue Jun 21 13:06:33 2022: VRRP_Script(chk_default_ingress) succeeded 2022-06-21T13:06:53.674191003+00:00 stderr F Tue Jun 21 13:06:53 2022: VRRP_Script(chk_default_ingress) considered successful on reload ingressVIP moved to worker-0-0 where is running default router
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069