Description of problem: On baremetal deployments we occasionally see incorrect OVN Logical Router policies being created/ orphaned in a cluster on the ovn_cluster_router, specifically related to to ingress/api VIPs (more information about those VIPs can be found here https://github.com/openshift/installer/blob/master/docs/design/baremetal/networking-infrastructure.md) For example Consider the ingress VIP 192.168.123.112 It lives on the node worker-0-0's br-ex interface ``` Worker-0-0 MAC: 52:54:00:e8:d0:ba 5: br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000 link/ether 52:54:00:e8:d0:ba brd ff:ff:ff:ff:ff:ff inet 192.168.123.146/24 brd 192.168.123.255 scope global dynamic noprefixroute br-ex valid_lft 2561sec preferred_lft 2561sec inet 192.168.123.112/32 scope global br-ex valid_lft forever preferred_lft forever inet6 fe80::b9a2:1089:d28b:61d8/64 scope link noprefixroute valid_lft forever preferred_lft forever ``` And look at all the LRPs matching on traffic destined for that VIP [root@master-0-0 ~]# ovn-nbctl find logical_router_policy | grep -B 5 -A 10 192.168.123.112 _uuid : b53f7be1-882c-4a3b-8aa9-10d9f728bfc7 action : reroute external_ids : {} match : "inport == \"rtos-master-0-2\" && ip4.dst == 192.168.123.112 /* master-0-2 */" nexthop : [] nexthops : ["10.128.2.2"] options : {} priority : 1004 _uuid : 10162a62-6f53-4190-96ce-7cb81acf686f action : reroute external_ids : {} match : "inport == \"rtos-worker-0-0\" && ip4.dst == 192.168.123.112 /* worker-0-0 */" nexthop : [] nexthops : ["10.130.0.2"] options : {} priority : 1004 _uuid : 2d3e2453-e937-4138-a417-cebaf8af094c action : reroute external_ids : {} match : "inport == \"rtos-master-0-0\" && ip4.dst == 192.168.123.112 /* master-0-0 */" nexthop : [] nexthops : ["10.128.0.2"] options : {} priority : 1004 _uuid : 84bcbae9-ef66-4b75-a9c0-c82083295743 action : reroute external_ids : {} match : "inport == \"rtos-master-0-1\" && ip4.dst == 192.168.123.112 /* master-0-1 */" nexthop : [] nexthops : ["10.129.0.2"] options : {} priority : 1004 ``` These LRPs steer traffic destined for the VIP(192.168.123.112) out the ovn-k8s-mp0 interface on each relative node. This is correct for node worker-0-0 (where the VIP actually lives) i.e ``` _uuid : 10162a62-6f53-4190-96ce-7cb81acf686f action : reroute external_ids : {} match : "inport == \"rtos-worker-0-0\" && ip4.dst == 192.168.123.112 /* worker-0-0 */" nexthop : [] nexthops : ["10.130.0.2"] options : {} priority : 1004 ``` However for other nodes(where the VIP does not live) this is incorrect and results in the following scenario for Pod -> VIP traffic Pod -> ovn-k8's-mp0 -> (Node where VIP lives) However now the packet is never SNATed so it arrives at the VIP with srcIP==PodIP resulting in return traffic trying to go through the OVN architecture rather than back via the underlay. This is incorrect and needs to be fixed. Further evidence can be seen with stale node annotations... For example the VIP(192.168.123.112) does not live on master-0-0 but it still shows up under `host-addresses` in the annotation for that node ``` [root@ocp-edge33 ~]# oc get node master-0-0 -o yaml apiVersion: v1 kind: Node metadata: annotations: k8s.ovn.org/host-addresses: '["192.168.123.112","192.168.123.145"]' k8s.ovn.org/l3-gateway-config: '{"default":{"mode":"shared","interface-id":"br-ex_master-0-0","mac-address":"52:54:00:71:3f:52","ip-addresses":["192.168.123.145/24"],"ip-address":"192.168.123.145/24","next-hops":["192.168.123.1"],"next-hop":"192.168.123.1","node-port-enable":"true","vlan-id":"0"}}' k8s.ovn.org/node-chassis-id: aeb1a889-ffd6-49d9-920b-1caefa7498ba k8s.ovn.org/node-mgmt-port-mac-address: c2:2a:a8:cd:03:dc k8s.ovn.org/node-primary-ifaddr: '{"ipv4":"192.168.123.145/24"}' k8s.ovn.org/node-subnets: '{"default":"10.128.0.0/23"}' k8s.ovn.org/topology-version: "4" machine.openshift.io/machine: openshift-machine-api/ocp-cluster-edge33-0-pg74x-master-0 machineconfiguration.openshift.io/controlPlaneTopology: HighlyAvailable machineconfiguration.openshift.io/currentConfig: rendered-master-1bf3edea9ea1b3d6aa628f8cf5d28304 machineconfiguration.openshift.io/desiredConfig: rendered-master-1bf3edea9ea1b3d6aa628f8cf5d28304 machineconfiguration.openshift.io/reason: "" machineconfiguration.openshift.io/state: Done volumes.kubernetes.io/controller-managed-attach-detach: "true" ``` Version-Release number of selected component (if applicable): 4.9.0 How reproducible: Not 100% of the time, I think it has to do with when the VIPs are created/ how many times they move Steps to Reproduce: 1. This can be reproduced on any OCP baremetal, and Is most likely re-producible on upstream kind as well Expected results: The Correct LRPs be programmed
Something with keep-alived configuration that causes unnecessary failovers. From the must-gather attached in comment 1, it seems a bit odd that virtual_router_id 21 are all configured as backup and with the same priority. That would undoubtedly cause a bit of unnecessary instability in the election process until a vrrp router becomes the master. Another source of instability is that the config keeps changing unicast_peer instead of writing once. Can that be improved as well? See: https://gist.github.com/flavio-fernandes/37891111061b2264a105d72b2daf4a46 All nodes have the same exact state + priority for virtual_router_id 219. 2021-10-01T04:42:06.646698988Z time="2021-10-01T04:42:06Z" level=info msg="vrrp_instance ocp-cluster-edge33-0_INGRESS {" 2021-10-01T04:42:06.646708352Z time="2021-10-01T04:42:06Z" level=info msg=" state BACKUP" 2021-10-01T04:42:06.646708352Z time="2021-10-01T04:42:06Z" level=info msg=" interface br-ex" 2021-10-01T04:42:06.646708352Z time="2021-10-01T04:42:06Z" level=info msg=" virtual_router_id 219" 2021-10-01T04:42:06.646717707Z time="2021-10-01T04:42:06Z" level=info msg=" priority 20" 2021-10-01T04:42:06.646717707Z time="2021-10-01T04:42:06Z" level=info msg=" advert_int 1" 2021-10-01T04:42:06.646717707Z time="2021-10-01T04:42:06Z" level=info msg=" " Will follow up on this issue at a separate bug.
(In reply to ffernand from comment #2) > Something with keep-alived configuration that causes unnecessary failovers. > > From the must-gather attached in comment 1, it seems a bit odd that > virtual_router_id 21 are all configured as backup and with the same priority. > That would undoubtedly cause a bit of unnecessary instability in the > election process until a vrrp router becomes the master. It does cause a bit of churn, but we don't want to have keepalived start up assuming it's the master. After initial deployment, in most cases there will already be a master when keepalived starts on a given node. Also, keepalived breaks priority ties using the node IP so while the priority is set the same, in practice it actually isn't. The VIP will prefer whichever node has the highest IP address. > Another source of instability is that the config keeps changing unicast_peer > instead of writing once. Can that be improved as well? The peer list is updated dynamically as nodes are added/removed in the cluster. There's no way to write it once and never update. I believe we already have logic to limit the number of updates, but some amount of churn is inevitable.
*** Bug 2018276 has been marked as a duplicate of this bug. ***
Verified it with 4.10.0-0.ci-2021-11-11-033953 on IPI vsphere. ingressIP: 172.31.248.140 1. At first, it is located on worker node qe-huirwang11-xf8sn-worker-djj6h. $ for i in $(oc get nodes -o wide |awk '{print $1}' |sed '1d'); do echo "$i" && oc describe node "$i" | grep k8s.ovn.org/host-addresses; done qe-huirwang11-xf8sn-master-0 k8s.ovn.org/host-addresses: ["172.31.249.46"] qe-huirwang11-xf8sn-master-1 k8s.ovn.org/host-addresses: ["172.31.249.62"] qe-huirwang11-xf8sn-master-2 k8s.ovn.org/host-addresses: ["172.31.248.139","172.31.249.97"] qe-huirwang11-xf8sn-worker-djj6h k8s.ovn.org/host-addresses: ["172.31.248.140","172.31.249.12"] qe-huirwang11-xf8sn-worker-f2sjx k8s.ovn.org/host-addresses: ["172.31.249.81"] qe-huirwang11-xf8sn-worker-tzzgv k8s.ovn.org/host-addresses: ["172.31.249.31"] sh-4.4# ovn-nbctl find logical_router_policy | grep -B 4 -A 5 172.31.248.140 _uuid : 017a67c2-deb6-44b9-b004-614f8e0399b8 action : reroute external_ids : {} match : "inport == \"rtos-qe-huirwang11-xf8sn-worker-djj6h\" && ip4.dst == 172.31.248.140 /* qe-huirwang11-xf8sn-worker-djj6h */" nexthop : [] nexthops : ["10.128.2.2"] options : {} priority : 1004 2. Then reboot node qe-huirwang11-xf8sn-worker-djj6h,ingressVIP was moved to qe-huirwang11-xf8sn-worker-tzzgv $ for i in $(oc get nodes -o wide |awk '{print $1}' |sed '1d'); do echo "$i" && oc describe node "$i" | grep k8s.ovn.org/host-addresses; done qe-huirwang11-xf8sn-master-0 k8s.ovn.org/host-addresses: ["172.31.249.46"] qe-huirwang11-xf8sn-master-1 k8s.ovn.org/host-addresses: ["172.31.249.62"] qe-huirwang11-xf8sn-master-2 k8s.ovn.org/host-addresses: ["172.31.248.139","172.31.249.97"] qe-huirwang11-xf8sn-worker-djj6h k8s.ovn.org/host-addresses: ["172.31.249.12"] qe-huirwang11-xf8sn-worker-f2sjx k8s.ovn.org/host-addresses: ["172.31.249.81"] qe-huirwang11-xf8sn-worker-tzzgv k8s.ovn.org/host-addresses: ["172.31.248.140","172.31.249.31"] $ oc rsh -n openshift-ovn-kubernetes ovnkube-master-6tkdq Defaulting container name to northd. Use 'oc describe pod/ovnkube-master-6tkdq -n openshift-ovn-kubernetes' to see all of the containers in this pod. sh-4.4# ovn-nbctl find logical_router_policy | grep -B 4 -A 5 172.31.248.140 _uuid : 0782feb7-d211-4150-a7f1-c3ced1972faa action : reroute external_ids : {} match : "inport == \"rtos-qe-huirwang11-xf8sn-worker-tzzgv\" && ip4.dst == 172.31.248.140 /* qe-huirwang11-xf8sn-worker-tzzgv */" nexthop : [] nexthops : ["10.131.0.2"] options : {} priority : 1004 sh-4.4# 3. Reboot the worker node which hosted the ingress VIP in turn for a couple of times, looks good. Reboot all the nodes, looks good as well. $ for i in $(oc get nodes -o wide |awk '{print $1}' |sed '1d'); do echo "$i" && oc describe node "$i" | grep k8s.ovn.org/host-addresses; done qe-huirwang11-xf8sn-master-0 k8s.ovn.org/host-addresses: ["172.31.248.139","172.31.249.46"] qe-huirwang11-xf8sn-master-1 k8s.ovn.org/host-addresses: ["172.31.249.62"] qe-huirwang11-xf8sn-master-2 k8s.ovn.org/host-addresses: ["172.31.249.97"] qe-huirwang11-xf8sn-worker-djj6h k8s.ovn.org/host-addresses: ["172.31.248.140","172.31.249.12"] qe-huirwang11-xf8sn-worker-f2sjx k8s.ovn.org/host-addresses: ["172.31.249.81"] qe-huirwang11-xf8sn-worker-tzzgv k8s.ovn.org/host-addresses: ["172.31.249.31"] sh-4.4# ovn-nbctl find logical_router_policy | grep -B 4 -A 5 172.31.248.140 _uuid : 02815b33-7f64-431f-9553-1efba202bac8 action : reroute external_ids : {} match : "inport == \"rtos-qe-huirwang11-xf8sn-worker-djj6h\" && ip4.dst == 172.31.248.140 /* qe-huirwang11-xf8sn-worker-djj6h */" nexthop : [] nexthops : ["10.128.2.2"] options : {} priority : 1004 sh-4.4#
Following [1], we're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the ImpactStatementRequested label has been added to this bug. When responding, please remove ImpactStatementRequested and set the ImpactStatementProposed label. The expectation is that the assignee answers these questions. Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking? * example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet * example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time What is the impact? Is it serious enough to warrant blocking edges? * example: Up to 2 minute disruption in edge routing * example: Up to 90 seconds of API downtime * example: etcd loses quorum and you have to restore from backup How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? * example: Issue resolves itself after five minutes * example: Admin uses oc to fix things * example: Admin must SSH to hosts, restore from backups, or other non standard admin activities Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)? * example: No, it has always been like this we just never noticed * example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1 One thing that might help for resolved versions, is waking the 'Blocks' tree from this bug, which is more complicated than normal. A number of changes were associated this bug, and most of those went back to 4.9 with bug 2022042, shipped in 4.9.9. One commit was left out, and is making its way back in bug 2027485, about to ship in a coming 4.9.z. The main fix continued back to 4.8 in bug 2022043, shipped in 4.8.22, with the dangling commit getting back in bug 2027487, about to ship in a coming 4.8.z. Looks like nothing has made it back to 4.7 yet; perhaps it was never affected. [1]: https://github.com/openshift/enhancements/tree/2911c46bf7d2f22eb1ab81739b4f9c2603fd0c07/enhancements/update/update-blocker-lifecycle#impact-statement-request
(In reply to W. Trevor King from comment #37) > Who is impacted? If we have to block upgrade edges based on this issue, > which edges would need blocking? Any bare-metal cluster where a failover takes place. More specifically, this bug affects the node annotation related to the VIP addresses when the node that went down currently had the VIP configured. > What is the impact? Is it serious enough to warrant blocking edges? This issue would cause packets destined to the VIP address to go to the wrong node, due to the stale node annotation. > How involved is remediation (even moderately serious impacts might be > acceptable if they are easy to mitigate)? In order to fix this, node annotation would have to be repaired manually, or VIP should be set back to the original node after it came online. > Is this a regression (if all previous versions were also vulnerable, > updating to the new, vulnerable version does not increase exposure)? It is not a regression. It has been broken since 4.7 (or earlier).
That sounds pretty serious. Looks like OVN support went GA in 4.6 [1], which is still in EUS until 2022-10-27 [2]. But checking with Flavio out of band, this bug impacted all of 4.6 as well, so it's still not going to regress when updating between any currently supported versions, and we wouldn't help keep anyone safer by blocking updates. I'm dropping UpgradeBlocker. [1]: https://docs.openshift.com/container-platform/4.6/release_notes/ocp-4-6-release-notes.html#ocp-4-6-ovn-kubernetes-ga [2]: https://access.redhat.com/support/policy/updates/openshift#dates
Also rolling 'Version' back to 4.6, so the range of affected versions is more clear. 'Target Release' for this bug stays 4.10.0, since that's where the fixes we're tracking here landed. Backports have their own bugs in the Blocks tree, as described in comment 37.
Looks like all of the BZs it depends on are closed. Should we close this one?
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days