Bug 2009873 - Stale Logical Router Policies and Annotations for a given node [NEEDINFO]
Summary: Stale Logical Router Policies and Annotations for a given node
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.6
Hardware: Unspecified
OS: Unspecified
urgent
high
Target Milestone: ---
: 4.10.0
Assignee: ffernand
QA Contact: Anurag saxena
URL:
Whiteboard:
: 2018276 (view as bug list)
Depends On:
Blocks: 2022042 2027485
TreeView+ depends on / blocked
 
Reported: 2021-10-01 20:03 UTC by Andrew Stoycos
Modified: 2022-03-10 16:17 UTC (History)
17 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 2022042 2022043 (view as bug list)
Environment:
Last Closed: 2022-03-10 16:16:28 UTC
Target Upstream Version:
anusaxen: needinfo-
akanevsk: needinfo? (ffernand)
hshukla: needinfo? (anusaxen)


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift ovn-kubernetes pull 843 0 None Merged Bug 2009873: [4.10.0] Avoid stale annotations by re-subscribing to netlink 2021-12-01 15:26:45 UTC
Github ovn-org ovn-kubernetes pull 2521 0 None Merged Bug 1998515: ovn-kubernetes repeatedly updates host-addresses annotation on ipv6/dual-stack hosts 2021-10-05 02:05:25 UTC
Github ovn-org ovn-kubernetes pull 2534 0 None Merged Bug 2009873: Stale annotations for a given node 2021-10-25 18:00:44 UTC
Github ovn-org ovn-kubernetes pull 2614 0 None Merged Bug 2018276: Avoid stale annotations by re-subscribing to netlink 2021-11-03 17:54:37 UTC
Github ovn-org ovn-kubernetes pull 2657 0 None Merged Ensure node host address annotations are in sync with api server 2021-11-19 20:43:50 UTC
Red Hat Product Errata RHSA-2022:0056 0 None None None 2022-03-10 16:16:59 UTC

Description Andrew Stoycos 2021-10-01 20:03:47 UTC
Description of problem:

On baremetal deployments we occasionally see incorrect OVN Logical Router policies being created/ orphaned in a cluster on the ovn_cluster_router, specifically related to to ingress/api VIPs (more information about those VIPs can be found here https://github.com/openshift/installer/blob/master/docs/design/baremetal/networking-infrastructure.md) 

For example Consider the ingress VIP 192.168.123.112

It lives on the node worker-0-0's br-ex interface 

```
Worker-0-0

MAC: 52:54:00:e8:d0:ba

5: br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether 52:54:00:e8:d0:ba brd ff:ff:ff:ff:ff:ff
    inet 192.168.123.146/24 brd 192.168.123.255 scope global dynamic noprefixroute br-ex
       valid_lft 2561sec preferred_lft 2561sec
    inet 192.168.123.112/32 scope global br-ex
       valid_lft forever preferred_lft forever
    inet6 fe80::b9a2:1089:d28b:61d8/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
```

And look at all the LRPs matching on traffic destined for that VIP 

[root@master-0-0 ~]# ovn-nbctl find logical_router_policy | grep -B 5 -A 10 192.168.123.112
_uuid               : b53f7be1-882c-4a3b-8aa9-10d9f728bfc7
action              : reroute
external_ids        : {}
match               : "inport == \"rtos-master-0-2\" && ip4.dst == 192.168.123.112 /* master-0-2 */"
nexthop             : []
nexthops            : ["10.128.2.2"]
options             : {}
priority            : 1004


_uuid               : 10162a62-6f53-4190-96ce-7cb81acf686f
action              : reroute
external_ids        : {}
match               : "inport == \"rtos-worker-0-0\" && ip4.dst == 192.168.123.112 /* worker-0-0 */"
nexthop             : []
nexthops            : ["10.130.0.2"]
options             : {}
priority            : 1004

_uuid               : 2d3e2453-e937-4138-a417-cebaf8af094c
action              : reroute
external_ids        : {}
match               : "inport == \"rtos-master-0-0\" && ip4.dst == 192.168.123.112 /* master-0-0 */"
nexthop             : []
nexthops            : ["10.128.0.2"]
options             : {}
priority            : 1004


_uuid               : 84bcbae9-ef66-4b75-a9c0-c82083295743
action              : reroute
external_ids        : {}
match               : "inport == \"rtos-master-0-1\" && ip4.dst == 192.168.123.112 /* master-0-1 */"
nexthop             : []
nexthops            : ["10.129.0.2"]
options             : {}
priority            : 1004

```

These LRPs steer traffic destined for the VIP(192.168.123.112) out the ovn-k8s-mp0 interface on each relative node.  This is correct for node worker-0-0 (where the VIP actually lives)  

i.e 

```

_uuid               : 10162a62-6f53-4190-96ce-7cb81acf686f
action              : reroute
external_ids        : {}
match               : "inport == \"rtos-worker-0-0\" && ip4.dst == 192.168.123.112 /* worker-0-0 */"
nexthop             : []
nexthops            : ["10.130.0.2"]
options             : {}
priority            : 1004
```

However for other nodes(where the VIP does not live) this is incorrect and results in the following scenario for Pod  -> VIP traffic 

Pod -> ovn-k8's-mp0 -> (Node where VIP lives)

However now the packet is never SNATed so it arrives at the VIP with srcIP==PodIP resulting in return traffic trying to go through the OVN architecture rather than back via the underlay. 

This is incorrect and needs to be fixed. 

Further evidence can be seen with stale node annotations... For example the VIP(192.168.123.112) does not live on master-0-0 but it still shows up under `host-addresses` in the annotation for that node 

```
[root@ocp-edge33 ~]# oc get node master-0-0 -o yaml
apiVersion: v1
kind: Node
metadata:
  annotations:
    k8s.ovn.org/host-addresses: '["192.168.123.112","192.168.123.145"]'
    k8s.ovn.org/l3-gateway-config: '{"default":{"mode":"shared","interface-id":"br-ex_master-0-0","mac-address":"52:54:00:71:3f:52","ip-addresses":["192.168.123.145/24"],"ip-address":"192.168.123.145/24","next-hops":["192.168.123.1"],"next-hop":"192.168.123.1","node-port-enable":"true","vlan-id":"0"}}'
    k8s.ovn.org/node-chassis-id: aeb1a889-ffd6-49d9-920b-1caefa7498ba
    k8s.ovn.org/node-mgmt-port-mac-address: c2:2a:a8:cd:03:dc
    k8s.ovn.org/node-primary-ifaddr: '{"ipv4":"192.168.123.145/24"}'
    k8s.ovn.org/node-subnets: '{"default":"10.128.0.0/23"}'
    k8s.ovn.org/topology-version: "4"
    machine.openshift.io/machine: openshift-machine-api/ocp-cluster-edge33-0-pg74x-master-0
    machineconfiguration.openshift.io/controlPlaneTopology: HighlyAvailable
    machineconfiguration.openshift.io/currentConfig: rendered-master-1bf3edea9ea1b3d6aa628f8cf5d28304
    machineconfiguration.openshift.io/desiredConfig: rendered-master-1bf3edea9ea1b3d6aa628f8cf5d28304
    machineconfiguration.openshift.io/reason: ""
    machineconfiguration.openshift.io/state: Done
    volumes.kubernetes.io/controller-managed-attach-detach: "true"
```


Version-Release number of selected component (if applicable):

4.9.0 

How reproducible:

Not 100% of the time, I think it has to do with when the VIPs are created/ how many times they move 

Steps to Reproduce:
1. This can be reproduced on any OCP baremetal, and Is most likely re-producible on upstream kind as well 


Expected results:
The Correct LRPs be programmed

Comment 2 ffernand 2021-10-05 01:39:34 UTC
Something with keep-alived configuration that causes unnecessary failovers.

From the must-gather attached in comment 1, it seems a bit odd that virtual_router_id 21 are all configured as backup and with the same priority.
That would undoubtedly cause a bit of unnecessary instability in the election process until a vrrp router becomes the master.
Another source of instability is that the config keeps changing unicast_peer instead of writing once. Can that be improved as well?

See:  https://gist.github.com/flavio-fernandes/37891111061b2264a105d72b2daf4a46

All nodes have the same exact state + priority for virtual_router_id 219.

2021-10-01T04:42:06.646698988Z time="2021-10-01T04:42:06Z" level=info msg="vrrp_instance ocp-cluster-edge33-0_INGRESS {"
2021-10-01T04:42:06.646708352Z time="2021-10-01T04:42:06Z" level=info msg="    state BACKUP"
2021-10-01T04:42:06.646708352Z time="2021-10-01T04:42:06Z" level=info msg="    interface br-ex"
2021-10-01T04:42:06.646708352Z time="2021-10-01T04:42:06Z" level=info msg="    virtual_router_id 219"
2021-10-01T04:42:06.646717707Z time="2021-10-01T04:42:06Z" level=info msg="    priority 20"
2021-10-01T04:42:06.646717707Z time="2021-10-01T04:42:06Z" level=info msg="    advert_int 1"
2021-10-01T04:42:06.646717707Z time="2021-10-01T04:42:06Z" level=info msg="    "

Will follow up on this issue at a separate bug.

Comment 4 Ben Nemec 2021-10-05 15:32:10 UTC
(In reply to ffernand from comment #2)
> Something with keep-alived configuration that causes unnecessary failovers.
> 
> From the must-gather attached in comment 1, it seems a bit odd that
> virtual_router_id 21 are all configured as backup and with the same priority.
> That would undoubtedly cause a bit of unnecessary instability in the
> election process until a vrrp router becomes the master.

It does cause a bit of churn, but we don't want to have keepalived start up assuming it's the master. After initial deployment, in most cases there will already be a master when keepalived starts on a given node.

Also, keepalived breaks priority ties using the node IP so while the priority is set the same, in practice it actually isn't. The VIP will prefer whichever node has the highest IP address.

> Another source of instability is that the config keeps changing unicast_peer
> instead of writing once. Can that be improved as well?

The peer list is updated dynamically as nodes are added/removed in the cluster. There's no way to write it once and never update. I believe we already have logic to limit the number of updates, but some amount of churn is inevitable.

Comment 6 Tim Rozet 2021-11-01 14:14:19 UTC
*** Bug 2018276 has been marked as a duplicate of this bug. ***

Comment 19 huirwang 2021-11-11 07:21:10 UTC
Verified it with 4.10.0-0.ci-2021-11-11-033953 on IPI vsphere.

 ingressIP: 172.31.248.140

1. At first, it is located on worker node qe-huirwang11-xf8sn-worker-djj6h.
$ for i in $(oc get nodes -o wide |awk '{print $1}' |sed '1d'); do echo "$i" && oc describe node "$i" | grep k8s.ovn.org/host-addresses; done
qe-huirwang11-xf8sn-master-0
                    k8s.ovn.org/host-addresses: ["172.31.249.46"]
qe-huirwang11-xf8sn-master-1
                    k8s.ovn.org/host-addresses: ["172.31.249.62"]
qe-huirwang11-xf8sn-master-2
                    k8s.ovn.org/host-addresses: ["172.31.248.139","172.31.249.97"]
qe-huirwang11-xf8sn-worker-djj6h
                    k8s.ovn.org/host-addresses: ["172.31.248.140","172.31.249.12"]
qe-huirwang11-xf8sn-worker-f2sjx
                    k8s.ovn.org/host-addresses: ["172.31.249.81"]
qe-huirwang11-xf8sn-worker-tzzgv
                    k8s.ovn.org/host-addresses: ["172.31.249.31"]

sh-4.4# ovn-nbctl find logical_router_policy | grep -B 4 -A 5 172.31.248.140

_uuid               : 017a67c2-deb6-44b9-b004-614f8e0399b8
action              : reroute
external_ids        : {}
match               : "inport == \"rtos-qe-huirwang11-xf8sn-worker-djj6h\" && ip4.dst == 172.31.248.140 /* qe-huirwang11-xf8sn-worker-djj6h */"
nexthop             : []
nexthops            : ["10.128.2.2"]
options             : {}
priority            : 1004

2. Then reboot node qe-huirwang11-xf8sn-worker-djj6h,ingressVIP was moved to qe-huirwang11-xf8sn-worker-tzzgv
$ for i in $(oc get nodes -o wide |awk '{print $1}' |sed '1d'); do echo "$i" && oc describe node "$i" | grep k8s.ovn.org/host-addresses; done
qe-huirwang11-xf8sn-master-0
                    k8s.ovn.org/host-addresses: ["172.31.249.46"]
qe-huirwang11-xf8sn-master-1
                    k8s.ovn.org/host-addresses: ["172.31.249.62"]
qe-huirwang11-xf8sn-master-2
                    k8s.ovn.org/host-addresses: ["172.31.248.139","172.31.249.97"]
qe-huirwang11-xf8sn-worker-djj6h
                    k8s.ovn.org/host-addresses: ["172.31.249.12"]
qe-huirwang11-xf8sn-worker-f2sjx
                    k8s.ovn.org/host-addresses: ["172.31.249.81"]
qe-huirwang11-xf8sn-worker-tzzgv
                    k8s.ovn.org/host-addresses: ["172.31.248.140","172.31.249.31"]

$ oc rsh -n openshift-ovn-kubernetes ovnkube-master-6tkdq
Defaulting container name to northd.
Use 'oc describe pod/ovnkube-master-6tkdq -n openshift-ovn-kubernetes' to see all of the containers in this pod.
sh-4.4# ovn-nbctl find logical_router_policy | grep -B 4 -A 5 172.31.248.140

_uuid               : 0782feb7-d211-4150-a7f1-c3ced1972faa
action              : reroute
external_ids        : {}
match               : "inport == \"rtos-qe-huirwang11-xf8sn-worker-tzzgv\" && ip4.dst == 172.31.248.140 /* qe-huirwang11-xf8sn-worker-tzzgv */"
nexthop             : []
nexthops            : ["10.131.0.2"]
options             : {}
priority            : 1004

sh-4.4#

3. Reboot the worker node which hosted the ingress VIP in turn for a couple of times, looks good. Reboot all the nodes, looks good as well.
$ for i in $(oc get nodes -o wide |awk '{print $1}' |sed '1d'); do echo "$i" && oc describe node "$i" | grep k8s.ovn.org/host-addresses; done
qe-huirwang11-xf8sn-master-0
                    k8s.ovn.org/host-addresses: ["172.31.248.139","172.31.249.46"]
qe-huirwang11-xf8sn-master-1
                    k8s.ovn.org/host-addresses: ["172.31.249.62"]
qe-huirwang11-xf8sn-master-2
                    k8s.ovn.org/host-addresses: ["172.31.249.97"]
qe-huirwang11-xf8sn-worker-djj6h
                    k8s.ovn.org/host-addresses: ["172.31.248.140","172.31.249.12"]
qe-huirwang11-xf8sn-worker-f2sjx
                    k8s.ovn.org/host-addresses: ["172.31.249.81"]
qe-huirwang11-xf8sn-worker-tzzgv
                    k8s.ovn.org/host-addresses: ["172.31.249.31"]


sh-4.4# ovn-nbctl find logical_router_policy | grep -B 4 -A 5 172.31.248.140

_uuid               : 02815b33-7f64-431f-9553-1efba202bac8
action              : reroute
external_ids        : {}
match               : "inport == \"rtos-qe-huirwang11-xf8sn-worker-djj6h\" && ip4.dst == 172.31.248.140 /* qe-huirwang11-xf8sn-worker-djj6h */"
nexthop             : []
nexthops            : ["10.128.2.2"]
options             : {}
priority            : 1004

sh-4.4#

Comment 37 W. Trevor King 2021-12-10 19:39:42 UTC
Following [1], we're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the ImpactStatementRequested label has been added to this bug. When responding, please remove ImpactStatementRequested and set the ImpactStatementProposed label. The expectation is that the assignee answers these questions.

Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking?
* example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet
* example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time

What is the impact? Is it serious enough to warrant blocking edges?
* example: Up to 2 minute disruption in edge routing
* example: Up to 90 seconds of API downtime
* example: etcd loses quorum and you have to restore from backup

How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
* example: Issue resolves itself after five minutes
* example: Admin uses oc to fix things
* example: Admin must SSH to hosts, restore from backups, or other non standard admin activities

Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?
* example: No, it has always been like this we just never noticed
* example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1

One thing that might help for resolved versions, is waking the 'Blocks' tree from this bug, which is more complicated than normal.  A number of changes were associated this bug, and most of those went back to 4.9 with bug 2022042, shipped in 4.9.9.  One commit was left out, and is making its way back in bug 2027485, about to ship in a coming 4.9.z.  The main fix continued back to 4.8 in bug 2022043, shipped in 4.8.22, with the dangling commit getting back in bug 2027487, about to ship in a coming 4.8.z.  Looks like nothing has made it back to 4.7 yet; perhaps it was never affected.

[1]: https://github.com/openshift/enhancements/tree/2911c46bf7d2f22eb1ab81739b4f9c2603fd0c07/enhancements/update/update-blocker-lifecycle#impact-statement-request

Comment 38 ffernand 2021-12-10 19:49:33 UTC
(In reply to W. Trevor King from comment #37)
> Who is impacted? If we have to block upgrade edges based on this issue,
> which edges would need blocking?

Any bare-metal cluster where a failover takes place. More specifically,
this bug affects the node annotation related to the VIP addresses when
the node that went down currently had the VIP configured.

> What is the impact? Is it serious enough to warrant blocking edges?

This issue would cause packets destined to the VIP address to go to the
wrong node, due to the stale node annotation.

> How involved is remediation (even moderately serious impacts might be
> acceptable if they are easy to mitigate)?

In order to fix this, node annotation would have to be repaired manually,
or VIP should be set back to the original node after it came online.

> Is this a regression (if all previous versions were also vulnerable,
> updating to the new, vulnerable version does not increase exposure)?

It is not a regression. It has been broken since 4.7 (or earlier).

Comment 39 W. Trevor King 2021-12-10 20:04:00 UTC
That sounds pretty serious.  Looks like OVN support went GA in 4.6 [1], which is still in EUS until 2022-10-27 [2].  But checking with Flavio out of band, this bug impacted all of 4.6 as well, so it's still not going to regress when updating between any currently supported versions, and we wouldn't help keep anyone safer by blocking updates.  I'm dropping UpgradeBlocker.

[1]: https://docs.openshift.com/container-platform/4.6/release_notes/ocp-4-6-release-notes.html#ocp-4-6-ovn-kubernetes-ga
[2]: https://access.redhat.com/support/policy/updates/openshift#dates

Comment 40 W. Trevor King 2021-12-10 20:07:04 UTC
Also rolling 'Version' back to 4.6, so the range of affected versions is more clear.  'Target Release' for this bug stays 4.10.0, since that's where the fixes we're tracking here landed.  Backports have their own bugs in the Blocks tree, as described in comment 37.

Comment 42 Arkady Kanevsky 2022-02-04 20:45:25 UTC
Looks like all of the BZs it depends on are closed.
Should we close this one?

Comment 45 errata-xmlrpc 2022-03-10 16:16:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056


Note You need to log in before you can comment on or make changes to this bug.