Bug 2009873
Summary: | Stale Logical Router Policies and Annotations for a given node | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Andrew Stoycos <astoycos> | |
Component: | Networking | Assignee: | ffernand <ffernand> | |
Networking sub component: | ovn-kubernetes | QA Contact: | Anurag saxena <anusaxen> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | high | |||
Priority: | urgent | CC: | akanevsk, anbhat, anusaxen, ashsharm, bnemec, dkiselev, ffernand, hshukla, huirwang, jfindysz, kpelc, mapandey, openshift-bugs-escalate, skharat, trozet, wking, zzhao | |
Version: | 4.6 | Keywords: | FastFix, Triaged | |
Target Milestone: | --- | Flags: | anusaxen:
needinfo-
|
|
Target Release: | 4.10.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | No Doc Update | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 2022042 2022043 (view as bug list) | Environment: | ||
Last Closed: | 2022-03-10 16:16:28 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 2022042, 2027485 |
Description
Andrew Stoycos
2021-10-01 20:03:47 UTC
Something with keep-alived configuration that causes unnecessary failovers. From the must-gather attached in comment 1, it seems a bit odd that virtual_router_id 21 are all configured as backup and with the same priority. That would undoubtedly cause a bit of unnecessary instability in the election process until a vrrp router becomes the master. Another source of instability is that the config keeps changing unicast_peer instead of writing once. Can that be improved as well? See: https://gist.github.com/flavio-fernandes/37891111061b2264a105d72b2daf4a46 All nodes have the same exact state + priority for virtual_router_id 219. 2021-10-01T04:42:06.646698988Z time="2021-10-01T04:42:06Z" level=info msg="vrrp_instance ocp-cluster-edge33-0_INGRESS {" 2021-10-01T04:42:06.646708352Z time="2021-10-01T04:42:06Z" level=info msg=" state BACKUP" 2021-10-01T04:42:06.646708352Z time="2021-10-01T04:42:06Z" level=info msg=" interface br-ex" 2021-10-01T04:42:06.646708352Z time="2021-10-01T04:42:06Z" level=info msg=" virtual_router_id 219" 2021-10-01T04:42:06.646717707Z time="2021-10-01T04:42:06Z" level=info msg=" priority 20" 2021-10-01T04:42:06.646717707Z time="2021-10-01T04:42:06Z" level=info msg=" advert_int 1" 2021-10-01T04:42:06.646717707Z time="2021-10-01T04:42:06Z" level=info msg=" " Will follow up on this issue at a separate bug. (In reply to ffernand from comment #2) > Something with keep-alived configuration that causes unnecessary failovers. > > From the must-gather attached in comment 1, it seems a bit odd that > virtual_router_id 21 are all configured as backup and with the same priority. > That would undoubtedly cause a bit of unnecessary instability in the > election process until a vrrp router becomes the master. It does cause a bit of churn, but we don't want to have keepalived start up assuming it's the master. After initial deployment, in most cases there will already be a master when keepalived starts on a given node. Also, keepalived breaks priority ties using the node IP so while the priority is set the same, in practice it actually isn't. The VIP will prefer whichever node has the highest IP address. > Another source of instability is that the config keeps changing unicast_peer > instead of writing once. Can that be improved as well? The peer list is updated dynamically as nodes are added/removed in the cluster. There's no way to write it once and never update. I believe we already have logic to limit the number of updates, but some amount of churn is inevitable. *** Bug 2018276 has been marked as a duplicate of this bug. *** Verified it with 4.10.0-0.ci-2021-11-11-033953 on IPI vsphere. ingressIP: 172.31.248.140 1. At first, it is located on worker node qe-huirwang11-xf8sn-worker-djj6h. $ for i in $(oc get nodes -o wide |awk '{print $1}' |sed '1d'); do echo "$i" && oc describe node "$i" | grep k8s.ovn.org/host-addresses; done qe-huirwang11-xf8sn-master-0 k8s.ovn.org/host-addresses: ["172.31.249.46"] qe-huirwang11-xf8sn-master-1 k8s.ovn.org/host-addresses: ["172.31.249.62"] qe-huirwang11-xf8sn-master-2 k8s.ovn.org/host-addresses: ["172.31.248.139","172.31.249.97"] qe-huirwang11-xf8sn-worker-djj6h k8s.ovn.org/host-addresses: ["172.31.248.140","172.31.249.12"] qe-huirwang11-xf8sn-worker-f2sjx k8s.ovn.org/host-addresses: ["172.31.249.81"] qe-huirwang11-xf8sn-worker-tzzgv k8s.ovn.org/host-addresses: ["172.31.249.31"] sh-4.4# ovn-nbctl find logical_router_policy | grep -B 4 -A 5 172.31.248.140 _uuid : 017a67c2-deb6-44b9-b004-614f8e0399b8 action : reroute external_ids : {} match : "inport == \"rtos-qe-huirwang11-xf8sn-worker-djj6h\" && ip4.dst == 172.31.248.140 /* qe-huirwang11-xf8sn-worker-djj6h */" nexthop : [] nexthops : ["10.128.2.2"] options : {} priority : 1004 2. Then reboot node qe-huirwang11-xf8sn-worker-djj6h,ingressVIP was moved to qe-huirwang11-xf8sn-worker-tzzgv $ for i in $(oc get nodes -o wide |awk '{print $1}' |sed '1d'); do echo "$i" && oc describe node "$i" | grep k8s.ovn.org/host-addresses; done qe-huirwang11-xf8sn-master-0 k8s.ovn.org/host-addresses: ["172.31.249.46"] qe-huirwang11-xf8sn-master-1 k8s.ovn.org/host-addresses: ["172.31.249.62"] qe-huirwang11-xf8sn-master-2 k8s.ovn.org/host-addresses: ["172.31.248.139","172.31.249.97"] qe-huirwang11-xf8sn-worker-djj6h k8s.ovn.org/host-addresses: ["172.31.249.12"] qe-huirwang11-xf8sn-worker-f2sjx k8s.ovn.org/host-addresses: ["172.31.249.81"] qe-huirwang11-xf8sn-worker-tzzgv k8s.ovn.org/host-addresses: ["172.31.248.140","172.31.249.31"] $ oc rsh -n openshift-ovn-kubernetes ovnkube-master-6tkdq Defaulting container name to northd. Use 'oc describe pod/ovnkube-master-6tkdq -n openshift-ovn-kubernetes' to see all of the containers in this pod. sh-4.4# ovn-nbctl find logical_router_policy | grep -B 4 -A 5 172.31.248.140 _uuid : 0782feb7-d211-4150-a7f1-c3ced1972faa action : reroute external_ids : {} match : "inport == \"rtos-qe-huirwang11-xf8sn-worker-tzzgv\" && ip4.dst == 172.31.248.140 /* qe-huirwang11-xf8sn-worker-tzzgv */" nexthop : [] nexthops : ["10.131.0.2"] options : {} priority : 1004 sh-4.4# 3. Reboot the worker node which hosted the ingress VIP in turn for a couple of times, looks good. Reboot all the nodes, looks good as well. $ for i in $(oc get nodes -o wide |awk '{print $1}' |sed '1d'); do echo "$i" && oc describe node "$i" | grep k8s.ovn.org/host-addresses; done qe-huirwang11-xf8sn-master-0 k8s.ovn.org/host-addresses: ["172.31.248.139","172.31.249.46"] qe-huirwang11-xf8sn-master-1 k8s.ovn.org/host-addresses: ["172.31.249.62"] qe-huirwang11-xf8sn-master-2 k8s.ovn.org/host-addresses: ["172.31.249.97"] qe-huirwang11-xf8sn-worker-djj6h k8s.ovn.org/host-addresses: ["172.31.248.140","172.31.249.12"] qe-huirwang11-xf8sn-worker-f2sjx k8s.ovn.org/host-addresses: ["172.31.249.81"] qe-huirwang11-xf8sn-worker-tzzgv k8s.ovn.org/host-addresses: ["172.31.249.31"] sh-4.4# ovn-nbctl find logical_router_policy | grep -B 4 -A 5 172.31.248.140 _uuid : 02815b33-7f64-431f-9553-1efba202bac8 action : reroute external_ids : {} match : "inport == \"rtos-qe-huirwang11-xf8sn-worker-djj6h\" && ip4.dst == 172.31.248.140 /* qe-huirwang11-xf8sn-worker-djj6h */" nexthop : [] nexthops : ["10.128.2.2"] options : {} priority : 1004 sh-4.4# Following [1], we're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the ImpactStatementRequested label has been added to this bug. When responding, please remove ImpactStatementRequested and set the ImpactStatementProposed label. The expectation is that the assignee answers these questions. Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking? * example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet * example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time What is the impact? Is it serious enough to warrant blocking edges? * example: Up to 2 minute disruption in edge routing * example: Up to 90 seconds of API downtime * example: etcd loses quorum and you have to restore from backup How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? * example: Issue resolves itself after five minutes * example: Admin uses oc to fix things * example: Admin must SSH to hosts, restore from backups, or other non standard admin activities Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)? * example: No, it has always been like this we just never noticed * example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1 One thing that might help for resolved versions, is waking the 'Blocks' tree from this bug, which is more complicated than normal. A number of changes were associated this bug, and most of those went back to 4.9 with bug 2022042, shipped in 4.9.9. One commit was left out, and is making its way back in bug 2027485, about to ship in a coming 4.9.z. The main fix continued back to 4.8 in bug 2022043, shipped in 4.8.22, with the dangling commit getting back in bug 2027487, about to ship in a coming 4.8.z. Looks like nothing has made it back to 4.7 yet; perhaps it was never affected. [1]: https://github.com/openshift/enhancements/tree/2911c46bf7d2f22eb1ab81739b4f9c2603fd0c07/enhancements/update/update-blocker-lifecycle#impact-statement-request (In reply to W. Trevor King from comment #37) > Who is impacted? If we have to block upgrade edges based on this issue, > which edges would need blocking? Any bare-metal cluster where a failover takes place. More specifically, this bug affects the node annotation related to the VIP addresses when the node that went down currently had the VIP configured. > What is the impact? Is it serious enough to warrant blocking edges? This issue would cause packets destined to the VIP address to go to the wrong node, due to the stale node annotation. > How involved is remediation (even moderately serious impacts might be > acceptable if they are easy to mitigate)? In order to fix this, node annotation would have to be repaired manually, or VIP should be set back to the original node after it came online. > Is this a regression (if all previous versions were also vulnerable, > updating to the new, vulnerable version does not increase exposure)? It is not a regression. It has been broken since 4.7 (or earlier). That sounds pretty serious. Looks like OVN support went GA in 4.6 [1], which is still in EUS until 2022-10-27 [2]. But checking with Flavio out of band, this bug impacted all of 4.6 as well, so it's still not going to regress when updating between any currently supported versions, and we wouldn't help keep anyone safer by blocking updates. I'm dropping UpgradeBlocker. [1]: https://docs.openshift.com/container-platform/4.6/release_notes/ocp-4-6-release-notes.html#ocp-4-6-ovn-kubernetes-ga [2]: https://access.redhat.com/support/policy/updates/openshift#dates Also rolling 'Version' back to 4.6, so the range of affected versions is more clear. 'Target Release' for this bug stays 4.10.0, since that's where the fixes we're tracking here landed. Backports have their own bugs in the Blocks tree, as described in comment 37. Looks like all of the BZs it depends on are closed. Should we close this one? Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days |