Bug 1978797 - external gateway pod deletes may not clean up ECMP routes
Summary: external gateway pod deletes may not clean up ECMP routes
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.6
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.9.0
Assignee: Tim Rozet
QA Contact: Ross Brattain
URL:
Whiteboard:
: 1974430 (view as bug list)
Depends On:
Blocks: 2004269
TreeView+ depends on / blocked
 
Reported: 2021-07-02 18:39 UTC by Tim Rozet
Modified: 2021-10-18 17:38 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2004269 (view as bug list)
Environment:
Last Closed: 2021-10-18 17:38:01 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift ovn-kubernetes pull 658 0 None None None 2021-08-23 13:24:26 UTC
Github ovn-org ovn-kubernetes pull 2302 0 None closed Fixes stale routes after external gateway pods delete/update 2021-08-23 13:24:27 UTC
Github ovn-org ovn-kubernetes pull 2348 0 None None None 2021-08-23 13:24:31 UTC
Red Hat Product Errata RHSA-2021:3759 0 None None None 2021-10-18 17:38:22 UTC

Description Tim Rozet 2021-07-02 18:39:38 UTC
Description of problem:
There are a couple paths where deleting a pod acting as an external gateway will not clean up routes in the served namespaces:

1) If a pod is host networked, and acting as an external gateway pod. When it is deleted, the routes will not be removed for pods in served external gateway namespaces.

2) If a pod is acting as an external gateway pod, and it is removed while ovnkube-master is restarting, or the ovnkube-master cache has somehow become invalid, then routes will not be removed for pods in served external gateway namespaces.

Comment 1 Tim Rozet 2021-07-02 21:48:09 UTC
3) More of a corner case, but... an exgw pod fails to be added, but its route gets programmed into OVN. Then the pod is updated with a change to its exgw annotations, the new annotation will be added as a route into OVN, but the old route will not be removed.

4) ovnkube-master is restarted, then the exgw pod is deleted. The stale ecmp routes are not removed. This is because of a bug where OVN complains about "duplicate nexthop" during cache recreation during ovnkube-master start. We had a workaround to handle this, but we were checking the wrong message.

5) ovnkube-master is restarted, and while it is down, the exgw pod is deleted. This is the hardest case to fix, because we get no event. We need to come up with a sync method for this.

I've posted a PR that will fix cases 1-4: https://github.com/ovn-org/ovn-kubernetes/pull/2302

Comment 2 Tim Rozet 2021-07-02 21:50:53 UTC
note 5 is really the same as number 2. I just found there was 2 issues causing it to occur and I fixed one of them.

Comment 3 Andreas Karis 2021-07-03 07:34:47 UTC
*** Bug 1974430 has been marked as a duplicate of this bug. ***

Comment 4 Tim Rozet 2021-08-06 21:52:36 UTC
Updated https://github.com/ovn-org/ovn-kubernetes/pull/2348 to handle the final case...under review

Comment 5 Surya Seetharaman 2021-08-12 07:31:56 UTC
https://github.com/ovn-org/ovn-kubernetes/pull/2348 has got merged, we need to open a downstream merge PR to get this in.

Comment 7 Ross Brattain 2021-09-01 21:19:20 UTC
Verified in 4.9.0-0.nightly-2021-08-31-123131

I0901 20:49:39.933792       1 egressgw.go:54] External gateway pod: testpod1, detected for namespace(s) exgw
I0901 20:49:39.934041       1 egressgw.go:85] Adding routes for external gateway pod: testpod1, next hops: "fd2e:6f44:5dd8::8a,fd2e:6f44:5dd8::8f", namespace: exgw, bfd-enabled: false
I0901 20:49:59.881049       1 reflector.go:530] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.Namespace total 8 items received
I0901 20:50:06.108886       1 node_tracker.go:162] Processing possible switch / router updates for node master-0-2
I0901 20:50:06.115200       1 node_tracker.go:162] Processing possible switch / router updates for node master-0-0
I0901 20:50:10.829808       1 egressgw.go:54] External gateway pod: testpod1, detected for namespace(s) exgw
I0901 20:50:10.829878       1 egressgw.go:85] Adding routes for external gateway pod: testpod1, next hops: "fd2e:6f44:5dd8::8a,fd2e:6f44:5dd8::8f", namespace: exgw, bfd-enabled: false
I0901 20:50:10.839163       1 egressgw.go:54] External gateway pod: testpod1, detected for namespace(s) exgw
I0901 20:50:10.839310       1 egressgw.go:85] Adding routes for external gateway pod: testpod1, next hops: "fd2e:6f44:5dd8::8a,fd2e:6f44:5dd8::8f", namespace: exgw, bfd-enabled: false
I0901 20:50:10.849736       1 egressgw.go:176] Deleting routes for external gateway pod: testpod1, for namespace(s) exgw
2021-09-01T20:50:10.854Z|06283|unixctl|DBG|received request run["--if-exists","--policy=src-ip","--","lr-route-del","GR_master-0-1","fd01:0:0:1::2b/128","fd2e:6f44:5dd8::8a"], id=0
2021-09-01T20:50:10.855Z|06284|ovn_dbctl|INFO|Running command run --if-exists --policy=src-ip -- lr-route-del GR_master-0-1 fd01:0:0:1::2b/128 fd2e:6f44:5dd8::8a
2021-09-01T20:50:10.862Z|06285|unixctl|DBG|replying with success, id=0: ""
2021-09-01T20:50:10.865Z|06286|unixctl|DBG|received request run["--format=csv","--data=bare","--no-headings","--columns=bfd","--","find","Logical_Router_Static_Route","output_port=rtoe-GR_master-0-1","nexthop=\"fd2e:6f44:5dd8::8a\"","bfd!=[]"], id=0
2021-09-01T20:50:10.865Z|06287|ovn_dbctl|DBG|Running command run --format=csv --data=bare --no-headings --columns=bfd -- find Logical_Router_Static_Route output_port=rtoe-GR_master-0-1 "nexthop=\"fd2e:6f44:5dd8::8a\"" bfd!=[]
2021-09-01T20:50:10.865Z|06288|unixctl|DBG|replying with success, id=0: ""
2021-09-01T20:50:10.869Z|06289|unixctl|DBG|received request run["--format=csv","--data=bare","--no-headings","--columns=_uuid","--","find","BFD","logical_port=rtoe-GR_master-0-1","dst_ip=\"fd2e:6f44:5dd8::8a\""], id=0
2021-09-01T20:50:10.870Z|06290|ovn_dbctl|DBG|Running command run --format=csv --data=bare --no-headings --columns=_uuid -- find BFD logical_port=rtoe-GR_master-0-1 "dst_ip=\"fd2e:6f44:5dd8::8a\""
2021-09-01T20:50:10.870Z|06291|unixctl|DBG|replying with success, id=0: ""
I0901 20:50:10.870985       1 egressgw.go:548] Did not find bfd entry for rtoe-GR_master-0-1 fd2e:6f44:5dd8::8a
2021-09-01T20:50:10.875Z|06292|unixctl|DBG|received request run["--if-exists","--policy=src-ip","--","lr-route-del","GR_master-0-1","fd01:0:0:1::2b/128","fd2e:6f44:5dd8::8f"], id=0
2021-09-01T20:50:10.875Z|06293|ovn_dbctl|INFO|Running command run --if-exists --policy=src-ip -- lr-route-del GR_master-0-1 fd01:0:0:1::2b/128 fd2e:6f44:5dd8::8f
2021-09-01T20:50:10.879Z|06294|unixctl|DBG|replying with success, id=0: ""
2021-09-01T20:50:10.883Z|06295|unixctl|DBG|received request run["--format=csv","--data=bare","--no-headings","--columns=bfd","--","find","Logical_Router_Static_Route","output_port=rtoe-GR_master-0-1","nexthop=\"fd2e:6f44:5dd8::8f\"","bfd!=[]"], id=0
2021-09-01T20:50:10.883Z|06296|ovn_dbctl|DBG|Running command run --format=csv --data=bare --no-headings --columns=bfd -- find Logical_Router_Static_Route output_port=rtoe-GR_master-0-1 "nexthop=\"fd2e:6f44:5dd8::8f\"" bfd!=[]
2021-09-01T20:50:10.883Z|06297|unixctl|DBG|replying with success, id=0: ""
2021-09-01T20:50:10.887Z|06298|unixctl|DBG|received request run["--format=csv","--data=bare","--no-headings","--columns=_uuid","--","find","BFD","logical_port=rtoe-GR_master-0-1","dst_ip=\"fd2e:6f44:5dd8::8f\""], id=0
2021-09-01T20:50:10.887Z|06299|ovn_dbctl|DBG|Running command run --format=csv --data=bare --no-headings --columns=_uuid -- find BFD logical_port=rtoe-GR_master-0-1 "dst_ip=\"fd2e:6f44:5dd8::8f\""
2021-09-01T20:50:10.887Z|06300|unixctl|DBG|replying with success, id=0: ""
I0901 20:50:10.887943       1 egressgw.go:548] Did not find bfd entry for rtoe-GR_master-0-1 fd2e:6f44:5dd8::8f
I0901 20:50:22.888957       1 reflector.go:530] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.Endpoints total 14 items received
I0901 20:50:24.347027       1 reflector.go:530] k8s.io/client-go/informers/factory.go:134: Watch close - *v1beta1.EndpointSlice total 23 items received
W0901 20:50:24.348964       1 warnings.go:70] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1 EndpointSlice

Comment 10 errata-xmlrpc 2021-10-18 17:38:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759


Note You need to log in before you can comment on or make changes to this bug.