Bug 1978797

Summary: external gateway pod deletes may not clean up ECMP routes
Product: OpenShift Container Platform Reporter: Tim Rozet <trozet>
Component: NetworkingAssignee: Tim Rozet <trozet>
Networking sub component: ovn-kubernetes QA Contact: Ross Brattain <rbrattai>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: aconstan, akaris, bhershbe, dblack, kholtz, surya, zzhao
Version: 4.6   
Target Milestone: ---   
Target Release: 4.9.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2004269 (view as bug list) Environment:
Last Closed: 2021-10-18 17:38:01 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2004269    

Description Tim Rozet 2021-07-02 18:39:38 UTC
Description of problem:
There are a couple paths where deleting a pod acting as an external gateway will not clean up routes in the served namespaces:

1) If a pod is host networked, and acting as an external gateway pod. When it is deleted, the routes will not be removed for pods in served external gateway namespaces.

2) If a pod is acting as an external gateway pod, and it is removed while ovnkube-master is restarting, or the ovnkube-master cache has somehow become invalid, then routes will not be removed for pods in served external gateway namespaces.

Comment 1 Tim Rozet 2021-07-02 21:48:09 UTC
3) More of a corner case, but... an exgw pod fails to be added, but its route gets programmed into OVN. Then the pod is updated with a change to its exgw annotations, the new annotation will be added as a route into OVN, but the old route will not be removed.

4) ovnkube-master is restarted, then the exgw pod is deleted. The stale ecmp routes are not removed. This is because of a bug where OVN complains about "duplicate nexthop" during cache recreation during ovnkube-master start. We had a workaround to handle this, but we were checking the wrong message.

5) ovnkube-master is restarted, and while it is down, the exgw pod is deleted. This is the hardest case to fix, because we get no event. We need to come up with a sync method for this.

I've posted a PR that will fix cases 1-4: https://github.com/ovn-org/ovn-kubernetes/pull/2302

Comment 2 Tim Rozet 2021-07-02 21:50:53 UTC
note 5 is really the same as number 2. I just found there was 2 issues causing it to occur and I fixed one of them.

Comment 3 Andreas Karis 2021-07-03 07:34:47 UTC
*** Bug 1974430 has been marked as a duplicate of this bug. ***

Comment 4 Tim Rozet 2021-08-06 21:52:36 UTC
Updated https://github.com/ovn-org/ovn-kubernetes/pull/2348 to handle the final case...under review

Comment 5 Surya Seetharaman 2021-08-12 07:31:56 UTC
https://github.com/ovn-org/ovn-kubernetes/pull/2348 has got merged, we need to open a downstream merge PR to get this in.

Comment 7 Ross Brattain 2021-09-01 21:19:20 UTC
Verified in 4.9.0-0.nightly-2021-08-31-123131

I0901 20:49:39.933792       1 egressgw.go:54] External gateway pod: testpod1, detected for namespace(s) exgw
I0901 20:49:39.934041       1 egressgw.go:85] Adding routes for external gateway pod: testpod1, next hops: "fd2e:6f44:5dd8::8a,fd2e:6f44:5dd8::8f", namespace: exgw, bfd-enabled: false
I0901 20:49:59.881049       1 reflector.go:530] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.Namespace total 8 items received
I0901 20:50:06.108886       1 node_tracker.go:162] Processing possible switch / router updates for node master-0-2
I0901 20:50:06.115200       1 node_tracker.go:162] Processing possible switch / router updates for node master-0-0
I0901 20:50:10.829808       1 egressgw.go:54] External gateway pod: testpod1, detected for namespace(s) exgw
I0901 20:50:10.829878       1 egressgw.go:85] Adding routes for external gateway pod: testpod1, next hops: "fd2e:6f44:5dd8::8a,fd2e:6f44:5dd8::8f", namespace: exgw, bfd-enabled: false
I0901 20:50:10.839163       1 egressgw.go:54] External gateway pod: testpod1, detected for namespace(s) exgw
I0901 20:50:10.839310       1 egressgw.go:85] Adding routes for external gateway pod: testpod1, next hops: "fd2e:6f44:5dd8::8a,fd2e:6f44:5dd8::8f", namespace: exgw, bfd-enabled: false
I0901 20:50:10.849736       1 egressgw.go:176] Deleting routes for external gateway pod: testpod1, for namespace(s) exgw
2021-09-01T20:50:10.854Z|06283|unixctl|DBG|received request run["--if-exists","--policy=src-ip","--","lr-route-del","GR_master-0-1","fd01:0:0:1::2b/128","fd2e:6f44:5dd8::8a"], id=0
2021-09-01T20:50:10.855Z|06284|ovn_dbctl|INFO|Running command run --if-exists --policy=src-ip -- lr-route-del GR_master-0-1 fd01:0:0:1::2b/128 fd2e:6f44:5dd8::8a
2021-09-01T20:50:10.862Z|06285|unixctl|DBG|replying with success, id=0: ""
2021-09-01T20:50:10.865Z|06286|unixctl|DBG|received request run["--format=csv","--data=bare","--no-headings","--columns=bfd","--","find","Logical_Router_Static_Route","output_port=rtoe-GR_master-0-1","nexthop=\"fd2e:6f44:5dd8::8a\"","bfd!=[]"], id=0
2021-09-01T20:50:10.865Z|06287|ovn_dbctl|DBG|Running command run --format=csv --data=bare --no-headings --columns=bfd -- find Logical_Router_Static_Route output_port=rtoe-GR_master-0-1 "nexthop=\"fd2e:6f44:5dd8::8a\"" bfd!=[]
2021-09-01T20:50:10.865Z|06288|unixctl|DBG|replying with success, id=0: ""
2021-09-01T20:50:10.869Z|06289|unixctl|DBG|received request run["--format=csv","--data=bare","--no-headings","--columns=_uuid","--","find","BFD","logical_port=rtoe-GR_master-0-1","dst_ip=\"fd2e:6f44:5dd8::8a\""], id=0
2021-09-01T20:50:10.870Z|06290|ovn_dbctl|DBG|Running command run --format=csv --data=bare --no-headings --columns=_uuid -- find BFD logical_port=rtoe-GR_master-0-1 "dst_ip=\"fd2e:6f44:5dd8::8a\""
2021-09-01T20:50:10.870Z|06291|unixctl|DBG|replying with success, id=0: ""
I0901 20:50:10.870985       1 egressgw.go:548] Did not find bfd entry for rtoe-GR_master-0-1 fd2e:6f44:5dd8::8a
2021-09-01T20:50:10.875Z|06292|unixctl|DBG|received request run["--if-exists","--policy=src-ip","--","lr-route-del","GR_master-0-1","fd01:0:0:1::2b/128","fd2e:6f44:5dd8::8f"], id=0
2021-09-01T20:50:10.875Z|06293|ovn_dbctl|INFO|Running command run --if-exists --policy=src-ip -- lr-route-del GR_master-0-1 fd01:0:0:1::2b/128 fd2e:6f44:5dd8::8f
2021-09-01T20:50:10.879Z|06294|unixctl|DBG|replying with success, id=0: ""
2021-09-01T20:50:10.883Z|06295|unixctl|DBG|received request run["--format=csv","--data=bare","--no-headings","--columns=bfd","--","find","Logical_Router_Static_Route","output_port=rtoe-GR_master-0-1","nexthop=\"fd2e:6f44:5dd8::8f\"","bfd!=[]"], id=0
2021-09-01T20:50:10.883Z|06296|ovn_dbctl|DBG|Running command run --format=csv --data=bare --no-headings --columns=bfd -- find Logical_Router_Static_Route output_port=rtoe-GR_master-0-1 "nexthop=\"fd2e:6f44:5dd8::8f\"" bfd!=[]
2021-09-01T20:50:10.883Z|06297|unixctl|DBG|replying with success, id=0: ""
2021-09-01T20:50:10.887Z|06298|unixctl|DBG|received request run["--format=csv","--data=bare","--no-headings","--columns=_uuid","--","find","BFD","logical_port=rtoe-GR_master-0-1","dst_ip=\"fd2e:6f44:5dd8::8f\""], id=0
2021-09-01T20:50:10.887Z|06299|ovn_dbctl|DBG|Running command run --format=csv --data=bare --no-headings --columns=_uuid -- find BFD logical_port=rtoe-GR_master-0-1 "dst_ip=\"fd2e:6f44:5dd8::8f\""
2021-09-01T20:50:10.887Z|06300|unixctl|DBG|replying with success, id=0: ""
I0901 20:50:10.887943       1 egressgw.go:548] Did not find bfd entry for rtoe-GR_master-0-1 fd2e:6f44:5dd8::8f
I0901 20:50:22.888957       1 reflector.go:530] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.Endpoints total 14 items received
I0901 20:50:24.347027       1 reflector.go:530] k8s.io/client-go/informers/factory.go:134: Watch close - *v1beta1.EndpointSlice total 23 items received
W0901 20:50:24.348964       1 warnings.go:70] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1 EndpointSlice

Comment 10 errata-xmlrpc 2021-10-18 17:38:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759