Bug 2022536 - WebScale: duplicate ecmp next hop error caused by multiple of the same gateway IPs in ovnkube cache
Summary: WebScale: duplicate ecmp next hop error caused by multiple of the same gatewa...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.7
Hardware: x86_64
OS: Linux
medium
high
Target Milestone: ---
: 4.10.0
Assignee: Tim Rozet
QA Contact: Yurii Prokulevych
URL:
Whiteboard:
: 2027854 (view as bug list)
Depends On:
Blocks: 2058683
TreeView+ depends on / blocked
 
Reported: 2021-11-11 22:28 UTC by Nabeel Cocker
Modified: 2022-04-11 14:56 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2058683 (view as bug list)
Environment:
Last Closed: 2022-03-10 16:26:48 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
ovnkube-master leader logs (1.95 MB, application/zip)
2021-11-11 22:28 UTC, Nabeel Cocker
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift ovn-kubernetes pull 895 0 None Merged Bug 2022536: Validate ExGW Cache and fix cache keys 2022-01-20 10:27:22 UTC
Github ovn-org ovn-kubernetes pull 2722 0 None Merged Multiple ExGW cache validation/improvements 2022-01-20 10:27:22 UTC
Red Hat Product Errata RHSA-2022:0056 0 None None None 2022-03-10 16:27:10 UTC

Description Nabeel Cocker 2021-11-11 22:28:27 UTC
Created attachment 1841293 [details]
ovnkube-master leader logs

Description of problem:

Pods were stuck in creating and crashloop.  Describe of the pods indicated duplicate ECMP route errors and finally timed out waiting for OVS flows

Warning  FailedCreatePodSandBox  3m53s (x367 over 145m)  kubelet  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_splunk-operator-5f74449b7c-vd99x_thingspace_146197bb-2021-4858-a5b8-2b4689b33494_0(1705c5031d6e8319ea820b68fa8cb441d88d53899e4dd8c63320377adba59095): error adding pod thingspace_splunk-operator-5f74449b7c-vd99x to CNI network "multus-cni-network": [thingspace/splunk-operator-5f74449b7c-vd99x:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[thingspace/splunk-operator-5f74449b7c-vd99x 1705c5031d6e8319ea820b68fa8cb441d88d53899e4dd8c63320377adba59095] [thingspace/splunk-operator-5f74449b7c-vd99x 1705c5031d6e8319ea820b68fa8cb441d88d53899e4dd8c63320377adba59095] failed to configure pod interface: error while waiting on flows for pod: timed out waiting for OVS flows
'
  Warning  ErrorAddingLogicalPort  3m45s (x148 over 149m)  controlplane  unable to add external gwStr src-ip route to GR router, stderr:"ovn-nbctl: duplicate nexthop for the same ECMP route\n", err:&{%!!(MISSING)g(string=OVN command '/usr/bin/ovn-nbctl --timeout=15 --may-exist --policy=src-ip --ecmp-symmetric-reply lr-route-add GR_worker-148 192.168.9.236/32 198.19.16.25' failed: exit status 1)}w

ovnkube-node rutes:




Version-Release number of selected component (if applicable):

OCP 4.7.24

sh-4.4# ovn-nbctl lr-route-list GR_worker-148
IPv4 Routes
            192.168.9.176               198.19.16.1 src-ip ecmp-symmetric-reply
            192.168.9.177               198.19.16.1 src-ip ecmp-symmetric-reply
            192.168.9.181               198.19.16.1 src-ip ecmp-symmetric-reply
            192.168.9.187                198.19.3.9 src-ip ecmp-symmetric-reply
            192.168.9.192               198.19.16.1 src-ip ecmp-symmetric-reply
            192.168.9.202               198.19.16.1 src-ip ecmp-symmetric-reply
            192.168.9.203               198.19.16.1 src-ip ecmp-symmetric-reply
            192.168.9.228               198.19.16.1 src-ip ecmp-symmetric-reply
            192.168.9.236              198.19.16.25 src-ip ecmp-symmetric-reply
           192.168.0.0/16                100.64.0.1 dst-ip
                0.0.0.0/0              10.75.69.129 dst-ip rtoe-GR_worker-148




I1111 20:33:40.812500       1 ovn.go:584] [146197bb-2021-4858-a5b8-2b4689b33494/thingspace/splunk-operator-5f74449b7c-vd99x] retry pod setup
I1111 20:33:40.812554       1 pods.go:338] LSP already exists for port: thingspace_splunk-operator-5f74449b7c-vd99x
I1111 20:33:40.823426       1 pods.go:302] [thingspace/splunk-operator-5f74449b7c-vd99x] addLogicalPort took 10.890013ms
I1111 20:33:40.823507       1 ovn.go:590] [146197bb-2021-4858-a5b8-2b4689b33494/thingspace/splunk-operator-5f74449b7c-vd99x] setup retry failed; will try again later
I1111 20:33:40.823618       1 event.go:282] Event(v1.ObjectReference{Kind:"Pod", Namespace:"thingspace", Name:"splunk-operator-5f74449b7c-vd99x", UID:"146197bb-2021-4858-a5b8-2b4689b33494", APIVersion:"v1", ResourceVersion:"815718841", FieldPath:""}): type: 'Warning' reason: 'ErrorAddingLogicalPort' unable to add external gwStr src-ip route to GR router, stderr:"ovn-nbctl: duplicate nexthop for the same ECMP route\n", err:&{%!!(MISSING)g(string=OVN command '/usr/bin/ovn-nbctl --timeout=15 --may-exist --policy=src-ip --ecmp-symmetric-reply lr-route-add GR_worker-148 192.168.9.236/32 198.19.16.25' failed: exit status 1)}w
I1111 20:34:40.891795       1 ovn.go:584] [146197bb-2021-4858-a5b8-2b4689b33494/thingspace/splunk-operator-5f74449b7c-vd99x] retry pod setup
I1111 20:34:40.891976       1 pods.go:338] LSP already exists for port: thingspace_splunk-operator-5f74449b7c-vd99x
I1111 20:34:40.902447       1 pods.go:302] [thingspace/splunk-operator-5f74449b7c-vd99x] addLogicalPort took 10.634355ms
I1111 20:34:40.902513       1 ovn.go:590] [146197bb-2021-4858-a5b8-2b4689b33494/thingspace/splunk-operator-5f74449b7c-vd99x] setup retry failed; will try again later
I1111 20:34:40.902552       1 event.go:282] Event(v1.ObjectReference{Kind:"Pod", Namespace:"thingspace", Name:"splunk-operator-5f74449b7c-vd99x", UID:"146197bb-2021-4858-a5b8-2b4689b33494", APIVersion:"v1", ResourceVersion:"815718841", FieldPath:""}): type: 'Warning' reason: 'ErrorAddingLogicalPort' unable to add external gwStr src-ip route to GR router, stderr:"ovn-nbctl: duplicate nexthop for the same ECMP route\n", err:&{%!!(MISSING)g(string=OVN command '/usr/bin/ovn-nbctl --timeout=15 --may-exist --policy=src-ip --ecmp-symmetric-reply lr-route-add GR_worker-148 192.168.9.236/32 198.19.16.25' failed: exit status 1)}w


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Surya Seetharaman 2021-12-03 10:11:29 UTC
*** Bug 2027854 has been marked as a duplicate of this bug. ***

Comment 9 arajapa 2022-02-02 16:16:48 UTC
Hi,

Any update on how one can verify this bz?

Thanks,
Alan

Comment 10 Tim Rozet 2022-02-02 20:05:08 UTC
There are a few things to verify with these fixes:

Scenario 1:
1. app pod is created in ns foo
2. exgwAPod is created in ns exgw1 (172.0.1.1), serving ns foo
3. exgwAPod is created in ns exgw2 (172.0.1.2), serving ns foo
4. Verify there is an ECMP route for both 170.0.1.1 and 172.0.1.2.
5. Delete the exgw pods, verify that both routes are removed.

Scenario 2:
1. app pod is created in ns foo
2. exgwAPod is created in ns exgw1 (172.0.1.1), serving ns foo
3. exgwAPod is created in ns exgw2 (172.0.1.1), serving ns foo (duplicate IP in annotation)
4. verify there is only 1 one ECMP route exists
5. verify there is a log present "unable to add pod: exgw2/exgwAPod as external gateway for namespace: foo"
6. delete both exgwAPods, ensure there is no ECMP route present afterwards

Scenario 3:
1. app pod is created in ns foo
2. annotate the ns foo namespace with the exgw annotation, but use a duplicate IPs: k8s.ovn.org/routing-external-gws: 172.18.0.4,172.18.0.5,172.18.0.4
3. verify that there is only 2 ECMP routes present (no duplicates)

Comment 13 arajapa 2022-02-02 20:55:00 UTC
or is there something you need from us?

Comment 20 errata-xmlrpc 2022-03-10 16:26:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056


Note You need to log in before you can comment on or make changes to this bug.