Bug 2060549
Summary: | ErrorAddingLogicalPort: duplicate IP found in ECMP Pod route cache! | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Murali Krishnasamy <murali> | |
Component: | Networking | Assignee: | Tim Rozet <trozet> | |
Networking sub component: | ovn-kubernetes | QA Contact: | Anurag saxena <anusaxen> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | high | |||
Priority: | high | CC: | rkhan, rsevilla, surya | |
Version: | 4.10 | |||
Target Milestone: | --- | |||
Target Release: | 4.11.0 | |||
Hardware: | All | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 2061804 2063863 (view as bug list) | Environment: | ||
Last Closed: | 2022-08-10 10:52:11 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 2061804, 2063863 |
Description
Murali Krishnasamy
2022-03-03 18:35:08 UTC
Spoke to Tim, as we can see from the logs below, we add the key to cache and then the actual transact happens much later, and if that fails, we will retry pod setup and then we retry adding this pod to cache and we hit the duplicate podIP error. Tim might be able to explain it better, assigning this to him, we had a few options for fixing but considering this situation is not practically possible that often in user env, we went with the simplest fix to check the case of same pod retry happening. ==============================Round 1================================================ I0303 15:45:44.617308 1 pods.go:332] [serving-ns-33/dep-serving-33-4-serving-job-66f7948f5d-dzx52] creating logical port for pod on switch worker003-fc640 I0303 15:45:44.617409 1 kube.go:73] Setting annotations map[k8s.ovn.org/pod-networks:{"default":{"ip_addresses":["10.128.16.48/23"],"mac_address":"0a:58:0a:80:10:30","gateway_ips":["10.128.16.1"],"ip_address":"10.128.16.48/23","gateway_ip":"10.128.16.1"}}] on pod serving-ns-33/dep-serving-33-4-serving-job-66f7948f5d-dzx52 == I0303 15:45:44.629398 1 egressgw.go:155] External gateway pod: dep-serving-33-4-serving-job-66f7948f5d-dzx52, detected for namespace(s) served-ns-33 I0303 15:45:44.629444 1 egressgw.go:203] Adding routes for external gateway pod: dep-serving-33-4-serving-job-66f7948f5d-dzx52, next hops: "192.168.222.33", namespace: served-ns-33, bfd-enabled: true == 2022-03-03T15:45:44.630Z|00065|ovsdb_cs|INFO|ssl:192.168.216.12:9641: clustered database server is not cluster leader; trying another server 2022-03-03T15:45:44.630Z|00066|reconnect|INFO|ssl:192.168.216.12:9641: connection attempt timed out 2022-03-03T15:45:44.630Z|00067|reconnect|INFO|ssl:192.168.216.11:9641: connecting... 2022-03-03T15:45:44.634Z|00068|reconnect|INFO|ssl:192.168.216.11:9641: connected ==== I0303 15:45:44.732939 1 client.go:1053] "msg"="endpoint lost leader, reconnecting" "database"="OVN_Northbound" "endpoint"="ssl:192.168.216.12:9641" "sid"="d627d068-0e72-4397-80f4-22b922bee61e" I0303 15:45:44.733110 1 client.go:1137] "msg"="connection lost, reconnecting" "database"="OVN_Northbound" "endpoint"="ssl:192.168.216.10:9641" I0303 15:45:44.733660 1 transact.go:41] Configuring OVN: [{Op:insert Table:BFD Row:map[dst_ip:192.168.222.33 logical_port:rtoe-GR_worker023-fc640] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:u2596996875} {Op:insert Table:Logical_Router_Static_Route Row:map[bfd:{GoSet:[{GoUUID:u2596996875}]} ip_prefix:10.130.30.8/32 nexthop:192.168.222.33 options:{GoMap:map[ecmp_symmetric_reply:true]} output_port:{GoSet:[rtoe-GR_worker023-fc640]} policy:{GoSet:[src-ip]}] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:u2596996876} {Op:mutate Table:Logical_Router Row:map[] Rows:[] Columns:[] Mutations:[{Column:static_routes Mutator:insert Value:{GoSet:[{GoUUID:u2596996876}]}}] Timeout:<nil> Where:[where column _uuid == {37882d04-18f4-4b88-b87e-0d245145be44}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}] === 2022-03-03T15:45:52.863Z|00069|ovsdb_cs|INFO|ssl:192.168.216.11:9641: clustered database server is not cluster leader; trying another server 2022-03-03T15:45:52.863Z|00070|reconnect|INFO|ssl:192.168.216.11:9641: connection attempt timed out 2022-03-03T15:45:52.863Z|00071|reconnect|INFO|ssl:192.168.216.10:9641: connecting... 2022-03-03T15:45:52.869Z|00072|reconnect|INFO|ssl:192.168.216.10:9641: connected 2022-03-03T15:45:52.870Z|00073|ovsdb_cs|INFO|ssl:192.168.216.10:9641: clustered database server is not cluster leader; trying another server 2022-03-03T15:45:52.870Z|00074|reconnect|INFO|ssl:192.168.216.10:9641: connection attempt timed out 2022-03-03T15:45:52.870Z|00075|reconnect|INFO|ssl:192.168.216.12:9641: connecting... 2022-03-03T15:45:52.874Z|00076|reconnect|INFO|ssl:192.168.216.12:9641: connected I0303 15:46:04.734211 1 pods.go:322] [serving-ns-33/dep-serving-33-4-serving-job-66f7948f5d-dzx52] addLogicalPort took 20.116904396s ==== E0303 15:46:04.734238 1 ovn.go:648] failed to handle external GW check: unable to add src-ip route to GR router, err: error in transact with ops [{Op:insert Table:BFD Row:map[dst_ip:192.168.222.33 logical_port:rtoe-GR_worker023-fc640] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:u2596996875} {Op:insert Table:Logical_Router_Static_Route Row:map[bfd:{GoSet:[{GoUUID:u2596996875}]} ip_prefix:10.130.30.8/32 nexthop:192.168.222.33 options:{GoMap:map[ecmp_symmetric_reply:true]} output_port:{GoSet:[rtoe-GR_worker023-fc640]} policy:{GoSet:[src-ip]}] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:u2596996876} {Op:mutate Table:Logical_Router Row:map[] Rows:[] Columns:[] Mutations:[{Column:static_routes Mutator:insert Value:{GoSet:[{GoUUID:u2596996876}]}}] Timeout:<nil> Where:[where column _uuid == {37882d04-18f4-4b88-b87e-0d245145be44}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}]: context deadline exceeded: while awaiting reconnection =========================================Round 2=================================== I0303 15:46:04.734287 1 pods.go:332] [serving-ns-33/dep-serving-33-4-serving-job-66f7948f5d-dzx52] creating logical port for pod on switch worker003-fc640 I0303 15:46:04.734406 1 client.go:1248] "msg"="warning: unable to ensure cache consistency for reading" "database"="OVN_Northbound" I0303 15:46:04.734408 1 event.go:282] Event(v1.ObjectReference{Kind:"Pod", Namespace:"serving-ns-33", Name:"dep-serving-33-4-serving-job-66f7948f5d-dzx52", UID:"1b25bb96-66f8-401d-b063-1c90eec18ad5", APIVersion:"v1", ResourceVersion:"5071427", FieldPath:""}): type: 'Warning' reason: 'ErrorAddingLogicalPort' failed to handle external GW check: unable to add src-ip route to GR router, err: error in transact with ops [{Op:insert Table:BFD Row:map[dst_ip:192.168.222.33 logical_port:rtoe-GR_worker023-fc640] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:u2596996875} {Op:insert Table:Logical_Router_Static_Route Row:map[bfd:{GoSet:[{GoUUID:u2596996875}]} ip_prefix:10.130.30.8/32 nexthop:192.168.222.33 options:{GoMap:map[ecmp_symmetric_reply:true]} output_port:{GoSet:[rtoe-GR_worker023-fc640]} policy:{GoSet:[src-ip]}] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:u2596996876} {Op:mutate Table:Logical_Router Row:map[] Rows:[] Columns:[] Mutations:[{Column:static_routes Mutator:insert Value:{GoSet:[{GoUUID:u2596996876}]}}] Timeout:<nil> Where:[where column _uuid == {37882d04-18f4-4b88-b87e-0d245145be44}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}]: context deadline exceeded: while awaiting reconnection ========= I0303 15:46:05.162958 1 egressgw.go:155] External gateway pod: dep-serving-33-4-serving-job-66f7948f5d-dzx52, detected for namespace(s) served-ns-33 I0303 15:46:05.162994 1 pods.go:322] [serving-ns-33/dep-serving-33-4-serving-job-66f7948f5d-dzx52] addLogicalPort took 428.709552ms E0303 15:46:05.163003 1 ovn.go:648] failed to handle external GW check: unable to add pod: serving-ns-33/dep-serving-33-4-serving-job-66f7948f5d-dzx52 as external gateway for namespace: served-ns-33, error: duplicate IP found in ECMP Pod route cache! IP: "192.168.222.33", first pod: "dep-serving-33-4-serving-job-66f7948f5d-dzx52", second pod: "serving-ns-33_dep-serving-33-4-serving-job-66f7948f5d-dzx52" I0303 15:46:05.163010 1 egressgw.go:155] External gateway pod: dep-serving-33-2-serving-job-dc6c6d4c8-9jnnr, detected for namespace(s) served-ns-33 I0303 15:46:05.163043 1 pods.go:332] [serving-ns-33/dep-serving-33-4-serving-job-66f7948f5d-dzx52] creating logical port for pod on switch worker003-fc640 I0303 15:46:05.163108 1 event.go:282] Event(v1.ObjectReference{Kind:"Pod", Namespace:"serving-ns-33", Name:"dep-serving-33-4-serving-job-66f7948f5d-dzx52", UID:"1b25bb96-66f8-401d-b063-1c90eec18ad5", APIVersion:"v1", ResourceVersion:"5071434", FieldPath:""}): type: 'Warning' reason: 'ErrorAddingLogicalPort' failed to handle external GW check: unable to add pod: serving-ns-33/dep-serving-33-4-serving-job-66f7948f5d-dzx52 as external gateway for namespace: served-ns-33, error: duplicate IP found in ECMP Pod route cache! IP: "192.168.222.33", first pod: "dep-serving-33-4-serving-job-66f7948f5d-dzx52", second pod: "serving-ns-33_dep-serving-33-4-serving-job-66f7948f5d-dzx52" (In reply to Surya Seetharaman from comment #2) > Spoke to Tim, as we can see from the logs below, we add the key to cache and > then the actual transact happens much later, and if that fails, we will > retry pod setup and then we retry adding this pod to cache and we hit the > duplicate podIP error. > > Tim might be able to explain it better, assigning this to him, we had a few > options for fixing but considering this situation is not practically > possible that often in user env, we went with the simplest fix to check the > case of same pod retry happening. > Wait I take this back actually, my understanding is if we fail for whatever reason during an addLogicalPort of exgw serving pod, we'd hit this bug i.e any adds of routes/policies/bfd post that add to cache event. Maybe its not a "rare" event :) Tested this patch(https://github.com/openshift/ovn-kubernetes/pull/990) and it worked. Able to churn ICNI2 workload on 120 node environment without addLogicalPort error. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069 |