Bug 2053922 - [OCP 4.8][OVN] pod interface: error while waiting on OVS.Interface.external-ids
Summary: [OCP 4.8][OVN] pod interface: error while waiting on OVS.Interface.external-ids
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.8
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: 4.11.0
Assignee: Tim Rozet
QA Contact: huirwang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-02-13 08:39 UTC by Aaron Park
Modified: 2022-08-10 10:50 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-08-10 10:49:41 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github cloud-bulldozer/e2e-benchmarking/blob/master/workloads/kube-burner/README.md 0 None None None 2022-05-13 03:20:27 UTC
Github openshift ovn-kubernetes pull 966 0 None Merged Bug 2048538: [DownstreamMerge] 2-22-22 2022-02-25 16:07:30 UTC
Github ovn-org ovn-kubernetes pull 2821 0 None open Fixes thread safety with LB healthcheck 2022-02-14 18:14:28 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 10:50:11 UTC

Description Aaron Park 2022-02-13 08:39:24 UTC
Description of problem:


Version-Release number of selected component (if applicable):
# oc version
Client Version: 4.8.13
Server Version: 4.8.21
Kubernetes Version: v1.21.5+6a39d04
Baremetal / UPI / IPv4 single

# master : 3 / worker : 45
$ omg get nodes | grep -v NAME | wc -l
48

How reproducible:
After changing sdn to ovn, some pods are not created. In addition, some other pods are taking a long time to be created.

[root@ipv4-svt-bastion ~]# oc get po -A -owide | grep d204-30
common-smf-1            smf-pfcpc-6bd4dfd666-k26x7                 0/3     ContainerCreating   0          16m     <none>      d204-30.core-svt.samsung.net      <none>           <none>
common-smf-1            smf-pfcpc-6bd4dfd666-svd7n                 0/3     ContainerCreating   0          16m     <none>      d204-30.core-svt.samsung.net      <none>           <none>
common-smf-2            smf-pfcpc-77f469db86-fgg4t                 0/3     ContainerCreating   0          16m     <none>      d204-30.core-svt.samsung.net      <none>           <none>
common-smf-2            smf-pfcpc-77f469db86-tdd2z                 0/3     ContainerCreating   0          16m     <none>      d204-30.core-svt.samsung.net      <none>           <none>
common-smf-2            smf-pfcpc-8596cfff55-85jmk                 0/3     ContainerCreating   0          16m     <none>      d204-30.core-svt.samsung.net      <none>           <none>
common-upf-1            svc-upf-pfcpc-udp-b866b7465-4kp99          0/3     ContainerCreating   0          16m     <none>      d204-30.core-svt.samsung.net      <none>           <none>
common-upf-1            upf-pfcpc-d49c477b7-5rmqd                  0/2     ContainerCreating   0          16m     <none>      d204-30.core-svt.samsung.net      <none>           <none>
common-upf-1            upf-pfcpc-d49c477b7-7l2pk                  0/2     ContainerCreating   0          16m     <none>      d204-30.core-svt.samsung.net      <none>           <none>
common-upf-2            svc-upf-pfcpc-udp-56f46c449b-7d6cg         0/3     ContainerCreating   0          16m     <none>      d204-30.core-svt.samsung.net      <none>           <none>
common-upf-2            upf-pfcpc-5d7f669d48-56thw                 0/2     ContainerCreating   0          16m     <none>      d204-30.core-svt.samsung.net      <none>           <none>
common-upf-2            upf-pfcpc-5d7f669d48-7jhgb                 0/2     ContainerCreating   0          16m     <none>      d204-30.core-svt.samsung.net      <none>           <none>
common-upf-2            upf-pfcpc-686c5fb57d-hkkzq                 0/2     ContainerCreating   0          16m     <none>      d204-30.core-svt.samsung.net      <none>           <none>
global-smf-auto-1       smf-pfcpc-55b9d84875-4dzmz                 0/2     ContainerCreating   0          16m     <none>      d204-30.core-svt.samsung.net      <none>           <none>
global-smf-auto-1       smf-pfcpc-55b9d84875-cgplf                 0/2     ContainerCreating   0          16m     <none>      d204-30.core-svt.samsung.net      <none>           <none>
global-smf-auto-1       smf-pfcpc-5878475f46-2tffk                 0/3     ContainerCreating   0          16m     <none>      d204-30.core-svt.samsung.net      <none>           <none>
global-smf-auto-2       smf-pfcpc-55b9d84875-c24fq                 0/2     ContainerCreating   0          16m     <none>      d204-30.core-svt.samsung.net      <none>           <none>
global-smf-auto-2       smf-pfcpc-55b9d84875-nrnvt                 0/2     ContainerCreating   0          16m     <none>      d204-30.core-svt.samsung.net      <none>           <none>
global-smf-auto-2       smf-pfcpc-566d749b97-gjzrc                 0/3     ContainerCreating   0          16m     <none>      d204-30.core-svt.samsung.net      <none>           <none>
global-smf-auto-3       smf-pfcpc-55b9d84875-25868                 0/2     ContainerCreating   0          16m     <none>      d204-30.core-svt.samsung.net      <none>           <none>
global-smf-auto-3       smf-pfcpc-55b9d84875-6b8gr                 0/2     ContainerCreating   0          16m     <none>      d204-30.core-svt.samsung.net      <none>           <none>

The engineering team quickly fixed the bug for this issue, so Samsung upgraded to 4.8.12 and the problem was solved.
* Samsung tested as below, and no error occurred.
200 Pods: Pass (within 1 minute)
300 Pods : Pass (within 2 minutes)
400 Pods: Pass (within 15 minutes)

But they are using 4.8.21 and I am getting the below error again.
There are about 2000 Pods.

I deleted the ovn-controller container or the ovnkube-node pod but it didn't work.

Steps to Reproduce:
1.
2.
3.

Actual results:

It remains in the ContainerCreating state.

Expected results:

The Pod will change to a 'Running' state.

Additional info:

Comment 2 Tim Rozet 2022-02-14 16:01:41 UTC
Notice a crash in the sosreport:
2021-12-30T14:45:03.315231013+09:00 stderr F I1230 05:45:03.315219  142946 port_claim.go:40] Opening socket for service: global-amfmme-auto-2/amf-anintf-sctp, port: 48819 and protocol SCTP
2021-12-30T14:45:03.315231013+09:00 stderr P fatal error: concurrent map read and map write
2021-12-30T14:45:03.315241233+09:00 stderr F
2021-12-30T14:45:03.318508461+09:00 stderr F
2021-12-30T14:45:03.318508461+09:00 stderr F goroutine 322 [running]:
2021-12-30T14:45:03.318508461+09:00 stderr F runtime.throw(0x1b2bac6, 0x21)
2021-12-30T14:45:03.318508461+09:00 stderr F    /usr/lib/golang/src/runtime/panic.go:1117 +0x72 fp=0xc000811b60 sp=0xc000811b30 pc=0x43e612
2021-12-30T14:45:03.318565098+09:00 stderr F runtime.mapaccess2(0x18e4000, 0xc0018c5e90, 0xc000811bf0, 0x2921980, 0x0)
2021-12-30T14:45:03.318565098+09:00 stderr P    /usr/lib/golang/src/runtime/map.go:469 +0x255 fp=0xc000811ba0 sp=0xc000811b60 pc=0x415c55
2021-12-30T14:45:03.318589278+09:00 stderr F
2021-12-30T14:45:03.318589278+09:00 stderr F github.com/ovn-org/ovn-kubernetes/go-controller/pkg/node.(*loadBalancerHealthChecker).AddEndpoints(0xc0018c5ef0, 0xc0004d9180)
2021-12-30T14:45:03.318589278+09:00 stderr P    /go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/node/healthcheck.go:
2021-12-30T14:45:03.318611932+09:00 stderr F 61 +0x8c fp=0xc000811c20 sp=0xc000811ba0 pc=0x16a5b8c
2021-12-30T14:45:03.318632601+09:00 stderr F github.com/ovn-org/ovn-kubernetes/go-controller/pkg/node.(*gateway).AddEndpoints(...)
2021-12-30T14:45:03.318632601+09:00 stderr F    /go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/node/gateway.go:107
2021-12-30T14:45:03.318653771+09:00 stderr F github.com/ovn-org/ovn-kubernetes/go-controller/pkg/node.(*gateway).Init.func4(0x1ade440, 0xc0004d9180)
2021-12-30T14:45:03.318653771+09:00 stderr P    /go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/node/gateway.go:
2021-12-30T14:45:03.318674902+09:00 stderr P 147 +0x5d fp=0xc000811c48 sp=0xc000811c20 pc=0x16b851d

Comment 3 Tim Rozet 2022-02-14 17:33:31 UTC
ovn-controller is constantly claiming/releasing the port:
ovn-controller/0.log:2021-12-30T14:46:23.417027986+09:00 stderr F 2021-12-30T05:46:23Z|00239|binding|INFO|Claiming lport ntte-smf-1_smf-pfcpc-857b877946-db7mc for this chassis.
ovn-controller/0.log:2021-12-30T14:46:23.417027986+09:00 stderr F 2021-12-30T05:46:23Z|00240|binding|INFO|ntte-smf-1_smf-pfcpc-857b877946-db7mc: Claiming 0a:58:80:1f:00:80 128.31.0.128
ovn-controller/0.log:2021-12-30T14:46:35.031080312+09:00 stderr F 2021-12-30T05:46:35Z|00335|binding|INFO|Releasing lport ntte-smf-1_smf-pfcpc-857b877946-db7mc from this chassis.
ovn-controller/0.log:2021-12-30T14:46:55.837041123+09:00 stderr F 2021-12-30T05:46:55Z|00424|binding|INFO|Claiming lport ntte-smf-1_smf-pfcpc-857b877946-db7mc for this chassis.
ovn-controller/0.log:2021-12-30T14:46:55.837041123+09:00 stderr F 2021-12-30T05:46:55Z|00425|binding|INFO|ntte-smf-1_smf-pfcpc-857b877946-db7mc: Claiming 0a:58:80:1f:00:80 128.31.0.128
ovn-controller/0.log:2021-12-30T14:47:17.749638683+09:00 stderr F 2021-12-30T05:47:17Z|00455|binding|INFO|Releasing lport ntte-smf-1_smf-pfcpc-857b877946-db7mc from this chassis.
ovn-controller/0.log:2021-12-30T14:47:39.552058987+09:00 stderr F 2021-12-30T05:47:39Z|00558|binding|INFO|Claiming lport ntte-smf-1_smf-pfcpc-857b877946-db7mc for this chassis.
ovn-controller/0.log:2021-12-30T14:47:39.552058987+09:00 stderr F 2021-12-30T05:47:39Z|00559|binding|INFO|ntte-smf-1_smf-pfcpc-857b877946-db7mc: Claiming 0a:58:80:1f:00:80 128.31.0.128
ovn-controller/0.log:2021-12-30T14:48:04.129982838+09:00 stderr F 2021-12-30T05:48:04Z|00625|binding|INFO|Releasing lport ntte-smf-1_smf-pfcpc-857b877946-db7mc from this chassis.
ovn-controller/0.log:2021-12-30T14:48:15.395657797+09:00 stderr F 2021-12-30T05:48:15Z|00661|binding|INFO|Claiming lport ntte-smf-1_smf-pfcpc-857b877946-db7mc for this chassis.

ovn-controller on this node is showing high cpu usage with ovsdb and br-int:
2021-12-30T14:54:42.590818277+09:00 stderr F 2021-12-30T05:54:42Z|02010|poll_loop|INFO|wakeup due to [POLLIN] on fd 20 (<->/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:153 (94% CPU usage)
2021-12-30T14:54:42.590827300+09:00 stderr F 2021-12-30T05:54:42Z|02011|poll_loop|INFO|wakeup due to [POLLIN] on fd 16 (<->/var/run/openvswitch/db.sock) at lib/stream-fd.c:157 (94% CPU usage)
2021-12-30T14:54:42.590835523+09:00 stderr F 2021-12-30T05:54:42Z|02012|poll_loop|INFO|wakeup due to [POLLIN] on fd 19 (172.23.102.73:50280<->172.23.102.2:9642) at lib/stream-ssl.c:832 (94% CPU usage)

2021-12-30T14:47:27.297051860+09:00 stderr F 2021-12-30T05:47:27Z|00478|timeval|WARN|Unreasonably long 9558ms poll interval (9373ms user, 123ms system)
2021-12-30T14:47:38.828866531+09:00 stderr F 2021-12-30T05:47:38Z|00498|timeval|WARN|Unreasonably long 10956ms poll interval (9985ms user, 909ms system)
2021-12-30T14:47:51.369877464+09:00 stderr F 2021-12-30T05:47:51Z|00587|timeval|WARN|Unreasonably long 11834ms poll interval (10422ms user, 1344ms system)
2021-12-30T14:48:03.434296880+09:00 stderr F 2021-12-30T05:48:03Z|00615|timeval|WARN|Unreasonably long 11178ms poll interval (10029ms user, 1086ms system)
2021-12-30T14:48:14.305150447+09:00 stderr F 2021-12-30T05:48:14Z|00638|timeval|WARN|Unreasonably long 10187ms poll interval (9772ms user, 356ms system)
2021-12-30T14:48:26.588210331+09:00 stderr F 2021-12-30T05:48:26Z|00696|timeval|WARN|Unreasonably long 11203ms poll interval (10235ms user, 905ms system)
2021-12-30T14:48:38.540734685+09:00 stderr F 2021-12-30T05:48:38Z|00719|timeval|WARN|Unreasonably long 11285ms poll interval (10270ms user, 945ms system)
2021-12-30T14:48:48.905387049+09:00 stderr F 2021-12-30T05:48:48Z|00786|timeval|WARN|Unreasonably long 9954ms poll interval (9616ms user, 278ms system)
2021-12-30T14:49:00.888985408+09:00 stderr F 2021-12-30T05:49:00Z|00820|timeval|WARN|Unreasonably long 11242ms poll interval (10178ms user, 1002ms system)
2021-12-30T14:49:13.374769826+09:00 stderr F 2021-12-30T05:49:13Z|00864|timeval|WARN|Unreasonably long 11658ms poll interval (10526ms user, 1065ms system)
2021-12-30T14:49:25.359753639+09:00 stderr F 2021-12-30T05:49:25Z|00897|timeval|WARN|Unreasonably long 11063ms poll interval (10065ms user, 930ms system)
2021-12-30T14:49:37.771741720+09:00 stderr F 2021-12-30T05:49:37Z|00924|timeval|WARN|Unreasonably long 11841ms poll interval (10537ms user, 1232ms system)
2021-12-30T14:49:49.789399230+09:00 stderr F 2021-12-30T05:49:49Z|01015|timeval|WARN|Unreasonably long 11079ms poll interval (10020ms user, 996ms system)
2021-12-30T14:50:00.839961382+09:00 stderr F 2021-12-30T05:50:00Z|01039|timeval|WARN|Unreasonably long 10437ms poll interval (10061ms user, 314ms system)
2021-12-30T14:50:12.056502934+09:00 stderr F 2021-12-30T05:50:12Z|01069|timeval|WARN|Unreasonably long 10320ms poll interval (10013ms user, 248ms system)
2021-12-30T14:50:24.838590556+09:00 stderr F 2021-12-30T05:50:24Z|01124|timeval|WARN|Unreasonably long 11373ms poll interval (10389ms user, 917ms system)
2021-12-30T14:50:37.692246916+09:00 stderr F 2021-12-30T05:50:37Z|01156|timeval|WARN|Unreasonably long 11695ms poll interval (10568ms user, 1055ms system)
2021-12-30T14:50:50.684069165+09:00 stderr F 2021-12-30T05:50:50Z|01221|timeval|WARN|Unreasonably long 12053ms poll interval (10628ms user, 1354ms system)
2021-12-30T14:51:03.396585955+09:00 stderr F 2021-12-30T05:51:03Z|01269|timeval|WARN|Unreasonably long 11645ms poll interval (10435ms user, 1149ms system)
2021-12-30T14:51:15.783168987+09:00 stderr F 2021-12-30T05:51:15Z|01305|timeval|WARN|Unreasonably long 11687ms poll interval (10610ms user, 1005ms system)
2021-12-30T14:51:28.208336804+09:00 stderr F 2021-12-30T05:51:28Z|01342|timeval|WARN|Unreasonably long 11510ms poll interval (10431ms user, 1006ms system)
2021-12-30T14:51:40.659551277+09:00 stderr F 2021-12-30T05:51:40Z|01382|timeval|WARN|Unreasonably long 11572ms poll interval (10491ms user, 1018ms system)
2021-12-30T14:51:53.243245067+09:00 stderr F 2021-12-30T05:51:53Z|01418|timeval|WARN|Unreasonably long 11656ms poll interval (10543ms user, 1042ms system)
2021-12-30T14:52:05.892526675+09:00 stderr F 2021-12-30T05:52:05Z|01458|timeval|WARN|Unreasonably long 11661ms poll interval (10590ms user, 998ms system)


version:
2021-12-30T14:44:21.915462456+09:00 stderr F 2021-12-30T05:44:21Z|00004|main|INFO|OVN internal version is : [20.12.0-20.17.0-56.1]

Comment 4 Tim Rozet 2022-02-14 18:14:14 UTC
Posted a fix for the ovnkube crash here: https://github.com/ovn-org/ovn-kubernetes/pull/2821

Comment 5 Tim Rozet 2022-02-21 15:09:45 UTC
Dumitru from the OVN team analyzed the ovn-controller issue. From his analysis he believes this is the fix:
https://github.com/openshift/ovn-kubernetes/commit/2431758477c9709c25c07a64c03d49656ce30505

This is present in both local and shared gateway mode in versions > 4.10. It is only present for shared gateway mode in 4.9, and not available at all in 4.8. Will look into feasibility of backporting this.

Comment 6 Tim Rozet 2022-02-25 16:53:08 UTC
The OVN fix we would need for local gw support in versions <= 4.9:
https://patchwork.ozlabs.org/project/openvswitch/patch/20220104115034.142846-1-sangana.abhiram@nutanix.com/

Or the other option is backporting removing the DGP.

Comment 11 errata-xmlrpc 2022-08-10 10:49:41 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069


Note You need to log in before you can comment on or make changes to this bug.