Description of problem: Version-Release number of selected component (if applicable): # oc version Client Version: 4.8.13 Server Version: 4.8.21 Kubernetes Version: v1.21.5+6a39d04 Baremetal / UPI / IPv4 single # master : 3 / worker : 45 $ omg get nodes | grep -v NAME | wc -l 48 How reproducible: After changing sdn to ovn, some pods are not created. In addition, some other pods are taking a long time to be created. [root@ipv4-svt-bastion ~]# oc get po -A -owide | grep d204-30 common-smf-1 smf-pfcpc-6bd4dfd666-k26x7 0/3 ContainerCreating 0 16m <none> d204-30.core-svt.samsung.net <none> <none> common-smf-1 smf-pfcpc-6bd4dfd666-svd7n 0/3 ContainerCreating 0 16m <none> d204-30.core-svt.samsung.net <none> <none> common-smf-2 smf-pfcpc-77f469db86-fgg4t 0/3 ContainerCreating 0 16m <none> d204-30.core-svt.samsung.net <none> <none> common-smf-2 smf-pfcpc-77f469db86-tdd2z 0/3 ContainerCreating 0 16m <none> d204-30.core-svt.samsung.net <none> <none> common-smf-2 smf-pfcpc-8596cfff55-85jmk 0/3 ContainerCreating 0 16m <none> d204-30.core-svt.samsung.net <none> <none> common-upf-1 svc-upf-pfcpc-udp-b866b7465-4kp99 0/3 ContainerCreating 0 16m <none> d204-30.core-svt.samsung.net <none> <none> common-upf-1 upf-pfcpc-d49c477b7-5rmqd 0/2 ContainerCreating 0 16m <none> d204-30.core-svt.samsung.net <none> <none> common-upf-1 upf-pfcpc-d49c477b7-7l2pk 0/2 ContainerCreating 0 16m <none> d204-30.core-svt.samsung.net <none> <none> common-upf-2 svc-upf-pfcpc-udp-56f46c449b-7d6cg 0/3 ContainerCreating 0 16m <none> d204-30.core-svt.samsung.net <none> <none> common-upf-2 upf-pfcpc-5d7f669d48-56thw 0/2 ContainerCreating 0 16m <none> d204-30.core-svt.samsung.net <none> <none> common-upf-2 upf-pfcpc-5d7f669d48-7jhgb 0/2 ContainerCreating 0 16m <none> d204-30.core-svt.samsung.net <none> <none> common-upf-2 upf-pfcpc-686c5fb57d-hkkzq 0/2 ContainerCreating 0 16m <none> d204-30.core-svt.samsung.net <none> <none> global-smf-auto-1 smf-pfcpc-55b9d84875-4dzmz 0/2 ContainerCreating 0 16m <none> d204-30.core-svt.samsung.net <none> <none> global-smf-auto-1 smf-pfcpc-55b9d84875-cgplf 0/2 ContainerCreating 0 16m <none> d204-30.core-svt.samsung.net <none> <none> global-smf-auto-1 smf-pfcpc-5878475f46-2tffk 0/3 ContainerCreating 0 16m <none> d204-30.core-svt.samsung.net <none> <none> global-smf-auto-2 smf-pfcpc-55b9d84875-c24fq 0/2 ContainerCreating 0 16m <none> d204-30.core-svt.samsung.net <none> <none> global-smf-auto-2 smf-pfcpc-55b9d84875-nrnvt 0/2 ContainerCreating 0 16m <none> d204-30.core-svt.samsung.net <none> <none> global-smf-auto-2 smf-pfcpc-566d749b97-gjzrc 0/3 ContainerCreating 0 16m <none> d204-30.core-svt.samsung.net <none> <none> global-smf-auto-3 smf-pfcpc-55b9d84875-25868 0/2 ContainerCreating 0 16m <none> d204-30.core-svt.samsung.net <none> <none> global-smf-auto-3 smf-pfcpc-55b9d84875-6b8gr 0/2 ContainerCreating 0 16m <none> d204-30.core-svt.samsung.net <none> <none> The engineering team quickly fixed the bug for this issue, so Samsung upgraded to 4.8.12 and the problem was solved. * Samsung tested as below, and no error occurred. 200 Pods: Pass (within 1 minute) 300 Pods : Pass (within 2 minutes) 400 Pods: Pass (within 15 minutes) But they are using 4.8.21 and I am getting the below error again. There are about 2000 Pods. I deleted the ovn-controller container or the ovnkube-node pod but it didn't work. Steps to Reproduce: 1. 2. 3. Actual results: It remains in the ContainerCreating state. Expected results: The Pod will change to a 'Running' state. Additional info:
Notice a crash in the sosreport: 2021-12-30T14:45:03.315231013+09:00 stderr F I1230 05:45:03.315219 142946 port_claim.go:40] Opening socket for service: global-amfmme-auto-2/amf-anintf-sctp, port: 48819 and protocol SCTP 2021-12-30T14:45:03.315231013+09:00 stderr P fatal error: concurrent map read and map write 2021-12-30T14:45:03.315241233+09:00 stderr F 2021-12-30T14:45:03.318508461+09:00 stderr F 2021-12-30T14:45:03.318508461+09:00 stderr F goroutine 322 [running]: 2021-12-30T14:45:03.318508461+09:00 stderr F runtime.throw(0x1b2bac6, 0x21) 2021-12-30T14:45:03.318508461+09:00 stderr F /usr/lib/golang/src/runtime/panic.go:1117 +0x72 fp=0xc000811b60 sp=0xc000811b30 pc=0x43e612 2021-12-30T14:45:03.318565098+09:00 stderr F runtime.mapaccess2(0x18e4000, 0xc0018c5e90, 0xc000811bf0, 0x2921980, 0x0) 2021-12-30T14:45:03.318565098+09:00 stderr P /usr/lib/golang/src/runtime/map.go:469 +0x255 fp=0xc000811ba0 sp=0xc000811b60 pc=0x415c55 2021-12-30T14:45:03.318589278+09:00 stderr F 2021-12-30T14:45:03.318589278+09:00 stderr F github.com/ovn-org/ovn-kubernetes/go-controller/pkg/node.(*loadBalancerHealthChecker).AddEndpoints(0xc0018c5ef0, 0xc0004d9180) 2021-12-30T14:45:03.318589278+09:00 stderr P /go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/node/healthcheck.go: 2021-12-30T14:45:03.318611932+09:00 stderr F 61 +0x8c fp=0xc000811c20 sp=0xc000811ba0 pc=0x16a5b8c 2021-12-30T14:45:03.318632601+09:00 stderr F github.com/ovn-org/ovn-kubernetes/go-controller/pkg/node.(*gateway).AddEndpoints(...) 2021-12-30T14:45:03.318632601+09:00 stderr F /go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/node/gateway.go:107 2021-12-30T14:45:03.318653771+09:00 stderr F github.com/ovn-org/ovn-kubernetes/go-controller/pkg/node.(*gateway).Init.func4(0x1ade440, 0xc0004d9180) 2021-12-30T14:45:03.318653771+09:00 stderr P /go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/node/gateway.go: 2021-12-30T14:45:03.318674902+09:00 stderr P 147 +0x5d fp=0xc000811c48 sp=0xc000811c20 pc=0x16b851d
ovn-controller is constantly claiming/releasing the port: ovn-controller/0.log:2021-12-30T14:46:23.417027986+09:00 stderr F 2021-12-30T05:46:23Z|00239|binding|INFO|Claiming lport ntte-smf-1_smf-pfcpc-857b877946-db7mc for this chassis. ovn-controller/0.log:2021-12-30T14:46:23.417027986+09:00 stderr F 2021-12-30T05:46:23Z|00240|binding|INFO|ntte-smf-1_smf-pfcpc-857b877946-db7mc: Claiming 0a:58:80:1f:00:80 128.31.0.128 ovn-controller/0.log:2021-12-30T14:46:35.031080312+09:00 stderr F 2021-12-30T05:46:35Z|00335|binding|INFO|Releasing lport ntte-smf-1_smf-pfcpc-857b877946-db7mc from this chassis. ovn-controller/0.log:2021-12-30T14:46:55.837041123+09:00 stderr F 2021-12-30T05:46:55Z|00424|binding|INFO|Claiming lport ntte-smf-1_smf-pfcpc-857b877946-db7mc for this chassis. ovn-controller/0.log:2021-12-30T14:46:55.837041123+09:00 stderr F 2021-12-30T05:46:55Z|00425|binding|INFO|ntte-smf-1_smf-pfcpc-857b877946-db7mc: Claiming 0a:58:80:1f:00:80 128.31.0.128 ovn-controller/0.log:2021-12-30T14:47:17.749638683+09:00 stderr F 2021-12-30T05:47:17Z|00455|binding|INFO|Releasing lport ntte-smf-1_smf-pfcpc-857b877946-db7mc from this chassis. ovn-controller/0.log:2021-12-30T14:47:39.552058987+09:00 stderr F 2021-12-30T05:47:39Z|00558|binding|INFO|Claiming lport ntte-smf-1_smf-pfcpc-857b877946-db7mc for this chassis. ovn-controller/0.log:2021-12-30T14:47:39.552058987+09:00 stderr F 2021-12-30T05:47:39Z|00559|binding|INFO|ntte-smf-1_smf-pfcpc-857b877946-db7mc: Claiming 0a:58:80:1f:00:80 128.31.0.128 ovn-controller/0.log:2021-12-30T14:48:04.129982838+09:00 stderr F 2021-12-30T05:48:04Z|00625|binding|INFO|Releasing lport ntte-smf-1_smf-pfcpc-857b877946-db7mc from this chassis. ovn-controller/0.log:2021-12-30T14:48:15.395657797+09:00 stderr F 2021-12-30T05:48:15Z|00661|binding|INFO|Claiming lport ntte-smf-1_smf-pfcpc-857b877946-db7mc for this chassis. ovn-controller on this node is showing high cpu usage with ovsdb and br-int: 2021-12-30T14:54:42.590818277+09:00 stderr F 2021-12-30T05:54:42Z|02010|poll_loop|INFO|wakeup due to [POLLIN] on fd 20 (<->/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:153 (94% CPU usage) 2021-12-30T14:54:42.590827300+09:00 stderr F 2021-12-30T05:54:42Z|02011|poll_loop|INFO|wakeup due to [POLLIN] on fd 16 (<->/var/run/openvswitch/db.sock) at lib/stream-fd.c:157 (94% CPU usage) 2021-12-30T14:54:42.590835523+09:00 stderr F 2021-12-30T05:54:42Z|02012|poll_loop|INFO|wakeup due to [POLLIN] on fd 19 (172.23.102.73:50280<->172.23.102.2:9642) at lib/stream-ssl.c:832 (94% CPU usage) 2021-12-30T14:47:27.297051860+09:00 stderr F 2021-12-30T05:47:27Z|00478|timeval|WARN|Unreasonably long 9558ms poll interval (9373ms user, 123ms system) 2021-12-30T14:47:38.828866531+09:00 stderr F 2021-12-30T05:47:38Z|00498|timeval|WARN|Unreasonably long 10956ms poll interval (9985ms user, 909ms system) 2021-12-30T14:47:51.369877464+09:00 stderr F 2021-12-30T05:47:51Z|00587|timeval|WARN|Unreasonably long 11834ms poll interval (10422ms user, 1344ms system) 2021-12-30T14:48:03.434296880+09:00 stderr F 2021-12-30T05:48:03Z|00615|timeval|WARN|Unreasonably long 11178ms poll interval (10029ms user, 1086ms system) 2021-12-30T14:48:14.305150447+09:00 stderr F 2021-12-30T05:48:14Z|00638|timeval|WARN|Unreasonably long 10187ms poll interval (9772ms user, 356ms system) 2021-12-30T14:48:26.588210331+09:00 stderr F 2021-12-30T05:48:26Z|00696|timeval|WARN|Unreasonably long 11203ms poll interval (10235ms user, 905ms system) 2021-12-30T14:48:38.540734685+09:00 stderr F 2021-12-30T05:48:38Z|00719|timeval|WARN|Unreasonably long 11285ms poll interval (10270ms user, 945ms system) 2021-12-30T14:48:48.905387049+09:00 stderr F 2021-12-30T05:48:48Z|00786|timeval|WARN|Unreasonably long 9954ms poll interval (9616ms user, 278ms system) 2021-12-30T14:49:00.888985408+09:00 stderr F 2021-12-30T05:49:00Z|00820|timeval|WARN|Unreasonably long 11242ms poll interval (10178ms user, 1002ms system) 2021-12-30T14:49:13.374769826+09:00 stderr F 2021-12-30T05:49:13Z|00864|timeval|WARN|Unreasonably long 11658ms poll interval (10526ms user, 1065ms system) 2021-12-30T14:49:25.359753639+09:00 stderr F 2021-12-30T05:49:25Z|00897|timeval|WARN|Unreasonably long 11063ms poll interval (10065ms user, 930ms system) 2021-12-30T14:49:37.771741720+09:00 stderr F 2021-12-30T05:49:37Z|00924|timeval|WARN|Unreasonably long 11841ms poll interval (10537ms user, 1232ms system) 2021-12-30T14:49:49.789399230+09:00 stderr F 2021-12-30T05:49:49Z|01015|timeval|WARN|Unreasonably long 11079ms poll interval (10020ms user, 996ms system) 2021-12-30T14:50:00.839961382+09:00 stderr F 2021-12-30T05:50:00Z|01039|timeval|WARN|Unreasonably long 10437ms poll interval (10061ms user, 314ms system) 2021-12-30T14:50:12.056502934+09:00 stderr F 2021-12-30T05:50:12Z|01069|timeval|WARN|Unreasonably long 10320ms poll interval (10013ms user, 248ms system) 2021-12-30T14:50:24.838590556+09:00 stderr F 2021-12-30T05:50:24Z|01124|timeval|WARN|Unreasonably long 11373ms poll interval (10389ms user, 917ms system) 2021-12-30T14:50:37.692246916+09:00 stderr F 2021-12-30T05:50:37Z|01156|timeval|WARN|Unreasonably long 11695ms poll interval (10568ms user, 1055ms system) 2021-12-30T14:50:50.684069165+09:00 stderr F 2021-12-30T05:50:50Z|01221|timeval|WARN|Unreasonably long 12053ms poll interval (10628ms user, 1354ms system) 2021-12-30T14:51:03.396585955+09:00 stderr F 2021-12-30T05:51:03Z|01269|timeval|WARN|Unreasonably long 11645ms poll interval (10435ms user, 1149ms system) 2021-12-30T14:51:15.783168987+09:00 stderr F 2021-12-30T05:51:15Z|01305|timeval|WARN|Unreasonably long 11687ms poll interval (10610ms user, 1005ms system) 2021-12-30T14:51:28.208336804+09:00 stderr F 2021-12-30T05:51:28Z|01342|timeval|WARN|Unreasonably long 11510ms poll interval (10431ms user, 1006ms system) 2021-12-30T14:51:40.659551277+09:00 stderr F 2021-12-30T05:51:40Z|01382|timeval|WARN|Unreasonably long 11572ms poll interval (10491ms user, 1018ms system) 2021-12-30T14:51:53.243245067+09:00 stderr F 2021-12-30T05:51:53Z|01418|timeval|WARN|Unreasonably long 11656ms poll interval (10543ms user, 1042ms system) 2021-12-30T14:52:05.892526675+09:00 stderr F 2021-12-30T05:52:05Z|01458|timeval|WARN|Unreasonably long 11661ms poll interval (10590ms user, 998ms system) version: 2021-12-30T14:44:21.915462456+09:00 stderr F 2021-12-30T05:44:21Z|00004|main|INFO|OVN internal version is : [20.12.0-20.17.0-56.1]
Posted a fix for the ovnkube crash here: https://github.com/ovn-org/ovn-kubernetes/pull/2821
Dumitru from the OVN team analyzed the ovn-controller issue. From his analysis he believes this is the fix: https://github.com/openshift/ovn-kubernetes/commit/2431758477c9709c25c07a64c03d49656ce30505 This is present in both local and shared gateway mode in versions > 4.10. It is only present for shared gateway mode in 4.9, and not available at all in 4.8. Will look into feasibility of backporting this.
The OVN fix we would need for local gw support in versions <= 4.9: https://patchwork.ozlabs.org/project/openvswitch/patch/20220104115034.142846-1-sangana.abhiram@nutanix.com/ Or the other option is backporting removing the DGP.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069