Hide Forgot
Description of problem: We believe this causes packet drops or latency that causes an application in OpenShift to crash. The Cluster Operator Networking is degraded and following pattern of "Unreasonably long time spent on polling" and many minor faults reported. Version-Release number of selected component (if applicable): 4.5.16 We are seeing regular OVN faults and polling interval stuck for 4 seconds every roughly 5 minutes intervals. 2021-04-08T18:00:02.715074869Z 2021-04-08T18:00:02Z|1633233|timeval|WARN|faults: 980 minor, 0 major 2021-04-08T18:00:06.380304864Z 2021-04-08T18:00:06Z|1633239|timeval|WARN|faults: 138 minor, 0 major 2021-04-08T18:00:10.142240431Z 2021-04-08T18:00:10Z|1633246|timeval|WARN|faults: 782 minor, 0 major 2021-04-08T18:04:54.926945661Z 2021-04-08T18:04:54Z|1633262|timeval|WARN|faults: 1770 minor, 0 major 2021-04-08T18:04:58.653385889Z 2021-04-08T18:04:58Z|1633269|timeval|WARN|faults: 338 minor, 0 major 2021-04-08T18:05:02.457484642Z 2021-04-08T18:05:02Z|1633278|timeval|WARN|faults: 933 minor, 0 major 2021-04-08T18:05:06.273114677Z 2021-04-08T18:05:06Z|1633286|timeval|WARN|faults: 1 minor, 0 major 2021-04-08T18:05:10.043856974Z 2021-04-08T18:05:10Z|1633292|timeval|WARN|faults: 452 minor, 0 major 2021-04-08T18:09:54.98602567Z 2021-04-08T18:09:54Z|1633301|timeval|WARN|faults: 1986 minor, 0 major 2021-04-08T18:09:58.75175974Z 2021-04-08T18:09:58Z|1633308|timeval|WARN|faults: 594 minor, 0 major 2021-04-08T18:10:02.503164696Z 2021-04-08T18:10:02Z|1633313|timeval|WARN|faults: 981 minor, 0 major 2021-04-08T18:10:06.314634985Z 2021-04-08T18:10:06Z|1633319|timeval|WARN|faults: 137 minor, 0 major 2021-04-08T18:10:10.150085862Z 2021-04-08T18:10:10Z|1633324|timeval|WARN|faults: 772 minor, 0 major 2021-04-08T18:14:54.788238875Z 2021-04-08T18:14:54Z|1633331|timeval|WARN|faults: 1781 minor, 0 major 2021-04-08T18:14:58.551843931Z 2021-04-08T18:14:58Z|1633337|timeval|WARN|faults: 924 minor, 0 major 2021-04-08T18:15:02.313595211Z 2021-04-08T18:15:02Z|1633344|timeval|WARN|faults: 481 minor, 0 major 2021-04-08T18:15:06.268042727Z 2021-04-08T18:15:06Z|1633352|timeval|WARN|faults: 133 minor, 0 major 2021-04-08T18:19:54.991310581Z 2021-04-08T18:19:54Z|1633362|timeval|WARN|faults: 2045 minor, 0 major 2021-04-08T18:19:58.735163092Z 2021-04-08T18:19:58Z|1633368|timeval|WARN|faults: 924 minor, 0 major 2021-04-08T18:20:02.523545829Z 2021-04-08T18:20:02Z|1633373|timeval|WARN|faults: 473 minor, 0 major 2021-04-08T18:20:06.355446491Z 2021-04-08T18:20:06Z|1633379|timeval|WARN|faults: 773 minor, 0 major 2021-04-08T18:20:10.133747946Z 2021-04-08T18:20:10Z|1633384|timeval|WARN|faults: 330 minor, 0 major 2021-04-08T18:24:55.104693942Z 2021-04-08T18:24:55Z|1633395|timeval|WARN|faults: 1606 minor, 0 major 2021-04-08T18:24:58.935811519Z 2021-04-08T18:24:58Z|1633401|timeval|WARN|faults: 1089 minor, 0 major 2021-04-08T18:25:02.639788749Z 2021-04-08T18:25:02Z|1633406|timeval|WARN|faults: 496 minor, 0 major 2021-04-08T18:25:06.427825717Z 2021-04-08T18:25:06Z|1633412|timeval|WARN|faults: 261 minor, 0 major 2021-04-08T18:25:10.228325615Z 2021-04-08T18:25:10Z|1633417|timeval|WARN|faults: 644 minor, 0 major 2021-04-08T18:00:02.715074869Z 2021-04-08T18:00:02Z|1633232|timeval|WARN|Unreasonably long 3713ms poll interval (3690ms user, 0ms system) 2021-04-08T18:00:06.380279371Z 2021-04-08T18:00:06Z|1633238|timeval|WARN|Unreasonably long 3664ms poll interval (3640ms user, 0ms system) 2021-04-08T18:00:10.142240431Z 2021-04-08T18:00:10Z|1633245|timeval|WARN|Unreasonably long 3762ms poll interval (3740ms user, 0ms system) 2021-04-08T18:04:54.926905224Z 2021-04-08T18:04:54Z|1633261|timeval|WARN|Unreasonably long 3862ms poll interval (3837ms user, 1ms system) 2021-04-08T18:04:58.653339573Z 2021-04-08T18:04:58Z|1633268|timeval|WARN|Unreasonably long 3726ms poll interval (3705ms user, 0ms system) 2021-04-08T18:05:02.457451673Z 2021-04-08T18:05:02Z|1633277|timeval|WARN|Unreasonably long 3804ms poll interval (3780ms user, 1ms system) 2021-04-08T18:05:06.273079183Z 2021-04-08T18:05:06Z|1633285|timeval|WARN|Unreasonably long 3813ms poll interval (3790ms user, 0ms system) 2021-04-08T18:05:10.043811869Z 2021-04-08T18:05:10Z|1633291|timeval|WARN|Unreasonably long 3771ms poll interval (3747ms user, 2ms system) 2021-04-08T18:09:54.985966323Z 2021-04-08T18:09:54Z|1633300|timeval|WARN|Unreasonably long 3833ms poll interval (3809ms user, 1ms system) 2021-04-08T18:09:58.75175974Z 2021-04-08T18:09:58Z|1633307|timeval|WARN|Unreasonably long 3766ms poll interval (3741ms user, 1ms system) 2021-04-08T18:10:02.503164696Z 2021-04-08T18:10:02Z|1633312|timeval|WARN|Unreasonably long 3751ms poll interval (3727ms user, 0ms system) 2021-04-08T18:10:06.314634985Z 2021-04-08T18:10:06Z|1633318|timeval|WARN|Unreasonably long 3810ms poll interval (3785ms user, 1ms system) 2021-04-08T18:10:10.150057959Z 2021-04-08T18:10:10Z|1633323|timeval|WARN|Unreasonably long 3835ms poll interval (3810ms user, 0ms system) 2021-04-08T18:14:54.788238875Z 2021-04-08T18:14:54Z|1633330|timeval|WARN|Unreasonably long 3736ms poll interval (3719ms user, 1ms system) 2021-04-08T18:14:58.551843931Z 2021-04-08T18:14:58Z|1633336|timeval|WARN|Unreasonably long 3764ms poll interval (3743ms user, 1ms system) 2021-04-08T18:15:02.313595211Z 2021-04-08T18:15:02Z|1633343|timeval|WARN|Unreasonably long 3761ms poll interval (3735ms user, 2ms system) 2021-04-08T18:15:06.267997289Z 2021-04-08T18:15:06Z|1633351|timeval|WARN|Unreasonably long 3786ms poll interval (3761ms user, 1ms system) 2021-04-08T18:19:54.991272828Z 2021-04-08T18:19:54Z|1633361|timeval|WARN|Unreasonably long 3804ms poll interval (3779ms user, 2ms system) 2021-04-08T18:19:58.735127639Z 2021-04-08T18:19:58Z|1633367|timeval|WARN|Unreasonably long 3744ms poll interval (3722ms user, 0ms system) 2021-04-08T18:20:02.523506603Z 2021-04-08T18:20:02Z|1633372|timeval|WARN|Unreasonably long 3788ms poll interval (3763ms user, 2ms system) 2021-04-08T18:20:06.35541336Z 2021-04-08T18:20:06Z|1633378|timeval|WARN|Unreasonably long 3831ms poll interval (3806ms user, 0ms system) 2021-04-08T18:20:10.133747946Z 2021-04-08T18:20:10Z|1633383|timeval|WARN|Unreasonably long 3779ms poll interval (3757ms user, 0ms system) 2021-04-08T18:24:55.104664739Z 2021-04-08T18:24:55Z|1633394|timeval|WARN|Unreasonably long 3801ms poll interval (3775ms user, 1ms system) 2021-04-08T18:24:58.935782105Z 2021-04-08T18:24:58Z|1633400|timeval|WARN|Unreasonably long 3831ms poll interval (3806ms user, 0ms system) 2021-04-08T18:25:02.639753565Z 2021-04-08T18:25:02Z|1633405|timeval|WARN|Unreasonably long 3704ms poll interval (3680ms user, 1ms system) 2021-04-08T18:25:06.427775317Z 2021-04-08T18:25:06Z|1633411|timeval|WARN|Unreasonably long 3788ms poll interval (3762ms user, 0ms system) 2021-04-08T18:25:10.228294012Z 2021-04-08T18:25:10Z|1633416|timeval|WARN|Unreasonably long 3800ms poll interval (3775ms user, 0ms system) On Average there are 4000ms poll intervals and around 400 minor faults (ranging from 1 minor fault to up to 2300 minor faults) Also CPU usage report is at peak: 2021-04-08T18:15:06.268094833Z 2021-04-08T18:15:06Z|1633354|poll_loop|INFO|wakeup due to [POLLIN] on fd 18 (x.x.x.x:40154<->x.x.x.y:9642) at lib/stream-ssl.c:832 (100% CPU usage) 2021-04-08T18:15:06.268121267Z 2021-04-08T18:15:06Z|1633355|poll_loop|INFO|wakeup due to [POLLIN] on fd 19 (<->/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:157 (100% CPU usage) 2021-04-08T18:15:06.437193983Z 2021-04-08T18:15:06Z|1633356|poll_loop|INFO|wakeup due to [POLLIN] on fd 19 (<->/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:157 (100% CPU usage) 2021-04-08T18:19:54.991422758Z 2021-04-08T18:19:54Z|1633366|poll_loop|INFO|wakeup due to [POLLIN] on fd 18 (x.x.x.x:40154<->x.x.x.y:9642) at lib/stream-ssl.c:832 (65% CPU usage) 2021-04-08T18:19:58.735249779Z 2021-04-08T18:19:58Z|1633371|poll_loop|INFO|wakeup due to [POLLIN] on fd 15 (<->/var/run/openvswitch/db.sock) at lib/stream-fd.c:157 (100% CPU usage) 2021-04-08T18:20:02.52364114Z 2021-04-08T18:20:02Z|1633376|poll_loop|INFO|wakeup due to [POLLIN] on fd 19 (<->/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:157 (101% CPU usage) 2021-04-08T18:20:02.524290378Z 2021-04-08T18:20:02Z|1633377|poll_loop|INFO|wakeup due to [POLLIN] on fd 15 (<->/var/run/openvswitch/db.sock) at lib/stream-fd.c:157 (101% CPU usage) 2021-04-08T18:20:06.355500295Z 2021-04-08T18:20:06Z|1633381|poll_loop|INFO|wakeup due to [POLLIN] on fd 15 (<->/var/run/openvswitch/db.sock) at lib/stream-fd.c:157 (100% CPU usage) 2021-04-08T18:20:06.35553017Z 2021-04-08T18:20:06Z|1633382|poll_loop|INFO|wakeup due to [POLLIN] on fd 18 (x.x.x.x:40154<->x.x.x.y:9642) at lib/stream-ssl.c:832 (100% CPU usage) 2021-04-08T18:20:10.133779051Z 2021-04-08T18:20:10Z|1633386|poll_loop|INFO|wakeup due to [POLLIN] on fd 19 (<->/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:157 (101% CPU usage) 2021-04-08T18:20:11.456759906Z 2021-04-08T18:20:11Z|1633387|poll_loop|INFO|wakeup due to [POLLIN] on fd 18 (x.x.x.x:40154<->x.x.x.y:9642) at lib/stream-ssl.c:832 (101% CPU usage) 2021-04-08T18:20:11.628415988Z 2021-04-08T18:20:11Z|1633388|poll_loop|INFO|wakeup due to [POLLIN] on fd 19 (<->/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:157 (101% CPU usage) 2021-04-08T18:20:12.761901284Z 2021-04-08T18:20:12Z|1633389|poll_loop|INFO|wakeup due to [POLLIN] on fd 18 (x.x.x.x:40154<->x.x.x.y:9642) at lib/stream-ssl.c:832 (101% CPU usage) 2021-04-08T18:24:55.104809983Z 2021-04-08T18:24:55Z|1633399|poll_loop|INFO|wakeup due to [POLLIN] on fd 18 (x.x.x.x:40154<->x.x.x.y:9642) at lib/stream-ssl.c:832 (99% CPU usage) 2021-04-08T18:24:58.93589203Z 2021-04-08T18:24:58Z|1633404|poll_loop|INFO|wakeup due to [POLLIN] on fd 15 (<->/var/run/openvswitch/db.sock) at lib/stream-fd.c:157 (100% CPU usage) 2021-04-08T18:25:02.63986941Z 2021-04-08T18:25:02Z|1633409|poll_loop|INFO|wakeup due to [POLLIN] on fd 19 (<->/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:157 (99% CPU usage) 2021-04-08T18:25:02.640442529Z 2021-04-08T18:25:02Z|1633410|poll_loop|INFO|wakeup due to [POLLIN] on fd 15 (<->/var/run/openvswitch/db.sock) at lib/stream-fd.c:157 (99% CPU usage) 2021-04-08T18:25:06.427872765Z 2021-04-08T18:25:06Z|1633414|poll_loop|INFO|wakeup due to [POLLIN] on fd 15 (<->/var/run/openvswitch/db.sock) at lib/stream-fd.c:157 (101% CPU usage) 2021-04-08T18:25:06.427901603Z 2021-04-08T18:25:06Z|1633415|poll_loop|INFO|wakeup due to [POLLIN] on fd 18 (x.x.x.x:40154<->x.x.x.y:9642) at lib/stream-ssl.c:832 (101% CPU usage) 2021-04-08T18:25:10.228377383Z 2021-04-08T18:25:10Z|1633419|poll_loop|INFO|wakeup due to [POLLIN] on fd 19 (<->/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:157 (99% CPU usage) Due to this problem, it may be the problem behind other issues such as network operator in degraded state: omg get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.5.16 True False 31m Error while reconciling 4.5.16: an unknown error has occurred: MultipleErrors omg get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.5.16 True False False 1d cloud-credential 4.5.16 True False False 107d cluster-autoscaler 4.5.16 True False False 107d config-operator 4.5.16 True False False 107d console 4.5.16 True False False 1d csi-snapshot-controller 4.5.16 True False False 20d dns 4.5.16 True True False 2d etcd 4.5.16 True False False 20d image-registry 4.5.16 True False False 107d ingress 4.5.16 True False False 45d insights 4.5.16 True False False 208d kube-apiserver 4.5.16 True False False 1d kube-controller-manager 4.5.16 True False False 20d kube-scheduler 4.5.16 True False False 20d kube-storage-version-migrator 4.5.16 True False False 20d machine-api 4.5.16 True False False 107d machine-approver 4.5.16 True False False 107d machine-config 4.5.16 False False True 2d marketplace 4.5.16 True False False 20d monitoring 4.5.16 False True True 24s network 4.5.16 True True True 2d node-tuning 4.5.16 True False False 90d openshift-apiserver 4.5.16 True False False 2d openshift-controller-manager 4.5.16 True False False 12d openshift-samples 4.5.16 True False False 107d operator-lifecycle-manager 4.5.16 True False False 107d operator-lifecycle-manager-catalog 4.5.16 True False False 107d operator-lifecycle-manager-packageserver 4.5.16 True False False 20d service-ca 4.5.16 True False False 20d storage 4.5.16 True False False 107d
Note it was also seen that: So far on the connectivity side I found a bunch of these (about as early as logging starts on node 06), about every minute: Apr 08 18:20:24 worker-006 hyperkube[4011520]: F0224 13:59:03.392898 8431 ovnkube.go:129] Error setting OVS external ID 'ovn-nb="ssl:172.20.130.202:9641,ssl:172.20.130.203:9641,ssl:172.20.130.204:9641"': exit status 1 Apr 08 18:21:44 worker-006 hyperkube[4011520]: F0224 13:59:03.392898 8431 ovnkube.go:129] Error setting OVS external ID 'ovn-nb="ssl:172.20.130.202:9641,ssl:172.20.130.203:9641,ssl:172.20.130.204:9641"': exit status 1 This is odd because this is just ovnkube setting the value in local OVSDB.
The 5 seconds interval resonates with https://bugzilla.redhat.com/show_bug.cgi?id=1908921 which was fixed in 4.5.27
The dropped traffic every 5 minutes was due to stale ports for a previous instance of the pod on another node. This was causing ovn-controllers on two different nodes to thrash as they both kept binding the port, and that will cause traffic drop. The root cause of this was kubelet was not sending the events on the stale node, covered by: https://bugzilla.redhat.com/show_bug.cgi?id=1948052 However, we can make this more robust on the SDN side, by specifically telling OVN where the current pod should be bound. This will prevent the thrashing in the event that there are stale ports leftover for whatever reason. Additionally, after fixing the above issue by restarting kubelet the affected worker; TCP connections were being closed every hour. This was determined to be due to Istio's default behavior with idle connections and not an OVN issue.
Verified on 4.8.0-0.nightly-2021-04-20-032026 requested-chassis is set. sh-4.4# ovn-nbctl --no-leader-only --format=csv list Logical_Switch_Port | grep test-rc 124b0b93-9a97-4d83-a1ae-92c3b83a7784,"[""0a:58:0a:81:03:39 10.129.3.57""]",[],[],[],[],"{namespace=t1, pod=""true""}",[],t1_test-rc-jwq7p,{requested-chassis=compute-2},[],"[""0a:58:0a:81:03:39 10.129.3.57""]",[],[],"""""",true 854ef321-b993-4d78-b167-5045f6841bdd,"[""0a:58:0a:80:02:24 10.128.2.36""]",[],[],[],[],"{namespace=t1, pod=""true""}",[],t1_test-rc-vxj2k,{requested-chassis=compute-1},[],"[""0a:58:0a:80:02:24 10.128.2.36""]",[],[],"""""",true 0040de26-dac8-4160-8630-56129d15b290,"[""0a:58:0a:80:02:22 10.128.2.34""]",[],[],[],[],"{namespace=t1, pod=""true""}",[],t1_test-rc-gzm87,{requested-chassis=compute-1},[],"[""0a:58:0a:80:02:22 10.128.2.34""]",[],[],"""""",true 14140fc7-8516-4e56-8890-b6708aa3ca9d,"[""0a:58:0a:80:02:21 10.128.2.33""]",[],[],[],[],"{namespace=t1, pod=""true""}",[],t1_test-rc-vtcch,{requested-chassis=compute-1},[],"[""0a:58:0a:80:02:21 10.128.2.33""]",[],[],"""""",true 8023b0cf-3992-49bf-8521-edb6193835d3,"[""0a:58:0a:81:03:3a 10.129.3.58""]",[],[],[],[],"{namespace=t1, pod=""true""}",[],t1_test-rc-rn2qp,{requested-chassis=compute-2},[],"[""0a:58:0a:81:03:3a 10.129.3.58""]",[],[],"""""",true 7e57a8d7-7f75-4486-9c1b-364855aca405,"[""0a:58:0a:83:00:38 10.131.0.56""]",[],[],[],[],"{namespace=t1, pod=""true""}",[],t1_test-rc-9r45q,{requested-chassis=compute-0},[],"[""0a:58:0a:83:00:38 10.131.0.56""]",[],[],"""""",true 9603d659-dcd2-4996-9a6b-a8107e325985,"[""0a:58:0a:83:00:36 10.131.0.54""]",[],[],[],[],"{namespace=t1, pod=""true""}",[],t1_test-rc-jljjx,{requested-chassis=compute-0},[],"[""0a:58:0a:83:00:36 10.131.0.54""]",[],[],"""""",true ca7e6f5c-44cd-4aeb-a806-e00fbdb22776,"[""0a:58:0a:81:03:38 10.129.3.56""]",[],[],[],[],"{namespace=t1, pod=""true""}",[],t1_test-rc-dqvqx,{requested-chassis=compute-2},[],"[""0a:58:0a:81:03:38 10.129.3.56""]",[],[],"""""",true a7d0335c-8bc7-4e1c-851a-4bde4292b907,"[""0a:58:0a:80:02:23 10.128.2.35""]",[],[],[],[],"{namespace=t1, pod=""true""}",[],t1_test-rc-fw296,{requested-chassis=compute-1},[],"[""0a:58:0a:80:02:23 10.128.2.35""]",[],[],"""""",true cbf227f9-2c77-4d75-8abb-241c5e29f530,"[""0a:58:0a:83:00:37 10.131.0.55""]",[],[],[],[],"{namespace=t1, pod=""true""}",[],t1_test-rc-xn5db,{requested-chassis=compute-0},[],"[""0a:58:0a:83:00:37 10.131.0.55""]",[],[],"""""",true
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438