Bug 1947712

Summary: [OVN] Many faults and Polling interval stuck for 4 seconds every roughly 5 minutes intervals.
Product: OpenShift Container Platform Reporter: Robin Cernin <rcernin>
Component: NetworkingAssignee: Tim Rozet <trozet>
Networking sub component: ovn-kubernetes QA Contact: Ross Brattain <rbrattai>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: medium CC: aconstan, akaris, dblack, fpaoline, gdiotte, memodi, mmethot, pmannidi, rbrattai, trozet
Version: 4.5   
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1950432 (view as bug list) Environment:
Last Closed: 2021-07-27 22:58:16 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1950432    

Description Robin Cernin 2021-04-09 02:04:06 UTC
Description of problem:

We believe this causes packet drops or latency that causes an application in OpenShift to crash. The Cluster Operator Networking is degraded and following pattern of "Unreasonably long time spent on polling" and many minor faults reported.


Version-Release number of selected component (if applicable):
4.5.16

We are seeing regular OVN faults and polling interval stuck for 4 seconds every roughly 5 minutes intervals.

2021-04-08T18:00:02.715074869Z 2021-04-08T18:00:02Z|1633233|timeval|WARN|faults: 980 minor, 0 major
2021-04-08T18:00:06.380304864Z 2021-04-08T18:00:06Z|1633239|timeval|WARN|faults: 138 minor, 0 major
2021-04-08T18:00:10.142240431Z 2021-04-08T18:00:10Z|1633246|timeval|WARN|faults: 782 minor, 0 major
2021-04-08T18:04:54.926945661Z 2021-04-08T18:04:54Z|1633262|timeval|WARN|faults: 1770 minor, 0 major
2021-04-08T18:04:58.653385889Z 2021-04-08T18:04:58Z|1633269|timeval|WARN|faults: 338 minor, 0 major
2021-04-08T18:05:02.457484642Z 2021-04-08T18:05:02Z|1633278|timeval|WARN|faults: 933 minor, 0 major
2021-04-08T18:05:06.273114677Z 2021-04-08T18:05:06Z|1633286|timeval|WARN|faults: 1 minor, 0 major
2021-04-08T18:05:10.043856974Z 2021-04-08T18:05:10Z|1633292|timeval|WARN|faults: 452 minor, 0 major
2021-04-08T18:09:54.98602567Z 2021-04-08T18:09:54Z|1633301|timeval|WARN|faults: 1986 minor, 0 major
2021-04-08T18:09:58.75175974Z 2021-04-08T18:09:58Z|1633308|timeval|WARN|faults: 594 minor, 0 major
2021-04-08T18:10:02.503164696Z 2021-04-08T18:10:02Z|1633313|timeval|WARN|faults: 981 minor, 0 major
2021-04-08T18:10:06.314634985Z 2021-04-08T18:10:06Z|1633319|timeval|WARN|faults: 137 minor, 0 major
2021-04-08T18:10:10.150085862Z 2021-04-08T18:10:10Z|1633324|timeval|WARN|faults: 772 minor, 0 major
2021-04-08T18:14:54.788238875Z 2021-04-08T18:14:54Z|1633331|timeval|WARN|faults: 1781 minor, 0 major
2021-04-08T18:14:58.551843931Z 2021-04-08T18:14:58Z|1633337|timeval|WARN|faults: 924 minor, 0 major
2021-04-08T18:15:02.313595211Z 2021-04-08T18:15:02Z|1633344|timeval|WARN|faults: 481 minor, 0 major
2021-04-08T18:15:06.268042727Z 2021-04-08T18:15:06Z|1633352|timeval|WARN|faults: 133 minor, 0 major
2021-04-08T18:19:54.991310581Z 2021-04-08T18:19:54Z|1633362|timeval|WARN|faults: 2045 minor, 0 major
2021-04-08T18:19:58.735163092Z 2021-04-08T18:19:58Z|1633368|timeval|WARN|faults: 924 minor, 0 major
2021-04-08T18:20:02.523545829Z 2021-04-08T18:20:02Z|1633373|timeval|WARN|faults: 473 minor, 0 major
2021-04-08T18:20:06.355446491Z 2021-04-08T18:20:06Z|1633379|timeval|WARN|faults: 773 minor, 0 major
2021-04-08T18:20:10.133747946Z 2021-04-08T18:20:10Z|1633384|timeval|WARN|faults: 330 minor, 0 major
2021-04-08T18:24:55.104693942Z 2021-04-08T18:24:55Z|1633395|timeval|WARN|faults: 1606 minor, 0 major
2021-04-08T18:24:58.935811519Z 2021-04-08T18:24:58Z|1633401|timeval|WARN|faults: 1089 minor, 0 major
2021-04-08T18:25:02.639788749Z 2021-04-08T18:25:02Z|1633406|timeval|WARN|faults: 496 minor, 0 major
2021-04-08T18:25:06.427825717Z 2021-04-08T18:25:06Z|1633412|timeval|WARN|faults: 261 minor, 0 major
2021-04-08T18:25:10.228325615Z 2021-04-08T18:25:10Z|1633417|timeval|WARN|faults: 644 minor, 0 major


2021-04-08T18:00:02.715074869Z 2021-04-08T18:00:02Z|1633232|timeval|WARN|Unreasonably long 3713ms poll interval (3690ms user, 0ms system)
2021-04-08T18:00:06.380279371Z 2021-04-08T18:00:06Z|1633238|timeval|WARN|Unreasonably long 3664ms poll interval (3640ms user, 0ms system)
2021-04-08T18:00:10.142240431Z 2021-04-08T18:00:10Z|1633245|timeval|WARN|Unreasonably long 3762ms poll interval (3740ms user, 0ms system)
2021-04-08T18:04:54.926905224Z 2021-04-08T18:04:54Z|1633261|timeval|WARN|Unreasonably long 3862ms poll interval (3837ms user, 1ms system)
2021-04-08T18:04:58.653339573Z 2021-04-08T18:04:58Z|1633268|timeval|WARN|Unreasonably long 3726ms poll interval (3705ms user, 0ms system)
2021-04-08T18:05:02.457451673Z 2021-04-08T18:05:02Z|1633277|timeval|WARN|Unreasonably long 3804ms poll interval (3780ms user, 1ms system)
2021-04-08T18:05:06.273079183Z 2021-04-08T18:05:06Z|1633285|timeval|WARN|Unreasonably long 3813ms poll interval (3790ms user, 0ms system)
2021-04-08T18:05:10.043811869Z 2021-04-08T18:05:10Z|1633291|timeval|WARN|Unreasonably long 3771ms poll interval (3747ms user, 2ms system)
2021-04-08T18:09:54.985966323Z 2021-04-08T18:09:54Z|1633300|timeval|WARN|Unreasonably long 3833ms poll interval (3809ms user, 1ms system)
2021-04-08T18:09:58.75175974Z 2021-04-08T18:09:58Z|1633307|timeval|WARN|Unreasonably long 3766ms poll interval (3741ms user, 1ms system)
2021-04-08T18:10:02.503164696Z 2021-04-08T18:10:02Z|1633312|timeval|WARN|Unreasonably long 3751ms poll interval (3727ms user, 0ms system)
2021-04-08T18:10:06.314634985Z 2021-04-08T18:10:06Z|1633318|timeval|WARN|Unreasonably long 3810ms poll interval (3785ms user, 1ms system)
2021-04-08T18:10:10.150057959Z 2021-04-08T18:10:10Z|1633323|timeval|WARN|Unreasonably long 3835ms poll interval (3810ms user, 0ms system)
2021-04-08T18:14:54.788238875Z 2021-04-08T18:14:54Z|1633330|timeval|WARN|Unreasonably long 3736ms poll interval (3719ms user, 1ms system)
2021-04-08T18:14:58.551843931Z 2021-04-08T18:14:58Z|1633336|timeval|WARN|Unreasonably long 3764ms poll interval (3743ms user, 1ms system)
2021-04-08T18:15:02.313595211Z 2021-04-08T18:15:02Z|1633343|timeval|WARN|Unreasonably long 3761ms poll interval (3735ms user, 2ms system)
2021-04-08T18:15:06.267997289Z 2021-04-08T18:15:06Z|1633351|timeval|WARN|Unreasonably long 3786ms poll interval (3761ms user, 1ms system)
2021-04-08T18:19:54.991272828Z 2021-04-08T18:19:54Z|1633361|timeval|WARN|Unreasonably long 3804ms poll interval (3779ms user, 2ms system)
2021-04-08T18:19:58.735127639Z 2021-04-08T18:19:58Z|1633367|timeval|WARN|Unreasonably long 3744ms poll interval (3722ms user, 0ms system)
2021-04-08T18:20:02.523506603Z 2021-04-08T18:20:02Z|1633372|timeval|WARN|Unreasonably long 3788ms poll interval (3763ms user, 2ms system)
2021-04-08T18:20:06.35541336Z 2021-04-08T18:20:06Z|1633378|timeval|WARN|Unreasonably long 3831ms poll interval (3806ms user, 0ms system)
2021-04-08T18:20:10.133747946Z 2021-04-08T18:20:10Z|1633383|timeval|WARN|Unreasonably long 3779ms poll interval (3757ms user, 0ms system)
2021-04-08T18:24:55.104664739Z 2021-04-08T18:24:55Z|1633394|timeval|WARN|Unreasonably long 3801ms poll interval (3775ms user, 1ms system)
2021-04-08T18:24:58.935782105Z 2021-04-08T18:24:58Z|1633400|timeval|WARN|Unreasonably long 3831ms poll interval (3806ms user, 0ms system)
2021-04-08T18:25:02.639753565Z 2021-04-08T18:25:02Z|1633405|timeval|WARN|Unreasonably long 3704ms poll interval (3680ms user, 1ms system)
2021-04-08T18:25:06.427775317Z 2021-04-08T18:25:06Z|1633411|timeval|WARN|Unreasonably long 3788ms poll interval (3762ms user, 0ms system)
2021-04-08T18:25:10.228294012Z 2021-04-08T18:25:10Z|1633416|timeval|WARN|Unreasonably long 3800ms poll interval (3775ms user, 0ms system)

On Average there are 4000ms poll intervals and around 400 minor faults (ranging from 1 minor fault to up to 2300 minor faults)

Also CPU usage report is at peak:

2021-04-08T18:15:06.268094833Z 2021-04-08T18:15:06Z|1633354|poll_loop|INFO|wakeup due to [POLLIN] on fd 18 (x.x.x.x:40154<->x.x.x.y:9642) at lib/stream-ssl.c:832 (100% CPU usage)
2021-04-08T18:15:06.268121267Z 2021-04-08T18:15:06Z|1633355|poll_loop|INFO|wakeup due to [POLLIN] on fd 19 (<->/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:157 (100% CPU usage)
2021-04-08T18:15:06.437193983Z 2021-04-08T18:15:06Z|1633356|poll_loop|INFO|wakeup due to [POLLIN] on fd 19 (<->/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:157 (100% CPU usage)
2021-04-08T18:19:54.991422758Z 2021-04-08T18:19:54Z|1633366|poll_loop|INFO|wakeup due to [POLLIN] on fd 18 (x.x.x.x:40154<->x.x.x.y:9642) at lib/stream-ssl.c:832 (65% CPU usage)
2021-04-08T18:19:58.735249779Z 2021-04-08T18:19:58Z|1633371|poll_loop|INFO|wakeup due to [POLLIN] on fd 15 (<->/var/run/openvswitch/db.sock) at lib/stream-fd.c:157 (100% CPU usage)
2021-04-08T18:20:02.52364114Z 2021-04-08T18:20:02Z|1633376|poll_loop|INFO|wakeup due to [POLLIN] on fd 19 (<->/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:157 (101% CPU usage)
2021-04-08T18:20:02.524290378Z 2021-04-08T18:20:02Z|1633377|poll_loop|INFO|wakeup due to [POLLIN] on fd 15 (<->/var/run/openvswitch/db.sock) at lib/stream-fd.c:157 (101% CPU usage)
2021-04-08T18:20:06.355500295Z 2021-04-08T18:20:06Z|1633381|poll_loop|INFO|wakeup due to [POLLIN] on fd 15 (<->/var/run/openvswitch/db.sock) at lib/stream-fd.c:157 (100% CPU usage)
2021-04-08T18:20:06.35553017Z 2021-04-08T18:20:06Z|1633382|poll_loop|INFO|wakeup due to [POLLIN] on fd 18 (x.x.x.x:40154<->x.x.x.y:9642) at lib/stream-ssl.c:832 (100% CPU usage)
2021-04-08T18:20:10.133779051Z 2021-04-08T18:20:10Z|1633386|poll_loop|INFO|wakeup due to [POLLIN] on fd 19 (<->/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:157 (101% CPU usage)
2021-04-08T18:20:11.456759906Z 2021-04-08T18:20:11Z|1633387|poll_loop|INFO|wakeup due to [POLLIN] on fd 18 (x.x.x.x:40154<->x.x.x.y:9642) at lib/stream-ssl.c:832 (101% CPU usage)
2021-04-08T18:20:11.628415988Z 2021-04-08T18:20:11Z|1633388|poll_loop|INFO|wakeup due to [POLLIN] on fd 19 (<->/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:157 (101% CPU usage)
2021-04-08T18:20:12.761901284Z 2021-04-08T18:20:12Z|1633389|poll_loop|INFO|wakeup due to [POLLIN] on fd 18 (x.x.x.x:40154<->x.x.x.y:9642) at lib/stream-ssl.c:832 (101% CPU usage)
2021-04-08T18:24:55.104809983Z 2021-04-08T18:24:55Z|1633399|poll_loop|INFO|wakeup due to [POLLIN] on fd 18 (x.x.x.x:40154<->x.x.x.y:9642) at lib/stream-ssl.c:832 (99% CPU usage)
2021-04-08T18:24:58.93589203Z 2021-04-08T18:24:58Z|1633404|poll_loop|INFO|wakeup due to [POLLIN] on fd 15 (<->/var/run/openvswitch/db.sock) at lib/stream-fd.c:157 (100% CPU usage)
2021-04-08T18:25:02.63986941Z 2021-04-08T18:25:02Z|1633409|poll_loop|INFO|wakeup due to [POLLIN] on fd 19 (<->/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:157 (99% CPU usage)
2021-04-08T18:25:02.640442529Z 2021-04-08T18:25:02Z|1633410|poll_loop|INFO|wakeup due to [POLLIN] on fd 15 (<->/var/run/openvswitch/db.sock) at lib/stream-fd.c:157 (99% CPU usage)
2021-04-08T18:25:06.427872765Z 2021-04-08T18:25:06Z|1633414|poll_loop|INFO|wakeup due to [POLLIN] on fd 15 (<->/var/run/openvswitch/db.sock) at lib/stream-fd.c:157 (101% CPU usage)
2021-04-08T18:25:06.427901603Z 2021-04-08T18:25:06Z|1633415|poll_loop|INFO|wakeup due to [POLLIN] on fd 18 (x.x.x.x:40154<->x.x.x.y:9642) at lib/stream-ssl.c:832 (101% CPU usage)
2021-04-08T18:25:10.228377383Z 2021-04-08T18:25:10Z|1633419|poll_loop|INFO|wakeup due to [POLLIN] on fd 19 (<->/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:157 (99% CPU usage)

Due to this problem, it may be the problem behind other issues such as network operator in degraded state:

omg get clusterversion
NAME     VERSION  AVAILABLE  PROGRESSING  SINCE  STATUS
version  4.5.16   True       False        31m    Error while reconciling 4.5.16: an unknown error has occurred: MultipleErrors

omg get co
NAME                                      VERSION  AVAILABLE  PROGRESSING  DEGRADED  SINCE
authentication                            4.5.16   True       False        False     1d
cloud-credential                          4.5.16   True       False        False     107d
cluster-autoscaler                        4.5.16   True       False        False     107d
config-operator                           4.5.16   True       False        False     107d
console                                   4.5.16   True       False        False     1d
csi-snapshot-controller                   4.5.16   True       False        False     20d
dns                                       4.5.16   True       True         False     2d
etcd                                      4.5.16   True       False        False     20d
image-registry                            4.5.16   True       False        False     107d
ingress                                   4.5.16   True       False        False     45d
insights                                  4.5.16   True       False        False     208d
kube-apiserver                            4.5.16   True       False        False     1d
kube-controller-manager                   4.5.16   True       False        False     20d
kube-scheduler                            4.5.16   True       False        False     20d
kube-storage-version-migrator             4.5.16   True       False        False     20d
machine-api                               4.5.16   True       False        False     107d
machine-approver                          4.5.16   True       False        False     107d
machine-config                            4.5.16   False      False        True      2d
marketplace                               4.5.16   True       False        False     20d
monitoring                                4.5.16   False      True         True      24s
network                                   4.5.16   True       True         True      2d
node-tuning                               4.5.16   True       False        False     90d
openshift-apiserver                       4.5.16   True       False        False     2d
openshift-controller-manager              4.5.16   True       False        False     12d
openshift-samples                         4.5.16   True       False        False     107d
operator-lifecycle-manager                4.5.16   True       False        False     107d
operator-lifecycle-manager-catalog        4.5.16   True       False        False     107d
operator-lifecycle-manager-packageserver  4.5.16   True       False        False     20d
service-ca                                4.5.16   True       False        False     20d
storage                                   4.5.16   True       False        False     107d

Comment 2 Tim Rozet 2021-04-09 14:06:19 UTC
Note it was also seen that:
So far on the connectivity side I found a bunch of these (about as early as logging starts on node 06), about every minute:

Apr 08 18:20:24 worker-006 hyperkube[4011520]: F0224 13:59:03.392898    8431 ovnkube.go:129] Error setting OVS external ID 'ovn-nb="ssl:172.20.130.202:9641,ssl:172.20.130.203:9641,ssl:172.20.130.204:9641"': exit status 1
Apr 08 18:21:44 worker-006 hyperkube[4011520]: F0224 13:59:03.392898    8431 ovnkube.go:129] Error setting OVS external ID 'ovn-nb="ssl:172.20.130.202:9641,ssl:172.20.130.203:9641,ssl:172.20.130.204:9641"': exit status 1

This is odd because this is just ovnkube setting the value in local OVSDB.

Comment 3 Federico Paolinelli 2021-04-09 14:23:21 UTC
The 5 seconds interval resonates with https://bugzilla.redhat.com/show_bug.cgi?id=1908921 which was fixed in 4.5.27

Comment 14 Tim Rozet 2021-04-14 13:32:38 UTC
The dropped traffic every 5 minutes was due to stale ports for a previous instance of the pod on another node. This was causing ovn-controllers on two different nodes to thrash as they both kept binding the port, and that will cause traffic drop. The root cause of this was kubelet was not sending the events on the stale node, covered by:
https://bugzilla.redhat.com/show_bug.cgi?id=1948052

However, we can make this more robust on the SDN side, by specifically telling OVN where the current pod should be bound. This will prevent the thrashing in the event that there are stale ports leftover for whatever reason.

Additionally, after fixing the above issue by restarting kubelet the affected worker; TCP connections were being closed every hour. This was determined to be due to Istio's default behavior with idle connections and not an OVN issue.

Comment 16 Ross Brattain 2021-04-21 00:01:59 UTC


Verified on 4.8.0-0.nightly-2021-04-20-032026

requested-chassis is set.

sh-4.4# ovn-nbctl --no-leader-only --format=csv     list Logical_Switch_Port | grep test-rc
124b0b93-9a97-4d83-a1ae-92c3b83a7784,"[""0a:58:0a:81:03:39 10.129.3.57""]",[],[],[],[],"{namespace=t1, pod=""true""}",[],t1_test-rc-jwq7p,{requested-chassis=compute-2},[],"[""0a:58:0a:81:03:39 10.129.3.57""]",[],[],"""""",true
854ef321-b993-4d78-b167-5045f6841bdd,"[""0a:58:0a:80:02:24 10.128.2.36""]",[],[],[],[],"{namespace=t1, pod=""true""}",[],t1_test-rc-vxj2k,{requested-chassis=compute-1},[],"[""0a:58:0a:80:02:24 10.128.2.36""]",[],[],"""""",true
0040de26-dac8-4160-8630-56129d15b290,"[""0a:58:0a:80:02:22 10.128.2.34""]",[],[],[],[],"{namespace=t1, pod=""true""}",[],t1_test-rc-gzm87,{requested-chassis=compute-1},[],"[""0a:58:0a:80:02:22 10.128.2.34""]",[],[],"""""",true
14140fc7-8516-4e56-8890-b6708aa3ca9d,"[""0a:58:0a:80:02:21 10.128.2.33""]",[],[],[],[],"{namespace=t1, pod=""true""}",[],t1_test-rc-vtcch,{requested-chassis=compute-1},[],"[""0a:58:0a:80:02:21 10.128.2.33""]",[],[],"""""",true
8023b0cf-3992-49bf-8521-edb6193835d3,"[""0a:58:0a:81:03:3a 10.129.3.58""]",[],[],[],[],"{namespace=t1, pod=""true""}",[],t1_test-rc-rn2qp,{requested-chassis=compute-2},[],"[""0a:58:0a:81:03:3a 10.129.3.58""]",[],[],"""""",true
7e57a8d7-7f75-4486-9c1b-364855aca405,"[""0a:58:0a:83:00:38 10.131.0.56""]",[],[],[],[],"{namespace=t1, pod=""true""}",[],t1_test-rc-9r45q,{requested-chassis=compute-0},[],"[""0a:58:0a:83:00:38 10.131.0.56""]",[],[],"""""",true
9603d659-dcd2-4996-9a6b-a8107e325985,"[""0a:58:0a:83:00:36 10.131.0.54""]",[],[],[],[],"{namespace=t1, pod=""true""}",[],t1_test-rc-jljjx,{requested-chassis=compute-0},[],"[""0a:58:0a:83:00:36 10.131.0.54""]",[],[],"""""",true
ca7e6f5c-44cd-4aeb-a806-e00fbdb22776,"[""0a:58:0a:81:03:38 10.129.3.56""]",[],[],[],[],"{namespace=t1, pod=""true""}",[],t1_test-rc-dqvqx,{requested-chassis=compute-2},[],"[""0a:58:0a:81:03:38 10.129.3.56""]",[],[],"""""",true
a7d0335c-8bc7-4e1c-851a-4bde4292b907,"[""0a:58:0a:80:02:23 10.128.2.35""]",[],[],[],[],"{namespace=t1, pod=""true""}",[],t1_test-rc-fw296,{requested-chassis=compute-1},[],"[""0a:58:0a:80:02:23 10.128.2.35""]",[],[],"""""",true
cbf227f9-2c77-4d75-8abb-241c5e29f530,"[""0a:58:0a:83:00:37 10.131.0.55""]",[],[],[],[],"{namespace=t1, pod=""true""}",[],t1_test-rc-xn5db,{requested-chassis=compute-0},[],"[""0a:58:0a:83:00:37 10.131.0.55""]",[],[],"""""",true

Comment 19 errata-xmlrpc 2021-07-27 22:58:16 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Comment 20 Red Hat Bugzilla 2023-09-15 01:04:51 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days