Description of problem: As per Dan Williams request from https://bugzilla.redhat.com/show_bug.cgi?id=1959352#c38 opening a new BZ Version-Release number of selected component (if applicable): OCP 4.8 and 4.9 How reproducible: 100% Steps to Reproduce: See comment below Actual results: Expected results: Additional info:
Looking at the timings there is an outage happening for almost an hour where pods are not getting created. For example these 2 pods are created about the same time at 09:12: 2022-02-07T09:12:00.572809071Z I0207 09:12:00.572755 1 pods.go:282] [kubelet-density/kubelet-density-1000] addLogicalPort took 69.908103ms, OVN Execute time 13.269817ms, pod Annotation time: 53.815214ms 2022-02-07T09:12:00.602623915Z I0207 09:12:00.602561 1 pods.go:282] [kubelet-density/kubelet-density-1001] addLogicalPort took 45.126374ms, OVN Execute time 17.719173ms, pod Annotation time: 24.597634ms In CNI we can see add requests start around this time (on 2 different nodes): 2022-02-07T09:12:01.041782603Z I0207 09:12:01.041729 4592 cni.go:169] [kubelet-density/kubelet-density-1000 248afd1f9a85fd39d5a9af20cf7ba9aa58907920b6444481cea2115c4b7a710a] ADD starting CNI request [kubelet-density/kubelet-density-1000 248afd1f9a85fd39d5a9af20cf7ba9aa58907920b6444481cea2115c4b7a710a] 2022-02-07T09:12:01.074982189Z I0207 09:12:01.074933 4751 cni.go:169] [kubelet-density/kubelet-density-1001 263c9dec1635cb12f642f98ebd6717df498ab43255c8d653da594ffb3439c820] ADD starting CNI request [kubelet-density/kubelet-density-1001 263c9dec1635cb12f642f98ebd6717df498ab43255c8d653da594ffb3439c820] However neither of them complete until almost an hour later, around 10:04: 2022-02-07T10:04:36.391561647Z I0207 10:04:36.391502 4592 cni.go:190] [kubelet-density/kubelet-density-1000 c67f4174a09ca9176bc811a9420fcfa2246a07150e8467ede221a9d258062c34] ADD finished CNI request [kubelet-density/kubelet-density-1000 c67f4174a09ca9176bc811a9420fcfa2246a07150e8467ede221a9d258062c34], result "{\"interfaces\":[{\"name\":\"c67f4174a09ca91\",\"mac\":\"46:bd:48:70:d8:49\"},{\"name\":\"eth0\",\"mac\":\"0a:58:0a:02:1d:6d\",\"sandbox\":\"/var/run/netns/e52ddbeb-119d-4c7f-a88e-57ef157ce337\"}],\"ips\":[{\"version\":\"4\",\"interface\":1,\"address\":\"10.2.29.109/23\",\"gateway\":\"10.2.28.1\"}],\"dns\":{}}", err <nil> 2022-02-07T10:04:41.088582137Z I0207 10:04:41.086894 4751 cni.go:190] [kubelet-density/kubelet-density-1001 8fbc12de4fcc39800118e1be6407da9c1c3293407cfee68e5412d3248cc27ecf] ADD finished CNI request [kubelet-density/kubelet-density-1001 8fbc12de4fcc39800118e1be6407da9c1c3293407cfee68e5412d3248cc27ecf], result "{\"interfaces\":[{\"name\":\"8fbc12de4fcc398\",\"mac\":\"3e:37:83:c1:83:97\"},{\"name\":\"eth0\",\"mac\":\"0a:58:0a:02:21:00\",\"sandbox\":\"/var/run/netns/820d91e5-6e8f-491e-8496-79cfa842bdb7\"}],\"ips\":[{\"version\":\"4\",\"interface\":1,\"address\":\"10.2.33.0/23\",\"gateway\":\"10.2.32.1\"}],\"dns\":{}}", err <nil> Looking at ovn-controller for each node, there is nothing in the log for this period of time, indicating ovn-controller is getting no updates from SBDB. Looking at SBDB reveals some very high CPU usage and disconnects to northd: 2022-02-07T09:11:28.600592364Z 2022-02-07T09:11:28Z|05778|timeval|WARN|context switches: 0 voluntary, 331 involuntary 2022-02-07T09:11:28.602739922Z 2022-02-07T09:11:28Z|05779|coverage|INFO|Skipping details of duplicate event coverage for hash=03003a1e 2022-02-07T09:18:12.990317236Z 2022-02-07T09:18:12Z|05780|ovsdb|INFO|OVN_Southbound: Database compaction took 3291ms 2022-02-07T09:18:12.995595460Z 2022-02-07T09:18:12Z|05781|timeval|WARN|Unreasonably long 3296ms poll interval (2701ms user, 375ms system) 2022-02-07T09:18:12.998561646Z 2022-02-07T09:18:12Z|05782|timeval|WARN|faults: 63764 minor, 0 major 2022-02-07T09:18:13.001288515Z 2022-02-07T09:18:13Z|05783|timeval|WARN|disk: 0 reads, 105792 writes 2022-02-07T09:18:13.003741584Z 2022-02-07T09:18:13Z|05784|timeval|WARN|context switches: 15 voluntary, 879 involuntary 2022-02-07T09:18:13.005990399Z 2022-02-07T09:18:13Z|05785|coverage|INFO|Skipping details of duplicate event coverage for hash=03003a1e 2022-02-07T09:18:13.008356347Z 2022-02-07T09:18:13Z|05786|poll_loop|INFO|Dropped 36 log messages in last 969 seconds (most recently, 966 seconds ago) due to excessive rate 2022-02-07T09:18:13.010728062Z 2022-02-07T09:18:13Z|05787|poll_loop|INFO|wakeup due to [POLLIN] on fd 13 (0.0.0.0:9642<->) at ../lib/stream-ssl.c:968 (93% CPU usage) 2022-02-07T09:18:13.075353470Z 2022-02-07T09:18:13Z|05788|stream_ssl|WARN|SSL_accept: system error (Success) 2022-02-07T09:18:13.075465735Z 2022-02-07T09:18:13Z|05789|jsonrpc|WARN|ssl:10.200.131.171:42884: receive error: Protocol error 2022-02-07T09:18:13.075731921Z 2022-02-07T09:18:13Z|05790|poll_loop|INFO|wakeup due to [POLLIN] on fd 14 (/var/run/ovn/ovnsb_db.ctl<->) at ../lib/stream-fd.c:274 (93% CPU usage) 2022-02-07T09:18:13.075984623Z 2022-02-07T09:18:13Z|05791|reconnect|WARN|ssl:10.200.131.171:42884: connection dropped (Protocol error) 2022-02-07T09:18:13.076063621Z 2022-02-07T09:18:13Z|05792|poll_loop|INFO|wakeup due to [POLLIN][POLLHUP] on fd 36 (/var/run/ovn/ovnsb_db.ctl<->) at ../lib/stream-fd.c:157 (93% CPU usage) 2022-02-07T09:18:13.248967852Z 2022-02-07T09:18:13Z|05793|poll_loop|INFO|wakeup due to [POLLIN] on fd 25 (10.200.131.62:9644<->10.200.131.171:54262) at ../lib/stream-ssl.c:832 (93% CPU usage) 2022-02-07T09:18:15.497974485Z 2022-02-07T09:18:15Z|05794|poll_loop|INFO|wakeup due to 2249-ms timeout at ../ovsdb/ovsdb-server.c:296 (93% CPU usage) 2022-02-07T09:18:28.987622051Z 2022-02-07T09:18:28Z|05795|raft|INFO|received leadership transfer from c6bb in term 420 2022-02-07T09:18:28.988157878Z 2022-02-07T09:18:28Z|05796|raft|INFO|term 421: starting election 2022-02-07T09:18:28.995259167Z 2022-02-07T09:18:28Z|05797|raft|INFO|term 421: elected leader by 2+ of 3 servers 2022-02-07T09:31:41.859362155Z 2022-02-07T09:31:41Z|05798|reconnect|ERR|ssl:10.200.131.108:48496: no response to inactivity probe after 180 seconds, disconnecting 2022-02-07T09:32:17.732327327Z 2022-02-07T09:32:17Z|05799|stream_ssl|WARN|SSL_accept: system error (Success) 2022-02-07T09:32:17.732327327Z 2022-02-07T09:32:17Z|05800|jsonrpc|WARN|ssl:10.200.131.171:46652: receive error: Protocol error 2022-02-07T09:32:17.732555534Z 2022-02-07T09:32:17Z|05801|reconnect|WARN|ssl:10.200.131.171:46652: connection dropped (Protocol error) 2022-02-07T09:38:31.877447920Z 2022-02-07T09:38:31Z|05802|reconnect|ERR|ssl:10.200.131.171:50034: no response to inactivity probe after 180 seconds, disconnecting 2022-02-07T09:39:26.875325527Z 2022-02-07T09:39:26Z|05803|reconnect|ERR|ssl:10.200.131.62:37918: no response to inactivity probe after 180 seconds, disconnecting 2022-02-07T09:39:46.232969447Z 2022-02-07T09:39:46Z|05804|reconnect|ERR|ssl:10.200.131.108:42020: no response to inactivity probe after 180 seconds, disconnecting 2022-02-07T09:48:47.660046399Z 2022-02-07T09:48:47Z|05805|stream_ssl|WARN|SSL_accept: system error (Success) 2022-02-07T09:48:47.660046399Z 2022-02-07T09:48:47Z|05806|jsonrpc|WARN|ssl:10.200.131.108:54656: receive error: Protocol error 2022-02-07T09:48:47.660046399Z 2022-02-07T09:48:47Z|05807|reconnect|WARN|ssl:10.200.131.108:54656: connection dropped (Protocol error) 2022-02-07T09:54:04.292241130Z 2022-02-07T09:54:04Z|05808|reconnect|ERR|ssl:10.200.131.62:39800: no response to inactivity probe after 180 seconds, disconnecting 2022-02-07T09:56:55.260563846Z 2022-02-07T09:56:55Z|05809|stream_ssl|WARN|SSL_accept: system error (Success) 2022-02-07T09:56:55.260563846Z 2022-02-07T09:56:55Z|05810|jsonrpc|WARN|ssl:10.200.131.108:48958: receive error: Protocol error 2022-02-07T09:56:55.260739256Z 2022-02-07T09:56:55Z|05811|reconnect|WARN|ssl:10.200.131.108:48958: connection dropped (Protocol error) 2022-02-07T10:04:17.568359915Z 2022-02-07T10:04:17Z|05812|timeval|WARN|Unreasonably long 1492ms poll interval (1191ms user, 271ms system) 2022-02-07T10:04:17.581604716Z 2022-02-07T10:04:17Z|05813|timeval|WARN|faults: 59468 minor, 0 major 2022-02-07T10:04:17.582705226Z 2022-02-07T10:04:17Z|05814|timeval|WARN|disk: 0 reads, 42920 writes Also, northd is not responding to inactivity probes, and it when it reconnects to nbdb I see a poll interval that is really large: 2022-02-07T09:46:41.016747743Z 2022-02-07T09:46:41Z|87345|timeval|WARN|Unreasonably long 792299ms poll interval (776638ms user, 2939ms system)
OVN team believes the root cause is https://github.com/ovn-org/ovn/commit/e7d6d9922fc4d3c7365990ac8bac26a380e26341 The version of OVN in 4.10 and 4.11 already contain this fix. We will need to bump the version in 4.9. We will scale test this on 4.11 first to ensure the problem is resolved by this fix.
Verified that this issue is resolved in 4.11.0-0.nightly-2022-02-24-054925
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days