2052415 – Pod density test causing problems when using kube-burner

Bug 2052415 - Pod density test causing problems when using kube-burner

Summary: Pod density test causing problems when using kube-burner

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.11.0
Assignee:	Tim Rozet
QA Contact:	Raul Sevilla
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2058641
TreeView+	depends on / blocked

Reported:	2022-02-09 08:53 UTC by Andy Bartlett
Modified:	2023-09-15 01:51 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	2058641 (view as bug list)
Environment:
Last Closed:	2022-08-10 10:48:33 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1959352	1	high	CLOSED	[scale] failed to get pod annotation: timed out waiting for annotations	2023-09-18 00:26:34 UTC
Red Hat Product Errata	RHSA-2022:5069	0	None	None	None	2022-08-10 10:49:09 UTC

Description Andy Bartlett 2022-02-09 08:53:42 UTC

Description of problem:

As per Dan Williams request from https://bugzilla.redhat.com/show_bug.cgi?id=1959352#c38 opening a new BZ

Version-Release number of selected component (if applicable):

OCP 4.8 and 4.9


How reproducible:

100%


Steps to Reproduce:

See comment below

Actual results:


Expected results:


Additional info:

Comment 3 Tim Rozet 2022-02-09 17:54:08 UTC

Looking at the timings there is an outage happening for almost an hour where pods are not getting created. For example these 2 pods are created about the same time at 09:12:

2022-02-07T09:12:00.572809071Z I0207 09:12:00.572755       1 pods.go:282] [kubelet-density/kubelet-density-1000] addLogicalPort took 69.908103ms, OVN Execute time 13.269817ms, pod Annotation time: 53.815214ms
2022-02-07T09:12:00.602623915Z I0207 09:12:00.602561       1 pods.go:282] [kubelet-density/kubelet-density-1001] addLogicalPort took 45.126374ms, OVN Execute time 17.719173ms, pod Annotation time: 24.597634ms

In CNI we can see add requests start around this time (on 2 different nodes):

2022-02-07T09:12:01.041782603Z I0207 09:12:01.041729    4592 cni.go:169] [kubelet-density/kubelet-density-1000 248afd1f9a85fd39d5a9af20cf7ba9aa58907920b6444481cea2115c4b7a710a] ADD starting CNI request [kubelet-density/kubelet-density-1000 248afd1f9a85fd39d5a9af20cf7ba9aa58907920b6444481cea2115c4b7a710a]

2022-02-07T09:12:01.074982189Z I0207 09:12:01.074933    4751 cni.go:169] [kubelet-density/kubelet-density-1001 263c9dec1635cb12f642f98ebd6717df498ab43255c8d653da594ffb3439c820] ADD starting CNI request [kubelet-density/kubelet-density-1001 263c9dec1635cb12f642f98ebd6717df498ab43255c8d653da594ffb3439c820]

However neither of them complete until almost an hour later, around 10:04:
2022-02-07T10:04:36.391561647Z I0207 10:04:36.391502    4592 cni.go:190] [kubelet-density/kubelet-density-1000 c67f4174a09ca9176bc811a9420fcfa2246a07150e8467ede221a9d258062c34] ADD finished CNI request [kubelet-density/kubelet-density-1000 c67f4174a09ca9176bc811a9420fcfa2246a07150e8467ede221a9d258062c34], result "{\"interfaces\":[{\"name\":\"c67f4174a09ca91\",\"mac\":\"46:bd:48:70:d8:49\"},{\"name\":\"eth0\",\"mac\":\"0a:58:0a:02:1d:6d\",\"sandbox\":\"/var/run/netns/e52ddbeb-119d-4c7f-a88e-57ef157ce337\"}],\"ips\":[{\"version\":\"4\",\"interface\":1,\"address\":\"10.2.29.109/23\",\"gateway\":\"10.2.28.1\"}],\"dns\":{}}", err <nil>

2022-02-07T10:04:41.088582137Z I0207 10:04:41.086894    4751 cni.go:190] [kubelet-density/kubelet-density-1001 8fbc12de4fcc39800118e1be6407da9c1c3293407cfee68e5412d3248cc27ecf] ADD finished CNI request [kubelet-density/kubelet-density-1001 8fbc12de4fcc39800118e1be6407da9c1c3293407cfee68e5412d3248cc27ecf], result "{\"interfaces\":[{\"name\":\"8fbc12de4fcc398\",\"mac\":\"3e:37:83:c1:83:97\"},{\"name\":\"eth0\",\"mac\":\"0a:58:0a:02:21:00\",\"sandbox\":\"/var/run/netns/820d91e5-6e8f-491e-8496-79cfa842bdb7\"}],\"ips\":[{\"version\":\"4\",\"interface\":1,\"address\":\"10.2.33.0/23\",\"gateway\":\"10.2.32.1\"}],\"dns\":{}}", err <nil>

Looking at ovn-controller for each node, there is nothing in the log for this period of time, indicating ovn-controller is getting no updates from SBDB. Looking at SBDB reveals some very high CPU usage and disconnects to northd:
2022-02-07T09:11:28.600592364Z 2022-02-07T09:11:28Z|05778|timeval|WARN|context switches: 0 voluntary, 331 involuntary
2022-02-07T09:11:28.602739922Z 2022-02-07T09:11:28Z|05779|coverage|INFO|Skipping details of duplicate event coverage for hash=03003a1e
2022-02-07T09:18:12.990317236Z 2022-02-07T09:18:12Z|05780|ovsdb|INFO|OVN_Southbound: Database compaction took 3291ms
2022-02-07T09:18:12.995595460Z 2022-02-07T09:18:12Z|05781|timeval|WARN|Unreasonably long 3296ms poll interval (2701ms user, 375ms system)
2022-02-07T09:18:12.998561646Z 2022-02-07T09:18:12Z|05782|timeval|WARN|faults: 63764 minor, 0 major
2022-02-07T09:18:13.001288515Z 2022-02-07T09:18:13Z|05783|timeval|WARN|disk: 0 reads, 105792 writes
2022-02-07T09:18:13.003741584Z 2022-02-07T09:18:13Z|05784|timeval|WARN|context switches: 15 voluntary, 879 involuntary
2022-02-07T09:18:13.005990399Z 2022-02-07T09:18:13Z|05785|coverage|INFO|Skipping details of duplicate event coverage for hash=03003a1e
2022-02-07T09:18:13.008356347Z 2022-02-07T09:18:13Z|05786|poll_loop|INFO|Dropped 36 log messages in last 969 seconds (most recently, 966 seconds ago) due to excessive rate
2022-02-07T09:18:13.010728062Z 2022-02-07T09:18:13Z|05787|poll_loop|INFO|wakeup due to [POLLIN] on fd 13 (0.0.0.0:9642<->) at ../lib/stream-ssl.c:968 (93% CPU usage)
2022-02-07T09:18:13.075353470Z 2022-02-07T09:18:13Z|05788|stream_ssl|WARN|SSL_accept: system error (Success)
2022-02-07T09:18:13.075465735Z 2022-02-07T09:18:13Z|05789|jsonrpc|WARN|ssl:10.200.131.171:42884: receive error: Protocol error
2022-02-07T09:18:13.075731921Z 2022-02-07T09:18:13Z|05790|poll_loop|INFO|wakeup due to [POLLIN] on fd 14 (/var/run/ovn/ovnsb_db.ctl<->) at ../lib/stream-fd.c:274 (93% CPU usage)
2022-02-07T09:18:13.075984623Z 2022-02-07T09:18:13Z|05791|reconnect|WARN|ssl:10.200.131.171:42884: connection dropped (Protocol error)
2022-02-07T09:18:13.076063621Z 2022-02-07T09:18:13Z|05792|poll_loop|INFO|wakeup due to [POLLIN][POLLHUP] on fd 36 (/var/run/ovn/ovnsb_db.ctl<->) at ../lib/stream-fd.c:157 (93% CPU usage)
2022-02-07T09:18:13.248967852Z 2022-02-07T09:18:13Z|05793|poll_loop|INFO|wakeup due to [POLLIN] on fd 25 (10.200.131.62:9644<->10.200.131.171:54262) at ../lib/stream-ssl.c:832 (93% CPU usage)
2022-02-07T09:18:15.497974485Z 2022-02-07T09:18:15Z|05794|poll_loop|INFO|wakeup due to 2249-ms timeout at ../ovsdb/ovsdb-server.c:296 (93% CPU usage)
2022-02-07T09:18:28.987622051Z 2022-02-07T09:18:28Z|05795|raft|INFO|received leadership transfer from c6bb in term 420
2022-02-07T09:18:28.988157878Z 2022-02-07T09:18:28Z|05796|raft|INFO|term 421: starting election
2022-02-07T09:18:28.995259167Z 2022-02-07T09:18:28Z|05797|raft|INFO|term 421: elected leader by 2+ of 3 servers
2022-02-07T09:31:41.859362155Z 2022-02-07T09:31:41Z|05798|reconnect|ERR|ssl:10.200.131.108:48496: no response to inactivity probe after 180 seconds, disconnecting
2022-02-07T09:32:17.732327327Z 2022-02-07T09:32:17Z|05799|stream_ssl|WARN|SSL_accept: system error (Success)
2022-02-07T09:32:17.732327327Z 2022-02-07T09:32:17Z|05800|jsonrpc|WARN|ssl:10.200.131.171:46652: receive error: Protocol error
2022-02-07T09:32:17.732555534Z 2022-02-07T09:32:17Z|05801|reconnect|WARN|ssl:10.200.131.171:46652: connection dropped (Protocol error)
2022-02-07T09:38:31.877447920Z 2022-02-07T09:38:31Z|05802|reconnect|ERR|ssl:10.200.131.171:50034: no response to inactivity probe after 180 seconds, disconnecting
2022-02-07T09:39:26.875325527Z 2022-02-07T09:39:26Z|05803|reconnect|ERR|ssl:10.200.131.62:37918: no response to inactivity probe after 180 seconds, disconnecting
2022-02-07T09:39:46.232969447Z 2022-02-07T09:39:46Z|05804|reconnect|ERR|ssl:10.200.131.108:42020: no response to inactivity probe after 180 seconds, disconnecting
2022-02-07T09:48:47.660046399Z 2022-02-07T09:48:47Z|05805|stream_ssl|WARN|SSL_accept: system error (Success)
2022-02-07T09:48:47.660046399Z 2022-02-07T09:48:47Z|05806|jsonrpc|WARN|ssl:10.200.131.108:54656: receive error: Protocol error
2022-02-07T09:48:47.660046399Z 2022-02-07T09:48:47Z|05807|reconnect|WARN|ssl:10.200.131.108:54656: connection dropped (Protocol error)
2022-02-07T09:54:04.292241130Z 2022-02-07T09:54:04Z|05808|reconnect|ERR|ssl:10.200.131.62:39800: no response to inactivity probe after 180 seconds, disconnecting
2022-02-07T09:56:55.260563846Z 2022-02-07T09:56:55Z|05809|stream_ssl|WARN|SSL_accept: system error (Success)
2022-02-07T09:56:55.260563846Z 2022-02-07T09:56:55Z|05810|jsonrpc|WARN|ssl:10.200.131.108:48958: receive error: Protocol error
2022-02-07T09:56:55.260739256Z 2022-02-07T09:56:55Z|05811|reconnect|WARN|ssl:10.200.131.108:48958: connection dropped (Protocol error)
2022-02-07T10:04:17.568359915Z 2022-02-07T10:04:17Z|05812|timeval|WARN|Unreasonably long 1492ms poll interval (1191ms user, 271ms system)
2022-02-07T10:04:17.581604716Z 2022-02-07T10:04:17Z|05813|timeval|WARN|faults: 59468 minor, 0 major
2022-02-07T10:04:17.582705226Z 2022-02-07T10:04:17Z|05814|timeval|WARN|disk: 0 reads, 42920 writes

Also, northd is not responding to inactivity probes, and it when it reconnects to nbdb I see a poll interval that is really large:
2022-02-07T09:46:41.016747743Z 2022-02-07T09:46:41Z|87345|timeval|WARN|Unreasonably long 792299ms poll interval (776638ms user, 2939ms system)

Comment 7 Tim Rozet 2022-02-10 15:01:30 UTC

OVN team believes the root cause is https://github.com/ovn-org/ovn/commit/e7d6d9922fc4d3c7365990ac8bac26a380e26341

The version of OVN in 4.10 and 4.11 already contain this fix. We will need to bump the version in 4.9. We will scale test this on 4.11 first to ensure the problem is resolved by this fix.

Comment 12 Sai Sindhur Malleni 2022-02-24 21:22:48 UTC

Verified that this issue is resolved in 4.11.0-0.nightly-2022-02-24-054925

Comment 19 errata-xmlrpc 2022-08-10 10:48:33 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Comment 20 Red Hat Bugzilla 2023-09-15 01:51:43 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days

Note You need to log in before you can comment on or make changes to this bug.