1820737 – Under condition of heavy pod creation: "failed to set annotation on pod ... not found"

Bug 1820737 - Under condition of heavy pod creation: "failed to set annotation on pod ... not found"

Summary: Under condition of heavy pod creation: "failed to set annotation on pod ... n...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.4
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Aniket Bhat
QA Contact:	Raul Sevilla
Docs Contact:
URL:
Whiteboard:	aos-scalability-44,SDN-CI-IMPACT
Duplicates (2):	1801890 1822353 (view as bug list)
Depends On:
Blocks:	1838259
TreeView+	depends on / blocked

Reported:	2020-04-03 18:30 UTC by Raul Sevilla
Modified:	2021-02-09 11:11 UTC (History)
CC List:	17 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	1838259 (view as bug list)
Environment:
Last Closed:	2020-07-13 17:25:33 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
ovn-controller.log (11.79 MB, application/gzip) 2020-05-07 17:46 UTC, Joe Talerico	no flags	Details
northd log (3.78 MB, application/gzip) 2020-05-07 17:46 UTC, Joe Talerico	no flags	Details
must gather (11.59 MB, application/gzip) 2020-05-07 18:29 UTC, Joe Talerico	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift ovn-kubernetes pull 172	0	None	closed	Bug 1820737: scale: Enable parallel pod creation	2021-02-17 05:35:36 UTC
Red Hat Product Errata	RHBA-2020:2409	0	None	None	None	2020-07-13 17:25:56 UTC

Comment 4 Aniket Bhat 2020-04-20 22:01:10 UTC

*** Bug 1822353 has been marked as a duplicate of this bug. ***

Comment 6 Dan Williams 2020-04-27 15:08:08 UTC

Can we get a retest with latest ovn-kubernetes in 4.4? We've made some significant changes since April 3rd.

Comment 8 Joe Talerico 2020-05-07 17:46:23 UTC

Created attachment 1686263 [details]
ovn-controller.log

Comment 9 Joe Talerico 2020-05-07 17:46:44 UTC

Created attachment 1686264 [details]
northd log

Comment 10 Joe Talerico 2020-05-07 18:29:29 UTC

Created attachment 1686282 [details]
must gather

Comment 11 Dan Williams 2020-05-07 20:18:47 UTC

So in this case (annotation on pod not found) we would want both ovnkube-node logs for the node with the problem, and ovnkube-master logs from the leader. This is purely an issue between ovnkube master/node.

Comment 20 Aniket Bhat 2020-05-20 13:56:23 UTC

*** Bug 1801890 has been marked as a duplicate of this bug. ***

Comment 24 Weibin Liang 2020-05-26 23:27:59 UTC

After creating those request resources in about 300 namespaces, my cluster is not accessible any more and get below errors:

[weliang@weliang tools]$
unable to recognize Get https://api.weliang-267.qe.devcluster.openshift.com:6443/api?timeout=32s: EOF
unable to recognize Get https://api.weliang-267.qe.devcluster.openshift.com:6443/api?timeout=32s: EOF
unable to recognize Get https://api.weliang-267.qe.devcluster.openshift.com:6443/api?timeout=32s: EOF
[weliang@weliang tools]$ oc get project
Unable to connect to the server: EOF
[weliang@weliang tools]$ 
[weliang@weliang tools]$ oc get project
NAME      AGE
cluster   3h2m



vm type is m1.xlarge in my cluster, tomorrow I will continue testing in different types of clusters.

Comment 26 Weibin Liang 2020-05-27 17:36:05 UTC

Hi Marko,

The fixing is in latest v4.5 image, not backport to v4.4 yet.
Could you try the same testing in v4.5 image?

Thanks,
Weibin

Comment 27 zhaozhanqi 2020-05-28 02:27:07 UTC

this issue also can be reproduced in 4.5.0-0.nightly-2020-05-27-103700 see https://bugzilla.redhat.com/show_bug.cgi?id=1836376#c8

Comment 32 Dustin Black 2020-05-29 17:43:37 UTC

We've hit the "failed to get pod annotation" error consistently on 4.4.0-nightly-2020-05-27-032519 with a request of only 50 pods, which I don't think qualifies as a "thundering herd" case. I actually wonder if we shouldn't have closed BZ 1801890 as a dupe of this one and should instead be tracking the conditions separately since we can trigger this with a very modest load request.

Comment 34 Aniket Bhat 2020-06-04 14:07:44 UTC

Since the fix is already in master and also in 4.5 since this was committed before it was branched, can we close this bug out?

@Joe, @Raul, wdyt?

Comment 35 Joe Talerico 2020-06-04 16:25:10 UTC

(In reply to Aniket Bhat from comment #34)
> Since the fix is already in master and also in 4.5 since this was committed
> before it was branched, can we close this bug out?
> 
> @Joe, @Raul, wdyt?

Which "fix"? IMHO this was the jsonrpc + raft election modification. 

My concern is the "fix" we are discussing here is just the jsonrpc, but we need to also ensure CNO / OVNKube deployment sets the raft election timer.

Comment 36 Dan Williams 2020-06-05 19:04:05 UTC

Let's have two bugs for this. We use this bug for the IPAM and a different one for raft/jsonrpc.

Comment 37 Dan Williams 2020-06-05 19:10:39 UTC

To be clear, there were three fixes related to heavy pod creation:

1) jsonrpc inactivity timeout on raft connections  (NOT MERGED, in openvswitch2.13-2.13.0-24 and later RPM)
2) do IPAM in ovnkube master, not ovn-northd (NOT MERGED, https://github.com/ovn-org/ovn-kubernetes/pull/1365 )
3) parallel pod handlers in ovnkube-master (MERGED 4.5/4.6)

Comment 38 Dan Williams 2020-06-05 19:16:40 UTC

4) raft election timer change

Comment 39 Dan Williams 2020-06-05 19:26:06 UTC

[edit/update] To be clear, there were three fixes related to heavy pod creation:

1) jsonrpc inactivity timeout on raft connections  (NOT MERGED, in openvswitch2.13-2.13.0-24 and later RPM)
2) do IPAM in ovnkube master, not ovn-northd (NOT MERGED, https://github.com/ovn-org/ovn-kubernetes/pull/1365 )
3) parallel pod handlers in ovnkube-master (MERGED 4.5/4.6)
4) raft election timer change (https://github.com/openshift/cluster-network-operator/pull/615 )

Comment 40 Joe Talerico 2020-06-05 19:35:02 UTC

From our testing, with the two line items below the stability is greatly improved (OCP4.5 w/OVNKube). Our testing was with 200 nodes, 1000 namespaces, and 10000 pods. This was confirmed on two different environments w/ many iterations of the same test. 

1) jsonrpc patch to OVN - https://patchwork.ozlabs.org/project/openvswitch/patch/20200331002104.26230-1-zhewang@nvidia.com/
2) Changing the RAFT election timer in OVN sbdb and nbdb. ( https://github.com/ovn-org/ovn-kubernetes/pull/1276/files , https://github.com/openshift/cluster-network-operator/pull/615 )* 

The IPAM solution is a nice to have to reduce the "pod annotation" symptom. 


* The sbdb/nbdb timer settings:
 oc exec -n openshift-ovn-kubernetes ovnkube-master-p9ztr -c nbdb --  ovn-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/change-election-timer OVN_Northbound 2000
 oc exec -n openshift-ovn-kubernetes ovnkube-master-p9ztr -c nbdb --  ovn-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/change-election-timer OVN_Northbound 4000
 oc exec -n openshift-ovn-kubernetes ovnkube-master-p9ztr -c nbdb --  ovn-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/change-election-timer OVN_Northbound 8000
 oc exec -n openshift-ovn-kubernetes ovnkube-master-p9ztr -c nbdb --  ovn-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/change-election-timer OVN_Northbound 10000

 oc exec -n openshift-ovn-kubernetes ovnkube-master-xzkh8 -c sbdb --  ovn-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/change-election-timer OVN_Southbound 2000
 oc exec -n openshift-ovn-kubernetes ovnkube-master-xzkh8 -c sbdb --  ovn-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/change-election-timer OVN_Southbound 4000
 oc exec -n openshift-ovn-kubernetes ovnkube-master-xzkh8 -c sbdb --  ovn-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/change-election-timer OVN_Southbound 8000
 oc exec -n openshift-ovn-kubernetes ovnkube-master-xzkh8 -c sbdb --  ovn-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/change-election-timer OVN_Southbound 16000

Comment 41 Aniket Bhat 2020-06-08 14:31:45 UTC

Can we close this bug? @dcbw, @joe

Comment 42 Aniket Bhat 2020-06-08 19:05:17 UTC

The parallel pod handler fix which is what will address the annotations issue as originally seen along with the timer tweaks is all that is needed in 4.5 from ovn-kubernetes stand-point. Marking this bug as ON_QA and changing the target release to 4.5.

Comment 43 Rashid Khan 2020-06-10 14:24:35 UTC

Hi Aniket, who is tasked to do the backport to 4.4.z ? Just curious

Comment 44 Dan Williams 2020-06-15 19:10:33 UTC

@rashid 4.4.z is merged and verfied as well via https://github.com/openshift/ovn-kubernetes/pull/173

Comment 45 errata-xmlrpc 2020-07-13 17:25:33 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Note You need to log in before you can comment on or make changes to this bug.