*** Bug 1822353 has been marked as a duplicate of this bug. ***
Can we get a retest with latest ovn-kubernetes in 4.4? We've made some significant changes since April 3rd.
Created attachment 1686263 [details] ovn-controller.log
Created attachment 1686264 [details] northd log
Created attachment 1686282 [details] must gather
So in this case (annotation on pod not found) we would want both ovnkube-node logs for the node with the problem, and ovnkube-master logs from the leader. This is purely an issue between ovnkube master/node.
*** Bug 1801890 has been marked as a duplicate of this bug. ***
After creating those request resources in about 300 namespaces, my cluster is not accessible any more and get below errors: [weliang@weliang tools]$ unable to recognize Get https://api.weliang-267.qe.devcluster.openshift.com:6443/api?timeout=32s: EOF unable to recognize Get https://api.weliang-267.qe.devcluster.openshift.com:6443/api?timeout=32s: EOF unable to recognize Get https://api.weliang-267.qe.devcluster.openshift.com:6443/api?timeout=32s: EOF [weliang@weliang tools]$ oc get project Unable to connect to the server: EOF [weliang@weliang tools]$ [weliang@weliang tools]$ oc get project NAME AGE cluster 3h2m vm type is m1.xlarge in my cluster, tomorrow I will continue testing in different types of clusters.
Hi Marko, The fixing is in latest v4.5 image, not backport to v4.4 yet. Could you try the same testing in v4.5 image? Thanks, Weibin
this issue also can be reproduced in 4.5.0-0.nightly-2020-05-27-103700 see https://bugzilla.redhat.com/show_bug.cgi?id=1836376#c8
We've hit the "failed to get pod annotation" error consistently on 4.4.0-nightly-2020-05-27-032519 with a request of only 50 pods, which I don't think qualifies as a "thundering herd" case. I actually wonder if we shouldn't have closed BZ 1801890 as a dupe of this one and should instead be tracking the conditions separately since we can trigger this with a very modest load request.
Since the fix is already in master and also in 4.5 since this was committed before it was branched, can we close this bug out? @Joe, @Raul, wdyt?
(In reply to Aniket Bhat from comment #34) > Since the fix is already in master and also in 4.5 since this was committed > before it was branched, can we close this bug out? > > @Joe, @Raul, wdyt? Which "fix"? IMHO this was the jsonrpc + raft election modification. My concern is the "fix" we are discussing here is just the jsonrpc, but we need to also ensure CNO / OVNKube deployment sets the raft election timer.
Let's have two bugs for this. We use this bug for the IPAM and a different one for raft/jsonrpc.
To be clear, there were three fixes related to heavy pod creation: 1) jsonrpc inactivity timeout on raft connections (NOT MERGED, in openvswitch2.13-2.13.0-24 and later RPM) 2) do IPAM in ovnkube master, not ovn-northd (NOT MERGED, https://github.com/ovn-org/ovn-kubernetes/pull/1365 ) 3) parallel pod handlers in ovnkube-master (MERGED 4.5/4.6)
4) raft election timer change
[edit/update] To be clear, there were three fixes related to heavy pod creation: 1) jsonrpc inactivity timeout on raft connections (NOT MERGED, in openvswitch2.13-2.13.0-24 and later RPM) 2) do IPAM in ovnkube master, not ovn-northd (NOT MERGED, https://github.com/ovn-org/ovn-kubernetes/pull/1365 ) 3) parallel pod handlers in ovnkube-master (MERGED 4.5/4.6) 4) raft election timer change (https://github.com/openshift/cluster-network-operator/pull/615 )
From our testing, with the two line items below the stability is greatly improved (OCP4.5 w/OVNKube). Our testing was with 200 nodes, 1000 namespaces, and 10000 pods. This was confirmed on two different environments w/ many iterations of the same test. 1) jsonrpc patch to OVN - https://patchwork.ozlabs.org/project/openvswitch/patch/20200331002104.26230-1-zhewang@nvidia.com/ 2) Changing the RAFT election timer in OVN sbdb and nbdb. ( https://github.com/ovn-org/ovn-kubernetes/pull/1276/files , https://github.com/openshift/cluster-network-operator/pull/615 )* The IPAM solution is a nice to have to reduce the "pod annotation" symptom. * The sbdb/nbdb timer settings: oc exec -n openshift-ovn-kubernetes ovnkube-master-p9ztr -c nbdb -- ovn-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/change-election-timer OVN_Northbound 2000 oc exec -n openshift-ovn-kubernetes ovnkube-master-p9ztr -c nbdb -- ovn-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/change-election-timer OVN_Northbound 4000 oc exec -n openshift-ovn-kubernetes ovnkube-master-p9ztr -c nbdb -- ovn-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/change-election-timer OVN_Northbound 8000 oc exec -n openshift-ovn-kubernetes ovnkube-master-p9ztr -c nbdb -- ovn-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/change-election-timer OVN_Northbound 10000 oc exec -n openshift-ovn-kubernetes ovnkube-master-xzkh8 -c sbdb -- ovn-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/change-election-timer OVN_Southbound 2000 oc exec -n openshift-ovn-kubernetes ovnkube-master-xzkh8 -c sbdb -- ovn-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/change-election-timer OVN_Southbound 4000 oc exec -n openshift-ovn-kubernetes ovnkube-master-xzkh8 -c sbdb -- ovn-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/change-election-timer OVN_Southbound 8000 oc exec -n openshift-ovn-kubernetes ovnkube-master-xzkh8 -c sbdb -- ovn-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/change-election-timer OVN_Southbound 16000
Can we close this bug? @dcbw, @joe
The parallel pod handler fix which is what will address the annotations issue as originally seen along with the timer tweaks is all that is needed in 4.5 from ovn-kubernetes stand-point. Marking this bug as ON_QA and changing the target release to 4.5.
Hi Aniket, who is tasked to do the backport to 4.4.z ? Just curious
@rashid 4.4.z is merged and verfied as well via https://github.com/openshift/ovn-kubernetes/pull/173
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409