Bug 1822353

Summary: Under heavy object creation see error: "failed to set annotation on pod ... not found"
Product: OpenShift Container Platform Reporter: agopi
Component: NetworkingAssignee: Aniket Bhat <anbhat>
Networking sub component: ovn-kubernetes QA Contact: zhaozhanqi <zzhao>
Status: CLOSED DUPLICATE Docs Contact:
Severity: high    
Priority: unspecified CC: anbhat, dblack, jtaleric, nelluri, rsevilla
Version: 4.3.zKeywords: Performance
Target Milestone: ---   
Target Release: 4.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: aos-scalability-43
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-04-20 22:01:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
rss_usage_ovnkube_apiserver none

Description agopi 2020-04-08 19:49:46 UTC
Created attachment 1677357 [details]
rss_usage_ovnkube_apiserver

Created attachment 1677357 [details]
rss_usage_ovnkube_apiserver

Description of problem:

When trying to create a lot of objects across namespaces (1000) at 100 worker nodes, see the following error:

Warning  FailedCreatePodSandBox  25m  kubelet, ip-10-0-191-46.us-west-2.compute.internal  Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_deploymentconfig0-1-deploy_masterverticalovn1000proj100nodes700_b602b62f-4fc4-44fe-a606-201ab1300a57_0(a33d1aeee61f8ccf6ade592dc50d9465b517486fd3bf7448dbdbd04d4ae5bfe5): Multus: [masterverticalovn1000proj100nodes700/deploymentconfig0-1-deploy]: error adding container to network "ovn-kubernetes": delegateAdd: error
 invoking DelegateAdd - "ovn-k8s-cni-overlay": error in getting result from AddNetwork: CNI request failed with status 400: '[masterverticalovn1000proj100nodes700/deploymentconfig0-1-deploy] failed to get pod annotation: timed out waiting for the condition                                           


A few moments later we can see the ovnkube-master hitting readiness-probe issues:

 Warning  Unhealthy  23m (x10 over 5d11h)  kubelet, ip-10-0-128-5.us-west-2.compute.internal  Readiness probe failed: command timed out  


Over time we see ovnkube-master pod with ovnkube-master container eating up a lot of memory and then hitting oom, soon after we see the api-server do the same( as pointed in the graph ).




Version-Release number of selected component (if applicable): 4.3.9


Additional info:

Comment 1 Aniket Bhat 2020-04-20 22:01:10 UTC

*** This bug has been marked as a duplicate of bug 1820737 ***