Description of problem: During pod density test pods are stuck in ContainerCreating state due failing CNI request Version-Release number of selected component (if applicable): 4.4.0-0.nightly-2020-02-11-052508 How reproducible: 100% Steps to Reproduce: 1. Scale up cluster to 20 working nodes. 2. Create 2000 projects (200 per node): - git clone https://github.com/openshift/svt.git - cd svt openshift_scalability - touch test.yaml - vim test.yaml ```yaml projects: - num: 2000 basename: svt- templates: - num: 1 file: ./content/deployment-config-1rep-pause-template.json ``` - cp $KUBECONFIG ~/.kube/config - python cluster-loader.py -f test.yaml -p 5 3. Delete projects: oc delete project -l purpose=test 4. Change number of projects to 4000: vim test.yaml 5. Create 4000 projects python cluster-loader.py -f test.yaml -p 5 Actual results: Pods are stuck with ContainerCreating status: events: Warning FailedCreatePodSandBox 95m kubelet, ip-10-0-148-115.us-west-2.compute.internal Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_deploymentconfig0-1-deploy_svt-1682_9f936fa9-0b64-4f35-89c5-a1095288dbf3_0(adb0256597951e768b417f453cf6640eb15737b9c64d1facf1541d6f4ae9910c): Multus: error adding pod to network "ovn-kubernetes": delegateAdd: error invoking DelegateAdd - "ovn-k8s-cni-overlay": error in getting result from AddNetwork: CNI request failed with status 400: 'failed to get pod annotation: timed out waiting for the condition Expected results: All pods will be created with no problem. Additional info: For the same reason I can't get oc adm must-gather (can't create pod)
@Ben How should we track scalability related bugs - such as this one? By tagging them? /Alex
Hi Simon Could you re-test with the newer version of OVN? We've had a lot of performance improvements coming in recently and we suspect the issue might have been resolved. Thanks in advance! -Alex
Retest negative The same issue. oc get clusterversions 4.4.0-0.nightly-2020-03-02-201804 ovnkube version 0.3.0 ovn-controller (Open vSwitch) 2.12.0 OpenFlow versions 0x4:0x4
Marking TestBlocker for PerfScale pod density tests.
Dan, Aniket thinks you have some PRs in flight that help with this. When they land, can you get someone on our team to test this and then if it is good, get Joe to kick off a new scale test (after a backport). Moved to 4.5, but any fix to this is a strong candidate for a 4.4 (or 4.3) backport.
hi, skordas Can I move the QE-contact to you to verified this bug once this issue is fixed? thanks.
Will this be fixed in 4.4 before release? If yes, we should have bug to track 4.4.
It's highly likely that both OVN and ovnkube scalability changes have fixed this issue (eg, monitor-all and some ovnkube master things). Can we retest scaling to 200 nodes?