Bug 1801890

Summary: [OVN] Failed to create pod network sandbox due failed CNI requests (when scaling to 200 pods per node)
Product: OpenShift Container Platform Reporter: Simon <skordas>
Component: NetworkingAssignee: Dan Williams <dcbw>
Networking sub component: ovn-kubernetes QA Contact: Simon <skordas>
Status: CLOSED DUPLICATE Docs Contact:
Severity: high    
Priority: high CC: aconstan, anbhat, bbennett, dblack, mifiedle, mkarg, pportant, rkhan, wsun
Version: 4.4   
Target Milestone: ---   
Target Release: 4.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: aos-scalability-44,SDN-CI-IMPACT
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-05-20 13:56:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Simon 2020-02-11 20:26:02 UTC
Description of problem:

During pod density test pods are stuck in ContainerCreating state due failing CNI request

Version-Release number of selected component (if applicable):
4.4.0-0.nightly-2020-02-11-052508

How reproducible:
100%

Steps to Reproduce:
1. Scale up cluster to 20 working nodes.
2. Create 2000 projects (200 per node):
  - git clone https://github.com/openshift/svt.git
  - cd svt openshift_scalability
  - touch test.yaml
  - vim test.yaml

```yaml
projects:
  - num: 2000
    basename: svt-
    templates:
      -
        num: 1
        file: ./content/deployment-config-1rep-pause-template.json
```

  - cp $KUBECONFIG ~/.kube/config
  - python cluster-loader.py -f test.yaml -p 5

3. Delete projects: oc delete project -l purpose=test
4. Change number of projects to 4000: vim test.yaml
5. Create 4000 projects
python cluster-loader.py -f test.yaml -p 5

Actual results:
Pods are stuck with ContainerCreating status:
events:

  Warning  FailedCreatePodSandBox  95m        kubelet, ip-10-0-148-115.us-west-2.compute.internal  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_deploymentconfig0-1-deploy_svt-1682_9f936fa9-0b64-4f35-89c5-a1095288dbf3_0(adb0256597951e768b417f453cf6640eb15737b9c64d1facf1541d6f4ae9910c): Multus: error adding pod to network "ovn-kubernetes": delegateAdd: error invoking DelegateAdd - "ovn-k8s-cni-overlay": error in getting result from AddNetwork: CNI request failed with status 400: 'failed to get pod annotation: timed out waiting for the condition
 

Expected results:
All pods will be created with no problem.

Additional info:
For the same reason I can't get oc adm must-gather (can't create pod)

Comment 3 Alexander Constantinescu 2020-02-17 16:02:24 UTC
@Ben

How should we track scalability related bugs - such as this one? By tagging them? 

/Alex

Comment 5 Alexander Constantinescu 2020-03-02 16:00:07 UTC
Hi Simon

Could you re-test with the newer version of OVN? We've had a lot of performance improvements coming in recently and we suspect the issue might have been resolved. 

Thanks in advance!

-Alex

Comment 6 Simon 2020-03-03 18:12:18 UTC
Retest negative
The same issue.

oc get clusterversions
4.4.0-0.nightly-2020-03-02-201804

ovnkube version 0.3.0
ovn-controller (Open vSwitch) 2.12.0
OpenFlow versions 0x4:0x4

Comment 7 Mike Fiedler 2020-03-04 14:20:30 UTC
Marking TestBlocker for PerfScale pod density tests.

Comment 9 Ben Bennett 2020-03-05 14:37:42 UTC
Dan, Aniket thinks you have some PRs in flight that help with this.  When they land, can you get someone on our team to test this and then if it is good, get Joe to kick off a new scale test (after a backport).

Moved to 4.5, but any fix to this is a strong candidate for a 4.4 (or 4.3) backport.

Comment 10 zhaozhanqi 2020-03-06 01:58:55 UTC
hi, skordas
Can I move the QE-contact to you to verified this bug once this issue is fixed? thanks.

Comment 11 Wei Sun 2020-03-06 04:13:59 UTC
Will this be fixed in 4.4 before release? If yes, we should have bug to track 4.4.

Comment 12 Dan Williams 2020-05-01 18:29:16 UTC
It's highly likely that both OVN and ovnkube scalability changes have fixed this issue (eg, monitor-all and some ovnkube master things). Can we retest scaling to 200 nodes?