Bug 1801890

Summary:	[OVN] Failed to create pod network sandbox due failed CNI requests (when scaling to 200 pods per node)
Product:	OpenShift Container Platform	Reporter:	Simon <skordas>
Component:	Networking	Assignee:	Dan Williams <dcbw>
Networking sub component:	ovn-kubernetes	QA Contact:	Simon <skordas>
Status:	CLOSED DUPLICATE	Docs Contact:
Severity:	high
Priority:	high	CC:	aconstan, anbhat, bbennett, dblack, mifiedle, mkarg, pportant, rkhan, wsun
Version:	4.4
Target Milestone:	---
Target Release:	4.5.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	aos-scalability-44,SDN-CI-IMPACT
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-05-20 13:56:23 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Simon 2020-02-11 20:26:02 UTC

Description of problem:

During pod density test pods are stuck in ContainerCreating state due failing CNI request

Version-Release number of selected component (if applicable):
4.4.0-0.nightly-2020-02-11-052508

How reproducible:
100%

Steps to Reproduce:
1. Scale up cluster to 20 working nodes.
2. Create 2000 projects (200 per node):
  - git clone https://github.com/openshift/svt.git
  - cd svt openshift_scalability
  - touch test.yaml
  - vim test.yaml

```yaml
projects:
  - num: 2000
    basename: svt-
    templates:
      -
        num: 1
        file: ./content/deployment-config-1rep-pause-template.json
```

  - cp $KUBECONFIG ~/.kube/config
  - python cluster-loader.py -f test.yaml -p 5

3. Delete projects: oc delete project -l purpose=test
4. Change number of projects to 4000: vim test.yaml
5. Create 4000 projects
python cluster-loader.py -f test.yaml -p 5

Actual results:
Pods are stuck with ContainerCreating status:
events:

  Warning  FailedCreatePodSandBox  95m        kubelet, ip-10-0-148-115.us-west-2.compute.internal  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_deploymentconfig0-1-deploy_svt-1682_9f936fa9-0b64-4f35-89c5-a1095288dbf3_0(adb0256597951e768b417f453cf6640eb15737b9c64d1facf1541d6f4ae9910c): Multus: error adding pod to network "ovn-kubernetes": delegateAdd: error invoking DelegateAdd - "ovn-k8s-cni-overlay": error in getting result from AddNetwork: CNI request failed with status 400: 'failed to get pod annotation: timed out waiting for the condition
 

Expected results:
All pods will be created with no problem.

Additional info:
For the same reason I can't get oc adm must-gather (can't create pod)

Comment 3 Alexander Constantinescu 2020-02-17 16:02:24 UTC

@Ben

How should we track scalability related bugs - such as this one? By tagging them? 

/Alex

Comment 5 Alexander Constantinescu 2020-03-02 16:00:07 UTC

Hi Simon

Could you re-test with the newer version of OVN? We've had a lot of performance improvements coming in recently and we suspect the issue might have been resolved. 

Thanks in advance!

-Alex

Comment 6 Simon 2020-03-03 18:12:18 UTC

Retest negative
The same issue.

oc get clusterversions
4.4.0-0.nightly-2020-03-02-201804

ovnkube version 0.3.0
ovn-controller (Open vSwitch) 2.12.0
OpenFlow versions 0x4:0x4

Comment 7 Mike Fiedler 2020-03-04 14:20:30 UTC

Marking TestBlocker for PerfScale pod density tests.

Comment 9 Ben Bennett 2020-03-05 14:37:42 UTC

Dan, Aniket thinks you have some PRs in flight that help with this.  When they land, can you get someone on our team to test this and then if it is good, get Joe to kick off a new scale test (after a backport).

Moved to 4.5, but any fix to this is a strong candidate for a 4.4 (or 4.3) backport.

Comment 10 zhaozhanqi 2020-03-06 01:58:55 UTC

hi, skordas
Can I move the QE-contact to you to verified this bug once this issue is fixed? thanks.

Comment 11 Wei Sun 2020-03-06 04:13:59 UTC

Will this be fixed in 4.4 before release? If yes, we should have bug to track 4.4.

Comment 12 Dan Williams 2020-05-01 18:29:16 UTC

It's highly likely that both OVN and ovnkube scalability changes have fixed this issue (eg, monitor-all and some ovnkube master things). Can we retest scaling to 200 nodes?