Bug 1801890 - [OVN] Failed to create pod network sandbox due failed CNI requests (when scaling to 200 pods per node)
Summary: [OVN] Failed to create pod network sandbox due failed CNI requests (when scal...
Keywords:
Status: CLOSED DUPLICATE of bug 1820737
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.4
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.5.0
Assignee: Dan Williams
QA Contact: Simon
URL:
Whiteboard: aos-scalability-44,SDN-CI-IMPACT
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-02-11 20:26 UTC by Simon
Modified: 2020-08-04 14:20 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-05-20 13:56:23 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Simon 2020-02-11 20:26:02 UTC
Description of problem:

During pod density test pods are stuck in ContainerCreating state due failing CNI request

Version-Release number of selected component (if applicable):
4.4.0-0.nightly-2020-02-11-052508

How reproducible:
100%

Steps to Reproduce:
1. Scale up cluster to 20 working nodes.
2. Create 2000 projects (200 per node):
  - git clone https://github.com/openshift/svt.git
  - cd svt openshift_scalability
  - touch test.yaml
  - vim test.yaml

```yaml
projects:
  - num: 2000
    basename: svt-
    templates:
      -
        num: 1
        file: ./content/deployment-config-1rep-pause-template.json
```

  - cp $KUBECONFIG ~/.kube/config
  - python cluster-loader.py -f test.yaml -p 5

3. Delete projects: oc delete project -l purpose=test
4. Change number of projects to 4000: vim test.yaml
5. Create 4000 projects
python cluster-loader.py -f test.yaml -p 5

Actual results:
Pods are stuck with ContainerCreating status:
events:

  Warning  FailedCreatePodSandBox  95m        kubelet, ip-10-0-148-115.us-west-2.compute.internal  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_deploymentconfig0-1-deploy_svt-1682_9f936fa9-0b64-4f35-89c5-a1095288dbf3_0(adb0256597951e768b417f453cf6640eb15737b9c64d1facf1541d6f4ae9910c): Multus: error adding pod to network "ovn-kubernetes": delegateAdd: error invoking DelegateAdd - "ovn-k8s-cni-overlay": error in getting result from AddNetwork: CNI request failed with status 400: 'failed to get pod annotation: timed out waiting for the condition
 

Expected results:
All pods will be created with no problem.

Additional info:
For the same reason I can't get oc adm must-gather (can't create pod)

Comment 3 Alexander Constantinescu 2020-02-17 16:02:24 UTC
@Ben

How should we track scalability related bugs - such as this one? By tagging them? 

/Alex

Comment 5 Alexander Constantinescu 2020-03-02 16:00:07 UTC
Hi Simon

Could you re-test with the newer version of OVN? We've had a lot of performance improvements coming in recently and we suspect the issue might have been resolved. 

Thanks in advance!

-Alex

Comment 6 Simon 2020-03-03 18:12:18 UTC
Retest negative
The same issue.

oc get clusterversions
4.4.0-0.nightly-2020-03-02-201804

ovnkube version 0.3.0
ovn-controller (Open vSwitch) 2.12.0
OpenFlow versions 0x4:0x4

Comment 7 Mike Fiedler 2020-03-04 14:20:30 UTC
Marking TestBlocker for PerfScale pod density tests.

Comment 9 Ben Bennett 2020-03-05 14:37:42 UTC
Dan, Aniket thinks you have some PRs in flight that help with this.  When they land, can you get someone on our team to test this and then if it is good, get Joe to kick off a new scale test (after a backport).

Moved to 4.5, but any fix to this is a strong candidate for a 4.4 (or 4.3) backport.

Comment 10 zhaozhanqi 2020-03-06 01:58:55 UTC
hi, skordas
Can I move the QE-contact to you to verified this bug once this issue is fixed? thanks.

Comment 11 Wei Sun 2020-03-06 04:13:59 UTC
Will this be fixed in 4.4 before release? If yes, we should have bug to track 4.4.

Comment 12 Dan Williams 2020-05-01 18:29:16 UTC
It's highly likely that both OVN and ovnkube scalability changes have fixed this issue (eg, monitor-all and some ovnkube master things). Can we retest scaling to 200 nodes?


Note You need to log in before you can comment on or make changes to this bug.