Bug 1652535 - It will use up the IPs in the subnet range when the secondary NIC failed to be setup on pod via Multus
Summary: It will use up the IPs in the subnet range when the secondary NIC failed to b...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: ---
: 4.1.0
Assignee: Douglas Smith
QA Contact: Meng Bo
URL:
Whiteboard: multus
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-11-22 10:35 UTC by Meng Bo
Modified: 2019-06-04 10:41 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-06-04 10:41:02 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
multus_log (3.07 MB, text/plain)
2018-11-22 10:35 UTC, Meng Bo
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:0758 0 None None None 2019-06-04 10:41:08 UTC

Description Meng Bo 2018-11-22 10:35:12 UTC
Created attachment 1507911 [details]
multus_log

Description of problem:
When trying to create pod with multiple interfaces which is invoking by the multus. The multus will call openshift-sdn CMD_ADD in the first order, and then call secondary cni plugin. 
But when the secondary cni plugin failed to setup the interface with any reason, it will not call the CMD_DEL for openshift-sdn. Which will cause the IP pool exhausted on the node.

Version-Release number of selected component (if applicable):
v4.0

How reproducible:
always

Steps to Reproduce:
1. Setup ocp cluster

2. Enable the multus-cni for the cluster with the following steps
 a) Create CRD for network-attachment-definition
 b) Create the clusterrole/clusterrolebinding/serviceaccount for the multus daemonset
 c) Create the configmap which is using openshift-sdn as master plugin
    # oc create -f https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/networking/multus-cni/ConfigMap-openshift-sdn-delegates.yaml
 d) Create the multus cni daemonset with the configmap above
    # oc create -f https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/networking/multus-cni/DaemonSet-multus.yaml

3.Create the net-attach-def resource for macvlan plugin
# oc create -f https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/networking/multus-cni/NetworkAttachmentDefinitions/macvlan-bridge.yaml

4. Make sure that the macvlan cni binary is NOT present on all the nodes

5. Try to create pod with macvlan as secondary NIC
apiVersion: v1
kind: Pod
metadata:
  name: macvlan-bridge-pod
  annotations:
    k8s.v1.cni.cncf.io/networks: macvlan-bridge-conf
spec:
  containers:
  - name: macvlan-bridge-pod
    image: docker.io/bmeng/centos-network

6. Check the IPAM dir for openshift-sdn on the node after a while 

Actual results:
The pod will not be ready and the IPAM files keep generating till the IP exhausted.

Expected results:
Should release the IP via openshift-sdn when the secondary interface failed to be added by multus.

Additional info:
The pod event shows as:
Events:
  Type     Reason                  Age                 From                             Message
  ----     ------                  ----                ----                             -------
  Normal   Scheduled               33m                 default-scheduler                Successfully assigned default/macvlan-bridge-pod to ocp40-node.bmeng.local
  Warning  FailedCreatePodSandBox  33m                 kubelet, ocp40-node.bmeng.local  Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "ad6791abf5ff926d926b64a511b4344fae17eb77bb292f75c61d0a19c259f44f" network for pod "macvlan-bridge-pod": NetworkPlugin cni failed to set up pod "macvlan-bridge-pod_default" network: Multus: Err in tearing down failed plugins: Multus: error in invoke Delegate add - "macvlan": failed to find plugin "macvlan" in path [/opt/cni/bin], failed to clean up sandbox container "ad6791abf5ff926d926b64a511b4344fae17eb77bb292f75c61d0a19c259f44f" network for pod "macvlan-bridge-pod": NetworkPlugin cni failed to teardown pod "macvlan-bridge-pod_default" network: Multus: error in invoke Delegate del - "macvlan": failed to find plugin "macvlan" in path [/opt/cni/bin]]
  Warning  FailedCreatePodSandBox  33m                 kubelet, ocp40-node.bmeng.local  Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "48831226e858e450c9427bbe19a7546bbf04f675976dc427986309a8c279a35a" network for pod "macvlan-bridge-pod": NetworkPlugin cni failed to set up pod "macvlan-bridge-pod_default" network: Multus: Err in tearing down failed plugins: Multus: error in invoke Delegate add - "macvlan": failed to find plugin "macvlan" in path [/opt/cni/bin], failed to clean up sandbox container "48831226e858e450c9427bbe19a7546bbf04f675976dc427986309a8c279a35a" network for pod "macvlan-bridge-pod": NetworkPlugin cni failed to teardown pod "macvlan-bridge-pod_default" network: Multus: error in invoke Delegate del - "macvlan": failed to find plugin "macvlan" in path [/opt/cni/bin]]
  Warning  FailedCreatePodSandBox  33m                 kubelet, ocp40-node.bmeng.local  Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "d10568ea752f1f29baf1561874f9722da4921671eff5aa51cd45e9aa84a27c4a" network for pod "macvlan-bridge-pod": NetworkPlugin cni failed to set up pod "macvlan-bridge-pod_default" network: Multus: Err in tearing down failed plugins: Multus: error in invoke Delegate add - "macvlan": failed to find plugin "macvlan" in path [/opt/cni/bin], failed to clean up sandbox container "d10568ea752f1f29baf1561874f9722da4921671eff5aa51cd45e9aa84a27c4a" network for pod "macvlan-bridge-pod": NetworkPlugin cni failed to teardown pod "macvlan-bridge-pod_default" network: Multus: error in invoke Delegate del - "macvlan": failed to find plugin "macvlan" in path [/opt/cni/bin]]


Full multus log attached.

Comment 1 Casey Callendrello 2018-11-23 15:30:01 UTC
Very good find!

Over to Feng.

Comment 2 Douglas Smith 2018-11-28 21:38:57 UTC
Just an FYI that I've been able to successfully replicate the issue -- thank you A TON for the detailed instructions to make that very easy.

For my own record, I was able to watch the number of IP assignment files grow on the node the pod is scheduled to with:

watch -n1 "ls -1 /var/lib/cni/networks/openshift-sdn | wc -l"

Comment 3 Douglas Smith 2018-11-29 20:23:02 UTC
I've been able to isolate the issue, and have a patch upstream in this pull request: https://github.com/intel/multus-cni/pull/201

In short, the area that's causing the problem are these lines: https://github.com/intel/multus-cni/blob/4aa1d212f1f98a4ad24b8694bf5d8871f9ab7d99/multus/multus.go#L259-L261

The return statement there is causing the deletion of the default network plugin to never happen when there is a failure to delete the plugin that failed to start in the first place.

I've kicked off a conversation with the other maintainers to begin to get this pull request merged in.

Comment 4 Douglas Smith 2018-11-30 15:38:17 UTC
We've got the pull request merged, and the upstream docker image should reflect the change. I'll work on merging outstanding patches into our release branch and bring the changes down early next week.

Thanks again for the great find on this one.

Comment 5 Casey Callendrello 2019-03-06 14:33:46 UTC
Can we mark this as MODIFIED? Are the changes in the openshift tree?

Comment 6 Douglas Smith 2019-03-06 15:48:51 UTC
Changes are downstream and available. Thanks Casey, updated to modified.

Comment 9 Meng Bo 2019-03-14 08:20:41 UTC
Tested on OCP build 4.0.0-0.nightly-2019-03-14-040908

Issue has been fixed.

Comment 11 errata-xmlrpc 2019-06-04 10:41:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758


Note You need to log in before you can comment on or make changes to this bug.