Hide Forgot
Created attachment 1507911 [details] multus_log Description of problem: When trying to create pod with multiple interfaces which is invoking by the multus. The multus will call openshift-sdn CMD_ADD in the first order, and then call secondary cni plugin. But when the secondary cni plugin failed to setup the interface with any reason, it will not call the CMD_DEL for openshift-sdn. Which will cause the IP pool exhausted on the node. Version-Release number of selected component (if applicable): v4.0 How reproducible: always Steps to Reproduce: 1. Setup ocp cluster 2. Enable the multus-cni for the cluster with the following steps a) Create CRD for network-attachment-definition b) Create the clusterrole/clusterrolebinding/serviceaccount for the multus daemonset c) Create the configmap which is using openshift-sdn as master plugin # oc create -f https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/networking/multus-cni/ConfigMap-openshift-sdn-delegates.yaml d) Create the multus cni daemonset with the configmap above # oc create -f https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/networking/multus-cni/DaemonSet-multus.yaml 3.Create the net-attach-def resource for macvlan plugin # oc create -f https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/networking/multus-cni/NetworkAttachmentDefinitions/macvlan-bridge.yaml 4. Make sure that the macvlan cni binary is NOT present on all the nodes 5. Try to create pod with macvlan as secondary NIC apiVersion: v1 kind: Pod metadata: name: macvlan-bridge-pod annotations: k8s.v1.cni.cncf.io/networks: macvlan-bridge-conf spec: containers: - name: macvlan-bridge-pod image: docker.io/bmeng/centos-network 6. Check the IPAM dir for openshift-sdn on the node after a while Actual results: The pod will not be ready and the IPAM files keep generating till the IP exhausted. Expected results: Should release the IP via openshift-sdn when the secondary interface failed to be added by multus. Additional info: The pod event shows as: Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 33m default-scheduler Successfully assigned default/macvlan-bridge-pod to ocp40-node.bmeng.local Warning FailedCreatePodSandBox 33m kubelet, ocp40-node.bmeng.local Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "ad6791abf5ff926d926b64a511b4344fae17eb77bb292f75c61d0a19c259f44f" network for pod "macvlan-bridge-pod": NetworkPlugin cni failed to set up pod "macvlan-bridge-pod_default" network: Multus: Err in tearing down failed plugins: Multus: error in invoke Delegate add - "macvlan": failed to find plugin "macvlan" in path [/opt/cni/bin], failed to clean up sandbox container "ad6791abf5ff926d926b64a511b4344fae17eb77bb292f75c61d0a19c259f44f" network for pod "macvlan-bridge-pod": NetworkPlugin cni failed to teardown pod "macvlan-bridge-pod_default" network: Multus: error in invoke Delegate del - "macvlan": failed to find plugin "macvlan" in path [/opt/cni/bin]] Warning FailedCreatePodSandBox 33m kubelet, ocp40-node.bmeng.local Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "48831226e858e450c9427bbe19a7546bbf04f675976dc427986309a8c279a35a" network for pod "macvlan-bridge-pod": NetworkPlugin cni failed to set up pod "macvlan-bridge-pod_default" network: Multus: Err in tearing down failed plugins: Multus: error in invoke Delegate add - "macvlan": failed to find plugin "macvlan" in path [/opt/cni/bin], failed to clean up sandbox container "48831226e858e450c9427bbe19a7546bbf04f675976dc427986309a8c279a35a" network for pod "macvlan-bridge-pod": NetworkPlugin cni failed to teardown pod "macvlan-bridge-pod_default" network: Multus: error in invoke Delegate del - "macvlan": failed to find plugin "macvlan" in path [/opt/cni/bin]] Warning FailedCreatePodSandBox 33m kubelet, ocp40-node.bmeng.local Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "d10568ea752f1f29baf1561874f9722da4921671eff5aa51cd45e9aa84a27c4a" network for pod "macvlan-bridge-pod": NetworkPlugin cni failed to set up pod "macvlan-bridge-pod_default" network: Multus: Err in tearing down failed plugins: Multus: error in invoke Delegate add - "macvlan": failed to find plugin "macvlan" in path [/opt/cni/bin], failed to clean up sandbox container "d10568ea752f1f29baf1561874f9722da4921671eff5aa51cd45e9aa84a27c4a" network for pod "macvlan-bridge-pod": NetworkPlugin cni failed to teardown pod "macvlan-bridge-pod_default" network: Multus: error in invoke Delegate del - "macvlan": failed to find plugin "macvlan" in path [/opt/cni/bin]] Full multus log attached.
Very good find! Over to Feng.
Just an FYI that I've been able to successfully replicate the issue -- thank you A TON for the detailed instructions to make that very easy. For my own record, I was able to watch the number of IP assignment files grow on the node the pod is scheduled to with: watch -n1 "ls -1 /var/lib/cni/networks/openshift-sdn | wc -l"
I've been able to isolate the issue, and have a patch upstream in this pull request: https://github.com/intel/multus-cni/pull/201 In short, the area that's causing the problem are these lines: https://github.com/intel/multus-cni/blob/4aa1d212f1f98a4ad24b8694bf5d8871f9ab7d99/multus/multus.go#L259-L261 The return statement there is causing the deletion of the default network plugin to never happen when there is a failure to delete the plugin that failed to start in the first place. I've kicked off a conversation with the other maintainers to begin to get this pull request merged in.
We've got the pull request merged, and the upstream docker image should reflect the change. I'll work on merging outstanding patches into our release branch and bring the changes down early next week. Thanks again for the great find on this one.
Can we mark this as MODIFIED? Are the changes in the openshift tree?
Changes are downstream and available. Thanks Casey, updated to modified.
Tested on OCP build 4.0.0-0.nightly-2019-03-14-040908 Issue has been fixed.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758