Bug 1679036

Summary:	[Multus-cni] The multus-cni cannot call openshift-sdn to clean up the ipam file when the pod falls into failed status
Product:	OpenShift Container Platform	Reporter:	Meng Bo <bmeng>
Component:	Networking	Assignee:	Douglas Smith <dosmith>
Status:	CLOSED ERRATA	QA Contact:	Meng Bo <bmeng>
Severity:	medium	Docs Contact:
Priority:	high
Version:	4.1.0	CC:	aos-bugs, cdc
Target Milestone:	---
Target Release:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-06-04 10:44:14 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Meng Bo 2019-02-20 08:03:27 UTC

Description of problem:
https://bugzilla.redhat.com/show_bug.cgi?id=1652535

This is similar with the above bug, but different. The above bug as been fixed in the latest build.

Version-Release number of selected component (if applicable):
4.0.0-0.nightly-2019-02-19-024716

How reproducible:
always

Steps to Reproduce:
1. Setup ocp cluster with multus enabled
# openshift-install create cluster

2. Login to the cluster via user

3. Try to create pod which will fall into fail status
Here is an example of the failed pod:
{
"kind": "Pod",
"apiVersion":"v1",
"metadata": {
"name": "fail-pod",
"labels": {
"name": "test-pods"
}
},
"spec": {
"containers": [{
"command": [ "/bin/bash", "-c", "sleep 5 ; false" ],
"name": "fail-pod",
"image": "bmeng/hello-openshift"
}],
}
}

4. Delete the pod after the pod running into unhealthy status
# oc get po -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE
fail-pod 0/1 Error 0 18s 10.131.0.224 ip-10-0-130-169.ap-northeast-1.compute.internal <none>
# oc delete po fail-pod
pod "fail-pod" deleted

5. Check the ip files on the pod's node
# ls /var/lib/cni/networks/openshift-sdn

Actual results:
The ip file for the failed pod still can be found on the node and was not deleted.
If this condition happens for multiple times, all the ips will be occupied and pod cannot be created on this node.

Expected results:
The multus should be able to call openshift-sdn to tear-down properly to remove the ip files for the failed pods.

Additional info:
I cannot provide the multus log since it is managed by network operator now.

Comment 1 Casey Callendrello 2019-02-20 14:13:08 UTC

Good catch. This is definitely a release blocker.

Comment 2 Douglas Smith 2019-02-20 21:16:42 UTC

Meng Bo -- thanks a bunch for the detailed instructions on replicating the issue.

I'm able to replicate the issue with the given instructions.

I did hack in my own logging, like so:

```
# cat /etc/kubernetes/cni/net.d/00-multus.conf 
{ "name": "multus-cni-network", "type": "multus", "logFile": "/var/log/multus.log", "logLevel": "debug", "namespaceIsolation": true, "kubeconfig": "/etc/kubernetes/cni/net.d/multus.d/multus.kubeconfig", "delegates": [ { "cniVersion": "0.2.0", "name": "openshift-sdn", "type": "openshift-sdn" } ] }
```

Additionally I used a node label and node selector to assign to a particular node, more detail in my notes about my investigation here: https://gist.github.com/dougbtv/31b53730afc11eeffee30f30907d1060

There were no logs on deletion. My next steps are to look into how / why that's happening, but, it's almost as if Multus was never called.

Comment 3 Casey Callendrello 2019-02-21 09:24:40 UTC

FYI, you can stop the network operator and do your own customizations for development. The instructions are at https://github.com/openshift/cluster-network-operator#stopping-the-deployed-operators

Comment 4 Douglas Smith 2019-02-21 12:51:04 UTC

I've also been able to replicate in an upstream Kubernetes lab, and I've filed an upstream issue here @ https://github.com/intel/multus-cni/issues/267

Comment 5 Douglas Smith 2019-02-21 19:47:13 UTC

I've been able to isolate the issue and I can see that there's a portion where Multus returns too early in the `cmdDel` function, thereby not calling the delegated CNI plugin during delete when it cannot find the netns. My fix includes just sending a warning to the debug logs and continuing along to allow the delegates to be called.

Proposed fix @ https://github.com/intel/multus-cni/pull/269

Comment 6 Douglas Smith 2019-02-22 15:13:04 UTC

Pull request landed upstream and has been merged downstream, should be available in the next build of the downstream image.

Comment 7 Casey Callendrello 2019-03-06 14:35:59 UTC

Can this be marked as MODIFIED? Has this been brought downstream?

Comment 8 Douglas Smith 2019-03-06 15:47:07 UTC

Thanks Casey, it has indeed been brought downstream, marked it as modified.

Comment 11 Meng Bo 2019-03-14 08:26:17 UTC

Tested on 4.0.0-0.nightly-2019-03-14-040908

The issue has been fixed.

Comment 13 errata-xmlrpc 2019-06-04 10:44:14 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758