Bug 1679036 - [Multus-cni] The multus-cni cannot call openshift-sdn to clean up the ipam file when the pod falls into failed status
Summary: [Multus-cni] The multus-cni cannot call openshift-sdn to clean up the ipam fi...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: ---
: 4.1.0
Assignee: Douglas Smith
QA Contact: Meng Bo
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-02-20 08:03 UTC by Meng Bo
Modified: 2019-06-04 10:44 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-06-04 10:44:14 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:0758 None None None 2019-06-04 10:44:20 UTC

Description Meng Bo 2019-02-20 08:03:27 UTC
Description of problem:
https://bugzilla.redhat.com/show_bug.cgi?id=1652535

This is similar with the above bug, but different. The above bug as been fixed in the latest build.

Version-Release number of selected component (if applicable):
4.0.0-0.nightly-2019-02-19-024716

How reproducible:
always

Steps to Reproduce:
1. Setup ocp cluster with multus enabled
# openshift-install create cluster

2. Login to the cluster via user

3. Try to create pod which will fall into fail status
Here is an example of the failed pod:
{
        "kind": "Pod",
        "apiVersion":"v1",
        "metadata": {
                "name": "fail-pod",
                "labels": {
                        "name": "test-pods"
                }
        },
        "spec": {
                "containers": [{
                        "command": [ "/bin/bash", "-c", "sleep 5 ; false" ],
                        "name": "fail-pod",
                        "image": "bmeng/hello-openshift"
                }],
        }
}

4. Delete the pod after the pod running into unhealthy status
# oc get po -o wide 
NAME       READY     STATUS    RESTARTS   AGE       IP             NODE                                              NOMINATED NODE
fail-pod   0/1       Error     0          18s       10.131.0.224   ip-10-0-130-169.ap-northeast-1.compute.internal   <none>
# oc delete po fail-pod
pod "fail-pod" deleted

5. Check the ip files on the pod's node
# ls /var/lib/cni/networks/openshift-sdn


Actual results:
The ip file for the failed pod still can be found on the node and was not deleted.
If this condition happens for multiple times, all the ips will be occupied and pod cannot be created on this node.

Expected results:
The multus should be able to call openshift-sdn to tear-down properly to remove the ip files for the failed pods.


Additional info:
I cannot provide the multus log since it is managed by network operator now.

Comment 1 Casey Callendrello 2019-02-20 14:13:08 UTC
Good catch. This is definitely a release blocker.

Comment 2 Douglas Smith 2019-02-20 21:16:42 UTC
Meng Bo -- thanks a bunch for the detailed instructions on replicating the issue.

I'm able to replicate the issue with the given instructions.

I did hack in my own logging, like so:

```
# cat /etc/kubernetes/cni/net.d/00-multus.conf 
{ "name": "multus-cni-network", "type": "multus", "logFile": "/var/log/multus.log", "logLevel": "debug", "namespaceIsolation": true, "kubeconfig": "/etc/kubernetes/cni/net.d/multus.d/multus.kubeconfig", "delegates": [ { "cniVersion": "0.2.0", "name": "openshift-sdn", "type": "openshift-sdn" } ] }
```

Additionally I used a node label and node selector to assign to a particular node, more detail in my notes about my investigation here: https://gist.github.com/dougbtv/31b53730afc11eeffee30f30907d1060

There were no logs on deletion. My next steps are to look into how / why that's happening, but, it's almost as if Multus was never called.

Comment 3 Casey Callendrello 2019-02-21 09:24:40 UTC
FYI, you can stop the network operator and do your own customizations for development. The instructions are at https://github.com/openshift/cluster-network-operator#stopping-the-deployed-operators

Comment 4 Douglas Smith 2019-02-21 12:51:04 UTC
I've also been able to replicate in an upstream Kubernetes lab, and I've filed an upstream issue here @ https://github.com/intel/multus-cni/issues/267

Comment 5 Douglas Smith 2019-02-21 19:47:13 UTC
I've been able to isolate the issue and I can see that there's a portion where Multus returns too early in the `cmdDel` function, thereby not calling the delegated CNI plugin during delete when it cannot find the netns. My fix includes just sending a warning to the debug logs and continuing along to allow the delegates to be called.

Proposed fix @ https://github.com/intel/multus-cni/pull/269

Comment 6 Douglas Smith 2019-02-22 15:13:04 UTC
Pull request landed upstream and has been merged downstream, should be available in the next build of the downstream image.

Comment 7 Casey Callendrello 2019-03-06 14:35:59 UTC
Can this be marked as MODIFIED? Has this been brought downstream?

Comment 8 Douglas Smith 2019-03-06 15:47:07 UTC
Thanks Casey, it has indeed been brought downstream, marked it as modified.

Comment 11 Meng Bo 2019-03-14 08:26:17 UTC
Tested on 4.0.0-0.nightly-2019-03-14-040908

The issue has been fixed.

Comment 13 errata-xmlrpc 2019-06-04 10:44:14 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758


Note You need to log in before you can comment on or make changes to this bug.