Hide Forgot
Description of problem: https://bugzilla.redhat.com/show_bug.cgi?id=1652535 This is similar with the above bug, but different. The above bug as been fixed in the latest build. Version-Release number of selected component (if applicable): 4.0.0-0.nightly-2019-02-19-024716 How reproducible: always Steps to Reproduce: 1. Setup ocp cluster with multus enabled # openshift-install create cluster 2. Login to the cluster via user 3. Try to create pod which will fall into fail status Here is an example of the failed pod: { "kind": "Pod", "apiVersion":"v1", "metadata": { "name": "fail-pod", "labels": { "name": "test-pods" } }, "spec": { "containers": [{ "command": [ "/bin/bash", "-c", "sleep 5 ; false" ], "name": "fail-pod", "image": "bmeng/hello-openshift" }], } } 4. Delete the pod after the pod running into unhealthy status # oc get po -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE fail-pod 0/1 Error 0 18s 10.131.0.224 ip-10-0-130-169.ap-northeast-1.compute.internal <none> # oc delete po fail-pod pod "fail-pod" deleted 5. Check the ip files on the pod's node # ls /var/lib/cni/networks/openshift-sdn Actual results: The ip file for the failed pod still can be found on the node and was not deleted. If this condition happens for multiple times, all the ips will be occupied and pod cannot be created on this node. Expected results: The multus should be able to call openshift-sdn to tear-down properly to remove the ip files for the failed pods. Additional info: I cannot provide the multus log since it is managed by network operator now.
Good catch. This is definitely a release blocker.
Meng Bo -- thanks a bunch for the detailed instructions on replicating the issue. I'm able to replicate the issue with the given instructions. I did hack in my own logging, like so: ``` # cat /etc/kubernetes/cni/net.d/00-multus.conf { "name": "multus-cni-network", "type": "multus", "logFile": "/var/log/multus.log", "logLevel": "debug", "namespaceIsolation": true, "kubeconfig": "/etc/kubernetes/cni/net.d/multus.d/multus.kubeconfig", "delegates": [ { "cniVersion": "0.2.0", "name": "openshift-sdn", "type": "openshift-sdn" } ] } ``` Additionally I used a node label and node selector to assign to a particular node, more detail in my notes about my investigation here: https://gist.github.com/dougbtv/31b53730afc11eeffee30f30907d1060 There were no logs on deletion. My next steps are to look into how / why that's happening, but, it's almost as if Multus was never called.
FYI, you can stop the network operator and do your own customizations for development. The instructions are at https://github.com/openshift/cluster-network-operator#stopping-the-deployed-operators
I've also been able to replicate in an upstream Kubernetes lab, and I've filed an upstream issue here @ https://github.com/intel/multus-cni/issues/267
I've been able to isolate the issue and I can see that there's a portion where Multus returns too early in the `cmdDel` function, thereby not calling the delegated CNI plugin during delete when it cannot find the netns. My fix includes just sending a warning to the debug logs and continuing along to allow the delegates to be called. Proposed fix @ https://github.com/intel/multus-cni/pull/269
Pull request landed upstream and has been merged downstream, should be available in the next build of the downstream image.
Can this be marked as MODIFIED? Has this been brought downstream?
Thanks Casey, it has indeed been brought downstream, marked it as modified.
Tested on 4.0.0-0.nightly-2019-03-14-040908 The issue has been fixed.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758