Bug 1737551 - Multus: Err in tearing down failed plugins: cannot set "openshift-sdn" ifname to "eth0": no netns: failed to Statfs "/proc/87891/ns/net": no such file or directory
Summary: Multus: Err in tearing down failed plugins: cannot set "openshift-sdn" ifname...
Keywords:
Status: CLOSED NEXTRELEASE
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Networking
Version: 1.4.1
Hardware: Unspecified
OS: Unspecified
urgent
medium
Target Milestone: ---
: ---
Assignee: Sebastian Scheinkman
QA Contact: Meni Yakove
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-08-05 15:55 UTC by Qixuan Wang
Modified: 2019-08-06 06:43 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-08-06 06:43:50 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Qixuan Wang 2019-08-05 15:55:03 UTC
Description of problem:
Created a PVC then created a Pod reference the PVC. The container in the Pod ran command `touch /mnt/SUCCESS && exit 0 || exit 1` and then Pod was completed, which was expected from storage angle because there was a file named SUCCESS created in the backend storage. But I got the warning from event: 
 
FailedCreatePodSandBox  1s    kubelet, cnv-executor-qwang-cnv14-node-a9e382-1.example.com  Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_pod-g_qwang2_99a32ac7-b793-11e9-a860-fa163e026b79_1(cd418d6c2d28e9a505b496e443e6fc365d5f0d5fdb019565544203d56bdea947): Multus: Err in tearing down failed plugins: cannot set "openshift-sdn" ifname to "eth0": no netns: failed to Statfs "/proc/87891/ns/net": no such file or directory


My questions are:
1. There is not an "openshift-multus" namespace in my cluster, no pod provides multiple NIC. Is Multus supported in CNV 1.4?
2. There are kube-multus pods running in "kube-system" namespace. What are these pods for?
3. Why in this case, Pod needs Multus instead of other network interfaces?
4. Why does container run command `touch /mnt/SUCCESS && exit 0 || exit 1`, a pod sandbox will be created? If it ran command `dd if=/dev/zero of=/tmp/test.data bs=5M count=100 2>&1`, I wouldn't see this warning, everything seemed fine.


Version-Release number of selected component (if applicable):
CNV v1.4.1


How reproducible:
100%


Steps to Reproduce:
1. Use this manifest to create a PVC 
$ cat pvc-gluster.yaml
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: pvc-gluster
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi

2. Verify PVC and PV is bound
[cloud-user@cnv-executor-qwang-cnv14-master-a9e382-1 ~]$ oc get pvc
NAME          STATUS    VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS        AGE
pvc-gluster   Bound     pvc-3cd5ac36-b793-11e9-a860-fa163e026b79   1Gi        RWO            glusterfs-storage   5m

3. Create a Pod reference the PVC
$ cat pod.yaml
kind: Pod
apiVersion: v1
metadata:
  name: pod-g
spec:
  containers:
  - name: pod-test
    image: gcr.io/google_containers/busybox:1.24
    command:
      - "/bin/sh"
    args:
      - "-c"
      - "touch /mnt/SUCCESS && exit 0 || exit 1"
    volumeMounts:
      - name: g-pvc
        mountPath: "/mnt"
  restartPolicy: "OnFailure"
  volumes:
    - name: g-pvc
      persistentVolumeClaim:
        claimName: pvc-gluster

4. Check Pod status/logs


Actual results:
4. [cloud-user@cnv-executor-qwang-cnv14-master-a9e382-1 ~]$ oc describe pod pod-g
Name:               pod-g
Namespace:          qwang2
Priority:           0
PriorityClassName:  <none>
Node:               cnv-executor-qwang-cnv14-node-a9e382-1.example.com/172.16.0.29
Start Time:         Mon, 05 Aug 2019 11:13:37 -0400
Labels:             <none>
Annotations:        k8s.v1.cni.cncf.io/networks-status=
                    openshift.io/scc=anyuid
Status:             Succeeded
IP:                 10.130.0.22
Containers:
  pod-test:
    Container ID:  cri-o://ec7fbea43772096815f75b60d900ae4cb5fced0e4b4e82fbf385386c0c0e5b65
    Image:         gcr.io/google_containers/busybox:1.24
    Image ID:      gcr.io/google_containers/busybox@sha256:4bdd623e848417d96127e16037743f0cd8b528c026e9175e22a84f639eca58ff
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
    Args:
      -c
      touch /mnt/SUCCESS && exit 0 || exit 1
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Mon, 05 Aug 2019 11:13:46 -0400
      Finished:     Mon, 05 Aug 2019 11:13:46 -0400
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /mnt from g-pvc (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-sfzc5 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  g-pvc:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  pvc-gluster
    ReadOnly:   false
  default-token-sfzc5:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-sfzc5
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  node-role.kubernetes.io/compute=true
Tolerations:     <none>
Events:
  Type     Reason                  Age   From                                                         Message
  ----     ------                  ----  ----                                                         -------
  Normal   Scheduled               21s   default-scheduler                                            Successfully assigned qwang2/pod-g to cnv-executor-qwang-cnv14-node-a9e382-1.example.com
  Normal   Pulled                  12s   kubelet, cnv-executor-qwang-cnv14-node-a9e382-1.example.com  Container image "gcr.io/google_containers/busybox:1.24" already present on machine
  Normal   Created                 12s   kubelet, cnv-executor-qwang-cnv14-node-a9e382-1.example.com  Created container
  Normal   Started                 12s   kubelet, cnv-executor-qwang-cnv14-node-a9e382-1.example.com  Started container
  Normal   SandboxChanged          11s   kubelet, cnv-executor-qwang-cnv14-node-a9e382-1.example.com  Pod sandbox changed, it will be killed and re-created.
  Warning  FailedCreatePodSandBox  1s    kubelet, cnv-executor-qwang-cnv14-node-a9e382-1.example.com  Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_pod-g_qwang2_99a32ac7-b793-11e9-a860-fa163e026b79_1(cd418d6c2d28e9a505b496e443e6fc365d5f0d5fdb019565544203d56bdea947): Multus: Err in tearing down failed plugins: cannot set "openshift-sdn" ifname to "eth0": no netns: failed to Statfs "/proc/87891/ns/net": no such file or directory


Expected results:
Don't have this warning.


Additional info:
[cloud-user@cnv-executor-qwang-cnv14-master-a9e382-1 ~]$ oc get pods -o wide -n openshift-sdn
NAME        READY     STATUS    RESTARTS   AGE       IP            NODE                                                   NOMINATED NODE
ovs-mvmfh   1/1       Running   0          1h        172.16.0.14   cnv-executor-qwang-cnv14-node-a9e382-2.example.com     <none>
ovs-qcvhd   1/1       Running   0          1h        172.16.0.29   cnv-executor-qwang-cnv14-node-a9e382-1.example.com     <none>
ovs-zwswg   1/1       Running   0          1h        172.16.0.17   cnv-executor-qwang-cnv14-master-a9e382-1.example.com   <none>
sdn-5cs88   1/1       Running   0          1h        172.16.0.14   cnv-executor-qwang-cnv14-node-a9e382-2.example.com     <none>
sdn-846tp   1/1       Running   0          1h        172.16.0.29   cnv-executor-qwang-cnv14-node-a9e382-1.example.com     <none>
sdn-8ndpg   1/1       Running   0          1h        172.16.0.17   cnv-executor-qwang-cnv14-master-a9e382-1.example.com   <none>

Comment 1 Dan Kenigsberg 2019-08-05 16:36:32 UTC
Would you grant credentials to your cluster to Sebastian (over email?)
Multus must keep working in CNV-1.4.1.

Comment 2 Sebastian Scheinkman 2019-08-05 17:51:26 UTC
I think this is a know issue in multus that was fixed by using a local cache.

The problem is the network namespace is removed before multus run the delete function.

I also see the

Annotations:        k8s.v1.cni.cncf.io/networks-status=

is empty and this is a bit weird.

can you try to start a pod that doesn't finish immediately and check the annotation?

also the start and end time are equals looks like the pod was to quickly for the remove

State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Mon, 05 Aug 2019 11:13:46 -0400
      Finished:     Mon, 05 Aug 2019 11:13:46 -0400

Comment 3 Qixuan Wang 2019-08-06 05:19:28 UTC
A comparatively long-running command won't have this problem. 

I noticed during the Pod running, it has network annotation. 
k8s.v1.cni.cncf.io/networks-status=[{
    "name": "openshift.1",
    "ips": [
        "10.129.0.25"
    ],
    "default": true,
    "dns": {}
}]

Once command finished and Pod is completed, it became k8s.v1.cni.cncf.io/networks-status=


Flowing are the same Pod I `describe` during it's Running vs. Compeleted
[cloud-user@cnv-executor-qwang-cnv14-master-a9e382-1 ~]$ oc describe pod pod-g-3
Name:               pod-g-3
Namespace:          qwang3
Priority:           0
PriorityClassName:  <none>
Node:               cnv-executor-qwang-cnv14-node-a9e382-2.example.com/172.16.0.14
Start Time:         Tue, 06 Aug 2019 01:13:37 -0400
Labels:             <none>
Annotations:        k8s.v1.cni.cncf.io/networks-status=[{
    "name": "openshift.1",
    "ips": [
        "10.129.0.25"
    ],
    "default": true,
    "dns": {}
}]
         openshift.io/scc=anyuid
Status:  Running
IP:      10.129.0.25
Containers:
  pod-test:
    Container ID:  cri-o://ec7465913012d050bff4f30279bfefce401f493439f3f99ef46406740ea1ab15
    Image:         gcr.io/google_containers/busybox:1.24
    Image ID:      gcr.io/google_containers/busybox@sha256:4bdd623e848417d96127e16037743f0cd8b528c026e9175e22a84f639eca58ff
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
    Args:
      -c
      max=10; for i in `seq 2 $max`; do echo $i; sleep 5; done
    State:          Running
      Started:      Tue, 06 Aug 2019 01:13:44 -0400
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /mnt from g-pvc (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-4qf2v (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  g-pvc:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  pvc-gluster
    ReadOnly:   false
  default-token-4qf2v:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-4qf2v
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  node-role.kubernetes.io/compute=true
Tolerations:     <none>
Events:
  Type    Reason     Age   From                                                         Message
  ----    ------     ----  ----                                                         -------
  Normal  Scheduled  10s   default-scheduler                                            Successfully assigned qwang3/pod-g-3 to cnv-executor-qwang-cnv14-node-a9e382-2.example.com
  Normal  Pulled     1s    kubelet, cnv-executor-qwang-cnv14-node-a9e382-2.example.com  Container image "gcr.io/google_containers/busybox:1.24" already present on machine
  Normal  Created    1s    kubelet, cnv-executor-qwang-cnv14-node-a9e382-2.example.com  Created container
  Normal  Started    1s    kubelet, cnv-executor-qwang-cnv14-node-a9e382-2.example.com  Started container


[cloud-user@cnv-executor-qwang-cnv14-master-a9e382-1 ~]$ oc describe pod pod-g-3
Name:               pod-g-3
Namespace:          qwang3
Priority:           0
PriorityClassName:  <none>
Node:               cnv-executor-qwang-cnv14-node-a9e382-2.example.com/172.16.0.14
Start Time:         Tue, 06 Aug 2019 01:13:37 -0400
Labels:             <none>
Annotations:        k8s.v1.cni.cncf.io/networks-status=
                    openshift.io/scc=anyuid
Status:             Succeeded
IP:                 10.129.0.25
Containers:
  pod-test:
    Container ID:  cri-o://ec7465913012d050bff4f30279bfefce401f493439f3f99ef46406740ea1ab15
    Image:         gcr.io/google_containers/busybox:1.24
    Image ID:      gcr.io/google_containers/busybox@sha256:4bdd623e848417d96127e16037743f0cd8b528c026e9175e22a84f639eca58ff
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
    Args:
      -c
      max=10; for i in `seq 2 $max`; do echo $i; sleep 5; done
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 06 Aug 2019 01:13:44 -0400
      Finished:     Tue, 06 Aug 2019 01:14:29 -0400
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /mnt from g-pvc (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-4qf2v (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  g-pvc:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  pvc-gluster
    ReadOnly:   false
  default-token-4qf2v:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-4qf2v
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  node-role.kubernetes.io/compute=true
Tolerations:     <none>
Events:
  Type    Reason     Age   From                                                         Message
  ----    ------     ----  ----                                                         -------
  Normal  Scheduled  1m    default-scheduler                                            Successfully assigned qwang3/pod-g-3 to cnv-executor-qwang-cnv14-node-a9e382-2.example.com
  Normal  Pulled     58s   kubelet, cnv-executor-qwang-cnv14-node-a9e382-2.example.com  Container image "gcr.io/google_containers/busybox:1.24" already present on machine
  Normal  Created    58s   kubelet, cnv-executor-qwang-cnv14-node-a9e382-2.example.com  Created container
  Normal  Started    58s   kubelet, cnv-executor-qwang-cnv14-node-a9e382-2.example.com  Started container

Comment 4 Dan Kenigsberg 2019-08-06 06:43:50 UTC
Thanks, Qixuan and Sebastian.
Based on your report, I understand that this is a known multus/openshift bug that is solved in OCP4, I'm closing this bug.


Note You need to log in before you can comment on or make changes to this bug.