Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1819495

Summary:	OCP 4.4: Node Feature Discovery nfd-master and nfd-worker pods terminating in openshift-nfd namespace - FailMount Unable to attach or mount volumes errors
Product:	OpenShift Container Platform	Reporter:	Walid A. <wabouham>
Component:	Node Feature Discovery Operator	Assignee:	Zvonko Kosic <zkosic>
Status:	CLOSED CANTFIX	QA Contact:	Walid A. <wabouham>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.4	CC:	aos-bugs, carangog, dwalsh, ematysek, jokerman, mifiedle, nalin, sejug, tsweeney, zkosic
Target Milestone:	---
Target Release:	4.5.0
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-04-21 14:16:25 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Walid A. 2020-04-01 01:43:15 UTC

Description of problem:
This is happening on an AWS m5.xlarge IPI OCP 4.4 rc2 cluster ( 3 worker and 3 master nodes).  We deployed Node Feature Discovery (NFD) operator and operand successfully, in openshift-nfd namespace.  We then created a new machineset to add a new GPU enabled g3.4xlarge node with 1 NVIDIA GPU and then ran a gpu-burn workload for 3-5 days. The GPU workload executes successfully and continuously, but we started to see nfd-worker and nfd-master nodes terminating on one master noide and on the gpu worker node.

# oc get pods -n openshift-nfd -o wide
NAME                            READY   STATUS        RESTARTS   AGE     IP             NODE                                         NOMINATED NODE   READINESS GATES
nfd-master-f8htp                1/1     Running       0          5d12h   10.130.0.69    ip-10-0-158-168.us-west-2.compute.internal   <none>           <none>
nfd-master-r2wgk                0/1     Terminating   0          0s      <none>         ip-10-0-129-51.us-west-2.compute.internal    <none>           <none>
nfd-master-r48j9                1/1     Running       0          5d12h   10.129.0.49    ip-10-0-173-202.us-west-2.compute.internal   <none>           <none>
nfd-operator-5df4df5b67-nxx9v   1/1     Running       0          5d12h   10.128.0.30    ip-10-0-129-51.us-west-2.compute.internal    <none>           <none>
nfd-worker-ftgvk                0/1     Terminating   0          3s      10.0.134.51    ip-10-0-134-51.us-west-2.compute.internal    <none>           <none>
nfd-worker-ghw6x                1/1     Running       1          5d12h   10.0.129.161   ip-10-0-129-161.us-west-2.compute.internal   <none>           <none>
nfd-worker-pj7mh                1/1     Running       1          5d12h   10.0.171.52    ip-10-0-171-52.us-west-2.compute.internal    <none>           <none>
nfd-worker-sg9nh                1/1     Running       1          5d12h   10.0.153.12    ip-10-0-153-12.us-west-2.compute.internal    <none>           <none>


In the openshift-nfd namespace events, we see errors:

179m        Warning   Failed             pod/nfd-master-qkhsh   Error: container create failed: time="2020-03-31T22:16:09Z" level=warning msg="exit status 1"
time="2020-03-31T22:16:09Z" level=error msg="container_linux.go:349: starting container process caused \"process_linux.go:449: container init caused \\\"rootfs_linux.go:58: mounting \\\\\\\"/var/lib/kubelet/pods/2b358b43-f065-48e8-b028-a3fadbf428a8/volumes/kubernetes.io~secret/nfd-master-token-btjjp\\\\\\\" to rootfs \\\\\\\"/var/lib/containers/storage/overlay/343b160b42df836d100e534e7884d1f2f4d00715c6effa871fbc7ecef16ed4ec/merged\\\\\\\" at \\\\\\\"/var/run/secrets/kubernetes.io/serviceaccount\\\\\\\" caused \\\\\\\"stat /var/lib/kubelet/pods/2b358b43-f065-48e8-b028-a3fadbf428a8/volumes/kubernetes.io~secret/nfd-master-token-btjjp: no such file or directory\\\\\\\"\\\"\""
container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"rootfs_linux.go:58: mounting \\\"/var/lib/kubelet/pods/2b358b43-f065-48e8-b028-a3fadbf428a8/volumes/kubernetes.io~secret/nfd-master-token-btjjp\\\" to rootfs \\\"/var/lib/containers/storage/overlay/343b160b42df836d100e534e7884d1f2f4d00715c6effa871fbc7ecef16ed4ec/merged\\\" at \\\"/var/run/secrets/kubernetes.io/serviceaccount\\\" caused \\\"stat /var/lib/kubelet/pods/2b358b43-f065-48e8-b028-a3fadbf428a8/volumes/kubernetes.io~secret/nfd-master-token-btjjp: no such file or directory\\\"\""
<unknown>   Normal    Scheduled          pod/nfd-master-qksjg   Successfully assigned openshift-nfd/nfd-master-qksjg to ip-10-0-129-51.us-west-2.compute.internal
136m        Normal    Pulled             pod/nfd-master-qksjg   Container image "quay.io/zvonkok/node-feature-discovery:v4.2" already present on machine
136m        Warning   Failed             pod/nfd-master-qksjg   Error: cannot find volume "nfd-master-token-btjjp" to mount into container "nfd-master"
<unknown>   Normal    Scheduled          pod/nfd-master-ql8qf   Successfully assigned openshift-nfd/nfd-master-ql8qf to ip-10-0-129-51.us-west-2.compute.internal
126m        Normal    Pulled             pod/nfd-master-ql8qf   Container image "quay.io/zvonkok/node-feature-discovery:v4.2" already present on machine
126m        Warning   Failed             pod/nfd-master-ql8qf   Error: cannot find volume "nfd-master-token-btjjp" to mount into container "nfd-master"


also for nfd-worker pod:


<unknown>   Normal    Scheduled          pod/nfd-worker-zz5rz   Successfully assigned openshift-nfd/nfd-worker-zz5rz to ip-10-0-134-51.us-west-2.compute.internal
113m        Normal    Pulled             pod/nfd-worker-zz5rz   Container image "quay.io/zvonkok/node-feature-discovery:v4.2" already present on machine
113m        Normal    Created            pod/nfd-worker-zz5rz   Created container nfd-worker
113m        Normal    Started            pod/nfd-worker-zz5rz   Started container nfd-worker
<unknown>   Normal    Scheduled          pod/nfd-worker-zzjhv   Successfully assigned openshift-nfd/nfd-worker-zzjhv to ip-10-0-134-51.us-west-2.compute.internal
38m         Warning   FailedMount        pod/nfd-worker-zzjhv   Unable to attach or mount volumes: unmounted volumes=[host-os-release host-sys config nfd-hooks nfd-features nfd-worker-token-rgpm6 host-boot], unattached volumes=[host-os-release host-sys config nfd-hooks nfd-features nfd-worker-token-rgpm6 host-boot]: timed out waiting for the condition
31s         Normal    SuccessfulCreate   daemonset/nfd-worker   (combined from similar events): Created pod: nfd-worker-sf977


Version-Release number of selected component (if applicable):

Server Version: 4.4.0-rc.2
Kubernetes Version: v1.17.1

How reproducible:
Happened once on this multi-day test

Steps to Reproduce:
1.  IPI install of OCP 4.4.0rc2 on AWS, 3 master and 3 worker nodes, m5.xlarge
2.  Deploy NFD from github
    cd $GOPATH/src/github.com/openshift
    git clone https://github.com/openshift/cluster-nfd-operator.git
    cd cluster-nfd-operator
    git checkout release-4.4
    PULLPOLICY=Always make deploy
    oc get pods -n openshift-nfd
    oc describe node | grep feature

3.  Create a new machineset to add a new g3.4xlarge GPU enabled worker node to one of the worker zones
    Make sure it gets labeled with label:  feature.node.kubernetes.io/pci-10de.present=true

4.  Deploy Special Resource Operator:
    cd $GOPATH/src/github.com/openshift-psap
    git clone https://github.com/openshift-psap/special-resource-operator.git
    cd special-resource-operator
    make deploy
    oc get pods -n openshift-sro

5.  Run GPU workload for 3-5 days
    oc new-project test-gpu-burn
    oc create -f gpu-burn.yaml

    gpu-burn.yaml content:
    
apiVersion: v1
kind: ConfigMap
metadata:
  name: gpu-burn-entrypoint
data:
  entrypoint.sh: |-
    #!/bin/bash
    NUM_GPUS=$(nvidia-smi -L | wc -l)
    if [ $NUM_GPUS -eq 0 ]; then
      echo "ERROR No GPUs found"
      exit 1
    fi

    /usr/local/bin/gpu-burn 300

    if [ ! $? -eq 0 ]; then 
      exit 1
    fi
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  labels:
    app: gpu-burn-daemonset
  name: gpu-burn-daemonset
spec:
  selector:
    matchLabels:
      app: gpu-burn-daemonset
  template:
    metadata:
      labels:
        app: gpu-burn-daemonset
    spec:
      tolerations:
      - operator: Exists
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      containers:
      - image: quay.io/openshift-psap/gpu-burn
        imagePullPolicy: Always
        name: gpu-burn-ctr
        command: ["/bin/entrypoint.sh"]
        volumeMounts:
        - name: entrypoint
          mountPath: /bin/entrypoint.sh
          readOnly: true
          subPath: entrypoint.sh
      volumes:
        - name: entrypoint
          configMap:
            defaultMode: 0700
            name: gpu-burn-entrypoint
      nodeSelector:
        node-role.kubernetes.io/worker: ""
        feature.node.kubernetes.io/pci-10de.present: "true"
      
6. After 4-5 days, we started seeing nfd-master and nfd-worker nodes terminating and being re-spun.
    

Actual results:
NFD operator and operand pods deploy successfully and SRO deployed the NVIDIA drivers successfully.
GPU-burn workload executed successfully on GPU node for several days, and we could see GPU metrics in grafana dahsboard and GPU alerts on OCP console monitoring tab.
After 4-5 days, we started seeing nfd-master and nfd-worker nodes terminating and being re-deployed.
Events show pods with FailMount errors:

38m         Warning   FailedMount        pod/nfd-worker-zzjhv   Unable to attach or mount volumes: unmounted volumes=[host-os-release host-sys config nfd-hooks nfd-features nfd-worker-token-rgpm6 host-boot], unattached volumes=[host-os-release host-sys config nfd-hooks nfd-features nfd-worker-token-rgpm6 host-boot]: timed out waiting for the condition
Expected results:
No nfd-master or nfd-worker pods should be terminating, they should be running the entire time and we should not be seeing FailMount errors

Additional info:
Link to must-gather, nodelogs, and oc logs in next comment

Comment 2 Zvonko Kosic 2020-04-15 13:04:44 UTC

Not yet reproducable, may be a flake, setting target to 4.5

Comment 3 Walid A. 2020-04-21 13:55:43 UTC

Unable to reproduce after 18 days on new OCP 4.4 cluster:  4.4.0-0.nightly-2020-04-01-062641

Server Version: 4.4.0-0.nightly-2020-04-01-062641
Kubernetes Version: v1.17.1
Container Runtime Version:  cri-o://1.17.0-9.dev.rhaos4.4.gitdfc8414.el8

# oc get pods -n openshift-nfd
NAME                            READY   STATUS    RESTARTS   AGE
nfd-master-6jdrw                1/1     Running   0          18d
nfd-master-jhngz                1/1     Running   0          18d
nfd-master-vn6qp                1/1     Running   0          18d
nfd-operator-5df4df5b67-rp4lh   1/1     Running   0          18d
nfd-worker-99lbb                1/1     Running   1          18d
nfd-worker-npm2q                1/1     Running   2          18d
nfd-worker-vgs6n                1/1     Running   1          18d
nfd-worker-vsswz                1/1     Running   1          18d

Comment 4 Zvonko Kosic 2020-04-21 14:16:25 UTC

Will close, if the error appears again, please reopen and have the cluster running for investigation.