OCP 4.4: Node Feature Discovery nfd-master and nfd-worker pods terminating in openshift-nfd namespace - FailMount Unable to attach or mount volumes errors
Description of problem:
This is happening on an AWS m5.xlarge IPI OCP 4.4 rc2 cluster ( 3 worker and 3 master nodes). We deployed Node Feature Discovery (NFD) operator and operand successfully, in openshift-nfd namespace. We then created a new machineset to add a new GPU enabled g3.4xlarge node with 1 NVIDIA GPU and then ran a gpu-burn workload for 3-5 days. The GPU workload executes successfully and continuously, but we started to see nfd-worker and nfd-master nodes terminating on one master noide and on the gpu worker node.
# oc get pods -n openshift-nfd -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nfd-master-f8htp 1/1 Running 0 5d12h 10.130.0.69 ip-10-0-158-168.us-west-2.compute.internal <none> <none>
nfd-master-r2wgk 0/1 Terminating 0 0s <none> ip-10-0-129-51.us-west-2.compute.internal <none> <none>
nfd-master-r48j9 1/1 Running 0 5d12h 10.129.0.49 ip-10-0-173-202.us-west-2.compute.internal <none> <none>
nfd-operator-5df4df5b67-nxx9v 1/1 Running 0 5d12h 10.128.0.30 ip-10-0-129-51.us-west-2.compute.internal <none> <none>
nfd-worker-ftgvk 0/1 Terminating 0 3s 10.0.134.51 ip-10-0-134-51.us-west-2.compute.internal <none> <none>
nfd-worker-ghw6x 1/1 Running 1 5d12h 10.0.129.161 ip-10-0-129-161.us-west-2.compute.internal <none> <none>
nfd-worker-pj7mh 1/1 Running 1 5d12h 10.0.171.52 ip-10-0-171-52.us-west-2.compute.internal <none> <none>
nfd-worker-sg9nh 1/1 Running 1 5d12h 10.0.153.12 ip-10-0-153-12.us-west-2.compute.internal <none> <none>
In the openshift-nfd namespace events, we see errors:
179m Warning Failed pod/nfd-master-qkhsh Error: container create failed: time="2020-03-31T22:16:09Z" level=warning msg="exit status 1"
time="2020-03-31T22:16:09Z" level=error msg="container_linux.go:349: starting container process caused \"process_linux.go:449: container init caused \\\"rootfs_linux.go:58: mounting \\\\\\\"/var/lib/kubelet/pods/2b358b43-f065-48e8-b028-a3fadbf428a8/volumes/kubernetes.io~secret/nfd-master-token-btjjp\\\\\\\" to rootfs \\\\\\\"/var/lib/containers/storage/overlay/343b160b42df836d100e534e7884d1f2f4d00715c6effa871fbc7ecef16ed4ec/merged\\\\\\\" at \\\\\\\"/var/run/secrets/kubernetes.io/serviceaccount\\\\\\\" caused \\\\\\\"stat /var/lib/kubelet/pods/2b358b43-f065-48e8-b028-a3fadbf428a8/volumes/kubernetes.io~secret/nfd-master-token-btjjp: no such file or directory\\\\\\\"\\\"\""
container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"rootfs_linux.go:58: mounting \\\"/var/lib/kubelet/pods/2b358b43-f065-48e8-b028-a3fadbf428a8/volumes/kubernetes.io~secret/nfd-master-token-btjjp\\\" to rootfs \\\"/var/lib/containers/storage/overlay/343b160b42df836d100e534e7884d1f2f4d00715c6effa871fbc7ecef16ed4ec/merged\\\" at \\\"/var/run/secrets/kubernetes.io/serviceaccount\\\" caused \\\"stat /var/lib/kubelet/pods/2b358b43-f065-48e8-b028-a3fadbf428a8/volumes/kubernetes.io~secret/nfd-master-token-btjjp: no such file or directory\\\"\""
<unknown> Normal Scheduled pod/nfd-master-qksjg Successfully assigned openshift-nfd/nfd-master-qksjg to ip-10-0-129-51.us-west-2.compute.internal
136m Normal Pulled pod/nfd-master-qksjg Container image "quay.io/zvonkok/node-feature-discovery:v4.2" already present on machine
136m Warning Failed pod/nfd-master-qksjg Error: cannot find volume "nfd-master-token-btjjp" to mount into container "nfd-master"
<unknown> Normal Scheduled pod/nfd-master-ql8qf Successfully assigned openshift-nfd/nfd-master-ql8qf to ip-10-0-129-51.us-west-2.compute.internal
126m Normal Pulled pod/nfd-master-ql8qf Container image "quay.io/zvonkok/node-feature-discovery:v4.2" already present on machine
126m Warning Failed pod/nfd-master-ql8qf Error: cannot find volume "nfd-master-token-btjjp" to mount into container "nfd-master"
also for nfd-worker pod:
<unknown> Normal Scheduled pod/nfd-worker-zz5rz Successfully assigned openshift-nfd/nfd-worker-zz5rz to ip-10-0-134-51.us-west-2.compute.internal
113m Normal Pulled pod/nfd-worker-zz5rz Container image "quay.io/zvonkok/node-feature-discovery:v4.2" already present on machine
113m Normal Created pod/nfd-worker-zz5rz Created container nfd-worker
113m Normal Started pod/nfd-worker-zz5rz Started container nfd-worker
<unknown> Normal Scheduled pod/nfd-worker-zzjhv Successfully assigned openshift-nfd/nfd-worker-zzjhv to ip-10-0-134-51.us-west-2.compute.internal
38m Warning FailedMount pod/nfd-worker-zzjhv Unable to attach or mount volumes: unmounted volumes=[host-os-release host-sys config nfd-hooks nfd-features nfd-worker-token-rgpm6 host-boot], unattached volumes=[host-os-release host-sys config nfd-hooks nfd-features nfd-worker-token-rgpm6 host-boot]: timed out waiting for the condition
31s Normal SuccessfulCreate daemonset/nfd-worker (combined from similar events): Created pod: nfd-worker-sf977
Version-Release number of selected component (if applicable):
Server Version: 4.4.0-rc.2
Kubernetes Version: v1.17.1
How reproducible:
Happened once on this multi-day test
Steps to Reproduce:
1. IPI install of OCP 4.4.0rc2 on AWS, 3 master and 3 worker nodes, m5.xlarge
2. Deploy NFD from github
cd $GOPATH/src/github.com/openshift
git clone https://github.com/openshift/cluster-nfd-operator.git
cd cluster-nfd-operator
git checkout release-4.4
PULLPOLICY=Always make deploy
oc get pods -n openshift-nfd
oc describe node | grep feature
3. Create a new machineset to add a new g3.4xlarge GPU enabled worker node to one of the worker zones
Make sure it gets labeled with label: feature.node.kubernetes.io/pci-10de.present=true
4. Deploy Special Resource Operator:
cd $GOPATH/src/github.com/openshift-psap
git clone https://github.com/openshift-psap/special-resource-operator.git
cd special-resource-operator
make deploy
oc get pods -n openshift-sro
5. Run GPU workload for 3-5 days
oc new-project test-gpu-burn
oc create -f gpu-burn.yaml
gpu-burn.yaml content:
apiVersion: v1
kind: ConfigMap
metadata:
name: gpu-burn-entrypoint
data:
entrypoint.sh: |-
#!/bin/bash
NUM_GPUS=$(nvidia-smi -L | wc -l)
if [ $NUM_GPUS -eq 0 ]; then
echo "ERROR No GPUs found"
exit 1
fi
/usr/local/bin/gpu-burn 300
if [ ! $? -eq 0 ]; then
exit 1
fi
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
labels:
app: gpu-burn-daemonset
name: gpu-burn-daemonset
spec:
selector:
matchLabels:
app: gpu-burn-daemonset
template:
metadata:
labels:
app: gpu-burn-daemonset
spec:
tolerations:
- operator: Exists
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
containers:
- image: quay.io/openshift-psap/gpu-burn
imagePullPolicy: Always
name: gpu-burn-ctr
command: ["/bin/entrypoint.sh"]
volumeMounts:
- name: entrypoint
mountPath: /bin/entrypoint.sh
readOnly: true
subPath: entrypoint.sh
volumes:
- name: entrypoint
configMap:
defaultMode: 0700
name: gpu-burn-entrypoint
nodeSelector:
node-role.kubernetes.io/worker: ""
feature.node.kubernetes.io/pci-10de.present: "true"
6. After 4-5 days, we started seeing nfd-master and nfd-worker nodes terminating and being re-spun.
Actual results:
NFD operator and operand pods deploy successfully and SRO deployed the NVIDIA drivers successfully.
GPU-burn workload executed successfully on GPU node for several days, and we could see GPU metrics in grafana dahsboard and GPU alerts on OCP console monitoring tab.
After 4-5 days, we started seeing nfd-master and nfd-worker nodes terminating and being re-deployed.
Events show pods with FailMount errors:
38m Warning FailedMount pod/nfd-worker-zzjhv Unable to attach or mount volumes: unmounted volumes=[host-os-release host-sys config nfd-hooks nfd-features nfd-worker-token-rgpm6 host-boot], unattached volumes=[host-os-release host-sys config nfd-hooks nfd-features nfd-worker-token-rgpm6 host-boot]: timed out waiting for the condition
Expected results:
No nfd-master or nfd-worker pods should be terminating, they should be running the entire time and we should not be seeing FailMount errors
Additional info:
Link to must-gather, nodelogs, and oc logs in next comment
Description of problem: This is happening on an AWS m5.xlarge IPI OCP 4.4 rc2 cluster ( 3 worker and 3 master nodes). We deployed Node Feature Discovery (NFD) operator and operand successfully, in openshift-nfd namespace. We then created a new machineset to add a new GPU enabled g3.4xlarge node with 1 NVIDIA GPU and then ran a gpu-burn workload for 3-5 days. The GPU workload executes successfully and continuously, but we started to see nfd-worker and nfd-master nodes terminating on one master noide and on the gpu worker node. # oc get pods -n openshift-nfd -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nfd-master-f8htp 1/1 Running 0 5d12h 10.130.0.69 ip-10-0-158-168.us-west-2.compute.internal <none> <none> nfd-master-r2wgk 0/1 Terminating 0 0s <none> ip-10-0-129-51.us-west-2.compute.internal <none> <none> nfd-master-r48j9 1/1 Running 0 5d12h 10.129.0.49 ip-10-0-173-202.us-west-2.compute.internal <none> <none> nfd-operator-5df4df5b67-nxx9v 1/1 Running 0 5d12h 10.128.0.30 ip-10-0-129-51.us-west-2.compute.internal <none> <none> nfd-worker-ftgvk 0/1 Terminating 0 3s 10.0.134.51 ip-10-0-134-51.us-west-2.compute.internal <none> <none> nfd-worker-ghw6x 1/1 Running 1 5d12h 10.0.129.161 ip-10-0-129-161.us-west-2.compute.internal <none> <none> nfd-worker-pj7mh 1/1 Running 1 5d12h 10.0.171.52 ip-10-0-171-52.us-west-2.compute.internal <none> <none> nfd-worker-sg9nh 1/1 Running 1 5d12h 10.0.153.12 ip-10-0-153-12.us-west-2.compute.internal <none> <none> In the openshift-nfd namespace events, we see errors: 179m Warning Failed pod/nfd-master-qkhsh Error: container create failed: time="2020-03-31T22:16:09Z" level=warning msg="exit status 1" time="2020-03-31T22:16:09Z" level=error msg="container_linux.go:349: starting container process caused \"process_linux.go:449: container init caused \\\"rootfs_linux.go:58: mounting \\\\\\\"/var/lib/kubelet/pods/2b358b43-f065-48e8-b028-a3fadbf428a8/volumes/kubernetes.io~secret/nfd-master-token-btjjp\\\\\\\" to rootfs \\\\\\\"/var/lib/containers/storage/overlay/343b160b42df836d100e534e7884d1f2f4d00715c6effa871fbc7ecef16ed4ec/merged\\\\\\\" at \\\\\\\"/var/run/secrets/kubernetes.io/serviceaccount\\\\\\\" caused \\\\\\\"stat /var/lib/kubelet/pods/2b358b43-f065-48e8-b028-a3fadbf428a8/volumes/kubernetes.io~secret/nfd-master-token-btjjp: no such file or directory\\\\\\\"\\\"\"" container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"rootfs_linux.go:58: mounting \\\"/var/lib/kubelet/pods/2b358b43-f065-48e8-b028-a3fadbf428a8/volumes/kubernetes.io~secret/nfd-master-token-btjjp\\\" to rootfs \\\"/var/lib/containers/storage/overlay/343b160b42df836d100e534e7884d1f2f4d00715c6effa871fbc7ecef16ed4ec/merged\\\" at \\\"/var/run/secrets/kubernetes.io/serviceaccount\\\" caused \\\"stat /var/lib/kubelet/pods/2b358b43-f065-48e8-b028-a3fadbf428a8/volumes/kubernetes.io~secret/nfd-master-token-btjjp: no such file or directory\\\"\"" <unknown> Normal Scheduled pod/nfd-master-qksjg Successfully assigned openshift-nfd/nfd-master-qksjg to ip-10-0-129-51.us-west-2.compute.internal 136m Normal Pulled pod/nfd-master-qksjg Container image "quay.io/zvonkok/node-feature-discovery:v4.2" already present on machine 136m Warning Failed pod/nfd-master-qksjg Error: cannot find volume "nfd-master-token-btjjp" to mount into container "nfd-master" <unknown> Normal Scheduled pod/nfd-master-ql8qf Successfully assigned openshift-nfd/nfd-master-ql8qf to ip-10-0-129-51.us-west-2.compute.internal 126m Normal Pulled pod/nfd-master-ql8qf Container image "quay.io/zvonkok/node-feature-discovery:v4.2" already present on machine 126m Warning Failed pod/nfd-master-ql8qf Error: cannot find volume "nfd-master-token-btjjp" to mount into container "nfd-master" also for nfd-worker pod: <unknown> Normal Scheduled pod/nfd-worker-zz5rz Successfully assigned openshift-nfd/nfd-worker-zz5rz to ip-10-0-134-51.us-west-2.compute.internal 113m Normal Pulled pod/nfd-worker-zz5rz Container image "quay.io/zvonkok/node-feature-discovery:v4.2" already present on machine 113m Normal Created pod/nfd-worker-zz5rz Created container nfd-worker 113m Normal Started pod/nfd-worker-zz5rz Started container nfd-worker <unknown> Normal Scheduled pod/nfd-worker-zzjhv Successfully assigned openshift-nfd/nfd-worker-zzjhv to ip-10-0-134-51.us-west-2.compute.internal 38m Warning FailedMount pod/nfd-worker-zzjhv Unable to attach or mount volumes: unmounted volumes=[host-os-release host-sys config nfd-hooks nfd-features nfd-worker-token-rgpm6 host-boot], unattached volumes=[host-os-release host-sys config nfd-hooks nfd-features nfd-worker-token-rgpm6 host-boot]: timed out waiting for the condition 31s Normal SuccessfulCreate daemonset/nfd-worker (combined from similar events): Created pod: nfd-worker-sf977 Version-Release number of selected component (if applicable): Server Version: 4.4.0-rc.2 Kubernetes Version: v1.17.1 How reproducible: Happened once on this multi-day test Steps to Reproduce: 1. IPI install of OCP 4.4.0rc2 on AWS, 3 master and 3 worker nodes, m5.xlarge 2. Deploy NFD from github cd $GOPATH/src/github.com/openshift git clone https://github.com/openshift/cluster-nfd-operator.git cd cluster-nfd-operator git checkout release-4.4 PULLPOLICY=Always make deploy oc get pods -n openshift-nfd oc describe node | grep feature 3. Create a new machineset to add a new g3.4xlarge GPU enabled worker node to one of the worker zones Make sure it gets labeled with label: feature.node.kubernetes.io/pci-10de.present=true 4. Deploy Special Resource Operator: cd $GOPATH/src/github.com/openshift-psap git clone https://github.com/openshift-psap/special-resource-operator.git cd special-resource-operator make deploy oc get pods -n openshift-sro 5. Run GPU workload for 3-5 days oc new-project test-gpu-burn oc create -f gpu-burn.yaml gpu-burn.yaml content: apiVersion: v1 kind: ConfigMap metadata: name: gpu-burn-entrypoint data: entrypoint.sh: |- #!/bin/bash NUM_GPUS=$(nvidia-smi -L | wc -l) if [ $NUM_GPUS -eq 0 ]; then echo "ERROR No GPUs found" exit 1 fi /usr/local/bin/gpu-burn 300 if [ ! $? -eq 0 ]; then exit 1 fi --- apiVersion: apps/v1 kind: DaemonSet metadata: labels: app: gpu-burn-daemonset name: gpu-burn-daemonset spec: selector: matchLabels: app: gpu-burn-daemonset template: metadata: labels: app: gpu-burn-daemonset spec: tolerations: - operator: Exists - key: nvidia.com/gpu operator: Exists effect: NoSchedule containers: - image: quay.io/openshift-psap/gpu-burn imagePullPolicy: Always name: gpu-burn-ctr command: ["/bin/entrypoint.sh"] volumeMounts: - name: entrypoint mountPath: /bin/entrypoint.sh readOnly: true subPath: entrypoint.sh volumes: - name: entrypoint configMap: defaultMode: 0700 name: gpu-burn-entrypoint nodeSelector: node-role.kubernetes.io/worker: "" feature.node.kubernetes.io/pci-10de.present: "true" 6. After 4-5 days, we started seeing nfd-master and nfd-worker nodes terminating and being re-spun. Actual results: NFD operator and operand pods deploy successfully and SRO deployed the NVIDIA drivers successfully. GPU-burn workload executed successfully on GPU node for several days, and we could see GPU metrics in grafana dahsboard and GPU alerts on OCP console monitoring tab. After 4-5 days, we started seeing nfd-master and nfd-worker nodes terminating and being re-deployed. Events show pods with FailMount errors: 38m Warning FailedMount pod/nfd-worker-zzjhv Unable to attach or mount volumes: unmounted volumes=[host-os-release host-sys config nfd-hooks nfd-features nfd-worker-token-rgpm6 host-boot], unattached volumes=[host-os-release host-sys config nfd-hooks nfd-features nfd-worker-token-rgpm6 host-boot]: timed out waiting for the condition Expected results: No nfd-master or nfd-worker pods should be terminating, they should be running the entire time and we should not be seeing FailMount errors Additional info: Link to must-gather, nodelogs, and oc logs in next comment