Description of problem: We have run CNV performance with 21 nodes and 1900 VMS, it seams that after few day's the node lost connection to the remote NFS storage , then it was resumed, but after it was resumed OCP did not reconnect the volumes and we see Stale file handler. Version-Release number of selected component (if applicable): NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.0-0.nightly-2020-10-12-223649 True False 6d20h Cluster version is 4.6.0-0.nightly-2020-10-12-223649 How reproducible: Always Steps to Reproduce: 1.Connect a remote NFS storage 2.Create a PVC on this NFS folder 3.Start a VMI that is attached to this PVC 4.Disconnect and connect the NFS Actual results: Getting a Stale file handler on the volume Expected results: volume should be reconnected Mass gather link https://drive.google.com/file/d/1dDAX3xn3t93wAt8nIA0dj332TGE2C7CI/view?usp=sharing Additional info: Detailed investigation of the issue : we have run CNV performance with 21 nodes and 1900 VMS, after some time I started to see the below issue : I saw few VMS stuck on schedule : [root@f12-h17-b07-5039ms ~]# oc get vmi | grep -v Run NAME AGE PHASE IP NODENAME vm-1883 11h Scheduling vm-1887 11h Scheduling vm-1890 11h Scheduling vm-1894 11h Scheduling vm-1898 10h Scheduling So I've checked on which node they have failed on : [root@f12-h17-b07-5039ms ~]# oc get vmi | grep -v Run | grep vm | awk '{print $1}' | xargs oc describe vmi | grep hostname | grep rdu2 | awk '{print $2}' f12-h29-b08-5039ms.rdu2.scalelab.redhat.com f12-h32-b04-5039ms.rdu2.scalelab.redhat.com f12-h32-b07-5039ms.rdu2.scalelab.redhat.com Now the problem with these nodes are that all the kvm setting went down to 0, but they where not 0 before because I run 63 VMS on them a few day's ago and yesterday try to add 2 more : [root@f12-h17-b07-5039ms ~]# oc get vmi | grep -v Run | grep vm | awk '{print $1}' | xargs oc describe vmi | grep hostname | grep rdu2 | awk '{print $2}' | xargs oc describe node | grep "devices.kubevirt.io/kvm" devices.kubevirt.io/kvm: 0 devices.kubevirt.io/kvm: 0 devices.kubevirt.io/kvm 63 63 devices.kubevirt.io/kvm: 0 devices.kubevirt.io/kvm: 0 devices.kubevirt.io/kvm 63 63 devices.kubevirt.io/kvm: 0 devices.kubevirt.io/kvm: 0 devices.kubevirt.io/kvm 63 63 The KVM order is : capacity allocatable allocated So capacity and allocatable dropped to 0. This is how a good config looks like : [root@f12-h17-b07-5039ms ~]# oc describe node f12-h29-b07-5039ms.rdu2.scalelab.redhat.com | grep "devices.kubevirt.io/kvm" devices.kubevirt.io/kvm: 250 devices.kubevirt.io/kvm: 250 devices.kubevirt.io/kvm 65 Checked the virt-handler pods state, since they get the device number from the virt-handler daemon set where I've configured the max-devices to 250, found 3 that are crashing : [root@f12-h17-b07-5039ms ~]# oc get po -n openshift-cnv | grep virt-handler | grep -v "1/1" virt-handler-gnjzw 0/1 Running 166 3d23h virt-handler-tb89f 0/1 CrashLoopBackOff 143 11h virt-handler-zdzx2 0/1 Running 164 3d23h Checked on which nodes they are associated with : [root@f12-h17-b07-5039ms ~]# oc get po -n openshift-cnv | grep virt-handler | grep -v "1/1" | awk '{print $1}' | xargs oc describe po -n openshift-cnv | grep Node : Node: f12-h32-b04-5039ms.rdu2.scalelab.redhat.com/10.1.36.90 Node: f12-h29-b08-5039ms.rdu2.scalelab.redhat.com/10.1.37.220 Node: f12-h32-b07-5039ms.rdu2.scalelab.redhat.com/10.1.36.41 And that's are the nodes that the KVM went to 0. The virt-handler where OK because on the weekend started on these nodes 63 VMS successfully, they have started to fail after it. Checked the virt-handler pod events : Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Pulled 161m (x131 over 3d23h) kubelet Container image "registry.redhat.io/container-native-virtualization/virt-handler@sha256:4f4efa082893fbb28f40023fc2a23a3e2cf91e522d52b2fd5c54144a6dda1676" already present on machine Normal Created 15m (x162 over 3d23h) kubelet Created container virt-handler Warning Unhealthy 6m6s (x1555 over 12h) kubelet Readiness probe failed: Get "https://10.128.6.8:8443/healthz": dial tcp 10.128.6.8:8443: connect: connection refused Warning BackOff 63s (x1934 over 12h) kubelet Back-off restarting failed container And saw also that kubectl lost connection with the nodes : [root@f12-h17-b07-5039ms ~]# kubectl top nodes NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% f12-h26-b08-5039ms.rdu2.scalelab.redhat.com 2651m 35% 41669Mi 65% f12-h29-b01-5039ms.rdu2.scalelab.redhat.com 1478m 19% 6516Mi 10% f12-h29-b02-5039ms.rdu2.scalelab.redhat.com 1587m 21% 6235Mi 9% f12-h29-b03-5039ms.rdu2.scalelab.redhat.com 741m 9% 5275Mi 8% f12-h29-b04-5039ms.rdu2.scalelab.redhat.com 836m 11% 4945Mi 7% f12-h29-b05-5039ms.rdu2.scalelab.redhat.com 823m 10% 6216Mi 9% f12-h29-b06-5039ms.rdu2.scalelab.redhat.com 793m 10% 5601Mi 8% f12-h29-b07-5039ms.rdu2.scalelab.redhat.com 674m 8% 42949Mi 67% f12-h32-b01-5039ms.rdu2.scalelab.redhat.com 5549m 73% 42859Mi 67% f12-h32-b02-5039ms.rdu2.scalelab.redhat.com 625m 8% 42821Mi 67% f12-h32-b03-5039ms.rdu2.scalelab.redhat.com 4834m 64% 42869Mi 67% f12-h32-b05-5039ms.rdu2.scalelab.redhat.com 1826m 24% 42787Mi 67% f12-h32-b06-5039ms.rdu2.scalelab.redhat.com 677m 9% 42942Mi 67% f12-h32-b08-5039ms.rdu2.scalelab.redhat.com 719m 9% 42224Mi 66% f13-h17-b01-5039ms.rdu2.scalelab.redhat.com 663m 8% 42294Mi 66% f25-h01-000-r730xd.rdu2.scalelab.redhat.com 1679m 4% 87659Mi 68% f25-h03-000-r730xd.rdu2.scalelab.redhat.com 1786m 4% 87926Mi 68% f25-h05-000-r730xd.rdu2.scalelab.redhat.com 1736m 4% 87938Mi 68% f25-h07-000-r730xd.rdu2.scalelab.redhat.com 4644m 11% 94268Mi 73% f25-h09-000-r730xd.rdu2.scalelab.redhat.com 3178m 8% 94948Mi 74% f25-h11-000-r730xd.rdu2.scalelab.redhat.com 3773m 9% 88239Mi 69% f25-h13-000-r730xd.rdu2.scalelab.redhat.com 2347m 5% 89956Mi 70% f25-h15-000-r730xd.rdu2.scalelab.redhat.com 2279m 5% 87878Mi 68% f25-h17-000-r730xd.rdu2.scalelab.redhat.com 2359m 5% 89354Mi 69% f12-h29-b08-5039ms.rdu2.scalelab.redhat.com <unknown> <unknown> <unknown> <unknown> f12-h32-b04-5039ms.rdu2.scalelab.redhat.com <unknown> <unknown> <unknown> <unknown> f12-h32-b07-5039ms.rdu2.scalelab.redhat.com <unknown> <unknown> <unknown> <unknown> Checked one of the problematic node stats, looks like this is the root cause : [core@f12-h32-b04-5039ms ~]$ top top - 09:51:18 up 5 days, 13:27, 1 user, load average: 32.10, 35.15, 35.71 Tasks: 961 total, 4 running, 957 sleeping, 0 stopped, 0 zombie %Cpu(s): 99.0 us, 0.6 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.3 hi, 0.1 si, 0.0 st So checked the journal : [core@f12-h32-b04-5039ms ~]$ journalctl -f | grep -i error Oct 19 09:58:01 f12-h32-b04-5039ms.rdu2.scalelab.redhat.com hyperkube[1908]: I1019 09:58:01.100570 1908 volume_stat_calculator.go:112] Failed to calculate volume metrics for pod virt-launcher-vm-1537-rgpdx_default(34bf33c7-6178-447f-b604-8e066446e499) volume rootdisk: failed to get FsInfo due to error stale NFS file handle [core@f12-h32-b04-5039ms ~]$ mount | grep nfs | grep 4bf33c7-6178-447f-b604-8e066446e499 10.9.96.21:/scaletest2/qe-scale-cnv25-cluster/pv1537 on /var/lib/kubelet/pods/34bf33c7-6178-447f-b604-8e066446e499/volumes/kubernetes.io~nfs/nfs-pv1537 type nfs4 (rw,relatime,vers=4.0,rsize=65536,wsize=65536,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.1.36.90,local_lock=none,addr=10.9.96.21) [core@f12-h32-b04-5039ms ~]$ sudo ls /var/lib/kubelet/pods/34bf33c7-6178-447f-b604-8e066446e499/volumes/kubernetes.io~nfs/nfs-pv1537 ls: cannot access '/var/lib/kubelet/pods/34bf33c7-6178-447f-b604-8e066446e499/volumes/kubernetes.io~nfs/nfs-pv1537': Stale file handle So it looks like OCP doesn't reconnect the NFS folder when the connection is lost since NFS that resides on netapp is up and all the other nodes are connected to it.
Kubernetes / OCP does not monitor storage / mount health, its work ends with mount(). Applications (pods) are supposed to have their own readiness / liveness probes to ensure the application as a whole works, incl. its mounted storage volumes. Implementing mount check in kubelet would be *very* complicated. You can file a RFE.
Jan, from what you say I understand why NFS-dependent workload running on the node would stop performing. But why is it acceptable that the node becomes unknown? If the node stayed up, health probes could report the specific problematic load and have it rescheduled; other workload can keep running; new workload can be added. Is there a way to address this bug other than mount health checks?
Nodes being <unknown> in "oc top nodes" is odd, it may not be related to NFS issues at all. Do you have cluster must-gather + kubelet logs from the node? Anything interesting in kubelet logs? What does "oc describe node" say?
This bug will not be fixed in the upcoming sprint.