Bug 1889734

Summary:	Nodes become unavailable during CNV performance tests
Product:	OpenShift Container Platform	Reporter:	guy chen <guchen>
Component:	Node	Assignee:	Ryan Phillips <rphillips>
Node sub component:	Kubelet	QA Contact:	Sunil Choudhary <schoudha>
Status:	CLOSED DUPLICATE	Docs Contact:
Severity:	urgent
Priority:	unspecified	CC:	aos-bugs, danken, fdeutsch, jokerman, jsafrane, pkliczew
Version:	4.6.z	Keywords:	Reopened
Target Milestone:	---
Target Release:	4.7.0
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-11-10 15:09:01 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description guy chen 2020-10-20 13:31:38 UTC

Description of problem:

We have run CNV performance with 21 nodes and 1900 VMS, it seams that after few day's the node lost connection to the remote NFS storage , then it was resumed, but after it was resumed OCP did not reconnect the volumes and we see Stale file handler.  

Version-Release number of selected component (if applicable):

NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-10-12-223649   True        False         6d20h   Cluster version is 4.6.0-0.nightly-2020-10-12-223649

How reproducible:
Always

Steps to Reproduce:
1.Connect a remote NFS storage
2.Create a PVC on this NFS folder 
3.Start a VMI that is attached to this PVC
4.Disconnect and connect the NFS

Actual results:
Getting a Stale file handler on the volume 

Expected results:
volume should be reconnected

Mass gather link

https://drive.google.com/file/d/1dDAX3xn3t93wAt8nIA0dj332TGE2C7CI/view?usp=sharing


Additional info:

Detailed investigation of the issue : we have run CNV performance with 21 nodes and 1900 VMS, after some time I started to see the below issue : 

I saw few VMS stuck on schedule :

[root@f12-h17-b07-5039ms ~]# oc get vmi | grep -v Run
NAME      AGE     PHASE        IP              NODENAME
vm-1883   11h     Scheduling                  
vm-1887   11h     Scheduling                  
vm-1890   11h     Scheduling                  
vm-1894   11h     Scheduling                  
vm-1898   10h     Scheduling

So I've checked on which node they have failed on :

[root@f12-h17-b07-5039ms ~]# oc get vmi | grep -v Run | grep vm | awk '{print $1}' | xargs oc describe vmi | grep hostname | grep rdu2 | awk '{print $2}'
f12-h29-b08-5039ms.rdu2.scalelab.redhat.com
f12-h32-b04-5039ms.rdu2.scalelab.redhat.com
f12-h32-b07-5039ms.rdu2.scalelab.redhat.com

Now the problem with these nodes are that all the kvm setting went down to 0, but they where not 0 before because I run 63 VMS on them a few day's ago and yesterday try to add 2 more :

[root@f12-h17-b07-5039ms ~]# oc get vmi | grep -v Run | grep vm | awk '{print $1}' | xargs oc describe vmi | grep hostname | grep rdu2 | awk '{print $2}' | xargs oc describe node | grep "devices.kubevirt.io/kvm"
 devices.kubevirt.io/kvm:     0
 devices.kubevirt.io/kvm:     0
 devices.kubevirt.io/kvm     63        63
 devices.kubevirt.io/kvm:     0
 devices.kubevirt.io/kvm:     0
 devices.kubevirt.io/kvm     63        63
 devices.kubevirt.io/kvm:     0
 devices.kubevirt.io/kvm:     0
 devices.kubevirt.io/kvm     63        63

The KVM order is :
capacity 
allocatable
allocated
So capacity and allocatable dropped to 0.

This is how a good config looks like :

[root@f12-h17-b07-5039ms ~]#  oc describe node f12-h29-b07-5039ms.rdu2.scalelab.redhat.com | grep "devices.kubevirt.io/kvm"
  devices.kubevirt.io/kvm:          250
  devices.kubevirt.io/kvm:          250
  devices.kubevirt.io/kvm          65

Checked the virt-handler pods state, since they get the device number from the virt-handler daemon set where I've configured the max-devices to 250, found 3 that are crashing :

[root@f12-h17-b07-5039ms ~]# oc get po -n openshift-cnv | grep virt-handler | grep -v "1/1"
virt-handler-gnjzw                                   0/1     Running            166        3d23h
virt-handler-tb89f                                   0/1     CrashLoopBackOff   143        11h
virt-handler-zdzx2                                   0/1     Running            164        3d23h

Checked on which nodes they are associated with :

[root@f12-h17-b07-5039ms ~]# oc get po -n openshift-cnv | grep virt-handler | grep -v "1/1" | awk '{print $1}' | xargs oc describe po -n openshift-cnv | grep Node :
Node:                 f12-h32-b04-5039ms.rdu2.scalelab.redhat.com/10.1.36.90
Node:                 f12-h29-b08-5039ms.rdu2.scalelab.redhat.com/10.1.37.220
Node:                 f12-h32-b07-5039ms.rdu2.scalelab.redhat.com/10.1.36.41

And that's are the nodes that the KVM went to 0.
The virt-handler where OK because on the weekend started on these nodes 63 VMS successfully,  they have started to fail after it.

Checked the virt-handler pod events :

Events:
  Type     Reason     Age                     From     Message
  ----     ------     ----                    ----     -------
  Normal   Pulled     161m (x131 over 3d23h)  kubelet  Container image "registry.redhat.io/container-native-virtualization/virt-handler@sha256:4f4efa082893fbb28f40023fc2a23a3e2cf91e522d52b2fd5c54144a6dda1676" already present on machine
  Normal   Created    15m (x162 over 3d23h)   kubelet  Created container virt-handler
  Warning  Unhealthy  6m6s (x1555 over 12h)   kubelet  Readiness probe failed: Get "https://10.128.6.8:8443/healthz": dial tcp 10.128.6.8:8443: connect: connection refused
  Warning  BackOff    63s (x1934 over 12h)    kubelet  Back-off restarting failed container

And saw also that kubectl lost connection with the nodes : 

[root@f12-h17-b07-5039ms ~]# kubectl top nodes
NAME                                          CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
f12-h26-b08-5039ms.rdu2.scalelab.redhat.com   2651m        35%    41669Mi         65%
f12-h29-b01-5039ms.rdu2.scalelab.redhat.com   1478m        19%    6516Mi          10%
f12-h29-b02-5039ms.rdu2.scalelab.redhat.com   1587m        21%    6235Mi          9%
f12-h29-b03-5039ms.rdu2.scalelab.redhat.com   741m         9%     5275Mi          8%
f12-h29-b04-5039ms.rdu2.scalelab.redhat.com   836m         11%    4945Mi          7%
f12-h29-b05-5039ms.rdu2.scalelab.redhat.com   823m         10%    6216Mi          9%
f12-h29-b06-5039ms.rdu2.scalelab.redhat.com   793m         10%    5601Mi          8%
f12-h29-b07-5039ms.rdu2.scalelab.redhat.com   674m         8%     42949Mi         67%
f12-h32-b01-5039ms.rdu2.scalelab.redhat.com   5549m        73%    42859Mi         67%
f12-h32-b02-5039ms.rdu2.scalelab.redhat.com   625m         8%     42821Mi         67%
f12-h32-b03-5039ms.rdu2.scalelab.redhat.com   4834m        64%    42869Mi         67%
f12-h32-b05-5039ms.rdu2.scalelab.redhat.com   1826m        24%    42787Mi         67%
f12-h32-b06-5039ms.rdu2.scalelab.redhat.com   677m         9%     42942Mi         67%
f12-h32-b08-5039ms.rdu2.scalelab.redhat.com   719m         9%     42224Mi         66%
f13-h17-b01-5039ms.rdu2.scalelab.redhat.com   663m         8%     42294Mi         66%
f25-h01-000-r730xd.rdu2.scalelab.redhat.com   1679m        4%     87659Mi         68%
f25-h03-000-r730xd.rdu2.scalelab.redhat.com   1786m        4%     87926Mi         68%
f25-h05-000-r730xd.rdu2.scalelab.redhat.com   1736m        4%     87938Mi         68%
f25-h07-000-r730xd.rdu2.scalelab.redhat.com   4644m        11%    94268Mi         73%
f25-h09-000-r730xd.rdu2.scalelab.redhat.com   3178m        8%     94948Mi         74%
f25-h11-000-r730xd.rdu2.scalelab.redhat.com   3773m        9%     88239Mi         69%
f25-h13-000-r730xd.rdu2.scalelab.redhat.com   2347m        5%     89956Mi         70%
f25-h15-000-r730xd.rdu2.scalelab.redhat.com   2279m        5%     87878Mi         68%
f25-h17-000-r730xd.rdu2.scalelab.redhat.com   2359m        5%     89354Mi         69%
f12-h29-b08-5039ms.rdu2.scalelab.redhat.com   <unknown>                           <unknown>               <unknown>               <unknown>
f12-h32-b04-5039ms.rdu2.scalelab.redhat.com   <unknown>                           <unknown>               <unknown>               <unknown>
f12-h32-b07-5039ms.rdu2.scalelab.redhat.com   <unknown>                           <unknown>               <unknown>               <unknown>

Checked one of the problematic node stats, looks like this is the root cause :

[core@f12-h32-b04-5039ms ~]$ top
top - 09:51:18 up 5 days, 13:27,  1 user,  load average: 32.10, 35.15, 35.71
Tasks: 961 total,   4 running, 957 sleeping,   0 stopped,   0 zombie
%Cpu(s): 99.0 us,  0.6 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.3 hi,  0.1 si,  0.0 st

So checked the journal :

[core@f12-h32-b04-5039ms ~]$ journalctl -f | grep -i error
Oct 19 09:58:01 f12-h32-b04-5039ms.rdu2.scalelab.redhat.com hyperkube[1908]: I1019 09:58:01.100570  1908 volume_stat_calculator.go:112] Failed to calculate volume metrics for pod virt-launcher-vm-1537-rgpdx_default(34bf33c7-6178-447f-b604-8e066446e499) volume rootdisk: failed to get FsInfo due to error stale NFS file handle

[core@f12-h32-b04-5039ms ~]$ mount | grep nfs | grep 4bf33c7-6178-447f-b604-8e066446e499
10.9.96.21:/scaletest2/qe-scale-cnv25-cluster/pv1537 on /var/lib/kubelet/pods/34bf33c7-6178-447f-b604-8e066446e499/volumes/kubernetes.io~nfs/nfs-pv1537 type nfs4 (rw,relatime,vers=4.0,rsize=65536,wsize=65536,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.1.36.90,local_lock=none,addr=10.9.96.21)

[core@f12-h32-b04-5039ms ~]$ sudo ls /var/lib/kubelet/pods/34bf33c7-6178-447f-b604-8e066446e499/volumes/kubernetes.io~nfs/nfs-pv1537
ls: cannot access '/var/lib/kubelet/pods/34bf33c7-6178-447f-b604-8e066446e499/volumes/kubernetes.io~nfs/nfs-pv1537': Stale file handle

So it looks like OCP doesn't reconnect the NFS folder when the connection is lost since NFS that resides on netapp is up and all the other nodes are connected to it.

Comment 1 Jan Safranek 2020-10-20 14:33:01 UTC

Kubernetes / OCP does not monitor storage / mount health, its work ends with mount(). Applications (pods) are supposed to have their own readiness / liveness probes to ensure the application as a whole works, incl. its mounted storage volumes.

Implementing mount check in kubelet would be *very* complicated. You can file a RFE.

Comment 2 Dan Kenigsberg 2020-10-21 14:50:15 UTC

Jan, from what you say I understand why NFS-dependent workload running on the node would stop performing. But why is it acceptable that the node becomes unknown? If the node stayed up, health probes could report the specific problematic load and have it rescheduled; other workload can keep running; new workload can be added. Is there a way to address this bug other than mount health checks?

Comment 3 Jan Safranek 2020-10-23 10:03:54 UTC

Nodes being <unknown> in "oc top nodes" is odd, it may not be related to NFS issues at all. Do you have cluster must-gather + kubelet logs from the node? Anything interesting in kubelet logs? What does "oc describe node" say?

Comment 4 Ryan Phillips 2020-10-23 15:03:28 UTC

This bug will not be fixed in the upcoming sprint.