Bug 1681175

Summary:	[Scale out testing] VMS Fail when more then 2500 VMI up on the system
Product:	Container Native Virtualization (CNV)	Reporter:	guy chen <guchen>
Component:	Virtualization	Assignee:	Roman Mohr <rmohr>
Status:	CLOSED CURRENTRELEASE	QA Contact:	guy chen <guchen>
Severity:	high	Docs Contact:
Priority:	high
Version:	1.4	CC:	cnv-qe-bugs, fdeutsch, fgarciad, guchen, ipinto, ncredi, pousley, rmohr, sgordon, sgott
Target Milestone:	---	Keywords:	Performance
Target Release:	2.1.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	hco-bundle-registry-container-v2.1.0-34	Doc Type:	Known Issue
Doc Text:	Cause: kubelet crashes or is getting restarted Consequence: kubelet reports zero (0) kvm devices and Vms are not getting scheduled on the affected node. $ oc get node $NODE \| grep devices.kubevirt devices.kubevirt.io/kvm: 0 Workaround (if any): Kill the relevant virt-handler pod on the affected node. It will then be automatically started again Result: kubelet will start reporting the correct number of available kvm devices again: devices.kubevirt.io/kvm: 100	Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-11-04 15:08:51 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description guy chen 2019-02-25 16:02:57 UTC

Description of problem:
When continuing to upload past ~2500 VMI on the system we start to get VMS failing and stuck on scheduling - when we reach ~5K VMI we have 4755 at Running state,  119 at Failed state and 176 at Scheduling state.

Version-Release number of selected component (if applicable):
1.4

How reproducible:
Alway's

Steps to Reproduce:
1.Create 5000 VMS with 115 nodes
2.Start the VMI

Actual results:
Some of the VMI start to fail

Expected results:
All VMI starts successfully 

Additional info:

VMI counts on the system :



root@master-0: ~ # oc get vmi --all-namespaces | grep -c Running 
4755
root@master-0: ~ # oc get vmi --all-namespaces | grep -c Failed 
119
root@master-0: ~ # oc get vmi --all-namespaces | grep -c Scheduling
176


kubelet restarts on the node :

root@master-0: ~ # oc describe node app-node-6.scale-ci.example.com 

  Type    Reason                   Age                From                                      Message
  ----    ------                   ----               ----                                      -------
  Normal  NodeReady                1h (x3 over 2h)    kubelet, app-node-6.scale-ci.example.com  Node app-node-6.scale-ci.example.com status is now: NodeReady
  Normal  NodeHasSufficientMemory  1h (x3 over 2h)    kubelet, app-node-6.scale-ci.example.com  Node app-node-6.scale-ci.example.com status is now: NodeHasSufficientMemory
  Normal  NodeHasNoDiskPressure    1h (x3 over 2h)    kubelet, app-node-6.scale-ci.example.com  Node app-node-6.scale-ci.example.com status is now: NodeHasNoDiskPressure
  Normal  NodeNotReady             1h (x3 over 1h)    kubelet, app-node-6.scale-ci.example.com  Node app-node-6.scale-ci.example.com status is now: NodeNotReady
  Normal  NodeHasSufficientDisk    1h (x3 over 2h)    kubelet, app-node-6.scale-ci.example.com  Node app-node-6.scale-ci.example.com status is now: NodeHasSufficientDisk
  Normal  NodeHasSufficientPID     1h                 kubelet, app-node-6.scale-ci.example.com  Node app-node-6.scale-ci.example.com status is now: NodeHasSufficientPID
  Normal  Starting                 1h                 kubelet, app-node-6.scale-ci.example.com  Starting kubelet.
  Normal  NodeHasSufficientDisk    1h (x2 over 1h)    kubelet, app-node-6.scale-ci.example.com  Node app-node-6.scale-ci.example.com status is now: NodeHasSufficientDisk
  Normal  NodeHasSufficientMemory  1h (x2 over 1h)    kubelet, app-node-6.scale-ci.example.com  Node app-node-6.scale-ci.example.com status is now: NodeHasSufficientMemory
  Normal  NodeHasNoDiskPressure    1h (x2 over 1h)    kubelet, app-node-6.scale-ci.example.com  Node app-node-6.scale-ci.example.com status is now: NodeHasNoDiskPressure


Error message on kvm limits :

root@master-0: ~ # oc describe po virt-launcher-vm-cirros-41-t2h25scrowed

Node-Selectors:  kubernetes.io/hostname=app-node-40.scale-ci.example.com
                 kubevirt.io/schedulable=true
                 node-role.kubernetes.io/compute=true
Tolerations:     node.kubernetes.io/memory-pressure:NoSchedule
Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  39s (x9190 over 5h)  default-scheduler  0/131 nodes are available: 12 Insufficient devices.kubevirt.io/kvm, 131 node(s) didn't match node selector.

kvm limits went to 0 :

root@master-0: ~ # oc get node | grep app-node | awk '{print $1}' | xargs oc describe node | grep -E "devices.kubevirt.io/kvm:|Hostname:"
  Hostname:    app-node-0.scale-ci.example.com
 devices.kubevirt.io/kvm:        0
 devices.kubevirt.io/kvm:        0
  Hostname:    app-node-1.scale-ci.example.com
 devices.kubevirt.io/kvm:        0
 devices.kubevirt.io/kvm:        0

Comment 2 guy chen 2019-02-25 19:33:20 UTC


Added logs of app-node-12 that devices.kubevirt.io went from 110 to 0 by oc get event | grep app-node-12.scale-ci.example.com.

  Hostname:    app-node-12.scale-ci.example.com
 devices.kubevirt.io/kvm:        0
 devices.kubevirt.io/kvm:        0

Comment 3 Fabian Deutsch 2019-02-25 20:20:22 UTC

Guy, in the last comment: How many VMs were running on app-node-12.scale-ci.example.com when it had 0 kvm devices?

Comment 4 guy chen 2019-02-27 08:47:01 UTC

I cannot say exactly but i estimate around 30-40 VMS according to when it failed.

Comment 7 guy chen 2019-02-28 11:08:56 UTC

It's look like it's not a load related issue.
With no VM on the system more and more Nodes KVM capacity change to 0.
Attached is the virt handler log of app node 2, I see there the error's "server was unable to return a response in the time allotted", according to the below issue could it be network issues on the lab that cause the error ?

https://github.com/kubernetes/kubernetes/issues/60957

Comment 8 Roman Mohr 2019-02-28 11:14:56 UTC

Guy,

I see in the environment that on some nodes virt-handler restarted one or multiple times. Is there a correlation between these virt-handlers and the nodes where you see the capacity of zero? Also did the kubelet restart on the nodes with KVM capacity 0?

There can be a bug in the device plugin manager (like this one: https://github.com/kubernetes/kubernetes/issues/62773), or a bug in the kubevirt device plugins.

Further I investigated these warnings:

> {"component":"virt-handler","level":"warning","msg":"kubevirt.io/kubevirt/pkg/virt-handler/vm.go:514: watch of *v1.VirtualMachineInstance ended with: The resourceVersion for the provided watch is too old.","pos":"reflector.go:341","timestamp":"2019-02-28T07:23:13.502518Z"}

They only appear every now and then (minutes in-between). That seems to be pretty normal and not a concern if it does not happen frequently: https://github.com/kubernetes/kubernetes/issues/22024.

Comment 9 guy chen 2019-02-28 11:55:52 UTC

No correlation between the restarts of the virt-handler and zero kvm capacity.
All pods except app-node-44.scale-ci.example.com have currently zero kvm capacity.
only 40 out of 122 pods virt-handler where restarted :

root@master-0: ~ # oc get pods --all-namespaces -o wide | grep virt-handler | grep -v "0          8d" | grep -c virt-handler
40
root@master-0: ~ # oc get node | grep -c app-node
122

Comment 10 Roman Mohr 2019-02-28 12:00:36 UTC

Guy, thanks for checking. I think I found it. It is related to kubelet restarts. To quote the k8s manual in https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/#device-plugin-implementation: 

> A device plugin is expected to detect kubelet restarts and re-register itself with the new kubelet instance. In the current implementation, a new kubelet instance deletes all the existing Unix sockets under /var/lib/kubelet/device-plugins when it starts. A device plugin can monitor the deletion of its Unix socket and re-register itself upon such an event.


virt-handler does not re-register itself.

Comment 11 guy chen 2019-02-28 13:00:44 UTC

I cannot find the kubelet log, not on /var/log/kubelet.log or at journal, can you share how to obtain it ?
Also, no load was on the system so i wounder why is kubelet restarts...

Comment 12 Fabian Deutsch 2019-02-28 13:32:02 UTC

Pan, can we add a known issue for this bug to CNV 1.4 (including the workaround). Details are provided in "Doc Text"

Comment 14 Roman Mohr 2019-02-28 16:48:09 UTC

Fix is here: https://github.com/kubevirt/kubevirt/pull/2068

Comment 15 Roman Mohr 2019-02-28 16:50:20 UTC

> I cannot find the kubelet log, not on /var/log/kubelet.log or at journal, can you share how to obtain it ?
> Also, no load was on the system so i wounder why is kubelet restarts...

@Guy, yes there is nothing specific in the logs. I could finally connect the dots because of this:

 * The virt-handler logs just showed that it started the device plugins
 * The socket device-plugins socket files were gone (that happens when the kubelet starts again)
 * The kubelet reported 0 devices
 * When you restart virt-handler the socket files reappear and the devices are offered again.

Comment 16 Pan Ousley 2019-03-01 20:46:16 UTC

Hi Fabian, thanks for the info. Can you review the following text?

* If the kubelet on a node crashes or restarts, this causes the kubelet to
incorrectly report 0 KVM devices. Virtual machines are not properly scheduled
on affected nodes.
+
Verify the number of devices that the kubelet reports by running
`$ oc get node $NODE | grep devices.kubevirt`. The output on an affected node
shows `devices.kubevirt.io/kvm: 0`.
https://bugzilla.redhat.com/show_bug.cgi?id=1681175[(BZ#1681175)]
+
[NOTE]
====
As a workaround, kill the relevant `virt-handler` pod on the affected node. The
pod automatically restarts, and the kubelet reports the correct number of
available KVM devices.
====

Comment 17 Fabian Deutsch 2019-03-04 14:13:42 UTC

Passing the needinfo on to Roman,

Roman. Is this correct?

Comment 18 Roman Mohr 2019-03-04 14:16:56 UTC

Looks correct.

Comment 19 Pan Ousley 2019-03-05 20:06:07 UTC

Known Issue merged: https://github.com/openshift/openshift-docs/pull/13989

It is published at the bottom of the release notes here: https://docs.openshift.com/container-platform/3.11/cnv_release_notes/cnv_release_notes.html

Let me know if anything else is needed. Thanks!

Comment 20 Roman Mohr 2019-03-20 08:21:38 UTC

https://github.com/kubevirt/kubevirt/pull/2068 is merged.

Comment 23 guy chen 2019-09-25 11:11:53 UTC

Was tested by restarting the node and verifying the kvm devices where stayed on 110 and did not become zero.

version tested on :
virt-operator:v2.1.0-12
hyperconverged-cluster-operator:v2.1.0-16