Bug 1681175
Summary: | [Scale out testing] VMS Fail when more then 2500 VMI up on the system | ||
---|---|---|---|
Product: | Container Native Virtualization (CNV) | Reporter: | guy chen <guchen> |
Component: | Virtualization | Assignee: | Roman Mohr <rmohr> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | guy chen <guchen> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 1.4 | CC: | cnv-qe-bugs, fdeutsch, fgarciad, guchen, ipinto, ncredi, pousley, rmohr, sgordon, sgott |
Target Milestone: | --- | Keywords: | Performance |
Target Release: | 2.1.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | hco-bundle-registry-container-v2.1.0-34 | Doc Type: | Known Issue |
Doc Text: |
Cause: kubelet crashes or is getting restarted
Consequence: kubelet reports zero (0) kvm devices and Vms are not getting scheduled on the affected node.
$ oc get node $NODE | grep devices.kubevirt
devices.kubevirt.io/kvm: 0
Workaround (if any): Kill the relevant virt-handler pod on the affected node. It will then be automatically started again
Result: kubelet will start reporting the correct number of available kvm devices again:
devices.kubevirt.io/kvm: 100
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2019-11-04 15:08:51 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
guy chen
2019-02-25 16:02:57 UTC
Added logs of app-node-12 that devices.kubevirt.io went from 110 to 0 by oc get event | grep app-node-12.scale-ci.example.com. Hostname: app-node-12.scale-ci.example.com devices.kubevirt.io/kvm: 0 devices.kubevirt.io/kvm: 0 Guy, in the last comment: How many VMs were running on app-node-12.scale-ci.example.com when it had 0 kvm devices? I cannot say exactly but i estimate around 30-40 VMS according to when it failed. It's look like it's not a load related issue. With no VM on the system more and more Nodes KVM capacity change to 0. Attached is the virt handler log of app node 2, I see there the error's "server was unable to return a response in the time allotted", according to the below issue could it be network issues on the lab that cause the error ? https://github.com/kubernetes/kubernetes/issues/60957 Guy, I see in the environment that on some nodes virt-handler restarted one or multiple times. Is there a correlation between these virt-handlers and the nodes where you see the capacity of zero? Also did the kubelet restart on the nodes with KVM capacity 0? There can be a bug in the device plugin manager (like this one: https://github.com/kubernetes/kubernetes/issues/62773), or a bug in the kubevirt device plugins. Further I investigated these warnings: > {"component":"virt-handler","level":"warning","msg":"kubevirt.io/kubevirt/pkg/virt-handler/vm.go:514: watch of *v1.VirtualMachineInstance ended with: The resourceVersion for the provided watch is too old.","pos":"reflector.go:341","timestamp":"2019-02-28T07:23:13.502518Z"} They only appear every now and then (minutes in-between). That seems to be pretty normal and not a concern if it does not happen frequently: https://github.com/kubernetes/kubernetes/issues/22024. No correlation between the restarts of the virt-handler and zero kvm capacity. All pods except app-node-44.scale-ci.example.com have currently zero kvm capacity. only 40 out of 122 pods virt-handler where restarted : root@master-0: ~ # oc get pods --all-namespaces -o wide | grep virt-handler | grep -v "0 8d" | grep -c virt-handler 40 root@master-0: ~ # oc get node | grep -c app-node 122 Guy, thanks for checking. I think I found it. It is related to kubelet restarts. To quote the k8s manual in https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/#device-plugin-implementation: > A device plugin is expected to detect kubelet restarts and re-register itself with the new kubelet instance. In the current implementation, a new kubelet instance deletes all the existing Unix sockets under /var/lib/kubelet/device-plugins when it starts. A device plugin can monitor the deletion of its Unix socket and re-register itself upon such an event. virt-handler does not re-register itself. I cannot find the kubelet log, not on /var/log/kubelet.log or at journal, can you share how to obtain it ? Also, no load was on the system so i wounder why is kubelet restarts... Pan, can we add a known issue for this bug to CNV 1.4 (including the workaround). Details are provided in "Doc Text" Fix is here: https://github.com/kubevirt/kubevirt/pull/2068
> I cannot find the kubelet log, not on /var/log/kubelet.log or at journal, can you share how to obtain it ?
> Also, no load was on the system so i wounder why is kubelet restarts...
@Guy, yes there is nothing specific in the logs. I could finally connect the dots because of this:
* The virt-handler logs just showed that it started the device plugins
* The socket device-plugins socket files were gone (that happens when the kubelet starts again)
* The kubelet reported 0 devices
* When you restart virt-handler the socket files reappear and the devices are offered again.
Hi Fabian, thanks for the info. Can you review the following text? * If the kubelet on a node crashes or restarts, this causes the kubelet to incorrectly report 0 KVM devices. Virtual machines are not properly scheduled on affected nodes. + Verify the number of devices that the kubelet reports by running `$ oc get node $NODE | grep devices.kubevirt`. The output on an affected node shows `devices.kubevirt.io/kvm: 0`. https://bugzilla.redhat.com/show_bug.cgi?id=1681175[(BZ#1681175)] + [NOTE] ==== As a workaround, kill the relevant `virt-handler` pod on the affected node. The pod automatically restarts, and the kubelet reports the correct number of available KVM devices. ==== Passing the needinfo on to Roman, Roman. Is this correct? Looks correct. Known Issue merged: https://github.com/openshift/openshift-docs/pull/13989 It is published at the bottom of the release notes here: https://docs.openshift.com/container-platform/3.11/cnv_release_notes/cnv_release_notes.html Let me know if anything else is needed. Thanks! Was tested by restarting the node and verifying the kvm devices where stayed on 110 and did not become zero. version tested on : virt-operator:v2.1.0-12 hyperconverged-cluster-operator:v2.1.0-16 |