Description of problem: There is no monitoring of the qemu vm - reporting the qemu is stuck. For example when VM lost connection to storage(NFS with hard mount), there is no timeout, and kubevirt never report the VM as non responsive. Version-Release number of selected component (if applicable): 2.5.0 How reproducible: always Steps to Reproduce: 1. Create VM with PV on NFS. 2. Stop NFS service. Actual results: VM is reported as running forever. Expected results: VM is reported as as non responsive after qemu-timeout. Additional info: For example oVirt sets unresponsive VM if monitor timeouts: https://github.com/oVirt/vdsm/blob/master/lib/vdsm/virt/vm.py#L1814 Libvirt logs: {"component":"virt-launcher","level":"info","msg":"Thread 31 (virNetServerHandleJob) is now running job remoteDispatchConnectGetAllDomainStats","pos":"virThreadJobSet:94","subcomponent":"libvirt","thread":"31","timestamp":"2020-12-01T09:04:40.813000Z"} {"component":"virt-launcher","level":"info","msg":"conn=0x7fefd8001560, stats=0x3e, retStats=0x7fefeeefe7e0, flags=0x10","pos":"virConnectGetAllDomainStats:11715","subcomponent":"libvirt","thread":"31","timestamp":"2020-12-01T09:04:40.813000Z"} {"component":"virt-launcher","level":"info","msg":"OBJECT_REF: obj=0x5609de593340","pos":"virObjectRef:386","subcomponent":"libvirt","thread":"31","timestamp":"2020-12-01T09:04:40.813000Z"} {"component":"virt-launcher","level":"info","msg":"manager=0x5609de593340(name=stack) driver=QEMU perm=3","pos":"virAccessManagerCheckConnect:223","subcomponent":"libvirt","thread":"31","timestamp":"2020-12-01T09:04:40.813000Z"} {"component":"virt-launcher","level":"info","msg":"manager=0x5609de593050(name=none) driver=QEMU perm=3","pos":"virAccessManagerCheckConnect:223","subcomponent":"libvirt","thread":"31","timestamp":"2020-12-01T09:04:40.813000Z"} {"component":"virt-launcher","level":"info","msg":"OBJECT_UNREF: obj=0x5609de593340","pos":"virObjectUnref:348","subcomponent":"libvirt","thread":"31","timestamp":"2020-12-01T09:04:40.813000Z"} {"component":"virt-launcher","level":"info","msg":"OBJECT_REF: obj=0x7fefcc045370","pos":"virObjectRef:386","subcomponent":"libvirt","thread":"31","timestamp":"2020-12-01T09:04:40.813000Z"} {"component":"virt-launcher","level":"info","msg":"OBJECT_REF: obj=0x5609de593340","pos":"virObjectRef:386","subcomponent":"libvirt","thread":"31","timestamp":"2020-12-01T09:04:40.813000Z"} {"component":"virt-launcher","level":"info","msg":"manager=0x5609de593340(name=stack) driver=QEMU domain=0x7fefcc00da00 perm=1","pos":"virAccessManagerCheckDomain:239","subcomponent":"libvirt","thread":"31","timestamp":"2020-12-01T09:04:40.813000Z"} {"component":"virt-launcher","level":"info","msg":"manager=0x5609de593050(name=none) driver=QEMU domain=0x7fefcc00da00 perm=1","pos":"virAccessManagerCheckDomain:239","subcomponent":"libvirt","thread":"31","timestamp":"2020-12-01T09:04:40.813000Z"} {"component":"virt-launcher","level":"info","msg":"OBJECT_UNREF: obj=0x5609de593340","pos":"virObjectUnref:348","subcomponent":"libvirt","thread":"31","timestamp":"2020-12-01T09:04:40.813000Z"} {"component":"virt-launcher","level":"info","msg":"OBJECT_REF: obj=0x7fefa0009a40","pos":"virObjectRef:386","subcomponent":"libvirt","thread":"31","timestamp":"2020-12-01T09:04:40.813000Z"} {"component":"virt-launcher","level":"info","msg":"Starting job: job=query agentJob=none asyncJob=none (vm=0x7fefcc045370 name=default_vm-1, current job=query agentJob=none async=none)","pos":"qemuDomainObjBeginJobInternal:9782","subcomponent":"libvirt","thread":"31","timestamp":"2020-12-01T09:04:40.813000Z"} {"component":"virt-launcher","level":"info","msg":"Waiting for job (vm=0x7fefcc045370 name=default_vm-1)","pos":"qemuDomainObjBeginJobInternal:9812","subcomponent":"libvirt","thread":"31","timestamp":"2020-12-01T09:04:40.813000Z"} {"component":"virt-launcher","level":"warning","msg":"Cannot start job (query, none, none) for domain default_vm-1; current job is (query, none, none) owned by (34 remoteDispatchConnectGetAllDomainStats, 0 \u003cnull\u003e, 0 \u003cnull\u003e (flags=0x0)) for (124s, 0s, 0s)","pos":"qemuDomainObjBeginJobInternal:9896","subcomponent":"libvirt","thread":"35","timestamp":"2020-12-01T09:04:40.831000Z"} {"component":"virt-launcher","level":"error","msg":"Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchConnectGetAllDomainStats)","pos":"qemuDomainObjBeginJobInternal:9918","subcomponent":"libvirt","thread":"35","timestamp":"2020-12-01T09:04:40.831000Z"} {"component":"virt-launcher","level":"info","msg":"OBJECT_UNREF: obj=0x7fefa0009a40","pos":"virObjectUnref:348","subcomponent":"libvirt","thread":"35","timestamp":"2020-12-01T09:04:40.831000Z"} {"component":"virt-launcher","level":"info","msg":"OBJECT_UNREF: obj=0x7fefcc045370","pos":"virObjectUnref:348","subcomponent":"libvirt","thread":"35","timestamp":"2020-12-01T09:04:40.831000Z"} {"component":"virt-launcher","level":"info","msg":"OBJECT_UNREF: obj=0x7fefd00116d0","pos":"virObjectUnref:348","subcomponent":"libvirt","thread":"35","timestamp":"2020-12-01T09:04:40.831000Z"} {"component":"virt-launcher","level":"info","msg":"OBJECT_DISPOSE: obj=0x7fefd00116d0","pos":"virObjectUnref:350","subcomponent":"libvirt","thread":"35","timestamp":"2020-12-01T09:04:40.831000Z"} {"component":"virt-launcher","level":"info","msg":"release domain 0x7fefd00116d0 default_vm-1 13da100f-c015-569f-9d70-09a7e63fe27d","pos":"virDomainDispose:323","subcomponent":"libvirt","thread":"35","timestamp":"2020-12-01T09:04:40.831000Z"} {"component":"virt-launcher","level":"info","msg":"OBJECT_UNREF: obj=0x7fefd8001560","pos":"virObjectUnref:348","subcomponent":"libvirt","thread":"35","timestamp":"2020-12-01T09:04:40.831000Z"} {"component":"virt-launcher","level":"info","msg":"Thread 35 (virNetServerHandleJob) finished job remoteDispatchDomainMemoryStats with ret=-1","pos":"virThreadJobClear:119","subcomponent":"libvirt","thread":"35","timestamp":"2020-12-01T09:04:40.831000Z"} {"component":"virt-launcher","level":"info","msg":"OBJECT_UNREF: obj=0x5609de596010","pos":"virObjectUnref:348","subcomponent":"libvirt","thread":"35","timestamp":"2020-12-01T09:04:40.831000Z"} {"component":"virt-launcher","level":"info","msg":"OBJECT_UNREF: obj=0x5609de599cb0","pos":"virObjectUnref:348","subcomponent":"libvirt","thread":"35","timestamp":"2020-12-01T09:04:40.831000Z"} {"component":"virt-launcher","level":"info","msg":"OBJECT_REF: obj=0x5609de594390","pos":"virObjectRef:386","subcomponent":"libvirt","thread":"28","timestamp":"2020-12-01T09:04:40.831000Z"} {"component":"virt-launcher","level":"info","msg":"OBJECT_REF: obj=0x5609de599cb0","pos":"virObjectRef:386","subcomponent":"libvirt","thread":"28","timestamp":"2020-12-01T09:04:40.831000Z"} {"component":"virt-launcher","level":"info","msg":"OBJECT_REF: obj=0x5609de596010","pos":"virObjectRef:386","subcomponent":"libvirt","thread":"28","timestamp":"2020-12-01T09:04:40.831000Z"} {"component":"virt-launcher","level":"info","msg":"OBJECT_UNREF: obj=0x5609de594390","pos":"virObjectUnref:348","subcomponent":"libvirt","thread":"28","timestamp":"2020-12-01T09:04:40.831000Z"} {"component":"virt-launcher","level":"info","msg":"Thread 35 (virNetServerHandleJob) is now running job remoteDispatchDomainMemoryStats","pos":"virThreadJobSet:94","subcomponent":"libvirt","thread":"35","timestamp":"2020-12-01T09:04:40.831000Z"} {"component":"virt-launcher","level":"info","msg":"OBJECT_NEW: obj=0x7fefd00116d0 classname=virDomain","pos":"virObjectNew:251","subcomponent":"libvirt","thread":"35","timestamp":"2020-12-01T09:04:40.831000Z"} {"component":"virt-launcher","level":"info","msg":"OBJECT_REF: obj=0x7fefd8001560","pos":"virObjectRef:386","subcomponent":"libvirt","thread":"35","timestamp":"2020-12-01T09:04:40.831000Z"} {"component":"virt-launcher","level":"info","msg":"dom=0x7fefd00116d0, (VM: name=default_vm-1, uuid=13da100f-c015-569f-9d70-09a7e63fe27d), stats=0x7fefd0013fe0, nr_stats=13, flags=0x0","pos":"virDomainMemoryStats:5747","subcomponent":"libvirt","thread":"35","timestamp":"2020-12-01T09:04:40.831000Z"} {"component":"virt-launcher","level":"info","msg":"OBJECT_REF: obj=0x7fefcc045370","pos":"virObjectRef:386","subcomponent":"libvirt","thread":"35","timestamp":"2020-12-01T09:04:40.831000Z"} {"component":"virt-launcher","level":"info","msg":"OBJECT_REF: obj=0x5609de593340","pos":"virObjectRef:386","subcomponent":"libvirt","thread":"35","timestamp":"2020-12-01T09:04:40.831000Z"} {"component":"virt-launcher","level":"info","msg":"manager=0x5609de593340(name=stack) driver=QEMU domain=0x7fefcc00da00 perm=1","pos":"virAccessManagerCheckDomain:239","subcomponent":"libvirt","thread":"35","timestamp":"2020-12-01T09:04:40.831000Z"} {"component":"virt-launcher","level":"info","msg":"manager=0x5609de593050(name=none) driver=QEMU domain=0x7fefcc00da00 perm=1","pos":"virAccessManagerCheckDomain:239","subcomponent":"libvirt","thread":"35","timestamp":"2020-12-01T09:04:40.831000Z"}
Duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1901335?
That is different problem. The problem here is mainly with NFS hard mount. If NFS connection is lost it never timeout. bug 1901335 is about IO error in general.
Ondra, There exists a PR https://github.com/kubevirt/kubevirt/pull/5401 that addresses the somewhat related BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1901335 The PR above pauses the VM on IO Errors. Given this context, how do IO errors specifically related to NFS differ from the general case?
Stu, We used couple of methods to cause IO errors. One of the cases was to cause connection issue between NFS share and server. In the past this issue cause headache in vdsm to handle it correctly. In general handling any io errors should be done in similar way but nfs seems to be tricky. I would just make sure that NFS is not causing any additional issues.
The BZ mentioned in Comment #4 addresses error handling at the KubeVirt level. If the error is not propagated at all, that's an issue at a lower level. Re-assigned this to the Node component. Please re-assign if this is incorrect.
I don't think node is the right component. we cover Kubelet and CRI-O, and this bug doesn't seem to be an issue with either. What "lower level" are you looking for?
This is an issue related to NFS PVC's not reporting any status/errors. I was of the impression that CRI-O was involved in that. Sorry for the confusion.
Ah I believe that is in storage's department, moving there
I am not sure I understand the issue. A VM running in a Pod uses a NFS volume and the NFS server becomes unavailable? Kubernetes (kubelet) makes sure that a correct volume is mounted into the right pod, which probably happened in this case. However, after that, kubelet stays out of the data path. It's between NFS filesystem / NFS client in kernel and NFS server now. From what I remember, I/O calls, such as read(), never return when the NFS server becomes unavailable. Is it the issue you're trying to solve? I am afraid OCP can't help here much.
Jan, please take a look at comment #2. The idea is to handle NFS issues and let the user know that there is an issue as well as recover when the connection/nfs server is restored.
Kubernetes currently does not monitor mounted volumes. It's on our long-term TODO list (upstream), however, it will take some time until it's fully implemented & graduates to GA, see https://kubernetes.io/docs/concepts/storage/volume-health-monitoring/. It's alpha in 1.21 and it will stay alpha in 1.22 for sure. In addition, it needs explicit support in a CSI driver. If you insist, we may turn this into OCP RFE, not sure if it can speed things up. What I recommend in applications (Pods) is to have liveness probe that check if the application is running, which should include some code path that checks if the application can access the storage it needs. For VMs, it would make sense to have a probe to poke all VM's volumes once in a while. The Pod will be then killed when it looses connection to the NFS server for a defined time. Alternatively, you can use soft nfs mounts + a timeout. Or do some meonitoring in qemu / vdsm. Would any of the above be enough for you?
Base on your description it indeed looks more like RFE than BZ. Please paste a reference link where the RFE is tracked. Thanks!
I created https://issues.redhat.com/browse/RFE-2032. A link to customer case would help a lot to prioritize the request!