Bug 1903065
Summary: | [CNV][Chaos] VM don't report when the qemu is stuck | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Ondra Machacek <omachace> |
Component: | Storage | Assignee: | Jan Safranek <jsafrane> |
Storage sub component: | Storage | QA Contact: | Wei Duan <wduan> |
Status: | CLOSED DEFERRED | Docs Contact: | |
Severity: | medium | ||
Priority: | unspecified | CC: | aos-bugs, cnv-qe-bugs, jsafrane, kbidarka, omachace, pkliczew, sgott |
Version: | 4.6 | ||
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-07-30 09:27:53 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1908661 |
Description
Ondra Machacek
2020-12-01 09:06:45 UTC
Duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1901335? That is different problem. The problem here is mainly with NFS hard mount. If NFS connection is lost it never timeout. bug 1901335 is about IO error in general. Ondra, There exists a PR https://github.com/kubevirt/kubevirt/pull/5401 that addresses the somewhat related BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1901335 The PR above pauses the VM on IO Errors. Given this context, how do IO errors specifically related to NFS differ from the general case? Stu, We used couple of methods to cause IO errors. One of the cases was to cause connection issue between NFS share and server. In the past this issue cause headache in vdsm to handle it correctly. In general handling any io errors should be done in similar way but nfs seems to be tricky. I would just make sure that NFS is not causing any additional issues. The BZ mentioned in Comment #4 addresses error handling at the KubeVirt level. If the error is not propagated at all, that's an issue at a lower level. Re-assigned this to the Node component. Please re-assign if this is incorrect. I don't think node is the right component. we cover Kubelet and CRI-O, and this bug doesn't seem to be an issue with either. What "lower level" are you looking for? This is an issue related to NFS PVC's not reporting any status/errors. I was of the impression that CRI-O was involved in that. Sorry for the confusion. Ah I believe that is in storage's department, moving there I am not sure I understand the issue. A VM running in a Pod uses a NFS volume and the NFS server becomes unavailable? Kubernetes (kubelet) makes sure that a correct volume is mounted into the right pod, which probably happened in this case. However, after that, kubelet stays out of the data path. It's between NFS filesystem / NFS client in kernel and NFS server now. From what I remember, I/O calls, such as read(), never return when the NFS server becomes unavailable. Is it the issue you're trying to solve? I am afraid OCP can't help here much. Jan, please take a look at comment #2. The idea is to handle NFS issues and let the user know that there is an issue as well as recover when the connection/nfs server is restored. Kubernetes currently does not monitor mounted volumes. It's on our long-term TODO list (upstream), however, it will take some time until it's fully implemented & graduates to GA, see https://kubernetes.io/docs/concepts/storage/volume-health-monitoring/. It's alpha in 1.21 and it will stay alpha in 1.22 for sure. In addition, it needs explicit support in a CSI driver. If you insist, we may turn this into OCP RFE, not sure if it can speed things up. What I recommend in applications (Pods) is to have liveness probe that check if the application is running, which should include some code path that checks if the application can access the storage it needs. For VMs, it would make sense to have a probe to poke all VM's volumes once in a while. The Pod will be then killed when it looses connection to the NFS server for a defined time. Alternatively, you can use soft nfs mounts + a timeout. Or do some meonitoring in qemu / vdsm. Would any of the above be enough for you? Base on your description it indeed looks more like RFE than BZ. Please paste a reference link where the RFE is tracked. Thanks! I created https://issues.redhat.com/browse/RFE-2032. A link to customer case would help a lot to prioritize the request! |