| Summary: | VM is not automatically unpaused after no space IO error on NFS | ||
|---|---|---|---|
| Product: | [oVirt] vdsm | Reporter: | Katarzyna Jachim <kjachim> |
| Component: | General | Assignee: | Tal Nisan <tnisan> |
| Status: | CLOSED DEFERRED | QA Contact: | Avihai <aefrat> |
| Severity: | low | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | --- | CC: | acanan, amureini, bazulay, bugs, fsimonce, klaas, kshukla, lpeer, michal.skrivanek, mtessun, ncredi, nsoffer, rbalakri, rbarry, srevivo, tnisan, ylavi |
| Target Milestone: | --- | Keywords: | FutureFeature, Improvement, Reopened |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Enhancement | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2019-11-18 20:22:36 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | Virt | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
|
Description
Katarzyna Jachim
2013-10-29 13:32:07 UTC
This is probably not related to bug 1003588. I haven't looked at the logs well enough but the error here should be ENOSPC (instead of EIO). I'm not sure if the feature was supposed to cover also ENOSPC errors as they're not related to domain availability. The solution for ENOSPC is not related at all to the solution proposed for EIO (domain state change). Ayal, what do you think? (In reply to Federico Simoncelli from comment #3) > This is probably not related to bug 1003588. I haven't looked at the logs > well enough but the error here should be ENOSPC (instead of EIO). > > I'm not sure if the feature was supposed to cover also ENOSPC errors as > they're not related to domain availability. > > The solution for ENOSPC is not related at all to the solution proposed for > EIO (domain state change). > > Ayal, what do you think? indeed we have no mechanism for this on NFS atm (on block domains we automatically extend and resume). This is a gap we'll need to cover in 3.4 This is not a RFE, it's a bug. We can discuss when to fix it and if, but it's not correct to keep it on future. Tal, who's going to work on it for 4.1? We don't have a mechanism for unpausing vms on file storage, and our mechanism on block storage is broken as well; there is no way to detect storage domain state changes reliably. Fixing this require major redesign, possible for for 4.2 if we start to work on it now. (In reply to Nir Soffer from comment #12) > We don't have a mechanism for unpausing vms on file storage, and our > mechanism on > block storage is broken as well; there is no way to detect storage domain > state > changes reliably. > > Fixing this require major redesign, possible for for 4.2 if we start to work > on it > now. We could just retry periodically, I believe. Moving out all non blocker\exceptions. (In reply to Yaniv Dary from comment #14) > Moving out all non blocker\exceptions. Yaniv, I don't think this makes sense for 4.1. Did you read comment 12? We cannot fix problems like this is the last moment, this will only harm the stability of the product. These kind of issues must be scheduled to the start of the development for the next version. (In reply to Nir Soffer from comment #12) > We don't have a mechanism for unpausing vms on file storage, and our > mechanism on > block storage is broken as well; there is no way to detect storage domain > state > changes reliably. > > Fixing this require major redesign, possible for for 4.2 if we start to work > on it > now. We have seen no such requests from the field, and I don't see us putting any effort in this area - closing, if PM disagrees, please explain and reopen. Based comment 17, I think we can monitor the available space on file based storage domains, and remember the last value seen. When we move from value <= threshold to value > threshold, we can treat this as change from invalid to valid state, and use the same mechanism to resume vms paused because of ENOSPC. This cannot be very robust, since we monitor only every 10 seconds, and if space is added or removed quickly, we may miss the low space state. But if we choose large enough threshold, it may be good enough. Hi Koutuk, (In reply to Koutuk Shukla from comment #17) > Hello, > > Re-opening this bug [...] > > Environment details: > rhvm-4.2.6.4-0.1.el7ev.noarch > we will look at this in 4.4 as a feature request, but unpausing a VM could have harmful effects as well. E.g. if you have timer sensitive application (even the kernel is one) you might have unforeseen effects in case of longer pauses (hangs, deadlocks, etc.). As such unpausing the VM unconditionally is not sensible. So we will discuss how we can sensibly achieve this in 4.4. planning. Cheers, Martin (In reply to Martin Tessun from comment #20) > Hi Koutuk, > > (In reply to Koutuk Shukla from comment #17) > > Hello, > > > > Re-opening this bug [...] > > > > Environment details: > > rhvm-4.2.6.4-0.1.el7ev.noarch > > > > we will look at this in 4.4 as a feature request, but unpausing a VM could > have harmful effects as well. > E.g. if you have timer sensitive application (even the kernel is one) you > might have unforeseen effects in case of longer pauses (hangs, deadlocks, > etc.). > As such unpausing the VM unconditionally is not sensible. > So we will discuss how we can sensibly achieve this in 4.4. planning. If I understood this correctly the functionality already exists for other storage types. If you decide this is not a good idea for NFS why is it a good idea for other storage types? > > Cheers, > Martin (In reply to Klaas Demter from comment #21) > (In reply to Martin Tessun from comment #20) > > Hi Koutuk, > > > > (In reply to Koutuk Shukla from comment #17) > > > Hello, > > > > > > Re-opening this bug [...] > > > > > > Environment details: > > > rhvm-4.2.6.4-0.1.el7ev.noarch > > > > > > > we will look at this in 4.4 as a feature request, but unpausing a VM could > > have harmful effects as well. > > E.g. if you have timer sensitive application (even the kernel is one) you > > might have unforeseen effects in case of longer pauses (hangs, deadlocks, > > etc.). > > As such unpausing the VM unconditionally is not sensible. > > So we will discuss how we can sensibly achieve this in 4.4. planning. > > If I understood this correctly the functionality already exists for other > storage types. If you decide this is not a good idea for NFS why is it a > good idea for other storage types? To be fair it isn't. And afaik it should also not happen in case the ENOSPC condition lasts longer than a few moments. The unpausing in LVM (block based) storage is mainly done to avoid an extension being happening on a thin volume, while the guest writes data. These are the cases where the guest is automatically unpaused (afaik). @Michal: Anything to add? > > > > > Cheers, > > Martin As indicated by numerous comments (comment#20 and comment#23 in particular), as well as the age of this RFE without a reasonable mechanism, this bug will be closed. Block storage is able to handle this somewhat better than NFS direct monitoring of the underlying storage and reliable communication of events. NFS is not at a point where we can rely on the underlying storage being ready and alternate solutions for shared storage exist. *** Bug 1693786 has been marked as a duplicate of this bug. *** |