Bug 1314160
Summary: | [RFE] Option to Let VM handle EIO on Direct LUNs, not pausing it | ||
---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Germano Veit Michel <gveitmic> |
Component: | RFEs | Assignee: | Tal Nisan <tnisan> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Avihai <aefrat> |
Severity: | medium | Docs Contact: | |
Priority: | high | ||
Version: | 3.5.7 | CC: | aefrat, germano, gveitmic, jortialc, kwolf, lsurette, mkalinin, mtessun, mwest, nashok, nsoffer, pelauter, rhodain, sirao, sraje, srevivo, tnisan |
Target Milestone: | ovirt-4.4.3-2 | Keywords: | FutureFeature, ZStream |
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Enhancement | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-02-09 16:20:50 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1862534 | ||
Bug Blocks: |
Description
Germano Veit Michel
2016-03-03 05:48:08 UTC
To share some background about decision on how things work in RHEV/qemu: https://bugzilla.redhat.com/show_bug.cgi?id=1024428#c43 https://access.redhat.com/solutions/526303 https://bugzilla.redhat.com/show_bug.cgi?id=1064630 Not sure if this all still behaves same way today. And it does not cover the direct lun portion requested in this RFE. Specifically, check this comment how to set up a disk in RHEV to behave as "report" behavior: https://bugzilla.redhat.com/show_bug.cgi?id=1064630#c20 Hopefully, it still works the same way in RHV and will be good enough to at least provide a workaround. Thanks Marina! I've tested it in 4.2.6, wrote a script to manage this without using the DB (the SQL edits are a bit ugly and customers need to open support cases). And also rewrote the KCS. We can keep the RFE open to make this configurable via UI or change the default behavior of direct LUNs as per comment #14. Otherwise this can be closed. I would like to keep this RFE open, since having it in UI is a good thing to have. Maybe we should change the component then to UI instead. Germano, can you please open RFE to have this script in anisble? (In reply to Marina from comment #19) > Germano, can you please open RFE to have this script in anisble? I was planning a much simpler script and also submit a patch to ansible, but had to do it that way due to this: https://bugzilla.redhat.com/show_bug.cgi?id=1636331 Once its fixed (or clarified what I am doing wrong) I'm planning on doing both (simplifying the script on the KCS) and submit an ansible one. The customer would prefer if this could be done online (no VM shutdown). Even better if the disk does not need to be detached+attached too. *** Bug 1719166 has been marked as a duplicate of this bug. *** (In reply to Germano Veit Michel from comment #17) > Thanks Marina! > > I've tested it in 4.2.6, wrote a script to manage this without using the DB > (the SQL edits are a bit ugly and customers need to open support cases). And > also rewrote the KCS. > This KCS: https://access.redhat.com/solutions/526303 has a workaround for the customer how to enable it via a script (instead modifying database) and should work for customers with direct luns. > We can keep the RFE open to make this configurable via UI or change the > default behavior of direct LUNs as per comment #14. Otherwise this can be > closed. Would be nice to have this behavior is default for direct luns. It does not sound like lots of work for dev, but more for testing. However - it makes sense to have this configuration. Tal, Avihai? (In reply to Marina Kalinin from comment #25) > (In reply to Germano Veit Michel from comment #17) > > Thanks Marina! > > > > I've tested it in 4.2.6, wrote a script to manage this without using the DB > > (the SQL edits are a bit ugly and customers need to open support cases). And > > also rewrote the KCS. > > > This KCS: https://access.redhat.com/solutions/526303 has a workaround for > the customer how to enable it via a script (instead modifying database) and > should work for customers with direct luns. > > > We can keep the RFE open to make this configurable via UI or change the > > default behavior of direct LUNs as per comment #14. Otherwise this can be > > closed. > Would be nice to have this behavior is default for direct luns. It does not > sound like lots of work for dev, but more for testing. However - it makes > sense to have this configuration. > Tal, Avihai? To answer this question I need more details on how this will be implemented by DEV/PM/GSS like: What is the most common use case? clear verification scenario by DEV AFTER they will announce what can they implement and how. If all that is needed is to check VM does not pause using RO LUN than sure, but it looks like much more than that. How should the VM handle an EIO error without pausing the VM what should be the expected result in testing this part? Is this going to be added to the existing DR (as we already have a DR solution/ansible script)? As you can see a lot of Q's that relay on DEV/PM to answer/think about. I think the request makes sense, but this conflicts with the way we handle errors on other types of disks. For thin disks on block storage we must not propagate the error to the guest. When we get ENOSPC error, vdsm extend the disk and resume the VM. The guest should not be aware that a disk was extended. Anything else will break the guest. For thin disks on file storage, maybe the error was caused by full disk on the storage server. The storage admin can fix the issue resuming the VM will fix the issue. If we pass ENOSPC to the guest the guest will be broken. For prealocated disks or LUNs, ENOSPC should not be possible and we don't have any way to fix this. For other errors, I don't see why we should stop the VM and how this can help, so maybe propagating these errors always should be the default. Kevin, what do you think? You don't have to use the same defaults for every type of disk if different configurations make sense for different disk types. The idea with stopping the VM on I/O errors is situations like that the virtual disk is backed by a network connection and the network goes down temporarily (I'm sure you can think of other kinds of temporary failure, too). When a disk returns an error for a request, the guest usually assumes that its disk is broken and will never retry. If it's stopped instead, you can continue the guest as soon as the problem is fixed and it will look to the guest as if there had never been a problem, but obviously the guest doesn't run and perform its job in the meantime. There are probably valid use cases for both way to respond to an I/O failure. This is what makes it policy and why it's an option on the QEMU side. Right, I don't think we can have any default that will work for all cases. This must be configurable per disk, so users can tune configure the system as needed. As there is a workaround that does provide the requested functionality, closing this one. The feature is working since RHV 4.4.3 as this release includes a system wide configuration to use "report" error policy for direct LUN. You can use something like: engine-config -s PropagateDiskErrors=true With this configuration, Direct LUN disk will report errors to the guest instead of pausing the VM. |