Red Hat Bugzilla – Bug 665820
[RFE] Send monitor event upon stuck storage
Last modified: 2017-06-07 07:03:33 EDT
*** Bug 711374 has been marked as a duplicate of this bug. ***
Please check https://bugzilla.redhat.com/show_bug.cgi?id=695082#c15 for a description of the root problem and why there's no simple way of fixing it.
This bug is related to bug 695082 - KVM: QEMU stuck when attempting to read from unreachable CD-ROM (disconnected 'nfs-iso-storage-domain'). See https://bugzilla.redhat.com/show_bug.cgi?id=695082#c15 for an analysis of the underlying problem, and why we can't avoid QEMU getting stuck, at least not in RHEL-6.
Unlike bug 695082, this bug doesn't ask for avoiding the hang. Instead, it asks for QEMU emitting an event when it happens. I'm afraid that isn't in the cards, either.
We can't know we got stuck until some time after getting stuck (the analysis referenced above explains why). The shorter that timeout, the larger the chance to turn a recoverable network hiccup into a non-recoverable virtual hardware failure. Thus, the timeout value is policy. Policy needs to be set by the management application.
However, I can't think of a way to implement a "detect system call got stuck on NFS" feature in QEMU that doesn't require major architectural surgery, with a considerable risk of introducing nasty bugs. Would be quite inappropriate for a RHEL minor release even if it wasn't entirely infeasible.
Perhaps management applications could try one or more of the following to better deal with NFS outages:
* Monitor the NFS storage in the management application instead of relying on QEMU detecting and reporting outages in a timely manner.
* Poll QEMU to detect when it hangs. Pretty schlocky :)
* Use NFS mount options to implement the desired timeout on the NFS level.
As said in comment 10, there's no way we can fix this at the qemu level without major architectural changes and a consideralbe risk of introducing bugs. Maybe management layers can workaround this, as suggested in comment 10.
Closing as WONTFIX.
(In reply to comment #11)
> As said in comment 10, there's no way we can fix this at the qemu level
> without major architectural changes and a consideralbe risk of introducing
> bugs. Maybe management layers can workaround this, as suggested in comment
> Closing as WONTFIX.
It may be fine to close it for rhel6.x but we should keep it open for rhel7
IMHO some qemu threads shouldn't get stuck. One of them is the monitor.
The problem is that today the monitor is executed by the main loop and needs to global qemu lock. If a thread that accesses the storage w/ the lock on (is there such a case?) get stuck, it will cause a deadlock. The monitor should always be functional and we should add a info monitor command that provides data about the outgoing IO and the time since they got sent. W/ this data, management would be able to take the right decision.
Kevin/Markus, what's your take on the above
Then I guess the monitor would have run in a separate thread and make sure to call the block layer only from bottom halves that are executed in the I/O thread that can get stuck. This is easy to say, but I'm relatively sure that it's a quite massive change.
Also, what do you think is the right decision that management could take? If the maximum number of threads in the thread pool (or the maximum number of Linux AIO requests) is used up, you can either wait and hope that the situation will be fixed, or you can kill qemu. I don't see much more options.
bug is in ASSIGNED, is there anything going on? The "right decision for management" is a matter of identifying if there is any pending IO to be written or not. Then we can only wait, or pause and migrate the VM to a different host without storage issues.
It's unlikely that anything is going to happen as long as we haven't even
defined yet what we even want to have implemented in the end.
You cannot migrate the VM to a different host as long as requests are hanging
in the kernel because qemu must flush them first before it has a consistent
state that can be migrated.
Leaves you with waiting until the network comes back, which you already do
today. Then the difference is merely that the monitor will still be responsive
while the block subsystem is blocked. I think we'll eventually get there with
the dataplane work, but that's of course a massive project and certainly not a
This doesn't look much like RHEL 7.0 to me, if not even WONTFIX.
*** Bug 1053561 has been marked as a duplicate of this bug. ***
Actually, even dataplane probably doesn't help much if the monitor then tries
accessing a hanging block device or draining its requests. Moving to 7.2, still
with the prospect of closing as WONTFIX eventually.
Ademar has started closing bugs like this as WONTFIX. For now, I prefer to defer to RHEL 7.3 since we do need to work on reducing code paths where QEMU blocks on pending I/O.
(In reply to Stefan Hajnoczi from comment #21)
> Ademar has started closing bugs like this as WONTFIX. For now, I prefer to
> defer to RHEL 7.3 since we do need to work on reducing code paths where QEMU
> blocks on pending I/O.
Still working on it as a long term goal. Deferring it to 7.4.
Now actually deferring it to 7.4. Also moving it to Stefan, both because he is
the I/O path maintainer and because he requested to leave it open in comment 21.
Moving this long-term architectural BZ to RHEL 7.5.