Bug 1613010
Summary: | Hibernate of host fails when KVM guest is running | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Paul Gozart <pgozart> | |
Component: | qemu-kvm | Assignee: | Hai Huang <hhuang> | |
Status: | CLOSED DEFERRED | QA Contact: | Virtualization Bugs <virt-bugs> | |
Severity: | medium | Docs Contact: | ||
Priority: | unspecified | |||
Version: | 7.5 | CC: | ahs3, berrange, cfergeau, chayang, fziglio, jsuchane, jsynacek, juzhang, jwright, mprivozn, systemd-maint-list, thozza, tpelka, virt-maint, xfu, yalzhang | |
Target Milestone: | rc | |||
Target Release: | --- | |||
Hardware: | x86_64 | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1661276 (view as bug list) | Environment: | ||
Last Closed: | 2019-10-08 15:59:21 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 980840, 1594286 |
Comment 4
Christophe Fergeau
2018-12-17 11:09:54 UTC
Libvirt already has logic which aims to inhibit shutdown while VMs are running. It should not be difficult to extend this to inhibit suspend too. We would need to make this a configurable policy though for libvirtd. I am sorry I do not see the potential value in implementing such feature. While with shutdown inhibition the goal is to correctly shutdown/suspend -> start/resume running guests, with host suspend inhibition you have two options, a) it's off, which is current situation, host OS is responsible for gracefully suspending all running processes including libvirt and qemu. The host kernel may get stuck if a (kernel) thread is not interruptible (or for other reasons). b) it's on, than the host suspend would not proceed as the guest would not be suspendable. Host suspend is clearly workstation/laptop use case, its inhibition may lead to undesired effects such as battery drain, etc. What is the expectation from this requirement? It would not provide successful host suspend with running VMs, would it? Ought we focus on the root cause of the kernel hang? Where do we stand with this request please? I agree with Jarda. What if pm-suspend would call 'virsh suspend' over every domain running and then 'virsh resume' over domains suspended earlier (note that there might be some domains suspended prior issuing pm-suspend and we don't want to resume those). Looking at /usr/lib64/pm-utils/ there's a lot of scripts that are called on suspend/resume. If anything, there can be one for suspending/resuming libvirt domains. Alternatively, there might be a bug in KVM which would prevent host suspend. Moving to pm-utils for now to decide which way to go. Suspend/resume is primarily handled in userspace by systemd. The pm-utils are there for backward compatibility. The project is dead usptream and the package was removed from both Fedora and RHEL-8. So the benefit of adding such helper script to pm-utils is low. The proper components for this to be fixed in have already declined the request. What is systemd supposed to do with this? Is this another of those "we don't know where to put it" hacks that ends up in systemd? (In reply to Jan Synacek from comment #13) > The proper components for this to be fixed in have already declined the > request. What is systemd supposed to do with this? Is this another of those > "we don't know where to put it" hacks that ends up in systemd? Let's start from the beginning, what do you think is the best place to fix this issue? (In reply to Michal Privoznik from comment #9) > I agree with Jarda. What if pm-suspend would call 'virsh suspend' over every > domain running and then 'virsh resume' over domains suspended earlier (note > that there might be some domains suspended prior issuing pm-suspend and we > don't want to resume those). Calling "virsh suspend" and "virsh resume" doesn't do anything useful. The VMs' execution is already suspended by virtue of the host being suspended. The only useful thing would be to call the guest agent command to sync the clock upon resume, which merely requires some notification upon host resume. Being able to optionally inhibit suspend when VMs are running is the useful thing and that can be done in libvirt itself by talking to the systemd login manager over dbus, in the same way we inhibit shutdown already. I'm not seeing any need for extra features / changes in systemd here. Please keep in mind that the problem reported by my TAM customer is not that the processing of the KVM guest is mishandled upon host hibernation, but rather that the KVM host freezes. The customer doesn't care so much whether the hibernation is ignored or the guest processing is interrupted, but they don't want this scenario to cause the host to lock up and require a hard reboot. Obviously the KVM host should not freeze/hang in this situation and that smells like a kernel bug. The issue is that S3/S4 of the host with VMs running is considered an unsupported scenario, so this bug ended up morphing into a way to disable suspend so that we don't get near the kernel bug in the first place. (In reply to Daniel Berrangé from comment #18) > Obviously the KVM host should not freeze/hang in this situation and that > smells like a kernel bug. > It's a kernel bug, but in an unsupported scenario, therefore low priority (for the kernel team). We closed many S3/S4 BZs as "CLOSED/NOTABUG" in the past. This scenario is not in our test plans, so fixing it won't be that valuable. > The issue is that S3/S4 of the host with VMs running is considered an > unsupported scenario, so this bug ended up morphing into a way to disable > suspend so that we don't get near the kernel bug in the first place. ... ^^^ that's a problem we can prevent: crashing the host is very bad, so let's prevent it from ever happen. Given that historically speaking we had many S3/S4 issues and decided to declare it unsupported and don't test for it, let's disable it. (In reply to Michal Privoznik from comment #15) > (In reply to Jan Synacek from comment #13) > > The proper components for this to be fixed in have already declined the > > request. What is systemd supposed to do with this? Is this another of those > > "we don't know where to put it" hacks that ends up in systemd? > > Let's start from the beginning, what do you think is the best place to fix > this issue? See comment 19. (In reply to Ademar Reis from comment #19) > (In reply to Daniel Berrangé from comment #18) > > Obviously the KVM host should not freeze/hang in this situation and that > > smells like a kernel bug. > > > > It's a kernel bug, but in an unsupported scenario, therefore low priority > (for the kernel team). We closed many S3/S4 BZs as "CLOSED/NOTABUG" in the > past. This scenario is not in our test plans, so fixing it won't be that > valuable. > > > The issue is that S3/S4 of the host with VMs running is considered an > > unsupported scenario, so this bug ended up morphing into a way to disable > > suspend so that we don't get near the kernel bug in the first place. > > ... ^^^ that's a problem we can prevent: crashing the host is very bad, so > let's prevent it from ever happen. Given that historically speaking we had > many S3/S4 issues and decided to declare it unsupported and don't test for > it, let's disable it. Ademar, do you mean to disable suspend on libvirt level, that is - should libvirt do something which would prevent host suspend if there's a domain running? Well, I've just tested this with 5.2.13-gentoo and was able to suspend & resume successfully with a KVM guest running. So did Jarda with RHEL-AV-8.1.0 and it worked. Therefore, prohibiting suspend in upstream libvirt looks too harsh to me because it obviously works, except for RHEL kernel. If anything, we can make this opt-in (since libvirt doesn't have way to test if the kernel its running under has the bug or not), at which point we would require users to change a config file, which no one is going to do. Our best should would be a downstream only patch. And one philosophical question, if this is unsupported scenario and the component where the bug clearly lies in is refusing to fix it, why should libvirt? I don't want it to be a dump of bug workarounds. (In reply to Michal Privoznik from comment #21) > And one philosophical question, if this is unsupported scenario and the > component where the bug clearly lies in is refusing to fix it, why should > libvirt? I don't want it to be a dump of bug workarounds. That's not really the case here though. The original title / description of this bug report is that the host OS (probably kernel or KVM module) fails when suspending while VMs are running. This probable kernel or KVM bug was never even investigated, nor has the kernel team rejected any request to fix it since they've never been asked thus far. The bug was (IMHO) mistakenly turned into an RFE to block suspend when VMs are running, and systemd quite reasonably rejected that request on the basis that libvirt can already register a suspend blocker if it desires. IMHO either we investigate the root problem with VMs running or we just admit this is going to be a WONTFIX. I'm reverting this bug back to its original title & component assignment, since it is clearly not a systemd problem, and any RFE to libvirt blocking suspend should be considered separately from resolution of the actual customer problem report. (In reply to Michal Privoznik from comment #21) > (In reply to Ademar Reis from comment #19) > > (In reply to Daniel Berrangé from comment #18) > > > Obviously the KVM host should not freeze/hang in this situation and that > > > smells like a kernel bug. > > > > > > > It's a kernel bug, but in an unsupported scenario, therefore low priority > > (for the kernel team). We closed many S3/S4 BZs as "CLOSED/NOTABUG" in the > > past. This scenario is not in our test plans, so fixing it won't be that > > valuable. > > > > > The issue is that S3/S4 of the host with VMs running is considered an > > > unsupported scenario, so this bug ended up morphing into a way to disable > > > suspend so that we don't get near the kernel bug in the first place. > > > > ... ^^^ that's a problem we can prevent: crashing the host is very bad, so > > let's prevent it from ever happen. Given that historically speaking we had > > many S3/S4 issues and decided to declare it unsupported and don't test for > > it, let's disable it. > > Ademar, do you mean to disable suspend on libvirt level, that is - should > libvirt do something which would prevent host suspend if there's a domain > running? Yes: libvirt disabling suspend of the host if a VM is running. > Well, I've just tested this with 5.2.13-gentoo and was able to suspend & > resume successfully with a KVM guest running. So did Jarda with > RHEL-AV-8.1.0 and it worked. Therefore, prohibiting suspend in upstream > libvirt looks too harsh to me because it obviously works, except for RHEL > kernel. If anything, we can make this opt-in (since libvirt doesn't have way > to test if the kernel its running under has the bug or not), at which point > we would require users to change a config file, which no one is going to do. > Our best should would be a downstream only patch. I'm not talking about libvirt unconditionally disabling S3/S4 upstream/everywhere. I'm talking about disabling it by default in downstream RHEL, with a configuration switch for users who want to enable it back (and end up in an unsupported stated). I know it works in many cases, but it's not reliable and we've decided long ago that we don't support S3/S4+KVM because of the multiple obscure bugs that customers hit. QE has not been testing it and we're not actively developing it. We closed and continue to close many S3/S4 bugs as WONTFIX. If you're interested in bug archaeology, please check this tracker: https://bugzilla.redhat.com/show_bug.cgi?id=923626 > > And one philosophical question, if this is unsupported scenario and the > component where the bug clearly lies in is refusing to fix it, why should > libvirt? I don't want it to be a dump of bug workarounds. The story is much simpler than that: we don't support S3/S4 with KVM in RHEL and therefore we're not allocating resources to fix bugs or test it, so libvirt disables it by default in RHEL. Users who want to run an unsupported configuration can enable it. Upstream is a different story. We can and should re-evaluate our support statement of S3/S4 with KVM, but the truth is that right now it's not supported. With all that said, I think it's time to close this RHEL7 BZ, as obviously we're way too late to fix it there. We already have a RHEL8-AV RFE: https://bugzilla.redhat.com/show_bug.cgi?id=1568487 |