Bug 1032695
Summary: | libvirt: machines get killed when scopes are destroyed | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Daniel Berrangé <berrange> | |
Component: | libvirt | Assignee: | Michal Privoznik <mprivozn> | |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Virtualization Bugs <virt-bugs> | |
Severity: | unspecified | Docs Contact: | ||
Priority: | unspecified | |||
Version: | 7.0 | CC: | acathrow, dallan, dyuan, eblake, harald, jdenemar, jrieden, juzhou, lpoetter, lsu, mprivozn, mzhan, shyu, sluo, zhwang, zpeng | |
Target Milestone: | rc | |||
Target Release: | --- | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | libvirt-1.1.1-25.el7 | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | ||
Clone Of: | 1031696 | |||
: | 1064976 (view as bug list) | Environment: | ||
Last Closed: | 2014-06-13 10:09:56 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 1064976 | |||
Bug Blocks: | ||||
Attachments: |
Description
Daniel Berrangé
2013-11-20 15:25:36 UTC
*** Bug 961200 has been marked as a duplicate of this bug. *** I'm getting feeling that this might be a blocker. Lennart, do you have any bright idea, please? (In reply to Daniel Berrange from comment #0) > For this whole thing to work correctly, we need to ensure that > scopes are not terminated prematurely. If we introduced a target > like libvirt-ready.target, and made libvirt-guests.service be > After=libvirt-ready.target, and made all the scopes be > Before=libvirt-ready.target, I think the vms would have a chance > to shutdown properly. But that's pretty complicated. > And I'm not even sure how to do that properly. Any better > ideas? > Lennart commented, that Daniel analyzed it correctly. It took as some time to discuss this, sorry for the delay. Sooo, here's what we'd propose: In systemd we'll define a new generic target unit called "machines.target". Then, we'll change systemd-machined.service to implicitly add Before= dependencies to the machine scopes for this target. In libvirt we'd change the libvirt.service unit file to do After=machines.target. This would then result in the following ordering chain: machine-*.scope → machines.target → libvirtd.service Now, during start-up this ordering would be mostly pointless, as the scopes would not exist in the dependency network before libvirtd.service actually creates them. However, at shutdown this logic would have an effect: in systemd the shutdown order is always the inverse of the start-up order. This hence results in this shutdown ordering: libvirtd.service → machines.target → machine-*.scope Which means: first libvirt would shut down, taking down the machine scopes with them, simply by telling qemu to suspend/terminate them. Then, machines.target would be shut down, and finally the remaining scopes (if there are any would be removed). Now, machines.target would be generically useful for the non-libvirt case too. For example, if people encapsulate docker or nspawn services in individual units, they could pull those in from machines.target and things would somewhat make sense in that case for the start-up case as well. I hope this makes some sense. I will make the necessary changes to systemd upstream soon, we cann then backport this to RHEL7. Thanks Lennart. Just to clarify we would be adding the After=machines.target to libvirt-guests.service, which is the optional service which does VM save/restore. libvirtd likely shouldn't have any additional dependency since it isn't supposed to touch running qemu processes on service start/shutdown. libvirt-guests.service has 'After libvirtd.service' so I believe the ordering would still work as Lennart describes on startup machine-*.scope → machines.target → libvirt-guests.service → libvirtd.service and on shutdown the reverse libvirt-guests.service → libvirtd.service → machines.target → machine-*.scope (In reply to Daniel Berrange from comment #7) > libvirt-guests.service has 'After libvirtd.service' so I believe the > ordering would still work as Lennart describes on startup > > machine-*.scope → machines.target → libvirt-guests.service → > libvirtd.service > > and on shutdown the reverse > > libvirt-guests.service → libvirtd.service → machines.target → > machine-*.scope But if a user turns off libvirt-guests, should systemd wait for libvirt to stop before it kills all scopes at host shutdown time? there's no reason it should as far as I can tell (In reply to Cole Robinson from comment #8) > (In reply to Daniel Berrange from comment #7) > > libvirt-guests.service has 'After libvirtd.service' so I believe the > > ordering would still work as Lennart describes on startup > > > > machine-*.scope → machines.target → libvirt-guests.service → > > libvirtd.service > > > > and on shutdown the reverse > > > > libvirt-guests.service → libvirtd.service → machines.target → > > machine-*.scope > > But if a user turns off libvirt-guests, should systemd wait for libvirt to > stop before it kills all scopes at host shutdown time? there's no reason it > should as far as I can tell Opps, i got the second example the wrong way around. On shtudown it would be libvirtd.service → libvirt-guests.service → machines.target → machine-*.scope So if libvirt-guests.service were no activated, I don't believe libvirtd would block on machines.target (In reply to Lennart Poettering from comment #5) > It took as some time to discuss this, sorry for the delay. > > Sooo, here's what we'd propose: > > In systemd we'll define a new generic target unit called "machines.target". > Then, we'll change systemd-machined.service to implicitly add Before= > dependencies to the machine scopes for this target. > > In libvirt we'd change the libvirt.service unit file to do > After=machines.target. This would then result in the following ordering > chain: > > machine-*.scope → machines.target → libvirtd.service > > Now, during start-up this ordering would be mostly pointless, as the scopes > would not exist in the dependency network before libvirtd.service actually > creates them. However, at shutdown this logic would have an effect: in > systemd the shutdown order is always the inverse of the start-up order. This > hence results in this shutdown ordering: > > libvirtd.service → machines.target → machine-*.scope > > Which means: first libvirt would shut down, taking down the machine scopes > with them, simply by telling qemu to suspend/terminate them. Then, > machines.target would be shut down, and finally the remaining scopes (if > there are any would be removed). > > Now, machines.target would be generically useful for the non-libvirt case > too. For example, if people encapsulate docker or nspawn services in > individual units, they could pull those in from machines.target and things > would somewhat make sense in that case for the start-up case as well. > > I hope this makes some sense. I will make the necessary changes to systemd > upstream soon, we cann then backport this to RHEL7. So should I clone this bug for systemd too? Sooo, after thinking about this for a couple of more hours here at the hackfest we came to the conclusion that "machines.target" is probably not a good idea after all, since it cannot properly distinguish clean termination of machines by libvirt from the "emergency" clean-up done by systemd should libvirt die abnormally. The machines.target above would only cover the "emergency" case, which is certainly much less interesting though than the clean termination case. Moreover it actually enforces the "emergency" ordering even when a clean termination is done, a job that libvirt/machined would queue for termination of a machine scope would be delayed after libvirt would itself complete, which of course is mostly a chance for deadlock but certainly not useful... Instead, we want propose a different approach here, that covers this case much nicer: when creating a scope we'd add an additional, optional property parameter, maybe called "ScopeManager" or so, which takes a string. If specified it should contain the bus name (unique name, possibly well-known name) of a peer that systemd will send a bus signal to instead of sending SIGTERM to the scope's processes when it would like to shutdown the scope unit. libvirt would simply set this property parameter to its own unique name, and then when systemd wants the machine scopes to go away, it would be libvirt that gets asked this way to terminate the scopes, and systemd would not terminate them directly (well, subject to a timeout after which systemd would SIGKILL them...). When the system is shut down this would have the effect that systemd would send both SIGTERM to libvirt (since it wants to terminate libvirt itself), and a bus signal for each machine scope that is running, also to libvirt. (The SIGTERM and the bus signals would not be ordered though, but that should not be a problem, or would it?) A nice effect of this is that libvirt would also get hooked into the shutdown logic of its machines if people use "systemctl stop machine-xyz.scope", thus streamlining the termination logic of its machines both during runtime and at shut down time... Does this make sense to you? I personally find this a much more convincing solution, since it kinda puts libvirt into the right position of being the manager of its own scopes, which it should have been in the first place... systemd would never take things into its own hands anymore, except when libvirt for some reason did not react to the bus signal, and it would step in as last resort (though even that would be up to libvirt to decide if it really wants to, just by using the already existing SendSIGKILL property...) Sorry for the forth and back on this! This is the systemd side of things: http://cgit.freedesktop.org/systemd/systemd/commit/?id=2d4a39e759c4ab846ad8a546abeddd40bc8d736e libvirt should now simply add a property "Controller" containing a string with the unique name to the array it passes to the CreateMachine() call. Then, it should subscribe to the RequestStop() signal coming from PID 1's scope unit. And when it gets it it should shutdown that specific machine in whatever way it likes. I'll add documentation about this to the Wiki shortly. There's a new paragraph explaining this now at the end of this wiki text: http://www.freedesktop.org/wiki/Software/systemd/ControlGroupInterface/ I also added the new signal to the dbus API description: http://www.freedesktop.org/wiki/Software/systemd/dbus/ It's admittedly terse. If you need more information, please ping me. Now that I'm trying to implement this in libvirt, it seems to me, that libvirt must take over the libvirt-guests script. Because currently the functionality is split into the script. In RHEL-6 world, whenever host went down, the script was invoked and machines were (optionaly) saved/shutdown. Then, when the host went up again and the script was invoked at bootup, the machines were restored again. No DBus involved. So now we have two options: 1) since libvirtd is already listening on dbus, move the piece of functionality to libvirtd 2) make libvirt-guests script listen on dbus and set "Controller: libvirt-guests.service". This can possibly mean, that script will hang around forever. BTW: From the documentation you're linking: "As before in either case this will be followed by SIGKILL to the scope unit processes after a timeout." This is highly critical to libvirt. What is the timeout? In the libvirt-guests script configruation we allow users to set an arbitrary timeout for saving/shuting down guests. We even allow them to turn timeout off, meaning the host shutdown process is postponed untill all guests are saved/shut down. Hence, we must make the timeout configurable in the systemd too. FWIW, I've always thought that this functionality should live inside libvirtd. For the QEMU session driver we have indeed added such a capability, albeit more limited in flexibility. This triggered upon desktop session logout or host shutdown. This would clearly be a much larger job though, so probably not reasonable for 7.0. Well, the current script works for all drivers (not just qemu), in fact you can define any URI that whill shutdown guests (and it can be a remote host, not the one that is shutting down). For example if you have two hosts: hostA running VMs, hostB serving VM disks over NFS, libvirt-guests can be configured on hostB to shutdown all the VMs on hostA. And there doesn't have to be any libvirtd running on hostB. If we merge libvirt-guests into the libvirtd we will mangle not only this use case but all the other systems that doesn't use systemd at all. Anyway, if we are able to send a signal on dying scope, can we just run an arbitrary script instead? And when I say 'we' I mean systemd. (In reply to Michal Privoznik from comment #16) > Well, the current script works for all drivers (not just qemu), in fact you > can define any URI that whill shutdown guests (and it can be a remote host, > not the one that is shutting down). For example if you have two hosts: hostA > running VMs, hostB serving VM disks over NFS, libvirt-guests can be > configured on hostB to shutdown all the VMs on hostA. And there doesn't have > to be any libvirtd running on hostB. If we merge libvirt-guests into the > libvirtd we will mangle not only this use case but all the other systems > that doesn't use systemd at all. Obviously we need this work to be done for all the stateful drivers in libvirtd. I can imagine this would involve some generic code in libvirtd and a perhaps a handful of new driver APIs. The ability to shutdown VMS on host B, when host A shuts down has always struck me as a particularly pointless feature. It is a solution in search of a problem IMHO. The core important feature is shutting down VMs running in stateful libvirtd drivers. If we have to loose anything else, in order to do the core feature well, then so be it. We shouldn't have to loose support for non-systemd hosts either. All that I see happening is that the libvirt-guests sysvinit script would become very small - it'd just have to make an API call to libvirtd to trigger its job, or send a signal to it. I still don't understand, why we need to take the more complicated approach here. I think the best solution would be, if scope understand Before= and After= even for services. That is, in the scope we can have: Before=libvirt-guest.service So when systemd computes the shutdown order, libvirt-guest.service will always be before any scope termination. In which case, libvirt-guest.service would kill some scopes and all the remaining scopes after the script is run we don't care about. This approach is even better than having Script= like I'm suggesting in comment 16. Because scopes can be killed in parallel in which case we need to adapt libvirt-guest. (In reply to Lennart Poettering from comment #11) > libvirt would simply set this property parameter to its own unique name, and > then when systemd wants the machine scopes to go away, it would be libvirt > that gets asked this way to terminate the scopes, and systemd would not > terminate them directly (well, subject to a timeout after which systemd > would SIGKILL them...). When the system is shut down this would have the > effect that systemd would send both SIGTERM to libvirt (since it wants to > terminate libvirt itself), and a bus signal for each machine scope that is > running, also to libvirt. (The SIGTERM and the bus signals would not be > ordered though, but that should not be a problem, or would it?) I think what you describe with Controller= could work with 2 caveats - We would need to use a well-known bus name, not a unique bus name, since libvirtd needs to be able to restart without affecting guests, and such restarts would result in libvirt getting a new unique name. I don't think this is a problem, since there's nothing in the code that prevents use of a well-known bus name. It is just a matter of libvirt registering org.libvirt.system on the bus at startup. - We need to solve the SIGTERM vs scope bus signals race. If SIGTERM can arrive to libvirt before the scope bus signals, then chances are libvirt is going to have already shut itself down before it is notified that scopes need to be shutdown. AFAICT, this requires the ability for us to also set Before=libvirtd.service on the VM scopes, to ensure that the scopes will be scheduled for shutdown before libvirt ever gets a SIGTERM. I think using Controller= and Before= on the scopes is the right long term approach to fix this problem. For RHEL-7, however, I don't think we can do the work to support Controller=, and thus we need the ability to set After= on scopes, so we can control their shutdown ordering wrt libvirt-guests.service and libvirtd.service. So regardless of what approach we take for making this work, AFAICT, we really do need support for After= and Before= on scopes. After discussion with Lennart via email I found out why we were having problems with Before=/After= support. The key is that those properties must be provided as an array of strings, not a single comma separated string. eg something like this @@ -243,8 +243,9 @@ int virSystemdCreateMachine(const char *name, iscontainer ? "container" : "vm", (unsigned int)pidleader, rootdir ? rootdir : "", - 1, "Slice", "s", - slicename) < 0) + 2, + "Slice", "s", slicename, + "Before", "as", 1, "libvirtd.service") < 0) goto cleanup; ret = 0; (In reply to Daniel Berrange from comment #21) > After discussion with Lennart via email I found out why we were having > problems with Before=/After= support. The key is that those properties must > be provided as an array of strings, not a single comma separated string. > > eg something like this > > @@ -243,8 +243,9 @@ int virSystemdCreateMachine(const char *name, > iscontainer ? "container" : "vm", > (unsigned int)pidleader, > rootdir ? rootdir : "", > - 1, "Slice", "s", > - slicename) < 0) > + 2, > + "Slice", "s", slicename, > + "Before", "as", 1, "libvirtd.service") < 0) > goto cleanup; > > ret = 0; I've reached this code [*] after reading systemd source code (the documentation is just lacking) like a week ago. I even tried it out, but without any success. There's a difference between upstream and downstream systemd and I think this falls into grey area. Moreover, I've even tried setting "DefaultDependencies" to false too. No success either. Therefore I'm restoring the dependency back. And sorry for not updating the BZ. * - in fact I was trying something like this: @@ -243,8 +243,9 @@ int virSystemdCreateMachine(const char *name, iscontainer ? "container" : "vm", (unsigned int)pidleader, rootdir ? rootdir : "", - 1, "Slice", "s", - slicename) < 0) + 2, + "Slice", "s", slicename, + "Before", "as", 1, "libvirt-guests.service") < 0) goto cleanup; ret = 0; and this: @@ -243,8 +243,10 @@ int virSystemdCreateMachine(const char *name, iscontainer ? "container" : "vm", (unsigned int)pidleader, rootdir ? rootdir : "", - 1, "Slice", "s", - slicename) < 0) + 3, + "Slice", "s", slicename, + "DefaultDependencies", "b", 0, + "Before", "s", "libvirt-guests.service") < 0) which is interesting that in this case I get I/O error when communicating on dbus, and nothing but reboot can make it work again. Nor reverting my patch, rebuilding & restarting the libvirtd daemon. Patches proposed upstream: https://www.redhat.com/archives/libvir-list/2014-February/msg01357.html Moving to POST: http://post-office.corp.redhat.com/archives/rhvirt-patches/2014-February/msg00732.html Scratch build: http://brewweb.devel.redhat.com/brew/taskinfo?taskID=7097732 Moving back to ASSIGNED to make sure we also modify the .spec file for a reproducible build. Additional patches pending upstream: https://www.redhat.com/archives/libvir-list/2014-February/msg01559.html It turns out that the RHEL 7 build already depends on systemd_daemon based on an indirect dependency on udev; so backporting the additional patches won't change the binary, but will make the resulting rpm be deterministic for anyone rebuilding the rpm outside of the RHEL 7 build farm. Okay, moving to POST once again: http://post-office.corp.redhat.com/archives/rhvirt-patches/2014-February/msg00853.html Hi Michal I met a issue that some guests can't be resumed and the libvirt-guests service was in failed status after i finish reboot the host, please help have a look thanks pkginfo qemu-kvm-rhev-1.5.3-50.el7.x86_64 kernel-3.10.0-97.el7.x86_64 libvirt-1.1.1-25.el7.x86_64 steps 1.Prepare 3 running guests and enable the libvirt-guests service # virsh list Id Name State ---------------------------------------------------- 2 rhel7 running 3 rhel72 running 6 rhel75 running # service libvirt-guests status Redirecting to /bin/systemctl status libvirt-guests.service libvirt-guests.service - Suspend Active Libvirt Guests Loaded: loaded (/usr/lib/systemd/system/libvirt-guests.service; enabled) Active: active (exited) since Mon 2014-03-03 19:59:11 CST; 3s ago Process: 3275 ExecStart=/usr/libexec/libvirt-guests.sh start (code=exited, status=0/SUCCESS) Main PID: 3275 (code=exited, status=0/SUCCESS) 2. keep all configure as default in /etc/sysconfig/libvirt-guests ON_SHUTDOWN=suspend ON_BOOT=start 3. reboot the *host*: # reboot 4.Check the guest status after the host reboot, find one guest wasn't resumed # virsh list --all Id Name State ---------------------------------------------------- 2 rhel7 running 3 rhel72 running - rhel75 shut off # ll /var/lib/libvirt/qemu/save/ total 994060 -rw-------. 1 root root 1017913963 Mar 3 19:50 rhel75.save 5.Check the libvirt-guests status # systemctl status libvirt-guests -l libvirt-guests.service - Suspend Active Libvirt Guests Loaded: loaded (/usr/lib/systemd/system/libvirt-guests.service; enabled) Active: failed (Result: exit-code) since Mon 2014-03-03 19:55:33 CST; 1min 46s ago Process: 1800 ExecStart=/usr/libexec/libvirt-guests.sh start (code=exited, status=1/FAILURE) Main PID: 1800 (code=exited, status=1/FAILURE) CGroup: /system.slice/libvirt-guests.service Mar 03 19:53:51 ibm-x3650m3-07.qe.lab.eng.nay.redhat.com libvirt-guests.sh[1800]: Resuming guests on default URI... Mar 03 19:53:55 ibm-x3650m3-07.qe.lab.eng.nay.redhat.com libvirt-guests.sh[1800]: Resuming guest rhel7: done Mar 03 19:53:58 ibm-x3650m3-07.qe.lab.eng.nay.redhat.com libvirt-guests.sh[1800]: Resuming guest rhel72: done Mar 03 19:55:33 ibm-x3650m3-07.qe.lab.eng.nay.redhat.com libvirt-guests.sh[1800]: Resuming guest rhel75: error: Failed to start domain rhel75 Mar 03 19:55:33 ibm-x3650m3-07.qe.lab.eng.nay.redhat.com libvirt-guests.sh[1800]: error: End of file while reading data: Input/output error Mar 03 19:55:33 ibm-x3650m3-07.qe.lab.eng.nay.redhat.com libvirt-guests.sh[1800]: error: One or more references were leaked after disconnect from the hypervisor Mar 03 19:55:33 ibm-x3650m3-07.qe.lab.eng.nay.redhat.com libvirt-guests.sh[1800]: error: Failed to reconnect to the hypervisor Mar 03 19:55:33 ibm-x3650m3-07.qe.lab.eng.nay.redhat.com systemd[1]: libvirt-guests.service: main process exited, code=exited, status=1/FAILURE Mar 03 19:55:33 ibm-x3650m3-07.qe.lab.eng.nay.redhat.com systemd[1]: Failed to start Suspend Active Libvirt Guests. Mar 03 19:55:33 ibm-x3650m3-07.qe.lab.eng.nay.redhat.com systemd[1]: Unit libvirt-guests.service entered failed state. 6. I'll attach the log info in the attachment Created attachment 869905 [details]
The log about libvirt-guests
(In reply to zhenfeng wang from comment #30) > > 4.Check the guest status after the host reboot, find one guest wasn't resumed > # virsh list --all > Id Name State > ---------------------------------------------------- > 2 rhel7 running > 3 rhel72 running > - rhel75 shut off > > # ll /var/lib/libvirt/qemu/save/ > total 994060 > -rw-------. 1 root root 1017913963 Mar 3 19:50 rhel75.save > > > > 5.Check the libvirt-guests status > # systemctl status libvirt-guests -l > libvirt-guests.service - Suspend Active Libvirt Guests > Loaded: loaded (/usr/lib/systemd/system/libvirt-guests.service; enabled) > Active: failed (Result: exit-code) since Mon 2014-03-03 19:55:33 CST; > 1min 46s ago > Process: 1800 ExecStart=/usr/libexec/libvirt-guests.sh start (code=exited, > status=1/FAILURE) > Main PID: 1800 (code=exited, status=1/FAILURE) > CGroup: /system.slice/libvirt-guests.service > > Mar 03 19:53:51 ibm-x3650m3-07.qe.lab.eng.nay.redhat.com > libvirt-guests.sh[1800]: Resuming guests on default URI... > Mar 03 19:53:55 ibm-x3650m3-07.qe.lab.eng.nay.redhat.com > libvirt-guests.sh[1800]: Resuming guest rhel7: done > Mar 03 19:53:58 ibm-x3650m3-07.qe.lab.eng.nay.redhat.com > libvirt-guests.sh[1800]: Resuming guest rhel72: done > Mar 03 19:55:33 ibm-x3650m3-07.qe.lab.eng.nay.redhat.com > libvirt-guests.sh[1800]: Resuming guest rhel75: error: Failed to start > domain rhel75 > Mar 03 19:55:33 ibm-x3650m3-07.qe.lab.eng.nay.redhat.com > libvirt-guests.sh[1800]: error: End of file while reading data: Input/output > error This seems like libvirtd encountered a segmentation fault. Do you happen to have a coredump or be able to reproduce? What happens when you 'virsh restore rhel75'? BTW: wasn't rhel75 freshly booted up or shutting down prior to reboot? Maybe we are not handling that correctly. Michal Hi Michal Sorry to tell you that i didn't find the coredump in my host, and i find the libvirtd was in running status while the host start completely, maybe the libvirtd ever crash during the host start process. I can ofen reproduce this issue while start many guests (guest numbers >=3), can't reproduce it while only start one guest. BTW, The rhel75 guest can be restored successfully with the virsh restore command and the guest will recover to the place where it left Check the systemlog , i find the following record, hope it help you Mar 4 14:32:06 ibm-x3650m3-07 systemd: Stopping Virtualization daemon... Mar 4 14:33:36 ibm-x3650m3-07 systemd: libvirtd.service stopping timed out. Killing. Mar 4 14:33:36 ibm-x3650m3-07 libvirt-guests.sh: Resuming guest rhel75: error: Failed to start domain rhel75 Mar 4 14:33:36 ibm-x3650m3-07 libvirt-guests.sh: error: End of file while reading data: Input/output error Mar 4 14:33:36 ibm-x3650m3-07 libvirt-guests.sh: error: One or more references were leaked after disconnect from the hypervisor Mar 4 14:33:36 ibm-x3650m3-07 libvirt-guests.sh: error: Failed to reconnect to the hypervisor Mar 4 14:33:36 ibm-x3650m3-07 systemd: libvirtd.service: main process exited, code=killed, status=9/KILL Mar 4 14:33:36 ibm-x3650m3-07 systemd: Unit libvirtd.service entered failed state. Mar 4 14:33:36 ibm-x3650m3-07 systemd: Starting Virtualization daemon... Mar 4 14:33:36 ibm-x3650m3-07 systemd: libvirt-guests.service: main process exited, code=exited, status=1/FAILURE Mar 4 14:33:36 ibm-x3650m3-07 systemd: Failed to start Suspend Active Libvirt Guests. Mar 4 14:33:36 ibm-x3650m3-07 systemd: Unit libvirt-guests.service entered failed state. Mar 4 14:33:37 ibm-x3650m3-07 libvirtd: 2014-03-04 06:33:37.005+0000: 2717: info : libvirt version: 1.1.1, package: 25.el7 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2014-02-26-10:34:02, x86-017.build.eng.bos.redhat.com) Mar 4 14:33:37 ibm-x3650m3-07 libvirtd: 2014-03-04 06:33:37.005+0000: 2717: debug : virLogParseOutputs:1346 : outputs=1:file:/var/log/libvirt/libvirtd.log Mar 4 14:33:37 ibm-x3650m3-07 systemd: Started Virtualization daemon. Mar 4 14:33:37 ibm-x3650m3-07 libvirtd: 2014-03-04 06:33:37.023+0000: 2732: debug : virFileClose:90 : Closed fd 3 Mar 4 14:33:37 ibm-x3650m3-07 libvirtd: 2014-03-04 06:33:37.023+0000: 2732: debug : virFileClose:90 : Closed fd 5 Mar 4 14:33:37 ibm-x3650m3-07 libvirtd: 2014-03-04 06:33:37.023+0000: 2732: debug : virFileClose:90 : Closed fd 6 Mar 4 14:33:37 ibm-x3650m3-07 libvirtd: 2014-03-04 06:33:37.023+0000: 2732: debug : virFileClose:90 : Closed fd 7 Mar 4 14:33:37 ibm-x3650m3-07 libvirtd: 2014-03-04 06:33:37.023+0000: 2732: debug : virFileClose:90 : Closed fd 8 Mar 4 14:33:37 ibm-x3650m3-07 libvirtd: 2014-03-04 06:33:37.023+0000: 2732: debug : virFileClose:90 : Closed fd 9 Mar 4 14:33:37 ibm-x3650m3-07 libvirtd: 2014-03-04 06:33:37.023+0000: 2732: debug : virFileClose:90 : Closed fd 10 (In reply to zhenfeng wang from comment #33) > Hi Michal > Sorry to tell you that i didn't find the coredump in my host, and i find the > libvirtd was in running status while the host start completely, maybe the > libvirtd ever crash during the host start process. I can ofen reproduce this > issue while start many guests (guest numbers >=3), can't reproduce it while > only start one guest. BTW, The rhel75 guest can be restored successfully > with the virsh restore command and the guest will recover to the place where > it left > > > Check the systemlog , i find the following record, hope it help you It does indeed. > Mar 4 14:32:06 ibm-x3650m3-07 systemd: Stopping Virtualization daemon... Okay, so this is interesting. Why the heck is systemd *stopping* libvirtd at host boot up? > Mar 4 14:33:36 ibm-x3650m3-07 systemd: libvirtd.service stopping timed out. Yes, this timeouts as libvirtd is stuck resuming guests (that's why you hit this when guest count >= 3. I assume you have a slow disk or something (nfs?), so that resume of guest takes 30 seconds or more). > Killing. > Mar 4 14:33:36 ibm-x3650m3-07 libvirt-guests.sh: Resuming guest rhel75: > error: Failed to start domain rhel75 > Mar 4 14:33:36 ibm-x3650m3-07 libvirt-guests.sh: error: End of file while > reading data: Input/output error So after systemd timeouts on waiting for libvirtd, it kills it. Which release a chain-reaction, like libvirt-guests.sh losing the connection and thus not resuming other guests. > Mar 4 14:33:36 ibm-x3650m3-07 libvirt-guests.sh: error: One or more > references were leaked after disconnect from the hypervisor > Mar 4 14:33:36 ibm-x3650m3-07 libvirt-guests.sh: error: Failed to reconnect > to the hypervisor > Mar 4 14:33:36 ibm-x3650m3-07 systemd: libvirtd.service: main process > exited, code=killed, status=9/KILL > Mar 4 14:33:36 ibm-x3650m3-07 systemd: Unit libvirtd.service entered failed > state. > Mar 4 14:33:36 ibm-x3650m3-07 systemd: Starting Virtualization daemon... And later, when libvirtd is successfully killed, maybe just for a sheer joy of it, systemd decides to start it again. > Mar 4 14:33:36 ibm-x3650m3-07 systemd: libvirt-guests.service: main process > exited, code=exited, status=1/FAILURE > Mar 4 14:33:36 ibm-x3650m3-07 systemd: Failed to start Suspend Active > Libvirt Guests. > Mar 4 14:33:36 ibm-x3650m3-07 systemd: Unit libvirt-guests.service entered > failed state. > Mar 4 14:33:37 ibm-x3650m3-07 libvirtd: 2014-03-04 06:33:37.005+0000: 2717: > info : libvirt version: 1.1.1, package: 25.el7 (Red Hat, Inc. > <http://bugzilla.redhat.com/bugzilla>, 2014-02-26-10:34:02, > x86-017.build.eng.bos.redhat.com) > Mar 4 14:33:37 ibm-x3650m3-07 libvirtd: 2014-03-04 06:33:37.005+0000: 2717: > debug : virLogParseOutputs:1346 : > outputs=1:file:/var/log/libvirt/libvirtd.log > Mar 4 14:33:37 ibm-x3650m3-07 systemd: Started Virtualization daemon. Can you please try catching systemd debug logs and attach them here? I believe this can be achieved by appending: systemd.log_level=debug systemd.log_target=kmsg log_buf_len=5M enforcing=0 to kernel command line before booting. Then /var/log/messages and dmesg should contain the interesting bits. (In reply to Michal Privoznik from comment #34) > (In reply to zhenfeng wang from comment #33) > > Hi Michal > > Sorry to tell you that i didn't find the coredump in my host, and i find the > > libvirtd was in running status while the host start completely, maybe the > > libvirtd ever crash during the host start process. I can ofen reproduce this > > issue while start many guests (guest numbers >=3), can't reproduce it while > > only start one guest. BTW, The rhel75 guest can be restored successfully > > with the virsh restore command and the guest will recover to the place where > > it left > > > > > > Check the systemlog , i find the following record, hope it help you > > It does indeed. > > > Mar 4 14:32:06 ibm-x3650m3-07 systemd: Stopping Virtualization daemon... > > Okay, so this is interesting. Why the heck is systemd *stopping* libvirtd at > host boot up? Not clear about this strange phenomenon, I guest that maybe it have relationship with the timeout of the libvirtd initialization > > > Mar 4 14:33:36 ibm-x3650m3-07 systemd: libvirtd.service stopping timed out. > > Yes, this timeouts as libvirtd is stuck resuming guests (that's why you hit > this when guest count >= 3. I assume you have a slow disk or something > (nfs?), so that resume of guest takes 30 seconds or more). > In fact, I didn't use nfs and all the guests live in my local machine, my host's disk is sata disk, and its read rate can reach a value which is more than 100MB/s, so it shouldn't be low disk # hdparm -t /dev/sda1 /dev/sda1: Timing buffered disk reads: 356 MB in 3.01 seconds = 118.29 MB/sec > > Killing. > > Mar 4 14:33:36 ibm-x3650m3-07 libvirt-guests.sh: Resuming guest rhel75: > > error: Failed to start domain rhel75 > > Mar 4 14:33:36 ibm-x3650m3-07 libvirt-guests.sh: error: End of file while > > reading data: Input/output error > > So after systemd timeouts on waiting for libvirtd, it kills it. Which > release a chain-reaction, like libvirt-guests.sh losing the connection and > thus not resuming other guests. > > > Mar 4 14:33:36 ibm-x3650m3-07 libvirt-guests.sh: error: One or more > > references were leaked after disconnect from the hypervisor > > Mar 4 14:33:36 ibm-x3650m3-07 libvirt-guests.sh: error: Failed to reconnect > > to the hypervisor > > Mar 4 14:33:36 ibm-x3650m3-07 systemd: libvirtd.service: main process > > exited, code=killed, status=9/KILL > > Mar 4 14:33:36 ibm-x3650m3-07 systemd: Unit libvirtd.service entered failed > > state. > > Mar 4 14:33:36 ibm-x3650m3-07 systemd: Starting Virtualization daemon... > > And later, when libvirtd is successfully killed, maybe just for a sheer joy > of it, systemd decides to start it again. > > > Mar 4 14:33:36 ibm-x3650m3-07 systemd: libvirt-guests.service: main process > > exited, code=exited, status=1/FAILURE > > Mar 4 14:33:36 ibm-x3650m3-07 systemd: Failed to start Suspend Active > > Libvirt Guests. > > Mar 4 14:33:36 ibm-x3650m3-07 systemd: Unit libvirt-guests.service entered > > failed state. > > Mar 4 14:33:37 ibm-x3650m3-07 libvirtd: 2014-03-04 06:33:37.005+0000: 2717: > > info : libvirt version: 1.1.1, package: 25.el7 (Red Hat, Inc. > > <http://bugzilla.redhat.com/bugzilla>, 2014-02-26-10:34:02, > > x86-017.build.eng.bos.redhat.com) > > Mar 4 14:33:37 ibm-x3650m3-07 libvirtd: 2014-03-04 06:33:37.005+0000: 2717: > > debug : virLogParseOutputs:1346 : > > outputs=1:file:/var/log/libvirt/libvirtd.log > > Mar 4 14:33:37 ibm-x3650m3-07 systemd: Started Virtualization daemon. > > Can you please try catching systemd debug logs and attach them here? I > believe this can be achieved by appending: > > systemd.log_level=debug systemd.log_target=kmsg log_buf_len=5M enforcing=0 > > to kernel command line before booting. Then /var/log/messages and dmesg > should contain the interesting bits. Created attachment 870744 [details]
The dmesg info of the libvirt-guests
Created attachment 870745 [details]
The syslog about the libvirt-guests
zhenfeng, This libvirtd.service restart looks spurious, does your /etc/rc.local (or something) contains 'service libvirtd restart' (or equivalent?) Hi Michal I didn't find /etc/rc.local file in my host, however, find the /etc/rc.d/rc.local in my host and it contains 'service libvirtd restart', i'll attach this file to the attachment, also i'll attach the 'journalctl -u libvirtd.service' and 'journalctl -u libvirt-guests.service' logs to my attachment # grep 'service libvirtd restart' /etc/rc.d/rc.local service libvirtd restart service libvirtd restart Created attachment 871180 [details]
The libvirtd and libvirt-guests' log which output by journalctl command
Created attachment 871181 [details]
The rc.local file in my host
(In reply to zhenfeng wang from comment #41) > Created attachment 871181 [details] > The rc.local file in my host From the file: #!/bin/sh ntpdate clock.redhat.com modprobe kvm modprobe kvm-intel modprobe kvm-amd service rpcbind start chkconfig rpcbind on chmod 666 /dev/kvm service libvirtd restart setsebool -P virt_use_nfs 1 setenforce 1 /usr/libexec/iptables/iptables.init stop service firewalld stop service libvirtd restart That explains why libvirtd is restarted (two times) and why libvirt-guests script loses connection and fails to resume some domains. In fact, it explains why libvirtd fails to quit and systemd decides to kill it. When libvirtd is requested to stop in certain state (while talking on monitor) our event loop gets stuck. So I think you should remove 'service libvirtd restart' from rc.local and re-verify. Hi Michal Thanks for your suggestion, and it works well after i remove 'service libvirtd restart' from rc.local, the following was my verify steps pkginfo libvirt-1.1.1-26.el7.x86_64 qemu-kvm-rhev-1.5.3-52.el7.x86_64 kernel-3.10.0-105.el7.x86_64 steps SETUP1 1.Ensure the libvirtd and libvirt-guests service were in running status 2.Make sure the libvirt-guests script is on on the next reboot: # systemctl enable libvirt-guests.service 3.Start 4 running guests,make sure the autostart should be disabled for all domains # virsh list --all Id Name State ---------------------------------------------------- 2 rhel71 running 3 rhel72 running 4 rhel73 running 5 rhel75 running Scenario 1 1. Edit configure in /etc/sysconfig/libvirt-guests ON_SHUTDOWN=suspend ON_BOOT=start 2.Restart the libvirt-guests service, all guests were still in place where they left while start the libvirt-guests #systemctl restart libvirt-guests # virsh list --all Id Name State ---------------------------------------------------- 2 rhel71 running 3 rhel72 running 4 rhel73 running 5 rhel75 running 3.Reboot the host, got the same result with step 2 #reboot # virsh list --all Id Name State ---------------------------------------------------- 2 rhel71 running 3 rhel72 running 4 rhel73 running 5 rhel75 running Scenario 2 1. Edit configure in /etc/sysconfig/libvirt-guests ON_SHUTDOWN=suspend ON_BOOT=ignore 2.Restart the libvirt-guests service. When the libvirt-guests start, all domains were in shutoff status. there will have managedsave file /var/lib/libvirt/qemu/save/ for the domains #systemctl restart libvirt-guests # virsh list --all Id Name State ---------------------------------------------------- - rhel71 shut off - rhel72 shut off - rhel73 shut off - rhel75 shut off # ll /var/lib/libvirt/qemu/save/ total 2496084 -rw-------. 1 root root 263823173 Mar 7 19:15 rhel71.save -rw-------. 1 root root 258106558 Mar 7 19:15 rhel72.save -rw-------. 1 root root 1028830894 Mar 7 19:15 rhel73.save -rw-------. 1 root root 1005223558 Mar 7 19:15 rhel75.save while the start the guests, all guests will back to the place where they left 3.Reboot the host, got the same result with step 2 #reboot #virsh list --all Scenario 3 1. Edit configure in /etc/sysconfig/libvirt-guests ON_SHUTDOWN=shutdown ON_BOOT=ignore 2.Restart the libvirt-guests service, When the libvirt-guests start, all domains were in shutoff status. there didn't have managedsave file /var/lib/libvirt/qemu/save/ for the domains #systemctl restart libvirt-guests # virsh list --all Id Name State ---------------------------------------------------- - rhel71 shut off - rhel72 shut off - rhel73 shut off - rhel75 shut off # ll /var/lib/libvirt/qemu/save/ total 0 while the start the guests, all guests will have a fresh start 3.Reboot the host, got the same result with step 2 #reboot #virsh list --all Scenario 4 1. Edit configure in /etc/sysconfig/libvirt-guests ON_SHUTDOWN=shutdown ON_BOOT=start 2.Restart the libvirt-guests service, When the libvirt-guests start, all domains were in running status. all guests didn't back to the place where they left, they just had a fresh boot #systemctl restart libvirt-guests # virsh list --all Id Name State ---------------------------------------------------- - rhel71 running - rhel72 running - rhel73 running - rhel75 running 3.Reboot the host, got the same result with step 2 #reboot #virsh list --all SETUP2 All steps were the same with the SETUP1 except the step 3 which we enable the autostart for rhel71 and rhel72, didn't enable the autostart for rhel73 and rhel75 #virsh autostart rhel71 #virsh autostart rhel72 Scenario 1 1. Edit configure in /etc/sysconfig/libvirt-guests ON_SHUTDOWN=suspend ON_BOOT=ignore 2.Restart the libvirt-guests service. I can get the same result with the step2 in Scenario 2 3.Reboot the host, while the host start completly, the rhel71 and rhel72 guests were in running status and they stay the place where they left . For rhel73 and rhel75 guests, they were in shutoff status and i can see their save file under the /var/lib/libvirt/qemu/save folder Scenario2 1.Edit configure in /etc/sysconfig/libvirt-guests ON_SHUTDOWN=shutdown ON_BOOT=ignore 2.Restart the libvirt-guests service, I can get the same result with the step2 in Scenario 3 3.Reboot the host, while the host start completely, the rhel71 and rhel72 guests were in running status and they didn't stay the place where they left, they just have a fresh start. For the rhel73 and rhel75 guests, they were in shutoff status and i can't see their save file under the /var/lib/libvirt/qemu/save folder Base the upper info, mark this bug verified This request was resolved in Red Hat Enterprise Linux 7.0. Contact your manager or support representative in case you have further questions about the request. |