Bug 1416043
| Summary: | libvirtd loses connection to dbus and needs to be restarted | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Andreas Bleischwitz <ableisch> |
| Component: | libvirt | Assignee: | Martin Kletzander <mkletzan> |
| Status: | CLOSED DUPLICATE | QA Contact: | yafu <yafu> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 7.3 | CC: | ableisch, alexander.hass, amureini, berrange, dmardones, dmoessne, dyuan, eblake, erik, gwatson, henning.sackewitz, linux, mkardeh, mtessun, nsoffer, peter.engel, pzhang, rbalakri, rworkman, sbonazzo, tnisan, xuzhang, zpeng |
| Target Milestone: | pre-dev-freeze | Keywords: | Reopened |
| Target Release: | --- | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2017-03-21 10:04:38 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1420851 | ||
|
Description
Andreas Bleischwitz
2017-01-24 13:09:34 UTC
AFAIK, it is *not* supported to restart the DBus system bus - it is expected to remain running uninterrupted for the duration of the OS boot. The libdbus client library also does not expect applications to reconnect to dbus - it just provides a single global dbus connection handle and applications aren't supposed to try recreate it. So if you do restart DBus daemon for whatever reason, it is normal that applications using dbus will break and themselves need to be restarted. IOW, I don't think there's anything that needs fixing in libvirt here. Simply don't ever stop the dbus system daemon. In the event that dbus was mistakenly restarted, then simply restart libvirtd too. Maybe systemd should be smarter about automatically restarting services that depend on the system bus - but if so, that's a systemd bug report, not libvirtd (unless the libvirtd.service file needs a one-line addition to inform systemd that it does indeed depend on the systemd bus) (In reply to Eric Blake from comment #5) > Maybe systemd should be smarter about automatically restarting services that > depend on the system bus - but if so, that's a systemd bug report, not > libvirtd (unless the libvirtd.service file needs a one-line addition to > inform systemd that it does indeed depend on the systemd bus) If dbus daemon has stopped/restarted, then likely the entire OS should be restarted rather than trying to fix it up by figuring out which services need restarting. (In reply to Daniel Berrange from comment #9) > (In reply to Eric Blake from comment #5) > > Maybe systemd should be smarter about automatically restarting services that > > depend on the system bus - but if so, that's a systemd bug report, not > > libvirtd (unless the libvirtd.service file needs a one-line addition to > > inform systemd that it does indeed depend on the systemd bus) > > If dbus daemon has stopped/restarted, then likely the entire OS should be > restarted rather than trying to fix it up by figuring out which services > need restarting. According to the process times, dbus hasn't been restarted, but somehow the libvirt-dbus-connection got dropped. Any idea how to procede here? From my point of view: Workaround: Restart libvirt (could this be automated somehow?) Solution: Find the reason why the connection drops and resolve this (any ideas on how to debug this, especially as it does not happen this often?) So any ideas how to move the analysis of this one forward? Some snippets from the processes: dbus 1165 0.0 0.0 45664 12456 ? Ssl 2016 3:05 /bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation root 2616 2.4 0.0 961808 34444 ? S<sl 14:14 0:25 /usr/bin/python /usr/share/vdsm/supervdsmServer --sockfile /var/run/vdsm/svdsm.sock root 2636 0.3 0.0 1236080 21356 ? Ssl 14:14 0:03 /usr/sbin/libvirtd --listen vdsm 3638 97.4 0.0 3773852 65020 ? S<sl 14:14 16:43 /usr/bin/python /usr/share/vdsm/vdsm vdsm 3639 0.5 0.0 892392 22064 ? Ssl 14:14 0:06 python /usr/sbin/momd -c /etc/vdsm/mom.conf vdsm 3754 0.0 0.0 603812 1292 ? S<l 14:14 0:00 /usr/libexec/ioprocess --read-pipe-fd 53 --write-pipe-fd 51 --max-threads 10 --max-queued-requests 10 vdsm 3814 0.0 0.0 382616 1132 ? S<l 14:14 0:00 /usr/libexec/ioprocess --read-pipe-fd 71 --write-pipe-fd 70 --max-threads 10 --max-queued-requests 10 So dbus was started in 2016 and libvirt was (re-)started 2017-01-09 14:14 Before the restart libvirt, vdsm and dbus looked like: root 1155 7.3 0.0 1543456 35308 ? S<sl 2016 3301:27 /usr/bin/python /usr/share/vdsm/supervdsmServer --sockfile /var/run/vdsm/svdsm.sock dbus 1165 0.0 0.0 45532 12196 ? Ssl 2016 3:05 /bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation root 2052 21.1 0.0 1308424 285836 ? Ssl 2016 9468:21 /usr/sbin/libvirtd --listen vdsm 5529 30.6 0.0 5576832 431476 ? S<sl 2016 13712:09 /usr/bin/python /usr/share/vdsm/vdsm vdsm 5533 0.6 0.0 1039856 138616 ? Ssl 2016 289:54 python /usr/sbin/momd -c /etc/vdsm/mom.conf As the problem didn't occur earlier than 2017-01, I would expect that somehow the libvirt-dbus connection got closed/lost, which should probably not happen or am I missing something here? (In reply to Martin Tessun from comment #12) > (In reply to Daniel Berrange from comment #9) > > (In reply to Eric Blake from comment #5) > > > Maybe systemd should be smarter about automatically restarting services that > > > depend on the system bus - but if so, that's a systemd bug report, not > > > libvirtd (unless the libvirtd.service file needs a one-line addition to > > > inform systemd that it does indeed depend on the systemd bus) > > > > If dbus daemon has stopped/restarted, then likely the entire OS should be > > restarted rather than trying to fix it up by figuring out which services > > need restarting. > > According to the process times, dbus hasn't been restarted, but somehow the > libvirt-dbus-connection got dropped. UNIX domain sockets simply don't break on their own - one end must have terminated the connection. If dbus wasn't restarted, either dbus closed the socket, or libvirt mistakenly closed its end of the socket. Unless the scenario is reliably reproducable this is pretty much impossible to debug further. > Any idea how to procede here? CLOSED -> WORKSFORME / INSUFFICIENT_DATA since AFAICT, there's not enough to show what's happening and without it being reliably reproducable, there's no way to get the required info. (In reply to Daniel Berrange from comment #13) > (In reply to Martin Tessun from comment #12) > > > > According to the process times, dbus hasn't been restarted, but somehow the > > libvirt-dbus-connection got dropped. > > UNIX domain sockets simply don't break on their own - one end must have > terminated the connection. If dbus wasn't restarted, either dbus closed the > socket, or libvirt mistakenly closed its end of the socket. Unless the > scenario is reliably reproducable this is pretty much impossible to debug > further. > > > Any idea how to procede here? > > CLOSED -> WORKSFORME / INSUFFICIENT_DATA since AFAICT, there's not enough to > show what's happening and without it being reliably reproducable, there's no > way to get the required info. Ok. So anything, we can ask the affected customers to run/collect to being able to further debug this issue? I believe we just need to ask for the data we need to further debugging the issue, so we can analyze it. Thanks! Martin Hi All, we had the same issue more times on our systems in SAP LinuxLab. SAP colleagues are running permanently tests on that. As duplicate with this bug, we closed the https://bugzilla.redhat.com/show_bug.cgi?id=1422556 - VM occasionally losing disk. I would like to share the info on our system ---------------------------------------------------------------- "libvirt" and "qemu" version are: ~# rpm -q libvirt libvirt-2.0.0-10.el7_3.4.x86_64 ~# rpm -qa | grep qemu- qemu-img-rhev-2.6.0-28.el7_3.3.x86_64 qemu-kvm-common-rhev-2.6.0-28.el7_3.3.x86_64 qemu-kvm-tools-rhev-2.6.0-28.el7_3.3.x86_64 ipxe-roms-qemu-20160127-5.git6366fa7a.el7.noarch qemu-kvm-rhev-2.6.0-28.el7_3.3.x86_64 libvirt-daemon-driver-qemu-2.0.0-10.el7_3.4.x86_64 ------------------------------------------------------------- I attached the system logs, vdsm log file and sosreport-LogCollector.tar.xz to the Bug 1422556. Maybe these help you also to analyse the error. Two questions would be: 1- Why is libvirtd getting into a state where the "ListActivatableName" error produced? 2- What is the reason for deactivated disk or missing disk? Thanks and best regards, Mota (In reply to Martin Tessun from comment #14) > (In reply to Daniel Berrange from comment #13) > > (In reply to Martin Tessun from comment #12) > > > > > > According to the process times, dbus hasn't been restarted, but somehow the > > > libvirt-dbus-connection got dropped. > > > > UNIX domain sockets simply don't break on their own - one end must have > > terminated the connection. If dbus wasn't restarted, either dbus closed the > > socket, or libvirt mistakenly closed its end of the socket. Unless the > > scenario is reliably reproducable this is pretty much impossible to debug > > further. > > > > > Any idea how to procede here? > > > > CLOSED -> WORKSFORME / INSUFFICIENT_DATA since AFAICT, there's not enough to > > show what's happening and without it being reliably reproducable, there's no > > way to get the required info. > > Ok. So anything, we can ask the affected customers to run/collect to being > able to further debug this issue? > I believe we just need to ask for the data we need to further debugging the > issue, so we can analyze it. Martin, We got again the issue on our VM. Please let me know, if you need any additional info regarding to run/collect any log files to debug and analyze the issue. P.S. I tried to call you a couple of minutes ago. > > Thanks! > Martin Hi Mota, I am not sure what is needed to further analyze this is. I also asked for what is needed. @Daniel: Anything that can be collected to get more informaton on this issue? Thanks! Martin Assigning to Martin for further discussion about libvirtd dependency on systemd's bus restart, as was discussed by Daniel and Eric in comment 4 and comment 5. Please note that we might not be able to find the root cause of this issue and close it the same way Daniel did. If you are reopening it please provide your justification for doing it. Related changes in handling connection to systemd bus were proposed recently upstream by Jan Tomko: https://www.redhat.com/archives/libvir-list/2017-February/msg01271.html The only think we can really do is turn up libvirtd debug logs to a high level, but that'll generate many 100's of MBs of logs in any host that's actively used. So identifying the problem in there will be very hard, if not impossible. Realistically we're only likely to be able to solve it if someone can identify a way to reliably reproduce the problem in clean system. (In reply to Daniel Berrange from comment #20) > The only think we can really do is turn up libvirtd debug logs to a high > level, but that'll generate many 100's of MBs of logs in any host that's > actively used. So identifying the problem in there will be very hard, if not > impossible. Realistically we're only likely to be able to solve it if > someone can identify a way to reliably reproduce the problem in clean system. Daniel, the issue is reproduced on our different systems. So, I have not yet get the reply of comment #17. Anything that can be collected to get more information on this issue? I attached the systemlogs, sosreport-LogCollector to the Bug 1422556. Maybe is there any way to collect the debug log on only the host, that the issue happened there? Regards, Mota Hi Mota, (In reply to Mota Kardeh from comment #21) > (In reply to Daniel Berrange from comment #20) > > The only think we can really do is turn up libvirtd debug logs to a high > > level, but that'll generate many 100's of MBs of logs in any host that's > > actively used. So identifying the problem in there will be very hard, if not > > impossible. Realistically we're only likely to be able to solve it if > > someone can identify a way to reliably reproduce the problem in clean system. > > Daniel, > the issue is reproduced on our different systems. So, I have not yet get the > reply of comment #17. > Anything that can be collected to get more information on this issue? I > attached the systemlogs, sosreport-LogCollector to the Bug 1422556. > Maybe is there any way to collect the debug log on only the host, that the > issue happened there? as said by Daniel in comment 20 the thinug to do is, to turn up the libvirtd logs to debug level. Please note that this will generate lots of logs, so be sure to have enough space for the logs available and don't forget to reduce the logging back to normal once the situation has been successfully captured. Also please note that identifying the problem in these logs might not be possible due to the amounts of logs generated. The loglevel is defined in /etc/libvirt/libvirtd.log: log_level=1 Also the location should be set to a separate file, as systemd will probably not log everything. So setting log_outputs="1:file:/<FILE_PATH> should also be set. Please zip and upload the logs to a location we can access and notify us, once the logfiles are there. BTW: It is probably best to open a support case for this, so that a Support Engineer can have a look at the logs first. > > Regards, > Mota Cheers, Martin (In reply to Martin Tessun from comment #22) > Hi Mota, > > (In reply to Mota Kardeh from comment #21) > > (In reply to Daniel Berrange from comment #20) > > > The only think we can really do is turn up libvirtd debug logs to a high > > > level, but that'll generate many 100's of MBs of logs in any host that's > > > actively used. So identifying the problem in there will be very hard, if not > > > impossible. Realistically we're only likely to be able to solve it if > > > someone can identify a way to reliably reproduce the problem in clean system. > > > > Daniel, > > the issue is reproduced on our different systems. So, I have not yet get the > > reply of comment #17. > > Anything that can be collected to get more information on this issue? I > > attached the systemlogs, sosreport-LogCollector to the Bug 1422556. > > Maybe is there any way to collect the debug log on only the host, that the > > issue happened there? > > as said by Daniel in comment 20 the thinug to do is, to turn up the libvirtd > logs to debug level. > Please note that this will generate lots of logs, so be sure to have enough > space for the logs available and don't forget to reduce the logging back to > normal once the situation has been successfully captured. > > Also please note that identifying the problem in these logs might not be > possible due to the amounts of logs generated. > > The loglevel is defined in /etc/libvirt/libvirtd.log: > Martin, thanks for your update. :-) Maybe you mean /etc/libvirt/libvirtd.conf, because /etc/libvirt/libvirtd.log doesn't exist in the system. > log_level=1 > > Also the location should be set to a separate file, as systemd will probably > not log everything. So setting > > log_outputs="1:file:/<FILE_PATH> It would be great to know, how can this separate file help us. Because the all required log collector file are attached to the Bug 1422556 and sosreport-LogCollector.tar.xz under: https://drive.google.com/open?id=0BwWzNYssmDVabVo1THFHdXVFS2c > > should also be set. > > Please zip and upload the logs to a location we can access and notify us, > once the logfiles are there. > > BTW: It is probably best to open a support case for this, so that a Support > Engineer can have a look at the logs first. Unfortunately we can not open a support case for this, because we are here software engineers and the part of the support task is related to somebody from support team. Thanks and best regards, Mota > > > > > Regards, > > Mota > > > Cheers, > Martin (In reply to Mota Kardeh from comment #23) > (In reply to Martin Tessun from comment #22) > > The loglevel is defined in /etc/libvirt/libvirtd.log: > > > Martin, > > thanks for your update. :-) > Maybe you mean /etc/libvirt/libvirtd.conf, because /etc/libvirt/libvirtd.log > doesn't exist in the system. Yes, indeed. Sorry for the typo. > > Also the location should be set to a separate file, as systemd will probably > > not log everything. So setting > > > > log_outputs="1:file:/<FILE_PATH> > > It would be great to know, how can this separate file help us. Because the > all required log collector file are attached to the Bug 1422556 and > sosreport-LogCollector.tar.xz under: The debug level is the setting that does change the things. As standard logging is through syslog/systemd, you don't want this amount of data processed there. As such I proposed to write the debug data to a separate file instead of syslog. Does this make it clear? Cheers, Martin Hello Martin, Sorry for delay. I asked Jaroslav regarding to the issue, what is the best strategy to continue with the process? Shall we continue working on this Bug or primary Bug of our issue Bug 1422556. To get the debug log of the libvirt, I run the following steps: - open /etc/libvirt/libvirtd.conf - find & replace, or set these variables: log_level = 1 log_filters="3:remote 4:event 3:json 3:rpc" log_outputs="1:file:/var/log/libvirt/libvirtd.log" - save and exit - restart libvirtd service systemctl restart libvirtd.service The libvirt.log file of the system is available under: https://drive.google.com/open?id=0BwWzNYssmDVaWE11bTVZQ0ZRSGM Please let me know if you need any additional log files or Info. Thanks and best regards, Mota Since the BZs are going the same direction, I'm closing this one as a duplicate so that we don't get confused even more. *** This bug has been marked as a duplicate of bug 1422556 *** I hate to comment on a closed bug, but I think we're running into this issue after enabling polkit access controls, and since the dupe bug is private, it's impossible to see ongoing progress. Any hints or tips for mitigation? (In reply to Robby Workman from comment #27) Oh, the private settings. I don't know why the other bug s private, I'll ask about that. Anyway, there is not much progress. It looks like the problem is in the dbus daemon as the connection cannot break in any other place. It must be an overloaded system or a bug along the way, but the disconnection is not easy to achieve on a UNIX socket otherwise. What version of dbus are you other folks running? We're on 1.10.8 on the affected system, and in looking over the dbus git log for 1.10.x branch, I see this commit which went into 1.10.14:
commit 3f407671ecf821eb38ea7af5b160bfb93a9f4584
Author: Simon McVittie <simon.mcvittie.uk>
Date: Fri Nov 11 16:40:44 2016 +0000
Make uid 0 immune to pending_fd_timeout limit
This is a workaround for
<https://bugs.freedesktop.org/show_bug.cgi?id=95263>. If a service
sends a file descriptor sufficiently frequently that its queue of
messages never goes down to 0 fds pending, then it will eventually be
disconnected. logind is one such service.
We do not currently have a good solution for this: the proposed
patches either don't work, or reintroduce a denial of service
security vulnerability (CVE-2014-3637). Neither seems desirable.
However, we can avoid the worst symptoms by trusting uid 0 not to be
malicious.
...
That sounds like a possibility perhaps...
Nope, no change with dbus-1.10.18 now. Sigh. :/ I might be getting hopes up for nothing, but with libvirt-3.8.0 plus the patch from below, we've been up for over 24 hours with multiple virt-manager instances connected and monitoring with no lost connections. At the very least, this seems to have helped...
commit 65a983eca1f44eaf2a209ae232f72aa659184945
Author: Kothapally Madhu Pavan <kmp.ibm.com>
Date: Wed Oct 4 14:08:45 2017 +0530
util: Free a pointer in virPolkitCheckAuth
Free DBusMessage pointer in virPolkitCheckAuth
Signed-off-by: Kothapally Madhu Pavan <kmp.ibm.com>
diff --git a/src/util/virpolkit.c b/src/util/virpolkit.c
index c735ca9b1..4559431ba 100644
--- a/src/util/virpolkit.c
+++ b/src/util/virpolkit.c
@@ -138,6 +138,7 @@ int virPolkitCheckAuth(const char *actionid,
cleanup:
virStringListFreeCount(retdetails, nretdetails);
+ virDBusMessageUnref(reply);
return ret;
}
(In reply to Robby Workman from comment #31) I don't think this has anything to do with it, but I'm glad to hear this is better for you now. Are you sure you did not upgrade anything else on the system? (In reply to Robby Workman from comment #31) > I might be getting hopes up for nothing, but with libvirt-3.8.0 plus the > patch from below, we've been up for over 24 hours with multiple virt-manager > instances connected and monitoring with no lost connections. At the very > least, this seems to have helped... > > commit 65a983eca1f44eaf2a209ae232f72aa659184945 > Author: Kothapally Madhu Pavan <kmp.ibm.com> > Date: Wed Oct 4 14:08:45 2017 +0530 > > util: Free a pointer in virPolkitCheckAuth > > Free DBusMessage pointer in virPolkitCheckAuth > > Signed-off-by: Kothapally Madhu Pavan <kmp.ibm.com> > > diff --git a/src/util/virpolkit.c b/src/util/virpolkit.c > index c735ca9b1..4559431ba 100644 > --- a/src/util/virpolkit.c > +++ b/src/util/virpolkit.c > @@ -138,6 +138,7 @@ int virPolkitCheckAuth(const char *actionid, > > cleanup: > virStringListFreeCount(retdetails, nretdetails); > + virDBusMessageUnref(reply); > return ret; > } I don't see how this could resolve problems with the dbus connection as it is merely a memory leak fix. Note that the problem first appeared when we enabled polkit auth in libvirtd.conf with the following: auth_unix_rw = "polkit" access_drivers = [ "polkit" ] Aside from routine distribution-provided updates, here's a timeline of upgraded packages since we first encountered this problem (I first commented on Jun 11): Jun 15 /var/log/packages/polkit-0.113-x86_64-1_spike14.2 Jul 27 /var/log/packages/dbus-1.10.20-x86_64-1_spike14.2 Sep 19 /var/log/packages/libasr-1.0.2-x86_64-1_spike14.2 Sep 19 /var/log/packages/opensmtpd-6.0.2p1-x86_64-1_spike14.2 Oct 7 /var/log/packages/libvirt-3.8.0-x86_64-1_spike14.2 Oct 7 /var/log/packages/libvirt-python-3.8.0-x86_64-1_spike14.2 Oct 7 /var/log/packages/libvirt-glib-1.0.0-x86_64-2_spike14.2 The custom polkit package includes several memleak fixes from the git repo. This did not seem to have any effect on the bug. The custom dbus package is an upgrade from 1.10.8 (the distro version). This did not seem to have any effect on the bug. libasr and opensmtpd should be irrelevant. What's left are libvirt and friends, wherein libvirt 3.8.0 package also includes that patch referenced above. It may well not be the fix, but after digging through what seemed like infinite lines of debug output without a single clue being found, and knowing that there wasn't a problem at all without the polkit access controls in use (this machine is one of several virt hosts I have running almost identical setups other than polkit access controls), I've basically been watching git logs of libvirt, polkit, and dbus with a large amount of hope that someone stumbles across more than I did. On a slightly related note, when closing a bug as a dupe of another, it would be nice if the one remaining open isn't marked as private. (In reply to Robby Workman from comment #34) I asked for it to be publicized. |