Hide Forgot
Description of problem: In 3-node pacemaker cluster has seen the following error (seen in pacemaker.log): lrmd: info: pcmk_dbus_timeout_dispatch: Timeout 0x208a340 expired lrmd: info: pcmk_dbus_find_error: LoadUnit error 'org.freedesktop.DBus.Error.NoReply': Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security po licy blocked the reply, the reply timeout expired, or the network connection was broken. lrmd: error: systemd_loadunit_result: Unexpected DBus type, expected o in 's' instead of s After these events the node was fenced. Version-Release number of selected component (if applicable): systemd-219-19.el7_2.4.x86_64 pacemaker-1.1.13-10.el7_2.2.x86_64 kernel-3.10.0-327.10.1.el7.x86_64 How reproducible: N/A Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: Found similar BZ which might or might not be relevant https://bugzilla.redhat.com/show_bug.cgi?id=1139701
What was happening on this node at the time of the failure? It looks like systemd is overloaded for some reason.
Robin, The first two log messages in the Description indicate that one of pacemaker's interactions with DBus timed out, so the associated systemd resource action was considered failed. The node was fenced most likely because on-fail=fence for the action, whether explicitly configured or because this was a stop action, which defaults to fence. The last message ("Unexpected DBus type") is a bit of poor error logging by pacemaker, but the behavior is correct. That logging is fixed by upstream commit 7cdd84f6, which will be in RHEL 7.3. I'm not sure why the backups would suddenly start causing DBus timeouts; that's beyond pacemaker's scope, and we really don't have a window into it. If DBus actions are routinely expected to take longer now, you could try raising the op timeouts on affected systemd resources, although addressing the underlying cause is better of course. You could also look at the on-fail setting for the affected operation(s). BTW, the collab-shell links are not found; is the case still there?
Commit 7cdd84f6 is for pacemaker: https://github.com/ClusterLabs/pacemaker/commit/7cdd84f6 The function previously returned the wrong return code in this particular failure scenario. That had no adverse effect other than to needlessly run some checks that would then log that message.
Reassigning to DBus component for further investigation. Pacemaker is getting timeouts from the DBus API while processing a systemd resource. It is trying to call the LoadUnit method of org.freedesktop.systemd1.Manager.
org.freedesktop.systemd1.Manager is a systemd interface, so the problem is unlikely to be related to dbus-daemon.
DO you have some logs from that system? Are you able to reproduce it?
Hi Lucas, The customer hit this issue again. Would you please let me know the details in order to troubleshoot the DBus timeout issue ? Is it ok to just append a "debug" option at the end of the kernel cmdline ? What is the exact command of journalctl you want to collect if the issue reproduced ? Best Regards, Chen
Just FYI: Related issue looking similar was fixed before. [1] On customer's stack system, the pacemaker packages were updated at: Mar 21 20:26:00 To pacemaker-1.1.13-10.el7_2.2.x86_64 After that yum update, their system seemed to be screwed up. * I will not mention that `pacemaker-1.1.13-10.el7_2.2.x86_64` screwed up their system but... [1] https://access.redhat.com/solutions/1365513
(In reply to Shinobu KINJO from comment #29) > Just FYI: > > Related issue looking similar was fixed before. [1] > > On customer's stack system, the pacemaker packages were updated at: > > Mar 21 20:26:00 > > To pacemaker-1.1.13-10.el7_2.2.x86_64 > > After that yum update, their system seemed to be screwed up. If there has been no modification, change, or whatever on their system... > * I will not mention that `pacemaker-1.1.13-10.el7_2.2.x86_64` screwed up > their system but... > > [1] https://access.redhat.com/solutions/1365513
(In reply to Chen from comment #28) > Hi Lucas, > > The customer hit this issue again. Would you please let me know the details > in order to troubleshoot the DBus timeout issue ? > > Is it ok to just append a "debug" option at the end of the kernel cmdline ? > What is the exact command of journalctl you want to collect if the issue > reproduced ? > > Best Regards, > Chen Yep just add "debug" to the end of kernel cmdline ad post here journalctl -b after you reproduce the issue.
Hi Faiaz, Does comment #32 look good to you to update the case ? Best Regards, Chen
When systemd is loading unit file it obviously needs to go and read unit file of the disk. It seem possible that when server is under heavy I/O load then systemd's read request is not processed soon enough and that may in turn cause timeout on pacemaker side. I am not sure if we can somehow fix this, except bumping timeout value.