Bug 1323068

Summary: systemd_loadunit_result: Unexpected DBus type, expected o in 's' instead of s
Product: Red Hat Enterprise Linux 7 Reporter: Robin Cernin <rcernin>
Component: systemdAssignee: systemd-maint
Status: CLOSED NOTABUG QA Contact: qe-baseos-daemons
Severity: urgent Docs Contact:
Priority: urgent    
Version: 7.2CC: abeekhof, bblaskov, cchen, cluster-maint, fahmed, jruemker, kgaillot, lnykryn, mfuruta, mschuppe, msekleta, mzheng, qguo, skinjo, systemd-maint-list, systemd-maint, tpelka, vaggarwa
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-12-14 17:02:43 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 1298243    

Description Robin Cernin 2016-04-01 07:42:51 UTC
Description of problem:

In 3-node pacemaker cluster has seen the following error (seen in pacemaker.log):

lrmd:     info: pcmk_dbus_timeout_dispatch:        Timeout 0x208a340 expired
lrmd:     info: pcmk_dbus_find_error:      LoadUnit error 'org.freedesktop.DBus.Error.NoReply': Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security po
licy blocked the reply, the reply timeout expired, or the network connection was broken.
lrmd:    error: systemd_loadunit_result:   Unexpected DBus type, expected o in 's' instead of s

After these events the node was fenced.

Version-Release number of selected component (if applicable):

systemd-219-19.el7_2.4.x86_64
pacemaker-1.1.13-10.el7_2.2.x86_64
kernel-3.10.0-327.10.1.el7.x86_64

How reproducible:

N/A

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Found similar BZ which might or might not be relevant https://bugzilla.redhat.com/show_bug.cgi?id=1139701

Comment 4 Andrew Beekhof 2016-04-04 04:35:36 UTC
What was happening on this node at the time of the failure?
It looks like systemd is overloaded for some reason.

Comment 6 Ken Gaillot 2016-04-05 21:16:01 UTC
Robin,

The first two log messages in the Description indicate that one of pacemaker's interactions with DBus timed out, so the associated systemd resource action was considered failed. The node was fenced most likely because on-fail=fence for the action, whether explicitly configured or because this was a stop action, which defaults to fence.

The last message ("Unexpected DBus type") is a bit of poor error logging by pacemaker, but the behavior is correct. That logging is fixed by upstream commit 7cdd84f6, which will be in RHEL 7.3.

I'm not sure why the backups would suddenly start causing DBus timeouts; that's beyond pacemaker's scope, and we really don't have a window into it. If DBus actions are routinely expected to take longer now, you could try raising the op timeouts on affected systemd resources, although addressing the underlying cause is better of course. You could also look at the on-fail setting for the affected operation(s).

BTW, the collab-shell links are not found; is the case still there?

Comment 8 Ken Gaillot 2016-04-07 21:16:02 UTC
Commit 7cdd84f6 is for pacemaker: https://github.com/ClusterLabs/pacemaker/commit/7cdd84f6

The function previously returned the wrong return code in this particular failure scenario. That had no adverse effect other than to needlessly run some checks that would then log that message.

Comment 9 Ken Gaillot 2016-04-07 21:22:06 UTC
Reassigning to DBus component for further investigation.

Pacemaker is getting timeouts from the DBus API while processing a systemd resource. It is trying to call the LoadUnit method of org.freedesktop.systemd1.Manager.

Comment 11 David King 2016-04-26 15:50:54 UTC
org.freedesktop.systemd1.Manager is a systemd interface, so the problem is unlikely to be related to dbus-daemon.

Comment 12 Lukáš Nykrýn 2016-04-26 17:28:45 UTC
DO you have some logs from that system? Are you able to reproduce it?

Comment 28 Chen 2016-06-15 02:12:18 UTC
Hi Lucas,

The customer hit this issue again. Would you please let me know the details in order to troubleshoot the DBus timeout issue ?

Is it ok to just append a "debug" option at the end of the kernel cmdline ?
What is the exact command of journalctl you want to collect if the issue reproduced ?

Best Regards,
Chen

Comment 29 Shinobu KINJO 2016-06-15 03:02:08 UTC
Just FYI:

Related issue looking similar was fixed before. [1]

On customer's stack system, the pacemaker packages were updated at:

Mar 21 20:26:00

To pacemaker-1.1.13-10.el7_2.2.x86_64

After that yum update, their system seemed to be screwed up.
* I will not mention that `pacemaker-1.1.13-10.el7_2.2.x86_64` screwed up their system but...

[1] https://access.redhat.com/solutions/1365513

Comment 30 Shinobu KINJO 2016-06-15 03:03:39 UTC
(In reply to Shinobu KINJO from comment #29)
> Just FYI:
> 
> Related issue looking similar was fixed before. [1]
> 
> On customer's stack system, the pacemaker packages were updated at:
> 
> Mar 21 20:26:00
> 
> To pacemaker-1.1.13-10.el7_2.2.x86_64
> 
> After that yum update, their system seemed to be screwed up.

If there has been no modification, change, or whatever on their system...

> * I will not mention that `pacemaker-1.1.13-10.el7_2.2.x86_64` screwed up
> their system but...
> 
> [1] https://access.redhat.com/solutions/1365513

Comment 32 Lukáš Nykrýn 2016-06-15 13:09:21 UTC
(In reply to Chen from comment #28)
> Hi Lucas,
> 
> The customer hit this issue again. Would you please let me know the details
> in order to troubleshoot the DBus timeout issue ?
> 
> Is it ok to just append a "debug" option at the end of the kernel cmdline ?
> What is the exact command of journalctl you want to collect if the issue
> reproduced ?
> 
> Best Regards,
> Chen

Yep just add "debug" to the end of kernel cmdline ad post here journalctl -b after you reproduce the issue.

Comment 34 Chen 2016-06-27 03:17:38 UTC
Hi Faiaz,

Does comment #32 look good to you to update the case ?

Best Regards,
Chen

Comment 42 Michal Sekletar 2016-07-20 07:41:09 UTC
When systemd is loading unit file it obviously needs to go and read unit file of the disk. It seem possible that when server is under heavy I/O load then systemd's read request is not processed soon enough and that may in turn cause timeout on pacemaker side. I am not sure if we can somehow fix this, except bumping timeout value.