Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1139701

Summary:

systemd resources operations timeout and get unmanaged

Product:

Red Hat Enterprise Linux 7

Reporter:

Jeff Dexter <jdexter>

Component:

pacemaker

Assignee:

Andrew Beekhof <abeekhof>

Status:

CLOSED ERRATA

QA Contact:

cluster-qe <cluster-qe>

Severity:

urgent

Docs Contact:

Priority:

urgent

Version:

7.0

CC:

abeekhof, benglish, cluster-maint, dmaley, dvossel, fdinitto, jherrman, jruemker, lyarwood, mjuricek, mnovacek, qguo, vanhoof, walters

Target Milestone:

Keywords:

ZStream

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

pacemaker-1.1.12-4.el6

Doc Type:

Bug Fix

Doc Text:

Prior to this update, the dbus client API could become unable to process handling responses. This caused the systemd resource management to fail and the managed resources to become unresponsive. With this update, the dbus connection to systemd remains stable, and no longer leads to unresponsive resources.

Story Points:

---

Clone Of:

Clones:

1145740 (view as bug list)

Environment:

Last Closed:

2015-03-05 10:00:23 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1064025, 1145740

Attachments:

Description	Flags
Pacemaker logs files from	none
reproducer v1	none
[PATCH] dbus: Unref reply when we're done	none

Description Jeff Dexter 2014-09-09 13:23:52 UTC

Description of problem:
Several times in the last 2 weeks, each node of our 3-node pacemaker cluster has seen the following error (seen in pacemaker.log): "lrmd:     info: pcmk_dbus_find_error:   GetUnit error 'org.freedesktop.DBus.Error.NoReply': Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken."

After some of these events it seems that the pacemaker log indicates that the clustered services may have been cycled off and then turned back on.   


Version-Release number of selected component (if applicable):
systemd-208-11.el7_0.2.x86_64                               Tue Aug  5 21:55:26 2014	1407293726	Red Hat, Inc.
pacemaker-1.1.10-32.el7_0.x86_64                            Fri Aug 15 07:44:25 2014	1408106665	Red Hat, Inc.
kernel-3.10.0-123.4.2.el7.x86_64                            Thu Jul 31 08:09:43 2014	1406812183	Red Hat, Inc.

Additional info:
This looks similar to a BZ posted for F19, https://bugzilla.redhat.com/show_bug.cgi?id=1082532

Comment 1 Jeff Dexter 2014-09-09 13:31:09 UTC

Created attachment 935701 [details]
Pacemaker logs files from

first signs of error line 395

Comment 3 Andrew Beekhof 2014-09-09 22:37:25 UTC

Need logs (preferably in a crm_report archive) that demonstrate the error in the description.

Comment 4 David Vossel 2014-09-10 21:33:59 UTC

We have a reproducer for this issue. My analysis is that this issue exists in the dbus client API.

The problem appears to be that after a period of time, the dbus client API is no longer able to process handling responses.

Steps to Reproduce:
1. Within the pacemaker source tree.  Apply the reproducer patch attached to this issue.

patch -p1 < reproducer_v1.patch.

2. ./autogen.sh && ./configure && make core scratch

3. ./scratch

From there the 'scratch' application will issue a continuous stream of dbus requests to systemd to retrieve info on a unit file. Within a minute the scratch application will hang and begin displaying errors indicating 'NoReply' was received for a request. 

Analysis of 'NoReply' errors:

Using dbus-monitor and some pattern matching I can watch the request/responses while I run the reproducer. Once the reproducer starts to hang I can see that every request the reproducer has made has a matching response.  Eventually the reproducer reports this error.

'info: systemd_unit_by_name:         Call to GetUnit failed: org.clusterlabs.pacemaker.NoReply'

From the dbus logs, I can clearly see the last GetUnit request has a reply though. After that error the reproducer attempts to reload systemd using a Dbus command. I can see the Reload command issued in the dbus logs and I can see the systemd response. Oddly enough the reproducer does not hang on this command even though the previous one, 'GetUnit', timed out.  Tracing through the code I can see this is because we don't process the response of the 'Reload' command.  After reload we attempt to do a 'LoadUnit' command and once again when we actually attempt to wait for the response we get a 'NoReply' error.

This indicates that the dbus connection still works even after the 'NoReply' error is  encountered. Pacemaker uses the dbus client API directly. From what I've gathered this points to a bug in the client API that prevents us from being able to retrieve responses  after a period of time.

Here's the usage of the dbus client API that is in question.
https://github.com/ClusterLabs/pacemaker/blob/master/lib/services/dbus.c#L123

From what I gather, dbus_pending_call_block(pending) never gets a response and times out. When we call 'reply = dbus_pending_call_steal_reply(pending)' the resulting dbus_message_get_type(reply) returns DBUS_MESSAGE_TYPE_ERROR.  The error string then indicates the 'NoReply'  error occurred.


Below are the relevant dbus monitor related logs that systemd does in fact reply
# dbus-monitor  --system "interface=org.freedesktop.DBus.Properties" "interface=org.freedesktop.systemd1.Manager" "type=method_return" | grep -e "serial=.*member=GetUnit" -e "serial=.*member=Reload" -e "serial=.*member=LoadUnit" -e "serial=.*member=GetAll" -e "reply_serial="
.
.
.
method call sender=:1.150 -> dest=org.freedesktop.systemd1 serial=53974 path=/org/freedesktop/systemd1; interface=org.freedesktop.systemd1.Manager; member=GetUnit
method return sender=:1.2 -> dest=:1.150 reply_serial=53974

<this is where the reproducer hangs>

method call sender=:1.150 -> dest=org.freedesktop.systemd1 serial=53975 path=/org/freedesktop/systemd1; interface=org.freedesktop.systemd1.Manager; member=Reload
method return sender=:1.2 -> dest=:1.150 reply_serial=53975

method call sender=:1.150 -> dest=org.freedesktop.systemd1 serial=53976 path=/org/freedesktop/systemd1; interface=org.freedesktop.systemd1.Manager; member=LoadUnit
method return sender=:1.2 -> dest=:1.150 reply_serial=53976



-- Vossel

Comment 5 David Vossel 2014-09-10 21:36:03 UTC

I'm assigning this to the dbus team. We need to track down why the dbus client api stops processing replies.

Let me know if there is anything I can do to help.

-- Vossel

Comment 6 Colin Walters 2014-09-11 15:15:52 UTC

Is the systemd RPM itself being updated on the host?  I think you might see this issue if you have in-flight requests, as (to my knowledge) it doesn't serialize its dbus pending queue.

Comment 7 Fabio Massimo Di Nitto 2014-09-11 15:22:33 UTC

(In reply to Colin Walters from comment #6)
> Is the systemd RPM itself being updated on the host?  I think you might see
> this issue if you have in-flight requests, as (to my knowledge) it doesn't
> serialize its dbus pending queue.

No, the system is stable and only running the test. No other activities.

Comment 8 Colin Walters 2014-09-11 15:30:43 UTC

Ok.

From a quick look at that code, it looks like you're reimplementing dbus_connection_send_with_reply_and_block() ?  Why?

Comment 9 David Vossel 2014-09-11 16:21:19 UTC

(In reply to Colin Walters from comment #8)
> Ok.
> 
> From a quick look at that code, it looks like you're reimplementing
> dbus_connection_send_with_reply_and_block() ?  Why?

I noticed the same thing. I actually re-implemented this using dbus_connection_send_with_reply_and_block() just to be certain that wasn't the issue. I got the exact same bad behavior.  The only difference between our method and the reply_and_block function is the use of "dbus_connection_flush(connection);"

We can move to using dbus_connection_send_with_reply_and_block, but it has no bearing on this issue that I can see.

-- Vossel

Comment 10 Andrew Beekhof 2014-09-18 02:52:29 UTC

In summary, bad:

  dbus_connection_send_with_reply (connection, msg, &pending, DBUS_TIMEOUT_USE_DEFAULT));
  dbus_pending_call_block(pending);
  dbus_pending_call_steal_reply(pending);

good:
  
  dbus_connection_send_with_reply (connection, msg, &pending, DBUS_TIMEOUT_USE_DEFAULT));
  dbus_pending_call_get_completed(pending);
  dbus_pending_call_set_notify(pending, our_callback_fn, user_data, NULL);

Comment 11 Colin Walters 2014-09-19 11:49:11 UTC

(In reply to David Vossel from comment #4)

> Steps to Reproduce:
> 1. Within the pacemaker source tree.  Apply the reproducer patch attached to
> this issue.

I don't see a patch?  Or is it somewhere else?(In reply to Andrew Beekhof from comment #10)

> In summary, bad:
> 
>   dbus_connection_send_with_reply (connection, msg, &pending,
> DBUS_TIMEOUT_USE_DEFAULT));
>   dbus_pending_call_block(pending);
>   dbus_pending_call_steal_reply(pending);

That's pretty much the normal dbus_connection_send_with_reply()...

> good:
>   
>   dbus_connection_send_with_reply (connection, msg, &pending,
> DBUS_TIMEOUT_USE_DEFAULT));
>   dbus_pending_call_get_completed(pending);

I don't understand; _get_completed() just returns a boolean which you're ignoring?  

>   dbus_pending_call_set_notify(pending, our_callback_fn, user_data, NULL);

Yes, set_notify will make the operation async - which may be what you want anyways.

Comment 12 David Vossel 2014-09-19 14:34:00 UTC

(In reply to Colin Walters from comment #11)
> (In reply to David Vossel from comment #4)
> 
> > Steps to Reproduce:
> > 1. Within the pacemaker source tree.  Apply the reproducer patch attached to
> > this issue.
> 
> I don't see a patch?  Or is it somewhere else?(In reply to Andrew Beekhof
> from comment #10)
> 
> > In summary, bad:
> > 
> >   dbus_connection_send_with_reply (connection, msg, &pending,
> > DBUS_TIMEOUT_USE_DEFAULT));
> >   dbus_pending_call_block(pending);
> >   dbus_pending_call_steal_reply(pending);
> 
> That's pretty much the normal dbus_connection_send_with_reply()...
> 
> > good:
> >   
> >   dbus_connection_send_with_reply (connection, msg, &pending,
> > DBUS_TIMEOUT_USE_DEFAULT));
> >   dbus_pending_call_get_completed(pending);
> 
> I don't understand; _get_completed() just returns a boolean which you're
> ignoring?  
> 
> >   dbus_pending_call_set_notify(pending, our_callback_fn, user_data, NULL);
> 
> Yes, set_notify will make the operation async - which may be what you want
> anyways.


I'll clarify Andrew's previous comment.

He was just saying that the problem lies in the synchronous part of the dbus api, and that it appears that processing the request/reply async works.

There is no patch for comment #10. He was experimenting with the dbus api and wanted to share his observation.  He's just listing the set of functions that appear to work for async, and the functions that lead to the busted synced issue we're seeing.

Let me know if there's anything I can do to help this move along. The reproducer is very 100% consistent in displaying the problem we're encountering.

-- David

Comment 13 Colin Walters 2014-09-19 14:38:56 UTC

(In reply to David Vossel from comment #12)
>
> Let me know if there's anything I can do to help this move along. The
> reproducer is very 100% consistent in displaying the problem we're
> encountering.

Where's the reproducer?  I only see one log file attachment, I can't find "reproducer_v1.patch".

Comment 14 David Vossel 2014-09-19 14:49:02 UTC

(In reply to Colin Walters from comment #13)
> (In reply to David Vossel from comment #12)
> >
> > Let me know if there's anything I can do to help this move along. The
> > reproducer is very 100% consistent in displaying the problem we're
> > encountering.
> 
> Where's the reproducer?  I only see one log file attachment, I can't find
> "reproducer_v1.patch".

oops, it should have been attached. I'll upload it real quick. :)

Comment 15 David Vossel 2014-09-19 15:02:37 UTC

Created attachment 939370 [details]
reproducer v1

You can also see this patch here, http://paste.fedoraproject.org/132382/10345607/

The patch as it is depends on you having the abrtd unit file present on the system. If you don't have abrtd, you can change line 41 of the patch to list any unit file name and the issue will occur.  I did a swap of s/abrtd/dbus/ and had the dbus api query systemd about the dbus daemon when I reproduced it.

-- Vossel

Comment 16 Colin Walters 2014-09-20 18:26:02 UTC

The example program appears to recurse infinitely, and presumably errors start when we blow the stack:

---Type <return> to continue, or q <return> to quit---
#20547 0x00007ffff797b30d in operation_finalize (op=op@entry=0x6153d0) at services_linux.c:253
#20548 0x00007ffff7980b96 in systemd_unit_exec (op=0x6153d0, synchronous=synchronous@entry=0) at systemd.c:531
#20549 0x00007ffff797a072 in services_action_async (op=<optimized out>, action_callback=action_callback@entry=0x400c50 <action_callback>) at services.c:494
#20550 0x0000000000400cf3 in action_callback (op=<optimized out>) at scratch.c:55
#20551 0x00007ffff797b30d in operation_finalize (op=op@entry=0x6150e0) at services_linux.c:253
#20552 0x00007ffff7980b96 in systemd_unit_exec (op=0x6150e0, synchronous=synchronous@entry=0) at systemd.c:531
#20553 0x00007ffff797a072 in services_action_async (op=<optimized out>, action_callback=action_callback@entry=0x400c50 <action_callback>) at services.c:494
#20554 0x0000000000400cf3 in action_callback (op=<optimized out>) at scratch.c:55
#20555 0x00007ffff797b30d in operation_finalize (op=op@entry=0x613bf0) at services_linux.c:253
#20556 0x00007ffff7980b96 in systemd_unit_exec (op=0x613bf0, synchronous=synchronous@entry=0) at systemd.c:531
#20557 0x00007ffff797a072 in services_action_async (op=<optimized out>, action_callback=action_callback@entry=0x400c50 <action_callback>) at services.c:494
#20558 0x0000000000400cf3 in action_callback (op=<optimized out>) at scratch.c:55
#20559 0x00007ffff797b30d in operation_finalize (op=op@entry=0x613940) at services_linux.c:253
#20560 0x00007ffff7980b96 in systemd_unit_exec (op=0x613940, synchronous=synchronous@entry=0) at systemd.c:531
#20561 0x00007ffff797a072 in services_action_async (op=<optimized out>, action_callback=action_callback@entry=0x400c50 <action_callback>) at services.c:494
#20562 0x0000000000400cf3 in action_callback (op=<optimized out>) at scratch.c:55
#20563 0x00007ffff797b30d in operation_finalize (op=op@entry=0x60fff0) at services_linux.c:253
#20564 0x00007ffff7980b96 in systemd_unit_exec (op=0x60fff0, synchronous=synchronous@entry=0) at systemd.c:531
#20565 0x00007ffff797a072 in services_action_async (op=<optimized out>, action_callback=action_callback@entry=0x400c50 <action_callback>) at services.c:494
#20566 0x0000000000400cf3 in action_callback (op=<optimized out>) at scratch.c:55
#20567 0x00007ffff797b30d in operation_finalize (op=op@entry=0x60fd80) at services_linux.c:253
#20568 0x00007ffff7980b96 in systemd_unit_exec (op=0x60fd80, synchronous=synchronous@entry=0) at systemd.c:531
#20569 0x00007ffff797a072 in services_action_async (op=<optimized out>, action_callback=action_callback@entry=0x400c50 <action_callback>) at services.c:494
#20570 0x0000000000400cf3 in action_callback (op=<optimized out>) at scratch.c:55
#20571 0x00007ffff797b30d in operation_finalize (op=op@entry=0x60fc60) at services_linux.c:253
#20572 0x00007ffff7980b96 in systemd_unit_exec (op=0x60fc60, synchronous=synchronous@entry=0) at systemd.c:531
#20573 0x00007ffff797a072 in services_action_async (op=<optimized out>, action_callback=action_callback@entry=0x400c50 <action_callback>) at services.c:494
#20574 0x0000000000400cf3 in action_callback (op=<optimized out>) at scratch.c:55
#20575 0x00007ffff797b30d in operation_finalize (op=op@entry=0x609ad0) at services_linux.c:253
#20576 0x00007ffff7980b96 in systemd_unit_exec (op=0x609ad0, synchronous=synchronous@entry=0) at systemd.c:531
#20577 0x00007ffff797a072 in services_action_async (op=<optimized out>, action_callback=action_callback@entry=0x400c50 <action_callback>) at services.c:494
#20578 0x0000000000400cf3 in action_callback (op=<optimized out>, op@entry=0x0) at scratch.c:55
#20579 0x0000000000400b30 in main (argc=1, argv=0x7fffffffde78) at scratch.c:64

Comment 17 Colin Walters 2014-09-20 18:32:40 UTC

So...reading some of the pacemaker code, I know we can't rearchitect it immediately, but:

Why does lib/common/mainloop.c exist?  If you have a hard dependency on GLib, why not just have the APIs call it directly?

Why this comment?

    /* Do NOT use g_child_watch_add() and friends, they rely on pthreads */

Sure but...that's clearly better than polling every 5 seconds per process.

On file descriptor stuff, you should check the GLib version and use g_unix_fd_add_full() if GLib is new enough.  This API was added as part of the infrastructure to allow GLib to use epoll() which would be more efficient.

Anyways, the mainloop stuff matters because its integration with libdbus is quite critical.  You could look at dbus-glib's code for this if you haven't.

Then if you want much better DBus bindings with very flexible sync/async APIs plus thread safety, there's GDBus in libgio.  Which will be available starting with RHEL6.6 as well, although it was just backported for Evolution.

Comment 18 Andrew Beekhof 2014-09-22 00:40:23 UTC

(In reply to Colin Walters from comment #17)
> So...reading some of the pacemaker code, I know we can't rearchitect it
> immediately, but:
> 
> Why does lib/common/mainloop.c exist?  If you have a hard dependency on
> GLib, why not just have the APIs call it directly?

Unless I'm mistaking the intent of your question, the answer is because most of the constructs do not exist in GLib or rely on threads to function.

There is no analog for mainloop_setup_trigger() and mainloop doesn't know how to deal with libqb IPC connections to name two.

Also, its more efficient than typing out the bits that make the API useful in every place we need it.

> 
> Why this comment?
> 
>     /* Do NOT use g_child_watch_add() and friends, they rely on pthreads */
> 
> Sure but...that's clearly better than polling every 5 seconds per process.

In an environment that uses fork(), which the lrmd does extensively, threads are infinitely worse than polling.

> 
> On file descriptor stuff, you should check the GLib version and use
> g_unix_fd_add_full() if GLib is new enough.  This API was added as part of
> the infrastructure to allow GLib to use epoll() which would be more
> efficient.

Ah, good to know. We might be able to take advantage of that for rhel7.

> 
> Anyways, the mainloop stuff matters because its integration with libdbus is
> quite critical.

Surely we should be able to use libdbus without it though?

>  You could look at dbus-glib's code for this if you haven't.

I recall digging into that code quite deeply at one point.
Any part in I missed particular?

> Then if you want much better DBus bindings with very flexible sync/async
> APIs plus thread safety, there's GDBus in libgio.  Which will be available
> starting with RHEL6.6 as well, although it was just backported for Evolution.

Those bindings require threads which are a non-starter in pacemaker.
We used to use that API, it caused havoc.  See bug #1057697

Comment 19 Andrew Beekhof 2014-09-22 01:01:53 UTC

(In reply to Colin Walters from comment #16)
> The example program appears to recurse infinitely, and presumably errors
> start when we blow the stack:

My apologies, that was indeed sloppy of me for not checking this.
Its not the cause of the bug though. I've improved the example so that it no longer recurses and the result is the same.

New reproducer: http://ur1.ca/i7yv0

The stack should look more like this now (note that the number of calls hasn't changed):

Reading symbols from /home/Development/sources/pacemaker/rhel-branches/.libs/lt-scratch...done.
(gdb) break scratch.c:47
Breakpoint 1 at 0x400de4: file scratch.c, line 47.
(gdb) run
Starting program: /home/Development/sources/pacemaker/rhel-branches/.libs/lt-scratch 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
..........................
25408 successful calls: 1 0

Breakpoint 1, action_callback (op=0x61b540) at scratch.c:47
47	        exit(1);
Missing separate debuginfos, use: debuginfo-install keyutils-libs-1.5.9-1.fc20.x86_64 krb5-libs-1.11.5-4.fc20.x86_64 libcom_err-1.42.8-3.fc20.x86_64 libffi-3.0.13-5.fc20.x86_64 libqb-0.17.0-1.fc20.x86_64 libtasn1-3.3-2.fc20.x86_64 libxslt-1.1.28-5.fc20.x86_64 openssl-libs-1.0.1e-37.fc20.1.x86_64 trousers-0.3.11.2-3.fc20.x86_64
(gdb) where
#0  action_callback (op=0x61b540) at scratch.c:47
#1  0x00007ffff7bb15fd in operation_finalize (op=op@entry=0x61b540) at services_linux.c:246
#2  0x00007ffff7bb51ee in systemd_unit_exec (op=0x61b540, synchronous=synchronous@entry=0) at systemd.c:528
#3  0x00007ffff7bb0508 in services_action_async (op=<optimized out>, action_callback=action_callback@entry=0x400d20 <action_callback>) at services.c:410
#4  0x0000000000400e50 in do_action (user_data=<error reading variable: value has been optimized out>) at scratch.c:59
#5  0x00007ffff7de5f53 in crm_trigger_dispatch (source=source@entry=0x60d2b0, callback=<optimized out>, userdata=<optimized out>) at mainloop.c:105
#6  0x0000003cbe0492a6 in g_main_dispatch (context=0x604300) at gmain.c:3066
#7  g_main_context_dispatch (context=context@entry=0x604300) at gmain.c:3642
#8  0x0000003cbe049628 in g_main_context_iterate (context=0x604300, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>) at gmain.c:3713
#9  0x0000003cbe049a3a in g_main_loop_run (loop=0x604c30) at gmain.c:3907
#10 0x0000000000400c11 in main (argc=1, argv=0x7fffffffd7e8) at scratch.c:72

Comment 20 Colin Walters 2014-09-22 21:38:22 UTC

I rebuilt dbus with --enable-verbose-mode, and captured a trace with DBUS_VERBOSE=1, which is 2 gigabytes, so not going to post here.  However, something jumps out at me:

When the symptom occurs, we end up in a tight loop calling poll() repeatedly:

11404 poll([{fd=4, events=POLLIN}], 1, 18216) = 1 ([{fd=4, revents=POLLIN}])
11404 poll([{fd=4, events=POLLIN}], 1, 18216) = 1 ([{fd=4, revents=POLLIN}])
11404 poll([{fd=4, events=POLLIN}], 1, 18216) = 1 ([{fd=4, revents=POLLIN}])
11404 poll([{fd=4, events=POLLIN}], 1, 18216) = 1 ([{fd=4, revents=POLLIN}])

Digging into this a bit more, it has something to do with exceeding transport->max_live_messages_size:

$5 = {refcount = 25254, size_value = 66061848, unix_fd_value = 0, notify_size_guard_value = 66060288, notify_unix_fd_guard_value = 4096, notify_function = 0x7ffff773c810 <live_messages_notify>, 
  notify_data = 0x60dc80, notify_pending = 1}
(gdb) p transport->live_messages_s
There is no member named live_messages_s.
(gdb) p transport->live_messages 
Display all 34320 possibilities? (y or n)
(gdb) p transport->max_live_messages_
There is no member named max_live_messages_.
(gdb) p transport->max_live_messages_size 
$6 = 66060288

If you look at dbus-transport-socket.c:check_read_watch(), what's happening is we're trying to disable the read watch, under the theory that we need to drain the outgoing queue, but...

(still investigating)

Comment 21 Colin Walters 2014-09-22 21:41:46 UTC

Created attachment 940204 [details]
[PATCH] dbus: Unref reply when we're done

This patch appears to help.

Comment 22 Colin Walters 2014-09-22 21:54:23 UTC

(BTW Red Hat Bugzilla corrupts git patches - direct link to raw patch, which is against pacemaker git master: https://bugzilla.redhat.com/attachment.cgi?id=940204 )

Comment 23 Colin Walters 2014-09-22 22:03:36 UTC

(In reply to Andrew Beekhof from comment #18)

> Those bindings require threads which are a non-starter in pacemaker.

As a GLib upstream maintainer (and sometimes an upstream libdbus contributor), I can say that we're thread oriented.  Unfortunately, fork and threads are effectively mutually exclusive, and given that one must be picked, and threads are more flexible, and also a model portable to Windows, that's going to be how API is designed and internals are tested.

I think you're going to be carrying a substantial maintenance burden trying to use libraries which are designed for thread safety, but not fork safety.  (libdbus is only partially thread safe unfortunately even, and doesn't even try to call pthread_atfork() to reset state at the moment)

Comment 24 Andrew Beekhof 2014-09-22 22:49:13 UTC

(In reply to Colin Walters from comment #23)
> (In reply to Andrew Beekhof from comment #18)
> 
> > Those bindings require threads which are a non-starter in pacemaker.
> 
> As a GLib upstream maintainer (and sometimes an upstream libdbus
> contributor), I can say that we're thread oriented.  Unfortunately, fork and
> threads are effectively mutually exclusive, and given that one must be
> picked, and threads are more flexible, and also a model portable to Windows,
> that's going to be how API is designed and internals are tested.
> 
> I think you're going to be carrying a substantial maintenance burden trying
> to use libraries which are designed for thread safety, but not fork safety. 
> (libdbus is only partially thread safe unfortunately even, and doesn't even
> try to call pthread_atfork() to reset state at the moment)

If libdbus is neither fork() nor thread safe, then we have a support burden either way.

Its also the only option for us since the use of fork() is fundamental to our existance, imagine trying to implement systemd without it... same thing.  We need to use it for every resource type except systemd.

So it's actually Systemd/dbus that is the odd one out in our corner of world.

Comment 25 Andrew Beekhof 2014-09-22 23:14:56 UTC

(In reply to Colin Walters from comment #21)
> Created attachment 940204 [details]
> [PATCH] dbus: Unref reply when we're done
> 
> This patch appears to help.

Very much so.  This is great!

Could you explain why the patch works though?
I'd have expected a different kind of failure for a memory leak bug - ie. the process size to grow inordinately large and be killed rather than operations timing out.

We did consider a memory leak, but curiously valgrind didn't report anything when David ran it there.

(Taking the bug back since its clearly an issue on our side)

Comment 26 Colin Walters 2014-09-22 23:46:57 UTC

(In reply to Andrew Beekhof from comment #25)
s is great!
> 
> Could you explain why the patch works though?

I briefly mentioned this in my commit message, basically libdbus thinks the incoming queue is stalled and tries to stop reading (but fails at that due to a different bug in that you're not implementing the _toggle API of DBusWatch).

Comment 29 Andrew Beekhof 2014-09-29 06:00:34 UTC

I've implemented the watch toggle callback now too.
So the effect wont be silent if we see this again for any reason in the future.

Comment 34 michal novacek 2014-12-02 16:29:04 UTC

I have verified that with there are no more dbus noreply errors with pacemaker-1.1.12-13.el7.x86_64 after systemd resource with monitor interval 1s has been running continuously for three days.

Comment 37 errata-xmlrpc 2015-03-05 10:00:23 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-0440.html