Bug 1139701
| Summary: | systemd resources operations timeout and get unmanaged | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Jeff Dexter <jdexter> | ||||||||
| Component: | pacemaker | Assignee: | Andrew Beekhof <abeekhof> | ||||||||
| Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> | ||||||||
| Severity: | urgent | Docs Contact: | |||||||||
| Priority: | urgent | ||||||||||
| Version: | 7.0 | CC: | abeekhof, benglish, cluster-maint, dmaley, dvossel, fdinitto, jherrman, jruemker, lyarwood, mjuricek, mnovacek, qguo, vanhoof, walters | ||||||||
| Target Milestone: | rc | Keywords: | ZStream | ||||||||
| Target Release: | --- | ||||||||||
| Hardware: | x86_64 | ||||||||||
| OS: | Linux | ||||||||||
| Whiteboard: | |||||||||||
| Fixed In Version: | pacemaker-1.1.12-4.el6 | Doc Type: | Bug Fix | ||||||||
| Doc Text: |
Prior to this update, the dbus client API could become unable to process handling responses. This caused the systemd resource management to fail and the managed resources to become unresponsive. With this update, the dbus connection to systemd remains stable, and no longer leads to unresponsive resources.
|
Story Points: | --- | ||||||||
| Clone Of: | |||||||||||
| : | 1145740 (view as bug list) | Environment: | |||||||||
| Last Closed: | 2015-03-05 10:00:23 UTC | Type: | Bug | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Embargoed: | |||||||||||
| Bug Depends On: | |||||||||||
| Bug Blocks: | 1064025, 1145740 | ||||||||||
| Attachments: |
|
||||||||||
|
Description
Jeff Dexter
2014-09-09 13:23:52 UTC
Created attachment 935701 [details]
Pacemaker logs files from
first signs of error line 395
Need logs (preferably in a crm_report archive) that demonstrate the error in the description. We have a reproducer for this issue. My analysis is that this issue exists in the dbus client API. The problem appears to be that after a period of time, the dbus client API is no longer able to process handling responses. Steps to Reproduce: 1. Within the pacemaker source tree. Apply the reproducer patch attached to this issue. patch -p1 < reproducer_v1.patch. 2. ./autogen.sh && ./configure && make core scratch 3. ./scratch From there the 'scratch' application will issue a continuous stream of dbus requests to systemd to retrieve info on a unit file. Within a minute the scratch application will hang and begin displaying errors indicating 'NoReply' was received for a request. Analysis of 'NoReply' errors: Using dbus-monitor and some pattern matching I can watch the request/responses while I run the reproducer. Once the reproducer starts to hang I can see that every request the reproducer has made has a matching response. Eventually the reproducer reports this error. 'info: systemd_unit_by_name: Call to GetUnit failed: org.clusterlabs.pacemaker.NoReply' From the dbus logs, I can clearly see the last GetUnit request has a reply though. After that error the reproducer attempts to reload systemd using a Dbus command. I can see the Reload command issued in the dbus logs and I can see the systemd response. Oddly enough the reproducer does not hang on this command even though the previous one, 'GetUnit', timed out. Tracing through the code I can see this is because we don't process the response of the 'Reload' command. After reload we attempt to do a 'LoadUnit' command and once again when we actually attempt to wait for the response we get a 'NoReply' error. This indicates that the dbus connection still works even after the 'NoReply' error is encountered. Pacemaker uses the dbus client API directly. From what I've gathered this points to a bug in the client API that prevents us from being able to retrieve responses after a period of time. Here's the usage of the dbus client API that is in question. https://github.com/ClusterLabs/pacemaker/blob/master/lib/services/dbus.c#L123 From what I gather, dbus_pending_call_block(pending) never gets a response and times out. When we call 'reply = dbus_pending_call_steal_reply(pending)' the resulting dbus_message_get_type(reply) returns DBUS_MESSAGE_TYPE_ERROR. The error string then indicates the 'NoReply' error occurred. Below are the relevant dbus monitor related logs that systemd does in fact reply # dbus-monitor --system "interface=org.freedesktop.DBus.Properties" "interface=org.freedesktop.systemd1.Manager" "type=method_return" | grep -e "serial=.*member=GetUnit" -e "serial=.*member=Reload" -e "serial=.*member=LoadUnit" -e "serial=.*member=GetAll" -e "reply_serial=" . . . method call sender=:1.150 -> dest=org.freedesktop.systemd1 serial=53974 path=/org/freedesktop/systemd1; interface=org.freedesktop.systemd1.Manager; member=GetUnit method return sender=:1.2 -> dest=:1.150 reply_serial=53974 <this is where the reproducer hangs> method call sender=:1.150 -> dest=org.freedesktop.systemd1 serial=53975 path=/org/freedesktop/systemd1; interface=org.freedesktop.systemd1.Manager; member=Reload method return sender=:1.2 -> dest=:1.150 reply_serial=53975 method call sender=:1.150 -> dest=org.freedesktop.systemd1 serial=53976 path=/org/freedesktop/systemd1; interface=org.freedesktop.systemd1.Manager; member=LoadUnit method return sender=:1.2 -> dest=:1.150 reply_serial=53976 -- Vossel I'm assigning this to the dbus team. We need to track down why the dbus client api stops processing replies. Let me know if there is anything I can do to help. -- Vossel Is the systemd RPM itself being updated on the host? I think you might see this issue if you have in-flight requests, as (to my knowledge) it doesn't serialize its dbus pending queue. (In reply to Colin Walters from comment #6) > Is the systemd RPM itself being updated on the host? I think you might see > this issue if you have in-flight requests, as (to my knowledge) it doesn't > serialize its dbus pending queue. No, the system is stable and only running the test. No other activities. Ok. From a quick look at that code, it looks like you're reimplementing dbus_connection_send_with_reply_and_block() ? Why? (In reply to Colin Walters from comment #8) > Ok. > > From a quick look at that code, it looks like you're reimplementing > dbus_connection_send_with_reply_and_block() ? Why? I noticed the same thing. I actually re-implemented this using dbus_connection_send_with_reply_and_block() just to be certain that wasn't the issue. I got the exact same bad behavior. The only difference between our method and the reply_and_block function is the use of "dbus_connection_flush(connection);" We can move to using dbus_connection_send_with_reply_and_block, but it has no bearing on this issue that I can see. -- Vossel In summary, bad: dbus_connection_send_with_reply (connection, msg, &pending, DBUS_TIMEOUT_USE_DEFAULT)); dbus_pending_call_block(pending); dbus_pending_call_steal_reply(pending); good: dbus_connection_send_with_reply (connection, msg, &pending, DBUS_TIMEOUT_USE_DEFAULT)); dbus_pending_call_get_completed(pending); dbus_pending_call_set_notify(pending, our_callback_fn, user_data, NULL); (In reply to David Vossel from comment #4) > Steps to Reproduce: > 1. Within the pacemaker source tree. Apply the reproducer patch attached to > this issue. I don't see a patch? Or is it somewhere else?(In reply to Andrew Beekhof from comment #10) > In summary, bad: > > dbus_connection_send_with_reply (connection, msg, &pending, > DBUS_TIMEOUT_USE_DEFAULT)); > dbus_pending_call_block(pending); > dbus_pending_call_steal_reply(pending); That's pretty much the normal dbus_connection_send_with_reply()... > good: > > dbus_connection_send_with_reply (connection, msg, &pending, > DBUS_TIMEOUT_USE_DEFAULT)); > dbus_pending_call_get_completed(pending); I don't understand; _get_completed() just returns a boolean which you're ignoring? > dbus_pending_call_set_notify(pending, our_callback_fn, user_data, NULL); Yes, set_notify will make the operation async - which may be what you want anyways. (In reply to Colin Walters from comment #11) > (In reply to David Vossel from comment #4) > > > Steps to Reproduce: > > 1. Within the pacemaker source tree. Apply the reproducer patch attached to > > this issue. > > I don't see a patch? Or is it somewhere else?(In reply to Andrew Beekhof > from comment #10) > > > In summary, bad: > > > > dbus_connection_send_with_reply (connection, msg, &pending, > > DBUS_TIMEOUT_USE_DEFAULT)); > > dbus_pending_call_block(pending); > > dbus_pending_call_steal_reply(pending); > > That's pretty much the normal dbus_connection_send_with_reply()... > > > good: > > > > dbus_connection_send_with_reply (connection, msg, &pending, > > DBUS_TIMEOUT_USE_DEFAULT)); > > dbus_pending_call_get_completed(pending); > > I don't understand; _get_completed() just returns a boolean which you're > ignoring? > > > dbus_pending_call_set_notify(pending, our_callback_fn, user_data, NULL); > > Yes, set_notify will make the operation async - which may be what you want > anyways. I'll clarify Andrew's previous comment. He was just saying that the problem lies in the synchronous part of the dbus api, and that it appears that processing the request/reply async works. There is no patch for comment #10. He was experimenting with the dbus api and wanted to share his observation. He's just listing the set of functions that appear to work for async, and the functions that lead to the busted synced issue we're seeing. Let me know if there's anything I can do to help this move along. The reproducer is very 100% consistent in displaying the problem we're encountering. -- David (In reply to David Vossel from comment #12) > > Let me know if there's anything I can do to help this move along. The > reproducer is very 100% consistent in displaying the problem we're > encountering. Where's the reproducer? I only see one log file attachment, I can't find "reproducer_v1.patch". (In reply to Colin Walters from comment #13) > (In reply to David Vossel from comment #12) > > > > Let me know if there's anything I can do to help this move along. The > > reproducer is very 100% consistent in displaying the problem we're > > encountering. > > Where's the reproducer? I only see one log file attachment, I can't find > "reproducer_v1.patch". oops, it should have been attached. I'll upload it real quick. :) Created attachment 939370 [details] reproducer v1 You can also see this patch here, http://paste.fedoraproject.org/132382/10345607/ The patch as it is depends on you having the abrtd unit file present on the system. If you don't have abrtd, you can change line 41 of the patch to list any unit file name and the issue will occur. I did a swap of s/abrtd/dbus/ and had the dbus api query systemd about the dbus daemon when I reproduced it. -- Vossel The example program appears to recurse infinitely, and presumably errors start when we blow the stack: ---Type <return> to continue, or q <return> to quit--- #20547 0x00007ffff797b30d in operation_finalize (op=op@entry=0x6153d0) at services_linux.c:253 #20548 0x00007ffff7980b96 in systemd_unit_exec (op=0x6153d0, synchronous=synchronous@entry=0) at systemd.c:531 #20549 0x00007ffff797a072 in services_action_async (op=<optimized out>, action_callback=action_callback@entry=0x400c50 <action_callback>) at services.c:494 #20550 0x0000000000400cf3 in action_callback (op=<optimized out>) at scratch.c:55 #20551 0x00007ffff797b30d in operation_finalize (op=op@entry=0x6150e0) at services_linux.c:253 #20552 0x00007ffff7980b96 in systemd_unit_exec (op=0x6150e0, synchronous=synchronous@entry=0) at systemd.c:531 #20553 0x00007ffff797a072 in services_action_async (op=<optimized out>, action_callback=action_callback@entry=0x400c50 <action_callback>) at services.c:494 #20554 0x0000000000400cf3 in action_callback (op=<optimized out>) at scratch.c:55 #20555 0x00007ffff797b30d in operation_finalize (op=op@entry=0x613bf0) at services_linux.c:253 #20556 0x00007ffff7980b96 in systemd_unit_exec (op=0x613bf0, synchronous=synchronous@entry=0) at systemd.c:531 #20557 0x00007ffff797a072 in services_action_async (op=<optimized out>, action_callback=action_callback@entry=0x400c50 <action_callback>) at services.c:494 #20558 0x0000000000400cf3 in action_callback (op=<optimized out>) at scratch.c:55 #20559 0x00007ffff797b30d in operation_finalize (op=op@entry=0x613940) at services_linux.c:253 #20560 0x00007ffff7980b96 in systemd_unit_exec (op=0x613940, synchronous=synchronous@entry=0) at systemd.c:531 #20561 0x00007ffff797a072 in services_action_async (op=<optimized out>, action_callback=action_callback@entry=0x400c50 <action_callback>) at services.c:494 #20562 0x0000000000400cf3 in action_callback (op=<optimized out>) at scratch.c:55 #20563 0x00007ffff797b30d in operation_finalize (op=op@entry=0x60fff0) at services_linux.c:253 #20564 0x00007ffff7980b96 in systemd_unit_exec (op=0x60fff0, synchronous=synchronous@entry=0) at systemd.c:531 #20565 0x00007ffff797a072 in services_action_async (op=<optimized out>, action_callback=action_callback@entry=0x400c50 <action_callback>) at services.c:494 #20566 0x0000000000400cf3 in action_callback (op=<optimized out>) at scratch.c:55 #20567 0x00007ffff797b30d in operation_finalize (op=op@entry=0x60fd80) at services_linux.c:253 #20568 0x00007ffff7980b96 in systemd_unit_exec (op=0x60fd80, synchronous=synchronous@entry=0) at systemd.c:531 #20569 0x00007ffff797a072 in services_action_async (op=<optimized out>, action_callback=action_callback@entry=0x400c50 <action_callback>) at services.c:494 #20570 0x0000000000400cf3 in action_callback (op=<optimized out>) at scratch.c:55 #20571 0x00007ffff797b30d in operation_finalize (op=op@entry=0x60fc60) at services_linux.c:253 #20572 0x00007ffff7980b96 in systemd_unit_exec (op=0x60fc60, synchronous=synchronous@entry=0) at systemd.c:531 #20573 0x00007ffff797a072 in services_action_async (op=<optimized out>, action_callback=action_callback@entry=0x400c50 <action_callback>) at services.c:494 #20574 0x0000000000400cf3 in action_callback (op=<optimized out>) at scratch.c:55 #20575 0x00007ffff797b30d in operation_finalize (op=op@entry=0x609ad0) at services_linux.c:253 #20576 0x00007ffff7980b96 in systemd_unit_exec (op=0x609ad0, synchronous=synchronous@entry=0) at systemd.c:531 #20577 0x00007ffff797a072 in services_action_async (op=<optimized out>, action_callback=action_callback@entry=0x400c50 <action_callback>) at services.c:494 #20578 0x0000000000400cf3 in action_callback (op=<optimized out>, op@entry=0x0) at scratch.c:55 #20579 0x0000000000400b30 in main (argc=1, argv=0x7fffffffde78) at scratch.c:64 So...reading some of the pacemaker code, I know we can't rearchitect it immediately, but:
Why does lib/common/mainloop.c exist? If you have a hard dependency on GLib, why not just have the APIs call it directly?
Why this comment?
/* Do NOT use g_child_watch_add() and friends, they rely on pthreads */
Sure but...that's clearly better than polling every 5 seconds per process.
On file descriptor stuff, you should check the GLib version and use g_unix_fd_add_full() if GLib is new enough. This API was added as part of the infrastructure to allow GLib to use epoll() which would be more efficient.
Anyways, the mainloop stuff matters because its integration with libdbus is quite critical. You could look at dbus-glib's code for this if you haven't.
Then if you want much better DBus bindings with very flexible sync/async APIs plus thread safety, there's GDBus in libgio. Which will be available starting with RHEL6.6 as well, although it was just backported for Evolution.
(In reply to Colin Walters from comment #17) > So...reading some of the pacemaker code, I know we can't rearchitect it > immediately, but: > > Why does lib/common/mainloop.c exist? If you have a hard dependency on > GLib, why not just have the APIs call it directly? Unless I'm mistaking the intent of your question, the answer is because most of the constructs do not exist in GLib or rely on threads to function. There is no analog for mainloop_setup_trigger() and mainloop doesn't know how to deal with libqb IPC connections to name two. Also, its more efficient than typing out the bits that make the API useful in every place we need it. > > Why this comment? > > /* Do NOT use g_child_watch_add() and friends, they rely on pthreads */ > > Sure but...that's clearly better than polling every 5 seconds per process. In an environment that uses fork(), which the lrmd does extensively, threads are infinitely worse than polling. > > On file descriptor stuff, you should check the GLib version and use > g_unix_fd_add_full() if GLib is new enough. This API was added as part of > the infrastructure to allow GLib to use epoll() which would be more > efficient. Ah, good to know. We might be able to take advantage of that for rhel7. > > Anyways, the mainloop stuff matters because its integration with libdbus is > quite critical. Surely we should be able to use libdbus without it though? > You could look at dbus-glib's code for this if you haven't. I recall digging into that code quite deeply at one point. Any part in I missed particular? > Then if you want much better DBus bindings with very flexible sync/async > APIs plus thread safety, there's GDBus in libgio. Which will be available > starting with RHEL6.6 as well, although it was just backported for Evolution. Those bindings require threads which are a non-starter in pacemaker. We used to use that API, it caused havoc. See bug #1057697 (In reply to Colin Walters from comment #16) > The example program appears to recurse infinitely, and presumably errors > start when we blow the stack: My apologies, that was indeed sloppy of me for not checking this. Its not the cause of the bug though. I've improved the example so that it no longer recurses and the result is the same. New reproducer: http://ur1.ca/i7yv0 The stack should look more like this now (note that the number of calls hasn't changed): Reading symbols from /home/Development/sources/pacemaker/rhel-branches/.libs/lt-scratch...done. (gdb) break scratch.c:47 Breakpoint 1 at 0x400de4: file scratch.c, line 47. (gdb) run Starting program: /home/Development/sources/pacemaker/rhel-branches/.libs/lt-scratch [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". .......................... 25408 successful calls: 1 0 Breakpoint 1, action_callback (op=0x61b540) at scratch.c:47 47 exit(1); Missing separate debuginfos, use: debuginfo-install keyutils-libs-1.5.9-1.fc20.x86_64 krb5-libs-1.11.5-4.fc20.x86_64 libcom_err-1.42.8-3.fc20.x86_64 libffi-3.0.13-5.fc20.x86_64 libqb-0.17.0-1.fc20.x86_64 libtasn1-3.3-2.fc20.x86_64 libxslt-1.1.28-5.fc20.x86_64 openssl-libs-1.0.1e-37.fc20.1.x86_64 trousers-0.3.11.2-3.fc20.x86_64 (gdb) where #0 action_callback (op=0x61b540) at scratch.c:47 #1 0x00007ffff7bb15fd in operation_finalize (op=op@entry=0x61b540) at services_linux.c:246 #2 0x00007ffff7bb51ee in systemd_unit_exec (op=0x61b540, synchronous=synchronous@entry=0) at systemd.c:528 #3 0x00007ffff7bb0508 in services_action_async (op=<optimized out>, action_callback=action_callback@entry=0x400d20 <action_callback>) at services.c:410 #4 0x0000000000400e50 in do_action (user_data=<error reading variable: value has been optimized out>) at scratch.c:59 #5 0x00007ffff7de5f53 in crm_trigger_dispatch (source=source@entry=0x60d2b0, callback=<optimized out>, userdata=<optimized out>) at mainloop.c:105 #6 0x0000003cbe0492a6 in g_main_dispatch (context=0x604300) at gmain.c:3066 #7 g_main_context_dispatch (context=context@entry=0x604300) at gmain.c:3642 #8 0x0000003cbe049628 in g_main_context_iterate (context=0x604300, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>) at gmain.c:3713 #9 0x0000003cbe049a3a in g_main_loop_run (loop=0x604c30) at gmain.c:3907 #10 0x0000000000400c11 in main (argc=1, argv=0x7fffffffd7e8) at scratch.c:72 I rebuilt dbus with --enable-verbose-mode, and captured a trace with DBUS_VERBOSE=1, which is 2 gigabytes, so not going to post here. However, something jumps out at me:
When the symptom occurs, we end up in a tight loop calling poll() repeatedly:
11404 poll([{fd=4, events=POLLIN}], 1, 18216) = 1 ([{fd=4, revents=POLLIN}])
11404 poll([{fd=4, events=POLLIN}], 1, 18216) = 1 ([{fd=4, revents=POLLIN}])
11404 poll([{fd=4, events=POLLIN}], 1, 18216) = 1 ([{fd=4, revents=POLLIN}])
11404 poll([{fd=4, events=POLLIN}], 1, 18216) = 1 ([{fd=4, revents=POLLIN}])
Digging into this a bit more, it has something to do with exceeding transport->max_live_messages_size:
$5 = {refcount = 25254, size_value = 66061848, unix_fd_value = 0, notify_size_guard_value = 66060288, notify_unix_fd_guard_value = 4096, notify_function = 0x7ffff773c810 <live_messages_notify>,
notify_data = 0x60dc80, notify_pending = 1}
(gdb) p transport->live_messages_s
There is no member named live_messages_s.
(gdb) p transport->live_messages
Display all 34320 possibilities? (y or n)
(gdb) p transport->max_live_messages_
There is no member named max_live_messages_.
(gdb) p transport->max_live_messages_size
$6 = 66060288
If you look at dbus-transport-socket.c:check_read_watch(), what's happening is we're trying to disable the read watch, under the theory that we need to drain the outgoing queue, but...
(still investigating)
Created attachment 940204 [details]
[PATCH] dbus: Unref reply when we're done
This patch appears to help.
(BTW Red Hat Bugzilla corrupts git patches - direct link to raw patch, which is against pacemaker git master: https://bugzilla.redhat.com/attachment.cgi?id=940204 ) (In reply to Andrew Beekhof from comment #18) > Those bindings require threads which are a non-starter in pacemaker. As a GLib upstream maintainer (and sometimes an upstream libdbus contributor), I can say that we're thread oriented. Unfortunately, fork and threads are effectively mutually exclusive, and given that one must be picked, and threads are more flexible, and also a model portable to Windows, that's going to be how API is designed and internals are tested. I think you're going to be carrying a substantial maintenance burden trying to use libraries which are designed for thread safety, but not fork safety. (libdbus is only partially thread safe unfortunately even, and doesn't even try to call pthread_atfork() to reset state at the moment) (In reply to Colin Walters from comment #23) > (In reply to Andrew Beekhof from comment #18) > > > Those bindings require threads which are a non-starter in pacemaker. > > As a GLib upstream maintainer (and sometimes an upstream libdbus > contributor), I can say that we're thread oriented. Unfortunately, fork and > threads are effectively mutually exclusive, and given that one must be > picked, and threads are more flexible, and also a model portable to Windows, > that's going to be how API is designed and internals are tested. > > I think you're going to be carrying a substantial maintenance burden trying > to use libraries which are designed for thread safety, but not fork safety. > (libdbus is only partially thread safe unfortunately even, and doesn't even > try to call pthread_atfork() to reset state at the moment) If libdbus is neither fork() nor thread safe, then we have a support burden either way. Its also the only option for us since the use of fork() is fundamental to our existance, imagine trying to implement systemd without it... same thing. We need to use it for every resource type except systemd. So it's actually Systemd/dbus that is the odd one out in our corner of world. (In reply to Colin Walters from comment #21) > Created attachment 940204 [details] > [PATCH] dbus: Unref reply when we're done > > This patch appears to help. Very much so. This is great! Could you explain why the patch works though? I'd have expected a different kind of failure for a memory leak bug - ie. the process size to grow inordinately large and be killed rather than operations timing out. We did consider a memory leak, but curiously valgrind didn't report anything when David ran it there. (Taking the bug back since its clearly an issue on our side) (In reply to Andrew Beekhof from comment #25) s is great! > > Could you explain why the patch works though? I briefly mentioned this in my commit message, basically libdbus thinks the incoming queue is stalled and tries to stop reading (but fails at that due to a different bug in that you're not implementing the _toggle API of DBusWatch). I've implemented the watch toggle callback now too. So the effect wont be silent if we see this again for any reason in the future. I have verified that with there are no more dbus noreply errors with pacemaker-1.1.12-13.el7.x86_64 after systemd resource with monitor interval 1s has been running continuously for three days. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-0440.html |