Description of problem: operations with systemd resources tend to time out and eventually they get unmanaged. Version-Release number of selected component (if applicable): pacemaker-1.1.11-1.fc19.x86_64 systemd-204-18.fc19.x86_64 How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Last updated: Mon Mar 31 13:03:46 2014 Last change: Sun Mar 16 20:26:23 2014 via crmd on base1 Stack: corosync Current DC: base1 (1) - partition with quorum Version: 1.1.11-1.fc19-9d39a6b 2 Nodes configured 5 Resources configured Online: [ base1 ] OFFLINE: [ base2 ] db-ip (ocf::heartbeat:IPaddr2): Started base1 watch-tpval-log (systemd:ispadmin-watch-tpval-log): FAILED base1 (unmanaged) check-timeout (systemd:ispadmin-check-timeout): FAILED base1 (unmanaged) radius-ip (ocf::heartbeat:IPaddr2): Started base1 radiusd (systemd:radiusd): FAILED base1 (unmanaged) Failed actions: watch-tpval-log_stop_0 on base1 'unknown error' (1): call=-1, status=Timed Out, last-rc-change='Fri Mar 21 17:33:37 2014', queued=0ms, exec=0ms check-timeout_stop_0 on base1 'unknown error' (1): call=-1, status=Timed Out, last-rc-change='Fri Mar 21 17:33:37 2014', queued=0ms, exec=0ms radiusd_stop_0 on base1 'unknown error' (1): call=129, status=complete, last-rc-change='Mon Mar 31 12:23:16 2014', queued=0ms, exec=0ms Expected results: Additional info:
The most likely cause should have been resolved with 1.1.11... Can we get logs from this problem please?
Created attachment 889788 [details] crm_report from bbox cluster Here is crm_report from "bbox" cluster which has similar problems. A report from "base" cluster will be also available soon.
Created attachment 889792 [details] crm_report from "base" cluster
Something is strange here... 1. I don't see any sign of status=Timed Out. All the failures have rc=1 with is "unknown error" 2. There are no logs at all from the lrmd Perhaps configure a log file in corosync.conf
We are seeing very similar problems on our pacemaker cluster on openSUSE 13.1 with pacemaker-1.1.11+git20140512.da5aaab-139.2.x86_64. It would seem like systemd resources are severely broken. More often than not the cluster fails to obtain results to systemd calls resulting in unknown errors and time outs. When it fails, systemd_async_dispatch was never called for the failed request. Will attach log file fragments of a failed and a working start request. One possible explanation could be that in pcmk_dbus_send the dbus_connection_send_with_reply call already has a reply before dbus_pending_call_set_notify registers the callback. I found one message on a mailing list suggesting adding a dbus_pending_call_completed call before the dbus_pending_call_set_notify [1], but could not find any code in other open source programs doing that. I tried chaning the code to: if (dbus_pending_call_get_completed(pending)) { done(pending, user_data); } else { CRM_ASSERT(dbus_pending_call_set_notify(pending, done, user_data, NULL)); } But this leads to the crmd exiting with exit code 174 before doing anything useful. I stopped playing around with our production cluster after that. It is Sunday after all... [1] http://marc.info/?l=freedesktop-dbus&m=131436328920692&w=2
Created attachment 896793 [details] log file fragment of a successfulll systemd resource start
Created attachment 896794 [details] log file fragment of a failed systemd resource start
Corrections: we're still on openSUSE 12.3 on our servers. Additional observations: pacemaker tries to issue a Reload command to systemd. This failed until I added the following lines to my /etc/dbus-1/system-local.conf: <policy user="hacluster"> <allow own="org.freedesktop.systemd1"/> <allow send_destination="org.freedesktop.systemd1" send_interface="org.freedesktop.DBus.Manager"/> </policy> This should be contained in the pacemaker package. Also the GetUnit dbus call in systemd_unit_by_name will fail if the unit is not loaded. And the unit will not be loaded unless systemd is told to activate it or it is "Wanted" by some other unit or target which would cause systemd to start the unit automatically. To make matters more difficult to debug, systemd unloads units lazily after some timeout presumably. So it might work ok when doing some quick tests starting and stopping a service, but fails in real world tests with longer pauses in between. I guess having a look at how systemctl does these things could not hurt. A note in the docs would be nice that indeed leaving out the ".service" in the systemd service name is supported. For a short time I thought that was the source of the problems when my manually reproduced dbus calls failed.
(In reply to nine from comment #8) > Corrections: we're still on openSUSE 12.3 on our servers. > > Additional observations: pacemaker tries to issue a Reload command to > systemd. This failed until I added the following lines to my > /etc/dbus-1/system-local.conf: > <policy user="hacluster"> The confusing thing about this is that the lrmd is running as root, not hacluster. So its unclear why this would help. > <allow own="org.freedesktop.systemd1"/> > <allow send_destination="org.freedesktop.systemd1" > send_interface="org.freedesktop.DBus.Manager"/> > </policy> > This should be contained in the pacemaker package. > > Also the GetUnit dbus call in systemd_unit_by_name will fail if the unit is > not loaded. Which is why it follows up with a call to LoadUnit if GetUnit fails. > And the unit will not be loaded unless systemd is told to > activate it or it is "Wanted" by some other unit or target which would cause > systemd to start the unit automatically. To make matters more difficult to > debug, systemd unloads units lazily after some timeout presumably. We know :-( It makes getting the list of available unit files nearly impossible. > So it > might work ok when doing some quick tests starting and stopping a service, > but fails in real world tests with longer pauses in between. I guess having > a look at how systemctl does these things could not hurt. > > A note in the docs would be nice that indeed leaving out the ".service" in > the systemd service name is supported. Good idea. > For a short time I thought that was > the source of the problems when my manually reproduced dbus calls failed.
(In reply to nine from comment #5) > We are seeing very similar problems on our pacemaker cluster on openSUSE > 13.1 with pacemaker-1.1.11+git20140512.da5aaab-139.2.x86_64. It would seem > like systemd resources are severely broken. More often than not the cluster > fails to obtain results to systemd calls resulting in unknown errors and > time outs. > > When it fails, systemd_async_dispatch was never called for the failed > request. Will attach log file fragments of a failed and a working start > request. One possible explanation could be that in pcmk_dbus_send the > dbus_connection_send_with_reply call already has a reply before > dbus_pending_call_set_notify registers the callback. I found one message on > a mailing list suggesting adding a dbus_pending_call_completed call before > the dbus_pending_call_set_notify [1], but could not find any code in other > open source programs doing that. I tried chaning the code to: > if (dbus_pending_call_get_completed(pending)) { > done(pending, user_data); > } else { > CRM_ASSERT(dbus_pending_call_set_notify(pending, done, user_data, > NULL)); > } Interesting patch. It certainly seems plausible. Perhaps this additional patch might have been needed: diff --git a/lib/services/systemd.c b/lib/services/systemd.c index aa5870b..8b5d7ba 100644 --- a/lib/services/systemd.c +++ b/lib/services/systemd.c @@ -345,7 +345,10 @@ systemd_async_dispatch(DBusPendingCall *pending, void *user_data) if(pending) { reply = dbus_pending_call_steal_reply(pending); } - if(pcmk_dbus_find_error(op->action, pending, reply, &error)) { + if(reply == NULL) { + crm_err("No reply for %s action on %s", op->action, op->rsc); + + } else if(pcmk_dbus_find_error(op->action, pending, reply, &error)) { /* ignore "already started" or "not running" errors */ if (!systemd_mask_error(op, error.name)) { > But this leads to the crmd exiting with exit code 174 before doing anything > useful. I stopped playing around with our production cluster after that. Good idea. 174 is particularly odd too, since its not one of the standard error codes or one of pacemaker's custom ones. I would have been interested to see logs of this > It > is Sunday after all... > > [1] http://marc.info/?l=freedesktop-dbus&m=131436328920692&w=2
This message is a notice that Fedora 19 is now at end of life. Fedora has stopped maintaining and issuing updates for Fedora 19. It is Fedora's policy to close all bug reports from releases that are no longer maintained. Approximately 4 (four) weeks from now this bug will be closed as EOL if it remains open with a Fedora 'version' of '19'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora 19 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior this bug is closed as described in the policy above. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete.
Fedora 19 changed to end-of-life (EOL) status on 2015-01-06. Fedora 19 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. If you are unable to reopen this bug, please file a new report against the current release. If you experience problems, please add a comment to this bug. Thank you for reporting this bug and we are sorry it could not be fixed.