Bug 1127289
Summary: | core dumped: stonith-ng[21398]: error: crm_abort: crm_glib_handler: Forked child 12188 to record non-fatal assert at logging.c:73 : Source ID 132 was not found when attempting to remove it | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | michal novacek <mnovacek> | ||||
Component: | pacemaker | Assignee: | Andrew Beekhof <abeekhof> | ||||
Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> | ||||
Severity: | urgent | Docs Contact: | |||||
Priority: | urgent | ||||||
Version: | 7.1 | CC: | cluster-maint, dvossel, fdinitto, jkortus, mlisik | ||||
Target Milestone: | rc | Keywords: | TestBlocker | ||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | pacemaker-1.1.12-1.el7 | Doc Type: | Bug Fix | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2015-03-05 10:00:14 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
strange that abort leaves out the most useful piece of information - a stack trace Here's the stack trace: Thread 1 (Thread 0x7f0375dd9780 (LWP 21658)): #0 0x00007f0373005989 in raise () from /lib64/libc.so.6 #1 0x00007f0373007098 in abort () from /lib64/libc.so.6 #2 0x00007f03759a888d in crm_abort () from /lib64/libcrmcommon.so.3 #3 0x00007f03759c37d7 in crm_glib_handler () from /lib64/libcrmcommon.so.3 #4 0x00007f03728d9a11 in g_logv () from /lib64/libglib-2.0.so.0 #5 0x00007f03728d9caf in g_log () from /lib64/libglib-2.0.so.0 #6 0x00007f03728d184c in g_source_remove () from /lib64/libglib-2.0.so.0 #7 0x00007f0375577d5d in stonith_action_clear_tracking_data () from /lib64/libstonithd.so.2 #8 0x00007f0375577dad in stonith_action_destroy () from /lib64/libstonithd.so.2 #9 0x00007f03759c0eb3 in child_death_dispatch () from /lib64/libcrmcommon.so.3 #10 0x00007f03759c0077 in crm_signal_dispatch () from /lib64/libcrmcommon.so.3 #11 0x00007f03728d29ea in g_main_context_dispatch () from /lib64/libglib-2.0.so.0 #12 0x00007f03728d2d38 in g_main_context_iterate.isra.24 () from /lib64/libglib-2.0.so.0 #13 0x00007f03728d300a in g_main_loop_run () from /lib64/libglib-2.0.so.0 #14 0x00000000004034d8 in main () So it looks like the new version of g_source_remove() produces an error when the supplied ID has already been removed - instead of silently swallowing it as it did in the past. I believe this would solve the issue though: diff --git a/lib/fencing/st_client.c b/lib/fencing/st_client.c index 64bd8f3..2837682 100644 --- a/lib/fencing/st_client.c +++ b/lib/fencing/st_client.c @@ -663,9 +663,11 @@ stonith_action_async_done(mainloop_child_t * p, pid_t pid, int core, int signo, if (action->timer_sigterm > 0) { g_source_remove(action->timer_sigterm); + action->timer_sigterm = 0; } if (action->timer_sigkill > 0) { g_source_remove(action->timer_sigkill); + action->timer_sigkill = 0; } if (action->last_timeout_signo) { Except those core files will have nothing to do with your reproducer. Two completely unrelated daemons... At what time and to what node did you run step 2? I need to know that to make sense of the logs. I hit the same bug. I can reproduce this by simple cluster setup with standard QA cluster setup scripts. After installation and cluster setup you can see cycling errors in the log messages. Pacemaker version: pacemaker-1.1.10-32.el7_0.x86_64 RHEL build version: RHEL-7.1-20140812.n.0 Server x86_64 By "same bug" do you mean "process stalled forever" or the core dump? The fix for the core dump is in comment #4. Is this with the live version of glib2? If so, I think we should do a z-stream. I mean core dump. (In reply to Miroslav Lisik from comment #10) > I mean core dump. What about the second half of the question to do with the glib version? I don't understand what "live version" mean. It happens with glib2-2.40.0-2.el7.x86_64. (In reply to Miroslav Lisik from comment #12) > I don't understand what "live version" mean. It happens with > glib2-2.40.0-2.el7.x86_64. 'live version' meaning one that we've shipped already, as opposed to something like a 'development build' that is unreleased for something like 7.1. So, this core dump occurs only in development build of 7.1. I think no z-stream is needed. Released 7.0 with pacemaker-1.1.10-29.el7.x86_64 and glib2-2.36.3-5.el7.x86_64 is ok. (In reply to Miroslav Lisik from comment #14) > So, this core dump occurs only in development build of 7.1. I think no > z-stream is needed. > > Released 7.0 with pacemaker-1.1.10-29.el7.x86_64 and > glib2-2.36.3-5.el7.x86_64 is ok. Thank goodness for that! The patch in comment #4 will be picked up for 7.1 Andrew, can you please give it a spin ASAP? I have verified that the patch mentioned in comment #3 is included in pacemaker-1.1.12-11.el7.x86_64, that the source compiles correctly and that overall behaviour of the pacemaker package looks correctly. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-0440.html |
Created attachment 924509 [details] tared directory created by abrt Description of problem: Core dumped when attempting multiple (about ten times) to run 'ssh root@node-in-cluster pcs cluster cib' from outside of the cluster. Version-Release number of selected component (if applicable): pacemaker-1.1.10-32.el7_0.x86_64 How reproducible: always Steps to Reproduce: 1. have runnig cluster 2. from outside of the cluster run 'ssh root@node-in-cluster pcs cluster cib' multiple times Actual results: process stalled forever Expected results: see cluster cib xml