Bug 1127289

Summary: core dumped: stonith-ng[21398]: error: crm_abort: crm_glib_handler: Forked child 12188 to record non-fatal assert at logging.c:73 : Source ID 132 was not found when attempting to remove it
Product: Red Hat Enterprise Linux 7 Reporter: michal novacek <mnovacek>
Component: pacemakerAssignee: Andrew Beekhof <abeekhof>
Status: CLOSED ERRATA QA Contact: cluster-qe <cluster-qe>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 7.1CC: cluster-maint, dvossel, fdinitto, jkortus, mlisik
Target Milestone: rcKeywords: TestBlocker
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: pacemaker-1.1.12-1.el7 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-03-05 10:00:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
tared directory created by abrt none

Description michal novacek 2014-08-06 14:26:33 UTC
Created attachment 924509 [details]
tared directory created by abrt

Description of problem:
Core dumped when attempting multiple (about ten times) to run 'ssh
root@node-in-cluster pcs cluster cib' from outside of the cluster.

Version-Release number of selected component (if applicable):
pacemaker-1.1.10-32.el7_0.x86_64

How reproducible: always

Steps to Reproduce:
1. have runnig cluster
2. from outside of the cluster run 'ssh root@node-in-cluster pcs cluster cib'
multiple times

Actual results: process stalled forever

Expected results: see cluster cib xml

Comment 3 Andrew Beekhof 2014-08-07 00:45:39 UTC
strange that abort leaves out the most useful piece of information - a stack trace

Comment 4 Andrew Beekhof 2014-08-07 03:45:23 UTC
Here's the stack trace:

Thread 1 (Thread 0x7f0375dd9780 (LWP 21658)):
#0  0x00007f0373005989 in raise () from /lib64/libc.so.6
#1  0x00007f0373007098 in abort () from /lib64/libc.so.6
#2  0x00007f03759a888d in crm_abort () from /lib64/libcrmcommon.so.3
#3  0x00007f03759c37d7 in crm_glib_handler () from /lib64/libcrmcommon.so.3
#4  0x00007f03728d9a11 in g_logv () from /lib64/libglib-2.0.so.0
#5  0x00007f03728d9caf in g_log () from /lib64/libglib-2.0.so.0
#6  0x00007f03728d184c in g_source_remove () from /lib64/libglib-2.0.so.0
#7  0x00007f0375577d5d in stonith_action_clear_tracking_data () from /lib64/libstonithd.so.2
#8  0x00007f0375577dad in stonith_action_destroy () from /lib64/libstonithd.so.2
#9  0x00007f03759c0eb3 in child_death_dispatch () from /lib64/libcrmcommon.so.3
#10 0x00007f03759c0077 in crm_signal_dispatch () from /lib64/libcrmcommon.so.3
#11 0x00007f03728d29ea in g_main_context_dispatch () from /lib64/libglib-2.0.so.0
#12 0x00007f03728d2d38 in g_main_context_iterate.isra.24 () from /lib64/libglib-2.0.so.0
#13 0x00007f03728d300a in g_main_loop_run () from /lib64/libglib-2.0.so.0
#14 0x00000000004034d8 in main ()

So it looks like the new version of g_source_remove() produces an error when the supplied ID has already been removed - instead of silently swallowing it as it did in the past.

I believe this would solve the issue though:

diff --git a/lib/fencing/st_client.c b/lib/fencing/st_client.c
index 64bd8f3..2837682 100644
--- a/lib/fencing/st_client.c
+++ b/lib/fencing/st_client.c
@@ -663,9 +663,11 @@ stonith_action_async_done(mainloop_child_t * p, pid_t pid, int core, int signo,
 
     if (action->timer_sigterm > 0) {
         g_source_remove(action->timer_sigterm);
+        action->timer_sigterm = 0;
     }
     if (action->timer_sigkill > 0) {
         g_source_remove(action->timer_sigkill);
+        action->timer_sigkill = 0;
     }
 
     if (action->last_timeout_signo) {

Comment 5 Andrew Beekhof 2014-08-07 03:48:05 UTC
Except those core files will have nothing to do with your reproducer.
Two completely unrelated daemons...

Comment 6 Andrew Beekhof 2014-08-07 03:54:13 UTC
At what time and to what node did you run step 2?
I need to know that to make sense of the logs.

Comment 7 Miroslav Lisik 2014-08-13 15:16:25 UTC
I hit the same bug. I can reproduce this by simple cluster setup with standard QA cluster setup scripts. After installation and cluster setup you can see cycling errors in the log messages.

Pacemaker version:
pacemaker-1.1.10-32.el7_0.x86_64

RHEL build version:
RHEL-7.1-20140812.n.0 Server x86_64

Comment 9 Andrew Beekhof 2014-08-15 01:08:12 UTC
By "same bug" do you mean "process stalled forever" or the core dump?
The fix for the core dump is in comment #4.

Is this with the live version of glib2?  
If so, I think we should do a z-stream.

Comment 10 Miroslav Lisik 2014-08-18 12:22:02 UTC
I mean core dump.

Comment 11 Andrew Beekhof 2014-08-19 11:41:21 UTC
(In reply to Miroslav Lisik from comment #10)
> I mean core dump.

What about the second half of the question to do with the glib version?

Comment 12 Miroslav Lisik 2014-08-19 13:34:10 UTC
I don't understand what "live version" mean. It happens with glib2-2.40.0-2.el7.x86_64.

Comment 13 David Vossel 2014-08-19 14:05:55 UTC
(In reply to Miroslav Lisik from comment #12)
> I don't understand what "live version" mean. It happens with
> glib2-2.40.0-2.el7.x86_64.

'live version' meaning one that we've shipped already, as opposed to something like a 'development build' that is unreleased for something like 7.1.

Comment 14 Miroslav Lisik 2014-08-19 15:48:24 UTC
So, this core dump occurs only in development build of 7.1. I think no z-stream is needed.

Released 7.0 with pacemaker-1.1.10-29.el7.x86_64 and glib2-2.36.3-5.el7.x86_64 is ok.

Comment 15 Andrew Beekhof 2014-08-20 07:48:20 UTC
(In reply to Miroslav Lisik from comment #14)
> So, this core dump occurs only in development build of 7.1. I think no
> z-stream is needed.
> 
> Released 7.0 with pacemaker-1.1.10-29.el7.x86_64 and
> glib2-2.36.3-5.el7.x86_64 is ok.

Thank goodness for that!
The patch in comment #4 will be picked up for 7.1

Comment 17 Fabio Massimo Di Nitto 2014-09-04 14:38:23 UTC
Andrew, can you please give it a spin ASAP?

Comment 20 michal novacek 2014-11-19 11:38:35 UTC
I have verified that the patch mentioned in comment #3 is included in 
pacemaker-1.1.12-11.el7.x86_64, that the source compiles correctly and that overall behaviour of the pacemaker package looks correctly.

Comment 22 errata-xmlrpc 2015-03-05 10:00:14 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-0440.html