Bug 1133952 - [abrt] pacemaker: crm_abort(): stonithd killed by SIGABRT
Summary: [abrt] pacemaker: crm_abort(): stonithd killed by SIGABRT
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: pacemaker
Version: 7.0
Hardware: x86_64
OS: Unspecified
urgent
high
Target Milestone: rc
: ---
Assignee: Andrew Beekhof
QA Contact: cluster-qe@redhat.com
URL:
Whiteboard: abrt_hash:81e7c231edd24298e0f49715a05...
: 1133943 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-08-26 14:06 UTC by Nate Straz
Modified: 2015-03-05 10:00 UTC (History)
6 users (show)

Fixed In Version: pacemaker-1.1.12-4
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-03-05 10:00:19 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
File: backtrace (10.20 KB, text/plain)
2014-08-26 14:06 UTC, Nate Straz
no flags Details
File: cgroup (159 bytes, text/plain)
2014-08-26 14:06 UTC, Nate Straz
no flags Details
File: core_backtrace (4.54 KB, text/plain)
2014-08-26 14:06 UTC, Nate Straz
no flags Details
File: environ (466 bytes, text/plain)
2014-08-26 14:06 UTC, Nate Straz
no flags Details
File: limits (1.29 KB, text/plain)
2014-08-26 14:06 UTC, Nate Straz
no flags Details
File: maps (20.86 KB, text/plain)
2014-08-26 14:06 UTC, Nate Straz
no flags Details
File: open_fds (510 bytes, text/plain)
2014-08-26 14:06 UTC, Nate Straz
no flags Details
File: proc_pid_status (1.02 KB, text/plain)
2014-08-26 14:06 UTC, Nate Straz
no flags Details
File: sosreport.log (1.05 KB, text/plain)
2014-08-26 14:06 UTC, Nate Straz
no flags Details
File: tmpIFHSo2 (854 bytes, text/plain)
2014-08-26 14:06 UTC, Nate Straz
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2015:0440 0 normal SHIPPED_LIVE pacemaker bug fix and enhancement update 2015-03-05 14:37:57 UTC

Description Nate Straz 2014-08-26 14:06:43 UTC
Description of problem:


Version-Release number of selected component:
pacemaker-1.1.10-32.el7_0

Additional info:
reporter:       libreport-2.1.11
backtrace_rating: 4
cmdline:        /usr/libexec/pacemaker/stonithd
crash_function: crm_abort
executable:     /usr/libexec/pacemaker/stonithd
kernel:         3.10.0-131.el7.x86_64
runlevel:       N 3
tmpBQyGN7:      2014-08-26 08:48:58,453 INFO: could not run 'tree /var/lib': command not found
type:           CCpp
uid:            0

Truncated backtrace:
Thread no. 1 (7 frames)
 #2 crm_abort at /lib64/libcrmcommon.so.3
 #3 crm_glib_handler at /lib64/libcrmcommon.so.3
 #6 g_source_remove at gmain.c:2194
 #7 stonith_action_clear_tracking_data at /lib64/libstonithd.so.2
 #8 stonith_action_destroy at /lib64/libstonithd.so.2
 #9 child_death_dispatch at /lib64/libcrmcommon.so.3
 #10 crm_signal_dispatch at /lib64/libcrmcommon.so.3

Comment 1 Nate Straz 2014-08-26 14:06:45 UTC
Created attachment 931005 [details]
File: backtrace

Comment 2 Nate Straz 2014-08-26 14:06:45 UTC
Created attachment 931006 [details]
File: cgroup

Comment 3 Nate Straz 2014-08-26 14:06:46 UTC
Created attachment 931007 [details]
File: core_backtrace

Comment 4 Nate Straz 2014-08-26 14:06:47 UTC
Created attachment 931008 [details]
File: environ

Comment 5 Nate Straz 2014-08-26 14:06:48 UTC
Created attachment 931009 [details]
File: limits

Comment 6 Nate Straz 2014-08-26 14:06:49 UTC
Created attachment 931010 [details]
File: maps

Comment 7 Nate Straz 2014-08-26 14:06:50 UTC
Created attachment 931011 [details]
File: open_fds

Comment 8 Nate Straz 2014-08-26 14:06:50 UTC
Created attachment 931012 [details]
File: proc_pid_status

Comment 9 Nate Straz 2014-08-26 14:06:51 UTC
Created attachment 931013 [details]
File: sosreport.log

Comment 10 Nate Straz 2014-08-26 14:06:52 UTC
Created attachment 931014 [details]
File: tmpIFHSo2

Comment 12 Nate Straz 2014-08-26 18:03:01 UTC
*** Bug 1133943 has been marked as a duplicate of this bug. ***

Comment 13 Nate Straz 2014-08-26 18:07:20 UTC
This looks like a major issue that's causing other problems.  I've had a cluster up for 24 hours and I already have 1000+ cores from stonithd.

[root@host-077 cores]# pwd
/var/lib/pacemaker/cores
[root@host-077 cores]# ls | wc -l ; du -sh .
1298
3.1G    .


The log is filling with messages like these:
Aug 26 13:02:15 host-077 stonith-ng[2409]: error: crm_abort: crm_glib_handler: Forked child 3078 to record non-fatal assert at logging.c:73 : Source ID 25 was not found when attempting to remove it
Aug 26 13:02:15 host-077 stonith-ng[2409]: error: crm_glib_handler: GLib: Source ID 25 was not found when attempting to remove it
Aug 26 13:02:15 host-077 stonith-ng[2409]: error: crm_abort: crm_glib_handler: Forked child 3079 to record non-fatal assert at logging.c:73 : Source ID 26 was not found when attempting to remove it
Aug 26 13:02:15 host-077 stonith-ng[2409]: error: crm_glib_handler: GLib: Source ID 26 was not found when attempting to remove it
Aug 26 13:03:15 host-077 stonith-ng[2409]: error: crm_abort: crm_glib_handler: Forked child 3147 to record non-fatal assert at logging.c:73 : Source ID 27 was not found when attempting to remove it
Aug 26 13:03:15 host-077 stonith-ng[2409]: error: crm_glib_handler: GLib: Source ID 27 was not found when attempting to remove it
Aug 26 13:03:15 host-077 stonith-ng[2409]: error: crm_abort: crm_glib_handler: Forked child 3148 to record non-fatal assert at logging.c:73 : Source ID 28 was not found when attempting to remove it
Aug 26 13:03:15 host-077 stonith-ng[2409]: error: crm_glib_handler: GLib: Source ID 28 was not found when attempting to remove it
Aug 26 13:04:15 host-077 stonith-ng[2409]: error: crm_abort: crm_glib_handler: Forked child 3223 to record non-fatal assert at logging.c:73 : Source ID 29 was not found when attempting to remove it
Aug 26 13:04:15 host-077 stonith-ng[2409]: error: crm_glib_handler: GLib: Source ID 29 was not found when attempting to remove it
Aug 26 13:04:15 host-077 stonith-ng[2409]: error: crm_abort: crm_glib_handler: Forked child 3224 to record non-fatal assert at logging.c:73 : Source ID 30 was not found when attempting to remove it
Aug 26 13:04:15 host-077 stonith-ng[2409]: error: crm_glib_handler: GLib: Source ID 30 was not found when attempting to remove it
Aug 26 13:05:15 host-077 stonith-ng[2409]: error: crm_abort: crm_glib_handler: Forked child 3293 to record non-fatal assert at logging.c:73 : Source ID 31 was not found when attempting to remove it
Aug 26 13:05:15 host-077 stonith-ng[2409]: error: crm_glib_handler: GLib: Source ID 31 was not found when attempting to remove it
Aug 26 13:05:15 host-077 stonith-ng[2409]: error: crm_abort: crm_glib_handler: Forked child 3294 to record non-fatal assert at logging.c:73 : Source ID 32 was not found when attempting to remove it
Aug 26 13:05:15 host-077 stonith-ng[2409]: error: crm_glib_handler: GLib: Source ID 32 was not found when attempting to remove it
Aug 26 13:06:15 host-077 stonith-ng[2409]: error: crm_abort: crm_glib_handler: Forked child 3362 to record non-fatal assert at logging.c:73 : Source ID 33 was not found when attempting to remove it
Aug 26 13:06:15 host-077 stonith-ng[2409]: error: crm_glib_handler: GLib: Source ID 33 was not found when attempting to remove it
Aug 26 13:06:15 host-077 stonith-ng[2409]: error: crm_abort: crm_glib_handler: Forked child 3363 to record non-fatal assert at logging.c:73 : Source ID 34 was not found when attempting to remove it
Aug 26 13:06:15 host-077 stonith-ng[2409]: error: crm_glib_handler: GLib: Source ID 34 was not found when attempting to remove it

Comment 14 Andrew Beekhof 2014-08-26 22:01:15 UTC
I would expect this to be a dup of bug #1127289.  Can you confirm if your version of glib has been released to customers?

Comment 15 Nate Straz 2014-08-27 14:08:48 UTC
I'm running glibc-2.17-55.el7.x86_64

Comment 16 Nate Straz 2014-08-27 14:10:01 UTC
Sorry, wrong library.  I'm running glib2-2.40.0-2.el7.x86_64

Comment 17 Nate Straz 2014-08-27 14:50:53 UTC
It looks like the same issue to me.  There are a couple failure scenarios that is causing a lot of trouble so getting the fix in would be much appreciated.

1. when stonithd causes a core dump abrtd runs os-prober.  If there was a mkfs.gfs2 running at the time, this will cause a mount attempt and a segfault in mkfs.  

2. If I disable abrtd, cores from stonithd fills /var/lib/pacemaker/cores and causes a file system full condition.  More chaos after that happens.

Comment 18 Andrew Beekhof 2014-08-27 23:04:17 UTC
(In reply to Nate Straz from comment #16)
> Sorry, wrong library.  I'm running glib2-2.40.0-2.el7.x86_64

Is that what customers are using or something from brew?
In bug #1127289 it turned out not to be live, so we didn't prioritise an update.

Comment 19 Andrew Beekhof 2014-08-27 23:10:07 UTC
(In reply to Nate Straz from comment #17)
> It looks like the same issue to me.  There are a couple failure scenarios
> that is causing a lot of trouble so getting the fix in would be much
> appreciated.

I have to do a rebase before that deadline expires anyway, so it wont be far away.

The good news is that in addition to the fix, crm_glib_handler() has previously been updated to not produce core files by default anymore.  This will also get picked up by the rebase.

> 
> 1. when stonithd causes a core dump abrtd runs os-prober.  If there was a
> mkfs.gfs2 running at the time, this will cause a mount attempt and a
> segfault in mkfs.  

ouch

> 
> 2. If I disable abrtd, cores from stonithd fills /var/lib/pacemaker/cores
> and causes a file system full condition.  More chaos after that happens.

i bet :-(

Comment 24 Nate Straz 2014-11-19 14:35:15 UTC
Recent cluster are showing no core dumps in /var/lib/pacemaker/cores and no errors from stonith-ng.

[root@host-008 ~]# ls /var/lib/pacemaker/cores/
[root@host-008 ~]# rpm -q pacemaker
pacemaker-1.1.12-10.el7.x86_64
[root@host-008 ~]# grep stonith-ng /var/log/messages | grep error
[root@host-008 ~]#

Comment 26 errata-xmlrpc 2015-03-05 10:00:19 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-0440.html


Note You need to log in before you can comment on or make changes to this bug.