Description of problem:
I had a crash last night in pengine:
Apr 10 03:01:20 clu4 crmd[4057]: error: crm_ipc_read: Connection to pengine failed
Apr 10 03:01:20 clu4 pacemakerd[4043]: notice: pcmk_child_exit: Child process pengine terminated with signal 6 (pid=4056, core=128)
Apr 10 03:01:20 clu4 pacemakerd[4043]: notice: pcmk_process_exit: Respawning failed child process: pengine
Apr 10 03:01:20 clu4 crmd[4057]: error: mainloop_gio_callback: Connection to pengine[0x17bcff0] closed (I/O condition=25)
The following core dump and backtrace was deposited in pacemaker/cores directory:
#0 0x0000003e2a0328a5 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1 0x0000003e2a034085 in abort () at abort.c:92
#2 0x0000003e2a02ba1e in __assert_fail_base (fmt=<value optimized out>, assertion=0x3e30819fbb "rc == 0", file=0x3e30819fb1 "log_dcs.c", line=<value optimized out>,
function=<value optimized out>) at assert.c:96
#3 0x0000003e2a02bae0 in __assert_fail (assertion=0x3e30819fbb "rc == 0", file=0x3e30819fb1 "log_dcs.c", line=70, function=0x3e3081a030 "_log_dcs_new_cs") at assert.c:105
#4 0x0000003e308143eb in _log_dcs_new_cs (function=0x3e32c46310 "native_color", filename=0x3e32c442bf "native.c", format=0x3e32428d00 "%s: %s allocation score on %s: %s",
priority=<value optimized out>, lineno=472, tags=0) at log_dcs.c:70
#5 0x0000003e308145e5 in qb_log_dcs_get (newly_created=0x7fff43a5715c, function=<value optimized out>, filename=<value optimized out>, format=<value optimized out>,
priority=8 '\b', lineno=<value optimized out>, tags=0) at log_dcs.c:146
#6 0x0000003e30812ba9 in qb_log_callsite_get (function=<value optimized out>, filename=<value optimized out>, format=<value optimized out>, priority=<value optimized out>,
lineno=<value optimized out>, tags=0) at log.c:256
#7 0x0000003e308130ab in qb_log_from_external_source (function=<value optimized out>, filename=<value optimized out>, format=<value optimized out>,
priority=<value optimized out>, lineno=<value optimized out>, tags=<value optimized out>) at log.c:331
#8 0x0000003e32417315 in dump_node_scores_worker (level=9, file=0x3e32c442bf "native.c", function=0x3e32c46310 "native_color", line=472, rsc=0xa03a40,
comment=0x3e32c44548 "Post-coloc", nodes=0xc17800) at utils.c:189
#9 0x0000003e32c21135 in native_color (rsc=0xa03a40, prefer=0xc492c0, data_set=0x7fff43a57710) at native.c:472
#10 0x0000003e32c314fc in color_instance (rsc=0xa03a40, prefer=0xc492c0, all_coloc=<value optimized out>, data_set=0x7fff43a57710) at clone.c:430
#11 0x0000003e32c35549 in clone_color (rsc=0xb5c960, prefer=<value optimized out>, data_set=0x7fff43a57710) at clone.c:578
#12 0x0000003e32c21000 in native_color (rsc=0xd77150, prefer=0x0, data_set=0x7fff43a57710) at native.c:459
#13 0x0000003e32c12c0f in stage5 (data_set=0x7fff43a57710) at allocate.c:1130
#14 0x0000003e32c09b1d in do_calculations (data_set=0x7fff43a57710, xml_input=<value optimized out>, now=<value optimized out>) at pengine.c:247
#15 0x0000003e32c0a702 in process_pe_message (msg=0x115c8a0, xml_data=0x1106890, sender=0x9910c0) at pengine.c:126
#16 0x00000000004012be in pe_ipc_dispatch (c=0x9910c0, data=<value optimized out>, size=<value optimized out>) at main.c:74
#17 0x0000003e3080ebb4 in _process_request_ (c=0x9910c0, ms_timeout=10) at ipcs.c:647
#18 0x0000003e3080ef04 in qb_ipcs_dispatch_connection_request (fd=<value optimized out>, revents=<value optimized out>, data=0x9910c0) at ipcs.c:755
#19 0x0000003e31425240 in gio_read_socket (gio=<value optimized out>, condition=G_IO_IN, data=0x990a80) at mainloop.c:372
#20 0x0000003e2bc38f0e in g_main_dispatch (context=0x98e760) at gmain.c:1960
#21 IA__g_main_context_dispatch (context=0x98e760) at gmain.c:2513
#22 0x0000003e2bc3c938 in g_main_context_iterate (context=0x98e760, block=1, dispatch=1, self=<value optimized out>) at gmain.c:2591
#23 0x0000003e2bc3cd55 in IA__g_main_loop_run (loop=0x98cdb0) at gmain.c:2799
#24 0x00000000004014c8 in main (argc=1, argv=0x7fff43a57cf8) at main.c:174
This is apparently a bug in libqb and it's discussed here:
http://comments.gmane.org/gmane.linux.highavailability.pacemaker/15504
A patch for libqb 0.14.4 is available at:
https://github.com/asalkeld/libqb/commit/30a7871646c1f5bbb602e0a01f5550a4516b36f8
But that does not apply cleanly to 0.14.2 (which Red Hat ships).
How reproducible:
Not reproducible
Steps to Reproduce:
1. Use Pacemaker
Actual results:
Pengine crashes.
Expected results:
Pengine doesn't crash.
I think libqb will need a rebase for libqb in 6.5
Until then, you can borrow an updated rpm from:
http://clusterlabs.org/rpm-test-next/rhel-6/
Also, for QE, the reproducer is time.
libqb is creating duplicates and eventually uses up all the free memory until the pengine process crashes.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
http://rhn.redhat.com/errata/RHBA-2013-1634.html
Description of problem: I had a crash last night in pengine: Apr 10 03:01:20 clu4 crmd[4057]: error: crm_ipc_read: Connection to pengine failed Apr 10 03:01:20 clu4 pacemakerd[4043]: notice: pcmk_child_exit: Child process pengine terminated with signal 6 (pid=4056, core=128) Apr 10 03:01:20 clu4 pacemakerd[4043]: notice: pcmk_process_exit: Respawning failed child process: pengine Apr 10 03:01:20 clu4 crmd[4057]: error: mainloop_gio_callback: Connection to pengine[0x17bcff0] closed (I/O condition=25) The following core dump and backtrace was deposited in pacemaker/cores directory: #0 0x0000003e2a0328a5 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 #1 0x0000003e2a034085 in abort () at abort.c:92 #2 0x0000003e2a02ba1e in __assert_fail_base (fmt=<value optimized out>, assertion=0x3e30819fbb "rc == 0", file=0x3e30819fb1 "log_dcs.c", line=<value optimized out>, function=<value optimized out>) at assert.c:96 #3 0x0000003e2a02bae0 in __assert_fail (assertion=0x3e30819fbb "rc == 0", file=0x3e30819fb1 "log_dcs.c", line=70, function=0x3e3081a030 "_log_dcs_new_cs") at assert.c:105 #4 0x0000003e308143eb in _log_dcs_new_cs (function=0x3e32c46310 "native_color", filename=0x3e32c442bf "native.c", format=0x3e32428d00 "%s: %s allocation score on %s: %s", priority=<value optimized out>, lineno=472, tags=0) at log_dcs.c:70 #5 0x0000003e308145e5 in qb_log_dcs_get (newly_created=0x7fff43a5715c, function=<value optimized out>, filename=<value optimized out>, format=<value optimized out>, priority=8 '\b', lineno=<value optimized out>, tags=0) at log_dcs.c:146 #6 0x0000003e30812ba9 in qb_log_callsite_get (function=<value optimized out>, filename=<value optimized out>, format=<value optimized out>, priority=<value optimized out>, lineno=<value optimized out>, tags=0) at log.c:256 #7 0x0000003e308130ab in qb_log_from_external_source (function=<value optimized out>, filename=<value optimized out>, format=<value optimized out>, priority=<value optimized out>, lineno=<value optimized out>, tags=<value optimized out>) at log.c:331 #8 0x0000003e32417315 in dump_node_scores_worker (level=9, file=0x3e32c442bf "native.c", function=0x3e32c46310 "native_color", line=472, rsc=0xa03a40, comment=0x3e32c44548 "Post-coloc", nodes=0xc17800) at utils.c:189 #9 0x0000003e32c21135 in native_color (rsc=0xa03a40, prefer=0xc492c0, data_set=0x7fff43a57710) at native.c:472 #10 0x0000003e32c314fc in color_instance (rsc=0xa03a40, prefer=0xc492c0, all_coloc=<value optimized out>, data_set=0x7fff43a57710) at clone.c:430 #11 0x0000003e32c35549 in clone_color (rsc=0xb5c960, prefer=<value optimized out>, data_set=0x7fff43a57710) at clone.c:578 #12 0x0000003e32c21000 in native_color (rsc=0xd77150, prefer=0x0, data_set=0x7fff43a57710) at native.c:459 #13 0x0000003e32c12c0f in stage5 (data_set=0x7fff43a57710) at allocate.c:1130 #14 0x0000003e32c09b1d in do_calculations (data_set=0x7fff43a57710, xml_input=<value optimized out>, now=<value optimized out>) at pengine.c:247 #15 0x0000003e32c0a702 in process_pe_message (msg=0x115c8a0, xml_data=0x1106890, sender=0x9910c0) at pengine.c:126 #16 0x00000000004012be in pe_ipc_dispatch (c=0x9910c0, data=<value optimized out>, size=<value optimized out>) at main.c:74 #17 0x0000003e3080ebb4 in _process_request_ (c=0x9910c0, ms_timeout=10) at ipcs.c:647 #18 0x0000003e3080ef04 in qb_ipcs_dispatch_connection_request (fd=<value optimized out>, revents=<value optimized out>, data=0x9910c0) at ipcs.c:755 #19 0x0000003e31425240 in gio_read_socket (gio=<value optimized out>, condition=G_IO_IN, data=0x990a80) at mainloop.c:372 #20 0x0000003e2bc38f0e in g_main_dispatch (context=0x98e760) at gmain.c:1960 #21 IA__g_main_context_dispatch (context=0x98e760) at gmain.c:2513 #22 0x0000003e2bc3c938 in g_main_context_iterate (context=0x98e760, block=1, dispatch=1, self=<value optimized out>) at gmain.c:2591 #23 0x0000003e2bc3cd55 in IA__g_main_loop_run (loop=0x98cdb0) at gmain.c:2799 #24 0x00000000004014c8 in main (argc=1, argv=0x7fff43a57cf8) at main.c:174 This is apparently a bug in libqb and it's discussed here: http://comments.gmane.org/gmane.linux.highavailability.pacemaker/15504 A patch for libqb 0.14.4 is available at: https://github.com/asalkeld/libqb/commit/30a7871646c1f5bbb602e0a01f5550a4516b36f8 But that does not apply cleanly to 0.14.2 (which Red Hat ships). How reproducible: Not reproducible Steps to Reproduce: 1. Use Pacemaker Actual results: Pengine crashes. Expected results: Pengine doesn't crash.