Bug 950403 - Pengine assert in qb_log_from_external_source()
Summary: Pengine assert in qb_log_from_external_source()
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: libqb
Version: 6.4
Hardware: x86_64
OS: Linux
high
urgent
Target Milestone: rc
: ---
Assignee: David Vossel
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks: 987355 1001491
TreeView+ depends on / blocked
 
Reported: 2013-04-10 08:17 UTC by Taneli Leppä
Modified: 2013-11-21 11:53 UTC (History)
9 users (show)

Fixed In Version: libqb-0.16.0-2.el6
Doc Type: Rebase: Bug Fixes and Enhancements
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-11-21 11:53:03 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2013:1634 normal SHIPPED_LIVE libqb bug fix and enhancement update 2013-11-20 21:53:46 UTC

Description Taneli Leppä 2013-04-10 08:17:53 UTC
Description of problem:

I had a crash last night in pengine:

Apr 10 03:01:20 clu4 crmd[4057]:    error: crm_ipc_read: Connection to pengine failed
Apr 10 03:01:20 clu4 pacemakerd[4043]:   notice: pcmk_child_exit: Child process pengine terminated with signal 6 (pid=4056, core=128)
Apr 10 03:01:20 clu4 pacemakerd[4043]:   notice: pcmk_process_exit: Respawning failed child process: pengine
Apr 10 03:01:20 clu4 crmd[4057]:    error: mainloop_gio_callback: Connection to pengine[0x17bcff0] closed (I/O condition=25)

The following core dump and backtrace was deposited in pacemaker/cores directory:


#0  0x0000003e2a0328a5 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x0000003e2a034085 in abort () at abort.c:92
#2  0x0000003e2a02ba1e in __assert_fail_base (fmt=<value optimized out>, assertion=0x3e30819fbb "rc == 0", file=0x3e30819fb1 "log_dcs.c", line=<value optimized out>,
    function=<value optimized out>) at assert.c:96
#3  0x0000003e2a02bae0 in __assert_fail (assertion=0x3e30819fbb "rc == 0", file=0x3e30819fb1 "log_dcs.c", line=70, function=0x3e3081a030 "_log_dcs_new_cs") at assert.c:105
#4  0x0000003e308143eb in _log_dcs_new_cs (function=0x3e32c46310 "native_color", filename=0x3e32c442bf "native.c", format=0x3e32428d00 "%s: %s allocation score on %s: %s",
    priority=<value optimized out>, lineno=472, tags=0) at log_dcs.c:70
#5  0x0000003e308145e5 in qb_log_dcs_get (newly_created=0x7fff43a5715c, function=<value optimized out>, filename=<value optimized out>, format=<value optimized out>,
    priority=8 '\b', lineno=<value optimized out>, tags=0) at log_dcs.c:146
#6  0x0000003e30812ba9 in qb_log_callsite_get (function=<value optimized out>, filename=<value optimized out>, format=<value optimized out>, priority=<value optimized out>,
    lineno=<value optimized out>, tags=0) at log.c:256
#7  0x0000003e308130ab in qb_log_from_external_source (function=<value optimized out>, filename=<value optimized out>, format=<value optimized out>,
    priority=<value optimized out>, lineno=<value optimized out>, tags=<value optimized out>) at log.c:331
#8  0x0000003e32417315 in dump_node_scores_worker (level=9, file=0x3e32c442bf "native.c", function=0x3e32c46310 "native_color", line=472, rsc=0xa03a40,
    comment=0x3e32c44548 "Post-coloc", nodes=0xc17800) at utils.c:189
#9  0x0000003e32c21135 in native_color (rsc=0xa03a40, prefer=0xc492c0, data_set=0x7fff43a57710) at native.c:472
#10 0x0000003e32c314fc in color_instance (rsc=0xa03a40, prefer=0xc492c0, all_coloc=<value optimized out>, data_set=0x7fff43a57710) at clone.c:430
#11 0x0000003e32c35549 in clone_color (rsc=0xb5c960, prefer=<value optimized out>, data_set=0x7fff43a57710) at clone.c:578
#12 0x0000003e32c21000 in native_color (rsc=0xd77150, prefer=0x0, data_set=0x7fff43a57710) at native.c:459
#13 0x0000003e32c12c0f in stage5 (data_set=0x7fff43a57710) at allocate.c:1130
#14 0x0000003e32c09b1d in do_calculations (data_set=0x7fff43a57710, xml_input=<value optimized out>, now=<value optimized out>) at pengine.c:247
#15 0x0000003e32c0a702 in process_pe_message (msg=0x115c8a0, xml_data=0x1106890, sender=0x9910c0) at pengine.c:126
#16 0x00000000004012be in pe_ipc_dispatch (c=0x9910c0, data=<value optimized out>, size=<value optimized out>) at main.c:74
#17 0x0000003e3080ebb4 in _process_request_ (c=0x9910c0, ms_timeout=10) at ipcs.c:647
#18 0x0000003e3080ef04 in qb_ipcs_dispatch_connection_request (fd=<value optimized out>, revents=<value optimized out>, data=0x9910c0) at ipcs.c:755
#19 0x0000003e31425240 in gio_read_socket (gio=<value optimized out>, condition=G_IO_IN, data=0x990a80) at mainloop.c:372
#20 0x0000003e2bc38f0e in g_main_dispatch (context=0x98e760) at gmain.c:1960
#21 IA__g_main_context_dispatch (context=0x98e760) at gmain.c:2513
#22 0x0000003e2bc3c938 in g_main_context_iterate (context=0x98e760, block=1, dispatch=1, self=<value optimized out>) at gmain.c:2591
#23 0x0000003e2bc3cd55 in IA__g_main_loop_run (loop=0x98cdb0) at gmain.c:2799
#24 0x00000000004014c8 in main (argc=1, argv=0x7fff43a57cf8) at main.c:174

This is apparently a bug in libqb and it's discussed here:

http://comments.gmane.org/gmane.linux.highavailability.pacemaker/15504

A patch for libqb 0.14.4 is available at:

https://github.com/asalkeld/libqb/commit/30a7871646c1f5bbb602e0a01f5550a4516b36f8

But that does not apply cleanly to 0.14.2 (which Red Hat ships).


How reproducible:
Not reproducible

Steps to Reproduce:
1. Use Pacemaker
  
Actual results:
Pengine crashes.

Expected results:
Pengine doesn't crash.

Comment 1 Andrew Beekhof 2013-04-11 22:54:39 UTC
Yep, thats a libqb bug. Reassigning.

Comment 2 Taneli Leppä 2013-04-22 14:06:20 UTC
Got hit by another one of these crashes last night.

Comment 5 Andrew Beekhof 2013-06-03 01:58:13 UTC
I think libqb will need a rebase for libqb in 6.5

Until then, you can borrow an updated rpm from:
   http://clusterlabs.org/rpm-test-next/rhel-6/


Also, for QE, the reproducer is time.

libqb is creating duplicates and eventually uses up all the free memory until the pengine process crashes.

Comment 6 Taneli Leppä 2013-06-05 06:35:12 UTC
Is the updated RPM libqb-0.14.4-7.38.07c9.dirty.el6.x86_64.rpm? Seems kind of old.

Comment 7 Andrew Beekhof 2013-06-07 07:00:01 UTC
Its new enough to have the fix

Comment 16 Andrew Beekhof 2013-07-26 01:52:38 UTC
Dropping TechPreview keyword due to bug #987355

Comment 20 errata-xmlrpc 2013-11-21 11:53:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1634.html


Note You need to log in before you can comment on or make changes to this bug.