Bug 1459276

Summary: cib dumped core during cluster stop
Product: Red Hat Enterprise Linux 7 Reporter: Miroslav Lisik <mlisik>
Component: libqbAssignee: Jan Pokorný [poki] <jpokorny>
Status: CLOSED ERRATA QA Contact: cluster-qe <cluster-qe>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 7.4CC: ccaulfie, cfeist, cluster-maint, jfriesse, jpokorny, kgaillot, kwenning, mjuricek, tlavigne
Target Milestone: rcKeywords: Regression, Reopened
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: libqb-1.0.1-4.el7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1473695 (view as bug list) Environment:
Last Closed: 2017-08-01 18:00:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1473695    
Attachments:
Description Flags
coredump file
none
sosreport archive none

Description Miroslav Lisik 2017-06-06 17:14:07 UTC
Created attachment 1285473 [details]
coredump file

Description of problem:

Cib dumped core during cluster stop.


Version-Release number of selected component (if applicable):
pacemaker-1.1.16-10.el7.x86_64
libqb-1.0.1-3.el7.x86_64
corosync-2.4.0-9.el7.x86_64

How reproducible:
always


Steps to Reproduce:

1. Stop pacemaker cluster by using pcs utility

pcs cluster stop --all


Actual results:

Coredump observed.

Expected results:

No coredump.

Additional info:

snippet from /var/log/messages:

Jun  6 18:23:58 virt-143 cib[12111]:  notice: Caught 'Terminated' signal
Jun  6 18:23:59 virt-143 abrt-hook-ccpp: Process 12111 (cib) of user 189 killed by SIGBUS - dumping core
Jun  6 18:24:00 virt-143 pacemakerd[12110]:   error: Managed process 12111 (cib) dumped core
Jun  6 18:24:00 virt-143 pacemakerd[12110]:   error: The cib process (12111) terminated with signal 7 (core=1)
Jun  6 18:24:00 virt-143 pacemakerd[12110]:  notice: Shutdown complete
Jun  6 18:24:00 virt-143 systemd: Stopped Pacemaker High Availability Cluster Manager.
Jun  6 18:24:01 virt-143 abrt-server: Duplicate: core backtrace

from gdb:

(gdb) set print pretty on
(gdb) t a a bt full

Thread 1 (Thread 0x7fdba00727c0 (LWP 25718)):
#0  qb_atomic_int_set (atomic=0x7fdb9ff6d00c, newval=newval@entry=1) at unix.c:508
No locals.
#1  0x00007fdb9e12066f in qb_rb_lastref_and_ret (rb=0x55a9609abba0) at ringbuffer_int.h:125
        rb_res = 0x55a960a2c220
#2  qb_ipcs_shm_disconnect (c=0x55a9609ab920) at ipc_shm.c:233
        c = 0x55a9609ab920
#3  0x00007fdb9e11ed6c in qb_ipcs_connection_unref (c=c@entry=0x55a9609ab920) at ipcs.c:588
        c = 0x55a9609ab920
        free_it = <optimized out>
#4  0x000055a96025e19b in cib_shutdown (nsig=<optimized out>) at callbacks.c:1636
        last = 0x55a9609ab920
        disconnects = 0
        c = 0x0
        srv_stats = {
          active_connections = 2676196898,
          closed_connections = 32731
        }
        __func__ = "cib_shutdown"
#5  0x00007fdb9f826eee in crm_signal_dispatch (source=0x55a9606c5b20, callback=<optimized out>, userdata=<optimized out>) at mainloop.c:281
        sig = 0x55a9606c5b20
        __func__ = "crm_signal_dispatch"
#6  0x00007fdb9cc524c9 in g_main_dispatch (context=0x55a9606c4490) at gmain.c:3201
        dispatch = 0x7fdb9f826e90 <crm_signal_dispatch>
        prev_source = 0x0
        was_in_call = 0
        user_data = 0x0
        callback = 0x0
        cb_funcs = 0x0
        cb_data = 0x0
        need_destroy = <optimized out>
        source = 0x55a9606c5b20
        current = 0x55a9606c6470
        i = 0
#7  g_main_context_dispatch (context=context@entry=0x55a9606c4490) at gmain.c:3854
No locals.
#8  0x00007fdb9cc52818 in g_main_context_iterate (context=0x55a9606c4490, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>) at gmain.c:3927
---Type <return> to continue, or q <return> to quit---
        max_priority = 2147483647
        timeout = 500
        some_ready = 1
        nfds = <optimized out>
        allocated_nfds = 10
        fds = 0x55a960a2dee0
#9  0x00007fdb9cc52aea in g_main_loop_run (loop=0x55a9609a54c0) at gmain.c:4123
        __FUNCTION__ = "g_main_loop_run"
#10 0x000055a96025ea8e in cib_init () at main.c:543
        __func__ = "cib_init"
#11 0x000055a960253775 in main (argc=<optimized out>, argv=0x7fff043edb48) at main.c:246
        flag = <optimized out>
        rc = <optimized out>
        index = 0
        argerr = <optimized out>
        pwentry = <optimized out>
        __func__ = "main"
        __FUNCTION__ = "main"

Comment 2 Miroslav Lisik 2017-06-06 17:16:48 UTC
Created attachment 1285475 [details]
sosreport archive

Comment 4 Red Hat Bugzilla Rules Engine 2017-06-06 18:14:02 UTC
Quality Engineering Management has reviewed and declined this request. You may appeal this decision by reopening this request.

Comment 6 Jan Pokorný [poki] 2017-06-07 14:27:50 UTC
Looks like I am the first on blame list:
https://github.com/ClusterLabs/libqb/commit/189ca28
(via https://github.com/ClusterLabs/libqb/pull/231)

Hence assigning to myself.

Comment 7 Jan Pokorný [poki] 2017-06-08 13:00:06 UTC
It seems that issue is with the interference between
- qb_ipcs_disconnect
- qb_ipcs_connection_unref
in pacemaker/cib/callbacks.c:

1628│         c = qb_ipcs_connection_first_get(ipcs_ro);
1629│         while (c != NULL) {
1630│             qb_ipcs_connection_t *last = c;
1631│
1632│             c = qb_ipcs_connection_next_get(ipcs_ro, last);
1633│
1634│             crm_debug("Disconnecting r/o client %p...", last);
1635│             qb_ipcs_disconnect(last);
1636├>            qb_ipcs_connection_unref(last);
1637│             disconnects++;
1638│         }

as both may be (and I think this is the case here) doing the same
qb_ipcs_connection_unref, which may hence both trigger
qb_ipcs_shm_disconnect, which both call qb_rb_close for the same
chunk of mmap memory...


Need to investigate that in depth.

Note that pacemaker code in question just follows the libqb instructions
for qb_ipcs_connection_{first,next}_get, i.e.,

>  call qb_ipcs_connection_unref() after using the connection

So any changes in libqb should stay compatible with such (idempotency)
expectation.

Comment 8 Christine Caulfield 2017-06-20 14:59:22 UTC
Due to time constraints I've reverted patch 189ca28 for RHEL7.4. It still needs fixing upstream though

Comment 10 Jan Pokorný [poki] 2017-06-20 16:58:42 UTC
Principally, I've identified two issues:

1. ref counting in lib/ringbuffer.c is quite pointless
   (underused, like just some design glimpse never turned
   into practical use)

2. non-idempotent disconnect "methods" (at least for ipc_shm)
   vs. missing "closed" state (meaning the connection may end
   up cycling on SHUTTING_DOWN)  -- well, it was more idempotent
   prior to that referenced change, but anyway, there's still
   a conceptual gap

Comment 11 Jan Pokorný [poki] 2017-06-21 13:50:34 UTC
Digging more into the provided coredump:

(gdb) f
> #1  0x00007fdb9e12066f in qb_rb_lastref_and_ret (rb=0x55a9609abba0) at ringbuffer_int.h:125
> 125		qb_atomic_int_set(&rb_res->shared_hdr->ref_count, 1);
(gdb) l
> 120		if (rb_res == NULL) {
> 121			return NULL;
> 122		}
> 123		*rb = NULL;
> 124		/* qb_rb_close will get rid of this "last reference" */
> 125		qb_atomic_int_set(&rb_res->shared_hdr->ref_count, 1);
> 126	
> 127		return rb_res;
> 128	}
> 129	
(gdb) p rb_res->shared_hdr
> $1 = (struct qb_ringbuffer_shared_s *) 0x7fdb9ff6b000
(gdb) info proc mappings
> Mapped address spaces:
>           Start Addr           End Addr       Size     Offset objfile
> 
> [...]
>       0x7fdb9ff68000     0x7fdb9ff6b000     0x3000        0x0 /dev/shm/qb-cib_ro-event-25718-25699-11-header

Comment 12 Jan Pokorný [poki] 2017-06-21 14:43:47 UTC
> 1. ref counting in lib/ringbuffer.c is quite pointless
>    (underused, like just some design glimpse never turned
>    into practical use)

See also
https://github.com/ClusterLabs/libqb/commit/0207003da5b3b45e2328a6f35dc70b0e77e6a1df

Comment 30 errata-xmlrpc 2017-08-01 18:00:14 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:1896