Bug 1459276 - cib dumped core during cluster stop
cib dumped core during cluster stop
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: libqb (Show other bugs)
7.4
Unspecified Unspecified
unspecified Severity unspecified
: rc
: ---
Assigned To: Jan Pokorný
cluster-qe@redhat.com
: Regression, Reopened
Depends On:
Blocks: 1473695
  Show dependency treegraph
 
Reported: 2017-06-06 13:14 EDT by Miroslav Lisik
Modified: 2017-08-01 14:00 EDT (History)
9 users (show)

See Also:
Fixed In Version: libqb-1.0.1-4.el7
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1473695 (view as bug list)
Environment:
Last Closed: 2017-08-01 14:00:14 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
coredump file (6.95 MB, application/x-core)
2017-06-06 13:14 EDT, Miroslav Lisik
no flags Details
sosreport archive (8.70 MB, application/x-xz)
2017-06-06 13:16 EDT, Miroslav Lisik
no flags Details

  None (edit)
Description Miroslav Lisik 2017-06-06 13:14:07 EDT
Created attachment 1285473 [details]
coredump file

Description of problem:

Cib dumped core during cluster stop.


Version-Release number of selected component (if applicable):
pacemaker-1.1.16-10.el7.x86_64
libqb-1.0.1-3.el7.x86_64
corosync-2.4.0-9.el7.x86_64

How reproducible:
always


Steps to Reproduce:

1. Stop pacemaker cluster by using pcs utility

pcs cluster stop --all


Actual results:

Coredump observed.

Expected results:

No coredump.

Additional info:

snippet from /var/log/messages:

Jun  6 18:23:58 virt-143 cib[12111]:  notice: Caught 'Terminated' signal
Jun  6 18:23:59 virt-143 abrt-hook-ccpp: Process 12111 (cib) of user 189 killed by SIGBUS - dumping core
Jun  6 18:24:00 virt-143 pacemakerd[12110]:   error: Managed process 12111 (cib) dumped core
Jun  6 18:24:00 virt-143 pacemakerd[12110]:   error: The cib process (12111) terminated with signal 7 (core=1)
Jun  6 18:24:00 virt-143 pacemakerd[12110]:  notice: Shutdown complete
Jun  6 18:24:00 virt-143 systemd: Stopped Pacemaker High Availability Cluster Manager.
Jun  6 18:24:01 virt-143 abrt-server: Duplicate: core backtrace

from gdb:

(gdb) set print pretty on
(gdb) t a a bt full

Thread 1 (Thread 0x7fdba00727c0 (LWP 25718)):
#0  qb_atomic_int_set (atomic=0x7fdb9ff6d00c, newval=newval@entry=1) at unix.c:508
No locals.
#1  0x00007fdb9e12066f in qb_rb_lastref_and_ret (rb=0x55a9609abba0) at ringbuffer_int.h:125
        rb_res = 0x55a960a2c220
#2  qb_ipcs_shm_disconnect (c=0x55a9609ab920) at ipc_shm.c:233
        c = 0x55a9609ab920
#3  0x00007fdb9e11ed6c in qb_ipcs_connection_unref (c=c@entry=0x55a9609ab920) at ipcs.c:588
        c = 0x55a9609ab920
        free_it = <optimized out>
#4  0x000055a96025e19b in cib_shutdown (nsig=<optimized out>) at callbacks.c:1636
        last = 0x55a9609ab920
        disconnects = 0
        c = 0x0
        srv_stats = {
          active_connections = 2676196898,
          closed_connections = 32731
        }
        __func__ = "cib_shutdown"
#5  0x00007fdb9f826eee in crm_signal_dispatch (source=0x55a9606c5b20, callback=<optimized out>, userdata=<optimized out>) at mainloop.c:281
        sig = 0x55a9606c5b20
        __func__ = "crm_signal_dispatch"
#6  0x00007fdb9cc524c9 in g_main_dispatch (context=0x55a9606c4490) at gmain.c:3201
        dispatch = 0x7fdb9f826e90 <crm_signal_dispatch>
        prev_source = 0x0
        was_in_call = 0
        user_data = 0x0
        callback = 0x0
        cb_funcs = 0x0
        cb_data = 0x0
        need_destroy = <optimized out>
        source = 0x55a9606c5b20
        current = 0x55a9606c6470
        i = 0
#7  g_main_context_dispatch (context=context@entry=0x55a9606c4490) at gmain.c:3854
No locals.
#8  0x00007fdb9cc52818 in g_main_context_iterate (context=0x55a9606c4490, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>) at gmain.c:3927
---Type <return> to continue, or q <return> to quit---
        max_priority = 2147483647
        timeout = 500
        some_ready = 1
        nfds = <optimized out>
        allocated_nfds = 10
        fds = 0x55a960a2dee0
#9  0x00007fdb9cc52aea in g_main_loop_run (loop=0x55a9609a54c0) at gmain.c:4123
        __FUNCTION__ = "g_main_loop_run"
#10 0x000055a96025ea8e in cib_init () at main.c:543
        __func__ = "cib_init"
#11 0x000055a960253775 in main (argc=<optimized out>, argv=0x7fff043edb48) at main.c:246
        flag = <optimized out>
        rc = <optimized out>
        index = 0
        argerr = <optimized out>
        pwentry = <optimized out>
        __func__ = "main"
        __FUNCTION__ = "main"
Comment 2 Miroslav Lisik 2017-06-06 13:16 EDT
Created attachment 1285475 [details]
sosreport archive
Comment 4 Red Hat Bugzilla Rules Engine 2017-06-06 14:14:02 EDT
Quality Engineering Management has reviewed and declined this request. You may appeal this decision by reopening this request.
Comment 6 Jan Pokorný 2017-06-07 10:27:50 EDT
Looks like I am the first on blame list:
https://github.com/ClusterLabs/libqb/commit/189ca28
(via https://github.com/ClusterLabs/libqb/pull/231)

Hence assigning to myself.
Comment 7 Jan Pokorný 2017-06-08 09:00:06 EDT
It seems that issue is with the interference between
- qb_ipcs_disconnect
- qb_ipcs_connection_unref
in pacemaker/cib/callbacks.c:

1628│         c = qb_ipcs_connection_first_get(ipcs_ro);
1629│         while (c != NULL) {
1630│             qb_ipcs_connection_t *last = c;
1631│
1632│             c = qb_ipcs_connection_next_get(ipcs_ro, last);
1633│
1634│             crm_debug("Disconnecting r/o client %p...", last);
1635│             qb_ipcs_disconnect(last);
1636├>            qb_ipcs_connection_unref(last);
1637│             disconnects++;
1638│         }

as both may be (and I think this is the case here) doing the same
qb_ipcs_connection_unref, which may hence both trigger
qb_ipcs_shm_disconnect, which both call qb_rb_close for the same
chunk of mmap memory...


Need to investigate that in depth.

Note that pacemaker code in question just follows the libqb instructions
for qb_ipcs_connection_{first,next}_get, i.e.,

>  call qb_ipcs_connection_unref() after using the connection

So any changes in libqb should stay compatible with such (idempotency)
expectation.
Comment 8 Christine Caulfield 2017-06-20 10:59:22 EDT
Due to time constraints I've reverted patch 189ca28 for RHEL7.4. It still needs fixing upstream though
Comment 10 Jan Pokorný 2017-06-20 12:58:42 EDT
Principally, I've identified two issues:

1. ref counting in lib/ringbuffer.c is quite pointless
   (underused, like just some design glimpse never turned
   into practical use)

2. non-idempotent disconnect "methods" (at least for ipc_shm)
   vs. missing "closed" state (meaning the connection may end
   up cycling on SHUTTING_DOWN)  -- well, it was more idempotent
   prior to that referenced change, but anyway, there's still
   a conceptual gap
Comment 11 Jan Pokorný 2017-06-21 09:50:34 EDT
Digging more into the provided coredump:

(gdb) f
> #1  0x00007fdb9e12066f in qb_rb_lastref_and_ret (rb=0x55a9609abba0) at ringbuffer_int.h:125
> 125		qb_atomic_int_set(&rb_res->shared_hdr->ref_count, 1);
(gdb) l
> 120		if (rb_res == NULL) {
> 121			return NULL;
> 122		}
> 123		*rb = NULL;
> 124		/* qb_rb_close will get rid of this "last reference" */
> 125		qb_atomic_int_set(&rb_res->shared_hdr->ref_count, 1);
> 126	
> 127		return rb_res;
> 128	}
> 129	
(gdb) p rb_res->shared_hdr
> $1 = (struct qb_ringbuffer_shared_s *) 0x7fdb9ff6b000
(gdb) info proc mappings
> Mapped address spaces:
>           Start Addr           End Addr       Size     Offset objfile
> 
> [...]
>       0x7fdb9ff68000     0x7fdb9ff6b000     0x3000        0x0 /dev/shm/qb-cib_ro-event-25718-25699-11-header
Comment 12 Jan Pokorný 2017-06-21 10:43:47 EDT
> 1. ref counting in lib/ringbuffer.c is quite pointless
>    (underused, like just some design glimpse never turned
>    into practical use)

See also
https://github.com/ClusterLabs/libqb/commit/0207003da5b3b45e2328a6f35dc70b0e77e6a1df
Comment 30 errata-xmlrpc 2017-08-01 14:00:14 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:1896

Note You need to log in before you can comment on or make changes to this bug.