Note: This bug is displayed in read-only format because
the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Comment 7Jan Pokorný [poki]
2017-06-08 13:00:06 UTC
It seems that issue is with the interference between
- qb_ipcs_disconnect
- qb_ipcs_connection_unref
in pacemaker/cib/callbacks.c:
1628│ c = qb_ipcs_connection_first_get(ipcs_ro);
1629│ while (c != NULL) {
1630│ qb_ipcs_connection_t *last = c;
1631│
1632│ c = qb_ipcs_connection_next_get(ipcs_ro, last);
1633│
1634│ crm_debug("Disconnecting r/o client %p...", last);
1635│ qb_ipcs_disconnect(last);
1636├> qb_ipcs_connection_unref(last);
1637│ disconnects++;
1638│ }
as both may be (and I think this is the case here) doing the same
qb_ipcs_connection_unref, which may hence both trigger
qb_ipcs_shm_disconnect, which both call qb_rb_close for the same
chunk of mmap memory...
Need to investigate that in depth.
Note that pacemaker code in question just follows the libqb instructions
for qb_ipcs_connection_{first,next}_get, i.e.,
> call qb_ipcs_connection_unref() after using the connection
So any changes in libqb should stay compatible with such (idempotency)
expectation.
Comment 8Christine Caulfield
2017-06-20 14:59:22 UTC
Due to time constraints I've reverted patch 189ca28 for RHEL7.4. It still needs fixing upstream though
Comment 10Jan Pokorný [poki]
2017-06-20 16:58:42 UTC
Principally, I've identified two issues:
1. ref counting in lib/ringbuffer.c is quite pointless
(underused, like just some design glimpse never turned
into practical use)
2. non-idempotent disconnect "methods" (at least for ipc_shm)
vs. missing "closed" state (meaning the connection may end
up cycling on SHUTTING_DOWN) -- well, it was more idempotent
prior to that referenced change, but anyway, there's still
a conceptual gap
Comment 11Jan Pokorný [poki]
2017-06-21 13:50:34 UTC
Digging more into the provided coredump:
(gdb) f
> #1 0x00007fdb9e12066f in qb_rb_lastref_and_ret (rb=0x55a9609abba0) at ringbuffer_int.h:125
> 125 qb_atomic_int_set(&rb_res->shared_hdr->ref_count, 1);
(gdb) l
> 120 if (rb_res == NULL) {
> 121 return NULL;
> 122 }
> 123 *rb = NULL;
> 124 /* qb_rb_close will get rid of this "last reference" */
> 125 qb_atomic_int_set(&rb_res->shared_hdr->ref_count, 1);
> 126
> 127 return rb_res;
> 128 }
> 129
(gdb) p rb_res->shared_hdr
> $1 = (struct qb_ringbuffer_shared_s *) 0x7fdb9ff6b000
(gdb) info proc mappings
> Mapped address spaces:
> Start Addr End Addr Size Offset objfile
>
> [...]
> 0x7fdb9ff68000 0x7fdb9ff6b000 0x3000 0x0 /dev/shm/qb-cib_ro-event-25718-25699-11-header
Comment 12Jan Pokorný [poki]
2017-06-21 14:43:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHBA-2017:1896
Created attachment 1285473 [details] coredump file Description of problem: Cib dumped core during cluster stop. Version-Release number of selected component (if applicable): pacemaker-1.1.16-10.el7.x86_64 libqb-1.0.1-3.el7.x86_64 corosync-2.4.0-9.el7.x86_64 How reproducible: always Steps to Reproduce: 1. Stop pacemaker cluster by using pcs utility pcs cluster stop --all Actual results: Coredump observed. Expected results: No coredump. Additional info: snippet from /var/log/messages: Jun 6 18:23:58 virt-143 cib[12111]: notice: Caught 'Terminated' signal Jun 6 18:23:59 virt-143 abrt-hook-ccpp: Process 12111 (cib) of user 189 killed by SIGBUS - dumping core Jun 6 18:24:00 virt-143 pacemakerd[12110]: error: Managed process 12111 (cib) dumped core Jun 6 18:24:00 virt-143 pacemakerd[12110]: error: The cib process (12111) terminated with signal 7 (core=1) Jun 6 18:24:00 virt-143 pacemakerd[12110]: notice: Shutdown complete Jun 6 18:24:00 virt-143 systemd: Stopped Pacemaker High Availability Cluster Manager. Jun 6 18:24:01 virt-143 abrt-server: Duplicate: core backtrace from gdb: (gdb) set print pretty on (gdb) t a a bt full Thread 1 (Thread 0x7fdba00727c0 (LWP 25718)): #0 qb_atomic_int_set (atomic=0x7fdb9ff6d00c, newval=newval@entry=1) at unix.c:508 No locals. #1 0x00007fdb9e12066f in qb_rb_lastref_and_ret (rb=0x55a9609abba0) at ringbuffer_int.h:125 rb_res = 0x55a960a2c220 #2 qb_ipcs_shm_disconnect (c=0x55a9609ab920) at ipc_shm.c:233 c = 0x55a9609ab920 #3 0x00007fdb9e11ed6c in qb_ipcs_connection_unref (c=c@entry=0x55a9609ab920) at ipcs.c:588 c = 0x55a9609ab920 free_it = <optimized out> #4 0x000055a96025e19b in cib_shutdown (nsig=<optimized out>) at callbacks.c:1636 last = 0x55a9609ab920 disconnects = 0 c = 0x0 srv_stats = { active_connections = 2676196898, closed_connections = 32731 } __func__ = "cib_shutdown" #5 0x00007fdb9f826eee in crm_signal_dispatch (source=0x55a9606c5b20, callback=<optimized out>, userdata=<optimized out>) at mainloop.c:281 sig = 0x55a9606c5b20 __func__ = "crm_signal_dispatch" #6 0x00007fdb9cc524c9 in g_main_dispatch (context=0x55a9606c4490) at gmain.c:3201 dispatch = 0x7fdb9f826e90 <crm_signal_dispatch> prev_source = 0x0 was_in_call = 0 user_data = 0x0 callback = 0x0 cb_funcs = 0x0 cb_data = 0x0 need_destroy = <optimized out> source = 0x55a9606c5b20 current = 0x55a9606c6470 i = 0 #7 g_main_context_dispatch (context=context@entry=0x55a9606c4490) at gmain.c:3854 No locals. #8 0x00007fdb9cc52818 in g_main_context_iterate (context=0x55a9606c4490, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>) at gmain.c:3927 ---Type <return> to continue, or q <return> to quit--- max_priority = 2147483647 timeout = 500 some_ready = 1 nfds = <optimized out> allocated_nfds = 10 fds = 0x55a960a2dee0 #9 0x00007fdb9cc52aea in g_main_loop_run (loop=0x55a9609a54c0) at gmain.c:4123 __FUNCTION__ = "g_main_loop_run" #10 0x000055a96025ea8e in cib_init () at main.c:543 __func__ = "cib_init" #11 0x000055a960253775 in main (argc=<optimized out>, argv=0x7fff043edb48) at main.c:246 flag = <optimized out> rc = <optimized out> index = 0 argerr = <optimized out> pwentry = <optimized out> __func__ = "main" __FUNCTION__ = "main"