Bug 1459276
Summary: | cib dumped core during cluster stop | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Miroslav Lisik <mlisik> | ||||||
Component: | libqb | Assignee: | Jan Pokorný [poki] <jpokorny> | ||||||
Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> | ||||||
Severity: | unspecified | Docs Contact: | |||||||
Priority: | unspecified | ||||||||
Version: | 7.4 | CC: | ccaulfie, cfeist, cluster-maint, jfriesse, jpokorny, kgaillot, kwenning, mjuricek, tlavigne | ||||||
Target Milestone: | rc | Keywords: | Regression, Reopened | ||||||
Target Release: | --- | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | libqb-1.0.1-4.el7 | Doc Type: | If docs needed, set a value | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | |||||||||
: | 1473695 (view as bug list) | Environment: | |||||||
Last Closed: | 2017-08-01 18:00:14 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 1473695 | ||||||||
Attachments: |
|
Created attachment 1285475 [details]
sosreport archive
Quality Engineering Management has reviewed and declined this request. You may appeal this decision by reopening this request. Looks like I am the first on blame list: https://github.com/ClusterLabs/libqb/commit/189ca28 (via https://github.com/ClusterLabs/libqb/pull/231) Hence assigning to myself. It seems that issue is with the interference between
- qb_ipcs_disconnect
- qb_ipcs_connection_unref
in pacemaker/cib/callbacks.c:
1628│ c = qb_ipcs_connection_first_get(ipcs_ro);
1629│ while (c != NULL) {
1630│ qb_ipcs_connection_t *last = c;
1631│
1632│ c = qb_ipcs_connection_next_get(ipcs_ro, last);
1633│
1634│ crm_debug("Disconnecting r/o client %p...", last);
1635│ qb_ipcs_disconnect(last);
1636├> qb_ipcs_connection_unref(last);
1637│ disconnects++;
1638│ }
as both may be (and I think this is the case here) doing the same
qb_ipcs_connection_unref, which may hence both trigger
qb_ipcs_shm_disconnect, which both call qb_rb_close for the same
chunk of mmap memory...
Need to investigate that in depth.
Note that pacemaker code in question just follows the libqb instructions
for qb_ipcs_connection_{first,next}_get, i.e.,
> call qb_ipcs_connection_unref() after using the connection
So any changes in libqb should stay compatible with such (idempotency)
expectation.
Due to time constraints I've reverted patch 189ca28 for RHEL7.4. It still needs fixing upstream though Principally, I've identified two issues: 1. ref counting in lib/ringbuffer.c is quite pointless (underused, like just some design glimpse never turned into practical use) 2. non-idempotent disconnect "methods" (at least for ipc_shm) vs. missing "closed" state (meaning the connection may end up cycling on SHUTTING_DOWN) -- well, it was more idempotent prior to that referenced change, but anyway, there's still a conceptual gap Digging more into the provided coredump: (gdb) f > #1 0x00007fdb9e12066f in qb_rb_lastref_and_ret (rb=0x55a9609abba0) at ringbuffer_int.h:125 > 125 qb_atomic_int_set(&rb_res->shared_hdr->ref_count, 1); (gdb) l > 120 if (rb_res == NULL) { > 121 return NULL; > 122 } > 123 *rb = NULL; > 124 /* qb_rb_close will get rid of this "last reference" */ > 125 qb_atomic_int_set(&rb_res->shared_hdr->ref_count, 1); > 126 > 127 return rb_res; > 128 } > 129 (gdb) p rb_res->shared_hdr > $1 = (struct qb_ringbuffer_shared_s *) 0x7fdb9ff6b000 (gdb) info proc mappings > Mapped address spaces: > Start Addr End Addr Size Offset objfile > > [...] > 0x7fdb9ff68000 0x7fdb9ff6b000 0x3000 0x0 /dev/shm/qb-cib_ro-event-25718-25699-11-header > 1. ref counting in lib/ringbuffer.c is quite pointless > (underused, like just some design glimpse never turned > into practical use) See also https://github.com/ClusterLabs/libqb/commit/0207003da5b3b45e2328a6f35dc70b0e77e6a1df Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:1896 |
Created attachment 1285473 [details] coredump file Description of problem: Cib dumped core during cluster stop. Version-Release number of selected component (if applicable): pacemaker-1.1.16-10.el7.x86_64 libqb-1.0.1-3.el7.x86_64 corosync-2.4.0-9.el7.x86_64 How reproducible: always Steps to Reproduce: 1. Stop pacemaker cluster by using pcs utility pcs cluster stop --all Actual results: Coredump observed. Expected results: No coredump. Additional info: snippet from /var/log/messages: Jun 6 18:23:58 virt-143 cib[12111]: notice: Caught 'Terminated' signal Jun 6 18:23:59 virt-143 abrt-hook-ccpp: Process 12111 (cib) of user 189 killed by SIGBUS - dumping core Jun 6 18:24:00 virt-143 pacemakerd[12110]: error: Managed process 12111 (cib) dumped core Jun 6 18:24:00 virt-143 pacemakerd[12110]: error: The cib process (12111) terminated with signal 7 (core=1) Jun 6 18:24:00 virt-143 pacemakerd[12110]: notice: Shutdown complete Jun 6 18:24:00 virt-143 systemd: Stopped Pacemaker High Availability Cluster Manager. Jun 6 18:24:01 virt-143 abrt-server: Duplicate: core backtrace from gdb: (gdb) set print pretty on (gdb) t a a bt full Thread 1 (Thread 0x7fdba00727c0 (LWP 25718)): #0 qb_atomic_int_set (atomic=0x7fdb9ff6d00c, newval=newval@entry=1) at unix.c:508 No locals. #1 0x00007fdb9e12066f in qb_rb_lastref_and_ret (rb=0x55a9609abba0) at ringbuffer_int.h:125 rb_res = 0x55a960a2c220 #2 qb_ipcs_shm_disconnect (c=0x55a9609ab920) at ipc_shm.c:233 c = 0x55a9609ab920 #3 0x00007fdb9e11ed6c in qb_ipcs_connection_unref (c=c@entry=0x55a9609ab920) at ipcs.c:588 c = 0x55a9609ab920 free_it = <optimized out> #4 0x000055a96025e19b in cib_shutdown (nsig=<optimized out>) at callbacks.c:1636 last = 0x55a9609ab920 disconnects = 0 c = 0x0 srv_stats = { active_connections = 2676196898, closed_connections = 32731 } __func__ = "cib_shutdown" #5 0x00007fdb9f826eee in crm_signal_dispatch (source=0x55a9606c5b20, callback=<optimized out>, userdata=<optimized out>) at mainloop.c:281 sig = 0x55a9606c5b20 __func__ = "crm_signal_dispatch" #6 0x00007fdb9cc524c9 in g_main_dispatch (context=0x55a9606c4490) at gmain.c:3201 dispatch = 0x7fdb9f826e90 <crm_signal_dispatch> prev_source = 0x0 was_in_call = 0 user_data = 0x0 callback = 0x0 cb_funcs = 0x0 cb_data = 0x0 need_destroy = <optimized out> source = 0x55a9606c5b20 current = 0x55a9606c6470 i = 0 #7 g_main_context_dispatch (context=context@entry=0x55a9606c4490) at gmain.c:3854 No locals. #8 0x00007fdb9cc52818 in g_main_context_iterate (context=0x55a9606c4490, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>) at gmain.c:3927 ---Type <return> to continue, or q <return> to quit--- max_priority = 2147483647 timeout = 500 some_ready = 1 nfds = <optimized out> allocated_nfds = 10 fds = 0x55a960a2dee0 #9 0x00007fdb9cc52aea in g_main_loop_run (loop=0x55a9609a54c0) at gmain.c:4123 __FUNCTION__ = "g_main_loop_run" #10 0x000055a96025ea8e in cib_init () at main.c:543 __func__ = "cib_init" #11 0x000055a960253775 in main (argc=<optimized out>, argv=0x7fff043edb48) at main.c:246 flag = <optimized out> rc = <optimized out> index = 0 argerr = <optimized out> pwentry = <optimized out> __func__ = "main" __FUNCTION__ = "main"