Hide Forgot
Description of problem: service manager crashed on 2-node cluster when the other node issued clusvcadm -r service Version-Release number of selected component (if applicable): rgmanager-3.0.12-11.el6.x86_64 How reproducible: seen only once so far Steps to Reproduce: 1. clusvcadm -r service 2. 3. Actual results: coredump + reboot Expected results: service relocation Additional info: I haven't been able to reproduce it since then. It happened as a part of SNMP testing.
Core was generated by `rgmanager'. Program terminated with signal 11, Segmentation fault. #0 _wordcopy_fwd_dest_aligned (dstp=12738376, srcp=12738560, len=2305843009213664272) at wordcopy.c:196 196 a0 = ((op_t *) srcp)[0]; (gdb) thread apply all bt full Thread 5 (Thread 0x7f79cb5fe700 (LWP 3127)): #0 0x0000003d8e8dc6c3 in __poll (fds=<value optimized out>, nfds=<value optimized out>, timeout=<value optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:87 __arg2 = 1 _a3 = 500 _a1 = 140161079827968 resultvar = <value optimized out> __arg3 = 500 __arg1 = 140161079827968 _a2 = 1 resultvar = <value optimized out> oldtype = 0 result = <value optimized out> #1 0x0000003d90424310 in socket_do_iteration (transport=0xbea900, flags=6, timeout_milliseconds=<value optimized out>) at dbus-transport-socket.c:1066 socket_transport = 0xbea900 poll_fd = {fd = 10, events = 1, revents = 0} poll_res = <value optimized out> poll_timeout = <value optimized out> #2 0x0000003d904226fd in _dbus_transport_do_iteration (transport=0xbea900, flags=<value optimized out>, timeout_milliseconds=<value optimized out>) at dbus-transport.c:956 No locals. #3 0x0000003d9040ebae in _dbus_connection_do_iteration_unlocked (connection=0xbeaf40, flags=6, timeout_milliseconds=500) at dbus-connection.c:1163 No locals. #4 0x0000003d90410ce0 in _dbus_connection_read_write_dispatch (connection=0xbeaf40, timeout_milliseconds=500, dispatch=<value optimized out>) at dbus-connection.c:3491 dstatus = <value optimized out> progress_possible = <value optimized out> #5 0x0000000000419f21 in _dbus_auto_flush (arg=<value optimized out>) at /usr/src/debug/rgmanager-3.0.12/rgmanager/src/daemons/update-dbus.c:131 No locals. #6 0x0000003d8ec077e1 in start_thread (arg=0x7f79cb5fe700) at pthread_create.c:301 __res = <value optimized out> pd = 0x7f79cb5fe700 now = <value optimized out> unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140161079830272, 6477719499060015742, 264388022912, 140161079830976, 0, 3, -6406672630050983298, 6457711137005236862}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}} not_first_call = <value optimized out> pagesize_m1 = <value optimized out> sp = <value optimized out> freesize = <value optimized out> #7 0x0000003d8e8e5dcd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 No locals. Thread 4 (Thread 0x7f79d1dea700 (LWP 2202)): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162 No locals. #1 0x0000003f6de01653 in thread_fn (arg=<value optimized out>) at /usr/src/debug/cluster-3.0.12/common/liblogthread/liblogthread.c:83 str = "State change: z4 UP\n", '\000' <repeats 107 times> e = <value optimized out> logtime = 1302517112 level = <value optimized out> prev_dropped = <value optimized out> #2 0x0000003d8ec077e1 in start_thread (arg=0x7f79d1dea700) at pthread_create.c:301 __res = <value optimized out> pd = 0x7f79d1dea700 now = <value optimized out> unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140161188800256, 6477719499060015742, 264388022912, 140161188800960, 0, 3, -6406695730532584834, 6457711137005236862}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}} not_first_call = <value optimized out> pagesize_m1 = <value optimized out> sp = <value optimized out> freesize = <value optimized out> #3 0x0000003d8e8e5dcd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 No locals. Thread 3 (Thread 0x7f79d1e7d700 (LWP 2201)): #0 0x0000003d8e8de923 in select () at ../sysdeps/unix/syscall-template.S:82 No locals. #1 0x0000000000409c5f in event_loop (argc=<value optimized out>, argv=<value optimized out>) at /usr/src/debug/rgmanager-3.0.12/rgmanager/src/daemons/main.c:695 ret = <value optimized out> tv = {tv_sec = 4, tv_usec = 858871} n = 1 max = 13 rfds = {fds_bits = {12288, 0 <repeats 15 times>}} newctx = <value optimized out> nodeid = <value optimized out> #2 main (argc=<value optimized out>, argv=<value optimized out>) at /usr/src/debug/rgmanager-3.0.12/rgmanager/src/daemons/main.c:1102 rv = -1 do_init = <value optimized out> foreground = <value optimized out> wd = 1 '\001' me = {cn_nodeid = 2, cn_address = {cna_addrlen = 16, cna_address = "\002\000\000\000\n\017Y\017", '\000' <repeats 19 times>}, cn_name = "z2", '\000' <repeats 253 times>, cn_member = 1, cn_incarnation = 1812, cn_jointime = {tv_sec = 1302517093, tv_usec = 475975}} cluster_ctx = 0xbf5870 local_ctx = 0xbf9710 th = 264375375472 clu = 0xbea370 Thread 2 (Thread 0x7f79d13e9700 (LWP 2965)): #0 0x0000003d8e8de923 in select () at ../sysdeps/unix/syscall-template.S:82 No locals. #1 0x00000000004210de in poll_cluster_messages (arg=<value optimized out>) at /usr/src/debug/rgmanager-3.0.12/rgmanager/src/clulib/msg_cluster.c:186 ret = -1 fd = <value optimized out> tv = {tv_sec = 1, tv_usec = 988500} timeout = 2 rfds = {fds_bits = {80, 0 <repeats 15 times>}} lfd = 6 max = <value optimized out> p = 0x7f79d13e8e60 ch = 0xbea370 #2 cluster_comms_thread (arg=<value optimized out>) at /usr/src/debug/rgmanager-3.0.12/rgmanager/src/clulib/msg_cluster.c:967 No locals. #3 0x0000003d8ec077e1 in start_thread (arg=0x7f79d13e9700) at pthread_create.c:301 __res = <value optimized out> pd = 0x7f79d13e9700 now = <value optimized out> unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140161178310400, 6477719499060015742, 140733646842416, 140161178311104, 0, 3, -6406693805850365314, 6457711137005236862}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}} not_first_call = <value optimized out> pagesize_m1 = <value optimized out> sp = <value optimized out> freesize = <value optimized out> #4 0x0000003d8e8e5dcd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 No locals. Thread 1 (Thread 0x7f79cbfff700 (LWP 2966)): #0 _wordcopy_fwd_dest_aligned (dstp=12738376, srcp=12738560, len=2305843009213664272) at wordcopy.c:196 a0 = <value optimized out> a1 = <value optimized out> a2 = 0 a3 = 0 sh_1 = 8 sh_2 = 56 #1 0x0000003d8e8826de in memmove (dest=0xbec080, src=<value optimized out>, len=18446744073709551447) at memmove.c:73 dstp = <value optimized out> srcp = <value optimized out> #2 0x0000003d90429579 in delete (real=0xbeaad8, start=<value optimized out>, len=169) at /usr/include/bits/string3.h:59 No locals. #3 0x0000003d9041ac10 in load_message (loader=0xbeaad0) at dbus-message.c:3609 oom = 0 validity = DBUS_VALID type_str = 0xbec898 type_pos = 125 mode = DBUS_VALIDATION_MODE_DATA_IS_UNTRUSTED #4 _dbus_message_loader_queue_messages (loader=0xbeaad0) at dbus-message.c:3683 message = 0xbec890 validity = <value optimized out> body_len = 9 byte_order = 108 fields_array_len = 141 header_len = 160 #5 0x0000003d90422ed2 in _dbus_transport_get_dispatch_status (transport=0xbea900) at dbus-transport.c:1080 No locals. #6 0x0000003d90423011 in _dbus_transport_queue_messages (transport=0xbea900) at dbus-transport.c:1107 status = <value optimized out> #7 0x0000003d9040ee24 in _dbus_connection_get_dispatch_status_unlocked (connection=0xbeaf40) at dbus-connection.c:4028 No locals. #8 0x0000003d9040f451 in _dbus_connection_send_preallocated_and_unlock (connection=0xbeaf40, preallocated=<value optimized out>, message=<value optimized out>, client_serial=<value optimized out>) at dbus-connection.c:2015 status = 56 #9 0x0000003d9040f667 in _dbus_connection_send_and_unlock (connection=0xbeaf40, message=0xbeb410, client_serial=0x0) at dbus-connection.c:2047 preallocated = <value optimized out> #10 0x000000000041a2ad in _rgm_dbus_notify (key=0x7f79c0000e80 "rg=\"service:apache\"", view=<value optimized out>, data=0x7f79c0000ea0, size=<value optimized out>) at /usr/src/debug/rgmanager-3.0.12/rgmanager/src/daemons/update-dbus.c:191 msg = 0xbeb410 ret = -1 #11 rgm_dbus_update (key=0x7f79c0000e80 "rg=\"service:apache\"", view=<value optimized out>, data=0x7f79c0000ea0, size=<value optimized out>) at /usr/src/debug/rgmanager-3.0.12/rgmanager/src/daemons/update-dbus.c:250 flags = "(none)\000\000`\n\000\300y\177\000\000\240t\277\000\000\000\000\000\210\356\377\313y\177\000\000\300\t\000\300y\177\000\000\003\000\000\000\000\000\000\000p\356\377\313y\177\000\000 \000\000\300y\177\000" st = 0x7f79c0000ea0 m = 0x7f79c0000f10 owner = <value optimized out> last = <value optimized out> ret = 0 #12 0x000000000041d87a in vf_resolve_views (key_node=0x7f79c0000e20) at /usr/src/debug/rgmanager-3.0.12/rgmanager/src/clulib/vft.c:272 commits = 1 data = 0x7f79c0000ea0 datalen = 96 trans = 4294937616 #13 0x000000000041e47d in vf_process_msg (ctx=<value optimized out>, nodeid=<value optimized out>, msgp=0x7f79c00008c0, nbytes=<value optimized out>) at /usr/src/debug/rgmanager-3.0.12/rgmanager/src/clulib/vft.c:1410 hdrp = <value optimized out> ret = <value optimized out> kn = <value optimized out> #14 0x000000000041eb5a in vf_event_loop (arg=<value optimized out>) at /usr/src/debug/rgmanager-3.0.12/rgmanager/src/clulib/vft.c:812 n = <value optimized out> hdrp = 0x7f79c00008c0 #15 vf_server (arg=<value optimized out>) at /usr/src/debug/rgmanager-3.0.12/rgmanager/src/clulib/vft.c:871 my_node_id = <value optimized out> port = <value optimized out> cur = <value optimized out> trans = <value optimized out> ctx = 0xbf74a0 #16 0x0000003d8ec077e1 in start_thread (arg=0x7f79cbfff700) at pthread_create.c:301 __res = <value optimized out> pd = 0x7f79cbfff700 now = <value optimized out> unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140161090320128, 6477719499060015742, 264388022912, 140161090320832, 0, 3, -6406673455221575042, 6457711137005236862}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}} not_first_call = <value optimized out> pagesize_m1 = <value optimized out> sp = <value optimized out> freesize = <value optimized out> #17 0x0000003d8e8e5dcd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 No locals.
Created attachment 492847 [details] core file
Without a definitive reproducer, I don't know how to fix this. One potential cause is that dbus calls are not as thread-safe as we thought, which would mean that simply adding a mutex around operations on the connection would fix things.
I hit something which is likely related to this bug. While bringing up rgmanager for a derringer run, `clustat -x` tripped this assert in rgmanager. Thread 1 (Thread 0x7f98ef5fe700 (LWP 24191)): #0 0x0000003534e32a45 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 #1 0x0000003534e34225 in abort () at abort.c:92 #2 0x0000003536a2a955 in _dbus_abort () at dbus-sysdeps.c:88 #3 0x0000003536a26835 in _dbus_warn_check_failed ( format=0x3536a2f368 "The last reference on a connection was dropped without closing the connection. This is a bug in an application. See dbus_connection_unref() documentation for details.\n%s") at dbus-internals.c:283 #4 0x0000003536a10c62 in _dbus_connection_read_write_dispatch (connection=0x15d1f40, timeout_milliseconds=500, dispatch=<value optimized out>) at dbus-connection.c:3512 #5 0x0000000000419f21 in _dbus_auto_flush (arg=<value optimized out>) at /usr/src/debug/rgmanager-3.0.12/rgmanager/src/daemons/update-dbus.c:131 #6 0x00000035352077e1 in start_thread (arg=0x7f98ef5fe700) at pthread_create.c:301 #7 0x0000003534ee68ed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
On rgmanager 3.0.12.1-2.el6: * I've attempted to reproduce this while running 'clustat -fx' in a loop while relocating a service in a loop for several thousand iterations of both. * I've attempted to reproduce this while running 'clustat -x' in a loop while relocating a service in a loop for several thousand iterations of both. I am currently testing 'clustat -x' in a tight loop while restarting rgmanager every 30 seconds.
One core did come up during all relocation attempts: Aug 1 19:07:18 snap abrt[15525]: saved core dump of pid 11571 (/usr/sbin/rgmanager) to /var/spool/abrt/ccpp-1312240037-11571.new/coredump (65859584 bytes) Aug 1 19:07:18 snap abrtd: Directory 'ccpp-1312240037-11571' creation detected Aug 1 19:07:18 snap abrtd: Crash is in database already (dup of /var/spool/abrt/ccpp-1312239531-14406) Aug 1 19:07:18 snap abrtd: Deleting crash ccpp-1312240037-11571 (dup of ccpp-1312239531-14406), sending dbus signal There appears to be a race during shutdown for notifying, but this crash doesn't resemble the previous ones: #3 0x00000036c5c26835 in _dbus_warn_check_failed ( format=0x36c5c339c0 "arguments to %s() were incorrect, assertion \"%s\" failed in file %s line %d.\nThis is normally a bug in some application using the D-Bus library.\n") at dbus-internals.c:283 #4 0x00000036c5c0d973 in dbus_connection_get_is_connected (connection=0x0) at dbus-connection.c:2876
The lock was taken at the wrong place in rgm_dbus_notify; easyfix. It could potentially be related to the other noted issues here, too, but hard to say.
Actually, it occurred 442 times, ~1/3 times. abrtd simply pruned the crash since they were all the same.
I restarted testing. No problems after 92 runs. http://git.fedorahosted.org/git?p=cluster.git;a=commit;h=538becbe4b3779bf8173e6c09164aa7c8d5521ad https://www.redhat.com/archives/cluster-devel/2011-August/msg00010.html
Still no problems after 150 runs.
*** Bug 728620 has been marked as a duplicate of this bug. ***
Created attachment 519664 [details] core dump from buzz-04 rgmanager Looks like I hit this issue again with rgmanager-3.0.12.1-3.el6.x86_64. I hit this while shutting down rgmanager via the init script. #0 0x00000031e8632945 in raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 #1 0x00000031e8634125 in abort () at abort.c:92 #2 0x00000031ea62a975 in _dbus_abort () at dbus-sysdeps.c:88 #3 0x00000031ea626845 in _dbus_warn_check_failed ( format=0x31ea62f388 "The last reference on a connection was dropped without closing the connection. This is a bug in an application. See dbus_connection_unref() documentation for details.\n%s") at dbus-internals.c:283 #4 0x000000000041a45b in _dbus_auto_flush (arg=<value optimized out>) at /usr/src/debug/rgmanager-3.0.12.1/rgmanager/src/daemons/update-dbus.c:136 #5 0x00000031e8e077e1 in start_thread (arg=0x7fb149080700) at pthread_create.c:301 #6 0x00000031e86e578d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
http://git.fedorahosted.org/git?p=cluster.git;a=commit;h=7e4856e8b5f98d4b70a0c434b876f80c8aa43c5c I ran this patch on a test build for 175 iterations of stop/start (20 seconds between start/stop) with a relocation in a tight loop (over 10000 attempts) on the other node. No error messages or segfaults occurred, and events were still correctly sent to foghorn.
http://git.fedorahosted.org/git?p=cluster.git;a=commit;h=9a3280e078bd9b9b4997b087eab4748ca79d4182 <-- rhel6 patch
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2011-1595.html