Bug 697446 - rgmanager crash on service relocation
rgmanager crash on service relocation
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: rgmanager (Show other bugs)
6.1
Unspecified Unspecified
high Severity medium
: beta
: ---
Assigned To: Lon Hohberger
Cluster QE
:
: 728620 (view as bug list)
Depends On:
Blocks: 743047
  Show dependency treegraph
 
Reported: 2011-04-18 06:11 EDT by Jaroslav Kortus
Modified: 2011-12-06 06:59 EST (History)
4 users (show)

See Also:
Fixed In Version: rgmanager-3.0.12.1-4.el6
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2011-12-06 06:59:37 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)
core file (163.05 KB, application/x-gzip)
2011-04-18 06:19 EDT, Jaroslav Kortus
no flags Details
core dump from buzz-04 rgmanager (183.64 KB, application/x-gzip)
2011-08-24 12:06 EDT, Nate Straz
no flags Details

  None (edit)
Description Jaroslav Kortus 2011-04-18 06:11:41 EDT
Description of problem:
service manager crashed on 2-node cluster when the other node issued clusvcadm -r service

Version-Release number of selected component (if applicable):
rgmanager-3.0.12-11.el6.x86_64

How reproducible:
seen only once so far

Steps to Reproduce:
1. clusvcadm -r service
2.
3.
  
Actual results:
coredump + reboot

Expected results:
service relocation

Additional info:
I haven't been able to reproduce it since then. It happened as a part of SNMP testing.
Comment 2 Jaroslav Kortus 2011-04-18 06:16:20 EDT
Core was generated by `rgmanager'.
Program terminated with signal 11, Segmentation fault.
#0  _wordcopy_fwd_dest_aligned (dstp=12738376, srcp=12738560, len=2305843009213664272) at wordcopy.c:196
196	      a0 = ((op_t *) srcp)[0];



(gdb) thread apply all bt full


Thread 5 (Thread 0x7f79cb5fe700 (LWP 3127)):
#0  0x0000003d8e8dc6c3 in __poll (fds=<value optimized out>, nfds=<value optimized out>, timeout=<value optimized out>)
    at ../sysdeps/unix/sysv/linux/poll.c:87
        __arg2 = 1
        _a3 = 500
        _a1 = 140161079827968
        resultvar = <value optimized out>
        __arg3 = 500
        __arg1 = 140161079827968
        _a2 = 1
        resultvar = <value optimized out>
        oldtype = 0
        result = <value optimized out>
#1  0x0000003d90424310 in socket_do_iteration (transport=0xbea900, flags=6, timeout_milliseconds=<value optimized out>)
    at dbus-transport-socket.c:1066
        socket_transport = 0xbea900
        poll_fd = {fd = 10, events = 1, revents = 0}
        poll_res = <value optimized out>
        poll_timeout = <value optimized out>
#2  0x0000003d904226fd in _dbus_transport_do_iteration (transport=0xbea900, flags=<value optimized out>, 
    timeout_milliseconds=<value optimized out>) at dbus-transport.c:956
No locals.
#3  0x0000003d9040ebae in _dbus_connection_do_iteration_unlocked (connection=0xbeaf40, flags=6, timeout_milliseconds=500)
    at dbus-connection.c:1163
No locals.
#4  0x0000003d90410ce0 in _dbus_connection_read_write_dispatch (connection=0xbeaf40, timeout_milliseconds=500, dispatch=<value optimized out>)
    at dbus-connection.c:3491
        dstatus = <value optimized out>
        progress_possible = <value optimized out>
#5  0x0000000000419f21 in _dbus_auto_flush (arg=<value optimized out>) at /usr/src/debug/rgmanager-3.0.12/rgmanager/src/daemons/update-dbus.c:131
No locals.
#6  0x0000003d8ec077e1 in start_thread (arg=0x7f79cb5fe700) at pthread_create.c:301
        __res = <value optimized out>
        pd = 0x7f79cb5fe700
        now = <value optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140161079830272, 6477719499060015742, 264388022912, 140161079830976, 0, 3, 
                -6406672630050983298, 6457711137005236862}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, 
              cleanup = 0x0, canceltype = 0}}}
        not_first_call = <value optimized out>
        pagesize_m1 = <value optimized out>
        sp = <value optimized out>
        freesize = <value optimized out>
#7  0x0000003d8e8e5dcd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
No locals.

Thread 4 (Thread 0x7f79d1dea700 (LWP 2202)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
No locals.
#1  0x0000003f6de01653 in thread_fn (arg=<value optimized out>) at /usr/src/debug/cluster-3.0.12/common/liblogthread/liblogthread.c:83
        str = "State change: z4 UP\n", '\000' <repeats 107 times>
        e = <value optimized out>
        logtime = 1302517112
        level = <value optimized out>
        prev_dropped = <value optimized out>
#2  0x0000003d8ec077e1 in start_thread (arg=0x7f79d1dea700) at pthread_create.c:301
        __res = <value optimized out>
        pd = 0x7f79d1dea700
        now = <value optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140161188800256, 6477719499060015742, 264388022912, 140161188800960, 0, 3, 
                -6406695730532584834, 6457711137005236862}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, 
              cleanup = 0x0, canceltype = 0}}}
        not_first_call = <value optimized out>
        pagesize_m1 = <value optimized out>
        sp = <value optimized out>
        freesize = <value optimized out>
#3  0x0000003d8e8e5dcd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
No locals.

Thread 3 (Thread 0x7f79d1e7d700 (LWP 2201)):
#0  0x0000003d8e8de923 in select () at ../sysdeps/unix/syscall-template.S:82
No locals.
#1  0x0000000000409c5f in event_loop (argc=<value optimized out>, argv=<value optimized out>)
    at /usr/src/debug/rgmanager-3.0.12/rgmanager/src/daemons/main.c:695
        ret = <value optimized out>
        tv = {tv_sec = 4, tv_usec = 858871}
        n = 1
        max = 13
        rfds = {fds_bits = {12288, 0 <repeats 15 times>}}
        newctx = <value optimized out>
        nodeid = <value optimized out>
#2  main (argc=<value optimized out>, argv=<value optimized out>) at /usr/src/debug/rgmanager-3.0.12/rgmanager/src/daemons/main.c:1102
        rv = -1
        do_init = <value optimized out>
        foreground = <value optimized out>
        wd = 1 '\001'
        me = {cn_nodeid = 2, cn_address = {cna_addrlen = 16, cna_address = "\002\000\000\000\n\017Y\017", '\000' <repeats 19 times>}, 
          cn_name = "z2", '\000' <repeats 253 times>, cn_member = 1, cn_incarnation = 1812, cn_jointime = {tv_sec = 1302517093, 
            tv_usec = 475975}}
        cluster_ctx = 0xbf5870
        local_ctx = 0xbf9710
        th = 264375375472
        clu = 0xbea370

Thread 2 (Thread 0x7f79d13e9700 (LWP 2965)):
#0  0x0000003d8e8de923 in select () at ../sysdeps/unix/syscall-template.S:82
No locals.
#1  0x00000000004210de in poll_cluster_messages (arg=<value optimized out>)
    at /usr/src/debug/rgmanager-3.0.12/rgmanager/src/clulib/msg_cluster.c:186
        ret = -1
        fd = <value optimized out>
        tv = {tv_sec = 1, tv_usec = 988500}
        timeout = 2
        rfds = {fds_bits = {80, 0 <repeats 15 times>}}
        lfd = 6
        max = <value optimized out>
        p = 0x7f79d13e8e60
        ch = 0xbea370
#2  cluster_comms_thread (arg=<value optimized out>) at /usr/src/debug/rgmanager-3.0.12/rgmanager/src/clulib/msg_cluster.c:967
No locals.
#3  0x0000003d8ec077e1 in start_thread (arg=0x7f79d13e9700) at pthread_create.c:301
        __res = <value optimized out>
        pd = 0x7f79d13e9700
        now = <value optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140161178310400, 6477719499060015742, 140733646842416, 140161178311104, 0, 3, 
                -6406693805850365314, 6457711137005236862}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, 
              cleanup = 0x0, canceltype = 0}}}
        not_first_call = <value optimized out>
        pagesize_m1 = <value optimized out>
        sp = <value optimized out>
        freesize = <value optimized out>
#4  0x0000003d8e8e5dcd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
No locals.

Thread 1 (Thread 0x7f79cbfff700 (LWP 2966)):
#0  _wordcopy_fwd_dest_aligned (dstp=12738376, srcp=12738560, len=2305843009213664272) at wordcopy.c:196
        a0 = <value optimized out>
        a1 = <value optimized out>
        a2 = 0
        a3 = 0
        sh_1 = 8
        sh_2 = 56
#1  0x0000003d8e8826de in memmove (dest=0xbec080, src=<value optimized out>, len=18446744073709551447) at memmove.c:73
        dstp = <value optimized out>
        srcp = <value optimized out>
#2  0x0000003d90429579 in delete (real=0xbeaad8, start=<value optimized out>, len=169) at /usr/include/bits/string3.h:59
No locals.
#3  0x0000003d9041ac10 in load_message (loader=0xbeaad0) at dbus-message.c:3609
        oom = 0
        validity = DBUS_VALID
        type_str = 0xbec898
        type_pos = 125
        mode = DBUS_VALIDATION_MODE_DATA_IS_UNTRUSTED
#4  _dbus_message_loader_queue_messages (loader=0xbeaad0) at dbus-message.c:3683
        message = 0xbec890
        validity = <value optimized out>
        body_len = 9
        byte_order = 108
        fields_array_len = 141
        header_len = 160
#5  0x0000003d90422ed2 in _dbus_transport_get_dispatch_status (transport=0xbea900) at dbus-transport.c:1080
No locals.
#6  0x0000003d90423011 in _dbus_transport_queue_messages (transport=0xbea900) at dbus-transport.c:1107
        status = <value optimized out>
#7  0x0000003d9040ee24 in _dbus_connection_get_dispatch_status_unlocked (connection=0xbeaf40) at dbus-connection.c:4028
No locals.
#8  0x0000003d9040f451 in _dbus_connection_send_preallocated_and_unlock (connection=0xbeaf40, preallocated=<value optimized out>, 
    message=<value optimized out>, client_serial=<value optimized out>) at dbus-connection.c:2015
        status = 56
#9  0x0000003d9040f667 in _dbus_connection_send_and_unlock (connection=0xbeaf40, message=0xbeb410, client_serial=0x0) at dbus-connection.c:2047
        preallocated = <value optimized out>
#10 0x000000000041a2ad in _rgm_dbus_notify (key=0x7f79c0000e80 "rg=\"service:apache\"", view=<value optimized out>, data=0x7f79c0000ea0, 
    size=<value optimized out>) at /usr/src/debug/rgmanager-3.0.12/rgmanager/src/daemons/update-dbus.c:191
        msg = 0xbeb410
        ret = -1
#11 rgm_dbus_update (key=0x7f79c0000e80 "rg=\"service:apache\"", view=<value optimized out>, data=0x7f79c0000ea0, size=<value optimized out>)
    at /usr/src/debug/rgmanager-3.0.12/rgmanager/src/daemons/update-dbus.c:250
        flags = "(none)\000\000`\n\000\300y\177\000\000\240t\277\000\000\000\000\000\210\356\377\313y\177\000\000\300\t\000\300y\177\000\000\003\000\000\000\000\000\000\000p\356\377\313y\177\000\000 \000\000\300y\177\000"
        st = 0x7f79c0000ea0
        m = 0x7f79c0000f10
        owner = <value optimized out>
        last = <value optimized out>
        ret = 0
#12 0x000000000041d87a in vf_resolve_views (key_node=0x7f79c0000e20) at /usr/src/debug/rgmanager-3.0.12/rgmanager/src/clulib/vft.c:272
        commits = 1
        data = 0x7f79c0000ea0
        datalen = 96
        trans = 4294937616
#13 0x000000000041e47d in vf_process_msg (ctx=<value optimized out>, nodeid=<value optimized out>, msgp=0x7f79c00008c0, 
    nbytes=<value optimized out>) at /usr/src/debug/rgmanager-3.0.12/rgmanager/src/clulib/vft.c:1410
        hdrp = <value optimized out>
        ret = <value optimized out>
        kn = <value optimized out>
#14 0x000000000041eb5a in vf_event_loop (arg=<value optimized out>) at /usr/src/debug/rgmanager-3.0.12/rgmanager/src/clulib/vft.c:812
        n = <value optimized out>
        hdrp = 0x7f79c00008c0
#15 vf_server (arg=<value optimized out>) at /usr/src/debug/rgmanager-3.0.12/rgmanager/src/clulib/vft.c:871
        my_node_id = <value optimized out>
        port = <value optimized out>
        cur = <value optimized out>
        trans = <value optimized out>
        ctx = 0xbf74a0
#16 0x0000003d8ec077e1 in start_thread (arg=0x7f79cbfff700) at pthread_create.c:301
        __res = <value optimized out>
        pd = 0x7f79cbfff700
        now = <value optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140161090320128, 6477719499060015742, 264388022912, 140161090320832, 0, 3, 
                -6406673455221575042, 6457711137005236862}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, 
              cleanup = 0x0, canceltype = 0}}}
        not_first_call = <value optimized out>
        pagesize_m1 = <value optimized out>
        sp = <value optimized out>
        freesize = <value optimized out>
#17 0x0000003d8e8e5dcd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
No locals.
Comment 3 Jaroslav Kortus 2011-04-18 06:19:28 EDT
Created attachment 492847 [details]
core file
Comment 5 Lon Hohberger 2011-04-18 09:57:10 EDT
Without a definitive reproducer, I don't know how to fix this.

One potential cause is that dbus calls are not as thread-safe as we thought, which would mean that simply adding a mutex around operations on the connection would fix things.
Comment 9 Nate Straz 2011-05-04 14:37:26 EDT
I hit something which is likely related to this bug.  While bringing up rgmanager for a derringer run, `clustat -x` tripped this assert in rgmanager.

Thread 1 (Thread 0x7f98ef5fe700 (LWP 24191)):
#0  0x0000003534e32a45 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x0000003534e34225 in abort () at abort.c:92
#2  0x0000003536a2a955 in _dbus_abort () at dbus-sysdeps.c:88
#3  0x0000003536a26835 in _dbus_warn_check_failed (
    format=0x3536a2f368 "The last reference on a connection was dropped without closing the connection. This is a bug in an application. See dbus_connection_unref() documentation for details.\n%s") at dbus-internals.c:283
#4  0x0000003536a10c62 in _dbus_connection_read_write_dispatch (connection=0x15d1f40, timeout_milliseconds=500, dispatch=<value optimized out>) at dbus-connection.c:3512
#5  0x0000000000419f21 in _dbus_auto_flush (arg=<value optimized out>) at /usr/src/debug/rgmanager-3.0.12/rgmanager/src/daemons/update-dbus.c:131
#6  0x00000035352077e1 in start_thread (arg=0x7f98ef5fe700) at pthread_create.c:301
#7  0x0000003534ee68ed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
Comment 11 Lon Hohberger 2011-08-01 18:42:23 EDT
On rgmanager 3.0.12.1-2.el6:

* I've attempted to reproduce this while running 'clustat -fx' in a loop while relocating a service in a loop for several thousand iterations of both.

* I've attempted to reproduce this while running 'clustat -x' in a loop while relocating a service in a loop for several thousand iterations of both.

I am currently testing 'clustat -x' in a tight loop while restarting rgmanager every 30 seconds.
Comment 12 Lon Hohberger 2011-08-02 09:50:51 EDT
One core did come up during all relocation attempts:

Aug  1 19:07:18 snap abrt[15525]: saved core dump of pid 11571 (/usr/sbin/rgmanager) to /var/spool/abrt/ccpp-1312240037-11571.new/coredump (65859584 bytes)
Aug  1 19:07:18 snap abrtd: Directory 'ccpp-1312240037-11571' creation detected
Aug  1 19:07:18 snap abrtd: Crash is in database already (dup of /var/spool/abrt/ccpp-1312239531-14406)
Aug  1 19:07:18 snap abrtd: Deleting crash ccpp-1312240037-11571 (dup of ccpp-1312239531-14406), sending dbus signal


There appears to be a race during shutdown for notifying, but this crash doesn't resemble the previous ones:

#3  0x00000036c5c26835 in _dbus_warn_check_failed (
    format=0x36c5c339c0 "arguments to %s() were incorrect, assertion \"%s\" failed in file %s line %d.\nThis is normally a bug in some application using the D-Bus library.\n")
    at dbus-internals.c:283
#4  0x00000036c5c0d973 in dbus_connection_get_is_connected (connection=0x0)
    at dbus-connection.c:2876
Comment 13 Lon Hohberger 2011-08-02 09:55:51 EDT
The lock was taken at the wrong place in rgm_dbus_notify; easyfix.

It could potentially be related to the other noted issues here, too, but hard to say.
Comment 14 Lon Hohberger 2011-08-02 10:01:59 EDT
Actually, it occurred 442 times, ~1/3 times.  abrtd simply pruned the crash since they were all the same.
Comment 16 Lon Hohberger 2011-08-02 11:28:09 EDT
Still no problems after 150 runs.
Comment 19 Lon Hohberger 2011-08-08 10:32:36 EDT
*** Bug 728620 has been marked as a duplicate of this bug. ***
Comment 20 Nate Straz 2011-08-24 12:06:35 EDT
Created attachment 519664 [details]
core dump from buzz-04 rgmanager

Looks like I hit this issue again with rgmanager-3.0.12.1-3.el6.x86_64.  I hit this while shutting down rgmanager via the init script.

#0  0x00000031e8632945 in raise (sig=<value optimized out>)
    at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x00000031e8634125 in abort () at abort.c:92
#2  0x00000031ea62a975 in _dbus_abort () at dbus-sysdeps.c:88
#3  0x00000031ea626845 in _dbus_warn_check_failed (
    format=0x31ea62f388 "The last reference on a connection was dropped without closing the connection. This is a bug in an application. See dbus_connection_unref() documentation for details.\n%s") at dbus-internals.c:283
#4  0x000000000041a45b in _dbus_auto_flush (arg=<value optimized out>)
    at /usr/src/debug/rgmanager-3.0.12.1/rgmanager/src/daemons/update-dbus.c:136
#5  0x00000031e8e077e1 in start_thread (arg=0x7fb149080700) at pthread_create.c:301
#6  0x00000031e86e578d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
Comment 22 Lon Hohberger 2011-09-08 10:14:28 EDT
http://git.fedorahosted.org/git?p=cluster.git;a=commit;h=7e4856e8b5f98d4b70a0c434b876f80c8aa43c5c

I ran this patch on a test build for 175 iterations of stop/start (20 seconds between start/stop) with a relocation in a tight loop (over 10000 attempts) on the other node.  No error messages or segfaults occurred, and events were still correctly sent to foghorn.
Comment 26 errata-xmlrpc 2011-12-06 06:59:37 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2011-1595.html

Note You need to log in before you can comment on or make changes to this bug.