Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 697446

Summary:

rgmanager crash on service relocation

Product:

Red Hat Enterprise Linux 6

Reporter:

Jaroslav Kortus <jkortus>

Component:

rgmanager

Assignee:

Lon Hohberger <lhh>

Status:

CLOSED ERRATA

QA Contact:

Cluster QE <mspqa-list>

Severity:

medium

Docs Contact:

Priority:

high

Version:

6.1

CC:

cluster-maint, cmarthal, djansa, syeghiay

Target Milestone:

beta

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

rgmanager-3.0.12.1-4.el6

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2011-12-06 11:59:37 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

743047

Attachments:

Description	Flags
core file	none
core dump from buzz-04 rgmanager	none

Description Jaroslav Kortus 2011-04-18 10:11:41 UTC

Description of problem:
service manager crashed on 2-node cluster when the other node issued clusvcadm -r service

Version-Release number of selected component (if applicable):
rgmanager-3.0.12-11.el6.x86_64

How reproducible:
seen only once so far

Steps to Reproduce:
1. clusvcadm -r service
2.
3.
  
Actual results:
coredump + reboot

Expected results:
service relocation

Additional info:
I haven't been able to reproduce it since then. It happened as a part of SNMP testing.

Comment 2 Jaroslav Kortus 2011-04-18 10:16:20 UTC

Core was generated by `rgmanager'.
Program terminated with signal 11, Segmentation fault.
#0  _wordcopy_fwd_dest_aligned (dstp=12738376, srcp=12738560, len=2305843009213664272) at wordcopy.c:196
196	      a0 = ((op_t *) srcp)[0];



(gdb) thread apply all bt full


Thread 5 (Thread 0x7f79cb5fe700 (LWP 3127)):
#0  0x0000003d8e8dc6c3 in __poll (fds=<value optimized out>, nfds=<value optimized out>, timeout=<value optimized out>)
    at ../sysdeps/unix/sysv/linux/poll.c:87
        __arg2 = 1
        _a3 = 500
        _a1 = 140161079827968
        resultvar = <value optimized out>
        __arg3 = 500
        __arg1 = 140161079827968
        _a2 = 1
        resultvar = <value optimized out>
        oldtype = 0
        result = <value optimized out>
#1  0x0000003d90424310 in socket_do_iteration (transport=0xbea900, flags=6, timeout_milliseconds=<value optimized out>)
    at dbus-transport-socket.c:1066
        socket_transport = 0xbea900
        poll_fd = {fd = 10, events = 1, revents = 0}
        poll_res = <value optimized out>
        poll_timeout = <value optimized out>
#2  0x0000003d904226fd in _dbus_transport_do_iteration (transport=0xbea900, flags=<value optimized out>, 
    timeout_milliseconds=<value optimized out>) at dbus-transport.c:956
No locals.
#3  0x0000003d9040ebae in _dbus_connection_do_iteration_unlocked (connection=0xbeaf40, flags=6, timeout_milliseconds=500)
    at dbus-connection.c:1163
No locals.
#4  0x0000003d90410ce0 in _dbus_connection_read_write_dispatch (connection=0xbeaf40, timeout_milliseconds=500, dispatch=<value optimized out>)
    at dbus-connection.c:3491
        dstatus = <value optimized out>
        progress_possible = <value optimized out>
#5  0x0000000000419f21 in _dbus_auto_flush (arg=<value optimized out>) at /usr/src/debug/rgmanager-3.0.12/rgmanager/src/daemons/update-dbus.c:131
No locals.
#6  0x0000003d8ec077e1 in start_thread (arg=0x7f79cb5fe700) at pthread_create.c:301
        __res = <value optimized out>
        pd = 0x7f79cb5fe700
        now = <value optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140161079830272, 6477719499060015742, 264388022912, 140161079830976, 0, 3, 
                -6406672630050983298, 6457711137005236862}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, 
              cleanup = 0x0, canceltype = 0}}}
        not_first_call = <value optimized out>
        pagesize_m1 = <value optimized out>
        sp = <value optimized out>
        freesize = <value optimized out>
#7  0x0000003d8e8e5dcd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
No locals.

Thread 4 (Thread 0x7f79d1dea700 (LWP 2202)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
No locals.
#1  0x0000003f6de01653 in thread_fn (arg=<value optimized out>) at /usr/src/debug/cluster-3.0.12/common/liblogthread/liblogthread.c:83
        str = "State change: z4 UP\n", '\000' <repeats 107 times>
        e = <value optimized out>
        logtime = 1302517112
        level = <value optimized out>
        prev_dropped = <value optimized out>
#2  0x0000003d8ec077e1 in start_thread (arg=0x7f79d1dea700) at pthread_create.c:301
        __res = <value optimized out>
        pd = 0x7f79d1dea700
        now = <value optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140161188800256, 6477719499060015742, 264388022912, 140161188800960, 0, 3, 
                -6406695730532584834, 6457711137005236862}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, 
              cleanup = 0x0, canceltype = 0}}}
        not_first_call = <value optimized out>
        pagesize_m1 = <value optimized out>
        sp = <value optimized out>
        freesize = <value optimized out>
#3  0x0000003d8e8e5dcd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
No locals.

Thread 3 (Thread 0x7f79d1e7d700 (LWP 2201)):
#0  0x0000003d8e8de923 in select () at ../sysdeps/unix/syscall-template.S:82
No locals.
#1  0x0000000000409c5f in event_loop (argc=<value optimized out>, argv=<value optimized out>)
    at /usr/src/debug/rgmanager-3.0.12/rgmanager/src/daemons/main.c:695
        ret = <value optimized out>
        tv = {tv_sec = 4, tv_usec = 858871}
        n = 1
        max = 13
        rfds = {fds_bits = {12288, 0 <repeats 15 times>}}
        newctx = <value optimized out>
        nodeid = <value optimized out>
#2  main (argc=<value optimized out>, argv=<value optimized out>) at /usr/src/debug/rgmanager-3.0.12/rgmanager/src/daemons/main.c:1102
        rv = -1
        do_init = <value optimized out>
        foreground = <value optimized out>
        wd = 1 '\001'
        me = {cn_nodeid = 2, cn_address = {cna_addrlen = 16, cna_address = "\002\000\000\000\n\017Y\017", '\000' <repeats 19 times>}, 
          cn_name = "z2", '\000' <repeats 253 times>, cn_member = 1, cn_incarnation = 1812, cn_jointime = {tv_sec = 1302517093, 
            tv_usec = 475975}}
        cluster_ctx = 0xbf5870
        local_ctx = 0xbf9710
        th = 264375375472
        clu = 0xbea370

Thread 2 (Thread 0x7f79d13e9700 (LWP 2965)):
#0  0x0000003d8e8de923 in select () at ../sysdeps/unix/syscall-template.S:82
No locals.
#1  0x00000000004210de in poll_cluster_messages (arg=<value optimized out>)
    at /usr/src/debug/rgmanager-3.0.12/rgmanager/src/clulib/msg_cluster.c:186
        ret = -1
        fd = <value optimized out>
        tv = {tv_sec = 1, tv_usec = 988500}
        timeout = 2
        rfds = {fds_bits = {80, 0 <repeats 15 times>}}
        lfd = 6
        max = <value optimized out>
        p = 0x7f79d13e8e60
        ch = 0xbea370
#2  cluster_comms_thread (arg=<value optimized out>) at /usr/src/debug/rgmanager-3.0.12/rgmanager/src/clulib/msg_cluster.c:967
No locals.
#3  0x0000003d8ec077e1 in start_thread (arg=0x7f79d13e9700) at pthread_create.c:301
        __res = <value optimized out>
        pd = 0x7f79d13e9700
        now = <value optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140161178310400, 6477719499060015742, 140733646842416, 140161178311104, 0, 3, 
                -6406693805850365314, 6457711137005236862}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, 
              cleanup = 0x0, canceltype = 0}}}
        not_first_call = <value optimized out>
        pagesize_m1 = <value optimized out>
        sp = <value optimized out>
        freesize = <value optimized out>
#4  0x0000003d8e8e5dcd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
No locals.

Thread 1 (Thread 0x7f79cbfff700 (LWP 2966)):
#0  _wordcopy_fwd_dest_aligned (dstp=12738376, srcp=12738560, len=2305843009213664272) at wordcopy.c:196
        a0 = <value optimized out>
        a1 = <value optimized out>
        a2 = 0
        a3 = 0
        sh_1 = 8
        sh_2 = 56
#1  0x0000003d8e8826de in memmove (dest=0xbec080, src=<value optimized out>, len=18446744073709551447) at memmove.c:73
        dstp = <value optimized out>
        srcp = <value optimized out>
#2  0x0000003d90429579 in delete (real=0xbeaad8, start=<value optimized out>, len=169) at /usr/include/bits/string3.h:59
No locals.
#3  0x0000003d9041ac10 in load_message (loader=0xbeaad0) at dbus-message.c:3609
        oom = 0
        validity = DBUS_VALID
        type_str = 0xbec898
        type_pos = 125
        mode = DBUS_VALIDATION_MODE_DATA_IS_UNTRUSTED
#4  _dbus_message_loader_queue_messages (loader=0xbeaad0) at dbus-message.c:3683
        message = 0xbec890
        validity = <value optimized out>
        body_len = 9
        byte_order = 108
        fields_array_len = 141
        header_len = 160
#5  0x0000003d90422ed2 in _dbus_transport_get_dispatch_status (transport=0xbea900) at dbus-transport.c:1080
No locals.
#6  0x0000003d90423011 in _dbus_transport_queue_messages (transport=0xbea900) at dbus-transport.c:1107
        status = <value optimized out>
#7  0x0000003d9040ee24 in _dbus_connection_get_dispatch_status_unlocked (connection=0xbeaf40) at dbus-connection.c:4028
No locals.
#8  0x0000003d9040f451 in _dbus_connection_send_preallocated_and_unlock (connection=0xbeaf40, preallocated=<value optimized out>, 
    message=<value optimized out>, client_serial=<value optimized out>) at dbus-connection.c:2015
        status = 56
#9  0x0000003d9040f667 in _dbus_connection_send_and_unlock (connection=0xbeaf40, message=0xbeb410, client_serial=0x0) at dbus-connection.c:2047
        preallocated = <value optimized out>
#10 0x000000000041a2ad in _rgm_dbus_notify (key=0x7f79c0000e80 "rg=\"service:apache\"", view=<value optimized out>, data=0x7f79c0000ea0, 
    size=<value optimized out>) at /usr/src/debug/rgmanager-3.0.12/rgmanager/src/daemons/update-dbus.c:191
        msg = 0xbeb410
        ret = -1
#11 rgm_dbus_update (key=0x7f79c0000e80 "rg=\"service:apache\"", view=<value optimized out>, data=0x7f79c0000ea0, size=<value optimized out>)
    at /usr/src/debug/rgmanager-3.0.12/rgmanager/src/daemons/update-dbus.c:250
        flags = "(none)\000\000`\n\000\300y\177\000\000\240t\277\000\000\000\000\000\210\356\377\313y\177\000\000\300\t\000\300y\177\000\000\003\000\000\000\000\000\000\000p\356\377\313y\177\000\000 \000\000\300y\177\000"
        st = 0x7f79c0000ea0
        m = 0x7f79c0000f10
        owner = <value optimized out>
        last = <value optimized out>
        ret = 0
#12 0x000000000041d87a in vf_resolve_views (key_node=0x7f79c0000e20) at /usr/src/debug/rgmanager-3.0.12/rgmanager/src/clulib/vft.c:272
        commits = 1
        data = 0x7f79c0000ea0
        datalen = 96
        trans = 4294937616
#13 0x000000000041e47d in vf_process_msg (ctx=<value optimized out>, nodeid=<value optimized out>, msgp=0x7f79c00008c0, 
    nbytes=<value optimized out>) at /usr/src/debug/rgmanager-3.0.12/rgmanager/src/clulib/vft.c:1410
        hdrp = <value optimized out>
        ret = <value optimized out>
        kn = <value optimized out>
#14 0x000000000041eb5a in vf_event_loop (arg=<value optimized out>) at /usr/src/debug/rgmanager-3.0.12/rgmanager/src/clulib/vft.c:812
        n = <value optimized out>
        hdrp = 0x7f79c00008c0
#15 vf_server (arg=<value optimized out>) at /usr/src/debug/rgmanager-3.0.12/rgmanager/src/clulib/vft.c:871
        my_node_id = <value optimized out>
        port = <value optimized out>
        cur = <value optimized out>
        trans = <value optimized out>
        ctx = 0xbf74a0
#16 0x0000003d8ec077e1 in start_thread (arg=0x7f79cbfff700) at pthread_create.c:301
        __res = <value optimized out>
        pd = 0x7f79cbfff700
        now = <value optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140161090320128, 6477719499060015742, 264388022912, 140161090320832, 0, 3, 
                -6406673455221575042, 6457711137005236862}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, 
              cleanup = 0x0, canceltype = 0}}}
        not_first_call = <value optimized out>
        pagesize_m1 = <value optimized out>
        sp = <value optimized out>
        freesize = <value optimized out>
#17 0x0000003d8e8e5dcd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
No locals.

Comment 3 Jaroslav Kortus 2011-04-18 10:19:28 UTC

Created attachment 492847 [details]
core file

Comment 5 Lon Hohberger 2011-04-18 13:57:10 UTC

Without a definitive reproducer, I don't know how to fix this.

One potential cause is that dbus calls are not as thread-safe as we thought, which would mean that simply adding a mutex around operations on the connection would fix things.

Comment 9 Nate Straz 2011-05-04 18:37:26 UTC

I hit something which is likely related to this bug.  While bringing up rgmanager for a derringer run, `clustat -x` tripped this assert in rgmanager.

Thread 1 (Thread 0x7f98ef5fe700 (LWP 24191)):
#0  0x0000003534e32a45 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x0000003534e34225 in abort () at abort.c:92
#2  0x0000003536a2a955 in _dbus_abort () at dbus-sysdeps.c:88
#3  0x0000003536a26835 in _dbus_warn_check_failed (
    format=0x3536a2f368 "The last reference on a connection was dropped without closing the connection. This is a bug in an application. See dbus_connection_unref() documentation for details.\n%s") at dbus-internals.c:283
#4  0x0000003536a10c62 in _dbus_connection_read_write_dispatch (connection=0x15d1f40, timeout_milliseconds=500, dispatch=<value optimized out>) at dbus-connection.c:3512
#5  0x0000000000419f21 in _dbus_auto_flush (arg=<value optimized out>) at /usr/src/debug/rgmanager-3.0.12/rgmanager/src/daemons/update-dbus.c:131
#6  0x00000035352077e1 in start_thread (arg=0x7f98ef5fe700) at pthread_create.c:301
#7  0x0000003534ee68ed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Comment 11 Lon Hohberger 2011-08-01 22:42:23 UTC

On rgmanager 3.0.12.1-2.el6:

* I've attempted to reproduce this while running 'clustat -fx' in a loop while relocating a service in a loop for several thousand iterations of both.

* I've attempted to reproduce this while running 'clustat -x' in a loop while relocating a service in a loop for several thousand iterations of both.

I am currently testing 'clustat -x' in a tight loop while restarting rgmanager every 30 seconds.

Comment 12 Lon Hohberger 2011-08-02 13:50:51 UTC

One core did come up during all relocation attempts:

Aug  1 19:07:18 snap abrt[15525]: saved core dump of pid 11571 (/usr/sbin/rgmanager) to /var/spool/abrt/ccpp-1312240037-11571.new/coredump (65859584 bytes)
Aug  1 19:07:18 snap abrtd: Directory 'ccpp-1312240037-11571' creation detected
Aug  1 19:07:18 snap abrtd: Crash is in database already (dup of /var/spool/abrt/ccpp-1312239531-14406)
Aug  1 19:07:18 snap abrtd: Deleting crash ccpp-1312240037-11571 (dup of ccpp-1312239531-14406), sending dbus signal


There appears to be a race during shutdown for notifying, but this crash doesn't resemble the previous ones:

#3  0x00000036c5c26835 in _dbus_warn_check_failed (
    format=0x36c5c339c0 "arguments to %s() were incorrect, assertion \"%s\" failed in file %s line %d.\nThis is normally a bug in some application using the D-Bus library.\n")
    at dbus-internals.c:283
#4  0x00000036c5c0d973 in dbus_connection_get_is_connected (connection=0x0)
    at dbus-connection.c:2876

Comment 13 Lon Hohberger 2011-08-02 13:55:51 UTC

The lock was taken at the wrong place in rgm_dbus_notify; easyfix.

It could potentially be related to the other noted issues here, too, but hard to say.

Comment 14 Lon Hohberger 2011-08-02 14:01:59 UTC

Actually, it occurred 442 times, ~1/3 times.  abrtd simply pruned the crash since they were all the same.

Comment 15 Lon Hohberger 2011-08-02 14:54:10 UTC

I restarted testing.  No problems after 92 runs.

http://git.fedorahosted.org/git?p=cluster.git;a=commit;h=538becbe4b3779bf8173e6c09164aa7c8d5521ad

https://www.redhat.com/archives/cluster-devel/2011-August/msg00010.html

Comment 16 Lon Hohberger 2011-08-02 15:28:09 UTC

Still no problems after 150 runs.

Comment 19 Lon Hohberger 2011-08-08 14:32:36 UTC

*** Bug 728620 has been marked as a duplicate of this bug. ***

Comment 20 Nate Straz 2011-08-24 16:06:35 UTC

Created attachment 519664 [details]
core dump from buzz-04 rgmanager

Looks like I hit this issue again with rgmanager-3.0.12.1-3.el6.x86_64.  I hit this while shutting down rgmanager via the init script.

#0  0x00000031e8632945 in raise (sig=<value optimized out>)
    at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x00000031e8634125 in abort () at abort.c:92
#2  0x00000031ea62a975 in _dbus_abort () at dbus-sysdeps.c:88
#3  0x00000031ea626845 in _dbus_warn_check_failed (
    format=0x31ea62f388 "The last reference on a connection was dropped without closing the connection. This is a bug in an application. See dbus_connection_unref() documentation for details.\n%s") at dbus-internals.c:283
#4  0x000000000041a45b in _dbus_auto_flush (arg=<value optimized out>)
    at /usr/src/debug/rgmanager-3.0.12.1/rgmanager/src/daemons/update-dbus.c:136
#5  0x00000031e8e077e1 in start_thread (arg=0x7fb149080700) at pthread_create.c:301
#6  0x00000031e86e578d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Comment 22 Lon Hohberger 2011-09-08 14:14:28 UTC

http://git.fedorahosted.org/git?p=cluster.git;a=commit;h=7e4856e8b5f98d4b70a0c434b876f80c8aa43c5c

I ran this patch on a test build for 175 iterations of stop/start (20 seconds between start/stop) with a relocation in a tight loop (over 10000 attempts) on the other node.  No error messages or segfaults occurred, and events were still correctly sent to foghorn.

Comment 23 Lon Hohberger 2011-09-08 14:21:39 UTC

http://git.fedorahosted.org/git?p=cluster.git;a=commit;h=9a3280e078bd9b9b4997b087eab4748ca79d4182 <-- rhel6 patch

Comment 26 errata-xmlrpc 2011-12-06 11:59:37 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2011-1595.html