| Summary: | rgmanager crash on service relocation | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Jaroslav Kortus <jkortus> | ||||||
| Component: | rgmanager | Assignee: | Lon Hohberger <lhh> | ||||||
| Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> | ||||||
| Severity: | medium | Docs Contact: | |||||||
| Priority: | high | ||||||||
| Version: | 6.1 | CC: | cluster-maint, cmarthal, djansa, syeghiay | ||||||
| Target Milestone: | beta | ||||||||
| Target Release: | --- | ||||||||
| Hardware: | Unspecified | ||||||||
| OS: | Unspecified | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | rgmanager-3.0.12.1-4.el6 | Doc Type: | Bug Fix | ||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2011-12-06 11:59:37 UTC | Type: | --- | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Bug Depends On: | |||||||||
| Bug Blocks: | 743047 | ||||||||
| Attachments: |
|
||||||||
|
Description
Jaroslav Kortus
2011-04-18 10:11:41 UTC
Core was generated by `rgmanager'.
Program terminated with signal 11, Segmentation fault.
#0 _wordcopy_fwd_dest_aligned (dstp=12738376, srcp=12738560, len=2305843009213664272) at wordcopy.c:196
196 a0 = ((op_t *) srcp)[0];
(gdb) thread apply all bt full
Thread 5 (Thread 0x7f79cb5fe700 (LWP 3127)):
#0 0x0000003d8e8dc6c3 in __poll (fds=<value optimized out>, nfds=<value optimized out>, timeout=<value optimized out>)
at ../sysdeps/unix/sysv/linux/poll.c:87
__arg2 = 1
_a3 = 500
_a1 = 140161079827968
resultvar = <value optimized out>
__arg3 = 500
__arg1 = 140161079827968
_a2 = 1
resultvar = <value optimized out>
oldtype = 0
result = <value optimized out>
#1 0x0000003d90424310 in socket_do_iteration (transport=0xbea900, flags=6, timeout_milliseconds=<value optimized out>)
at dbus-transport-socket.c:1066
socket_transport = 0xbea900
poll_fd = {fd = 10, events = 1, revents = 0}
poll_res = <value optimized out>
poll_timeout = <value optimized out>
#2 0x0000003d904226fd in _dbus_transport_do_iteration (transport=0xbea900, flags=<value optimized out>,
timeout_milliseconds=<value optimized out>) at dbus-transport.c:956
No locals.
#3 0x0000003d9040ebae in _dbus_connection_do_iteration_unlocked (connection=0xbeaf40, flags=6, timeout_milliseconds=500)
at dbus-connection.c:1163
No locals.
#4 0x0000003d90410ce0 in _dbus_connection_read_write_dispatch (connection=0xbeaf40, timeout_milliseconds=500, dispatch=<value optimized out>)
at dbus-connection.c:3491
dstatus = <value optimized out>
progress_possible = <value optimized out>
#5 0x0000000000419f21 in _dbus_auto_flush (arg=<value optimized out>) at /usr/src/debug/rgmanager-3.0.12/rgmanager/src/daemons/update-dbus.c:131
No locals.
#6 0x0000003d8ec077e1 in start_thread (arg=0x7f79cb5fe700) at pthread_create.c:301
__res = <value optimized out>
pd = 0x7f79cb5fe700
now = <value optimized out>
unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140161079830272, 6477719499060015742, 264388022912, 140161079830976, 0, 3,
-6406672630050983298, 6457711137005236862}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0,
cleanup = 0x0, canceltype = 0}}}
not_first_call = <value optimized out>
pagesize_m1 = <value optimized out>
sp = <value optimized out>
freesize = <value optimized out>
#7 0x0000003d8e8e5dcd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
No locals.
Thread 4 (Thread 0x7f79d1dea700 (LWP 2202)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
No locals.
#1 0x0000003f6de01653 in thread_fn (arg=<value optimized out>) at /usr/src/debug/cluster-3.0.12/common/liblogthread/liblogthread.c:83
str = "State change: z4 UP\n", '\000' <repeats 107 times>
e = <value optimized out>
logtime = 1302517112
level = <value optimized out>
prev_dropped = <value optimized out>
#2 0x0000003d8ec077e1 in start_thread (arg=0x7f79d1dea700) at pthread_create.c:301
__res = <value optimized out>
pd = 0x7f79d1dea700
now = <value optimized out>
unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140161188800256, 6477719499060015742, 264388022912, 140161188800960, 0, 3,
-6406695730532584834, 6457711137005236862}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0,
cleanup = 0x0, canceltype = 0}}}
not_first_call = <value optimized out>
pagesize_m1 = <value optimized out>
sp = <value optimized out>
freesize = <value optimized out>
#3 0x0000003d8e8e5dcd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
No locals.
Thread 3 (Thread 0x7f79d1e7d700 (LWP 2201)):
#0 0x0000003d8e8de923 in select () at ../sysdeps/unix/syscall-template.S:82
No locals.
#1 0x0000000000409c5f in event_loop (argc=<value optimized out>, argv=<value optimized out>)
at /usr/src/debug/rgmanager-3.0.12/rgmanager/src/daemons/main.c:695
ret = <value optimized out>
tv = {tv_sec = 4, tv_usec = 858871}
n = 1
max = 13
rfds = {fds_bits = {12288, 0 <repeats 15 times>}}
newctx = <value optimized out>
nodeid = <value optimized out>
#2 main (argc=<value optimized out>, argv=<value optimized out>) at /usr/src/debug/rgmanager-3.0.12/rgmanager/src/daemons/main.c:1102
rv = -1
do_init = <value optimized out>
foreground = <value optimized out>
wd = 1 '\001'
me = {cn_nodeid = 2, cn_address = {cna_addrlen = 16, cna_address = "\002\000\000\000\n\017Y\017", '\000' <repeats 19 times>},
cn_name = "z2", '\000' <repeats 253 times>, cn_member = 1, cn_incarnation = 1812, cn_jointime = {tv_sec = 1302517093,
tv_usec = 475975}}
cluster_ctx = 0xbf5870
local_ctx = 0xbf9710
th = 264375375472
clu = 0xbea370
Thread 2 (Thread 0x7f79d13e9700 (LWP 2965)):
#0 0x0000003d8e8de923 in select () at ../sysdeps/unix/syscall-template.S:82
No locals.
#1 0x00000000004210de in poll_cluster_messages (arg=<value optimized out>)
at /usr/src/debug/rgmanager-3.0.12/rgmanager/src/clulib/msg_cluster.c:186
ret = -1
fd = <value optimized out>
tv = {tv_sec = 1, tv_usec = 988500}
timeout = 2
rfds = {fds_bits = {80, 0 <repeats 15 times>}}
lfd = 6
max = <value optimized out>
p = 0x7f79d13e8e60
ch = 0xbea370
#2 cluster_comms_thread (arg=<value optimized out>) at /usr/src/debug/rgmanager-3.0.12/rgmanager/src/clulib/msg_cluster.c:967
No locals.
#3 0x0000003d8ec077e1 in start_thread (arg=0x7f79d13e9700) at pthread_create.c:301
__res = <value optimized out>
pd = 0x7f79d13e9700
now = <value optimized out>
unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140161178310400, 6477719499060015742, 140733646842416, 140161178311104, 0, 3,
-6406693805850365314, 6457711137005236862}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0,
cleanup = 0x0, canceltype = 0}}}
not_first_call = <value optimized out>
pagesize_m1 = <value optimized out>
sp = <value optimized out>
freesize = <value optimized out>
#4 0x0000003d8e8e5dcd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
No locals.
Thread 1 (Thread 0x7f79cbfff700 (LWP 2966)):
#0 _wordcopy_fwd_dest_aligned (dstp=12738376, srcp=12738560, len=2305843009213664272) at wordcopy.c:196
a0 = <value optimized out>
a1 = <value optimized out>
a2 = 0
a3 = 0
sh_1 = 8
sh_2 = 56
#1 0x0000003d8e8826de in memmove (dest=0xbec080, src=<value optimized out>, len=18446744073709551447) at memmove.c:73
dstp = <value optimized out>
srcp = <value optimized out>
#2 0x0000003d90429579 in delete (real=0xbeaad8, start=<value optimized out>, len=169) at /usr/include/bits/string3.h:59
No locals.
#3 0x0000003d9041ac10 in load_message (loader=0xbeaad0) at dbus-message.c:3609
oom = 0
validity = DBUS_VALID
type_str = 0xbec898
type_pos = 125
mode = DBUS_VALIDATION_MODE_DATA_IS_UNTRUSTED
#4 _dbus_message_loader_queue_messages (loader=0xbeaad0) at dbus-message.c:3683
message = 0xbec890
validity = <value optimized out>
body_len = 9
byte_order = 108
fields_array_len = 141
header_len = 160
#5 0x0000003d90422ed2 in _dbus_transport_get_dispatch_status (transport=0xbea900) at dbus-transport.c:1080
No locals.
#6 0x0000003d90423011 in _dbus_transport_queue_messages (transport=0xbea900) at dbus-transport.c:1107
status = <value optimized out>
#7 0x0000003d9040ee24 in _dbus_connection_get_dispatch_status_unlocked (connection=0xbeaf40) at dbus-connection.c:4028
No locals.
#8 0x0000003d9040f451 in _dbus_connection_send_preallocated_and_unlock (connection=0xbeaf40, preallocated=<value optimized out>,
message=<value optimized out>, client_serial=<value optimized out>) at dbus-connection.c:2015
status = 56
#9 0x0000003d9040f667 in _dbus_connection_send_and_unlock (connection=0xbeaf40, message=0xbeb410, client_serial=0x0) at dbus-connection.c:2047
preallocated = <value optimized out>
#10 0x000000000041a2ad in _rgm_dbus_notify (key=0x7f79c0000e80 "rg=\"service:apache\"", view=<value optimized out>, data=0x7f79c0000ea0,
size=<value optimized out>) at /usr/src/debug/rgmanager-3.0.12/rgmanager/src/daemons/update-dbus.c:191
msg = 0xbeb410
ret = -1
#11 rgm_dbus_update (key=0x7f79c0000e80 "rg=\"service:apache\"", view=<value optimized out>, data=0x7f79c0000ea0, size=<value optimized out>)
at /usr/src/debug/rgmanager-3.0.12/rgmanager/src/daemons/update-dbus.c:250
flags = "(none)\000\000`\n\000\300y\177\000\000\240t\277\000\000\000\000\000\210\356\377\313y\177\000\000\300\t\000\300y\177\000\000\003\000\000\000\000\000\000\000p\356\377\313y\177\000\000 \000\000\300y\177\000"
st = 0x7f79c0000ea0
m = 0x7f79c0000f10
owner = <value optimized out>
last = <value optimized out>
ret = 0
#12 0x000000000041d87a in vf_resolve_views (key_node=0x7f79c0000e20) at /usr/src/debug/rgmanager-3.0.12/rgmanager/src/clulib/vft.c:272
commits = 1
data = 0x7f79c0000ea0
datalen = 96
trans = 4294937616
#13 0x000000000041e47d in vf_process_msg (ctx=<value optimized out>, nodeid=<value optimized out>, msgp=0x7f79c00008c0,
nbytes=<value optimized out>) at /usr/src/debug/rgmanager-3.0.12/rgmanager/src/clulib/vft.c:1410
hdrp = <value optimized out>
ret = <value optimized out>
kn = <value optimized out>
#14 0x000000000041eb5a in vf_event_loop (arg=<value optimized out>) at /usr/src/debug/rgmanager-3.0.12/rgmanager/src/clulib/vft.c:812
n = <value optimized out>
hdrp = 0x7f79c00008c0
#15 vf_server (arg=<value optimized out>) at /usr/src/debug/rgmanager-3.0.12/rgmanager/src/clulib/vft.c:871
my_node_id = <value optimized out>
port = <value optimized out>
cur = <value optimized out>
trans = <value optimized out>
ctx = 0xbf74a0
#16 0x0000003d8ec077e1 in start_thread (arg=0x7f79cbfff700) at pthread_create.c:301
__res = <value optimized out>
pd = 0x7f79cbfff700
now = <value optimized out>
unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140161090320128, 6477719499060015742, 264388022912, 140161090320832, 0, 3,
-6406673455221575042, 6457711137005236862}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0,
cleanup = 0x0, canceltype = 0}}}
not_first_call = <value optimized out>
pagesize_m1 = <value optimized out>
sp = <value optimized out>
freesize = <value optimized out>
#17 0x0000003d8e8e5dcd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
No locals.
Created attachment 492847 [details]
core file
Without a definitive reproducer, I don't know how to fix this. One potential cause is that dbus calls are not as thread-safe as we thought, which would mean that simply adding a mutex around operations on the connection would fix things. I hit something which is likely related to this bug. While bringing up rgmanager for a derringer run, `clustat -x` tripped this assert in rgmanager.
Thread 1 (Thread 0x7f98ef5fe700 (LWP 24191)):
#0 0x0000003534e32a45 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1 0x0000003534e34225 in abort () at abort.c:92
#2 0x0000003536a2a955 in _dbus_abort () at dbus-sysdeps.c:88
#3 0x0000003536a26835 in _dbus_warn_check_failed (
format=0x3536a2f368 "The last reference on a connection was dropped without closing the connection. This is a bug in an application. See dbus_connection_unref() documentation for details.\n%s") at dbus-internals.c:283
#4 0x0000003536a10c62 in _dbus_connection_read_write_dispatch (connection=0x15d1f40, timeout_milliseconds=500, dispatch=<value optimized out>) at dbus-connection.c:3512
#5 0x0000000000419f21 in _dbus_auto_flush (arg=<value optimized out>) at /usr/src/debug/rgmanager-3.0.12/rgmanager/src/daemons/update-dbus.c:131
#6 0x00000035352077e1 in start_thread (arg=0x7f98ef5fe700) at pthread_create.c:301
#7 0x0000003534ee68ed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
On rgmanager 3.0.12.1-2.el6: * I've attempted to reproduce this while running 'clustat -fx' in a loop while relocating a service in a loop for several thousand iterations of both. * I've attempted to reproduce this while running 'clustat -x' in a loop while relocating a service in a loop for several thousand iterations of both. I am currently testing 'clustat -x' in a tight loop while restarting rgmanager every 30 seconds. One core did come up during all relocation attempts:
Aug 1 19:07:18 snap abrt[15525]: saved core dump of pid 11571 (/usr/sbin/rgmanager) to /var/spool/abrt/ccpp-1312240037-11571.new/coredump (65859584 bytes)
Aug 1 19:07:18 snap abrtd: Directory 'ccpp-1312240037-11571' creation detected
Aug 1 19:07:18 snap abrtd: Crash is in database already (dup of /var/spool/abrt/ccpp-1312239531-14406)
Aug 1 19:07:18 snap abrtd: Deleting crash ccpp-1312240037-11571 (dup of ccpp-1312239531-14406), sending dbus signal
There appears to be a race during shutdown for notifying, but this crash doesn't resemble the previous ones:
#3 0x00000036c5c26835 in _dbus_warn_check_failed (
format=0x36c5c339c0 "arguments to %s() were incorrect, assertion \"%s\" failed in file %s line %d.\nThis is normally a bug in some application using the D-Bus library.\n")
at dbus-internals.c:283
#4 0x00000036c5c0d973 in dbus_connection_get_is_connected (connection=0x0)
at dbus-connection.c:2876
The lock was taken at the wrong place in rgm_dbus_notify; easyfix. It could potentially be related to the other noted issues here, too, but hard to say. Actually, it occurred 442 times, ~1/3 times. abrtd simply pruned the crash since they were all the same. I restarted testing. No problems after 92 runs. http://git.fedorahosted.org/git?p=cluster.git;a=commit;h=538becbe4b3779bf8173e6c09164aa7c8d5521ad https://www.redhat.com/archives/cluster-devel/2011-August/msg00010.html Still no problems after 150 runs. *** Bug 728620 has been marked as a duplicate of this bug. *** Created attachment 519664 [details]
core dump from buzz-04 rgmanager
Looks like I hit this issue again with rgmanager-3.0.12.1-3.el6.x86_64. I hit this while shutting down rgmanager via the init script.
#0 0x00000031e8632945 in raise (sig=<value optimized out>)
at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1 0x00000031e8634125 in abort () at abort.c:92
#2 0x00000031ea62a975 in _dbus_abort () at dbus-sysdeps.c:88
#3 0x00000031ea626845 in _dbus_warn_check_failed (
format=0x31ea62f388 "The last reference on a connection was dropped without closing the connection. This is a bug in an application. See dbus_connection_unref() documentation for details.\n%s") at dbus-internals.c:283
#4 0x000000000041a45b in _dbus_auto_flush (arg=<value optimized out>)
at /usr/src/debug/rgmanager-3.0.12.1/rgmanager/src/daemons/update-dbus.c:136
#5 0x00000031e8e077e1 in start_thread (arg=0x7fb149080700) at pthread_create.c:301
#6 0x00000031e86e578d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
http://git.fedorahosted.org/git?p=cluster.git;a=commit;h=7e4856e8b5f98d4b70a0c434b876f80c8aa43c5c I ran this patch on a test build for 175 iterations of stop/start (20 seconds between start/stop) with a relocation in a tight loop (over 10000 attempts) on the other node. No error messages or segfaults occurred, and events were still correctly sent to foghorn. http://git.fedorahosted.org/git?p=cluster.git;a=commit;h=9a3280e078bd9b9b4997b087eab4748ca79d4182 <-- rhel6 patch Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2011-1595.html |