Bug 1710744

Summary:	[FUSE] Endpoint is not connected after "Found anomalies" error
Product:	[Community] GlusterFS	Reporter:	Pavel Znamensky <kompastver>
Component:	io-cache	Assignee:	Raghavendra G <rgowdapp>
Status:	CLOSED CURRENTRELEASE	QA Contact:
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	5	CC:	bugs, csaba, jahernan, nbalacha, ndevos, rgowdapp
Target Milestone:	---	Keywords:	Triaged
Target Release:	---
Hardware:	Unspecified
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-11-05 12:44:53 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Pavel Znamensky 2019-05-16 08:32:57 UTC

Description of problem:
Fuse client crashed after it found anomalies. Or probably it's just a coincidence.

Version-Release number of selected component (if applicable):
v5.5

How reproducible:
I've observed it only once after two months of uptime on a production cluster

Steps to Reproduce:

Actually, there were usual workload, and nothing special was done

1. setup replicated volume with two bricks
2. mount it via fuse client
3. use it till it crashes
¯\_(ツ)_/¯

Actual results:
Fuse mount crashed

Expected results:
Fuse mount works

Additional info:

Volume info:

Volume Name: st
Type: Replicate
Volume ID: adfd1585-1f5c-42af-a195-af57889d951d
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: srv1:/vol3/
Brick2: srv2:/vol3/
Options Reconfigured:
performance.read-ahead: off
cluster.readdir-optimize: on
client.event-threads: 2
server.event-threads: 16
network.ping-timeout: 20
cluster.data-self-heal-algorithm: full
performance.io-thread-count: 16
performance.io-cache: on
performance.cache-size: 1GB
performance.quick-read: off
transport.address-family: inet6
performance.readdir-ahead: off
nfs.disable: on

Fuse client log:
[2019-05-15 01:46:03.521974] W [fuse-bridge.c:582:fuse_entry_cbk] 0-glusterfs-fuse: 179361305: MKDIR() /assets_cache/26B/280 => -1 (File exists)
[2019-05-16 05:29:44.172704] W [fuse-bridge.c:582:fuse_entry_cbk] 0-glusterfs-fuse: 186813586: MKDIR() /assets_cache/320/0C0 => -1 (File exists)
[2019-05-16 05:29:44.175393] I [MSGID: 109063] [dht-layout.c:659:dht_layout_normalize] 6-st3-dht: Found anomalies in /assets_cache/320/0C0 (gfid = 00000000-0000-0000-0000-000000000000). Holes=1 overlaps=0
pending frames:
pending frames:
frame : type(1) op(UNLINK)
frame : type(1) op(OPEN)
frame : type(0) op(0)
patchset: git://git.gluster.org/glusterfs.git
signal received: 11
frame : type(1) op(UNLINK)
frame : type(1) op(OPEN)
frame : type(0) op(0)
patchset: git://git.gluster.org/glusterfs.git
signal received: 11
time of crash: 
2019-05-16 05:58:43
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 5.5
time of crash: 
2019-05-16 05:58:43
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 5.5
/lib64/libglusterfs.so.0(+0x26620)[0x7f4b559a4620]
/lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7f4b559aebd4]
/lib64/libc.so.6(+0x36280)[0x7f4b53dd7280]
/lib64/libpthread.so.0(pthread_mutex_lock+0x0)[0x7f4b545d9c30]
/lib64/libglusterfs.so.0(fd_unref+0x37)[0x7f4b559cd1e7]
/usr/lib64/glusterfs/5.5/xlator/protocol/client.so(+0x17038)[0x7f4b47beb038]
/usr/lib64/glusterfs/5.5/xlator/protocol/client.so(+0x721ed)[0x7f4b47c461ed]
/lib64/libgfrpc.so.0(+0xf030)[0x7f4b55771030]
/lib64/libgfrpc.so.0(+0xf403)[0x7f4b55771403]
/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f4b5576d2f3]
/usr/lib64/glusterfs/5.5/rpc-transport/socket.so(+0xa106)[0x7f4b48d04106]
/lib64/libglusterfs.so.0(+0x8aa89)[0x7f4b55a08a89]
/lib64/libpthread.so.0(+0x7dd5)[0x7f4b545d7dd5]
/lib64/libc.so.6(clone+0x6d)[0x7f4b53e9eead]
---------

Comment 1 Pavel Znamensky 2019-05-16 09:14:11 UTC

I have the coredump file, but due to sensitive information, I can send it directly to the developers only.

Comment 2 Ravishankar N 2019-05-24 15:20:37 UTC

Adding protocol/client MAINTAINERS to the BZ by looking at the backtrace in the bug description. It might not be related to the client xlator but they can bring it to the attention of other component owners if needed.

Comment 3 Pavel Znamensky 2019-06-13 13:46:11 UTC

Have caught it again.
Is there any workaround?

Comment 4 Amar Tumballi 2019-06-18 03:59:38 UTC

2 requests:

1. By upgrading to glusterfs-6.x version you won't even hit the 'dht' (or distribute) code in this scenario. So, the log should go away, and if the reason for such log is the cause of crash, that should get avoided too.

2. please send 'thread apply all bt full' from the coredump. That should help us to see what caused the problem. (You can send the bt to me directly).

Comment 5 Pavel Znamensky 2019-06-18 11:22:46 UTC

Amar, thank you for the reply.
We're not going to upgrade to glusterfs-6.x in the nearest future. But we'll keep in mind.
As for threads backtraces, I've just sent them to you.
Thanks.

Comment 6 Amar Tumballi 2019-07-16 04:08:32 UTC

#0  0x00007f4b203fa313 in ?? () from /lib64/libgcc_s.so.1
No symbol table info available.
#1  0x00007f4b53edd42c in __GI___dl_iterate_phdr (callback=0x7f4b203fa280, data=0x7f4b4579f210) at dl-iteratephdr.c:76
        __clframe = {__cancel_routine = <optimized out>, __cancel_arg = 0x0, __do_it = 1, __cancel_type = <optimized out>}
        nloaded = 43
        ns = <optimized out>
        caller = <optimized out>
        l = 0x7f4b40011b60
        info = {dlpi_addr = 139961295089664, 
          dlpi_name = 0x7f4b40011b10 "/usr/lib64/glusterfs/5.5/xlator/performance/write-behind.so", dlpi_phdr = 0x7f4b4746b040, 
          dlpi_phnum = 7, dlpi_adds = 43, dlpi_subs = 0, dlpi_tls_modid = 0, dlpi_tls_data = 0x0}
        ret = 0
#2  0x00007f4b203fabbf in _Unwind_Find_FDE () from /lib64/libgcc_s.so.1
No symbol table info available.
#3  0x00007f4b203f7d2c in ?? () from /lib64/libgcc_s.so.1
No symbol table info available.
#4  0x00007f4b203f8fb9 in _Unwind_Backtrace () from /lib64/libgcc_s.so.1
No symbol table info available.
#5  0x00007f4b53eb4f16 in __GI___backtrace (array=<optimized out>, size=200) at ../sysdeps/x86_64/backtrace.c:107
        arg = {array = 0x7f4b4579f5f0, cfa = 139961264899952, cnt = 6, size = 200}
        once = 6
#6  0x00007f4b559a4620 in _gf_msg_backtrace_nomem () from /lib64/libglusterfs.so.0
No symbol table info available.
#7  0x00007f4b559aebd4 in gf_print_trace () from /lib64/libglusterfs.so.0
No symbol table info available.
#8  <signal handler called>
No locals.
#9  uuid_unpack (in=0x8 <Address 0x8 out of bounds>, uu=uu@entry=0x7f4b457a0730) at libuuid/src/unpack.c:43
        ptr = 0x9 <Address 0x9 out of bounds>
        tmp = <error reading variable tmp (Cannot access memory at address 0x8)>
#10 0x00007f4b54efc606 in uuid_unparse_x (uu=<optimized out>, out=0x7f4b3c002c20 "dbd277a3-a37f-4b84-aba5-164364b7853f", 
    fmt=0x7f4b54efcba0 "%08x-%04x-%04x-%02x%02x-%02x%02x%02x%02x%02x%02x") at libuuid/src/unparse.c:55
        uuid = {time_low = 0, time_mid = 0, time_hi_and_version = 0, clock_seq = 0, node = "\000\000\000\000\000"}
#11 0x00007f4b559add4c in uuid_utoa () from /lib64/libglusterfs.so.0
No symbol table info available.
#12 0x00007f4b46e39e15 in ioc_open_cbk () from /usr/lib64/glusterfs/5.5/xlator/performance/io-cache.so
No symbol table info available.
#13 0x00007f4b476fca51 in dht_open_cbk () from /usr/lib64/glusterfs/5.5/xlator/cluster/distribute.so
No symbol table info available.
#14 0x00007f4b4796ecea in afr_open_cbk () from /usr/lib64/glusterfs/5.5/xlator/cluster/replicate.so
No symbol table info available.
#15 0x00007f4b47c461dd in client4_0_open_cbk () from /usr/lib64/glusterfs/5.5/xlator/protocol/client.so
No symbol table info available.
#16 0x00007f4b55771030 in rpc_clnt_handle_reply () from /lib64/libgfrpc.so.0
No symbol table info available.
#17 0x00007f4b55771403 in rpc_clnt_notify () from /lib64/libgfrpc.so.0
No symbol table info available.
#18 0x00007f4b5576d2f3 in rpc_transport_notify () from /lib64/libgfrpc.so.0
No symbol table info available.
#19 0x00007f4b48d04106 in socket_event_handler () from /usr/lib64/glusterfs/5.5/rpc-transport/socket.so
No symbol table info available.
#20 0x00007f4b55a08a89 in event_dispatch_epoll_worker () from /lib64/libglusterfs.so.0
No symbol table info available.
#21 0x00007f4b545d7dd5 in start_thread (arg=0x7f4b457a1700) at pthread_create.c:307
        __res = <optimized out>
        pd = 0x7f4b457a1700
        now = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {139961264903936, -2887800249423137699, 0, 8392704, 0, 139961264903936, 
                2988967441253102685, 2988930292483585117}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {
              prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
        pagesize_m1 = <optimized out>
        sp = <optimized out>
        freesize = <optimized out>
#22 0x00007f4b53e9eead in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
No locals.

Thread 1 (Thread 0x7f4b4866d700 (LWP 10519)):
#0  __GI___pthread_mutex_lock (mutex=0x18) at ../nptl/pthread_mutex_lock.c:65
        type = <optimized out>
        id = <optimized out>
#1  0x00007f4b559cd1e7 in fd_unref () from /lib64/libglusterfs.so.0
No symbol table info available.
#2  0x00007f4b47beb038 in client_local_wipe () from /usr/lib64/glusterfs/5.5/xlator/protocol/client.so
No symbol table info available.
#3  0x00007f4b47c461ed in client4_0_open_cbk () from /usr/lib64/glusterfs/5.5/xlator/protocol/client.so
No symbol table info available.
#4  0x00007f4b55771030 in rpc_clnt_handle_reply () from /lib64/libgfrpc.so.0
No symbol table info available.
#5  0x00007f4b55771403 in rpc_clnt_notify () from /lib64/libgfrpc.so.0
No symbol table info available.
#6  0x00007f4b5576d2f3 in rpc_transport_notify () from /lib64/libgfrpc.so.0
No symbol table info available.
#7  0x00007f4b48d04106 in socket_event_handler () from /usr/lib64/glusterfs/5.5/rpc-transport/socket.so
No symbol table info available.

--------------------------------------------------


Looks like an extra fd_unref() (or inode_unref()) in io-cache xlator??

Comment 7 Nithya Balachandran 2019-07-29 10:28:23 UTC

Looks similar to the crash reported in https://bugzilla.redhat.com/show_bug.cgi?id=1697971#c4

Comment 8 Nithya Balachandran 2019-11-04 08:01:12 UTC

Can you try turning off open-behind and see if you still see the crash?

Comment 9 Pavel Znamensky 2019-11-05 12:20:06 UTC

Nithya, it's quite strange, but there were no errors like this since July.
The same version, the same properties. I don't know what has been changed.
open-behind is enabled.
Let's close the issue.  In case the error occurs again, I'll reopen it.

Comment 10 Nithya Balachandran 2019-11-05 12:44:53 UTC

(In reply to Pavel Znamensky from comment #9)
> Nithya, it's quite strange, but there were no errors like this since July.
> The same version, the same properties. I don't know what has been changed.
> open-behind is enabled.
> Let's close the issue.  In case the error occurs again, I'll reopen it.

Thanks Pavel. I'm going to close it with WorksForMe. Please reopen if you see it again.