Bug 1384865

Summary: USS: Snapd process crashed ,doing parallel clients operations
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Anil Shah <ashah>
Component: io-threadsAssignee: Pranith Kumar K <pkarampu>
Status: CLOSED ERRATA QA Contact: Anil Shah <ashah>
Severity: high Docs Contact:
Priority: unspecified    
Version: rhgs-3.2CC: amukherj, pgurusid, rcyriac, rhinduja, rhs-bugs, sbhaloth, storage-qa-internal
Target Milestone: ---   
Target Release: RHGS 3.2.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: glusterfs-3.8.4-3 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-03-23 06:10:08 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1380619    
Bug Blocks: 1351528    

Description Anil Shah 2016-10-14 10:11:34 UTC
Description of problem:

While doing parallel clients operation , activating and deactivating of snaps , snapd process crashed on two node in cluster 

Version-Release number of selected component (if applicable):

[root@dhcp47-158 /]# rpm -qa | grep glusterfs
glusterfs-events-3.8.4-2.26.git0a405a4.el7rhgs.x86_64
glusterfs-api-3.8.4-2.26.git0a405a4.el7rhgs.x86_64
glusterfs-geo-replication-3.8.4-2.26.git0a405a4.el7rhgs.x86_64
glusterfs-libs-3.8.4-2.26.git0a405a4.el7rhgs.x86_64
glusterfs-client-xlators-3.8.4-2.26.git0a405a4.el7rhgs.x86_64
glusterfs-fuse-3.8.4-2.26.git0a405a4.el7rhgs.x86_64
glusterfs-server-3.8.4-2.26.git0a405a4.el7rhgs.x86_64
glusterfs-3.8.4-2.26.git0a405a4.el7rhgs.x86_64
glusterfs-cli-3.8.4-2.26.git0a405a4.el7rhgs.x86_64
glusterfs-debuginfo-3.8.4-2.26.git0a405a4.el7rhgs.x86_64

How reproducible:

1/1

Steps to Reproduce:
1.
2.
3.

Actual results:


Bt from one node:
=======================

(gdb) bt
#0  0x00007f017908d1d7 in raise () from /lib64/libc.so.6
#1  0x00007f017908e8c8 in abort () from /lib64/libc.so.6
#2  0x00007f01790ccf07 in __libc_message () from /lib64/libc.so.6
#3  0x00007f01790d2da4 in malloc_printerr () from /lib64/libc.so.6
#4  0x00007f01790d5e41 in _int_malloc () from /lib64/libc.so.6
#5  0x00007f01790d7fbc in malloc () from /lib64/libc.so.6
#6  0x00007f017ac81df1 in _dl_signal_error () from /lib64/ld-linux-x86-64.so.2
#7  0x00007f017ac81f8e in _dl_signal_cerror () from /lib64/ld-linux-x86-64.so.2
#8  0x00007f017ac7d18d in _dl_lookup_symbol_x () from /lib64/ld-linux-x86-64.so.2
#9  0x00007f017918a379 in do_sym () from /lib64/libc.so.6
#10 0x00007f0179a200d4 in dlsym_doit () from /lib64/libdl.so.2
#11 0x00007f017ac81ff4 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#12 0x00007f0179a205bd in _dlerror_run () from /lib64/libdl.so.2
#13 0x00007f0179a20128 in dlsym () from /lib64/libdl.so.2
#14 0x00007f017a9a6238 in xlator_dynload (xl=0x7f015091beb0) at xlator.c:220
#15 0x00007f017a9a6e99 in xlator_set_type (xl=<optimized out>, type=type@entry=0x7f0150000c10 "features/read-only")
    at xlator.c:295
#16 0x00007f017aa1f02d in volume_type (type=0x7f0150000c10 "features/read-only") at ./graph.y:207
#17 graphyyparse () at ./graph.y:63
#18 0x00007f017aa1fde4 in glusterfs_graph_construct (fp=fp@entry=0x7f0150924cf0) at ./graph.y:590
#19 0x00007f016cd87019 in glfs_process_volfp (fs=fs@entry=0x7f015641e430, fp=fp@entry=0x7f0150924cf0)
    at glfs-mgmt.c:54
#20 0x00007f016cd872b6 in glfs_mgmt_getspec_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, 
    myframe=0x7f015586306c) at glfs-mgmt.c:604
#21 0x00007f017a773720 in rpc_clnt_handle_reply (clnt=clnt@entry=0x7f0154057fc0, pollin=pollin@entry=0x7f0151b36bb0)
    at rpc-clnt.c:791
#22 0x00007f017a7739ff in rpc_clnt_notify (trans=<optimized out>, mydata=0x7f0154057ff0, event=<optimized out>, 
    data=0x7f0151b36bb0) at rpc-clnt.c:962
#23 0x00007f017a76f923 in rpc_transport_notify (this=this@entry=0x7f0154059850, 
    event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=data@entry=0x7f0151b36bb0) at rpc-transport.c:541
---Type <return> to continue, or q <return> to quit---
#24 0x00007f016f261eb4 in socket_event_poll_in (this=this@entry=0x7f0154059850) at socket.c:2267
#25 0x00007f016f264365 in socket_event_handler (fd=<optimized out>, idx=0, data=0x7f0154059850, poll_in=1, 
    poll_out=0, poll_err=0) at socket.c:2397
#26 0x00007f017aa032e0 in event_dispatch_epoll_handler (event=0x7f0115ffae80, event_pool=0x7f015585e930)
    at event-epoll.c:571
#27 event_dispatch_epoll_worker (data=0x7f014c0008e0) at event-epoll.c:674
#28 0x00007f017980adc5 in start_thread () from /lib64/libpthread.so.0
#29 0x00007f017914f73d in clone () from /lib64/libc.so.6


========================================
bt from another node

(gdb) bt
#0  0x00007f5430b3f1d7 in raise () from /lib64/libc.so.6
#1  0x00007f5430b408c8 in abort () from /lib64/libc.so.6
#2  0x00007f5430b7ef07 in __libc_message () from /lib64/libc.so.6
#3  0x00007f5430b84da4 in malloc_printerr () from /lib64/libc.so.6
#4  0x00007f5430b87e41 in _int_malloc () from /lib64/libc.so.6
#5  0x00007f5430b8aa14 in calloc () from /lib64/libc.so.6
#6  0x00007f543248154e in __gf_default_calloc (size=312, cnt=1) at mem-pool.h:118
#7  __gf_calloc (nmemb=nmemb@entry=1, size=size@entry=312, type=type@entry=42, typestr=typestr@entry=0x7f5426d1afd3 "gf_common_mt_ioq") at mem-pool.c:110
#8  0x00007f5426d10fe2 in __socket_ioq_new (this=this@entry=0x7f53f1b35b80, msg=msg@entry=0x7f541c13be20) at socket.c:999
#9  0x00007f5426d13af7 in socket_submit_request (this=0x7f53f1b35b80, req=0x7f541c13be20) at socket.c:3398
#10 0x00007f5432225de2 in rpc_clnt_submit (rpc=0x7f53f1adc880, prog=prog@entry=0x7f541c60be20 <clnt3_3_fop_prog>, procnum=procnum@entry=27, cbkfn=cbkfn@entry=0x7f541c3d6370 <client3_3_lookup_cbk>, proghdr=proghdr@entry=0x7f541c13bf60, 
    proghdrcount=1, progpayload=progpayload@entry=0x0, progpayloadcount=progpayloadcount@entry=0, iobref=iobref@entry=0x7f5400000cb0, frame=frame@entry=0x7f540c0491c0, rsphdr=0x0, rsphdr_count=rsphdr_count@entry=0, 
    rsp_payload=rsp_payload@entry=0x0, rsp_payload_count=rsp_payload_count@entry=0, rsp_iobref=rsp_iobref@entry=0x0) at rpc-clnt.c:1633
#11 0x00007f541c3c4ea2 in client_submit_request (this=this@entry=0x7f53f0003cb0, req=req@entry=0x7f541c13c240, frame=frame@entry=0x7f540c0491c0, prog=0x7f541c60be20 <clnt3_3_fop_prog>, procnum=procnum@entry=27, 
    cbkfn=cbkfn@entry=0x7f541c3d6370 <client3_3_lookup_cbk>, iobref=iobref@entry=0x0, rsphdr=rsphdr@entry=0x0, rsphdr_count=rsphdr_count@entry=0, rsp_payload=rsp_payload@entry=0x0, rsp_payload_count=rsp_payload_count@entry=0, 
    rsp_iobref=0x0, xdrproc=0x7f5432007a70 <xdr_gfs3_lookup_req>) at client.c:316
#12 0x00007f541c3e1e66 in client3_3_lookup (frame=0x7f540c0491c0, this=0x7f53f0003cb0, data=<optimized out>) at client-rpc-fops.c:3455
#13 0x00007f541c3bcaa0 in client_lookup (frame=0x7f540c0491c0, this=<optimized out>, loc=<optimized out>, xdata=<optimized out>) at client.c:541
#14 0x00007f541c188e1a in afr_inode_refresh_subvol_with_lookup (frame=frame@entry=0x7f540c04978c, this=this@entry=0x7f53f0007130, i=i@entry=0, inode=<optimized out>, gfid=gfid@entry=0x7f53f1453b30 "", xdata=xdata@entry=0x7f540c8cffe4)
    at afr-common.c:1102
#15 0x00007f541c18cb9b in afr_inode_refresh_do (frame=frame@entry=0x7f540c04978c, this=this@entry=0x7f53f0007130) at afr-common.c:1213
#16 0x00007f541c18cd59 in afr_inode_refresh (frame=frame@entry=0x7f540c04978c, this=this@entry=0x7f53f0007130, inode=0x7f53f1d3934c, gfid=gfid@entry=0x0, refreshfn=refreshfn@entry=0x7f541c18cd70 <afr_discover_do>) at afr-common.c:1249
#17 0x00007f541c18d5dc in afr_discover (frame=frame@entry=0x7f540c04978c, this=this@entry=0x7f53f0007130, loc=loc@entry=0x7f540c2ec15c, xattr_req=xattr_req@entry=0x7f540c8cff38) at afr-common.c:2581
#18 0x00007f541c18df6d in afr_lookup (frame=0x7f540c04978c, this=0x7f53f0007130, loc=0x7f540c2ec15c, xattr_req=0x7f540c8cff38) at afr-common.c:2687
#19 0x00007f54135ca374 in dht_lookup (frame=0x7f540c04a0a8, this=<optimized out>, loc=0x7f540c2ec15c, xattr_req=<optimized out>) at dht-common.c:2499
#20 0x00007f54324d4cab in default_lookup (frame=0x7f540c04a0a8, this=0x7f53f000a3d0, loc=0x7f540c2ec15c, xdata=0x7f540c8cff38) at defaults.c:2572
#21 0x00007f54324d4cab in default_lookup (frame=0x7f540c04a0a8, this=0x7f53f000b7c0, loc=0x7f540c2ec15c, xdata=0x7f540c8cff38) at defaults.c:2572
#22 0x00007f54324d4cab in default_lookup (frame=0x7f540c04a0a8, this=0x7f53f000cb00, loc=0x7f540c2ec15c, xdata=0x7f540c8cff38) at defaults.c:2572
#23 0x00007f54324d4cab in default_lookup (frame=0x7f540c04a0a8, this=0x7f53f000de10, loc=0x7f540c2ec15c, xdata=0x7f540c8cff38) at defaults.c:2572
#24 0x00007f53ff1dba2b in ioc_lookup (frame=0x7f540c049fd4, this=0x7f53f000f180, loc=0x7f540c2ec15c, xdata=0x7f540c8cff38) at io-cache.c:285
#25 0x00007f53ffdfa59c in qr_lookup (frame=0x7f540c049368, this=0x7f53f0010440, loc=0x7f540c2ec15c, xdata=0x7f540c8cff38) at quick-read.c:487
#26 0x00007f54324d4cab in default_lookup (frame=0x7f540c049368, this=0x7f53f0011750, loc=0x7f540c2ec15c, xdata=0x7f540c8cff38) at defaults.c:2572
#27 0x00007f53fec2017e in mdc_lookup (frame=0x7f540c049510, this=0x7f53f0012a60, loc=0x7f540c2ec15c, xdata=0x7f540c8cff38) at md-cache.c:1090
#28 0x00007f54324eea44 in default_lookup_resume (frame=0x7f540c049f00, this=0x7f53f0013db0, loc=0x7f540c2ec15c, xdata=0x0) at defaults.c:1872
#29 0x00007f543247e47d in call_resume (stub=0x7f540c2ec10c) at call-stub.c:2508
#30 0x00007f541f9ca743 in iot_worker (data=0x7f53f0918180) at io-threads.c:210
#31 0x00007f54312bcdc5 in start_thread () from /lib64/libpthread.so.0
#32 0x00007f5430c0173d in clone () from /lib64/libc.so.6
(gdb) f 27
#27 0x00007f53fec2017e in mdc_lookup (frame=0x7f540c049510, this=0x7f53f0012a60, loc=0x7f540c2ec15c, xdata=0x7f540c8cff38) at md-cache.c:1090
1090	        STACK_WIND (frame, mdc_lookup_cbk, FIRST_CHILD (this),
(gdb) p *loc
$1 = {path = 0x7f540cbfec10 "/", name = 0x7f540cbfec11 "", inode = 0x7f53f1d3934c, parent = 0x0, gfid = '\000' <repeats 15 times>, "\001", pargfid = '\000' <repeats 15 times>}
(gdb)

Comment 3 surabhi 2016-10-14 10:50:23 UTC
This private build is based on 3.2 downstream build which has client-io-thread enabled.One of the bt's looks like the other client-io-thread issue tracked in BZ https://bugzilla.redhat.com/show_bug.cgi?id=1380619.

There are 2 more crashes which needs to be analysed if it is the same issue.
Poornima is working on it and will update her findings.
Until then we are disabling client-io-thread and continue to test.

Comment 4 Atin Mukherjee 2016-10-24 04:02:13 UTC
downstream patch : https://code.engineering.redhat.com/gerrit/87972

Comment 8 Poornima G 2016-10-27 08:33:53 UTC
This probably looks like client-io-threads related issue. Please verify by disabling client io-threads

Comment 9 Atin Mukherjee 2016-10-27 09:03:36 UTC
(In reply to Poornima G from comment #8)
> This probably looks like client-io-threads related issue. Please verify by
> disabling client io-threads

This bug is already moved to ON_QA. There is no point of verifying this issue disabling client io-threads. Given we have fixed the crash with glusterfs-3.8.4-3, this can be tested with the current build.

Comment 10 Anil Shah 2016-11-08 10:57:55 UTC
Not seeing snapd crash after doing parallel operation form multiple clients. 
Bug verified on build glusterfs-3.8.4-3.el7rhgs.x86_64

Comment 12 errata-xmlrpc 2017-03-23 06:10:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html