Bug 1689785

Summary: systemic: Brick crashed after an unexpected system reboot
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Nag Pavan Chilakam <nchilaka>
Component: coreAssignee: Mohit Agrawal <moagrawa>
Status: CLOSED CURRENTRELEASE QA Contact: Rahul Hinduja <rhinduja>
Severity: medium Docs Contact:
Priority: medium    
Version: rhgs-3.4CC: moagrawa, pasik, rhs-bugs, sheggodu, storage-qa-internal
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-01-28 09:49:00 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Nag Pavan Chilakam 2019-03-18 06:24:24 UTC
Description of problem:
========================
on my non functional setup, one of the server node got rebooted(unable to find the cause), and post that all bricks were online except one brick.
I checked the brick logs and found that a backtrace and hence not online.
Unfortunately, I didn't find any cores

[2019-03-17 23:55:53.369644] I [MSGID: 101190] [event-epoll.c:676:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
[2019-03-17 23:56:06.340888] I [MSGID: 101190] [event-epoll.c:676:event_dispatch_epoll_worker] 0-epoll: Started thread with index 2
[2019-03-17 23:56:06.341009] I [MSGID: 101190] [event-epoll.c:676:event_dispatch_epoll_worker] 0-epoll: Started thread with index 3
[2019-03-17 23:56:06.341053] I [MSGID: 101190] [event-epoll.c:676:event_dispatch_epoll_worker] 0-epoll: Started thread with index 4
[2019-03-17 23:56:06.341168] I [MSGID: 101190] [event-epoll.c:676:event_dispatch_epoll_worker] 0-epoll: Started thread with index 6
[2019-03-17 23:56:06.341197] I [MSGID: 101190] [event-epoll.c:676:event_dispatch_epoll_worker] 0-epoll: Started thread with index 5
[2019-03-17 23:56:06.341268] I [MSGID: 101190] [event-epoll.c:676:event_dispatch_epoll_worker] 0-epoll: Started thread with index 7
[2019-03-17 23:56:06.341747] I [rpcsvc.c:2582:rpcsvc_set_outstanding_rpc_limit] 0-rpc-service: Configured rpc.outstanding-rpc-limit with value 64
[2019-03-17 23:56:06.341865] W [MSGID: 101002] [options.c:995:xl_opt_validate] 0-rpcx3-server: option 'listen-port' is deprecated, preferred is 'transport.socket.listen-port', continuing with correction
pending frames:
frame : type(0) op(0)
patchset: git://git.gluster.org/glusterfs.git
signal received: 11
[2019-03-17 23:56:06.343189] W [socket.c:3973:reconfigure] 0-rpcx3-quota: disabling non-blocking IO
time of crash: 
2019-03-17 23:56:06
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.12.2
[2019-03-17 23:56:06.343330] I [socket.c:2489:socket_event_handler] 0-transport: EPOLLERR - disconnecting now
/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0x9d)[0x7fccf0cf9b9d]
/lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7fccf0d04114]
/lib64/libc.so.6(+0x36280)[0x7fccef336280]
/lib64/libpthread.so.0(pthread_mutex_lock+0x0)[0x7fccefb37c30]
/usr/lib64/glusterfs/3.12.2/xlator/protocol/server.so(+0x985d)[0x7fccdb57885d]
/lib64/libgfrpc.so.0(+0x7685)[0x7fccf0a95685]
/lib64/libgfrpc.so.0(rpcsvc_notify+0x65)[0x7fccf0a99985]
/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fccf0a9bae3]
/usr/lib64/glusterfs/3.12.2/rpc-transport/socket.so(+0xce77)[0x7fcce58c3e77]
/lib64/libglusterfs.so.0(+0x8a870)[0x7fccf0d58870]
/lib64/libpthread.so.0(+0x7dd5)[0x7fccefb35dd5]
/lib64/libc.so.6(clone+0x6d)[0x7fccef3fdead]
---------




Version-Release number of selected component (if applicable):
====================
3.12.2-43


How reproducible:
===============
hit it once on my system setup for rpc tests

Steps to Reproduce:
====================
more details @ https://docs.google.com/spreadsheets/d/17Yf9ZRWnWOpbRyFQ2ZYxAAlp9I_yarzKZdjN8idBJM0/edit#gid=1472913705
1. was running system tests for about 3 weeks
2. In current state , a rebalance is still going on for about last >2+ weeks
(refer bz#1686425 )
3. apart from the above, I set client and server event threads to 8 as part of https://bugzilla.redhat.com/show_bug.cgi?id=1409568#c31
4. IOs going on from clients are as below:
 a) 4 clients: just appending to a file whose name as same as host name(all different)
 b) another client: only on this client, I remounted the volume after setting event threads. From this client running IOs as explained in https://bugzilla.redhat.com/show_bug.cgi?id=1409568#c31 and previous comments
 c) from another 2 clients: reunning below IOs
2109.lookup	(Detached) --->find *|xargs stat from root of mount
1074.top	(Detached)--->top and free o/p every minute captured to a file on mount, in append mode 
1058.rm-rf	(Detached) -->removal of old untarred linux directories
801.kernel	(Detached) --->linux untar into new directories, on same parent dir as above
 du -sh -->on root of volume from only one of the clients, not yet over even after a week






Additional info:
===============
[root@rhs-client19 glusterfs]# gluster v status
Status of volume: rpcx3
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick rhs-client19.lab.eng.blr.redhat.com:/
gluster/brick1/rpcx3                        49152     0          Y       10824
Brick rhs-client25.lab.eng.blr.redhat.com:/
gluster/brick1/rpcx3                        49152     0          Y       5232 
Brick rhs-client32.lab.eng.blr.redhat.com:/
gluster/brick1/rpcx3                        49152     0          Y       10898
Brick rhs-client25.lab.eng.blr.redhat.com:/
gluster/brick2/rpcx3                        49153     0          Y       5253 
Brick rhs-client32.lab.eng.blr.redhat.com:/
gluster/brick2/rpcx3                        49153     0          Y       10904
Brick rhs-client38.lab.eng.blr.redhat.com:/
gluster/brick2/rpcx3                        N/A       N/A        N       N/A  
Brick rhs-client32.lab.eng.blr.redhat.com:/
gluster/brick3/rpcx3                        49154     0          Y       10998
Brick rhs-client38.lab.eng.blr.redhat.com:/
gluster/brick3/rpcx3                        49153     0          Y       8999 
Brick rhs-client19.lab.eng.blr.redhat.com:/
gluster/brick3/rpcx3                        49153     0          Y       10826
Brick rhs-client38.lab.eng.blr.redhat.com:/
gluster/brick3/rpcx3-newb                   49154     0          Y       8984 
Brick rhs-client19.lab.eng.blr.redhat.com:/
gluster/brick2/rpcx3-newb                   49155     0          Y       29805
Brick rhs-client25.lab.eng.blr.redhat.com:/
gluster/brick3/rpcx3-newb                   49155     0          Y       30021
Brick rhs-client19.lab.eng.blr.redhat.com:/
gluster/brick1/rpcx3-newb                   49156     0          Y       29826
Brick rhs-client25.lab.eng.blr.redhat.com:/
gluster/brick1/rpcx3-newb                   49156     0          Y       30042
Brick rhs-client32.lab.eng.blr.redhat.com:/
gluster/brick1/rpcx3-newb                   49156     0          Y       1636 
Snapshot Daemon on localhost                49154     0          Y       10872
Self-heal Daemon on localhost               N/A       N/A        Y       29849
Quota Daemon on localhost                   N/A       N/A        Y       29860
Snapshot Daemon on rhs-client25.lab.eng.blr
.redhat.com                                 49154     0          Y       9833 
Self-heal Daemon on rhs-client25.lab.eng.bl
r.redhat.com                                N/A       N/A        Y       30065
Quota Daemon on rhs-client25.lab.eng.blr.re
dhat.com                                    N/A       N/A        Y       30076
Snapshot Daemon on rhs-client38.lab.eng.blr
.redhat.com                                 49155     0          Y       9214 
Self-heal Daemon on rhs-client38.lab.eng.bl
r.redhat.com                                N/A       N/A        Y       8958 
Quota Daemon on rhs-client38.lab.eng.blr.re
dhat.com                                    N/A       N/A        Y       8969 
Snapshot Daemon on rhs-client32.lab.eng.blr
.redhat.com                                 49155     0          Y       11221
Self-heal Daemon on rhs-client32.lab.eng.bl
r.redhat.com                                N/A       N/A        Y       1658 
Quota Daemon on rhs-client32.lab.eng.blr.re
dhat.com                                    N/A       N/A        Y       1668 
 
Task Status of Volume rpcx3
------------------------------------------------------------------------------
Task                 : Rebalance           
ID                   : 2cd252ed-3202-4c7f-99bd-6326058c797f
Status               : in progress         
 
[root@rhs-client19 glusterfs]# gluster v info
 
Volume Name: rpcx3
Type: Distributed-Replicate
Volume ID: f7532c65-63d0-4e4a-a5b5-c95238635eff
Status: Started
Snapshot Count: 0
Number of Bricks: 5 x 3 = 15
Transport-type: tcp
Bricks:
Brick1: rhs-client19.lab.eng.blr.redhat.com:/gluster/brick1/rpcx3
Brick2: rhs-client25.lab.eng.blr.redhat.com:/gluster/brick1/rpcx3
Brick3: rhs-client32.lab.eng.blr.redhat.com:/gluster/brick1/rpcx3
Brick4: rhs-client25.lab.eng.blr.redhat.com:/gluster/brick2/rpcx3
Brick5: rhs-client32.lab.eng.blr.redhat.com:/gluster/brick2/rpcx3
Brick6: rhs-client38.lab.eng.blr.redhat.com:/gluster/brick2/rpcx3
Brick7: rhs-client32.lab.eng.blr.redhat.com:/gluster/brick3/rpcx3
Brick8: rhs-client38.lab.eng.blr.redhat.com:/gluster/brick3/rpcx3
Brick9: rhs-client19.lab.eng.blr.redhat.com:/gluster/brick3/rpcx3
Brick10: rhs-client38.lab.eng.blr.redhat.com:/gluster/brick3/rpcx3-newb
Brick11: rhs-client19.lab.eng.blr.redhat.com:/gluster/brick2/rpcx3-newb
Brick12: rhs-client25.lab.eng.blr.redhat.com:/gluster/brick3/rpcx3-newb
Brick13: rhs-client19.lab.eng.blr.redhat.com:/gluster/brick1/rpcx3-newb
Brick14: rhs-client25.lab.eng.blr.redhat.com:/gluster/brick1/rpcx3-newb
Brick15: rhs-client32.lab.eng.blr.redhat.com:/gluster/brick1/rpcx3-newb
Options Reconfigured:
client.event-threads: 8
server.event-threads: 8
cluster.rebal-throttle: aggressive
diagnostics.client-log-level: INFO
performance.client-io-threads: off
nfs.disable: on
transport.address-family: inet
diagnostics.latency-measurement: on
diagnostics.count-fop-hits: on
features.uss: enable
features.quota: on
features.inode-quota: on
features.quota-deem-statfs: on
You have new mail in /var/spool/mail/root


#########################
sosreports and logs to follow

Comment 9 Yaniv Kaul 2019-10-28 13:16:11 UTC
Has anyone looked at this?

Comment 13 Red Hat Bugzilla 2023-09-14 05:25:34 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days