1689785 – systemic: Brick crashed after an unexpected system reboot

Bug 1689785 - systemic: Brick crashed after an unexpected system reboot

Summary: systemic: Brick crashed after an unexpected system reboot

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	core
Sub Component:
Version:	rhgs-3.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Mohit Agrawal
QA Contact:	Rahul Hinduja
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-03-18 06:24 UTC by Nag Pavan Chilakam
Modified:	2023-09-14 05:25 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-01-28 09:49:00 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Nag Pavan Chilakam 2019-03-18 06:24:24 UTC

Description of problem:
========================
on my non functional setup, one of the server node got rebooted(unable to find the cause), and post that all bricks were online except one brick.
I checked the brick logs and found that a backtrace and hence not online.
Unfortunately, I didn't find any cores

[2019-03-17 23:55:53.369644] I [MSGID: 101190] [event-epoll.c:676:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
[2019-03-17 23:56:06.340888] I [MSGID: 101190] [event-epoll.c:676:event_dispatch_epoll_worker] 0-epoll: Started thread with index 2
[2019-03-17 23:56:06.341009] I [MSGID: 101190] [event-epoll.c:676:event_dispatch_epoll_worker] 0-epoll: Started thread with index 3
[2019-03-17 23:56:06.341053] I [MSGID: 101190] [event-epoll.c:676:event_dispatch_epoll_worker] 0-epoll: Started thread with index 4
[2019-03-17 23:56:06.341168] I [MSGID: 101190] [event-epoll.c:676:event_dispatch_epoll_worker] 0-epoll: Started thread with index 6
[2019-03-17 23:56:06.341197] I [MSGID: 101190] [event-epoll.c:676:event_dispatch_epoll_worker] 0-epoll: Started thread with index 5
[2019-03-17 23:56:06.341268] I [MSGID: 101190] [event-epoll.c:676:event_dispatch_epoll_worker] 0-epoll: Started thread with index 7
[2019-03-17 23:56:06.341747] I [rpcsvc.c:2582:rpcsvc_set_outstanding_rpc_limit] 0-rpc-service: Configured rpc.outstanding-rpc-limit with value 64
[2019-03-17 23:56:06.341865] W [MSGID: 101002] [options.c:995:xl_opt_validate] 0-rpcx3-server: option 'listen-port' is deprecated, preferred is 'transport.socket.listen-port', continuing with correction
pending frames:
frame : type(0) op(0)
patchset: git://git.gluster.org/glusterfs.git
signal received: 11
[2019-03-17 23:56:06.343189] W [socket.c:3973:reconfigure] 0-rpcx3-quota: disabling non-blocking IO
time of crash: 
2019-03-17 23:56:06
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.12.2
[2019-03-17 23:56:06.343330] I [socket.c:2489:socket_event_handler] 0-transport: EPOLLERR - disconnecting now
/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0x9d)[0x7fccf0cf9b9d]
/lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7fccf0d04114]
/lib64/libc.so.6(+0x36280)[0x7fccef336280]
/lib64/libpthread.so.0(pthread_mutex_lock+0x0)[0x7fccefb37c30]
/usr/lib64/glusterfs/3.12.2/xlator/protocol/server.so(+0x985d)[0x7fccdb57885d]
/lib64/libgfrpc.so.0(+0x7685)[0x7fccf0a95685]
/lib64/libgfrpc.so.0(rpcsvc_notify+0x65)[0x7fccf0a99985]
/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fccf0a9bae3]
/usr/lib64/glusterfs/3.12.2/rpc-transport/socket.so(+0xce77)[0x7fcce58c3e77]
/lib64/libglusterfs.so.0(+0x8a870)[0x7fccf0d58870]
/lib64/libpthread.so.0(+0x7dd5)[0x7fccefb35dd5]
/lib64/libc.so.6(clone+0x6d)[0x7fccef3fdead]
---------




Version-Release number of selected component (if applicable):
====================
3.12.2-43


How reproducible:
===============
hit it once on my system setup for rpc tests

Steps to Reproduce:
====================
more details @ https://docs.google.com/spreadsheets/d/17Yf9ZRWnWOpbRyFQ2ZYxAAlp9I_yarzKZdjN8idBJM0/edit#gid=1472913705
1. was running system tests for about 3 weeks
2. In current state , a rebalance is still going on for about last >2+ weeks
(refer bz#1686425 )
3. apart from the above, I set client and server event threads to 8 as part of https://bugzilla.redhat.com/show_bug.cgi?id=1409568#c31
4. IOs going on from clients are as below:
 a) 4 clients: just appending to a file whose name as same as host name(all different)
 b) another client: only on this client, I remounted the volume after setting event threads. From this client running IOs as explained in https://bugzilla.redhat.com/show_bug.cgi?id=1409568#c31 and previous comments
 c) from another 2 clients: reunning below IOs
2109.lookup	(Detached) --->find *|xargs stat from root of mount
1074.top	(Detached)--->top and free o/p every minute captured to a file on mount, in append mode 
1058.rm-rf	(Detached) -->removal of old untarred linux directories
801.kernel	(Detached) --->linux untar into new directories, on same parent dir as above
 du -sh -->on root of volume from only one of the clients, not yet over even after a week






Additional info:
===============
[root@rhs-client19 glusterfs]# gluster v status
Status of volume: rpcx3
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick rhs-client19.lab.eng.blr.redhat.com:/
gluster/brick1/rpcx3                        49152     0          Y       10824
Brick rhs-client25.lab.eng.blr.redhat.com:/
gluster/brick1/rpcx3                        49152     0          Y       5232 
Brick rhs-client32.lab.eng.blr.redhat.com:/
gluster/brick1/rpcx3                        49152     0          Y       10898
Brick rhs-client25.lab.eng.blr.redhat.com:/
gluster/brick2/rpcx3                        49153     0          Y       5253 
Brick rhs-client32.lab.eng.blr.redhat.com:/
gluster/brick2/rpcx3                        49153     0          Y       10904
Brick rhs-client38.lab.eng.blr.redhat.com:/
gluster/brick2/rpcx3                        N/A       N/A        N       N/A  
Brick rhs-client32.lab.eng.blr.redhat.com:/
gluster/brick3/rpcx3                        49154     0          Y       10998
Brick rhs-client38.lab.eng.blr.redhat.com:/
gluster/brick3/rpcx3                        49153     0          Y       8999 
Brick rhs-client19.lab.eng.blr.redhat.com:/
gluster/brick3/rpcx3                        49153     0          Y       10826
Brick rhs-client38.lab.eng.blr.redhat.com:/
gluster/brick3/rpcx3-newb                   49154     0          Y       8984 
Brick rhs-client19.lab.eng.blr.redhat.com:/
gluster/brick2/rpcx3-newb                   49155     0          Y       29805
Brick rhs-client25.lab.eng.blr.redhat.com:/
gluster/brick3/rpcx3-newb                   49155     0          Y       30021
Brick rhs-client19.lab.eng.blr.redhat.com:/
gluster/brick1/rpcx3-newb                   49156     0          Y       29826
Brick rhs-client25.lab.eng.blr.redhat.com:/
gluster/brick1/rpcx3-newb                   49156     0          Y       30042
Brick rhs-client32.lab.eng.blr.redhat.com:/
gluster/brick1/rpcx3-newb                   49156     0          Y       1636 
Snapshot Daemon on localhost                49154     0          Y       10872
Self-heal Daemon on localhost               N/A       N/A        Y       29849
Quota Daemon on localhost                   N/A       N/A        Y       29860
Snapshot Daemon on rhs-client25.lab.eng.blr
.redhat.com                                 49154     0          Y       9833 
Self-heal Daemon on rhs-client25.lab.eng.bl
r.redhat.com                                N/A       N/A        Y       30065
Quota Daemon on rhs-client25.lab.eng.blr.re
dhat.com                                    N/A       N/A        Y       30076
Snapshot Daemon on rhs-client38.lab.eng.blr
.redhat.com                                 49155     0          Y       9214 
Self-heal Daemon on rhs-client38.lab.eng.bl
r.redhat.com                                N/A       N/A        Y       8958 
Quota Daemon on rhs-client38.lab.eng.blr.re
dhat.com                                    N/A       N/A        Y       8969 
Snapshot Daemon on rhs-client32.lab.eng.blr
.redhat.com                                 49155     0          Y       11221
Self-heal Daemon on rhs-client32.lab.eng.bl
r.redhat.com                                N/A       N/A        Y       1658 
Quota Daemon on rhs-client32.lab.eng.blr.re
dhat.com                                    N/A       N/A        Y       1668 
 
Task Status of Volume rpcx3
------------------------------------------------------------------------------
Task                 : Rebalance           
ID                   : 2cd252ed-3202-4c7f-99bd-6326058c797f
Status               : in progress         
 
[root@rhs-client19 glusterfs]# gluster v info
 
Volume Name: rpcx3
Type: Distributed-Replicate
Volume ID: f7532c65-63d0-4e4a-a5b5-c95238635eff
Status: Started
Snapshot Count: 0
Number of Bricks: 5 x 3 = 15
Transport-type: tcp
Bricks:
Brick1: rhs-client19.lab.eng.blr.redhat.com:/gluster/brick1/rpcx3
Brick2: rhs-client25.lab.eng.blr.redhat.com:/gluster/brick1/rpcx3
Brick3: rhs-client32.lab.eng.blr.redhat.com:/gluster/brick1/rpcx3
Brick4: rhs-client25.lab.eng.blr.redhat.com:/gluster/brick2/rpcx3
Brick5: rhs-client32.lab.eng.blr.redhat.com:/gluster/brick2/rpcx3
Brick6: rhs-client38.lab.eng.blr.redhat.com:/gluster/brick2/rpcx3
Brick7: rhs-client32.lab.eng.blr.redhat.com:/gluster/brick3/rpcx3
Brick8: rhs-client38.lab.eng.blr.redhat.com:/gluster/brick3/rpcx3
Brick9: rhs-client19.lab.eng.blr.redhat.com:/gluster/brick3/rpcx3
Brick10: rhs-client38.lab.eng.blr.redhat.com:/gluster/brick3/rpcx3-newb
Brick11: rhs-client19.lab.eng.blr.redhat.com:/gluster/brick2/rpcx3-newb
Brick12: rhs-client25.lab.eng.blr.redhat.com:/gluster/brick3/rpcx3-newb
Brick13: rhs-client19.lab.eng.blr.redhat.com:/gluster/brick1/rpcx3-newb
Brick14: rhs-client25.lab.eng.blr.redhat.com:/gluster/brick1/rpcx3-newb
Brick15: rhs-client32.lab.eng.blr.redhat.com:/gluster/brick1/rpcx3-newb
Options Reconfigured:
client.event-threads: 8
server.event-threads: 8
cluster.rebal-throttle: aggressive
diagnostics.client-log-level: INFO
performance.client-io-threads: off
nfs.disable: on
transport.address-family: inet
diagnostics.latency-measurement: on
diagnostics.count-fop-hits: on
features.uss: enable
features.quota: on
features.inode-quota: on
features.quota-deem-statfs: on
You have new mail in /var/spool/mail/root


#########################
sosreports and logs to follow

Comment 9 Yaniv Kaul 2019-10-28 13:16:11 UTC

Has anyone looked at this?

Comment 13 Red Hat Bugzilla 2023-09-14 05:25:34 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days

Note You need to log in before you can comment on or make changes to this bug.