1435357 – [GSS]RHGS 3.1.3 glusterfs client crash on io-cache.so(__ioc_page_wakeup+0x44)

Bug 1435357 - [GSS]RHGS 3.1.3 glusterfs client crash on io-cache.so(__ioc_page_wakeup+0x44)

Summary: [GSS]RHGS 3.1.3 glusterfs client crash on io-cache.so(__ioc_page_wakeup+0x44)

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	io-cache
Sub Component:
Version:	rhgs-3.1
Hardware:	All
OS:	All
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.3.0
Assignee:	Nithya Balachandran
QA Contact:	Rahul Hinduja
Docs Contact:
URL:
Whiteboard:
Depends On:	1456385 1457054 1457058
Blocks:	1417151
TreeView+	depends on / blocked

Reported:	2017-03-23 15:30 UTC by Riyas Abdulrasak
Modified:	2020-12-14 08:23 UTC (History)
CC List:	6 users (show)
Fixed In Version:	glusterfs-3.8.4-27
Doc Type:	Bug Fix
Doc Text:	The ioc_inode_wakeup process did not lock the ioc_inode queue. This meant that the ioc_prune process could free a structure that ioc_inode_wakeup later attempted to access, resulting in an unexpected termination of the gluster mount process. The ioc_inode queue is now locked during access so that this issue cannot occur.
Clone Of:
Clones:	1456385 (view as bug list)
Environment:
Last Closed:	2017-09-21 04:35:56 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2017:2774	0	normal	SHIPPED_LIVE	glusterfs bug fix and enhancement update	2017-09-21 08:16:29 UTC

Description Riyas Abdulrasak 2017-03-23 15:30:36 UTC

Description of problem:

Glusterfs fuse client was crashing with the below back trace. 

~~~~~~~~~
patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash: 
2017-03-22 22:31:32
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.7.9
/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xc2)[0x7f45f3c3e1c2]
/lib64/libglusterfs.so.0(gf_print_trace+0x31d)[0x7f45f3c6396d]
/lib64/libc.so.6(+0x35670)[0x7f45f232a670]
/usr/lib64/glusterfs/3.7.9/xlator/performance/io-cache.so(__ioc_page_wakeup+0x44)[0x7f45e525e5b4]
/usr/lib64/glusterfs/3.7.9/xlator/performance/io-cache.so(ioc_inode_wakeup+0x164)[0x7f45e525ffa4]
/usr/lib64/glusterfs/3.7.9/xlator/performance/io-cache.so(ioc_cache_validate_cbk+0x31b)[0x7f45e5257b2b]
/usr/lib64/glusterfs/3.7.9/xlator/performance/read-ahead.so(ra_attr_cbk+0x11a)[0x7f45e566edfa]
/lib64/libglusterfs.so.0(default_fstat_cbk+0x11a)[0x7f45f3c47ada]
/usr/lib64/glusterfs/3.7.9/xlator/cluster/distribute.so(dht_file_attr_cbk+0x1c5)[0x7f45e5aea505]
/usr/lib64/glusterfs/3.7.9/xlator/cluster/replicate.so(afr_fstat_cbk+0x131)[0x7f45e5d27de1]
/usr/lib64/glusterfs/3.7.9/xlator/protocol/client.so(client3_3_fstat_cbk+0x44e)[0x7f45e5fa7f8e]
/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0x90)[0x7f45f3a0c990]
/lib64/libgfrpc.so.0(rpc_clnt_notify+0x1bf)[0x7f45f3a0cc4f]
/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f45f3a08793]
/usr/lib64/glusterfs/3.7.9/rpc-transport/socket.so(+0x69b4)[0x7f45e86a19b4]
/usr/lib64/glusterfs/3.7.9/rpc-transport/socket.so(+0x95f4)[0x7f45e86a45f4]
/lib64/libglusterfs.so.0(+0x94c0a)[0x7f45f3cacc0a]
/lib64/libpthread.so.0(+0x7dc5)[0x7f45f2aa6dc5]
/lib64/libc.so.6(clone+0x6d)[0x7f45f23ebced]
~~~~~~~~~


* There was a lot of below messages in the client just before the crash. 

~~~~~~~~~~
[2017-03-23 08:41:29.936098] W [MSGID: 108027] [afr-common.c:2250:afr_discover_done] 4-vCDN-replicate-2: no read subvols for /
The message "W [MSGID: 108027] [afr-common.c:2250:afr_discover_done] 4-vCDN-replicate-2: no read subvols for /" repeated 90 times between [2017-03-23 08:41:29.936098] and [2017-03-23 08:43:28.210919]
~~~~~~~~~~

* These messages was caused by the meta-data split-brain with some directories including the brick "/"


Version-Release number of selected component (if applicable):
RHGS 3.1.3
glusterfs-3.7.9-12.el7.x86_64


How reproducible:

A couple of times in the customer environment

Actual results:

glusterfs-fuseclient was crashing and mount point started giving transport end point not connected. 


Expected results:

glusterfs-fuseclient should not crash


Additional info:

* Application coredump is collected. It gave the below BT

(gdb) bt
#0  0x00007f45e525e5b4 in __ioc_page_wakeup (page=0x7f43246e1500, page@entry=0x7f45f17d0d64, op_errno=0) at page.c:960
#1  0x00007f45e525ffa4 in ioc_inode_wakeup (frame=0x7f45e00396c8, frame@entry=0x7f45f17d0d64, ioc_inode=ioc_inode@entry=0x7f45e0e62160, stbuf=stbuf@entry=0x7f45e69cca10) at ioc-inode.c:119
#2  0x00007f45e5257b2b in ioc_cache_validate_cbk (frame=0x7f45f17d0d64, cookie=<optimized out>, this=<optimized out>, op_ret=0, op_errno=<optimized out>, stbuf=<optimized out>, xdata=0x0)
    at io-cache.c:402
#3  0x00007f45e566edfa in ra_attr_cbk (frame=0x7f45f17e22e0, cookie=<optimized out>, this=<optimized out>, op_ret=0, op_errno=0, buf=0x7f45e69cca10, xdata=0x0) at read-ahead.c:721
#4  0x00007f45f3c47ada in default_fstat_cbk (frame=0x7f45f17b7188, cookie=<optimized out>, this=<optimized out>, op_ret=0, op_errno=0, buf=0x7f45e69cca10, xdata=0x0) at defaults.c:1053
#5  0x00007f45e5aea505 in dht_file_attr_cbk (frame=0x7f45f17ba090, cookie=<optimized out>, this=<optimized out>, op_ret=<optimized out>, op_errno=<optimized out>, stbuf=<optimized out>, 
    xdata=0x0) at dht-inode-read.c:214
#6  0x00007f45e5d27de1 in afr_fstat_cbk (frame=0x7f45f17562d8, cookie=<optimized out>, this=<optimized out>, op_ret=0, op_errno=0, buf=0x7f45e69cca10, xdata=0x0) at afr-inode-read.c:291
#7  0x00007f45e5fa7f8e in client3_3_fstat_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, myframe=0x7f45f17e1c28) at client-rpc-fops.c:1574
#8  0x00007f45f3a0c990 in rpc_clnt_handle_reply (clnt=clnt@entry=0x7f45e03547c0, pollin=pollin@entry=0x7f45e1033480) at rpc-clnt.c:764
#9  0x00007f45f3a0cc4f in rpc_clnt_notify (trans=<optimized out>, mydata=0x7f45e03547f0, event=<optimized out>, data=0x7f45e1033480) at rpc-clnt.c:905
#10 0x00007f45f3a08793 in rpc_transport_notify (this=<optimized out>, event=<optimized out>, data=<optimized out>) at rpc-transport.c:546
#11 0x00007f45e86a19b4 in socket_event_poll_in (this=0x7f45e0364440) at socket.c:2355
#12 0x00007f45e86a45f4 in socket_event_handler (fd=<optimized out>, idx=8, data=0x7f45e0364440, poll_in=1, poll_out=0, poll_err=0) at socket.c:2469
#13 0x00007f45f3cacc0a in event_dispatch_epoll_handler (event=0x7f45e69cce80, event_pool=0x7f45f507c350) at event-epoll.c:570
#14 event_dispatch_epoll_worker (data=0x7f45f50d2ff0) at event-epoll.c:678
#15 0x00007f45f2aa6dc5 in start_thread (arg=0x7f45e69cd700) at pthread_create.c:308
#16 0x00007f45f23ebced in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113


* It seems like the glusterfs.fuse was crashing in the below function. 

 __ioc_page_wakeup (ioc_page_t *page, int32_t op_errno)
    948 {
    949         ioc_waitq_t  *waitq = NULL, *trav = NULL;
    950         call_frame_t *frame = NULL;
    951         int32_t       ret   = -1;
    952 
    953         GF_VALIDATE_OR_GOTO ("io-cache", page, out);
    954 
    955         waitq = page->waitq;
    956         page->waitq = NULL;
    957 
    958         page->ready = 1;
    959 
    960         gf_msg_trace (page->inode->table->xl->name, 0,
    961                       "page is %p && waitq = %p", page, waitq);
    962 
    963         for (trav = waitq; trav; trav = trav->next) {
    964                 frame = trav->data;
    965                 ret = __ioc_frame_fill (page, frame, trav->pending_offset,
    966                                         trav->pending_size, op_errno);
    967                 if (ret == -1) {
    968                         break;
    969                 }                                        
    970         }

Comment 12 Atin Mukherjee 2017-05-29 10:40:02 UTC

upstream patch : https://review.gluster.org/17410

Comment 17 Nithya Balachandran 2017-08-17 06:09:29 UTC

Doc text looks fine.

Comment 19 errata-xmlrpc 2017-09-21 04:35:56 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774

Note You need to log in before you can comment on or make changes to this bug.