Bug 1443980

Summary: [New] - Replacing an arbiter brick while I/O happens causes vm pause
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: RamaKasturi <knarra>
Component: arbiterAssignee: Ravishankar N <ravishankar>
Status: CLOSED ERRATA QA Contact: RamaKasturi <knarra>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: rhgs-3.2CC: amukherj, rhinduja, rhs-bugs, storage-qa-internal
Target Milestone: ---   
Target Release: RHGS 3.3.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-3.8.4-26 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1449610 (view as bug list) Environment:
Last Closed: 2017-09-21 04:39:40 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1277939, 1417151, 1449610, 1450933, 1450934, 1450937    

Description RamaKasturi 2017-04-20 11:24:32 UTC
Description of problem:
I see that while replacing an arbiter brick with I/O happening all the vms goes to paused state.

Version-Release number of selected component (if applicable):
glusterfs-3.8.4-18.el7rhgs.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Install HC setup
2. Create vms 
3. Now start running fio on the vms.
4. while fio is happening replace arbiter brick in the volume from the UI

Actual results:
Replacing arbiter brick causes vm pause.

Expected results:
vms should not pause when arbiter brick is replaced.

Additional info:

Comment 2 RamaKasturi 2017-04-20 11:44:59 UTC
sosreports can be found in the link below:
===============================================
http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1443980/

Comment 3 Ravishankar N 2017-05-08 12:35:23 UTC
Notes:
I was able to re-create the issue with glusterfs-3.8.4-18 on the set up loaned from Kasturi but when I tried a more recent source with the following commits on top of 3.8.4.-18 I was not able to hit it any more:
------------------------
* 586b35b - cluster/dht: rm -rf fails if dir has stale linkto files (3 days ago) <N Balachandran>
* 9a8cc33 - glusterd/geo-rep: Fix snapshot create in geo-rep setup (4 days ago) <Kotresh HR>
* 6494b76 - features/shard: Initialize local->fop in readv (4 days ago) <Krutika Dhananjay>
* 4a6bdc7 - features/shard: Fix vm corruption upon fix-layout (4 days ago) <Krutika Dhananjay>
* f25b41d - nfs: make subdir mounting work for Solaris 10 clients (4 days ago) <Bipin Kunal>
* 8606249 - glusterd: Fix snapshot failure in non-root geo-rep setup (4 days ago) <Kotresh HR>
* 37edb3f - rpc: bump up conn->cleanup_gen in rpc_clnt_reconnect_cleanup (4 weeks ago) <Atin Mukherjee>
* 795a095 - features/shard: Pass the correct iatt for cache invalidation (4 weeks ago) <Krutika Dhananjay>
------------------------
I have requested Kasturi to try the repro out on rhgs-3.3 build but in the meantime I am continuing to debug it on 3.8.4.18 to find the RCA.


From the fuse mount logs, I see that *after* the graph switch, fuse_readv_cbk is failing with ENOTCONN, which should not happen.

[2017-05-04 12:34:20.865336] I [fuse-bridge.c:5241:fuse_graph_setup] 0-fuse: switched to graph 15

2017-05-04 12:34:20.894025] W [fuse-bridge.c:2228:fuse_readv_cbk] 0-glusterfs-fuse: 913157: READ => -1 gfid=036fe3b4-27b2-4103-825c-508e87635498 fd=0x7f15a53822c8 (Transport endpoint is not connected)


On attaching gdb to fuse_readv_cbk, I see that mknod FOP has failed with ENOTCONN:
#0  fuse_readv_cbk (frame=0x7fb1944b3208, cookie=<optimized out>, this=0x7fb1989b2dc0, op_ret=-1, op_errno=107, vector=0x0, count=0, stbuf=0x0, iobref=0x0, xdata=0x0) at fuse-bridge.c:2223
#1  0x00007fb18e32102d in io_stats_readv_cbk (frame=0x7fb1944ac9b0, cookie=<optimized out>, this=<optimized out>, op_ret=-1, op_errno=107, vector=0x0, count=0, buf=0x0, iobref=0x0, xdata=0x0) at io-stats.c:2035
#2  0x00007fb196a40072 in default_readv_cbk (frame=0x7fb1944af244, cookie=<optimized out>, this=<optimized out>, op_ret=-1, op_errno=107, vector=0x0, count=0, stbuf=0x0, iobref=0x0, xdata=0x0) at defaults.c:1105
#3  0x00007fb196a40072 in default_readv_cbk (frame=0x7fb1944b55a4, cookie=<optimized out>, this=<optimized out>, op_ret=-1, op_errno=107, vector=0x0, count=0, stbuf=0x0, iobref=0x0, xdata=0x0) at defaults.c:1105
#4  0x00007fb18eb6d0b9 in shard_post_mknod_readv_handler (frame=0x7fb1944a4604, this=<optimized out>) at shard.c:3225
#5  0x00007fb18eb679d8 in shard_common_mknod_cbk (frame=0x7fb1944a4604, cookie=<optimized out>, this=0x7fb188217230, op_ret=<optimized out>, op_errno=<optimized out>, inode=<optimized out>, buf=0x7fb184037468, preparent=0x7fb1840374d8,
    postparent=0x7fb184037548, xdata=0x0) at shard.c:3270
#6  0x00007fb18eda92a9 in dht_newfile_cbk (frame=0x7fb1944a2e00, cookie=<optimized out>, this=<optimized out>, op_ret=-1, op_errno=107, inode=0x7fb17211d418, stbuf=0x7fb184037468, preparent=0x7fb1840374d8, postparent=0x7fb184037548,
    xdata=0x0) at dht-common.c:5305
#7  0x00007fb18f006b17 in afr_mknod_unwind (frame=<optimized out>, this=<optimized out>) at afr-dir-write.c:554
#8  0x00007fb18f028f31 in __afr_txn_write_done (frame=0x7fb1944b3b24, this=0x7fb188215af0) at afr-transaction.c:195
#9  0x00007fb18f02e0ab in afr_unlock_common_cbk (frame=frame@entry=0x7fb1944b3b24, this=this@entry=0x7fb188215af0, xdata=0x0, op_errno=<optimized out>, op_ret=<optimized out>, cookie=<optimized out>) at afr-lk-common.c:633
#10 0x00007fb18f02e9a4 in afr_unlock_entrylk_cbk (frame=0x7fb1944b3b24, cookie=<optimized out>, this=0x7fb188215af0, op_ret=<optimized out>, op_errno=<optimized out>, xdata=<optimized out>) at afr-lk-common.c:829
#11 0x00007fb18f291991 in client3_3_entrylk_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, myframe=0x7fb1944a3e90) at client-rpc-fops.c:1657
#12 0x00007fb19678d850 in rpc_clnt_handle_reply (clnt=clnt@entry=0x7fb188706c70, pollin=pollin@entry=0x7fb1800a4a00) at rpc-clnt.c:792
#13 0x00007fb19678db3f in rpc_clnt_notify (trans=<optimized out>, mydata=0x7fb188706ca0, event=<optimized out>, data=0x7fb1800a4a00) at rpc-clnt.c:985
#14 0x00007fb1967899f3 in rpc_transport_notify (this=this@entry=0x7fb188760180, event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=data@entry=0x7fb1800a4a00) at rpc-transport.c:538
#15 0x00007fb191577314 in socket_event_poll_in (this=this@entry=0x7fb188760180) at socket.c:2272
#16 0x00007fb1915797c5 in socket_event_handler (fd=<optimized out>, idx=5, data=0x7fb188760180, poll_in=1, poll_out=0, poll_err=0) at socket.c:2402
#17 0x00007fb196a1d770 in event_dispatch_epoll_handler (event=0x7fb18ce83e80, event_pool=0x7fb1989a4f00) at event-epoll.c:571
#18 event_dispatch_epoll_worker (data=0x7fb18803a330) at event-epoll.c:674
#19 0x00007fb195824dc5 in start_thread (arg=0x7fb18ce84700) at pthread_create.c:308
#20 0x00007fb19516973d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113


Continuing to investigate why mknod failed.

Comment 4 Ravishankar N 2017-05-08 13:36:32 UTC
More from the mount logs:
[2017-05-04 12:34:20.893105] D [MSGID: 0] [dht-common.c:5380:dht_mknod_wind_to_avail_subvol] 15-data-dht: creating /.shard/036fe3b4-27b2-4103-825c-508e87635498.2446 on data-replicate-0

5473 [2017-05-04 12:34:20.893533] D [MSGID: 114031] [client-rpc-fops.c:237:client3_3_mknod_cbk] 15-data-client-0: remote operation failed. Path: /.shard/036fe3b4-27b2-4103-825c-508e87635498.2446 [File exists]

[2017-05-04 12:34:20.893575] D [MSGID: 114031] [client-rpc-fops.c:237:client3_3_mknod_cbk] 15-data-client-1: remote operation failed. Path: /.shard/036fe3b4-27b2-4103-825c-508e87635498.2446 [File exists]


It looks like shard xlator sends mknod instead of lookups. Since it is present in the 1st and 2 brick, we get EEXIST from them. On the 3rd brick (arbiter) which is just replaced, it succeeds. In arbiter volumes, AFR sends ENOTCONN if the FOP did not meet quorum. So the mknod fails with ENOTCONN.

Shard I think is tuned only to ignore EEXIST errors but AFR sends ENOTCONN:
[2017-05-04 12:34:20.893975] D [MSGID: 0] [shard.c:3260:shard_common_mknod_cbk] 15-data-shard: mknod of shard 2446 failed: Transport endpoint is not connected

So it propagates the error upto FUSE.
I think the reason it succeeded with the patches mentioned in comment #3 is due to commit `4a6bdc7 - features/shard: Fix vm corruption upon fix-layout` which "changes shard fops to do a LOOKUP before the MKNOD", thereby triggering name heal in afr.

I will need to confirm this analysis. FWIW, we need to consider a part of the changes in https://review.gluster.org/#/c/14769/ which propagates the appropriate errno instead of ENOTCONN (see commit message).

Comment 5 Ravishankar N 2017-05-10 11:32:06 UTC
So I tried with just the arbiter fix https://review.gluster.org/#/c/17235/ on top of glusterfs-3.8.4-18 and wasn't able to hit the issue any more.

Comment 8 Ravishankar N 2017-05-15 12:38:31 UTC
Downstream patch: https://code.engineering.redhat.com/gerrit/#/c/106202/

Comment 9 Ravishankar N 2017-05-16 05:53:32 UTC
Patch merged.

Comment 11 RamaKasturi 2017-06-21 09:58:46 UTC
I do not see this issue with 3.2 async build.

Comment 12 RamaKasturi 2017-06-23 08:54:01 UTC
glusterfs-3.8.4-29.el7rhgs.x86_64

Comment 13 RamaKasturi 2017-06-23 08:57:18 UTC
Verified and works fine with build glusterfs-3.8.4-29.el7rhgs.x86_64.

Replacing a brick while I/O happens did not cause vm pause.

Comment 15 errata-xmlrpc 2017-09-21 04:39:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774