Bug 1426512

Summary:	Application VM paused after add brick operation and VM didn't comeup after power cycle.
Product:	[Community] GlusterFS	Reporter:	Krutika Dhananjay <kdhananj>
Component:	sharding	Assignee:	bugs <bugs>
Status:	CLOSED EOL	QA Contact:	bugs <bugs>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	3.9	CC:	amukherj, bsrirama, bugs, csaba, kdhananj, ravishankar, rcyriac, rgowdapp, rhs-bugs, sabose, sasundar, storage-qa-internal
Target Milestone:	---	Keywords:	Triaged
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1420623	Environment:
Last Closed:	2017-03-08 12:32:17 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1387878

Description Krutika Dhananjay 2017-02-24 05:42:18 UTC

+++ This bug was initially created as a clone of Bug #1420623 +++

+++ This bug was initially created as a clone of Bug #1412554 +++

Description of problem:
=======================
Application VM paused after add brick operation, i had some 3 APP VMs running on gluster volume (1*3) and converted to 2 *3 volume and at the same time i added new rhgs node to cluster ( cluster expanded), during this time one of the app vm (vm1) went to pause state.


we are seeing disconnection messages and client quorum not met error messages.

usterd until brick's port is available
[2017-01-12 06:05:13.382644] I [socket.c:3439:socket_submit_request] 27-Dis-Rep1-client-9: not connected (priv->connected = -1)
[2017-01-12 06:05:13.382658] W [rpc-clnt.c:1657:rpc_clnt_submit] 27-Dis-Rep1-client-9: failed to submit rpc-request (XID: 0xdba6 Program: GlusterFS 3.3, ProgVers: 330, Proc: 30) to rpc-transport (Dis-Rep1-client-9)
[2017-01-12 06:05:13.382660] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 27-Dis-Rep1-client-11: disconnected from Dis-Rep1-client-11. Client process will keep trying to connect to glusterd until brick's port is available
[2017-01-12 06:05:13.382675] E [MSGID: 114031] [client-rpc-fops.c:1601:client3_3_finodelk_cbk] 27-Dis-Rep1-client-9: remote operation failed [Transport endpoint is not connected]
[2017-01-12 06:05:13.382692] W [MSGID: 108001] [afr-common.c:4768:afr_notify] 27-Dis-Rep1-replicate-0: Client-quorum is not met
[2017-01-12 06:05:13.382712] E [MSGID: 114031] [client-rpc-fops.c:1601:client3_3_finodelk_cbk] 27-Dis-Rep1-client-10: remote operation failed [Transport endpoint is not connected]
[2017-01-12 06:05:13.382731] I [socket.c:3439:socket_submit_request] 27-Dis-Rep1-client-11: not connected (priv->connected = -1)
[2017-01-12 06:05:13.382738] W [rpc-clnt.c:1657:rpc_clnt_submit] 27-Dis-Rep1-client-11: failed to submit rpc-request (XID: 0xdca1 Program: GlusterFS 3.3, ProgVers: 330, Proc: 30) to rpc-transport (Dis-Rep1-client-11)
[2017-01-12 06:05:13.382745] E [MSGID: 114031] [client-rpc-fops.c:1601:client3_3_finodelk_cbk] 27-Dis-Rep1-client-11: remote operation failed [Transport endpoint is not connected]
[2017-01-12 06:05:13.382826] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 27-Dis-Rep1-client-9: disconnected from Dis-Rep1-client-9. Client process will keep trying to connect to glusterd until brick's port is available
[2017-01-12 06:05:13.382851] E [MSGID: 108006] [afr-common.c:4679:afr_notify] 27-Dis-Rep1-replicate-0: All subvolumes are down. Going offline until atleast one of them comes back up.
[2017-01-12 06:05:13.407637] E [MSGID: 133007] [shard.c:3238:shard_common_resume_mknod] 29-Dis-Rep1-shard: Inode path failedon a15a813e-f512-4f96-858f-91b4fad86d4d.33, base file gfid = a15a813e-f512-4f96-858f-91b4fad86d4d
[2017-01-12 06:05:13.407692] W [fuse-bridge.c:2312:fuse_writev_cbk] 0-glusterfs-fuse: 4126899: WRITE => -1 gfid=a15a813e-f512-4f96-858f-91b4fad86d4d fd=0x7fd489699014 (Cannot allocate memory)
[2017-01-12 06:05:15.043730] W [MSGID: 114031] [client-rpc-fops.c:1830:client3_3_fxattrop_cbk] 27-Dis-Rep1-client-9: remote operation failed
[2017-01-12 06:05:15.043791] W [MSGID: 114031] [client-rpc-fops.c:1830:client3_3_fxattrop_cbk] 27-Dis-Rep1-client-10: remote operation failed
[2017-01-12 06:05:15.043813] W [MSGID: 114031] [client-rpc-fops.c:1830:client3_3_fxattrop_cbk] 27-Dis-Rep1-client-11: remote operation failed
[2017-01-12 06:05:15.043828] E [MSGID: 114031] [client-rpc-fops.c:1601:client3_3_finodelk_cbk] 27-Dis-Rep1-client-9: remote operation failed [Transport endpoint is not connected]
[2017-01-12 06:05:15.043840] E [MSGID: 114031] [client-rpc-fops.c:1601:client3_3_finodelk_cbk] 27-Dis-Rep1-client-10: remote operation failed [Transport endpoint is not connected]
[2017-01-12 06:05:15.043853] E [MSGID: 114031] [client-rpc-fops.c:1601:client3_3_finodelk_cbk] 27-Dis-Rep1-client-11: remote operation failed [Transport endpoint is not connected]
[2017-01-12 06:05:15.043897] W [MSGID: 114031] [client-rpc-fops.c:1830:client3_3_fxattrop_cbk] 27-Dis-Rep1-client-9: remote operation failed
[2017-01-12 06:05:15.043911] W [MSGID: 114031] [client-rpc-fops.c:1830:client3_3_fxattrop_cbk] 27-Dis-Rep1-client-10: remote operation failed


Version-Release number of selected component (if applicable):
=============================================================
glusterfs-3.8.4-11

How reproducible:
=================
One time


Steps to Reproduce:
===================
1. Have RHV-RHGS setup with 3 rhgs nodes and 2 virt cluster
2. Create a 2*3 volume and create the data domain using this volume.
3. Create two app vms (say  vm1,vm2 ) using the data domain created in step2
4. reduce the volume to 1*3.
5. Create a one more app vm (say vm3).
6. Add one more rhgs node to existing cluster ( cluster expand)
7. while step6 in-progress, Expand the volume to 2*3 by adding one more replica pair.

After step-7 done, one of app vm (vm1) went to paused state.

Actual results:
===============
Application VM paused after add brick operation.

Expected results:
=================
VM should not go to pause state.


--- Additional comment from Byreddy on 2017-01-12 04:48:34 EST ---

volume status and info:
========================
[root@dhcp42-35 ~]# gluster volume status
Status of volume: Dis-Rep1
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick server1.com:/rhs/brick3/a1            49153     0          Y       32634
Brick server2.com:/rhs/brick3/a2            49153     0          Y       15405
Brick server3.com:/rhs/brick3/a3            49154     0          Y       30612
Brick server1.com:/rhs/brick1/z1            49152     0          Y       2278 
Brick server2.com:/rhs/brick1/z2            49152     0          Y       7560 
Brick server3.com:/rhs/brick1/z3            49152     0          Y       32519
Self-heal Daemon on localhost               N/A       N/A        Y       2324 
Self-heal Daemon on server1.com             N/A       N/A        Y       13501
Self-heal Daemon on server2.com             N/A       N/A        Y       7675 
Self-heal Daemon on server3.com             N/A       N/A        Y       32562
 
Task Status of Volume Dis-Rep1
------------------------------------------------------------------------------
There are no active volume tasks
 
[root@localhost ~]# 
[root@localhost ~]# gluster volume info
 
Volume Name: Dis-Rep1
Type: Distributed-Replicate
Volume ID: 3c0a32d2-d9fd-4647-80cf-76d663fe7e90
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 3 = 6
Transport-type: tcp
Bricks:
Brick1: server1.com:/rhs/brick3/a1
Brick2: server2.com:/rhs/brick3/a2
Brick3: server3.com:/rhs/brick3/a3
Brick4: server1.com:/rhs/brick1/z1
Brick5: server2.com:/rhs/brick1/z2
Brick6: server3.com:/rhs/brick1/z3
Options Reconfigured:
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
cluster.use-compound-fops: on
nfs.disable: on
performance.readdir-ahead: on
transport.address-family: inet
user.cifs: off
auth.allow: *
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
cluster.eager-lock: enable
network.remote-dio: off
cluster.quorum-type: auto
cluster.server-quorum-type: server
features.shard: on
cluster.data-self-heal-algorithm: full
storage.owner-uid: 36
storage.owner-gid: 36
server.allow-insecure: on
network.ping-timeout: 30
features.shard-block-size: 512MB
performance.low-prio-threads: 32
cluster.locking-scheme: granular
cluster.shd-max-threads: 8
cluster.shd-wait-qlength: 10000
performance.strict-o-direct: on
[root@dhcp42-35 ~]#


--- Additional comment from Ravishankar N on 2017-01-12 05:04:20 EST ---

Looked at the mount log; moving it to rpc after discussing with Rajesh.

1. After add brick was performed, the mount connected to the old replicate (replicate-0):

2017-01-12 06:05:13.324064] I [MSGID: 114046] [client-handshake.c:1222:client_setvolume_cbk] 29-Dis-Rep1-client-9: Connected to Dis-Rep1-client-9, attached to remote volume '/rhs/brick3/a1'

2017-01-12 06:05:13.329514] I [MSGID: 114046] [client-handshake.c:1222:client_setvolume_cbk] 29-Dis-Rep1-client-10: Connected to Dis-Rep1-client-10, attached to remote volume '/rhs/brick3/a2'

2017-01-12 06:05:13.334979] I [MSGID: 114046] [client-handshake.c:1222:client_setvolume_cbk] 29-Dis-Rep1-client-11: Connected to Dis-Rep1-client-11, attached to remote volume '/rhs/brick3/a3'.


2. We have a graph switch
[2017-01-12 06:05:13.357471] I [fuse-bridge.c:5241:fuse_graph_setup] 0-fuse: switched to graph 29

3. Old connections are torn down, which implies all pending I/O on old graph is over:

[2017-01-12 06:05:13.373073] I [fuse-bridge.c:4446:fuse_migrate_fd_open] 0-glusterfs-fuse: migrated basefd (0x7fd50e69a06c) to newfd (0x7fd489698efc) (inode-gfid:dabc7928-44ce-4dd8-999a-0d5d648da532)(old-subvolume:meta-autoload-27 new-subvolume:meta-autoload-29)
[2017-01-12 06:05:13.376778] I [fuse-bridge.c:4446:fuse_migrate_fd_open] 0-glusterfs-fuse: migrated basefd (0x7fd5110f1bec) to newfd (0x7fd489698f88) (inode-gfid:13136355-721d-4fe7-bb76-0a1d35fdb8e0)(old-subvolume:meta-autoload-27 new-subvolume:meta-autoload-29)
[2017-01-12 06:05:13.380813] I [fuse-bridge.c:4446:fuse_migrate_fd_open] 0-glusterfs-fuse: migrated basefd (0x7fd504e57308) to newfd (0x7fd489699014) (inode-gfid:a15a813e-f512-4f96-858f-91b4fad86d4d)(old-subvolume:meta-autoload-27 new-subvolume:meta-autoload-29)
[2017-01-12 06:05:13.382513] I [MSGID: 114021] [client.c:2365:notify] 27-Dis-Rep1-client-9: current graph is no longer active, destroying rpc_client
[2017-01-12 06:05:13.382570] I [MSGID: 114021] [client.c:2365:notify] 27-Dis-Rep1-client-10: current graph is no longer active, destroying rpc_client
[2017-01-12 06:05:13.382596] I [MSGID: 114021] [client.c:2365:notify] 27-Dis-Rep1-client-11: current graph is no longer active, destroying rpc_client
[2017-01-12 06:05:13.382615] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 27-Dis-Rep1-client-10: disconnected from Dis-Rep1-client-10. Client process will keep trying to connect to glusterd until brick's port is available
[2017-01-12 06:05:13.382660] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 27-Dis-Rep1-client-11: disconnected from Dis-Rep1-client-11. Client process will keep trying to connect to glusterd until brick's port is available
[2017-01-12 06:05:13.382826] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 27-Dis-Rep1-client-9: disconnected from Dis-Rep1-client-9. Client process will keep trying to connect to glusterd until brick's port is available

[2017-01-12 06:05:13.382851] E [MSGID: 108006] [afr-common.c:4679:afr_notify] 27-Dis-Rep1-replicate-0: All subvolumes are down. Going offline until atleast one of them comes back up.



4. But despite this, all further FOPS are failing with ENOTCONN. This should not happen because all I/O should happen on the new graph and the connections to the children of replicate-o are valid as per logs in step-1.

[2017-01-12 06:05:13.407692] W [fuse-bridge.c:2312:fuse_writev_cbk] 0-glusterfs-fuse: 4126899: WRITE => -1 gfid=a15a813e-f512-4f96-858f-91b4fad86d4d fd=0x7fd489699014 (Cannot allocate memory)
[2017-01-12 06:05:15.043730] W [MSGID: 114031] [client-rpc-fops.c:1830:client3_3_fxattrop_cbk] 27-Dis-Rep1-client-9: remote operation failed
[2017-01-12 06:05:15.043791] W [MSGID: 114031] [client-rpc-fops.c:1830:client3_3_fxattrop_cbk] 27-Dis-Rep1-client-10: remote operation failed
[2017-01-12 06:05:15.043813] W [MSGID: 114031] [client-rpc-fops.c:1830:client3_3_fxattrop_cbk] 27-Dis-Rep1-client-11: remote operation failed
[2017-01-12 06:05:15.043828] E [MSGID: 114031] [client-rpc-fops.c:1601:client3_3_finodelk_cbk] 27-Dis-Rep1-client-9: remote operation failed [Transport endpoint is not connected]
[2017-01-12 06:05:15.043840] E [MSGID: 114031] [client-rpc-fops.c:1601:client3_3_finodelk_cbk] 27-Dis-Rep1-client-10: remote operation failed [Transport endpoint is not connected]
[2017-01-12 06:05:15.043853] E [MSGID: 114031] [client-rpc-fops.c:1601:client3_3_finodelk_cbk] 27-Dis-Rep1-client-11: remote operation failed [Transport endpoint is not connected]


--- Additional comment from Byreddy on 2017-01-12 06:28:37 EST ---

i power cycled the paused VM but it didn't up, i see vm corruption messages coming in the VM console.

attached the screenshot of it.

--- Additional comment from SATHEESARAN on 2017-01-13 05:28:37 EST ---

(In reply to Byreddy from comment #6)
> Created attachment 1239855 [details]
> VM corruption messages  in console

The attachment shows the XFS filesystem has got corrupted, which means this is a underlying disk image on the gluster volume has got corrupted - which is a data loss in the case of VM

--- Additional comment from Atin Mukherjee on 2017-01-25 03:55:52 EST ---

Raghavendra - Can you or some one from RPC team please start looking into this bug? Rajesh had done an initial analysis of the issue and might be the contact point to get additional data.

--- Additional comment from Milind Changire on 2017-01-27 03:39:59 EST ---

Byreddy,
Could you test with bricks on ext4 file-system and verify if this is not a file-system related side effect ?



--- Additional comment from SATHEESARAN on 2017-01-31 05:19:27 EST ---

Snip of errors from fuse mount logs


[2017-01-31 10:05:45.576808] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f8763d16602] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f8763adc76e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f8763adc87e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f8763addfe4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x94)[0x7f8763ade874] ))))) 0-imagestore-client-0: forced unwinding frame type(GlusterFS 3.3) op(FXATTROP(34)) called at 2017-01-31 10:05:45.575497 (xid=0x5fdf)
[2017-01-31 10:05:45.576840] W [MSGID: 114031] [client-rpc-fops.c:1830:client3_3_fxattrop_cbk] 0-imagestore-client-0: remote operation failed
[2017-01-31 10:05:45.576872] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f8763d16602] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f8763adc76e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f8763adc87e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f8763addfe4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x94)[0x7f8763ade874] ))))) 0-imagestore-client-1: forced unwinding frame type(GlusterFS 3.3) op(FXATTROP(34)) called at 2017-01-31 10:05:45.575520 (xid=0x5c7e)
[2017-01-31 10:05:45.576882] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-imagestore-client-0: disconnected from imagestore-client-0. Client process will keep trying to connect to glusterd until brick's port is available
[2017-01-31 10:05:45.576945] W [MSGID: 114031] [client-rpc-fops.c:1830:client3_3_fxattrop_cbk] 0-imagestore-client-1: remote operation failed
[2017-01-31 10:05:45.576974] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-imagestore-client-1: disconnected from imagestore-client-1. Client process will keep trying to connect to glusterd until brick's port is available
[2017-01-31 10:05:45.576985] W [MSGID: 108001] [afr-common.c:4770:afr_notify] 0-imagestore-replicate-0: Client-quorum is not met
[2017-01-31 10:05:45.577120] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f8763d16602] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f8763adc76e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f8763adc87e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f8763addfe4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x94)[0x7f8763ade874] ))))) 0-imagestore-client-2: forced unwinding frame type(GlusterFS 3.3) op(FXATTROP(34)) called at 2017-01-31 10:05:45.575541 (xid=0x5cb7)
[2017-01-31 10:05:45.577141] W [MSGID: 114031] [client-rpc-fops.c:1830:client3_3_fxattrop_cbk] 0-imagestore-client-2: remote operation failed
[2017-01-31 10:05:45.577172] E [MSGID: 114031] [client-rpc-fops.c:1601:client3_3_finodelk_cbk] 0-imagestore-client-0: remote operation failed [Transport endpoint is not connected]
[2017-01-31 10:05:45.577188] E [MSGID: 114031] [client-rpc-fops.c:1601:client3_3_finodelk_cbk] 0-imagestore-client-1: remote operation failed [Transport endpoint is not connected]
[2017-01-31 10:05:45.577219] I [socket.c:3439:socket_submit_request] 0-imagestore-client-2: not connected (priv->connected = -1)
[2017-01-31 10:05:45.577228] W [rpc-clnt.c:1657:rpc_clnt_submit] 0-imagestore-client-2: failed to submit rpc-request (XID: 0x5cb8 Program: GlusterFS 3.3, ProgVers: 330, Proc: 30) to rpc-transport (imagestore-client-2)
[2017-01-31 10:05:45.577240] E [MSGID: 114031] [client-rpc-fops.c:1601:client3_3_finodelk_cbk] 0-imagestore-client-2: remote operation failed [Transport endpoint is not connected]
[2017-01-31 10:05:45.577306] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-imagestore-client-2: disconnected from imagestore-client-2. Client process will keep trying to connect to glusterd until brick's port is available
[2017-01-31 10:05:45.577352] E [MSGID: 108006] [afr-common.c:4681:afr_notify] 0-imagestore-replicate-0: All subvolumes are down. Going offline until atleast one of them comes back up.
[2017-01-31 10:05:45.620808] W [fuse-bridge.c:1291:fuse_err_cbk] 0-glusterfs-fuse: 14861: FSYNC() ERR => -1 (Input/output error)

--- Additional comment from SATHEESARAN on 2017-01-31 05:22:01 EST ---

(In reply to Milind Changire from comment #10)
> Byreddy,
> Could you test with bricks on ext4 file-system and verify if this is not a
> file-system related side effect ?

Milind, 

From all the logs, it looks like a transport issue. Referring to comment4 I don't think retesting with ext4 is a viable option that we have. With the logs, its only the logs that would help in RCA'ing the issue. I also have the system intact for further investigation.

--- Additional comment from Milind Changire on 2017-01-31 05:36:47 EST ---

(In reply to SATHEESARAN from comment #16)
> (In reply to Milind Changire from comment #10)
> > Byreddy,
> > Could you test with bricks on ext4 file-system and verify if this is not a
> > file-system related side effect ?
> 
> Milind, 
> 
> From all the logs, it looks like a transport issue. Referring to comment4 I
> don't think retesting with ext4 is a viable option that we have. With the
> logs, its only the logs that would help in RCA'ing the issue. I also have
> the system intact for further investigation.

Thanks for making time to reproduce the issue sas.
Reproducibility was the major concern.
I'll definitely need the setup for debugging.

--- Additional comment from SATHEESARAN on 2017-02-01 02:11:13 EST ---

With debugging from Raghavendra, here are the findings:

Parent down event happened at the timestamp - [2017-01-31 10:05:45.576350]

[2017-01-31 10:05:45.576350] I [MSGID: 114021] [client.c:2365:notify] 0-imagestore-client-0: current graph is no longer active, destroying rpc_client 
[2017-01-31 10:05:45.576430] I [MSGID: 114021] [client.c:2365:notify] 0-imagestore-client-1: current graph is no longer active, destroying rpc_client 
[2017-01-31 10:05:45.576474] I [MSGID: 114021] [client.c:2365:notify] 0-imagestore-client-2: current graph is no longer active, destroying rpc_client 

Fsync happens later at the timestamp - [2017-01-31 10:05:45.620808]

[2017-01-31 10:05:45.620808] W [fuse-bridge.c:1291:fuse_err_cbk] 0-glusterfs-fuse: 14861: FSYNC() ERR => -1 (Input/output error)

Which means there are 2 possibilities, of which both conditions looks like a bug with graph switch code

1. If fsync is happening on the older graph, then parent down event should not have happened.
2. If fsync was happened later after graph switch, it should have happened with the new graph

Changing the component to fuse, as the bug is not really on RPC.

Please provide required acks, if changing the BZ component invalidates the older ack

--- Additional comment from SATHEESARAN on 2017-02-01 02:12:48 EST ---

Assigning the bug to Csaba ( as the bug is around glusterfs-fuse ), and providing the qa_ack as the BZ already had one

--- Additional comment from Raghavendra G on 2017-02-01 02:19:57 EST ---

https://bugzilla.redhat.com/show_bug.cgi?id=1387878 might be a similar issue


--- Additional comment from Krutika Dhananjay on 2017-02-02 01:45:18 EST ---

(In reply to SATHEESARAN from comment #18)
> With debugging from Raghavendra, here are the findings:
> 
> Parent down event happened at the timestamp - [2017-01-31 10:05:45.576350]
> 
> [2017-01-31 10:05:45.576350] I [MSGID: 114021] [client.c:2365:notify]
> 0-imagestore-client-0: current graph is no longer active, destroying
> rpc_client 
> [2017-01-31 10:05:45.576430] I [MSGID: 114021] [client.c:2365:notify]
> 0-imagestore-client-1: current graph is no longer active, destroying
> rpc_client 
> [2017-01-31 10:05:45.576474] I [MSGID: 114021] [client.c:2365:notify]
> 0-imagestore-client-2: current graph is no longer active, destroying
> rpc_client 
> 
> Fsync happens later at the timestamp - [2017-01-31 10:05:45.620808]
> 
> [2017-01-31 10:05:45.620808] W [fuse-bridge.c:1291:fuse_err_cbk]
> 0-glusterfs-fuse: 14861: FSYNC() ERR => -1 (Input/output error)
> 
> Which means there are 2 possibilities, of which both conditions looks like a
> bug with graph switch code
> 
> 1. If fsync is happening on the older graph, then parent down event should
> not have happened.
> 2. If fsync was happened later after graph switch, it should have happened
> with the new graph
> 
> Changing the component to fuse, as the bug is not really on RPC.
> 
> Please provide required acks, if changing the BZ component invalidates the
> older ack

+1. Contrary to what comment #4 says, it certainly doesn't look like an rpc bug and resembles the graph switch issues reported by some of our community users.

What is not clear is the origin of the fop failure, as it doesn't look like a failure due to connections on the old graph being torn down, since then we should have been seeing 'Transport endpoint is not connected' as the cause of the FSYNC() failure in the log, and also because protocol/clients should have logged of ENOTCONN from client3_3_fsync_cbk().

Will check the code for possible origins of EIO (prime suspect being AFR), and get back.

-Krutika

--- Additional comment from SATHEESARAN on 2017-02-02 23:56:55 EST ---

Update:

Raghavendra has given the private build over glusterfs-3.8.4-13 ( 02-Feb-2017 ) with few debug logs enabled.

I will be reproducing the issue with this build and provide the fuse mount logs.

--- Additional comment from SATHEESARAN on 2017-02-03 07:56:07 EST ---

(In reply to SATHEESARAN from comment #24)
> Update:
> 
> Raghavendra has given the private build over glusterfs-3.8.4-13 (
> 02-Feb-2017 ) with few debug logs enabled.
> 
> I will be reproducing the issue with this build and provide the fuse mount
> logs.

Found some useful log messages in fuse mount logs when the bricks are added to 1X3 replica volume. I will attaching the fuse mount logs to this bug for future reference.

--- Additional comment from SATHEESARAN on 2017-02-03 07:59:44 EST ---


Here is the snip of chat history about the issue:
<raghug> I am seeing fxattrop and finodelk on old graph even though there are no on-going fops from fuse
<raghug> does afr do transactions for read fops?
<raghug> like readdir, readv etc
<raghug> ?
<itisravi> raghug: nope, it only does lookup/fstat if needed for refreshing the inode context
<raghug> what are the fops it does transactions for?
<itisravi> raghug: but reads can also trigger self-heals via mount which can do finodelk and fxattrop
<raghug> specifically finodelk and fxattrop
<itisravi> raghug: all write txns.
<raghug> itisravi, ah! that can explain
<raghug> reads could be triggered by read-ahead
<raghug> and afr could've triggered selfheal during that
<itisravi> raghug: ok, but if mount triggered afr selfheal ,you should see "performing self-heal on gfid" or "completed self heal on gfid" etc.
<raghug> itisravi, let me check that

--- Additional comment from SATHEESARAN on 2017-02-06 01:19:41 EST ---

(In reply to SATHEESARAN from comment #24)
> Update:
> 
> Raghavendra has given the private build over glusterfs-3.8.4-13 (
> 02-Feb-2017 ) with few debug logs enabled.
> 
> I will be reproducing the issue with this build and provide the fuse mount
> logs.

I'm unable to hit the VM pause issue with the private build, which has logging changes including debug messages for RCA'ing the issue.

Probably the logging changes has eliminated the occurrence of race.

--- Additional comment from SATHEESARAN on 2017-02-07 04:44:54 EST ---

Snips of logs from fuse mount log:

<snip>

[2017-02-07 09:39:19.451836] I [fuse-bridge.c:5241:fuse_graph_setup] 0-fuse: switched to graph 2
[2017-02-07 09:39:19.452139] I [MSGID: 114035] [client-handshake.c:201:client_set_lk_version_cbk] 2-pluto-client-5: Server lk version = 1
[2017-02-07 09:39:19.551607] I [fuse-bridge.c:4446:fuse_migrate_fd_open] 0-glusterfs-fuse: migrated basefd (0x7f666806406c) to newfd (0x7f666227e06c) (inode-gfid:d18f89ef-6688-4d33-8d8f-221971639488)(old-subvolume:meta-autoload-0 new-subvolume:meta-autoload-2)
[2017-02-07 09:39:19.557778] I [fuse-bridge.c:4446:fuse_migrate_fd_open] 0-glusterfs-fuse: migrated basefd (0x7f66680640f8) to newfd (0x7f666227e0f8) (inode-gfid:fce8e590-da77-4b11-b43f-31f71f5aa2ca)(old-subvolume:meta-autoload-0 new-subvolume:meta-autoload-2)
[2017-02-07 09:39:19.562129] I [fuse-bridge.c:4446:fuse_migrate_fd_open] 0-glusterfs-fuse: migrated basefd (0x7f6668064184) to newfd (0x7f666227e184) (inode-gfid:a7ae1192-9f13-4d38-8f89-f34855453089)(old-subvolume:meta-autoload-0 new-subvolume:meta-autoload-2)
[2017-02-07 09:39:19.564971] I [fuse-bridge.c:4446:fuse_migrate_fd_open] 0-glusterfs-fuse: migrated basefd (0x7f666806429c) to newfd (0x7f666227e210) (inode-gfid:73378267-5712-4a6f-b5ca-b7f277c7533b)(old-subvolume:meta-autoload-0 new-subvolume:meta-autoload-2)
[2017-02-07 09:39:19.567667] I [fuse-bridge.c:4446:fuse_migrate_fd_open] 0-glusterfs-fuse: migrated basefd (0x7f6668064210) to newfd (0x7f666227e29c) (inode-gfid:23a02bc8-b52f-4b72-9907-fbfc8464d2a1)(old-subvolume:meta-autoload-0 new-subvolume:meta-autoload-2)
[2017-02-07 09:39:19.567820] I [MSGID: 114021] [client.c:2365:notify] 0-pluto-client-0: current graph is no longer active, destroying rpc_client 
[2017-02-07 09:39:19.567873] I [MSGID: 114021] [client.c:2365:notify] 0-pluto-client-1: current graph is no longer active, destroying rpc_client 
[2017-02-07 09:39:19.567918] I [MSGID: 114021] [client.c:2365:notify] 0-pluto-client-2: current graph is no longer active, destroying rpc_client 
[2017-02-07 09:39:19.567910] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-pluto-client-0: disconnected from pluto-client-0. Client process will keep trying to connect to glusterd until brick's port is available
[2017-02-07 09:39:19.567989] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-pluto-client-2: disconnected from pluto-client-2. Client process will keep trying to connect to glusterd until brick's port is available
[2017-02-07 09:39:19.567967] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-pluto-client-1: disconnected from pluto-client-1. Client process will keep trying to connect to glusterd until brick's port is available
[2017-02-07 09:39:19.568060] W [MSGID: 108001] [afr-common.c:4770:afr_notify] 0-pluto-replicate-0: Client-quorum is not met
[2017-02-07 09:39:19.568192] E [MSGID: 108006] [afr-common.c:4681:afr_notify] 0-pluto-replicate-0: All subvolumes are down. Going offline until atleast one of them comes back up.
[2017-02-07 09:39:19.610218] W [fuse-bridge.c:2228:fuse_readv_cbk] 0-glusterfs-fuse: 49362: READ => -1 gfid=a7ae1192-9f13-4d38-8f89-f34855453089 fd=0x7f666227e184 (Input/output error)
[2017-02-07 09:39:19.610517] W [fuse-bridge.c:2228:fuse_readv_cbk] 0-glusterfs-fuse: 49375: READ => -1 gfid=a7ae1192-9f13-4d38-8f89-f34855453089 fd=0x7f666227e184 (Input/output error)
[2017-02-07 09:39:19.701440] W [fuse-bridge.c:1291:fuse_err_cbk] 0-glusterfs-fuse: 49368: FSYNC() ERR => -1 (Input/output error)

</snip>



--- Additional comment from SATHEESARAN on 2017-02-08 05:31:04 EST ---

(In reply to Raghavendra G from comment #31)
> Sas,
> 
> I've triggered a new build with less debug messages at:
> https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=12509975
> 
> This will spew less logs and hence hopefully we should hit the bug. Can you
> please rerun the tests with this new build?
> 
> regards,
> Raghavendra

Thanks Raghu.

I could hit the VM pause issue with this private build, I will upload the full logs sooner. Here is the snip from the fuse mount logs

[2017-02-08 10:25:16.086736] I [fuse-bridge.c:4846:fuse_graph_sync] 0-meta-autoload: switched to new-graph: 2 old-graph: 0. new_subvol=0x7f83bc07d5e0 old_subvol=0x7f83cc0159c0old_subvol->winds = 2 unique = 38239
[2017-02-08 10:25:17.205378] I [fuse-bridge.c:4446:fuse_migrate_fd_open] 0-glusterfs-fuse: migrated basefd (0x7f83c9f6906c) to newfd (0x7f83c804b06c) (inode-gfid:8dad6ee2-a57f-47b8-ac06-648931200375)(old-subvolume:meta-autoload-0 new-subvolume:meta-autoload-2)
[2017-02-08 10:25:17.779502] I [fuse-bridge.c:4446:fuse_migrate_fd_open] 0-glusterfs-fuse: migrated basefd (0x7f83c9f690f8) to newfd (0x7f83c804b0f8) (inode-gfid:d04d7083-bdfe-4424-be50-a8ce01caa8a1)(old-subvolume:meta-autoload-0 new-subvolume:meta-autoload-2)
[2017-02-08 10:25:19.965103] I [fuse-bridge.c:4446:fuse_migrate_fd_open] 0-glusterfs-fuse: migrated basefd (0x7f83c9f69184) to newfd (0x7f83c804b184) (inode-gfid:e1e3195a-1b9a-495e-a583-5bde19532dae)(old-subvolume:meta-autoload-0 new-subvolume:meta-autoload-2)
[2017-02-08 10:25:20.653075] I [fuse-bridge.c:4446:fuse_migrate_fd_open] 0-glusterfs-fuse: migrated basefd (0x7f83c9f69210) to newfd (0x7f83c804b210) (inode-gfid:093291ca-8df7-4bf2-bd77-3765c3b6de34)(old-subvolume:meta-autoload-0 new-subvolume:meta-autoload-2)
[2017-02-08 10:25:21.403567] I [fuse-bridge.c:4446:fuse_migrate_fd_open] 0-glusterfs-fuse: migrated basefd (0x7f83c9f6929c) to newfd (0x7f83c804b29c) (inode-gfid:2325ebd0-4152-4405-8e8d-f98c27105990)(old-subvolume:meta-autoload-0 new-subvolume:meta-autoload-2)
[2017-02-08 10:25:21.403650] I [fuse-bridge.c:4863:fuse_graph_sync] 0-fuse: sending PARENT_DOWN on graph:0 subvol: 0x7f83cc0159c0 unique = 38239
[2017-02-08 10:25:21.403966] I [MSGID: 114021] [client.c:2365:notify] 0-andromeda-client-0: current graph is no longer active, destroying rpc_client 
[2017-02-08 10:25:21.404061] I [MSGID: 114021] [client.c:2365:notify] 0-andromeda-client-1: current graph is no longer active, destroying rpc_client 
[2017-02-08 10:25:21.404108] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-andromeda-client-0: disconnected from andromeda-client-0. Client process will keep trying to connect to glusterd until brick's port is available
[2017-02-08 10:25:21.404142] I [MSGID: 114021] [client.c:2365:notify] 0-andromeda-client-2: current graph is no longer active, destroying rpc_client 
[2017-02-08 10:25:21.404178] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-andromeda-client-1: disconnected from andromeda-client-1. Client process will keep trying to connect to glusterd until brick's port is available
[2017-02-08 10:25:21.404203] W [MSGID: 108001] [afr-common.c:4770:afr_notify] 0-andromeda-replicate-0: Client-quorum is not met
[2017-02-08 10:25:21.404244] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-andromeda-client-2: disconnected from andromeda-client-2. Client process will keep trying to connect to glusterd until brick's port is available
[2017-02-08 10:25:21.404411] E [MSGID: 108006] [afr-common.c:4681:afr_notify] 0-andromeda-replicate-0: All subvolumes are down. Going offline until atleast one of them comes back up.
[2017-02-08 10:25:21.438033] W [fuse-bridge.c:2312:fuse_writev_cbk] 0-glusterfs-fuse: 38318: WRITE => -1 gfid=d04d7083-bdfe-4424-be50-a8ce01caa8a1 fd=0x7f83c804b0f8 (Input/output error)
[2017-02-08 10:25:21.438541] W [fuse-bridge.c:2312:fuse_writev_cbk] 0-glusterfs-fuse: 38320: WRITE => -1 gfid=d04d7083-bdfe-4424-be50-a8ce01caa8a1 fd=0x7f83c804b0f8 (Input/output error)
[2017-02-08 10:25:21.455715] W [fuse-bridge.c:767:fuse_attr_cbk] 0-glusterfs-fuse: 38290: STAT() <gfid:8dad6ee2-a57f-47b8-ac06-648931200375> => -1 (Input/output error)
[2017-02-08 10:25:21.455821] W [fuse-bridge.c:2312:fuse_writev_cbk] 0-glusterfs-fuse: 38312: WRITE => -1 gfid=8dad6ee2-a57f-47b8-ac06-648931200375 fd=0x7f83c804b06c (Input/output error)
[2017-02-08 10:25:21.456344] W [fuse-bridge.c:2312:fuse_writev_cbk] 0-glusterfs-fuse: 38324: WRITE => -1 gfid=8dad6ee2-a57f-47b8-ac06-648931200375 fd=0x7f83c804b06c (Input/output error)
[2017-02-08 10:25:21.456692] W [fuse-bridge.c:2312:fuse_writev_cbk] 0-glusterfs-fuse: 38326: WRITE => -1 gfid=8dad6ee2-a57f-47b8-ac06-648931200375 fd=0x7f83c804b06c (Input/output error)

--- Additional comment from Raghavendra G on 2017-02-08 07:58:37 EST ---

Looks like its not an issue with graph switch in the sense fops are _not_ sent on a graph which was notified with PARENT_DOWN event. To corroborate this see the log:

[2017-02-08 10:25:21.438033] W [fuse-bridge.c:2312:fuse_writev_cbk] 0-glusterfs-fuse: 38318: WRITE => -1 gfid=d04d7083-bdfe-4424-be50-a8ce01caa8a1 fd=0x7f83c804b0f8 (Input/output error)
[2017-02-08 10:25:21.438541] W [fuse-bridge.c:2312:fuse_writev_cbk] 0-glusterfs-fuse: 38320: WRITE => -1 gfid=d04d7083-bdfe-4424-be50-a8ce01caa8a1 fd=0x7f83c804b0f8 (Input/output error)

If we look at the fd, it was opened on new graph as can be seen from following log:
[2017-02-08 10:25:17.779502] I [fuse-bridge.c:4446:fuse_migrate_fd_open] 0-glusterfs-fuse: migrated basefd (0x7f83c9f690f8) to newfd (0x7f83c804b0f8) (inode-gfid:d04d7083-bdfe-4424-be50-a8ce01caa8a1)(old-subvolume:meta-autoload-0 new-subvolume:meta-autoload-2)

Similar logs for other fd too:
[2017-02-08 10:25:21.455821] W [fuse-bridge.c:2312:fuse_writev_cbk] 0-glusterfs-fuse: 38312: WRITE => -1 gfid=8dad6ee2-a57f-47b8-ac06-648931200375 fd=0x7f83c804b06c (Input/output error)
[2017-02-08 10:25:21.456344] W [fuse-bridge.c:2312:fuse_writev_cbk] 0-glusterfs-fuse: 38324: WRITE => -1 gfid=8dad6ee2-a57f-47b8-ac06-648931200375 fd=0x7f83c804b06c (Input/output error)
[2017-02-08 10:25:21.456692] W [fuse-bridge.c:2312:fuse_writev_cbk] 0-glusterfs-fuse: 38326: WRITE => -1 gfid=8dad6ee2-a57f-47b8-ac06-648931200375 fd=0x7f83c804b06c (Input/output error)

which was opened on new graph too:
[2017-02-08 10:25:17.205378] I [fuse-bridge.c:4446:fuse_migrate_fd_open] 0-glusterfs-fuse: migrated basefd (0x7f83c9f6906c) to newfd (0x7f83c804b06c) (inode-gfid:8dad6ee2-a57f-47b8-ac06-648931200375)(old-subvolume:meta-autoload-0 new-subvolume:meta-autoload-2)

I got confused from the log which logs fuse xlator's name with graph id 0 (0-glusterfs-fuse) associating the logs with graph 0. But however it turns out that this->graph is NULL for fuse xlator and _gf_log uses "0" as id if this->graph is NULL. So, the fops might as well be sent on new graph and failures we see might be due to a different reason. The reason could be fops were sent on new graph even before new-graph is up - connections were established to brick. Note that there are log messages indicating that an afr subvol is up in graph 2 between
1. fuse switches active graph to graph 2.
2. errors are seen on fops in fuse.

To prove that the failed fops were indeed sent on new graph, I've added few more logs and the new debug build can be found at:
https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=12511068

@Sas,

Can you please test with this new debug build? Thanks for your patience :)

regards,
Raghavendra

--- Additional comment from Raghavendra G on 2017-02-08 08:01:22 EST ---

(In reply to Raghavendra G from comment #34)
> I got confused from the log which logs fuse xlator's name with graph id 0
> (0-glusterfs-fuse) associating the logs with graph 0. But however it turns
> out that this->graph is NULL for fuse xlator and _gf_log uses "0" as id if
> this->graph is NULL. So, the fops might as well be sent on new graph and
> failures we see might be due to a different reason. The reason could be fops
> were sent on new graph even before new-graph is up - connections were
> established to brick. 

*before connections were established to bricks

> Note that there are log messages indicating that an

*Note that there are _NO_ log messages indicating that...

> afr subvol is up in graph 2 between
> 1. fuse switches active graph to graph 2.
> 2. errors are seen on fops in fuse.
> 
> To prove that the failed fops were indeed sent on new graph, I've added few
> more logs and the new debug build can be found at:
> https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=12511068
> 
> @Sas,
> 
> Can you please test with this new debug build? Thanks for your patience :)
> 
> regards,
> Raghavendra

--- Additional comment from Raghavendra G on 2017-02-08 10:24:17 EST ---

(In reply to Raghavendra G from comment #34)
> Looks like its not an issue with graph switch in the sense fops are _not_
> sent on a graph which was notified with PARENT_DOWN event. To corroborate
> this see the log:
> 
> [2017-02-08 10:25:21.438033] W [fuse-bridge.c:2312:fuse_writev_cbk]
> 0-glusterfs-fuse: 38318: WRITE => -1
> gfid=d04d7083-bdfe-4424-be50-a8ce01caa8a1 fd=0x7f83c804b0f8 (Input/output
> error)
> [2017-02-08 10:25:21.438541] W [fuse-bridge.c:2312:fuse_writev_cbk]
> 0-glusterfs-fuse: 38320: WRITE => -1
> gfid=d04d7083-bdfe-4424-be50-a8ce01caa8a1 fd=0x7f83c804b0f8 (Input/output
> error)
> 
> If we look at the fd, it was opened on new graph as can be seen from
> following log:
> [2017-02-08 10:25:17.779502] I [fuse-bridge.c:4446:fuse_migrate_fd_open]
> 0-glusterfs-fuse: migrated basefd (0x7f83c9f690f8) to newfd (0x7f83c804b0f8)
> (inode-gfid:d04d7083-bdfe-4424-be50-a8ce01caa8a1)(old-subvolume:meta-
> autoload-0 new-subvolume:meta-autoload-2)
> 
> Similar logs for other fd too:
> [2017-02-08 10:25:21.455821] W [fuse-bridge.c:2312:fuse_writev_cbk]
> 0-glusterfs-fuse: 38312: WRITE => -1
> gfid=8dad6ee2-a57f-47b8-ac06-648931200375 fd=0x7f83c804b06c (Input/output
> error)
> [2017-02-08 10:25:21.456344] W [fuse-bridge.c:2312:fuse_writev_cbk]
> 0-glusterfs-fuse: 38324: WRITE => -1
> gfid=8dad6ee2-a57f-47b8-ac06-648931200375 fd=0x7f83c804b06c (Input/output
> error)
> [2017-02-08 10:25:21.456692] W [fuse-bridge.c:2312:fuse_writev_cbk]
> 0-glusterfs-fuse: 38326: WRITE => -1
> gfid=8dad6ee2-a57f-47b8-ac06-648931200375 fd=0x7f83c804b06c (Input/output
> error)
> 
> which was opened on new graph too:
> [2017-02-08 10:25:17.205378] I [fuse-bridge.c:4446:fuse_migrate_fd_open]
> 0-glusterfs-fuse: migrated basefd (0x7f83c9f6906c) to newfd (0x7f83c804b06c)
> (inode-gfid:8dad6ee2-a57f-47b8-ac06-648931200375)(old-subvolume:meta-
> autoload-0 new-subvolume:meta-autoload-2)
> 
> I got confused from the log which logs fuse xlator's name with graph id 0
> (0-glusterfs-fuse) associating the logs with graph 0. But however it turns
> out that this->graph is NULL for fuse xlator and _gf_log uses "0" as id if
> this->graph is NULL. So, the fops might as well be sent on new graph and
> failures we see might be due to a different reason. The reason could be fops
> were sent on new graph even before new-graph is up - connections were
> established to brick. Note that there are log messages indicating that an
> afr subvol is up in graph 2 between
> 1. fuse switches active graph to graph 2.
> 2. errors are seen on fops in fuse.
> 
> To prove that the failed fops were indeed sent on new graph, I've added few
> more logs and the new debug build can be found at:
> https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=12511068

I've updated the debug build with more debug messages. Please use,
https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=12512422

> 
> @Sas,
> 
> Can you please test with this new debug build? Thanks for your patience :)
> 
> regards,
> Raghavendra

--- Additional comment from Raghavendra G on 2017-02-09 00:59:46 EST ---

Pursuing further the RCA of fuse-bridge not waiting till the new graph is up before directing fops to it, I do see the code where fuse_graph_sync is waiting on priv->sync_cond after initializing a new graph as active_subvol. However, "notify" function which broadcasts a signal on priv->sync_cond whenever it receives a CHILD_DOWN/CHILD_UP doesn't check for on which graph the event was received before setting priv->event_recvd to true. For eg., consider the scenario:

* fuse_graph_sync is waiting for a CHILD_UP/CHILD_DOWN on new graph by doing pthread_cond_wait on priv->sync_cond
* notify receives a CHILD_DOWN on old-graph and signals priv->sync_cond

In the above scenario, fuse_graph_sync wakes up even though no CHILD_UP/CHILD_DOWN was received on new graph and starts directing ops to new graph, which will fail eventually till new graph is up.

There is some evidence in the logs too. Note that we started seeing failed fops immediately after CHILD_DOWN event from afr to its parents (fuse-bridge) and there were no errors before that.

[2017-02-08 10:25:21.404411] E [MSGID: 108006] [afr-common.c:4681:afr_notify] 0-andromeda-replicate-0: All subvolumes are down. Going offline until atleast one of them comes back up.
[2017-02-08 10:25:21.438033] W [fuse-bridge.c:2312:fuse_writev_cbk] 0-glusterfs-fuse: 38318: WRITE => -1 gfid=d04d7083-bdfe-4424-be50-a8ce01caa8a1 fd=0x7f83c804b0f8 (Input/output error)
[2017-02-08 10:25:21.438541] W [fuse-bridge.c:2312:fuse_writev_cbk] 0-glusterfs-fuse: 38320: WRITE => -1 gfid=d04d7083-bdfe-4424-be50-a8ce01caa8a1 fd=0x7f83c804b0f8 (Input/output error)
[2017-02-08 10:25:21.455715] W [fuse-bridge.c:767:fuse_attr_cbk] 0-glusterfs-fuse: 38290: STAT() <gfid:8dad6ee2-a57f-47b8-ac06-648931200375> => -1 (Input/output error)
[2017-02-08 10:25:21.455821] W [fuse-bridge.c:2312:fuse_writev_cbk] 0-glusterfs-fuse: 38312: WRITE => -1 gfid=8dad6ee2-a57f-47b8-ac06-648931200375 fd=0x7f83c804b06c (Input/output error)
[2017-02-08 10:25:21.456344] W [fuse-bridge.c:2312:fuse_writev_cbk] 0-glusterfs-fuse: 38324: WRITE => -1 gfid=8dad6ee2-a57f-47b8-ac06-648931200375 fd=0x7f83c804b06c (Input/output error)
[2017-02-08 10:25:21.456692] W [fuse-bridge.c:2312:fuse_writev_cbk] 0-glusterfs-fuse: 38326: WRITE => -1 gfid=8dad6ee2-a57f-47b8-ac06-648931200375 fd=0x7f83c804b06c (Input/output error)


I'll wait for debug logs from sas to confirm the above RCA. If it turns out the RCA above is correct, the fix would be to make fuse_graph_sync wait till a CHILD_UP/CHILD_DOWN event on "new-graph" it just set as active-subvol instead of waking up on receiving CHILD_UP/CHILD_DOWN on _any_ graph.

--- Additional comment from Raghavendra G on 2017-02-09 01:14:56 EST ---

Pursuing further the RCA of fuse-bridge not waiting till the new graph is up before directing fops to it, I do see the code where fuse_graph_sync is waiting on priv->sync_cond after initializing a new graph as active_subvol. However, "notify" function which broadcasts a signal on priv->sync_cond whenever it receives a CHILD_DOWN/CHILD_UP doesn't check for on which graph the event was received before setting priv->event_recvd to true. For eg., consider the scenario:

* fuse_graph_sync is waiting for a CHILD_UP/CHILD_DOWN on new graph by doing pthread_cond_wait on priv->sync_cond
* notify receives a CHILD_DOWN on old-graph and signals priv->sync_cond

In the above scenario, fuse_graph_sync wakes up even though no CHILD_UP/CHILD_DOWN was received on new graph and starts directing ops to new graph, which will fail eventually till new graph is up.

There is some evidence in the logs too. Note that we started seeing failed fops immediately after CHILD_DOWN event from afr to its parents (fuse-bridge) and there were no errors before that.

[2017-02-08 10:25:21.404411] E [MSGID: 108006] [afr-common.c:4681:afr_notify] 0-andromeda-replicate-0: All subvolumes are down. Going offline until atleast one of them comes back up.
[2017-02-08 10:25:21.438033] W [fuse-bridge.c:2312:fuse_writev_cbk] 0-glusterfs-fuse: 38318: WRITE => -1 gfid=d04d7083-bdfe-4424-be50-a8ce01caa8a1 fd=0x7f83c804b0f8 (Input/output error)
[2017-02-08 10:25:21.438541] W [fuse-bridge.c:2312:fuse_writev_cbk] 0-glusterfs-fuse: 38320: WRITE => -1 gfid=d04d7083-bdfe-4424-be50-a8ce01caa8a1 fd=0x7f83c804b0f8 (Input/output error)
[2017-02-08 10:25:21.455715] W [fuse-bridge.c:767:fuse_attr_cbk] 0-glusterfs-fuse: 38290: STAT() <gfid:8dad6ee2-a57f-47b8-ac06-648931200375> => -1 (Input/output error)
[2017-02-08 10:25:21.455821] W [fuse-bridge.c:2312:fuse_writev_cbk] 0-glusterfs-fuse: 38312: WRITE => -1 gfid=8dad6ee2-a57f-47b8-ac06-648931200375 fd=0x7f83c804b06c (Input/output error)
[2017-02-08 10:25:21.456344] W [fuse-bridge.c:2312:fuse_writev_cbk] 0-glusterfs-fuse: 38324: WRITE => -1 gfid=8dad6ee2-a57f-47b8-ac06-648931200375 fd=0x7f83c804b06c (Input/output error)
[2017-02-08 10:25:21.456692] W [fuse-bridge.c:2312:fuse_writev_cbk] 0-glusterfs-fuse: 38326: WRITE => -1 gfid=8dad6ee2-a57f-47b8-ac06-648931200375 fd=0x7f83c804b06c (Input/output error)


I'll wait for debug logs from sas to confirm the above RCA. If it turns out the RCA above is correct, the fix would be to make fuse_graph_sync wait till a CHILD_UP/CHILD_DOWN event on "new-graph" it just set as active-subvol instead of waking up on receiving CHILD_UP/CHILD_DOWN on _any_ graph.

--- Additional comment from Worker Ant on 2017-02-22 04:36:21 EST ---

REVIEW: https://review.gluster.org/16709 (features/shard: Put the onus of choosing the inode to resolve on individual fops) posted (#1) for review on master by Krutika Dhananjay (kdhananj)

--- Additional comment from Worker Ant on 2017-02-22 04:37:39 EST ---

REVIEW: https://review.gluster.org/14419 (features/shard: Fix write/read failure due to EINVAL) posted (#4) for review on master by Krutika Dhananjay (kdhananj)

--- Additional comment from Worker Ant on 2017-02-22 05:17:28 EST ---

REVIEW: https://review.gluster.org/16709 (features/shard: Put onus of choosing the inode to resolve on individual fops) posted (#2) for review on master by Krutika Dhananjay (kdhananj)

--- Additional comment from Worker Ant on 2017-02-22 05:31:18 EST ---

REVIEW: https://review.gluster.org/14419 (features/shard: Fix write/read failure due to EINVAL) posted (#6) for review on master by Krutika Dhananjay (kdhananj)

--- Additional comment from Worker Ant on 2017-02-23 01:48:21 EST ---

REVIEW: https://review.gluster.org/14419 (features/shard: Fix write/read failure due to EINVAL) posted (#7) for review on master by Pranith Kumar Karampuri (pkarampu)

--- Additional comment from Worker Ant on 2017-02-23 01:48:27 EST ---

REVIEW: https://review.gluster.org/16709 (features/shard: Put onus of choosing the inode to resolve on individual fops) posted (#3) for review on master by Pranith Kumar Karampuri (pkarampu)

--- Additional comment from Worker Ant on 2017-02-23 01:53:11 EST ---

REVIEW: https://review.gluster.org/14419 (features/shard: Fix EIO error on add-brick) posted (#8) for review on master by Krutika Dhananjay (kdhananj)

--- Additional comment from Worker Ant on 2017-02-23 01:53:17 EST ---

REVIEW: https://review.gluster.org/16709 (features/shard: Put the onus of choosing the inode to resolve on individual fops) posted (#4) for review on master by Krutika Dhananjay (kdhananj)

--- Additional comment from Worker Ant on 2017-02-23 02:43:32 EST ---

REVIEW: https://review.gluster.org/14419 (features/shard: Fix EIO error on add-brick) posted (#9) for review on master by Krutika Dhananjay (kdhananj)

--- Additional comment from Worker Ant on 2017-02-23 02:43:37 EST ---

REVIEW: https://review.gluster.org/16709 (features/shard: Put onus of choosing the inode to resolve on individual fops) posted (#5) for review on master by Krutika Dhananjay (kdhananj)

--- Additional comment from Worker Ant on 2017-02-23 02:56:42 EST ---

COMMIT: https://review.gluster.org/16709 committed in master by Pranith Kumar Karampuri (pkarampu) 
------
commit 583e6cfc5bc73c2a79be9d42e89b90a71454596f
Author: Krutika Dhananjay <kdhananj>
Date:   Wed Feb 22 14:43:46 2017 +0530

    features/shard: Put onus of choosing the inode to resolve on individual fops
    
    ... as opposed to adding checks in "common" functions to choose the inode
    to resolve based local->fop, which is rather ugly and prone to errors.
    
    Change-Id: Ia46cc59992baa2979516369cb72d8991452c0274
    BUG: 1420623
    Signed-off-by: Krutika Dhananjay <kdhananj>
    Reviewed-on: https://review.gluster.org/16709
    Tested-by: Pranith Kumar Karampuri <pkarampu>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Smoke: Gluster Build System <jenkins.org>
    Reviewed-by: Pranith Kumar Karampuri <pkarampu>

--- Additional comment from Worker Ant on 2017-02-23 05:50:55 EST ---

REVIEW: https://review.gluster.org/14419 (features/shard: Fix EIO error on add-brick) posted (#10) for review on master by Krutika Dhananjay (kdhananj)

--- Additional comment from Worker Ant on 2017-02-23 23:57:47 EST ---

COMMIT: https://review.gluster.org/14419 committed in master by Atin Mukherjee (amukherj) 
------
commit 1e2773cf1586b78c71e5b8adc24c6b65f065357c
Author: Krutika Dhananjay <kdhananj>
Date:   Tue May 17 15:37:18 2016 +0530

    features/shard: Fix EIO error on add-brick
    
    DHT seems to link inode during lookup even before initializing
    inode ctx with layout information, which comes after
    directory healing.
    
    Consider two parallel writes. As part of the first write,
    shard sends lookup on .shard which in its return path would
    cause DHT to link .shard inode. Now at this point, when a
    second write is wound, inode_find() of .shard succeeds and
    as a result of this, shard goes to create the participant
    shards by issuing MKNODs under .shard. Since the layout is
    yet to be initialized, mknod fails in dht call path with EIO,
    leading to VM pauses.
    
    The fix involves shard maintaining a flag to denote whether
    a fresh lookup on .shard completed one network trip. If it
    didn't, all inode_find()s in fop path will be followed by a
    lookup before proceeding with the next stage of the fop.
    
    Big thanks to Raghavendra G and Pranith Kumar K for the RCA
    and subsequent inputs and feedback on the patch.
    
    Change-Id: I9383ec7e3f24b34cd097a1b01ad34e4eeecc621f
    BUG: 1420623
    Signed-off-by: Krutika Dhananjay <kdhananj>
    Reviewed-on: https://review.gluster.org/14419
    Reviewed-by: Pranith Kumar Karampuri <pkarampu>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Smoke: Gluster Build System <jenkins.org>
    Reviewed-by: Atin Mukherjee <amukherj>

Comment 1 Worker Ant 2017-02-24 06:04:11 UTC

REVIEW: https://review.gluster.org/16751 (features/shard: Put onus of choosing the inode to resolve on individual fops) posted (#1) for review on release-3.9 by Krutika Dhananjay (kdhananj)

Comment 2 Worker Ant 2017-02-24 06:04:17 UTC

REVIEW: https://review.gluster.org/16752 (features/shard: Fix EIO error on add-brick) posted (#1) for review on release-3.9 by Krutika Dhananjay (kdhananj)

Comment 3 Worker Ant 2017-02-27 05:00:39 UTC

COMMIT: https://review.gluster.org/16751 committed in release-3.9 by Pranith Kumar Karampuri (pkarampu) 
------
commit a10bc7da360c95524cd79b30d364134f2368f348
Author: Krutika Dhananjay <kdhananj>
Date:   Wed Feb 22 14:43:46 2017 +0530

    features/shard: Put onus of choosing the inode to resolve on individual fops
    
            Backport of: https://review.gluster.org/16709
    
    ... as opposed to adding checks in "common" functions to choose the inode
    to resolve based local->fop, which is rather ugly and prone to errors.
    
    Change-Id: I84c5b26160150f2fd87e7f245190c500a4b36bd8
    BUG: 1426512
    Signed-off-by: Krutika Dhananjay <kdhananj>
    Reviewed-on: https://review.gluster.org/16751
    Smoke: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Pranith Kumar Karampuri <pkarampu>

Comment 4 Worker Ant 2017-02-27 05:21:50 UTC

REVIEW: https://review.gluster.org/16752 (features/shard: Fix EIO error on add-brick) posted (#2) for review on release-3.9 by Krutika Dhananjay (kdhananj)

Comment 5 Worker Ant 2017-02-27 05:45:57 UTC

COMMIT: https://review.gluster.org/16752 committed in release-3.9 by Pranith Kumar Karampuri (pkarampu) 
------
commit 860ead4e36e4bf54bc5ed88d5ae3aa44d40358c3
Author: Krutika Dhananjay <kdhananj>
Date:   Tue May 17 15:37:18 2016 +0530

    features/shard: Fix EIO error on add-brick
    
            Backport of: https://review.gluster.org/14419
    
    DHT seems to link inode during lookup even before initializing
    inode ctx with layout information, which comes after
    directory healing.
    
    Consider two parallel writes. As part of the first write,
    shard sends lookup on .shard which in its return path would
    cause DHT to link .shard inode. Now at this point, when a
    second write is wound, inode_find() of .shard succeeds and
    as a result of this, shard goes to create the participant
    shards by issuing MKNODs under .shard. Since the layout is
    yet to be initialized, mknod fails in dht call path with EIO,
    leading to VM pauses.
    
    The fix involves shard maintaining a flag to denote whether
    a fresh lookup on .shard completed one network trip. If it
    didn't, all inode_find()s in fop path will be followed by a
    lookup before proceeding with the next stage of the fop.
    
    Big thanks to Raghavendra G and Pranith Kumar K for the RCA
    and subsequent inputs and feedback on the patch.
    
    Change-Id: Id0d160157ad8f6bcd52801a2173c5869517d0a96
    BUG: 1426512
    Signed-off-by: Krutika Dhananjay <kdhananj>
    Reviewed-on: https://review.gluster.org/16752
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Smoke: Gluster Build System <jenkins.org>
    Reviewed-by: Pranith Kumar Karampuri <pkarampu>

Comment 6 Kaushal 2017-03-08 12:32:17 UTC

This bug is getting closed because GlusterFS-3.9 has reached its end-of-life [1].

Note: This bug is being closed using a script. No verification has been performed to check if it still exists on newer releases of GlusterFS.
If this bug still exists in newer GlusterFS releases, please open a new bug against the newer release.

[1]: https://www.gluster.org/community/release-schedule/