Bug 1412554

Summary: [RHV-RHGS]: Application VM paused after add brick operation and VM didn't comeup after power cycle.
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Byreddy <bsrirama>
Component: shardingAssignee: Krutika Dhananjay <kdhananj>
Status: CLOSED ERRATA QA Contact: SATHEESARAN <sasundar>
Severity: high Docs Contact:
Priority: unspecified    
Version: rhgs-3.2CC: amukherj, csaba, kdhananj, pkarampu, ravishankar, rcyriac, rgowdapp, rhs-bugs, sabose, sasundar, storage-qa-internal
Target Milestone: ---   
Target Release: RHGS 3.2.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: glusterfs-3.8.4-16 Doc Type: Bug Fix
Doc Text:
Premature requests to create shards before the .shard directory had been initialized caused virtual machines to pause when bricks were added because directory layout information was not yet available in memory. Shard creation now includes checks at various points to ensure that virtual machines operate normally.
Story Points: ---
Clone Of:
: 1420623 (view as bug list) Environment:
Last Closed: 2017-03-23 06:02:35 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1351528, 1351530, 1411323, 1420623    
Attachments:
Description Flags
VM corruption messages in console
none
Fuse mount l
none
fuse-mount-logs observed with private build that had debug info in logs
none
fuse-mount-logs observed with private build that had debug info in logs
none
fuse-mount-logs observed with private build that had debug info in logs
none
fuse-mount-logs with client-log-level and brick-log-level set to DEBUG
none
bricks-logs on node1 none

Description Byreddy 2017-01-12 09:44:15 UTC
Description of problem:
=======================
Application VM paused after add brick operation, i had some 3 APP VMs running on gluster volume (1*3) and converted to 2 *3 volume and at the same time i added new rhgs node to cluster ( cluster expanded), during this time one of the app vm (vm1) went to pause state.


we are seeing disconnection messages and client quorum not met error messages.

usterd until brick's port is available
[2017-01-12 06:05:13.382644] I [socket.c:3439:socket_submit_request] 27-Dis-Rep1-client-9: not connected (priv->connected = -1)
[2017-01-12 06:05:13.382658] W [rpc-clnt.c:1657:rpc_clnt_submit] 27-Dis-Rep1-client-9: failed to submit rpc-request (XID: 0xdba6 Program: GlusterFS 3.3, ProgVers: 330, Proc: 30) to rpc-transport (Dis-Rep1-client-9)
[2017-01-12 06:05:13.382660] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 27-Dis-Rep1-client-11: disconnected from Dis-Rep1-client-11. Client process will keep trying to connect to glusterd until brick's port is available
[2017-01-12 06:05:13.382675] E [MSGID: 114031] [client-rpc-fops.c:1601:client3_3_finodelk_cbk] 27-Dis-Rep1-client-9: remote operation failed [Transport endpoint is not connected]
[2017-01-12 06:05:13.382692] W [MSGID: 108001] [afr-common.c:4768:afr_notify] 27-Dis-Rep1-replicate-0: Client-quorum is not met
[2017-01-12 06:05:13.382712] E [MSGID: 114031] [client-rpc-fops.c:1601:client3_3_finodelk_cbk] 27-Dis-Rep1-client-10: remote operation failed [Transport endpoint is not connected]
[2017-01-12 06:05:13.382731] I [socket.c:3439:socket_submit_request] 27-Dis-Rep1-client-11: not connected (priv->connected = -1)
[2017-01-12 06:05:13.382738] W [rpc-clnt.c:1657:rpc_clnt_submit] 27-Dis-Rep1-client-11: failed to submit rpc-request (XID: 0xdca1 Program: GlusterFS 3.3, ProgVers: 330, Proc: 30) to rpc-transport (Dis-Rep1-client-11)
[2017-01-12 06:05:13.382745] E [MSGID: 114031] [client-rpc-fops.c:1601:client3_3_finodelk_cbk] 27-Dis-Rep1-client-11: remote operation failed [Transport endpoint is not connected]
[2017-01-12 06:05:13.382826] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 27-Dis-Rep1-client-9: disconnected from Dis-Rep1-client-9. Client process will keep trying to connect to glusterd until brick's port is available
[2017-01-12 06:05:13.382851] E [MSGID: 108006] [afr-common.c:4679:afr_notify] 27-Dis-Rep1-replicate-0: All subvolumes are down. Going offline until atleast one of them comes back up.
[2017-01-12 06:05:13.407637] E [MSGID: 133007] [shard.c:3238:shard_common_resume_mknod] 29-Dis-Rep1-shard: Inode path failedon a15a813e-f512-4f96-858f-91b4fad86d4d.33, base file gfid = a15a813e-f512-4f96-858f-91b4fad86d4d
[2017-01-12 06:05:13.407692] W [fuse-bridge.c:2312:fuse_writev_cbk] 0-glusterfs-fuse: 4126899: WRITE => -1 gfid=a15a813e-f512-4f96-858f-91b4fad86d4d fd=0x7fd489699014 (Cannot allocate memory)
[2017-01-12 06:05:15.043730] W [MSGID: 114031] [client-rpc-fops.c:1830:client3_3_fxattrop_cbk] 27-Dis-Rep1-client-9: remote operation failed
[2017-01-12 06:05:15.043791] W [MSGID: 114031] [client-rpc-fops.c:1830:client3_3_fxattrop_cbk] 27-Dis-Rep1-client-10: remote operation failed
[2017-01-12 06:05:15.043813] W [MSGID: 114031] [client-rpc-fops.c:1830:client3_3_fxattrop_cbk] 27-Dis-Rep1-client-11: remote operation failed
[2017-01-12 06:05:15.043828] E [MSGID: 114031] [client-rpc-fops.c:1601:client3_3_finodelk_cbk] 27-Dis-Rep1-client-9: remote operation failed [Transport endpoint is not connected]
[2017-01-12 06:05:15.043840] E [MSGID: 114031] [client-rpc-fops.c:1601:client3_3_finodelk_cbk] 27-Dis-Rep1-client-10: remote operation failed [Transport endpoint is not connected]
[2017-01-12 06:05:15.043853] E [MSGID: 114031] [client-rpc-fops.c:1601:client3_3_finodelk_cbk] 27-Dis-Rep1-client-11: remote operation failed [Transport endpoint is not connected]
[2017-01-12 06:05:15.043897] W [MSGID: 114031] [client-rpc-fops.c:1830:client3_3_fxattrop_cbk] 27-Dis-Rep1-client-9: remote operation failed
[2017-01-12 06:05:15.043911] W [MSGID: 114031] [client-rpc-fops.c:1830:client3_3_fxattrop_cbk] 27-Dis-Rep1-client-10: remote operation failed


Version-Release number of selected component (if applicable):
=============================================================
glusterfs-3.8.4-11

How reproducible:
=================
One time


Steps to Reproduce:
===================
1. Have RHV-RHGS setup with 3 rhgs nodes and 2 virt cluster
2. Create a 2*3 volume and create the data domain using this volume.
3. Create two app vms (say  vm1,vm2 ) using the data domain created in step2
4. reduce the volume to 1*3.
5. Create a one more app vm (say vm3).
6. Add one more rhgs node to existing cluster ( cluster expand)
7. while step6 in-progress, Expand the volume to 2*3 by adding one more replica pair.

After step-7 done, one of app vm (vm1) went to paused state.

Actual results:
===============
Application VM paused after add brick operation.

Expected results:
=================
VM should not go to pause state.

Additional info:
=================
I will provide the sos reports.

Comment 6 Byreddy 2017-01-12 11:24:16 UTC
Created attachment 1239855 [details]
VM corruption messages  in console

Comment 7 Byreddy 2017-01-12 11:28:37 UTC
i power cycled the paused VM but it didn't up, i see vm corruption messages coming in the VM console.

attached the screenshot of it.

Comment 8 SATHEESARAN 2017-01-13 10:28:37 UTC
(In reply to Byreddy from comment #6)
> Created attachment 1239855 [details]
> VM corruption messages  in console

The attachment shows the XFS filesystem has got corrupted, which means this is a underlying disk image on the gluster volume has got corrupted - which is a data loss in the case of VM

Comment 9 Atin Mukherjee 2017-01-25 08:55:52 UTC
Raghavendra - Can you or some one from RPC team please start looking into this bug? Rajesh had done an initial analysis of the issue and might be the contact point to get additional data.

Comment 11 SATHEESARAN 2017-01-31 10:13:32 UTC
Tested with RHEL 7.3 and RHGS 3.2.0 interim build ( glusterfs-3.8.4-13.el7rhgs ).
The issue is consistently seen as I could hit this issue all the three times I tried.

Components involved
-------------------
RHEL 7.3 - kernel - 3.10.0-514.6.1.el7.x86_64
RHGS 3.2.0 - glusterfs-3.8.4-13.el7rhgs
qemu-kvm-common-rhev-2.6.0-28.el7_3.3.x86_64
qemu-kvm-rhev-2.6.0-28.el7_3.3.x86_64
qemu-img-rhev-2.6.0-28.el7_3.3.x86_64

Test case:
----------
1. Create a replica 3 volume and optimize the volume for virt-store
2. Fuse mount the volume on the RHEL 7.3 server machine
3. Create 3 VM images on the fuse mount
4. Start OS installation on these 3 VMs
5. While OS installation is in progress, just add 3 more bricks to the volume to make it as 2x3

Observation:
------------
Each time out of 3 VMs, one of the VM is getting paused

Comment 13 SATHEESARAN 2017-01-31 10:16:53 UTC
Created attachment 1246122 [details]
Fuse mount l

Comment 14 SATHEESARAN 2017-01-31 10:17:29 UTC
attached the fuse mount logs in comment13 from the RHEL 7.3 server which is used as hypervisor

Comment 15 SATHEESARAN 2017-01-31 10:19:27 UTC
Snip of errors from fuse mount logs


[2017-01-31 10:05:45.576808] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f8763d16602] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f8763adc76e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f8763adc87e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f8763addfe4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x94)[0x7f8763ade874] ))))) 0-imagestore-client-0: forced unwinding frame type(GlusterFS 3.3) op(FXATTROP(34)) called at 2017-01-31 10:05:45.575497 (xid=0x5fdf)
[2017-01-31 10:05:45.576840] W [MSGID: 114031] [client-rpc-fops.c:1830:client3_3_fxattrop_cbk] 0-imagestore-client-0: remote operation failed
[2017-01-31 10:05:45.576872] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f8763d16602] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f8763adc76e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f8763adc87e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f8763addfe4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x94)[0x7f8763ade874] ))))) 0-imagestore-client-1: forced unwinding frame type(GlusterFS 3.3) op(FXATTROP(34)) called at 2017-01-31 10:05:45.575520 (xid=0x5c7e)
[2017-01-31 10:05:45.576882] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-imagestore-client-0: disconnected from imagestore-client-0. Client process will keep trying to connect to glusterd until brick's port is available
[2017-01-31 10:05:45.576945] W [MSGID: 114031] [client-rpc-fops.c:1830:client3_3_fxattrop_cbk] 0-imagestore-client-1: remote operation failed
[2017-01-31 10:05:45.576974] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-imagestore-client-1: disconnected from imagestore-client-1. Client process will keep trying to connect to glusterd until brick's port is available
[2017-01-31 10:05:45.576985] W [MSGID: 108001] [afr-common.c:4770:afr_notify] 0-imagestore-replicate-0: Client-quorum is not met
[2017-01-31 10:05:45.577120] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f8763d16602] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f8763adc76e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f8763adc87e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f8763addfe4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x94)[0x7f8763ade874] ))))) 0-imagestore-client-2: forced unwinding frame type(GlusterFS 3.3) op(FXATTROP(34)) called at 2017-01-31 10:05:45.575541 (xid=0x5cb7)
[2017-01-31 10:05:45.577141] W [MSGID: 114031] [client-rpc-fops.c:1830:client3_3_fxattrop_cbk] 0-imagestore-client-2: remote operation failed
[2017-01-31 10:05:45.577172] E [MSGID: 114031] [client-rpc-fops.c:1601:client3_3_finodelk_cbk] 0-imagestore-client-0: remote operation failed [Transport endpoint is not connected]
[2017-01-31 10:05:45.577188] E [MSGID: 114031] [client-rpc-fops.c:1601:client3_3_finodelk_cbk] 0-imagestore-client-1: remote operation failed [Transport endpoint is not connected]
[2017-01-31 10:05:45.577219] I [socket.c:3439:socket_submit_request] 0-imagestore-client-2: not connected (priv->connected = -1)
[2017-01-31 10:05:45.577228] W [rpc-clnt.c:1657:rpc_clnt_submit] 0-imagestore-client-2: failed to submit rpc-request (XID: 0x5cb8 Program: GlusterFS 3.3, ProgVers: 330, Proc: 30) to rpc-transport (imagestore-client-2)
[2017-01-31 10:05:45.577240] E [MSGID: 114031] [client-rpc-fops.c:1601:client3_3_finodelk_cbk] 0-imagestore-client-2: remote operation failed [Transport endpoint is not connected]
[2017-01-31 10:05:45.577306] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-imagestore-client-2: disconnected from imagestore-client-2. Client process will keep trying to connect to glusterd until brick's port is available
[2017-01-31 10:05:45.577352] E [MSGID: 108006] [afr-common.c:4681:afr_notify] 0-imagestore-replicate-0: All subvolumes are down. Going offline until atleast one of them comes back up.
[2017-01-31 10:05:45.620808] W [fuse-bridge.c:1291:fuse_err_cbk] 0-glusterfs-fuse: 14861: FSYNC() ERR => -1 (Input/output error)

Comment 18 SATHEESARAN 2017-02-01 07:11:13 UTC
With debugging from Raghavendra, here are the findings:

Parent down event happened at the timestamp - [2017-01-31 10:05:45.576350]

[2017-01-31 10:05:45.576350] I [MSGID: 114021] [client.c:2365:notify] 0-imagestore-client-0: current graph is no longer active, destroying rpc_client 
[2017-01-31 10:05:45.576430] I [MSGID: 114021] [client.c:2365:notify] 0-imagestore-client-1: current graph is no longer active, destroying rpc_client 
[2017-01-31 10:05:45.576474] I [MSGID: 114021] [client.c:2365:notify] 0-imagestore-client-2: current graph is no longer active, destroying rpc_client 

Fsync happens later at the timestamp - [2017-01-31 10:05:45.620808]

[2017-01-31 10:05:45.620808] W [fuse-bridge.c:1291:fuse_err_cbk] 0-glusterfs-fuse: 14861: FSYNC() ERR => -1 (Input/output error)

Which means there are 2 possibilities, of which both conditions looks like a bug with graph switch code

1. If fsync is happening on the older graph, then parent down event should not have happened.
2. If fsync was happened later after graph switch, it should have happened with the new graph

Changing the component to fuse, as the bug is not really on RPC.

Please provide required acks, if changing the BZ component invalidates the older ack

Comment 19 SATHEESARAN 2017-02-01 07:12:48 UTC
Assigning the bug to Csaba ( as the bug is around glusterfs-fuse ), and providing the qa_ack as the BZ already had one

Comment 20 Raghavendra G 2017-02-01 07:19:57 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=1387878 might be a similar issue

Comment 23 Krutika Dhananjay 2017-02-02 06:45:18 UTC
(In reply to SATHEESARAN from comment #18)
> With debugging from Raghavendra, here are the findings:
> 
> Parent down event happened at the timestamp - [2017-01-31 10:05:45.576350]
> 
> [2017-01-31 10:05:45.576350] I [MSGID: 114021] [client.c:2365:notify]
> 0-imagestore-client-0: current graph is no longer active, destroying
> rpc_client 
> [2017-01-31 10:05:45.576430] I [MSGID: 114021] [client.c:2365:notify]
> 0-imagestore-client-1: current graph is no longer active, destroying
> rpc_client 
> [2017-01-31 10:05:45.576474] I [MSGID: 114021] [client.c:2365:notify]
> 0-imagestore-client-2: current graph is no longer active, destroying
> rpc_client 
> 
> Fsync happens later at the timestamp - [2017-01-31 10:05:45.620808]
> 
> [2017-01-31 10:05:45.620808] W [fuse-bridge.c:1291:fuse_err_cbk]
> 0-glusterfs-fuse: 14861: FSYNC() ERR => -1 (Input/output error)
> 
> Which means there are 2 possibilities, of which both conditions looks like a
> bug with graph switch code
> 
> 1. If fsync is happening on the older graph, then parent down event should
> not have happened.
> 2. If fsync was happened later after graph switch, it should have happened
> with the new graph
> 
> Changing the component to fuse, as the bug is not really on RPC.
> 
> Please provide required acks, if changing the BZ component invalidates the
> older ack

+1. Contrary to what comment #4 says, it certainly doesn't look like an rpc bug and resembles the graph switch issues reported by some of our community users.

What is not clear is the origin of the fop failure, as it doesn't look like a failure due to connections on the old graph being torn down, since then we should have been seeing 'Transport endpoint is not connected' as the cause of the FSYNC() failure in the log, and also because protocol/clients should have logged of ENOTCONN from client3_3_fsync_cbk().

Will check the code for possible origins of EIO (prime suspect being AFR), and get back.

-Krutika

Comment 24 SATHEESARAN 2017-02-03 04:56:55 UTC
Update:

Raghavendra has given the private build over glusterfs-3.8.4-13 ( 02-Feb-2017 ) with few debug logs enabled.

I will be reproducing the issue with this build and provide the fuse mount logs.

Comment 25 SATHEESARAN 2017-02-03 12:56:07 UTC
(In reply to SATHEESARAN from comment #24)
> Update:
> 
> Raghavendra has given the private build over glusterfs-3.8.4-13 (
> 02-Feb-2017 ) with few debug logs enabled.
> 
> I will be reproducing the issue with this build and provide the fuse mount
> logs.

Found some useful log messages in fuse mount logs when the bricks are added to 1X3 replica volume. I will attaching the fuse mount logs to this bug for future reference.

Comment 26 SATHEESARAN 2017-02-03 12:58:45 UTC
Created attachment 1247437 [details]
fuse-mount-logs observed with private build that had debug info in logs

Comment 28 SATHEESARAN 2017-02-06 06:19:41 UTC
(In reply to SATHEESARAN from comment #24)
> Update:
> 
> Raghavendra has given the private build over glusterfs-3.8.4-13 (
> 02-Feb-2017 ) with few debug logs enabled.
> 
> I will be reproducing the issue with this build and provide the fuse mount
> logs.

I'm unable to hit the VM pause issue with the private build, which has logging changes including debug messages for RCA'ing the issue.

Probably the logging changes has eliminated the occurrence of race.

Comment 29 SATHEESARAN 2017-02-07 09:43:14 UTC
Based on Atin's suggestion ( as he had a wild guess that client-io-threads could be causing the race ), tried the same test with latest RHGS 3.2.0 interim build ( glusterfs-3.8.4-14.el7rhgs ) and I am hitting VM pause issue.

I will be attaching the logs to the bug

Comment 30 SATHEESARAN 2017-02-07 09:44:54 UTC
Snips of logs from fuse mount log:

<snip>

[2017-02-07 09:39:19.451836] I [fuse-bridge.c:5241:fuse_graph_setup] 0-fuse: switched to graph 2
[2017-02-07 09:39:19.452139] I [MSGID: 114035] [client-handshake.c:201:client_set_lk_version_cbk] 2-pluto-client-5: Server lk version = 1
[2017-02-07 09:39:19.551607] I [fuse-bridge.c:4446:fuse_migrate_fd_open] 0-glusterfs-fuse: migrated basefd (0x7f666806406c) to newfd (0x7f666227e06c) (inode-gfid:d18f89ef-6688-4d33-8d8f-221971639488)(old-subvolume:meta-autoload-0 new-subvolume:meta-autoload-2)
[2017-02-07 09:39:19.557778] I [fuse-bridge.c:4446:fuse_migrate_fd_open] 0-glusterfs-fuse: migrated basefd (0x7f66680640f8) to newfd (0x7f666227e0f8) (inode-gfid:fce8e590-da77-4b11-b43f-31f71f5aa2ca)(old-subvolume:meta-autoload-0 new-subvolume:meta-autoload-2)
[2017-02-07 09:39:19.562129] I [fuse-bridge.c:4446:fuse_migrate_fd_open] 0-glusterfs-fuse: migrated basefd (0x7f6668064184) to newfd (0x7f666227e184) (inode-gfid:a7ae1192-9f13-4d38-8f89-f34855453089)(old-subvolume:meta-autoload-0 new-subvolume:meta-autoload-2)
[2017-02-07 09:39:19.564971] I [fuse-bridge.c:4446:fuse_migrate_fd_open] 0-glusterfs-fuse: migrated basefd (0x7f666806429c) to newfd (0x7f666227e210) (inode-gfid:73378267-5712-4a6f-b5ca-b7f277c7533b)(old-subvolume:meta-autoload-0 new-subvolume:meta-autoload-2)
[2017-02-07 09:39:19.567667] I [fuse-bridge.c:4446:fuse_migrate_fd_open] 0-glusterfs-fuse: migrated basefd (0x7f6668064210) to newfd (0x7f666227e29c) (inode-gfid:23a02bc8-b52f-4b72-9907-fbfc8464d2a1)(old-subvolume:meta-autoload-0 new-subvolume:meta-autoload-2)
[2017-02-07 09:39:19.567820] I [MSGID: 114021] [client.c:2365:notify] 0-pluto-client-0: current graph is no longer active, destroying rpc_client 
[2017-02-07 09:39:19.567873] I [MSGID: 114021] [client.c:2365:notify] 0-pluto-client-1: current graph is no longer active, destroying rpc_client 
[2017-02-07 09:39:19.567918] I [MSGID: 114021] [client.c:2365:notify] 0-pluto-client-2: current graph is no longer active, destroying rpc_client 
[2017-02-07 09:39:19.567910] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-pluto-client-0: disconnected from pluto-client-0. Client process will keep trying to connect to glusterd until brick's port is available
[2017-02-07 09:39:19.567989] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-pluto-client-2: disconnected from pluto-client-2. Client process will keep trying to connect to glusterd until brick's port is available
[2017-02-07 09:39:19.567967] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-pluto-client-1: disconnected from pluto-client-1. Client process will keep trying to connect to glusterd until brick's port is available
[2017-02-07 09:39:19.568060] W [MSGID: 108001] [afr-common.c:4770:afr_notify] 0-pluto-replicate-0: Client-quorum is not met
[2017-02-07 09:39:19.568192] E [MSGID: 108006] [afr-common.c:4681:afr_notify] 0-pluto-replicate-0: All subvolumes are down. Going offline until atleast one of them comes back up.
[2017-02-07 09:39:19.610218] W [fuse-bridge.c:2228:fuse_readv_cbk] 0-glusterfs-fuse: 49362: READ => -1 gfid=a7ae1192-9f13-4d38-8f89-f34855453089 fd=0x7f666227e184 (Input/output error)
[2017-02-07 09:39:19.610517] W [fuse-bridge.c:2228:fuse_readv_cbk] 0-glusterfs-fuse: 49375: READ => -1 gfid=a7ae1192-9f13-4d38-8f89-f34855453089 fd=0x7f666227e184 (Input/output error)
[2017-02-07 09:39:19.701440] W [fuse-bridge.c:1291:fuse_err_cbk] 0-glusterfs-fuse: 49368: FSYNC() ERR => -1 (Input/output error)

</snip>

Comment 33 SATHEESARAN 2017-02-08 10:35:35 UTC
Created attachment 1248585 [details]
fuse-mount-logs observed with private build that had debug info in logs

Comment 34 Raghavendra G 2017-02-08 12:58:37 UTC
Looks like its not an issue with graph switch in the sense fops are _not_ sent on a graph which was notified with PARENT_DOWN event. To corroborate this see the log:

[2017-02-08 10:25:21.438033] W [fuse-bridge.c:2312:fuse_writev_cbk] 0-glusterfs-fuse: 38318: WRITE => -1 gfid=d04d7083-bdfe-4424-be50-a8ce01caa8a1 fd=0x7f83c804b0f8 (Input/output error)
[2017-02-08 10:25:21.438541] W [fuse-bridge.c:2312:fuse_writev_cbk] 0-glusterfs-fuse: 38320: WRITE => -1 gfid=d04d7083-bdfe-4424-be50-a8ce01caa8a1 fd=0x7f83c804b0f8 (Input/output error)

If we look at the fd, it was opened on new graph as can be seen from following log:
[2017-02-08 10:25:17.779502] I [fuse-bridge.c:4446:fuse_migrate_fd_open] 0-glusterfs-fuse: migrated basefd (0x7f83c9f690f8) to newfd (0x7f83c804b0f8) (inode-gfid:d04d7083-bdfe-4424-be50-a8ce01caa8a1)(old-subvolume:meta-autoload-0 new-subvolume:meta-autoload-2)

Similar logs for other fd too:
[2017-02-08 10:25:21.455821] W [fuse-bridge.c:2312:fuse_writev_cbk] 0-glusterfs-fuse: 38312: WRITE => -1 gfid=8dad6ee2-a57f-47b8-ac06-648931200375 fd=0x7f83c804b06c (Input/output error)
[2017-02-08 10:25:21.456344] W [fuse-bridge.c:2312:fuse_writev_cbk] 0-glusterfs-fuse: 38324: WRITE => -1 gfid=8dad6ee2-a57f-47b8-ac06-648931200375 fd=0x7f83c804b06c (Input/output error)
[2017-02-08 10:25:21.456692] W [fuse-bridge.c:2312:fuse_writev_cbk] 0-glusterfs-fuse: 38326: WRITE => -1 gfid=8dad6ee2-a57f-47b8-ac06-648931200375 fd=0x7f83c804b06c (Input/output error)

which was opened on new graph too:
[2017-02-08 10:25:17.205378] I [fuse-bridge.c:4446:fuse_migrate_fd_open] 0-glusterfs-fuse: migrated basefd (0x7f83c9f6906c) to newfd (0x7f83c804b06c) (inode-gfid:8dad6ee2-a57f-47b8-ac06-648931200375)(old-subvolume:meta-autoload-0 new-subvolume:meta-autoload-2)

I got confused from the log which logs fuse xlator's name with graph id 0 (0-glusterfs-fuse) associating the logs with graph 0. But however it turns out that this->graph is NULL for fuse xlator and _gf_log uses "0" as id if this->graph is NULL. So, the fops might as well be sent on new graph and failures we see might be due to a different reason. The reason could be fops were sent on new graph even before new-graph is up - connections were established to brick. Note that there are log messages indicating that an afr subvol is up in graph 2 between
1. fuse switches active graph to graph 2.
2. errors are seen on fops in fuse.

To prove that the failed fops were indeed sent on new graph, I've added few more logs and the new debug build can be found at:
https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=12511068

@Sas,

Can you please test with this new debug build? Thanks for your patience :)

regards,
Raghavendra

Comment 35 Raghavendra G 2017-02-08 13:01:22 UTC
(In reply to Raghavendra G from comment #34)
> I got confused from the log which logs fuse xlator's name with graph id 0
> (0-glusterfs-fuse) associating the logs with graph 0. But however it turns
> out that this->graph is NULL for fuse xlator and _gf_log uses "0" as id if
> this->graph is NULL. So, the fops might as well be sent on new graph and
> failures we see might be due to a different reason. The reason could be fops
> were sent on new graph even before new-graph is up - connections were
> established to brick. 

*before connections were established to bricks

> Note that there are log messages indicating that an

*Note that there are _NO_ log messages indicating that...

> afr subvol is up in graph 2 between
> 1. fuse switches active graph to graph 2.
> 2. errors are seen on fops in fuse.
> 
> To prove that the failed fops were indeed sent on new graph, I've added few
> more logs and the new debug build can be found at:
> https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=12511068
> 
> @Sas,
> 
> Can you please test with this new debug build? Thanks for your patience :)
> 
> regards,
> Raghavendra

Comment 36 Raghavendra G 2017-02-08 15:24:17 UTC
(In reply to Raghavendra G from comment #34)
> Looks like its not an issue with graph switch in the sense fops are _not_
> sent on a graph which was notified with PARENT_DOWN event. To corroborate
> this see the log:
> 
> [2017-02-08 10:25:21.438033] W [fuse-bridge.c:2312:fuse_writev_cbk]
> 0-glusterfs-fuse: 38318: WRITE => -1
> gfid=d04d7083-bdfe-4424-be50-a8ce01caa8a1 fd=0x7f83c804b0f8 (Input/output
> error)
> [2017-02-08 10:25:21.438541] W [fuse-bridge.c:2312:fuse_writev_cbk]
> 0-glusterfs-fuse: 38320: WRITE => -1
> gfid=d04d7083-bdfe-4424-be50-a8ce01caa8a1 fd=0x7f83c804b0f8 (Input/output
> error)
> 
> If we look at the fd, it was opened on new graph as can be seen from
> following log:
> [2017-02-08 10:25:17.779502] I [fuse-bridge.c:4446:fuse_migrate_fd_open]
> 0-glusterfs-fuse: migrated basefd (0x7f83c9f690f8) to newfd (0x7f83c804b0f8)
> (inode-gfid:d04d7083-bdfe-4424-be50-a8ce01caa8a1)(old-subvolume:meta-
> autoload-0 new-subvolume:meta-autoload-2)
> 
> Similar logs for other fd too:
> [2017-02-08 10:25:21.455821] W [fuse-bridge.c:2312:fuse_writev_cbk]
> 0-glusterfs-fuse: 38312: WRITE => -1
> gfid=8dad6ee2-a57f-47b8-ac06-648931200375 fd=0x7f83c804b06c (Input/output
> error)
> [2017-02-08 10:25:21.456344] W [fuse-bridge.c:2312:fuse_writev_cbk]
> 0-glusterfs-fuse: 38324: WRITE => -1
> gfid=8dad6ee2-a57f-47b8-ac06-648931200375 fd=0x7f83c804b06c (Input/output
> error)
> [2017-02-08 10:25:21.456692] W [fuse-bridge.c:2312:fuse_writev_cbk]
> 0-glusterfs-fuse: 38326: WRITE => -1
> gfid=8dad6ee2-a57f-47b8-ac06-648931200375 fd=0x7f83c804b06c (Input/output
> error)
> 
> which was opened on new graph too:
> [2017-02-08 10:25:17.205378] I [fuse-bridge.c:4446:fuse_migrate_fd_open]
> 0-glusterfs-fuse: migrated basefd (0x7f83c9f6906c) to newfd (0x7f83c804b06c)
> (inode-gfid:8dad6ee2-a57f-47b8-ac06-648931200375)(old-subvolume:meta-
> autoload-0 new-subvolume:meta-autoload-2)
> 
> I got confused from the log which logs fuse xlator's name with graph id 0
> (0-glusterfs-fuse) associating the logs with graph 0. But however it turns
> out that this->graph is NULL for fuse xlator and _gf_log uses "0" as id if
> this->graph is NULL. So, the fops might as well be sent on new graph and
> failures we see might be due to a different reason. The reason could be fops
> were sent on new graph even before new-graph is up - connections were
> established to brick. Note that there are log messages indicating that an
> afr subvol is up in graph 2 between
> 1. fuse switches active graph to graph 2.
> 2. errors are seen on fops in fuse.
> 
> To prove that the failed fops were indeed sent on new graph, I've added few
> more logs and the new debug build can be found at:
> https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=12511068

I've updated the debug build with more debug messages. Please use,
https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=12512422

> 
> @Sas,
> 
> Can you please test with this new debug build? Thanks for your patience :)
> 
> regards,
> Raghavendra

Comment 37 Raghavendra G 2017-02-09 05:59:46 UTC
Pursuing further the RCA of fuse-bridge not waiting till the new graph is up before directing fops to it, I do see the code where fuse_graph_sync is waiting on priv->sync_cond after initializing a new graph as active_subvol. However, "notify" function which broadcasts a signal on priv->sync_cond whenever it receives a CHILD_DOWN/CHILD_UP doesn't check for on which graph the event was received before setting priv->event_recvd to true. For eg., consider the scenario:

* fuse_graph_sync is waiting for a CHILD_UP/CHILD_DOWN on new graph by doing pthread_cond_wait on priv->sync_cond
* notify receives a CHILD_DOWN on old-graph and signals priv->sync_cond

In the above scenario, fuse_graph_sync wakes up even though no CHILD_UP/CHILD_DOWN was received on new graph and starts directing ops to new graph, which will fail eventually till new graph is up.

There is some evidence in the logs too. Note that we started seeing failed fops immediately after CHILD_DOWN event from afr to its parents (fuse-bridge) and there were no errors before that.

[2017-02-08 10:25:21.404411] E [MSGID: 108006] [afr-common.c:4681:afr_notify] 0-andromeda-replicate-0: All subvolumes are down. Going offline until atleast one of them comes back up.
[2017-02-08 10:25:21.438033] W [fuse-bridge.c:2312:fuse_writev_cbk] 0-glusterfs-fuse: 38318: WRITE => -1 gfid=d04d7083-bdfe-4424-be50-a8ce01caa8a1 fd=0x7f83c804b0f8 (Input/output error)
[2017-02-08 10:25:21.438541] W [fuse-bridge.c:2312:fuse_writev_cbk] 0-glusterfs-fuse: 38320: WRITE => -1 gfid=d04d7083-bdfe-4424-be50-a8ce01caa8a1 fd=0x7f83c804b0f8 (Input/output error)
[2017-02-08 10:25:21.455715] W [fuse-bridge.c:767:fuse_attr_cbk] 0-glusterfs-fuse: 38290: STAT() <gfid:8dad6ee2-a57f-47b8-ac06-648931200375> => -1 (Input/output error)
[2017-02-08 10:25:21.455821] W [fuse-bridge.c:2312:fuse_writev_cbk] 0-glusterfs-fuse: 38312: WRITE => -1 gfid=8dad6ee2-a57f-47b8-ac06-648931200375 fd=0x7f83c804b06c (Input/output error)
[2017-02-08 10:25:21.456344] W [fuse-bridge.c:2312:fuse_writev_cbk] 0-glusterfs-fuse: 38324: WRITE => -1 gfid=8dad6ee2-a57f-47b8-ac06-648931200375 fd=0x7f83c804b06c (Input/output error)
[2017-02-08 10:25:21.456692] W [fuse-bridge.c:2312:fuse_writev_cbk] 0-glusterfs-fuse: 38326: WRITE => -1 gfid=8dad6ee2-a57f-47b8-ac06-648931200375 fd=0x7f83c804b06c (Input/output error)


I'll wait for debug logs from sas to confirm the above RCA. If it turns out the RCA above is correct, the fix would be to make fuse_graph_sync wait till a CHILD_UP/CHILD_DOWN event on "new-graph" it just set as active-subvol instead of waking up on receiving CHILD_UP/CHILD_DOWN on _any_ graph.

Comment 38 SATHEESARAN 2017-02-09 09:31:28 UTC
(In reply to Raghavendra G from comment #36)
> 
> I've updated the debug build with more debug messages. Please use,
> https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=12512422
> 
> > 
> > @Sas,
> > 
> > Can you please test with this new debug build? Thanks for your patience :)
> > 
> > regards,
> > Raghavendra

Raghu,

I have tested with this build and hit the VM pause issue, will uploading the fuse mount logs to the bug

Comment 39 SATHEESARAN 2017-02-09 09:34:10 UTC
Created attachment 1248765 [details]
fuse-mount-logs observed with private build that had debug info in logs

This is the fuse mount logs as captured from the latest private build

Comment 40 Raghavendra G 2017-02-09 14:36:48 UTC
(In reply to SATHEESARAN from comment #39)
> Created attachment 1248765 [details]
> fuse-mount-logs observed with private build that had debug info in logs
> 
> This is the fuse mount logs as captured from the latest private build

Here are the observations:

* failed fops were sent on new graph
* By the time graph switch is complete and fops were directed to new graph, two subvols of replicate-0 (which contains files) were up (I assume that meets Quorum).

I assume EIO was returned by AFR. Not sure why afr fails fops even when Quorum number of subvols are met met. Can someone from replicate team shed light on this?

There are other observations too:
* A CHILD_UP event is sent to fuse-bridge when any of subvols of afr is up. Shouldn't afr wait for Quorum number of bricks are up? Otherwise fuse starts directing fops to new graph and they'll eventually fail as Quorum won't be met.

* For DHT to notify CHILD_UP its sufficient that any one of its subvol is up. Now if that subvol happens to be the one which doesn't have files being accessed, fops will fail. For eg., in this case replicate-1 could come up before replicate-0 and fops can be directed to new graph even before replicate-0 is up. That too will result in failures of fops.

Its not straightforward to fix 2 in DHT as of now.

Comment 43 Raghavendra G 2017-02-11 14:34:49 UTC
(In reply to Raghavendra G from comment #40)
> * For DHT to notify CHILD_UP its sufficient that any one of its subvol is
> up. Now if that subvol happens to be the one which doesn't have files being
> accessed, fops will fail. For eg., in this case replicate-1 could come up
> before replicate-0 and fops can be directed to new graph even before
> replicate-0 is up. That too will result in failures of fops.

This is wrong. DHT waits till it hears from all subvols (either UP/DOWN). So, this problem doesn't exist.

Comment 44 SATHEESARAN 2017-02-16 17:10:32 UTC
Further information to add to this BZ

I have tried reproducing the issue with o-direct and I am not seeing the VM pause issue with glusterfs-3.8.4-14.el7rhgs

Actual steps goes as follow:

1. Created replica 3 volume
2. Mounted the volume on RHEL 7.3 host
3. Disable o-direct on the volume and turned on network.remote-dio
4. Created 5 VMs and started VM installation
5. While the VM installation is going on, performed add-brick to convert 1x3 to 2x3
6. All went smooth, and there are no VM pauses

With earlier testing, atleast 2 of the VMs used to pause.
I am not sure of the technical reason behind why this issue is not seen without o-direct.

Tried the above steps 2 times and couldn't hit VM pause

Comment 48 Pranith Kumar K 2017-02-20 10:08:33 UTC
Sas,
    Raghavendra and I went through the code base and we feel this could be failing because dht is failing mknod(sent from shard) because parent directory is not looked up for some reason. We need the test case to be run with both client-log-level and brick-log-level set to DEBUG.

The reason why we don't think it is from afr is that there is no log indicating afr found a split-brain. We see that dht_mknod() and shard_common_mknod_cbk() are logging mknod failures in DEBUG log level so there is more probability for this being the actual issue. We will be able to confirm once you provide these logs.

On behalf of
Pranith & Du

Comment 49 SATHEESARAN 2017-02-21 03:58:42 UTC
(In reply to Pranith Kumar K from comment #48)
> Sas,
>     Raghavendra and I went through the code base and we feel this could be
> failing because dht is failing mknod(sent from shard) because parent
> directory is not looked up for some reason. We need the test case to be run
> with both client-log-level and brick-log-level set to DEBUG.
> 
> The reason why we don't think it is from afr is that there is no log
> indicating afr found a split-brain. We see that dht_mknod() and
> shard_common_mknod_cbk() are logging mknod failures in DEBUG log level so
> there is more probability for this being the actual issue. We will be able
> to confirm once you provide these logs.
> 
> On behalf of
> Pranith & Du

Pranith,

Thanks. I will get back with the brick logs & fuse mount logs with client-log-level and brick-log-level set to DEBUG.

You have reset the needinfo set by Atin in comment45, just restoring them

Comment 52 SATHEESARAN 2017-02-21 07:36:52 UTC
Created attachment 1255993 [details]
fuse-mount-logs with client-log-level and brick-log-level set to DEBUG

Comment 54 SATHEESARAN 2017-02-21 07:40:01 UTC
Created attachment 1255996 [details]
bricks-logs on node1

Comment 56 SATHEESARAN 2017-02-21 07:53:39 UTC
Pranih grep'ed for some information in the fuse mount logs and I am pasting the snip of the log what he has observed:

<snip>
# grep -i dht_mknod /var/log/glusterfs/mnt-repvol.log  | grep -v creating
[2017-02-21 07:13:16.029453] D [MSGID: 0] [dht-common.c:5985:dht_mknod] 2-mantis-dht: no subvolume in layout for path=/.shard/c0dbdc8a-99a3-44e9-8d2f-447e4ec508d0.1407
[2017-02-21 07:13:16.029465] D [MSGID: 0] [dht-common.c:6066:dht_mknod] 2-stack-trace: stack-address: 0x7fe834977ae4, mantis-dht returned -1 error: Input/output error [Input/output error]
[2017-02-21 07:13:16.038609] D [MSGID: 0] [dht-common.c:5985:dht_mknod] 2-mantis-dht: no subvolume in layout for path=/.shard/c0dbdc8a-99a3-44e9-8d2f-447e4ec508d0.1407
[2017-02-21 07:13:16.038622] D [MSGID: 0] [dht-common.c:6066:dht_mknod] 2-stack-trace: stack-address: 0x7fe834944968, mantis-dht returned -1 error: Input/output error [Input/output error]
[2017-02-21 07:13:16.051987] D [MSGID: 0] [dht-common.c:5985:dht_mknod] 2-mantis-dht: no subvolume in layout for path=/.shard/790f61bc-0c16-4010-8a75-1b15f5c32b20.1408
[2017-02-21 07:13:16.051998] D [MSGID: 0] [dht-common.c:6066:dht_mknod] 2-stack-trace: stack-address: 0x7fe83495d010, mantis-dht returned -1 error: Input/output error [Input/output error]
[2017-02-21 07:13:16.062837] D [MSGID: 0] [dht-common.c:5985:dht_mknod] 2-mantis-dht: no subvolume in layout for path=/.shard/c0dbdc8a-99a3-44e9-8d2f-447e4ec508d0.1407
[2017-02-21 07:13:16.062848] D [MSGID: 0] [dht-common.c:6066:dht_mknod] 2-stack-trace: stack-address: 0x7fe83493d5a8, mantis-dht returned -1 error: Input/output error [Input/output error]
[2017-02-21 07:13:16.073165] D [MSGID: 0] [dht-common.c:5985:dht_mknod] 2-mantis-dht: no subvolume in layout for path=/.shard/c0dbdc8a-99a3-44e9-8d2f-447e4ec508d0.1407
[2017-02-21 07:13:16.073183] D [MSGID: 0] [dht-common.c:6066:dht_mknod] 2-stack-trace: stack-address: 0x7fe83496c608, mantis-dht returned -1 error: Input/output error [Input/output error]
[2017-02-21 07:13:16.075046] D [MSGID: 0] [dht-common.c:5985:dht_mknod] 2-mantis-dht: no subvolume in layout for path=/.shard/790f61bc-0c16-4010-8a75-1b15f5c32b20.1408
[2017-02-21 07:13:16.075087] D [MSGID: 0] [dht-common.c:6066:dht_mknod] 2-stack-trace: stack-address: 0x7fe8349192e8, mantis-dht returned -1 error: Input/output error [Input/output error]
[2017-02-21 07:13:16.081830] D [MSGID: 0] [dht-common.c:5985:dht_mknod] 2-mantis-dht: no subvolume in layout for path=/.shard/790f61bc-0c16-4010-8a75-1b15f5c32b20.1408
[2017-02-21 07:13:16.081841] D [MSGID: 0] [dht-common.c:6066:dht_mknod] 2-stack-trace: stack-address: 0x7fe834963558, mantis-dht returned -1 error: Input/output error [Input/output error]
</snip>

Comment 57 Raghavendra G 2017-02-21 11:03:27 UTC
Observations:

* fuse_fsync failed with EIO

* dht_mknod failed with EIO on graph 2 (new graph with newly added bricks). The reason for failure was that layout was not set on .shard directory

[raghu@unused tmp]$ grep -i "no subvolume in layout" .fr-iRPd1F/mnt-repvol.log 
[2017-02-21 07:13:16.029453] D [MSGID: 0] [dht-common.c:5985:dht_mknod] 2-mantis-dht: no subvolume in layout for path=/.shard/c0dbdc8a-99a3-44e9-8d2f-447e4ec508d0.1407
[2017-02-21 07:13:16.038609] D [MSGID: 0] [dht-common.c:5985:dht_mknod] 2-mantis-dht: no subvolume in layout for path=/.shard/c0dbdc8a-99a3-44e9-8d2f-447e4ec508d0.1407
[2017-02-21 07:13:16.051987] D [MSGID: 0] [dht-common.c:5985:dht_mknod] 2-mantis-dht: no subvolume in layout for path=/.shard/790f61bc-0c16-4010-8a75-1b15f5c32b20.1408
[2017-02-21 07:13:16.062837] D [MSGID: 0] [dht-common.c:5985:dht_mknod] 2-mantis-dht: no subvolume in layout for path=/.shard/c0dbdc8a-99a3-44e9-8d2f-447e4ec508d0.1407
[2017-02-21 07:13:16.073165] D [MSGID: 0] [dht-common.c:5985:dht_mknod] 2-mantis-dht: no subvolume in layout for path=/.shard/c0dbdc8a-99a3-44e9-8d2f-447e4ec508d0.1407
[2017-02-21 07:13:16.075046] D [MSGID: 0] [dht-common.c:5985:dht_mknod] 2-mantis-dht: no subvolume in layout for path=/.shard/790f61bc-0c16-4010-8a75-1b15f5c32b20.1408
[2017-02-21 07:13:16.081830] D [MSGID: 0] [dht-common.c:5985:dht_mknod] 2-mantis-dht: no subvolume in layout for path=/.shard/790f61bc-0c16-4010-8a75-1b15f5c32b20.1408

* mknod was issued by features/shard to create shards 0dbdc8a-99a3-44e9-8d2f-447e4ec508d0.1407 and  790f61bc-0c16-4010-8a75-1b15f5c32b20.1408 as part of writes on these two different files and writes on these files were unwound as failure.

[2017-02-21 07:13:16.029476] D [MSGID: 0] [shard.c:3142:shard_common_mknod_cbk] 2-mantis-shard: mknod of shard 1407 failed: Input/output error
[2017-02-21 07:13:16.029507] D [MSGID: 0] [shard.c:495:shard_common_inode_write_failure_unwind] 2-stack-trace: stack-address: 0x7fe834977ae4, mantis-shard returned -1 error: Input/output error [Input/output error]
[2017-02-21 07:13:16.038622] D [MSGID: 0] [dht-common.c:6066:dht_mknod] 2-stack-trace: stack-address: 0x7fe834944968, mantis-dht returned -1 error: Input/output error [Input/output error]
[2017-02-21 07:13:16.038633] D [MSGID: 0] [shard.c:3142:shard_common_mknod_cbk] 2-mantis-shard: mknod of shard 1407 failed: Input/output error
[2017-02-21 07:13:16.038663] D [MSGID: 0] [shard.c:495:shard_common_inode_write_failure_unwind] 2-stack-trace: stack-address: 0x7fe834944968, mantis-shard returned -1 error: Input/output error [Input/output error]
[2017-02-21 07:13:16.051998] D [MSGID: 0] [dht-common.c:6066:dht_mknod] 2-stack-trace: stack-address: 0x7fe83495d010, mantis-dht returned -1 error: Input/output error [Input/output error]
[2017-02-21 07:13:16.052009] D [MSGID: 0] [shard.c:3142:shard_common_mknod_cbk] 2-mantis-shard: mknod of shard 1408 failed: Input/output error
[2017-02-21 07:13:16.052018] D [MSGID: 0] [shard.c:495:shard_common_inode_write_failure_unwind] 2-stack-trace: stack-address: 0x7fe83495d010, mantis-shard returned -1 error: Input/output error [Input/output error]
[2017-02-21 07:13:16.062848] D [MSGID: 0] [dht-common.c:6066:dht_mknod] 2-stack-trace: stack-address: 0x7fe83493d5a8, mantis-dht returned -1 error: Input/output error [Input/output error]
[2017-02-21 07:13:16.062858] D [MSGID: 0] [shard.c:3142:shard_common_mknod_cbk] 2-mantis-shard: mknod of shard 1407 failed: Input/output error
[2017-02-21 07:13:16.062867] D [MSGID: 0] [shard.c:495:shard_common_inode_write_failure_unwind] 2-stack-trace: stack-address: 0x7fe83493d5a8, mantis-shard returned -1 error: Input/output error [Input/output error]
[2017-02-21 07:13:16.073183] D [MSGID: 0] [dht-common.c:6066:dht_mknod] 2-stack-trace: stack-address: 0x7fe83496c608, mantis-dht returned -1 error: Input/output error [Input/output error]
[2017-02-21 07:13:16.073194] D [MSGID: 0] [shard.c:3142:shard_common_mknod_cbk] 2-mantis-shard: mknod of shard 1407 failed: Input/output error
[2017-02-21 07:13:16.073203] D [MSGID: 0] [shard.c:495:shard_common_inode_write_failure_unwind] 2-stack-trace: stack-address: 0x7fe83496c608, mantis-shard returned -1 error: Input/output error [Input/output error]
[2017-02-21 07:13:16.075087] D [MSGID: 0] [dht-common.c:6066:dht_mknod] 2-stack-trace: stack-address: 0x7fe8349192e8, mantis-dht returned -1 error: Input/output error [Input/output error]
[2017-02-21 07:13:16.075099] D [MSGID: 0] [shard.c:3142:shard_common_mknod_cbk] 2-mantis-shard: mknod of shard 1408 failed: Input/output error
[2017-02-21 07:13:16.075124] D [MSGID: 0] [shard.c:495:shard_common_inode_write_failure_unwind] 2-stack-trace: stack-address: 0x7fe8349192e8, mantis-shard returned -1 error: Input/output error [Input/output error]
[2017-02-21 07:13:16.081841] D [MSGID: 0] [dht-common.c:6066:dht_mknod] 2-stack-trace: stack-address: 0x7fe834963558, mantis-dht returned -1 error: Input/output error [Input/output error]
[2017-02-21 07:13:16.081852] D [MSGID: 0] [shard.c:3142:shard_common_mknod_cbk] 2-mantis-shard: mknod of shard 1408 failed: Input/output error
[2017-02-21 07:13:16.081861] D [MSGID: 0] [shard.c:495:shard_common_inode_write_failure_unwind] 2-stack-trace: stack-address: 0x7fe834963558, mantis-shard returned -1 error: Input/output error [Input/output error]

* fsync was waiting on flushing of these writes and since writes failed, fsync was unwound with error EIO

[2017-02-21 07:13:16.417951] D [MSGID: 0] [call-stub.c:2332:call_resume_unwind] 2-stack-trace: stack-address: 0x7fe8349530e8, mantis-write-behind returned -1 error: Input/output error [Input/output error]
[2017-02-21 07:13:16.417973] D [MSGID: 0] [defaults.c:1509:default_fsync_cbk] 2-stack-trace: stack-address: 0x7fe8349530e8, mantis-open-behind returned -1 error: Input/output error [Input/output error]
[2017-02-21 07:13:16.417987] D [MSGID: 0] [io-stats.c:2110:io_stats_fsync_cbk] 2-stack-trace: stack-address: 0x7fe8349530e8, mantis returned -1 error: Input/output error [Input/output error]
[2017-02-21 07:13:16.418004] W [fuse-bridge.c:1291:fuse_err_cbk] 0-glusterfs-fuse: 20429: FSYNC() ERR => -1 (Input/output error)

The fundamental question here is why mknod was issued when there was no dht layout set in .shard directory. I think the reason for this is parallel writes from write-behind on both files involved here. Imagine following scenario:

* write on file1 is wound to shard. Shard sees that inode for .shard is not linked in itable. So it winds down a mkdir on .shard which fails with EEXIST as it is present on replicate 0. Now shard_mkdir_dot_shard_cbk triggers a lookup on .shard. On receiving lookup, dht will trigger a heal on .shard as it is not present in replicate-1
[2017-02-21 07:13:15.978407] D [MSGID: 0] [afr-common.c:2170:afr_lookup_done] 2-stack-trace: stack-address: 0x7fe83492901c, mantis-replicate-1 returned -1 error: No such file or directory [No such file or directory]

[2017-02-21 07:13:15.978420] D [MSGID: 0] [dht-common.c:655:dht_lookup_dir_cbk] 2-mantis-dht: lookup of /.shard on mantis-replicate-1 returned error [No such file or directory]
[2017-02-21 07:13:15.978432] D [MSGID: 0] [dht-common.c:707:dht_lookup_dir_cbk] 2-mantis-dht: fixing assignment on /.shard

* dht_selfheal_directory links the inode it is healing to itable even before healing is complete. features/shard, if it finds inode corresponding to .shard it proceeds with write and that will result in creation of files in .shard directory by issuing mknod. Note that by the time mknod is issued by features/shard there is no guarantee that lookup is complete on .shard so no layout is set on .shard. This results in dht_mknod failures with EIO.

[raghu@unused tmp]$ grep shard_common_inode_write_failure_unwind .fr-iRPd1F/mnt-repvol.log 
[2017-02-21 07:13:16.029507] D [MSGID: 0] [shard.c:495:shard_common_inode_write_failure_unwind] 2-stack-trace: stack-address: 0x7fe834977ae4, mantis-shard returned -1 error: Input/output error [Input/output error]
[2017-02-21 07:13:16.038663] D [MSGID: 0] [shard.c:495:shard_common_inode_write_failure_unwind] 2-stack-trace: stack-address: 0x7fe834944968, mantis-shard returned -1 error: Input/output error [Input/output error]
[2017-02-21 07:13:16.052018] D [MSGID: 0] [shard.c:495:shard_common_inode_write_failure_unwind] 2-stack-trace: stack-address: 0x7fe83495d010, mantis-shard returned -1 error: Input/output error [Input/output error]
[2017-02-21 07:13:16.062867] D [MSGID: 0] [shard.c:495:shard_common_inode_write_failure_unwind] 2-stack-trace: stack-address: 0x7fe83493d5a8, mantis-shard returned -1 error: Input/output error [Input/output error]
[2017-02-21 07:13:16.073203] D [MSGID: 0] [shard.c:495:shard_common_inode_write_failure_unwind] 2-stack-trace: stack-address: 0x7fe83496c608, mantis-shard returned -1 error: Input/output error [Input/output error]
[2017-02-21 07:13:16.075124] D [MSGID: 0] [shard.c:495:shard_common_inode_write_failure_unwind] 2-stack-trace: stack-address: 0x7fe8349192e8, mantis-shard returned -1 error: Input/output error [Input/output error]
[2017-02-21 07:13:16.081861] D [MSGID: 0] [shard.c:495:shard_common_inode_write_failure_unwind] 2-stack-trace: stack-address: 0x7fe834963558, mantis-shard returned -1 error: Input/output error [Input/output error]

Interestingly we had similar issue in fuse, where because of premature linking of inode into itable by dht-selfheal there could be fops on a directory even before lookup is complete. Patch [1] addressed this issue. We need something similar in features/shard too.

However, even with features/shard sending lookups on .shard after it finds a linked inode but no shard inode-ctx set in it (indicating that not even one lookup is _complete_ to shard - similar to [1]), parallel lookups will be issued on .shard. These parallel lookups might try to heal .shard and all but one fail with EBUSY/EAGAIN. My preliminary code reading indicates that dht fails lookup with EBUSY/EAGAIN/ESTALE if locking during selfheal fails. Features/shard, on failure of lookup on .shard fails the write operations too. So, this might be another area we need to look into.

[1] http://review.gluster.org/13225

Comment 58 Raghavendra G 2017-02-22 04:27:00 UTC
(In reply to SATHEESARAN from comment #44)
> Further information to add to this BZ
> 
> I have tried reproducing the issue with o-direct and I am not seeing the VM
> pause issue with glusterfs-3.8.4-14.el7rhgs
> 
> Actual steps goes as follow:
> 
> 1. Created replica 3 volume
> 2. Mounted the volume on RHEL 7.3 host
> 3. Disable o-direct on the volume and turned on network.remote-dio

We can possibly use this as workaround. But before that we need to identify why disabling o-direct helps. For that we need to capture some information

* Does RHEV run as an user-space process? If yes, Can we capture strace output of the process while the test is going on? Even if it is not run as an user space process, is there any way to capture the syscalls it executes along with parameters passed to these syscalls.

* We can also dump fuse traffic to file while test is going on and attach the traffic to this bz. We can use the following option while starting mount process:
      --dump-fuse=PATH       Dump fuse traffic to PATH

> 4. Created 5 VMs and started VM installation
> 5. While the VM installation is going on, performed add-brick to convert 1x3
> to 2x3
> 6. All went smooth, and there are no VM pauses
> 
> With earlier testing, atleast 2 of the VMs used to pause.
> I am not sure of the technical reason behind why this issue is not seen
> without o-direct.
> 
> Tried the above steps 2 times and couldn't hit VM pause

Comment 59 Raghavendra G 2017-02-22 04:31:14 UTC
(In reply to Raghavendra G from comment #57)

> 
> However, even with features/shard sending lookups on .shard after it finds a
> linked inode but no shard inode-ctx set in it (indicating that not even one
> lookup is _complete_ to shard - similar to [1]), parallel lookups will be
> issued on .shard. These parallel lookups might try to heal .shard and all
> but one fail with EBUSY/EAGAIN. My preliminary code reading indicates that
> dht fails lookup with EBUSY/EAGAIN/ESTALE if locking during selfheal fails.
> Features/shard, on failure of lookup on .shard fails the write operations
> too. So, this might be another area we need to look into.

DHT uses blocking inodelks for synchronization during selfheal. So, parallel lookups on .shard even when it requires healing won't result in any failures.

Comment 60 Raghavendra G 2017-02-22 04:54:51 UTC
(In reply to Raghavendra G from comment #58)
> (In reply to SATHEESARAN from comment #44)
> > Further information to add to this BZ
> > 
> > I have tried reproducing the issue with o-direct and I am not seeing the VM
> > pause issue with glusterfs-3.8.4-14.el7rhgs
> > 
> > Actual steps goes as follow:
> > 
> > 1. Created replica 3 volume
> > 2. Mounted the volume on RHEL 7.3 host
> > 3. Disable o-direct on the volume and turned on network.remote-dio
> 
> We can possibly use this as workaround. But before that we need to identify
> why disabling o-direct helps. For that we need to capture some information
> 
> * Does RHEV run as an user-space process? If yes, Can we capture strace
> output of the process while the test is going on? Even if it is not run as
> an user space process, is there any way to capture the syscalls it executes
> along with parameters passed to these syscalls.
> 
> * We can also dump fuse traffic to file while test is going on and attach
> the traffic to this bz. We can use the following option while starting mount
> process:
>       --dump-fuse=PATH       Dump fuse traffic to PATH

Note that we no longer need to have logs at DEBUG level. Default loglevel (I think INFO) is fine enough. With dumping of fuse traffic, there is quite a bit logging. This combined with DEBUG log level might result in too much of log and alter the race and issue might not be reproduced as we observed earlier.

> 
> > 4. Created 5 VMs and started VM installation
> > 5. While the VM installation is going on, performed add-brick to convert 1x3
> > to 2x3
> > 6. All went smooth, and there are no VM pauses
> > 
> > With earlier testing, atleast 2 of the VMs used to pause.
> > I am not sure of the technical reason behind why this issue is not seen
> > without o-direct.
> > 
> > Tried the above steps 2 times and couldn't hit VM pause

Comment 61 Raghavendra G 2017-02-22 05:43:08 UTC
There is a patch by Krutika [1]. I think A small change suggested at [2], will also fix this bug.

[1] https://review.gluster.org/#/c/14419/
[2] https://review.gluster.org/#/c/14419/2/xlators/features/shard/src/shard.c@4919

@Krutika,

Need your opinion on this :).

regards,
Raghavendra

Comment 62 SATHEESARAN 2017-02-22 08:25:19 UTC
(In reply to Raghavendra G from comment #58)
>
> We can possibly use this as workaround. But before that we need to identify
> why disabling o-direct helps. For that we need to capture some information
> 
> * Does RHEV run as an user-space process? If yes, Can we capture strace
> output of the process while the test is going on? Even if it is not run as
> an user space process, is there any way to capture the syscalls it executes
> along with parameters passed to these syscalls.

To make it more simpler, I am performing these tests with qemu-kvm environment with RHEL 7.3. I think it will be good enough to capture the strace of the qemu process and associated syscalls.

> 
> * We can also dump fuse traffic to file while test is going on and attach
> the traffic to this bz. We can use the following option while starting mount
> process:
>       --dump-fuse=PATH       Dump fuse traffic to PATH

Ok sure. I will get that file in the next run

Comment 63 Atin Mukherjee 2017-02-23 04:04:50 UTC
https://review.gluster.org/#/c/14419/ was attempted by Krutika to address BZ 1337405. Apparently it looks like some modification to this change might help fixing this issue issue too. Dev is actively working on this patch to make sure these issues can be addressed. Moving this bug to POST now.

Comment 64 SATHEESARAN 2017-02-23 04:58:18 UTC
Raghu,

Here is the strace capturing the syscalls for the qemu process ( corresponds to VM that has got paused )

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 95.87    0.556678          91      6129           ppoll
  1.81    0.010501           1     17586      1252 read
  1.26    0.007335           2      4396           ioctl
  0.66    0.003826           1      3114        16 futex
  0.18    0.001065         355         3           open
  0.11    0.000644         215         3           clone
  0.04    0.000226           1       384           rt_sigprocmask
  0.03    0.000154           1       121           write
  0.01    0.000063           1        94           poll
  0.01    0.000048          12         4           tgkill
  0.01    0.000039          20         2           sendmsg
  0.01    0.000035           1        24           brk
  0.01    0.000033           3        12           madvise
  0.00    0.000003           1         4           close
------ ----------- ----------- --------- --------- ----------------
100.00    0.580650                 31876      1268 total

Comment 65 SATHEESARAN 2017-02-23 04:59:47 UTC
Raghu,

I have got the fuse dump which is around 8GB of data.
I have compressed and tar-ed it, still I have 2 GB of data.

I will drop this fuse dump in a private location and provide you the location to access the same.

Comment 67 Krutika Dhananjay 2017-02-23 08:24:05 UTC
Patches posted, gotten +2.

Big thanks to Raghavendra G and Pranith Kumar K for the RCA and the reviews.

Here is the build containing the patches applied to the head of rhgs-3.2.0 branch - https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=12608381

Comment 68 SATHEESARAN 2017-02-23 13:25:08 UTC
(In reply to Krutika Dhananjay from comment #67)
> Patches posted, gotten +2.
> 
> Big thanks to Raghavendra G and Pranith Kumar K for the RCA and the reviews.
> 
> Here is the build containing the patches applied to the head of rhgs-3.2.0
> branch - https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=12608381

I have tested with this private build with the following steps:

1. Created a replica 3 sharded volume
2. Optimized the volume for virt store and started the volume
3. Mounted the volume on RHEL 7.3 hypervisor
4. Created 5 VM images ( each of size 9 GB ) on the fuse mounted volume
5. Started OS installation on all the VMs
6. When the packages installation is in progress inside the VM, triggered add-brick to convert 1x3 volume to 2x3
7. Post add-brick VMs health was good and there are no VMs that went in to paused state

But the fuse mount logs has few error messages related to ENOTCONN, is that ok ?

<snip>
[2017-02-23 12:34:39.547656] W [MSGID: 114031] [client-rpc-fops.c:1830:client3_3_fxattrop_cbk] 0-vmstore-client-0: remote operation failed
[2017-02-23 12:34:39.547735] W [MSGID: 114031] [client-rpc-fops.c:1830:client3_3_fxattrop_cbk] 0-vmstore-client-1: remote operation failed
[2017-02-23 12:34:39.547758] W [MSGID: 114031] [client-rpc-fops.c:1830:client3_3_fxattrop_cbk] 0-vmstore-client-2: remote operation failed
[2017-02-23 12:34:39.547804] E [MSGID: 114031] [client-rpc-fops.c:1601:client3_3_finodelk_cbk] 0-vmstore-client-0: remote operation failed [Transport endpoint is not connected]
[2017-02-23 12:34:39.547821] E [MSGID: 114031] [client-rpc-fops.c:1601:client3_3_finodelk_cbk] 0-vmstore-client-1: remote operation failed [Transport endpoint is not connected]
[2017-02-23 12:34:39.547834] E [MSGID: 114031] [client-rpc-fops.c:1601:client3_3_finodelk_cbk] 0-vmstore-client-2: remote operation failed [Transport endpoint is not connected]
</snip>

Comment 70 Krutika Dhananjay 2017-02-24 05:10:26 UTC
(In reply to SATHEESARAN from comment #68)
> (In reply to Krutika Dhananjay from comment #67)
> > Patches posted, gotten +2.
> > 
> > Big thanks to Raghavendra G and Pranith Kumar K for the RCA and the reviews.
> > 
> > Here is the build containing the patches applied to the head of rhgs-3.2.0
> > branch - https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=12608381
> 
> I have tested with this private build with the following steps:
> 
> 1. Created a replica 3 sharded volume
> 2. Optimized the volume for virt store and started the volume
> 3. Mounted the volume on RHEL 7.3 hypervisor
> 4. Created 5 VM images ( each of size 9 GB ) on the fuse mounted volume
> 5. Started OS installation on all the VMs
> 6. When the packages installation is in progress inside the VM, triggered
> add-brick to convert 1x3 volume to 2x3
> 7. Post add-brick VMs health was good and there are no VMs that went in to
> paused state
> 
> But the fuse mount logs has few error messages related to ENOTCONN, is that
> ok ?
> 
> <snip>
> [2017-02-23 12:34:39.547656] W [MSGID: 114031]
> [client-rpc-fops.c:1830:client3_3_fxattrop_cbk] 0-vmstore-client-0: remote
> operation failed
> [2017-02-23 12:34:39.547735] W [MSGID: 114031]
> [client-rpc-fops.c:1830:client3_3_fxattrop_cbk] 0-vmstore-client-1: remote
> operation failed
> [2017-02-23 12:34:39.547758] W [MSGID: 114031]
> [client-rpc-fops.c:1830:client3_3_fxattrop_cbk] 0-vmstore-client-2: remote
> operation failed
> [2017-02-23 12:34:39.547804] E [MSGID: 114031]
> [client-rpc-fops.c:1601:client3_3_finodelk_cbk] 0-vmstore-client-0: remote
> operation failed [Transport endpoint is not connected]
> [2017-02-23 12:34:39.547821] E [MSGID: 114031]
> [client-rpc-fops.c:1601:client3_3_finodelk_cbk] 0-vmstore-client-1: remote
> operation failed [Transport endpoint is not connected]
> [2017-02-23 12:34:39.547834] E [MSGID: 114031]
> [client-rpc-fops.c:1601:client3_3_finodelk_cbk] 0-vmstore-client-2: remote
> operation failed [Transport endpoint is not connected]
> </snip>

That doesn't seem like a problem to me, as its on the old graph (graph "0") and caused due to AFR's delayed post-op and unlock.

-Krutika

Comment 73 SATHEESARAN 2017-03-07 10:09:25 UTC
Tested with glusterfs-3.8.4-17.el7rhgs with the following steps :

1. Create a replica 3 sharded volume
2. Optimize the volume for virt-store.
3. Fuse mount the volume on the hypervisor
4. Create 5 App VMs with their disk on this volume
5. While VM installation is in progress, add more bricks to the volume ( 1x3 -> 2x3 )

Observed all the VMs are healthy post add-brick operation.

Comment 76 errata-xmlrpc 2017-03-23 06:02:35 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html