1406224 – VM pauses due to storage I/O error, when one of the data brick is down with arbiter/replica volume

Bug 1406224 - VM pauses due to storage I/O error, when one of the data brick is down with arbiter/replica volume

Summary: VM pauses due to storage I/O error, when one of the data brick is down with a...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	replicate
Sub Component:
Version:	mainline
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Ravishankar N
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:	1404982
Blocks:	1408171
TreeView+	depends on / blocked

Reported:	2016-12-20 01:28 UTC by Ravishankar N
Modified:	2017-03-06 17:40 UTC (History)
CC List:	1 user (show)
Fixed In Version:	glusterfs-3.10.0
Clone Of:	1404982
Clones:	1408171 (view as bug list)
Environment:
Last Closed:	2017-03-06 17:40:20 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Ravishankar N 2016-12-20 01:28:56 UTC

+++ This bug was initially created as a clone of Bug #1404982 +++

Description of problem:
In a arbiter volume when one of the data brick is killed and start writing I/O i see that vm goes to paused state and following is seen in the mount logs.

[2016-12-15 09:47:16.357700] E [MSGID: 108008] [afr-transaction.c:2557:afr_write_txn_refresh_done] 0-data-replicate-0: Failing FXATTROP on gfid 883a5c0a-e16e-4937-83b5-5d90d
f1ec956: split-brain observed.
[2016-12-15 09:47:16.357724] E [MSGID: 133016] [shard.c:631:shard_update_file_size_cbk] 0-data-shard: Update to file size xattr failed on 883a5c0a-e16e-4937-83b5-5d90df1ec95
6 [Input/output error]
[2016-12-15 09:47:16.357998] W [fuse-bridge.c:2312:fuse_writev_cbk] 0-glusterfs-fuse: 15170: WRITE => -1 gfid=883a5c0a-e16e-4937-83b5-5d90df1ec956 fd=0x7fd0f000f0f8 (Input/o
utput error)


Version-Release number of selected component (if applicable):
glusterfs-3.8.4-8.el7rhgs.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Install HC with three nodes.
2. Inside the vm mount the disk and start writing I/O
3. while I/O is going on kill one of the brick.

Actual results:
I see that vm goes to paused state with Input/output error.

Expected results:
vm should not go to paused state as only one of the data brick is down.

Additional info:

Following is seen in the mount logs:
===========================================
[2016-12-15 09:47:16.357700] E [MSGID: 108008] [afr-transaction.c:2557:afr_write_txn_refresh_done] 0-data-replicate-0: Failing FXATTROP on gfid 883a5c0a-e16e-4937-83b5-5d90d
f1ec956: split-brain observed.
[2016-12-15 09:47:16.357724] E [MSGID: 133016] [shard.c:631:shard_update_file_size_cbk] 0-data-shard: Update to file size xattr failed on 883a5c0a-e16e-4937-83b5-5d90df1ec95
6 [Input/output error]
[2016-12-15 09:47:16.357998] W [fuse-bridge.c:2312:fuse_writev_cbk] 0-glusterfs-fuse: 15170: WRITE => -1 gfid=883a5c0a-e16e-4937-83b5-5d90df1ec956 fd=0x7fd0f000f0f8 (Input/o
utput error)



 Additional comment from Ravishankar N on 2016-12-19 11:25:20 EST ---

Thanks Kasturi for providing the setup for testing and thanks Satheesaran for providing virsh based commands for re-creating the issue.

The isse is due to a race between inode_refresh_done() and __afr_set_in_flight_sb_status() that occurs when I/O is going on and a brick is brought down or up. When the brick goes up/ comes down, inode refresh is triggered in the write transaction and sets the correct data/metadata readable and event_generation in inode_refresh_done(). But before it can proceed to the write FOP, __afr_set_in_flight_sb_status() from another writev cbk resets the event_generation. When the first write (that follows the inode refresh) gets the event_gen in afr_inode_get_readable(), it gets zero because of which it fails the write with EIO.

While ignoring event_generation seems to fix the issue
-----------------------------------------------------------
diff --git a/xlators/cluster/afr/src/afr-common.c b/xlators/cluster/afr/src/afr-common.c
index 60bae18..2f32e44 100644
--- a/xlators/cluster/afr/src/afr-common.c
+++ b/xlators/cluster/afr/src/afr-common.c
@@ -1089,7 +1089,7 @@ afr_txn_refresh_done (call_frame_t *frame, xlator_t *this, int err)
                                       &event_generation,
                                       local->transaction.type);

-        if (ret == -EIO || !event_generation) {
+        if (ret == -EIO){
                 /* No readable subvolume even after refresh ==> splitbrain.*/
                 if (!priv->fav_child_policy) {
                         err = -EIO;
-----------------------------------------------------------
I need to convince myself that ignoring event gen in afr_txn_refresh_done() is for reads (there is no prob in ignoring it for writes)does not have any repercussions.

Comment 1 Ravishankar N 2016-12-20 01:34:53 UTC

Before http://review.gluster.org/#/c/15673/, after inode refresh, we failed read txns in case of EIO or event_generation being zero. For write transactions, the check was only for EIO. 15673 re-factored the code to fail both read and write when event_generation=0. This seems to have caused a regression as explained in the description above.

While we could restore the above behaviour, it seems we don't need to check event_gen value for read transactions as well because it could very well happen that the event_gen could be set to zero after we checked (post inode refresh) for it to be non zero but just before we did a stack wind for that read txn.

Send a patch to see if this breaks any upstream regression test.

Comment 2 Worker Ant 2016-12-20 01:36:14 UTC

REVIEW: http://review.gluster.org/16205 (afr: Ignore event_generation checks post inode refresh) posted (#1) for review on master by Ravishankar N (ravishankar)

Comment 3 Worker Ant 2016-12-21 14:01:46 UTC

REVIEW: http://review.gluster.org/16205 (afr: Ignore event_generation checks post inode refresh) posted (#2) for review on master by Ravishankar N (ravishankar)

Comment 4 Worker Ant 2016-12-22 06:03:12 UTC

REVIEW: http://review.gluster.org/16205 (afr: Ignore event_generation checks post inode refresh) posted (#3) for review on master by Ravishankar N (ravishankar)

Comment 5 Worker Ant 2016-12-22 06:13:43 UTC

REVIEW: http://review.gluster.org/16205 (afr: Ignore event_generation checks post inode refresh for write txns) posted (#4) for review on master by Ravishankar N (ravishankar)

Comment 6 Worker Ant 2016-12-22 11:06:35 UTC

COMMIT: http://review.gluster.org/16205 committed in master by Pranith Kumar Karampuri (pkarampu) 
------
commit 7ee998b9041d594d93a4e2ef369892c185e80def
Author: Ravishankar N <ravishankar>
Date:   Tue Dec 20 07:05:02 2016 +0530

    afr: Ignore event_generation checks post inode refresh for write txns
    
    Before http://review.gluster.org/#/c/15673/, after inode refresh, we
    failed read txns in case of EIO or event_generation being zero. For
    write transactions, the check was only for EIO. 15673 re-factored the
    code to fail both read and write when event_generation=0. This seems to
    have caused a regression as explained in the BZ.
    
    This patch restores that behaviour in afr_txn_refresh_done().
    
    Change-Id: Ib8e116506badce6f58b55827dbe403d95069d744
    BUG: 1406224
    Signed-off-by: Ravishankar N <ravishankar>
    Reviewed-on: http://review.gluster.org/16205
    Reviewed-by: Pranith Kumar Karampuri <pkarampu>
    Smoke: Gluster Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>

Comment 7 Shyamsundar 2017-03-06 17:40:20 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.10.0, please open a new bug report.

glusterfs-3.10.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/gluster-users/2017-February/030119.html
[2] https://www.gluster.org/pipermail/gluster-users/

Note You need to log in before you can comment on or make changes to this bug.