1404982 – VM pauses due to storage I/O error, when one of the data brick is down with arbiter volume/replica volume

Bug 1404982 - VM pauses due to storage I/O error, when one of the data brick is down with arbiter volume/replica volume

Summary: VM pauses due to storage I/O error, when one of the data brick is down with a...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	arbiter
Sub Component:
Version:	rhgs-3.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	RHGS 3.2.0
Assignee:	Ravishankar N
QA Contact:	RamaKasturi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	Gluster-HC-2 1351528 1398331 1406224 1408171
TreeView+	depends on / blocked

Reported:	2016-12-15 10:04 UTC by RamaKasturi
Modified:	2017-03-23 05:57 UTC (History)
CC List:	7 users (show)
Fixed In Version:	glusterfs-3.8.4-10
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1406224 (view as bug list)
Environment:
Last Closed:	2017-03-23 05:57:02 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2017:0486	0	normal	SHIPPED_LIVE	Moderate: Red Hat Gluster Storage 3.2.0 security, bug fix, and enhancement update	2017-03-23 09:18:45 UTC

Description RamaKasturi 2016-12-15 10:04:48 UTC

Description of problem:
In a arbiter volume when one of the data brick is killed and start writing I/O i see that vm goes to paused state and following is seen in the mount logs.

[2016-12-15 09:47:16.357700] E [MSGID: 108008] [afr-transaction.c:2557:afr_write_txn_refresh_done] 0-data-replicate-0: Failing FXATTROP on gfid 883a5c0a-e16e-4937-83b5-5d90d
f1ec956: split-brain observed.
[2016-12-15 09:47:16.357724] E [MSGID: 133016] [shard.c:631:shard_update_file_size_cbk] 0-data-shard: Update to file size xattr failed on 883a5c0a-e16e-4937-83b5-5d90df1ec95
6 [Input/output error]
[2016-12-15 09:47:16.357998] W [fuse-bridge.c:2312:fuse_writev_cbk] 0-glusterfs-fuse: 15170: WRITE => -1 gfid=883a5c0a-e16e-4937-83b5-5d90df1ec956 fd=0x7fd0f000f0f8 (Input/o
utput error)


Version-Release number of selected component (if applicable):
glusterfs-3.8.4-8.el7rhgs.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Install HC with three nodes.
2. Inside the vm mount the disk and start writing I/O
3. while I/O is going on kill one of the brick.

Actual results:
I see that vm goes to paused state with Input/output error.

Expected results:
vm should not go to paused state as only one of the data brick is down.

Additional info:

Following is seen in the mount logs:
===========================================
[2016-12-15 09:47:16.357700] E [MSGID: 108008] [afr-transaction.c:2557:afr_write_txn_refresh_done] 0-data-replicate-0: Failing FXATTROP on gfid 883a5c0a-e16e-4937-83b5-5d90d
f1ec956: split-brain observed.
[2016-12-15 09:47:16.357724] E [MSGID: 133016] [shard.c:631:shard_update_file_size_cbk] 0-data-shard: Update to file size xattr failed on 883a5c0a-e16e-4937-83b5-5d90df1ec95
6 [Input/output error]
[2016-12-15 09:47:16.357998] W [fuse-bridge.c:2312:fuse_writev_cbk] 0-glusterfs-fuse: 15170: WRITE => -1 gfid=883a5c0a-e16e-4937-83b5-5d90df1ec956 fd=0x7fd0f000f0f8 (Input/o
utput error)

Comment 2 RamaKasturi 2016-12-15 10:20:04 UTC

sosreports can be found in the link below:
================================================

http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1404982/

Comment 5 Ravishankar N 2016-12-19 16:25:20 UTC

Thanks Kasturi for providing the setup for testing and thanks Satheesaran for providing virsh based commands for re-creating the issue.

The isse is due to a race between inode_refresh_done() and __afr_set_in_flight_sb_status() that occurs when I/O is going on and a brick is brought down or up. When the brick goes up/ comes down, inode refresh is triggered in the write transaction and sets the correct data/metadata readable and event_generation in inode_refresh_done(). But before it can proceed to the write FOP, __afr_set_in_flight_sb_status() from another writev cbk resets the event_generation. When the first write (that follows the inode refresh) gets the event_gen in afr_inode_get_readable(), it gets zero because of which it fails the write with EIO.

While ignoring event_generation seems to fix the issue
-----------------------------------------------------------
diff --git a/xlators/cluster/afr/src/afr-common.c b/xlators/cluster/afr/src/afr-common.c
index 60bae18..2f32e44 100644
--- a/xlators/cluster/afr/src/afr-common.c
+++ b/xlators/cluster/afr/src/afr-common.c
@@ -1089,7 +1089,7 @@ afr_txn_refresh_done (call_frame_t *frame, xlator_t *this, int err)
                                       &event_generation,
                                       local->transaction.type);

-        if (ret == -EIO || !event_generation) {
+        if (ret == -EIO){
                 /* No readable subvolume even after refresh ==> splitbrain.*/
                 if (!priv->fav_child_policy) {
                         err = -EIO;
-----------------------------------------------------------
I need to convince myself that ignoring event gen in afr_txn_refresh_done() is for reads (there is no prob in ignoring it for writes)does not have any repercussions.

Comment 6 Atin Mukherjee 2016-12-20 04:32:51 UTC

upstream mainline patch http://review.gluster.org/16205 posted for review.

Comment 7 SATHEESARAN 2016-12-22 03:19:23 UTC

I have seen the same issue with replica 3 volume as well, and updated the bug summary accordingly

Comment 10 RamaKasturi 2016-12-29 07:22:47 UTC

will verify this bug once the bug https://bugzilla.redhat.com/show_bug.cgi?id=1400057 is fixed. with out this bug fix i see that there are some entries which still remains in the heal info and does not go away.

Comment 11 RamaKasturi 2017-01-16 10:08:47 UTC

verified and works fine with build glusterfs-3.8.4-11.el7rhgs.x86_64.

With arbiter volume:
=========================
1) Deployed HC stack on arbiter volumes.
2) created a vm attached a disk from data arbiter volume.
3) mounted the disk at /mnt/testdata
4) started writing I/O.
5) Once the I/O starts brought down the first data brick in the volume

I/O did not stop and vm did not go to pause state.

volume info from arbiter volume:
===================================
[root@rhsqa-grafton4 ~]# gluster volume info data
 
Volume Name: data
Type: Replicate
Volume ID: b37f7c59-c9e3-4b04-97fe-39b4d462d5c1
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: 10.70.36.82:/rhgs/brick2/data
Brick2: 10.70.36.83:/rhgs/brick2/data
Brick3: 10.70.36.84:/rhgs/brick2/data (arbiter)
Options Reconfigured:
auth.ssl-allow: 10.70.36.84,10.70.36.82,10.70.36.83
server.ssl: on
client.ssl: on
cluster.granular-entry-heal: on
user.cifs: off
network.ping-timeout: 30
performance.strict-o-direct: on
cluster.shd-wait-qlength: 10000
cluster.shd-max-threads: 8
cluster.locking-scheme: granular
performance.low-prio-threads: 32
features.shard-block-size: 4MB
storage.owner-gid: 36
storage.owner-uid: 36
cluster.data-self-heal-algorithm: full
features.shard: on
cluster.server-quorum-type: server
cluster.quorum-type: auto
network.remote-dio: off
cluster.eager-lock: enable
performance.stat-prefetch: off
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on

with replicate volume:
=========================
1) Deployed HC stack on replicate volumes.
2) created a vm attached a disk from data  volume.
3) mounted the disk at /mnt/testdata
4) started writing I/O.
5) Once the I/O starts brought down the first data brick in the volume

I/O did not stop and vm did not go to pause state.

volume info for replicate volume:
=====================================
[root@rhsqa-grafton1 ~]# gluster volume info data
 
Volume Name: data
Type: Replicate
Volume ID: 29d01e0f-bec3-4e68-bbef-4011d95fea4a
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 10.70.36.79:/rhgs/brick2/data
Brick2: 10.70.36.80:/rhgs/brick2/data
Brick3: 10.70.36.81:/rhgs/brick2/data
Options Reconfigured:
auth.ssl-allow: 10.70.36.80,10.70.36.79,10.70.36.81
server.ssl: on
client.ssl: on
cluster.use-compound-fops: on
cluster.granular-entry-heal: on
user.cifs: off
network.ping-timeout: 30
performance.strict-o-direct: on
cluster.shd-wait-qlength: 10000
cluster.shd-max-threads: 8
cluster.locking-scheme: granular
performance.low-prio-threads: 32
features.shard-block-size: 4MB
storage.owner-gid: 36
storage.owner-uid: 36
cluster.data-self-heal-algorithm: full
features.shard: on
cluster.server-quorum-type: server
cluster.quorum-type: auto
network.remote-dio: off
cluster.eager-lock: enable
performance.stat-prefetch: off
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on

Comment 13 errata-xmlrpc 2017-03-23 05:57:02 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html

Note You need to log in before you can comment on or make changes to this bug.