Description of problem: In a arbiter volume when one of the data brick is killed and start writing I/O i see that vm goes to paused state and following is seen in the mount logs. [2016-12-15 09:47:16.357700] E [MSGID: 108008] [afr-transaction.c:2557:afr_write_txn_refresh_done] 0-data-replicate-0: Failing FXATTROP on gfid 883a5c0a-e16e-4937-83b5-5d90d f1ec956: split-brain observed. [2016-12-15 09:47:16.357724] E [MSGID: 133016] [shard.c:631:shard_update_file_size_cbk] 0-data-shard: Update to file size xattr failed on 883a5c0a-e16e-4937-83b5-5d90df1ec95 6 [Input/output error] [2016-12-15 09:47:16.357998] W [fuse-bridge.c:2312:fuse_writev_cbk] 0-glusterfs-fuse: 15170: WRITE => -1 gfid=883a5c0a-e16e-4937-83b5-5d90df1ec956 fd=0x7fd0f000f0f8 (Input/o utput error) Version-Release number of selected component (if applicable): glusterfs-3.8.4-8.el7rhgs.x86_64 How reproducible: Always Steps to Reproduce: 1. Install HC with three nodes. 2. Inside the vm mount the disk and start writing I/O 3. while I/O is going on kill one of the brick. Actual results: I see that vm goes to paused state with Input/output error. Expected results: vm should not go to paused state as only one of the data brick is down. Additional info: Following is seen in the mount logs: =========================================== [2016-12-15 09:47:16.357700] E [MSGID: 108008] [afr-transaction.c:2557:afr_write_txn_refresh_done] 0-data-replicate-0: Failing FXATTROP on gfid 883a5c0a-e16e-4937-83b5-5d90d f1ec956: split-brain observed. [2016-12-15 09:47:16.357724] E [MSGID: 133016] [shard.c:631:shard_update_file_size_cbk] 0-data-shard: Update to file size xattr failed on 883a5c0a-e16e-4937-83b5-5d90df1ec95 6 [Input/output error] [2016-12-15 09:47:16.357998] W [fuse-bridge.c:2312:fuse_writev_cbk] 0-glusterfs-fuse: 15170: WRITE => -1 gfid=883a5c0a-e16e-4937-83b5-5d90df1ec956 fd=0x7fd0f000f0f8 (Input/o utput error)
sosreports can be found in the link below: ================================================ http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1404982/
Thanks Kasturi for providing the setup for testing and thanks Satheesaran for providing virsh based commands for re-creating the issue. The isse is due to a race between inode_refresh_done() and __afr_set_in_flight_sb_status() that occurs when I/O is going on and a brick is brought down or up. When the brick goes up/ comes down, inode refresh is triggered in the write transaction and sets the correct data/metadata readable and event_generation in inode_refresh_done(). But before it can proceed to the write FOP, __afr_set_in_flight_sb_status() from another writev cbk resets the event_generation. When the first write (that follows the inode refresh) gets the event_gen in afr_inode_get_readable(), it gets zero because of which it fails the write with EIO. While ignoring event_generation seems to fix the issue ----------------------------------------------------------- diff --git a/xlators/cluster/afr/src/afr-common.c b/xlators/cluster/afr/src/afr-common.c index 60bae18..2f32e44 100644 --- a/xlators/cluster/afr/src/afr-common.c +++ b/xlators/cluster/afr/src/afr-common.c @@ -1089,7 +1089,7 @@ afr_txn_refresh_done (call_frame_t *frame, xlator_t *this, int err) &event_generation, local->transaction.type); - if (ret == -EIO || !event_generation) { + if (ret == -EIO){ /* No readable subvolume even after refresh ==> splitbrain.*/ if (!priv->fav_child_policy) { err = -EIO; ----------------------------------------------------------- I need to convince myself that ignoring event gen in afr_txn_refresh_done() is for reads (there is no prob in ignoring it for writes)does not have any repercussions.
upstream mainline patch http://review.gluster.org/16205 posted for review.
I have seen the same issue with replica 3 volume as well, and updated the bug summary accordingly
will verify this bug once the bug https://bugzilla.redhat.com/show_bug.cgi?id=1400057 is fixed. with out this bug fix i see that there are some entries which still remains in the heal info and does not go away.
verified and works fine with build glusterfs-3.8.4-11.el7rhgs.x86_64. With arbiter volume: ========================= 1) Deployed HC stack on arbiter volumes. 2) created a vm attached a disk from data arbiter volume. 3) mounted the disk at /mnt/testdata 4) started writing I/O. 5) Once the I/O starts brought down the first data brick in the volume I/O did not stop and vm did not go to pause state. volume info from arbiter volume: =================================== [root@rhsqa-grafton4 ~]# gluster volume info data Volume Name: data Type: Replicate Volume ID: b37f7c59-c9e3-4b04-97fe-39b4d462d5c1 Status: Started Snapshot Count: 0 Number of Bricks: 1 x (2 + 1) = 3 Transport-type: tcp Bricks: Brick1: 10.70.36.82:/rhgs/brick2/data Brick2: 10.70.36.83:/rhgs/brick2/data Brick3: 10.70.36.84:/rhgs/brick2/data (arbiter) Options Reconfigured: auth.ssl-allow: 10.70.36.84,10.70.36.82,10.70.36.83 server.ssl: on client.ssl: on cluster.granular-entry-heal: on user.cifs: off network.ping-timeout: 30 performance.strict-o-direct: on cluster.shd-wait-qlength: 10000 cluster.shd-max-threads: 8 cluster.locking-scheme: granular performance.low-prio-threads: 32 features.shard-block-size: 4MB storage.owner-gid: 36 storage.owner-uid: 36 cluster.data-self-heal-algorithm: full features.shard: on cluster.server-quorum-type: server cluster.quorum-type: auto network.remote-dio: off cluster.eager-lock: enable performance.stat-prefetch: off performance.io-cache: off performance.read-ahead: off performance.quick-read: off transport.address-family: inet performance.readdir-ahead: on nfs.disable: on with replicate volume: ========================= 1) Deployed HC stack on replicate volumes. 2) created a vm attached a disk from data volume. 3) mounted the disk at /mnt/testdata 4) started writing I/O. 5) Once the I/O starts brought down the first data brick in the volume I/O did not stop and vm did not go to pause state. volume info for replicate volume: ===================================== [root@rhsqa-grafton1 ~]# gluster volume info data Volume Name: data Type: Replicate Volume ID: 29d01e0f-bec3-4e68-bbef-4011d95fea4a Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: 10.70.36.79:/rhgs/brick2/data Brick2: 10.70.36.80:/rhgs/brick2/data Brick3: 10.70.36.81:/rhgs/brick2/data Options Reconfigured: auth.ssl-allow: 10.70.36.80,10.70.36.79,10.70.36.81 server.ssl: on client.ssl: on cluster.use-compound-fops: on cluster.granular-entry-heal: on user.cifs: off network.ping-timeout: 30 performance.strict-o-direct: on cluster.shd-wait-qlength: 10000 cluster.shd-max-threads: 8 cluster.locking-scheme: granular performance.low-prio-threads: 32 features.shard-block-size: 4MB storage.owner-gid: 36 storage.owner-uid: 36 cluster.data-self-heal-algorithm: full features.shard: on cluster.server-quorum-type: server cluster.quorum-type: auto network.remote-dio: off cluster.eager-lock: enable performance.stat-prefetch: off performance.io-cache: off performance.read-ahead: off performance.quick-read: off transport.address-family: inet performance.readdir-ahead: on nfs.disable: on
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2017-0486.html