Bug 1329503

Summary:	[tiering]: during detach tier operation, Input/output error is seen with new file writes on NFS mount
Product:	[Community] GlusterFS	Reporter:	Mohammed Rafi KC <rkavunga>
Component:	tiering	Assignee:	Mohammed Rafi KC <rkavunga>
Status:	CLOSED CURRENTRELEASE	QA Contact:	bugs <bugs>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	mainline	CC:	bugs, kramdoss, nchilaka, rkavunga
Target Milestone:	---	Keywords:	ZStream
Target Release:	---
Hardware:	All
OS:	Unspecified
Whiteboard:
Fixed In Version:	glusterfs-3.8rc2	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:	1326248
Clones:	1329505 1330428 (view as bug list)		Environment:
Last Closed:	2016-06-16 14:04:01 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1326248
Bug Blocks:	1329505, 1330428

Comment 1 Mohammed Rafi KC 2016-04-22 06:31:39 UTC

Copy pasting description and RCA for public use


Description of problem:
On an NFS mount, when large files are written and detach tier operation is started, input/output error is seen. 

[root@dhcp46-9 mnt]# while true; do for i in {1..5};do dd if=/dev/urandom of=file$i bs=1024 count=700000;echo $?;done; echo 'end of cycle'; done
700000+0 records in
700000+0 records out
716800000 bytes (717 MB) copied, 73.3324 s, 9.8 MB/s
0
700000+0 records in
700000+0 records out
716800000 bytes (717 MB) copied, 71.0725 s, 10.1 MB/s
0
dd: error writing ‘file3’: Input/output error
600027+0 records in
600026+0 records out
614426624 bytes (614 MB) copied, 70.7233 s, 8.7 MB/s
1
700000+0 records in
700000+0 records out
716800000 bytes (717 MB) copied, 75.3172 s, 9.5 MB/s
0
700000+0 records in
700000+0 records out
716800000 bytes (717 MB) copied, 73.2562 s, 9.8 MB/s
0
end of cycle

[2016-04-12 01:43:39.423991] E [MSGID: 108008] [afr-transaction.c:1981:afr_transaction] 0-testvol-replicate-4: Failing WRITE on gfid 250d586b-3591-470b-a3ce-99fe52bb453d: split-brain observed. [Input/output error]
[2016-04-12 01:43:39.424838] E [MSGID: 108008] [afr-transaction.c:1981:afr_transaction] 0-testvol-replicate-4: Failing WRITE on gfid 250d586b-3591-470b-a3ce-99fe52bb453d: split-brain observed. [Input/output error]
[2016-04-12 01:43:39.425705] E [MSGID: 108008] [afr-transaction.c:1981:afr_transaction] 0-testvol-replicate-4: Failing WRITE on gfid 250d586b-3591-470b-a3ce-99fe52bb453d: split-brain observed. [Input/output error]
[2016-04-12 01:43:39.429049] E [MSGID: 108008] [afr-transaction.c:1981:afr_transaction] 0-testvol-replicate-4: Failing WRITE on gfid 250d586b-3591-470b-a3ce-99fe52bb453d: split-brain observed. [Input/output error]
[2016-04-12 01:43:39.430226] E [MSGID: 108008] [afr-transaction.c:1981:afr_transaction] 0-testvol-replicate-4: Failing WRITE on gfid 250d586b-3591-470b-a3ce-99fe52bb453d: split-brain observed. [Input/output error]

[root@dhcp47-105 ~]# gluster v info
 
Volume Name: testvol
Type: Tier
Volume ID: 02427025-adcf-48a2-ac58-ae494839e9f8
Status: Started
Number of Bricks: 12
Transport-type: tcp
Hot Tier :
Hot Tier Type : Distributed-Replicate
Number of Bricks: 2 x 2 = 4
Brick1: 10.70.46.94:/bricks/brick3/leg1
Brick2: 10.70.47.9:/bricks/brick3/leg1
Brick3: 10.70.47.105:/bricks/brick3/leg1
Brick4: 10.70.47.90:/bricks/brick3/leg1
Cold Tier:
Cold Tier Type : Distributed-Replicate
Number of Bricks: 4 x 2 = 8
Brick5: 10.70.47.90:/bricks/brick0/ct
Brick6: 10.70.47.105:/bricks/brick0/ct
Brick7: 10.70.47.9:/bricks/brick0/ct
Brick8: 10.70.46.94:/bricks/brick0/ct
Brick9: 10.70.47.90:/bricks/brick1/ct
Brick10: 10.70.47.105:/bricks/brick1/ct
Brick11: 10.70.47.9:/bricks/brick1/ct
Brick12: 10.70.46.94:/bricks/brick1/ct
Options Reconfigured:
cluster.tier-mode: cache
features.ctr-enabled: on
performance.readdir-ahead: on

Version-Release number of selected component (if applicable):
glusterfs-server-3.7.9-1.el7rhgs.x86_64

How reproducible:
2/3 

Steps to Reproduce:
1) create a dist-rep and start it followed by enabling quota
2) now nfs mount the volume and use dd command to create say 5 files of atleast 700MB each " for i in {1..5};do dd if=/dev/urandom of=file$i bs=1024 count=700000;echo $?;done"
3) Now while dd is in progress, perform an attach tier operation
4) After attach tier is successful, Perform detach tier start --> This is when dd throws IO error


Actual results:
IO error is seen

Expected results:
No IO error should be seen  during detach tier operation

Additional info:

--- Additional comment from Mohammed Rafi KC on 2016-04-21 10:40:23 EDT ---

RCA:

NFS uses anonymous fd when writing into a file. If the file moved from cached subvol then write or lock from afr will fail with ENOENT. When write fails, first we will check migration complete check from dht layer. Which does a lookup on the previous source subvol. Since the file moved from there, this lookup will fail. So it will set readable flag to 0 for all subvolume in afr. At this point, the tier still has cached_subvolume as old source. So any subsequent request will again send to the same subvolume. That will cause afr to throw EIO error.

Tier layer update cached_subvol only after it completes "migration complete check". So this race window will be in between  migration complete check from dht later and tier layer.

Comment 2 Vijay Bellur 2016-04-22 06:41:19 UTC

REVIEW: http://review.gluster.org/14049 (tier/dht: check for rebalance completion for EIO error) posted (#1) for review on master by mohammed rafi  kc (rkavunga)

Comment 3 Vijay Bellur 2016-04-25 14:55:29 UTC

COMMIT: http://review.gluster.org/14049 committed in master by Jeff Darcy (jdarcy) 
------
commit a9ccd0c8ea6989c72073028b296f73a6fcb6b896
Author: Mohammed Rafi KC <rkavunga>
Date:   Fri Apr 22 12:07:31 2016 +0530

    tier/dht: check for rebalance completion for EIO error
    
    When an ongoing rebalance completion check task been
    triggered by dht, there is a possibility of a race
    between afr setting subvol as non-readable and dht updates
    the cached subvol. In this window a write can fail with EIO.
    
    Change-Id: I42638e6d4104c0dbe893d1bc73e1366188458c5d
    BUG: 1329503
    Signed-off-by: Mohammed Rafi KC <rkavunga>
    Reviewed-on: http://review.gluster.org/14049
    NetBSD-regression: NetBSD Build System <jenkins.org>
    Smoke: Gluster Build System <jenkins.com>
    CentOS-regression: Gluster Build System <jenkins.com>
    Reviewed-by: N Balachandran <nbalacha>
    Reviewed-by: Jeff Darcy <jdarcy>

Comment 4 Niels de Vos 2016-06-16 14:04:01 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.8.0, please open a new bug report.

glusterfs-3.8.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://blog.gluster.org/2016/06/glusterfs-3-8-released/
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user