Bug 1179560

Summary:	sanlock directio test file (__DIRECT_IO_TEST__) is triggering self heal
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Paul Cuzner <pcuzner>
Component:	replicate	Assignee:	Pranith Kumar K <pkarampu>
Status:	CLOSED CURRENTRELEASE	QA Contact:	SATHEESARAN <sasundar>
Severity:	medium	Docs Contact:
Priority:	low
Version:	rhgs-3.0	CC:	amukherj, knarra, nlevinki, olim, pcuzner, pkarampu, ravishankar, rhs-bugs, sabose, sasundar, scotth, storage-qa-internal, vbellur
Target Milestone:	---	Keywords:	ZStream
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-08-28 15:21:28 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1177771

Description Paul Cuzner 2015-01-07 05:49:14 UTC

Description of problem:
The __DIRECT_IO_TEST__ file used by sanlock is triggering self heal activity upon DC start up and vm start/shutdown cycles, and whenever probes are made to the glusterfs volume. The volume option "network.remote-dio: on" is set


Version-Release number of selected component (if applicable):
rhss 3.0.x, glusterfs 3.6.0.30 el7 builds

How reproducible:
I see this entry in self heal all the time

Steps to Reproduce:
1. build an environment based on rhel7.1 beta, glusterfs 3.6 el7 and rhevm 3.5 beta
2. activate the rhev cluster
3. check the vol heal X info output on the vm storage domain

Actual results:
__DIRECT_IO_TEST__ keeps appearing in the self heal output

Expected results:
This file should not initiate any self heal/recovery action with remote-dio enabled.

Additional info:

Comment 4 Sahina Bose 2016-03-09 06:17:05 UTC

Sas, do you see this with replica 3 volume and with sharding turned on?

Comment 5 Scott Harvanek 2016-07-25 16:02:41 UTC

I can confirm I see the same thing in a distribute replicated volume-

gluster vol info gv0
 
Volume Name: gv0
Type: Distributed-Replicate
Volume ID: 08773fa0-d57d-4b0a-a517-eaba19e7d58c
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 172.16.17.1:/gluster/brick1/gv0
Brick2: 172.16.17.2:/gluster/brick1/gv0
Brick3: 172.16.17.3:/gluster/brick1/gv0
Brick4: 172.16.17.4:/gluster/brick1/gv0
Options Reconfigured:
performance.read-ahead: off
performance.stat-prefetch: off
cluster.eager-lock: enable
network.remote-dio: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server
storage.owner-gid: 36
storage.owner-uid: 36
performance.readdir-ahead: on

gluster volume heal gv0 info
Brick 172.16.17.1:/gluster/brick1/gv0
/__DIRECT_IO_TEST__ 
Number of entries: 1

Brick 172.16.17.2:/gluster/brick1/gv0
/__DIRECT_IO_TEST__ 
Number of entries: 1

Brick 172.16.17.3:/gluster/brick1/gv0
Number of entries: 0

Brick 172.16.17.4:/gluster/brick1/gv0
Number of entries: 0

Comment 6 Scott Harvanek 2016-07-25 16:07:23 UTC

Slightly more updated versions however-

RHEV-H - 7.1 - 20150603.0.el7ev , 7.2 sees it and wants to mark the array as inoperable, I'm not sure it's actually having the issue tho as I have 7.1 hosts in service and all the 7.2 hosts in MTX due to this.

Gluster server version 3.7.12

RHEV-M - 3.5.8-0.1

Comment 8 Pranith Kumar K 2017-02-10 07:13:54 UTC

Sas,
   Is this issue still re-creatable? We are doing the planning for 3.3.0. Let us know your inputs.

Comment 9 SATHEESARAN 2017-02-16 16:35:12 UTC

(In reply to Pranith Kumar K from comment #8)
> Sas,
>    Is this issue still re-creatable? We are doing the planning for 3.3.0.
> Let us know your inputs.

I am not seeing this issue with Gluster 3.8.4 and oVirt 4.1.

But I remember Kasturi reporting such an issue with Arbiter volume.
Let me redo the test with Arbiter and raise the bug accordingly, if the issue is seen.

Comment 10 SATHEESARAN 2017-02-16 16:36:42 UTC

(In reply to Scott Harvanek from comment #6)
> Slightly more updated versions however-
> 
> RHEV-H - 7.1 - 20150603.0.el7ev , 7.2 sees it and wants to mark the array as
> inoperable, I'm not sure it's actually having the issue tho as I have 7.1
> hosts in service and all the 7.2 hosts in MTX due to this.
> 
> Gluster server version 3.7.12
> 
> RHEV-M - 3.5.8-0.1

Scott,

Kindly check if you are seeing this issue with oVirt 4.1 and Gluster 3.8

Comment 11 Scott Harvanek 2017-02-16 16:45:34 UTC

My issue was related to the arrangement of my Distributed-Replicate volume as being unsupported, I moved away from 2x2 and haven't had an issue since.

Comment 12 SATHEESARAN 2017-02-16 17:51:27 UTC

(In reply to Scott Harvanek from comment #11)
> My issue was related to the arrangement of my Distributed-Replicate volume
> as being unsupported, I moved away from 2x2 and haven't had an issue since.

Thanks Scott. Nice to hear that. Its not that distributed-replicate is not supported, but replica 2 is prone to have split-brain issues. Its the replica 3 flavor that provides you better consistency and availability ( to certain extent )

Comment 14 Sahina Bose 2017-08-28 10:55:41 UTC

I think we can close this based on Comment 11 and comment 9?

Comment 15 SATHEESARAN 2017-08-28 15:21:28 UTC

(In reply to Sahina Bose from comment #14)
> I think we can close this based on Comment 11 and comment 9?

Yes, that makes sense. 

I am closing this bug as CLOSED CURRENTRELEASE, as this issue was not reproducible with RHGS 3.2.0