Bug 1179560

Summary: sanlock directio test file (__DIRECT_IO_TEST__) is triggering self heal
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Paul Cuzner <pcuzner>
Component: replicateAssignee: Pranith Kumar K <pkarampu>
Status: CLOSED CURRENTRELEASE QA Contact: SATHEESARAN <sasundar>
Severity: medium Docs Contact:
Priority: low    
Version: rhgs-3.0CC: amukherj, knarra, nlevinki, olim, pcuzner, pkarampu, ravishankar, rhs-bugs, sabose, sasundar, scotth, storage-qa-internal, vbellur
Target Milestone: ---Keywords: ZStream
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-08-28 15:21:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1177771    

Description Paul Cuzner 2015-01-07 05:49:14 UTC
Description of problem:
The __DIRECT_IO_TEST__ file used by sanlock is triggering self heal activity upon DC start up and vm start/shutdown cycles, and whenever probes are made to the glusterfs volume. The volume option "network.remote-dio: on" is set


Version-Release number of selected component (if applicable):
rhss 3.0.x, glusterfs 3.6.0.30 el7 builds

How reproducible:
I see this entry in self heal all the time

Steps to Reproduce:
1. build an environment based on rhel7.1 beta, glusterfs 3.6 el7 and rhevm 3.5 beta
2. activate the rhev cluster
3. check the vol heal X info output on the vm storage domain

Actual results:
__DIRECT_IO_TEST__ keeps appearing in the self heal output

Expected results:
This file should not initiate any self heal/recovery action with remote-dio enabled.

Additional info:

Comment 4 Sahina Bose 2016-03-09 06:17:05 UTC
Sas, do you see this with replica 3 volume and with sharding turned on?

Comment 5 Scott Harvanek 2016-07-25 16:02:41 UTC
I can confirm I see the same thing in a distribute replicated volume-

gluster vol info gv0
 
Volume Name: gv0
Type: Distributed-Replicate
Volume ID: 08773fa0-d57d-4b0a-a517-eaba19e7d58c
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 172.16.17.1:/gluster/brick1/gv0
Brick2: 172.16.17.2:/gluster/brick1/gv0
Brick3: 172.16.17.3:/gluster/brick1/gv0
Brick4: 172.16.17.4:/gluster/brick1/gv0
Options Reconfigured:
performance.read-ahead: off
performance.stat-prefetch: off
cluster.eager-lock: enable
network.remote-dio: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server
storage.owner-gid: 36
storage.owner-uid: 36
performance.readdir-ahead: on

gluster volume heal gv0 info
Brick 172.16.17.1:/gluster/brick1/gv0
/__DIRECT_IO_TEST__ 
Number of entries: 1

Brick 172.16.17.2:/gluster/brick1/gv0
/__DIRECT_IO_TEST__ 
Number of entries: 1

Brick 172.16.17.3:/gluster/brick1/gv0
Number of entries: 0

Brick 172.16.17.4:/gluster/brick1/gv0
Number of entries: 0

Comment 6 Scott Harvanek 2016-07-25 16:07:23 UTC
Slightly more updated versions however-

RHEV-H - 7.1 - 20150603.0.el7ev , 7.2 sees it and wants to mark the array as inoperable, I'm not sure it's actually having the issue tho as I have 7.1 hosts in service and all the 7.2 hosts in MTX due to this.

Gluster server version 3.7.12

RHEV-M - 3.5.8-0.1

Comment 8 Pranith Kumar K 2017-02-10 07:13:54 UTC
Sas,
   Is this issue still re-creatable? We are doing the planning for 3.3.0. Let us know your inputs.

Comment 9 SATHEESARAN 2017-02-16 16:35:12 UTC
(In reply to Pranith Kumar K from comment #8)
> Sas,
>    Is this issue still re-creatable? We are doing the planning for 3.3.0.
> Let us know your inputs.

I am not seeing this issue with Gluster 3.8.4 and oVirt 4.1.

But I remember Kasturi reporting such an issue with Arbiter volume.
Let me redo the test with Arbiter and raise the bug accordingly, if the issue is seen.

Comment 10 SATHEESARAN 2017-02-16 16:36:42 UTC
(In reply to Scott Harvanek from comment #6)
> Slightly more updated versions however-
> 
> RHEV-H - 7.1 - 20150603.0.el7ev , 7.2 sees it and wants to mark the array as
> inoperable, I'm not sure it's actually having the issue tho as I have 7.1
> hosts in service and all the 7.2 hosts in MTX due to this.
> 
> Gluster server version 3.7.12
> 
> RHEV-M - 3.5.8-0.1

Scott,

Kindly check if you are seeing this issue with oVirt 4.1 and Gluster 3.8

Comment 11 Scott Harvanek 2017-02-16 16:45:34 UTC
My issue was related to the arrangement of my Distributed-Replicate volume as being unsupported, I moved away from 2x2 and haven't had an issue since.

Comment 12 SATHEESARAN 2017-02-16 17:51:27 UTC
(In reply to Scott Harvanek from comment #11)
> My issue was related to the arrangement of my Distributed-Replicate volume
> as being unsupported, I moved away from 2x2 and haven't had an issue since.

Thanks Scott. Nice to hear that. Its not that distributed-replicate is not supported, but replica 2 is prone to have split-brain issues. Its the replica 3 flavor that provides you better consistency and availability ( to certain extent )

Comment 14 Sahina Bose 2017-08-28 10:55:41 UTC
I think we can close this based on Comment 11 and comment 9?

Comment 15 SATHEESARAN 2017-08-28 15:21:28 UTC
(In reply to Sahina Bose from comment #14)
> I think we can close this based on Comment 11 and comment 9?

Yes, that makes sense. 

I am closing this bug as CLOSED CURRENTRELEASE, as this issue was not reproducible with RHGS 3.2.0