Bug 1439657

Summary: Arbiter brick becomes a source for data heal
Product: [Community] GlusterFS Reporter: Denis Chaplygin <dchaplyg>
Component: arbiterAssignee: Ravishankar N <ravishankar>
Status: CLOSED EOL QA Contact:
Severity: high Docs Contact:
Priority: high    
Version: 3.8CC: bugs, ravishankar, sasundar
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-11-07 10:42:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1401969, 1411323, 1413845    

Description Denis Chaplygin 2017-04-06 11:34:56 UTC
Description of problem: Given three hosts and three bricks on them, combined into replica 3 volume with arbiter, it could happen, that arbiter brick will become a source for data heal, which should not happed


How reproducible: time to time


Steps to Reproduce:
1. Create a replica 3 volume with arbiter, keeping bricks on three different hosts.
2. Start updating some file frequently
3. Start rebooting nodes in a random order (breaking network connectivity is fine too), several reboots should affect two nodes in a random order

Actual results:
Some files will not be healed. 

[root@hc-lion ~]# gluster volume heal data full
Launching heal operation to perform full self heal on volume data has been successful 
Use heal info commands to check status
[root@hc-lion ~]# gluster volume heal data info  
Brick hc-lion:/rhgs/data
/555425cf-e3e4-4665-ae82-6152896d8190/dom_md/ids 
Status: Connected
Number of entries: 1

Brick hc-tiger:/rhgs/data
/555425cf-e3e4-4665-ae82-6152896d8190/dom_md/ids 
Status: Connected
Number of entries: 1

Brick hc-panther:/rhgs/data
/555425cf-e3e4-4665-ae82-6152896d8190/dom_md/ids 
Status: Connected
Number of entries: 1

[root@hc-lion dom_md]# getfattr -d -m . -e hex ids
# file: ids
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.afr.data-client-1=0x0000000e0000000000000000
trusted.afr.data-client-2=0x000000000000000000000000
trusted.afr.dirty=0x000000000000000000000000
trusted.bit-rot.version=0x080000000000000058e6028e000829f0
trusted.gfid=0x405ab9b11adb4ced927294ef36272b44
trusted.glusterfs.shard.block-size=0x0000000020000000
trusted.glusterfs.shard.file-size=0x0000000000100000000000000000000000000000000008000000000000000000


Expected results:
All files should be healed.


Additional info:

I do not have a good way to reproduce that bug. But i hope that logs from my nodes will be helpful. Bug was observed during first half of day 6th of April.

Comment 1 Ravishankar N 2017-04-06 11:56:27 UTC
Notes to self while Denis uploads the logs:


[root@hc-lion ~]# gluster v info data

Volume Name: data
Type: Replicate
Volume ID: 7070474d-14be-4cf3-96fa-3efb72a5458c
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: hc-lion:/rhgs/data
Brick2: hc-tiger:/rhgs/data
Brick3: hc-panther:/rhgs/data (arbiter)
Options Reconfigured:
cluster.self-heal-daemon: enable
user.cifs: off
performance.strict-o-direct: on
cluster.shd-wait-qlength: 10000
cluster.shd-max-threads: 8
cluster.locking-scheme: granular
performance.low-prio-threads: 32
features.shard-block-size: 512MB
network.ping-timeout: 30
server.allow-insecure: on
storage.owner-gid: 36
storage.owner-uid: 36
cluster.data-self-heal-algorithm: full
features.shard: on
cluster.server-quorum-type: server
cluster.quorum-type: auto
network.remote-dio: off
cluster.eager-lock: enable
performance.stat-prefetch: off
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
nfs.disable: on
performance.readdir-ahead: on
transport.address-family: inet

Xattr info:
1st Node:

[root@hc-lion ~]# g /rhgs/data/555425cf-e3e4-4665-ae82-6152896d8190/dom_md/ids
getfattr: Removing leading '/' from absolute path names
# file: rhgs/data/555425cf-e3e4-4665-ae82-6152896d8190/dom_md/ids
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.afr.data-client-1=0x0000000e0000000000000000
trusted.afr.data-client-2=0x000000000000000000000000
trusted.afr.dirty=0x000000000000000000000000
trusted.bit-rot.version=0x080000000000000058e6028e000829f0
trusted.gfid=0x405ab9b11adb4ced927294ef36272b44
trusted.glusterfs.shard.block-size=0x0000000020000000
trusted.glusterfs.shard.file-size=0x0000000000100000000000000000000000000000000008000000000000000000

[root@hc-lion ~]# stat /rhgs/data/555425cf-e3e4-4665-ae82-6152896d8190/dom_md/ids
  File: ‘/rhgs/data/555425cf-e3e4-4665-ae82-6152896d8190/dom_md/ids’
  Size: 1048576   	Blocks: 2048       IO Block: 4096   regular file
Device: fd07h/64775d	Inode: 67108931    Links: 2
Access: (0660/-rw-rw----)  Uid: (   36/    vdsm)   Gid: (   36/     kvm)
Context: system_u:object_r:unlabeled_t:s0
Access: 2017-04-06 12:30:08.330377133 +0300
Modify: 2017-04-06 12:05:06.547723917 +0300
Change: 2017-04-06 12:05:08.570709032 +0300

2nd Node:
root@hc-tiger ~]# g /rhgs/data/555425cf-e3e4-4665-ae82-6152896d8190/dom_md/ids
getfattr: Removing leading '/' from absolute path names
# file: rhgs/data/555425cf-e3e4-4665-ae82-6152896d8190/dom_md/ids
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.afr.data-client-0=0x000000050000000000000000
trusted.afr.data-client-2=0x000000000000000000000000
trusted.afr.dirty=0x000000010000000000000000
trusted.bit-rot.version=0x060000000000000058e5f9100009aaa1
trusted.gfid=0x405ab9b11adb4ced927294ef36272b44
trusted.glusterfs.shard.block-size=0x0000000020000000
trusted.glusterfs.shard.file-size=0x0000000000100000000000000000000000000000000008000000000000000000

[root@hc-tiger ~]# stat /rhgs/data/555425cf-e3e4-4665-ae82-6152896d8190/dom_md/ids
  File: ‘/rhgs/data/555425cf-e3e4-4665-ae82-6152896d8190/dom_md/ids’
  Size: 1048576   	Blocks: 2048       IO Block: 4096   regular file
Device: fd09h/64777d	Inode: 67108931    Links: 2
Access: (0660/-rw-rw----)  Uid: (   36/    vdsm)   Gid: (   36/     kvm)
Context: system_u:object_r:unlabeled_t:s0
Access: 2017-04-06 14:03:20.028466007 +0300
Modify: 2017-04-06 11:59:28.291178965 +0300
Change: 2017-04-06 11:59:28.291178965 +0300
 Birth: -

3rd Node:
[root@hc-panther ~]# g /rhgs/data/555425cf-e3e4-4665-ae82-6152896d8190/dom_md/ids
getfattr: Removing leading '/' from absolute path names
# file: rhgs/data/555425cf-e3e4-4665-ae82-6152896d8190/dom_md/ids
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.afr.data-client-0=0x000000000000000000000000
trusted.afr.data-client-1=0x0000000e0000000000000000
trusted.afr.dirty=0x000000000000000000000000
trusted.bit-rot.version=0x040000000000000058e5f8f7000d5127
trusted.gfid=0x405ab9b11adb4ced927294ef36272b44
trusted.glusterfs.shard.block-size=0x0000000020000000
trusted.glusterfs.shard.file-size=0x0000000000100000000000000000000000000000000008000000000000000000

[root@hc-panther ~]# stat /rhgs/data/555425cf-e3e4-4665-ae82-6152896d8190/dom_md/ids
  File: ‘/rhgs/data/555425cf-e3e4-4665-ae82-6152896d8190/dom_md/ids’
  Size: 0         	Blocks: 0          IO Block: 4096   regular empty file
Device: fd07h/64775d	Inode: 67108931    Links: 2
Access: (0660/-rw-rw----)  Uid: (   36/    vdsm)   Gid: (   36/     kvm)
Context: system_u:object_r:unlabeled_t:s0
Access: 2017-04-06 14:03:20.006835579 +0300
Modify: 2017-04-06 11:15:00.430926000 +0300
Change: 2017-04-06 12:05:08.572428152 +0300
 Birth: -
[root@hc-panther ~]# 


md5sums on the 3 nodes respectively are c6e665a63b15c4c2c6d66beff671834e, f84fd35dd9f09215e7710b7bed347a8a and d41d8cd98f00b204e9800998ecf8427e

Comment 2 Niels de Vos 2017-11-07 10:42:53 UTC
This bug is getting closed because the 3.8 version is marked End-Of-Life. There will be no further updates to this version. Please open a new bug against a version that still receives bugfixes if you are still facing this issue in a more current release.