Bug 1638883

Summary:	gluster heal problem
Product:	[Community] GlusterFS	Reporter:	jszep
Component:	replicate	Assignee:	Ravishankar N <ravishankar>
Status:	CLOSED WORKSFORME	QA Contact:	Nag Pavan Chilakam <nchilaka>
Severity:	urgent	Docs Contact:
Priority:	high
Version:	mainline	CC:	bugs, jszep, sankarshan, vbellur
Target Milestone:	---	Keywords:	ZStream
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-11-14 04:13:16 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description jszep 2018-10-12 17:22:36 UTC

Description of problem:

gluster heal procedure does not complete

Version-Release number of selected component (if applicable):

glusterfs-3.8.15-2.el7.x86_64

How reproducible:


Steps to Reproduce:
1. reboot system, heal commences
2.
3.

Actual results:

heal does not complete, one system shows a brick(?) not syncing

Expected results:

heal completes, gluster volume heal < > info should show no entries pending heal

Additional info:
We have a 3 node gluster file system cluster which we use as back-end
storage for our VM infrastructure.  Recently, we got an operating system
update so I was going to successively reboot the three file servers.

After rebooting the first system, the normal heal process commenced but
never really completed.  At this point, we see:

[root@cs-fs2 ~]# gluster volume heal vm info
Brick cs-fs1:/mnt/data/vm/brick
Status: Connected
Number of entries: 0

Brick cs-fs2:/mnt/data/vm/brick
/1f48f887-dd49-4363-9e5c-603c007a9baf/master/tasks/dc8b1e1e-f7d3-4199-aa84-2e809cc78a33.backup 
Status: Connected
Number of entries: 1

Brick cs-fs3:/mnt/data/vm/brick
Status: Connected
Number of entries: 0


And also:


[root@cs-fs2 dc8b1e1e-f7d3-4199-aa84-2e809cc78a33.backup]# pwd
/mnt/data/vm/brick/1f48f887-dd49-4363-9e5c-603c007a9baf/master/tasks/dc8b1e1e-f7d3-4199-aa84-2e809cc78a33.backup

[root@cs-fs2 dc8b1e1e-f7d3-4199-aa84-2e809cc78a33.backup]# ls -l
total 4
-rw-r--r--. 1 36 36 279 Oct  2 13:21 dc8b1e1e-f7d3-4199-aa84-2e809cc78a33.recover.0

[root@cs-fs2 dc8b1e1e-f7d3-4199-aa84-2e809cc78a33.backup]# cat dc8b1e1e-f7d3-4199-aa84-2e809cc78a33.recover.0 
function = create_image_rollback
moduleName = sd
params = /rhev/data-center/mnt/glusterSD/cs-fs1.bu.edu:_vm/1f48f887-dd49-4363-9e5c-603c007a9baf/images/34fc74a4-2665-44e8-b66d-455da248e209
name = create image rollback: 34fc74a4-2665-44e8-b66d-455da248e209
object = StorageDomain


We are trying to understand what this thing is and why it is still hanging around.
Is this an issue with RHEV or is it a glusterfs problem?

We have not had this sort of problem previously.


[root@cs-fs2 ~]# rpm -q glusterfs
glusterfs-3.8.15-2.el7.x86_64

On the management node:

[root@cs-rhvm ~]# rpm -q rhevm
rhevm-4.1.11.2-0.1.el7.noarch

Comment 2 Ravishankar N 2018-10-15 10:02:08 UTC

1. Can you provide the getfattr output of dc8b1e1e-f7d3-4199-aa84-2e809cc78a33.backup from all 3 bricks?

getfattr -d -m. -e hex /path-to-brick-mount/path-to-dc8b1e1e-f7d3-4199-aa84-2e809cc78a33.backup

2. Are the files inside dc8b1e1e-f7d3-4199-aa84-2e809cc78a33.backup identical in all bricks of the replica?

Comment 3 Ravishankar N 2018-10-17 08:18:46 UTC

jszep, I am changing the 'Product' to glusterfs. I'm assuming you are using the upstream gluster version. If you are a RHGS customer, please reach out to the Red Hat support team to assist you.

Comment 4 jszep 2018-10-17 17:53:08 UTC

 1. Can you provide the getfattr output of
 dc8b1e1e-f7d3-4199-aa84-2e809cc78a33.backup from all 3 bricks?

 getfattr -d -m. -e hex
 /path-to-brick-mount/path-to-dc8b1e1e-f7d3-4199-aa84-2e809cc78a33.backup


We have 3 glutser servers in the cluster:  cs-fs1, cs-fs2, and cs-fs3.
The directory dc8b1e1e-f7d3-4199-aa84-2e809cc78a33.backup only exists
on cs-fs2. (Note:  cs-fs1 was the system that was rebooted that started
all this.)

[root@cs-fs2 tasks]# getfattr -d -m. -e hex
+/mnt/data/vm/brick/1f48f887-dd49-4363-9e5c-603c007a9baf/master/tasks/dc8b1e1e-f7d3-4199-aa84-2e809cc78a33.backup
 getfattr: Removing leading '/' from absolute path names
 # file: mnt/data/vm/brick/1f48f887-dd49-4363-9e5c-603c007a9baf/master/tasks/dc8b1e1e-f7d3-4199-aa84-2e809cc78a33.backup

security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
 trusted.afr.vm-client-0=0x000000000000000100000001
 trusted.gfid=0x1a550e7627b3448cad5818e13fbb8671
 trusted.glusterfs.dht=0x000000010000000000000000ffffffff
 trusted.glusterfs.dht.mds=0x00000000
 In addition, there is a directory:
 /mnt/data/vm/brick/1f48f887-dd49-4363-9e5c-603c007a9baf/master/tasks/dc8b1e1e-f7d3-4199-aa84-2e809cc78a33/
 (no .backup) that DOES exist on all three servers with identical contents:

 [root@cs-fs2 tasks]# ls /mnt/data/vm/brick/1f48f887-dd49-4363-9e5c-603c007a9baf/master/tasks/dc8b1e1e-f7d3-4199-aa84-2e809cc78a33/
 dc8b1e1e-f7d3-4199-aa84-2e809cc78a33.job.0      dc8b1e1e-f7d3-4199-aa84-2e809cc78a33.result
 dc8b1e1e-f7d3-4199-aa84-2e809cc78a33.recover.0  dc8b1e1e-f7d3-4199-aa84-2e809cc78a33.task


2. Are the files inside dc8b1e1e-f7d3-4199-aa84-2e809cc78a33.backup identical
 in all bricks of the replica?

No. dc8b1e1e-f7d3-4199-aa84-2e809cc78a33.backup only exits on cs-fs2.  It's contents
are:

[root@cs-fs2 tasks]# ls dc8b1e1e-f7d3-4199-aa84-2e809cc78a33.backup
dc8b1e1e-f7d3-4199-aa84-2e809cc78a33.recover.0

Comment 5 jszep 2018-10-17 17:55:31 UTC

The contents of the file dc8b1e1e-f7d3-4199-aa84-2e809cc78a33.recover.0 are:

cs-fs2: cat dc8b1e1e-f7d3-4199-aa84-2e809cc78a33.recover.0
function = create_image_rollback
moduleName = sd
params = /rhev/data-center/mnt/glusterSD/cs-fs1.bu.edu:_vm/1f48f887-dd49-4363-9e5c-603c007a9baf/images/34fc74a4-2665-44e8-b66d-455da248e209
name = create image rollback: 34fc74a4-2665-44e8-b66d-455da248e209
object = StorageDomain

Comment 6 Shyamsundar 2018-10-23 14:54:20 UTC

Release 3.12 has been EOLd and this bug was still found to be in the NEW state, hence moving the version to mainline, to triage the same and take appropriate actions.

Comment 7 Ravishankar N 2018-10-26 04:26:12 UTC

Hi jszep,

> No. dc8b1e1e-f7d3-4199-aa84-2e809cc78a33.backup only exits on cs-fs2.

1.Is the setup still in the same state now? Can you also provide the getfattr output of the parent directory (mnt/data/vm/brick/1f48f887-dd49-4363-9e5c-603c007a9baf/master/tasks) on all 3 bricks?

If you explicitly do a stat from the fuse mount point to the path (/1f48f887-dd49-4363-9e5c-603c007a9baf/master/tasks/dc8b1e1e-f7d3-4199-aa84-2e809cc78a33.backup), the directory should get created on the other 2 bricks as well.

2. Could you provide the gluster volume info output? 

If you have some sort of a reproducer, that would help in identifying the issue.

Comment 8 jszep 2018-11-13 15:45:07 UTC

Huh - I answered this request over a week ago but it did not show up here.  Anyway,  the problem is solved.  I have updated and rebooted the other file servers and everything is running as expected.

Thank you for your assustance.  You can close this case.

Comment 9 Ravishankar N 2018-11-14 04:13:16 UTC

Closing the bug based on comment#8.