Description of problem: gluster heal procedure does not complete Version-Release number of selected component (if applicable): glusterfs-3.8.15-2.el7.x86_64 How reproducible: Steps to Reproduce: 1. reboot system, heal commences 2. 3. Actual results: heal does not complete, one system shows a brick(?) not syncing Expected results: heal completes, gluster volume heal < > info should show no entries pending heal Additional info: We have a 3 node gluster file system cluster which we use as back-end storage for our VM infrastructure. Recently, we got an operating system update so I was going to successively reboot the three file servers. After rebooting the first system, the normal heal process commenced but never really completed. At this point, we see: [root@cs-fs2 ~]# gluster volume heal vm info Brick cs-fs1:/mnt/data/vm/brick Status: Connected Number of entries: 0 Brick cs-fs2:/mnt/data/vm/brick /1f48f887-dd49-4363-9e5c-603c007a9baf/master/tasks/dc8b1e1e-f7d3-4199-aa84-2e809cc78a33.backup Status: Connected Number of entries: 1 Brick cs-fs3:/mnt/data/vm/brick Status: Connected Number of entries: 0 And also: [root@cs-fs2 dc8b1e1e-f7d3-4199-aa84-2e809cc78a33.backup]# pwd /mnt/data/vm/brick/1f48f887-dd49-4363-9e5c-603c007a9baf/master/tasks/dc8b1e1e-f7d3-4199-aa84-2e809cc78a33.backup [root@cs-fs2 dc8b1e1e-f7d3-4199-aa84-2e809cc78a33.backup]# ls -l total 4 -rw-r--r--. 1 36 36 279 Oct 2 13:21 dc8b1e1e-f7d3-4199-aa84-2e809cc78a33.recover.0 [root@cs-fs2 dc8b1e1e-f7d3-4199-aa84-2e809cc78a33.backup]# cat dc8b1e1e-f7d3-4199-aa84-2e809cc78a33.recover.0 function = create_image_rollback moduleName = sd params = /rhev/data-center/mnt/glusterSD/cs-fs1.bu.edu:_vm/1f48f887-dd49-4363-9e5c-603c007a9baf/images/34fc74a4-2665-44e8-b66d-455da248e209 name = create image rollback: 34fc74a4-2665-44e8-b66d-455da248e209 object = StorageDomain We are trying to understand what this thing is and why it is still hanging around. Is this an issue with RHEV or is it a glusterfs problem? We have not had this sort of problem previously. [root@cs-fs2 ~]# rpm -q glusterfs glusterfs-3.8.15-2.el7.x86_64 On the management node: [root@cs-rhvm ~]# rpm -q rhevm rhevm-4.1.11.2-0.1.el7.noarch
1. Can you provide the getfattr output of dc8b1e1e-f7d3-4199-aa84-2e809cc78a33.backup from all 3 bricks? getfattr -d -m. -e hex /path-to-brick-mount/path-to-dc8b1e1e-f7d3-4199-aa84-2e809cc78a33.backup 2. Are the files inside dc8b1e1e-f7d3-4199-aa84-2e809cc78a33.backup identical in all bricks of the replica?
jszep, I am changing the 'Product' to glusterfs. I'm assuming you are using the upstream gluster version. If you are a RHGS customer, please reach out to the Red Hat support team to assist you.
1. Can you provide the getfattr output of dc8b1e1e-f7d3-4199-aa84-2e809cc78a33.backup from all 3 bricks? getfattr -d -m. -e hex /path-to-brick-mount/path-to-dc8b1e1e-f7d3-4199-aa84-2e809cc78a33.backup We have 3 glutser servers in the cluster: cs-fs1, cs-fs2, and cs-fs3. The directory dc8b1e1e-f7d3-4199-aa84-2e809cc78a33.backup only exists on cs-fs2. (Note: cs-fs1 was the system that was rebooted that started all this.) [root@cs-fs2 tasks]# getfattr -d -m. -e hex +/mnt/data/vm/brick/1f48f887-dd49-4363-9e5c-603c007a9baf/master/tasks/dc8b1e1e-f7d3-4199-aa84-2e809cc78a33.backup getfattr: Removing leading '/' from absolute path names # file: mnt/data/vm/brick/1f48f887-dd49-4363-9e5c-603c007a9baf/master/tasks/dc8b1e1e-f7d3-4199-aa84-2e809cc78a33.backup security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000 trusted.afr.vm-client-0=0x000000000000000100000001 trusted.gfid=0x1a550e7627b3448cad5818e13fbb8671 trusted.glusterfs.dht=0x000000010000000000000000ffffffff trusted.glusterfs.dht.mds=0x00000000 In addition, there is a directory: /mnt/data/vm/brick/1f48f887-dd49-4363-9e5c-603c007a9baf/master/tasks/dc8b1e1e-f7d3-4199-aa84-2e809cc78a33/ (no .backup) that DOES exist on all three servers with identical contents: [root@cs-fs2 tasks]# ls /mnt/data/vm/brick/1f48f887-dd49-4363-9e5c-603c007a9baf/master/tasks/dc8b1e1e-f7d3-4199-aa84-2e809cc78a33/ dc8b1e1e-f7d3-4199-aa84-2e809cc78a33.job.0 dc8b1e1e-f7d3-4199-aa84-2e809cc78a33.result dc8b1e1e-f7d3-4199-aa84-2e809cc78a33.recover.0 dc8b1e1e-f7d3-4199-aa84-2e809cc78a33.task 2. Are the files inside dc8b1e1e-f7d3-4199-aa84-2e809cc78a33.backup identical in all bricks of the replica? No. dc8b1e1e-f7d3-4199-aa84-2e809cc78a33.backup only exits on cs-fs2. It's contents are: [root@cs-fs2 tasks]# ls dc8b1e1e-f7d3-4199-aa84-2e809cc78a33.backup dc8b1e1e-f7d3-4199-aa84-2e809cc78a33.recover.0
The contents of the file dc8b1e1e-f7d3-4199-aa84-2e809cc78a33.recover.0 are: cs-fs2: cat dc8b1e1e-f7d3-4199-aa84-2e809cc78a33.recover.0 function = create_image_rollback moduleName = sd params = /rhev/data-center/mnt/glusterSD/cs-fs1.bu.edu:_vm/1f48f887-dd49-4363-9e5c-603c007a9baf/images/34fc74a4-2665-44e8-b66d-455da248e209 name = create image rollback: 34fc74a4-2665-44e8-b66d-455da248e209 object = StorageDomain
Release 3.12 has been EOLd and this bug was still found to be in the NEW state, hence moving the version to mainline, to triage the same and take appropriate actions.
Hi jszep, > No. dc8b1e1e-f7d3-4199-aa84-2e809cc78a33.backup only exits on cs-fs2. 1.Is the setup still in the same state now? Can you also provide the getfattr output of the parent directory (mnt/data/vm/brick/1f48f887-dd49-4363-9e5c-603c007a9baf/master/tasks) on all 3 bricks? If you explicitly do a stat from the fuse mount point to the path (/1f48f887-dd49-4363-9e5c-603c007a9baf/master/tasks/dc8b1e1e-f7d3-4199-aa84-2e809cc78a33.backup), the directory should get created on the other 2 bricks as well. 2. Could you provide the gluster volume info output? If you have some sort of a reproducer, that would help in identifying the issue.
Huh - I answered this request over a week ago but it did not show up here. Anyway, the problem is solved. I have updated and rebooted the other file servers and everything is running as expected. Thank you for your assustance. You can close this case.
Closing the bug based on comment#8.