Bug 1640148

Summary:	Healing is not completed on Distributed-Replicated
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Vijay Avuthu <vavuthu>
Component:	arbiter	Assignee:	Karthik U S <ksubrahm>
Status:	CLOSED ERRATA	QA Contact:	Prasanth <pprakash>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.4	CC:	asakthiv, jahernan, ksubrahm, nravinas, pprakash, puebele, ravishankar, rhs-bugs, rkothiya, sheggodu, storage-qa-internal
Target Milestone:	---	Keywords:	Automation, ZStream
Target Release:	RHGS 3.5.z Batch Update 4
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	glusterfs-6.0-51	Doc Type:	Bug Fix
Doc Text:	Previously, there could have been a scenario where during entry heal of a directory rename, a directory in a new location could be created before deleting it from the old location, which could have led to two directories with the same gfid and few entries in pending heal state. With this update, a new volume option called cluster.use-anonymous-inode, is introduced which is ON by default on the newly created volumes with the op-version being GD_OP_VERSIONS_9_0 or higher. In this setting as part of heal, if the old location heals first and is not present in the source brick, it is best to rename it into a hidden directory inside the sink brick so that when heal is triggered in the new location, the self heal daemon renames it from this hidden directory to the new location. If new location heal is triggered first and it detects that the directory already exists in the brick, then it should skip healing the directory until it appears in the hidden directory. This volume option is OFF for older volumes created with op-version lesser than GD_OP_VERSION_9_0.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-04-29 07:20:37 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1786553

Description Vijay Avuthu 2018-10-17 12:33:34 UTC

Description of problem:

While running automation runs, found that healing is not completed on Distributed-Replicated ( Arbiter )

Version-Release number of selected component (if applicable):

glusterfs-3.12.2-18.2.el7rhgs.x86_64


How reproducible:


Steps to Reproduce:

1) create distributed-replicated volume ( Arbiter:2 x (2 + 1) ) and mount the volume
2) Disable client side heals
3) write IO using below script

#python /usr/share/glustolibs/io/scripts/file_dir_ops.py create_deep_dirs_with_files --dir-length 2 --dir-depth 2 --max-num-of-dirs 2 --num-of-files 20 /mnt/testvol_distributed-replicated_glusterfs/files

4) Disable self-heal-daemon
5) bring bricks offline from each set ( brick1 and brick5 )
6) create files from mount point

#python /usr/share/glustolibs/io/scripts/file_dir_ops.py create_files -f 20 /mnt/testvol_distributed-replicated_glusterfs/files

7) bring bricks online 
8) Enable self-heal-daemon
9) issue volume heal
10) wait for heal to complete

11) Disable self-heal-daemon
12) bring bricks offline from each set ( brick1 and brick5 )
13) Modify data
python /usr/share/glustolibs/io/scripts/file_dir_ops.py mv /mnt/testvol_distributed-replicated_glusterfs/files
14) bring bricks online
15) Enable self-heal-daemon
16) Issue volume heal
17) Wait for heal to complete

Actual results:

After step 17, heal info is still pending

[root@rhsauto049 ~]# gluster vol heal testvol_distributed-replicated info
Brick rhsauto049.lab.eng.blr.redhat.com:/bricks/brick0/testvol_distributed-replicated_brick0
Status: Connected
Number of entries: 0

Brick rhsauto029.lab.eng.blr.redhat.com:/bricks/brick0/testvol_distributed-replicated_brick1
Status: Connected
Number of entries: 0

Brick rhsauto034.lab.eng.blr.redhat.com:/bricks/brick0/testvol_distributed-replicated_brick2
Status: Connected
Number of entries: 0

Brick rhsauto039.lab.eng.blr.redhat.com:/bricks/brick0/testvol_distributed-replicated_brick3
/files/user1_a/dir1_a/dir0_a/testfile2_a.txt 
/files/user1_a/dir1_a/dir0_a 
/files/user1_a/dir1_a/dir0_a/testfile10_a.txt 
/files/user1_a/dir1_a 
/files/user1_a/dir1_a/dir1_a/testfile2_a.txt 
/files/user1_a/dir1_a/dir1_a 
/files/user1_a/dir1_a/dir1_a/testfile10_a.txt 
/files/user1_a/dir1_a/testfile2_a.txt 
/files/user1_a/dir1_a/testfile10_a.txt 
Status: Connected
Number of entries: 9

Brick rhsauto040.lab.eng.blr.redhat.com:/bricks/brick0/testvol_distributed-replicated_brick4
/files/user1_a/dir1_a/dir0_a/testfile2_a.txt 
/files/user1_a/dir1_a/dir0_a 
/files/user1_a/dir1_a/dir0_a/testfile10_a.txt 
/files/user1_a/dir1_a 
/files/user1_a/dir1_a/dir1_a/testfile2_a.txt 
/files/user1_a/dir1_a/dir1_a 
/files/user1_a/dir1_a/dir1_a/testfile10_a.txt 
/files/user1_a/dir1_a/testfile2_a.txt 
/files/user1_a/dir1_a/testfile10_a.txt 
Status: Connected
Number of entries: 9

Brick rhsauto041.lab.eng.blr.redhat.com:/bricks/brick0/testvol_distributed-replicated_brick5
<gfid:f8395fc2-fd6f-4408-a7d3-d039273bd7f2>/user1_a/dir1_a 
Status: Connected
Number of entries: 1

[root@rhsauto049 ~]# 

Expected results:

healing should complete

Additional info:

[root@rhsauto049 ~]# gluster vol info
 
Volume Name: testvol_distributed-replicated
Type: Distributed-Replicate
Volume ID: 8edc5266-545b-4b97-8e94-d8e41d1dad4d
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x (2 + 1) = 6
Transport-type: tcp
Bricks:
Brick1: rhsauto049.lab.eng.blr.redhat.com:/bricks/brick0/testvol_distributed-replicated_brick0
Brick2: rhsauto029.lab.eng.blr.redhat.com:/bricks/brick0/testvol_distributed-replicated_brick1
Brick3: rhsauto034.lab.eng.blr.redhat.com:/bricks/brick0/testvol_distributed-replicated_brick2 (arbiter)
Brick4: rhsauto039.lab.eng.blr.redhat.com:/bricks/brick0/testvol_distributed-replicated_brick3
Brick5: rhsauto040.lab.eng.blr.redhat.com:/bricks/brick0/testvol_distributed-replicated_brick4
Brick6: rhsauto041.lab.eng.blr.redhat.com:/bricks/brick0/testvol_distributed-replicated_brick5 (arbiter)
Options Reconfigured:
cluster.self-heal-daemon: on
cluster.data-self-heal: off
cluster.metadata-self-heal: off
cluster.entry-self-heal: off
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off
cluster.brick-multiplex: enable
cluster.server-quorum-ratio: 51
[root@rhsauto049 ~]# 


[root@rhsauto049 ~]# gluster vol status
Status of volume: testvol_distributed-replicated
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick rhsauto049.lab.eng.blr.redhat.com:/br
icks/brick0/testvol_distributed-replicated_
brick0                                      49153     0          Y       26694
Brick rhsauto029.lab.eng.blr.redhat.com:/br
icks/brick0/testvol_distributed-replicated_
brick1                                      49153     0          Y       3368 
Brick rhsauto034.lab.eng.blr.redhat.com:/br
icks/brick0/testvol_distributed-replicated_
brick2                                      49152     0          Y       31682
Brick rhsauto039.lab.eng.blr.redhat.com:/br
icks/brick0/testvol_distributed-replicated_
brick3                                      49153     0          Y       29522
Brick rhsauto040.lab.eng.blr.redhat.com:/br
icks/brick0/testvol_distributed-replicated_
brick4                                      49153     0          Y       26953
Brick rhsauto041.lab.eng.blr.redhat.com:/br
icks/brick0/testvol_distributed-replicated_
brick5                                      49153     0          Y       25772
Self-heal Daemon on localhost               N/A       N/A        Y       29604
Self-heal Daemon on rhsauto029.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       3442 
Self-heal Daemon on rhsauto034.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       32481
Self-heal Daemon on rhsauto041.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       25845
Self-heal Daemon on rhsauto039.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       30318
Self-heal Daemon on rhsauto040.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       27807
 
Task Status of Volume testvol_distributed-replicated
------------------------------------------------------------------------------
There are no active volume tasks
 
[root@rhsauto049 ~]# 


SOS Reports: http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/vavuthu/heal_issue_gfid_symlink_missing/

Comment 2 Ravishankar N 2018-10-18 04:20:58 UTC

Healing is not able to complete because the gfid handle (symlink) for a directory is missing. Steps to reproduce in a single node setup:


0."pip install python-docx" if you don't have it.
1. Create and fuse mount a 2x (2+1) volume on /mnt/fuse_mnt.
2. Fill data on mount:
python file_dir_ops.py create_deep_dirs_with_files --dir-length 2 --dir-depth 2 --max-num-of-dirs 2 --num-of-files 20 /mnt/fuse_mnt/files
python file_dir_ops.py create_files -f 20 /mnt/fuse_mnt/files
3. Kill 1st data brick of each replica
4. Rename files using:
python file_dir_ops.py mv /mnt/fuse_mnt/files
5. gluster volume start volname force
6. gluster volume heal volname
7. You will still see a directory and files under it not getting healed. If you look at the bricks you killed in step 3, it won't have the symlink for the directory.

-------------------------------------------------------------------------------
Pranith initially had a dirty fix which solves the problem but he found some more races between janitor thread unlinking the gfid handle and posix_lookup and posix_mkdir. He and Xavi are discussing some solutions to solve this in a clean way for both AFR and EC volumes. See BZ 1636902 comment#14 and 15 for some proposed solutions.

diff --git a/xlators/storage/posix/src/posix.c b/xlators/storage/posix/src/posix.c
index 7bfe780bb..82f44a012 100644
--- a/xlators/storage/posix/src/posix.c
+++ b/xlators/storage/posix/src/posix.c
@@ -1732,7 +1732,11 @@ posix_mkdir (call_frame_t *frame, xlator_t *this,
                                  * posix_gfid_set to set the symlink to the
                                  * new dir.*/
                                 posix_handle_unset (this, stbuf.ia_gfid, NULL);
+                } else if (op_ret < 0) {
+                    MAKE_HANDLE_GFID_PATH (gfid_path, this, uuid_req, NULL);
+                    sys_unlink(gfid_path);
                 }
+
         } else if (!uuid_req && frame->root->pid != GF_SERVER_PID_TRASH) {
                 op_ret = -1;
                 op_errno = EPERM;
-------------------------------------------------------------------------------
@Pranith, feel free to assign the bug back to me if you won't be working on this. Leaving a need-info on you for this.

Comment 3 Nag Pavan Chilakam 2018-11-02 12:20:31 UTC

Hit this issue, manually too while upgrading from 3.4.0-async to 3.4.1

more details refer to BZ#1643919 and https://bugzilla.redhat.com/show_bug.cgi?id=1643919#c10

Comment 4 Anees Patel 2018-12-13 09:57:17 UTC

*** Bug 1658870 has been marked as a duplicate of this bug. ***

Comment 8 Anees Patel 2019-05-23 09:35:31 UTC

*** Bug 1712225 has been marked as a duplicate of this bug. ***

Comment 13 Pranith Kumar K 2020-02-25 07:48:44 UTC

Work on this bug will start once https://review.gluster.org/c/glusterfs/+/23937 and a subsequent patch where shd will remove stale entries is merged so moving to next BU for now.

Comment 14 Pranith Kumar K 2020-04-07 06:15:26 UTC

Design for https://review.gluster.org/c/glusterfs/+/23937 changed and the new patch we need for this work is https://review.gluster.org/c/glusterfs/+/24284

Comment 15 Pranith Kumar K 2020-04-30 05:53:34 UTC

https://review.gluster.org/c/glusterfs/+/24373

Comment 16 Pranith Kumar K 2020-04-30 15:18:04 UTC

(In reply to Ravishankar N from comment #2)
> Healing is not able to complete because the gfid handle (symlink) for a
> directory is missing. Steps to reproduce in a single node setup:
> 
> 
> 0."pip install python-docx" if you don't have it.
> 1. Create and fuse mount a 2x (2+1) volume on /mnt/fuse_mnt.
> 2. Fill data on mount:
> python file_dir_ops.py create_deep_dirs_with_files --dir-length 2
> --dir-depth 2 --max-num-of-dirs 2 --num-of-files 20 /mnt/fuse_mnt/files
> python file_dir_ops.py create_files -f 20 /mnt/fuse_mnt/files
> 3. Kill 1st data brick of each replica
> 4. Rename files using:
> python file_dir_ops.py mv /mnt/fuse_mnt/files
> 5. gluster volume start volname force
> 6. gluster volume heal volname
> 7. You will still see a directory and files under it not getting healed. If
> you look at the bricks you killed in step 3, it won't have the symlink for
> the directory.
> 
> -----------------------------------------------------------------------------
> --
> Pranith initially had a dirty fix which solves the problem but he found some
> more races between janitor thread unlinking the gfid handle and posix_lookup
> and posix_mkdir. He and Xavi are discussing some solutions to solve this in
> a clean way for both AFR and EC volumes. See BZ 1636902 comment#14 and 15
> for some proposed solutions.
> 
> diff --git a/xlators/storage/posix/src/posix.c
> b/xlators/storage/posix/src/posix.c
> index 7bfe780bb..82f44a012 100644
> --- a/xlators/storage/posix/src/posix.c
> +++ b/xlators/storage/posix/src/posix.c
> @@ -1732,7 +1732,11 @@ posix_mkdir (call_frame_t *frame, xlator_t *this,
>                                   * posix_gfid_set to set the symlink to the
>                                   * new dir.*/
>                                  posix_handle_unset (this, stbuf.ia_gfid,
> NULL);
> +                } else if (op_ret < 0) {
> +                    MAKE_HANDLE_GFID_PATH (gfid_path, this, uuid_req, NULL);
> +                    sys_unlink(gfid_path);
>                  }
> +
>          } else if (!uuid_req && frame->root->pid != GF_SERVER_PID_TRASH) {
>                  op_ret = -1;
>                  op_errno = EPERM;
> -----------------------------------------------------------------------------
> --
> @Pranith, feel free to assign the bug back to me if you won't be working on
> this. Leaving a need-info on you for this.

Ravi,
     Could you attach file_dir_ops.py script to the bz?

Pranith

Comment 17 Pranith Kumar K 2020-04-30 15:33:06 UTC

Upasana gave the link on chat.
https://github.com/gluster/glusto-tests/blob/master/glustolibs-io/shared_files/scripts/file_dir_ops.py

Comment 18 Pranith Kumar K 2020-05-01 00:18:33 UTC

Tested this case by running the case in comment-2 10 times and printing the number of pending heals. Before the case it was failing once in 2 times.
[root@localhost-live ~]# bash testcase.sh 
pending-heals: 26
pending-heals: 0
pending-heals: 4
pending-heals: 0
pending-heals: 52
pending-heals: 0
pending-heals: 18
pending-heals: 0
pending-heals: 108
pending-heals: 12

With the fix it doesn't fail:
[root@localhost-live ~]# bash testcase.sh 
pending-heals: 0
pending-heals: 0
pending-heals: 0
pending-heals: 0
pending-heals: 0
pending-heals: 0
pending-heals: 0
pending-heals: 0
pending-heals: 0
pending-heals: 0

Comment 23 nravinas 2020-12-17 13:39:35 UTC

*** Bug 1901154 has been marked as a duplicate of this bug. ***

Comment 34 Amrita 2021-02-19 12:20:20 UTC

Hi Karthik,

Could you please set the "Doc Type" field and fill out the "Doc Text" template with the relevant information.

Thanks
Amrita

Comment 38 Amrita 2021-03-11 05:54:46 UTC

Thanks Karthik, LGTM.

Comment 41 errata-xmlrpc 2021-04-29 07:20:37 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (glusterfs bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:1462