1727256 – Directory pending heal in heal info output

Bug 1727256 - Directory pending heal in heal info output

Summary: Directory pending heal in heal info output

Keywords:
Status:	CLOSED NEXTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	arbiter
Sub Component:
Version:	mainline
Hardware:	Unspecified
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Assignee:	bugs@gluster.org
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:	1662206
Blocks:
TreeView+	depends on / blocked

Reported:	2019-07-05 08:54 UTC by Mohammed Rafi KC
Modified:	2019-08-05 06:49 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Clone Of:	1662206
Environment:
Last Closed:	2019-08-05 06:49:19 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Gluster.org Gerrit	23005	0	None	Merged	graph/shd: attach volfile even if ctx->active is NULL	2019-08-05 06:49:17 UTC

Description Mohammed Rafi KC 2019-07-05 08:54:18 UTC

+++ This bug was initially created as a clone of Bug #1662206 +++

Description of problem:

While verifying bz#1645480, hit an issue where directory is pending heal

Version-Release number of selected component (if applicable):
# rpm -qa | grep gluster
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
glusterfs-rdma-3.12.2-34.el7rhgs.x86_64
glusterfs-server-3.12.2-34.el7rhgs.x86_64
glusterfs-client-xlators-3.12.2-34.el7rhgs.x86_64
glusterfs-fuse-3.12.2-34.el7rhgs.x86_64
glusterfs-events-3.12.2-34.el7rhgs.x86_64


How reproducible:

1/1

Steps to Reproduce:
1. Create a 2X(2+1) arbiter volume, brick{1..6}
2. Start the volume and fuse mount it.
3. Create few files (200nos) inside few directories, create hardlinks and softlinks
4. Bring brick 2 and brick 5 down
5. Now do continuous metadata operations like rename, chgrp, chown, this is done continuously
6. Occasionally bring brick 1 and brick 4 down and bring the down bricks from step 4 up.
7. We see few transport endpoint not connected errors as sometime the good brick is down, which is expected
8. Now again bring brick 2 and brick 5 down, brick 1 and brick 4 up.
9. All the above steps are done when IO's per step 5 is ongoing.

Actual results:
Initial output
# gluster v heal test info
Brick 10.70.46.55:/bricks/brick1/testing
Status: Connected
Number of entries: 0

Brick 10.70.47.184:/bricks/brick1/testing
Status: Connected
Number of entries: 0

Brick 10.70.46.193:/bricks/brick1/testing
Status: Connected
Number of entries: 0

Brick 10.70.47.67:/bricks/brick1/testing
<gfid:725ab1d7-bed9-4a25-b1a0-95e2c15605b7> 
Status: Connected
Number of entries: 1

Brick 10.70.46.169:/bricks/brick1/testing
<gfid:725ab1d7-bed9-4a25-b1a0-95e2c15605b7> 
Status: Connected
Number of entries: 1

Brick 10.70.47.122:/bricks/brick1/testing
Status: Connected
Number of entries: 0

After few minutes heal output started giving dir level00 as pending heal

~]# gluster v heal test info
Brick 10.70.46.55:/bricks/brick1/testing
Status: Connected
Number of entries: 0

Brick 10.70.47.184:/bricks/brick1/testing
Status: Connected
Number of entries: 0

Brick 10.70.46.193:/bricks/brick1/testing
Status: Connected
Number of entries: 0

Brick 10.70.47.67:/bricks/brick1/testing
/level00 
Status: Connected
Number of entries: 1

Brick 10.70.46.169:/bricks/brick1/testing
/level00 
Status: Connected
Number of entries: 1

Brick 10.70.47.122:/bricks/brick1/testing
Status: Connected
Number of entries: 0


Expected results:
No files/dir should be in pending heal.

Additional info:
Volume Info
~]# gluster v info test
 
Volume Name: test
Type: Distributed-Replicate
Volume ID: 11ee3f35-f99d-49ce-95c7-bbee829bc6f1
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x (2 + 1) = 6
Transport-type: tcp
Bricks:
Brick1: 10.70.46.55:/bricks/brick1/testing
Brick2: 10.70.47.184:/bricks/brick1/testing
Brick3: 10.70.46.193:/bricks/brick1/testing (arbiter)
Brick4: 10.70.47.67:/bricks/brick1/testing
Brick5: 10.70.46.169:/bricks/brick1/testing
Brick6: 10.70.47.122:/bricks/brick1/testing (arbiter)
Options Reconfigured:
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off


Also the getfattr's for the directory level00, for the second sub-vol where client3 blames client 4 and vice versa

-55 ~]# getfattr -d -m . -e hex /bricks/brick1/testing/level00
getfattr: Removing leading '/' from absolute path names
# file: bricks/brick1/testing/level00
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.test-client-1=0x000000000000000000000000
trusted.gfid=0x725ab1d7bed94a25b1a095e2c15605b7
trusted.glusterfs.dht=0x000000000000000000000000aaac18d5
trusted.glusterfs.dht.mds=0x00000000


184 ~]# getfattr -d -m . -e hex /bricks/brick1/testing/level00
getfattr: Removing leading '/' from absolute path names
# file: bricks/brick1/testing/level00
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.test-client-0=0x000000000000000000000000
trusted.gfid=0x725ab1d7bed94a25b1a095e2c15605b7
trusted.glusterfs.dht=0x000000000000000000000000aaac18d5
trusted.glusterfs.dht.mds=0x00000000


193 ~]# getfattr -d -m . -e hex /bricks/brick1/testing/level00
getfattr: Removing leading '/' from absolute path names
# file: bricks/brick1/testing/level00
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.test-client-0=0x000000000000000000000000
trusted.afr.test-client-1=0x000000000000000000000000
trusted.gfid=0x725ab1d7bed94a25b1a095e2c15605b7
trusted.glusterfs.dht=0x000000000000000000000000aaac18d5
trusted.glusterfs.dht.mds=0x00000000


67 ~]# getfattr -d -m . -e hex /bricks/brick1/testing/level00
getfattr: Removing leading '/' from absolute path names
# file: bricks/brick1/testing/level00
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.test-client-4=0x000000000000000000000192
trusted.gfid=0x725ab1d7bed94a25b1a095e2c15605b7
trusted.glusterfs.dht=0x0000000000000000aaac18d6ffffffff


169 ~]# getfattr -d -m . -e hex /bricks/brick1/testing/level00
getfattr: Removing leading '/' from absolute path names
# file: bricks/brick1/testing/level00
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.test-client-3=0x0000000000000000000001a2
trusted.gfid=0x725ab1d7bed94a25b1a095e2c15605b7
trusted.glusterfs.dht=0x0000000000000000aaac18d6ffffffff


-122 ~]# getfattr -d -m . -e hex /bricks/brick1/testing/level00
getfattr: Removing leading '/' from absolute path names
# file: bricks/brick1/testing/level00
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.test-client-3=0x000000000000000000000000
trusted.afr.test-client-4=0x000000000000000000000000
trusted.gfid=0x725ab1d7bed94a25b1a095e2c15605b7
trusted.glusterfs.dht=0x0000000000000000aaac18d6ffffffff



--- Additional comment from Anees Patel on 2018-12-27 04:07:00 UTC ---

sosreports are available at http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/anees/bz1662206/

--- Additional comment from Ravishankar N on 2018-12-27 04:53:01 UTC ---

Karthik, could you take a look at this please? From the state of afr xattrs, a conservative merge should have occurred during self-heal of the directory.

--- Additional comment from Karthik U S on 2018-12-28 06:59:04 UTC ---

Hi Anees,

Can you provide the exact operations that you ran in steps 3 & 5 in the reproducer and the client logs as well please?
Did you try running the volume heal or lookup on the dir level00 from mount after this, and was there any change in the heal info output if you did so?

-Karthik

--- Additional comment from Karthik U S on 2018-12-28 14:34:39 UTC ---

Current state:
When Anees did glusterd restart on all the nodes today, the directory got healed by itself.
From the shd log:
[2018-12-28 07:46:05.439143] I [MSGID: 108026] [afr-self-heal-entry.c:887:afr_selfheal_entry_do] 0-test-replicate-1: performing entry selfheal on 725ab1d7-bed9-4a25-b1a0-95e2c15605b7
[2018-12-28 07:46:05.439205] I [MSGID: 114047] [client-handshake.c:1161:client_setvolume_cbk] 0-testvol_distributed-replicated-client-4: Server and Client lk-version numbers are not same, reopening the fds
[2018-12-28 07:46:05.439397] I [MSGID: 108005] [afr-common.c:5037:__afr_handle_child_up_event] 0-testvol_distributed-replicated-replicate-1: Subvolume 'testvol_distributed-replicated-client-4' came back up; going online.
[2018-12-28 07:46:05.440653] I [MSGID: 114035] [client-handshake.c:121:client_set_lk_version_cbk] 0-testvol_distributed-replicated-client-4: Server lk version = 1
[2018-12-28 07:46:05.446328] I [MSGID: 114057] [client-handshake.c:1397:select_server_supported_programs] 0-testvol_distributed-replicated-client-5: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2018-12-28 07:46:05.449398] I [MSGID: 114046] [client-handshake.c:1150:client_setvolume_cbk] 0-testvol_distributed-replicated-client-5: Connected to testvol_distributed-replicated-client-5, attached to remote volume '/bricks/brick3/testvol_distributed-replicated_brick5'.
[2018-12-28 07:46:05.449474] I [MSGID: 114047] [client-handshake.c:1161:client_setvolume_cbk] 0-testvol_distributed-replicated-client-5: Server and Client lk-version numbers are not same, reopening the fds
[2018-12-28 07:46:05.449720] I [MSGID: 108002] [afr-common.c:5334:afr_notify] 0-testvol_distributed-replicated-replicate-1: Client-quorum is met
[2018-12-28 07:46:05.450589] I [MSGID: 114035] [client-handshake.c:121:client_set_lk_version_cbk] 0-testvol_distributed-replicated-client-5: Server lk version = 1
[2018-12-28 07:46:05.472093] I [MSGID: 108026] [afr-self-heal-entry.c:887:afr_selfheal_entry_do] 0-arbitervol-replicate-1: performing entry selfheal on e82cba68-b0ec-4f0a-a6c3-113df7afcccf
[2018-12-28 07:46:05.505071] I [MSGID: 108026] [afr-self-heal-common.c:1732:afr_log_selfheal] 0-test-replicate-1: Completed entry selfheal on 725ab1d7-bed9-4a25-b1a0-95e2c15605b7. sources= sinks=0 1 

[root@dhcp47-67 ~]# gluster v heal test info
Brick 10.70.46.55:/bricks/brick1/testing
Status: Connected
Number of entries: 0

Brick 10.70.47.184:/bricks/brick1/testing
Status: Connected
Number of entries: 0

Brick 10.70.46.193:/bricks/brick1/testing
Status: Connected
Number of entries: 0

Brick 10.70.47.67:/bricks/brick1/testing
Status: Connected
Number of entries: 0

Brick 10.70.46.169:/bricks/brick1/testing
Status: Connected
Number of entries: 0

Brick 10.70.47.122:/bricks/brick1/testing
Status: Connected
Number of entries: 0

@Anees, can you try to reproduce this bug please. Looks like some connection issue between the bricks & SHD.

--- Additional comment from Anees Patel on 2019-07-02 05:54:22 UTC ---

Hi Karthik,

Executed all steps provided in comment#0 Did IO's with deep dir and intensive metadata operations, Also did a add-brick and rebalance on latest build (glusterfs-fuse-6.0-7.el7rhgs.x86_64)
# gluster v info arb1
 
Volume Name: arb1
Type: Distributed-Replicate
Volume ID: c772ca13-2767-40dc-b6d8-d76e4dd644e0
Status: Started
Snapshot Count: 0
Number of Bricks: 4 x (2 + 1) = 12
Transport-type: tcp
Bricks:
Brick1: 10.70.35.50:/bricks/brick1/newt
Brick2: 10.70.46.216:/bricks/brick1/newt
Brick3: 10.70.46.132:/bricks/brick1/newt (arbiter)
Brick4: 10.70.35.50:/bricks/brick2/newt
Brick5: 10.70.46.216:/bricks/brick2/newt
Brick6: 10.70.46.132:/bricks/brick2/newt (arbiter)
Brick7: 10.70.35.50:/bricks/brick3/newt
Brick8: 10.70.46.216:/bricks/brick3/newt
Brick9: 10.70.46.132:/bricks/brick3/newt (arbiter)
Brick10: 10.70.35.50:/bricks/brick5/newt
Brick11: 10.70.46.132:/bricks/brick5/newt
Brick12: 10.70.46.216:/bricks/brick5/newt (arbiter)
Options Reconfigured:
performance.client-io-threads: off
nfs.disable: on
storage.fips-mode-rchecksum: on
transport.address-family: inet
cluster.shd-max-threads: 30
==========================================================================

Rebalance was completed 

# gluster v rebal arb1 status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                            10.70.46.216            43281       125.5MB        366134             0          3278            completed        1:25:22
                            10.70.46.132            43160       131.8MB        366275             0          3213            completed        1:25:22
                               localhost            43209       109.3MB        365106             0          3135            completed        1:25:22
volume rebalance: arb1: success
==========================================================================

No Heals were pending.

--- Additional comment from Karthik U S on 2019-07-02 06:01:19 UTC ---

Anees, thanks for trying to reproduce this. Since we do not see the issue on the latest build, can we close this?

--- Additional comment from Anees Patel on 2019-07-02 06:29:53 UTC ---

Hi Karthik,

There were many fixes that have been provided since the build this issue was raise (3.12.2-34.el7rhgs.x86_64) and the latest build
There is a fix that went into the latest build per BZ#1715438, I was unable to hit this issue during validating that bug.
Also during all the 3.5 testing till date, I haven't seen anything like this 
We are good to close this.
I will re-open if issue is seen in future

Thanks,

--- Additional comment from Karthik U S on 2019-07-02 07:07:54 UTC ---

Fix for BZ#1715438 will handle the metadata pending cases and the issue we saw here was entry pending which were not getting healed.
As mentioned in comment #5, the issue here was looking like some connection problem with the bricks.

If you hit the issue again please feel free to re-open this bug. Closing the as WORKSFORME since it is not reproducible.

--- Additional comment from Bala Konda Reddy M on 2019-07-02 10:00:16 UTC ---

Hey Karthik,

I hit this issue there is one file/directory heal pending while doing in-service upgarde 
from RHGS3.4.4(glusterfs-3.12.2-47.2.el7rhgs.x86_64 to glusterfs-6.0-7.el7rhgs.x86_64)

Scenario I tried.
1. On three nodes cluster, created 1000 volumes and started them(1X3 replicate)
2. only one volume is mounted and IO (linux untar)
3. Stopped glusterd on n1, pkill glusterfsd, pkill gluster
4. Performed yum update and reboot.
5. start glusterd on node n1
6. checked gluster volume heal testvol_-999 info on N1.

#gluster vol heal testvol_-999 info
Brick rhs-gp-srv13.lab.eng.blr.redhat.com:/bricks/brick9/testvol_-999
<gfid:69c6bf1e-ab9e-4c74-88f2-241199a8ae56> 
Status: Connected
Number of entries: 1

Brick rhs-gp-srv14.lab.eng.blr.redhat.com:/bricks/brick9/testvol_-999
Status: Connected
Number of entries: 0

Brick rhs-gp-srv16.lab.eng.blr.redhat.com:/bricks/brick9/testvol_-999
Status: Connected
Number of entries: 0

I tried triggering heal again. but it is still in pending state. Waited for more than 10 minutes to get the files get healed.

Thanks,
Bala

--- Additional comment from Karthik U S on 2019-07-03 06:20:35 UTC ---

Initial analysis:

We have only one file needing heal in the volume "testvol_-999" which has dirty marking for that file on only the upgraded node.
From the discussion with Bala, there were no pending heals on the volume before starting the upgrade.
The dirty marking could have happened due to a write operation on the file which succeeded on all the bricks but before the post-op could remove the marking, the node would have taken down for upgrading leading to the dirty marking left over on that brick. The .glusterfs/indices/dirty has two base entries and this file's gfid entry in that directory is linked to the old base entry so we can confirm the above scenario. The md5 sum of the file on all the bricks are also matching.

There were other entries needing heal after the upgrade and they were completed successfully by the self heal daemons on the other two nodes, but only this entry is not getting cleared from the heal info list, since the dirty marking is present only on the upgraded node and whenever we try to run the index heal we see the following error message in the shd log of the upgraded brick:

[2019-07-02 15:16:32.126379] E [glusterfsd-mgmt.c:857:glusterfs_handle_translator_op] 0-glusterfs: xlator testvol_-999-replicate-0 is not loaded

We are suspecting that this is due to the recent change of shd-mux, where the upgraded node having trouble in loading the shd.
Assigning this to Rafi for further analysis.

--- Additional comment from Mohammed Rafi KC on 2019-07-05 08:52:45 UTC ---

RCA:

The attach request for that volume failed in shd process because the process was not initialized yet.

-    if (ctx->active) {
-        ret = mgmt_process_volfile(xlator_req.input.input_val,
-                                   xlator_req.input.input_len, xlator_req.name,
-                                   dict);
-    } else {
-        gf_msg(this->name, GF_LOG_WARNING, EINVAL, glusterfsd_msg_42,
-               "got attach for %s but no active graph", xlator_req.name);
-    }

This is possible when a process establishes a management connection, but not finished the initialization, then ctx->active would be NULL. I will remove the check as shd process is capable of handling the situation.

Comment 1 Worker Ant 2019-07-05 15:09:11 UTC

REVIEW: https://review.gluster.org/23005 (graph/shd: attach volfile even if ctx->active is NULL) posted (#1) for review on master by mohammed rafi  kc

Comment 2 Worker Ant 2019-08-05 06:49:19 UTC

REVIEW: https://review.gluster.org/23005 (graph/shd: attach volfile even if ctx->active is NULL) merged (#10) on master by Amar Tumballi

Note You need to log in before you can comment on or make changes to this bug.