Bug 1722541 - stale shd process files leading to heal timing out and heal deamon not coming up for all volumes
Summary: stale shd process files leading to heal timing out and heal deamon not coming...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: replicate
Version: mainline
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
Assignee: Mohammed Rafi KC
QA Contact:
URL:
Whiteboard:
Depends On: 1721802
Blocks: 1732668
TreeView+ depends on / blocked
 
Reported: 2019-06-20 15:17 UTC by Mohammed Rafi KC
Modified: 2020-02-10 17:45 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1721802
: 1732668 (view as bug list)
Environment:
Last Closed: 2020-02-10 17:45:16 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Gluster.org Gerrit 22909 0 None Merged shd/mux: Fix race between mux_proc unlink and stop 2019-06-24 05:02:20 UTC
Gluster.org Gerrit 22935 0 None Open glusterd/svc: update pid of mux volumes from the shd process 2019-07-09 12:19:36 UTC

Description Mohammed Rafi KC 2019-06-20 15:17:17 UTC
+++ This bug was initially created as a clone of Bug #1721802 +++

Description of problem:
======================
Description of problem:
=======================
I have a 3 node brickmux enabled cluster
3 volumes exist as below
12x(6+2) ecvol named cvlt-ecv
2 1x3 afr vols, namely testvol and logvol

IOs are being done on cvlt-ecv volume(just DDs and appends)

Two of the nodes have been upgraded over past few days.
As part of upgrading the last node of a 3 node cluster to 6.0.5(including kernel), I did a node reboot.
Post that the bricks were not coming up due to some bad entries in fstab and on resolving them I also noticed that the cluster went to rejected state.
When check the cksums of the cvlt-ecv volume, I noticed a difference in the cksum value b/w n3(node being upgraded) when compared to n1 and n2
Hence to fix that we deleted all the cvlt-ecv directory under /var/lib/glusterd so that glusterd will heal them.
Did a restart of glusterd and the peer rejected issue was fixed.

However, we noticed that the shd was not showing online for the 2 afr volumes.

Tried to do restart of glusterd( including deleting glusterfsd,shd,fs procs)

But the shd is not coming up for the 2 afr volumes

based on the logs we noticed that the /var/run/gluster/testvol and logvol have stale pid entries still existing and hence blocking the shd start on these volumes


I went ahead and deleted the old stale pid files and shd came up on all the volumes.

While I thought it was a one off thing, However I now see the same behavior in another node too, which is quite concerning, as we see below problems
1) manual index heal command is timing out
2) heal deamon is not running on the other volumes(as stale pidfile  exists in /var/run/gluster)

Comment 1 Worker Ant 2019-06-20 15:19:42 UTC
REVIEW: https://review.gluster.org/22909 (shd/mux: Fix race between mux_proc unlink and stop) posted (#2) for review on master by mohammed rafi  kc

Comment 2 Worker Ant 2019-06-24 05:02:21 UTC
REVIEW: https://review.gluster.org/22909 (shd/mux: Fix race between mux_proc unlink and stop) merged (#4) on master by Atin Mukherjee

Comment 3 Worker Ant 2019-06-24 15:19:45 UTC
REVIEW: https://review.gluster.org/22935 (glusterd/svc: Fix race between shd start and volume stop) posted (#1) for review on master by mohammed rafi  kc

Comment 4 Worker Ant 2019-07-09 12:19:37 UTC
REVIEW: https://review.gluster.org/22935 (glusterd/svc: update pid of mux volumes from the shd process) merged (#17) on master by Atin Mukherjee

Comment 5 Sunny Kumar 2020-02-10 17:45:16 UTC
Both of the above patches are merged; closing this bug now.


Note You need to log in before you can comment on or make changes to this bug.