Bug 1722541

Summary: stale shd process files leading to heal timing out and heal deamon not coming up for all volumes
Product: [Community] GlusterFS Reporter: Mohammed Rafi KC <rkavunga>
Component: replicateAssignee: Mohammed Rafi KC <rkavunga>
Status: POST --- QA Contact:
Severity: high Docs Contact:
Priority: high    
Version: mainlineCC: amukherj, bugs, nchilaka, pasik, rhs-bugs, rkavunga, storage-qa-internal
Target Milestone: ---Keywords: Regression, Reopened
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1721802
: 1732668 (view as bug list) Environment:
Last Closed: 2019-06-24 05:02:21 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On: 1721802    
Bug Blocks: 1732668    

Description Mohammed Rafi KC 2019-06-20 15:17:17 UTC
+++ This bug was initially created as a clone of Bug #1721802 +++

Description of problem:
======================
Description of problem:
=======================
I have a 3 node brickmux enabled cluster
3 volumes exist as below
12x(6+2) ecvol named cvlt-ecv
2 1x3 afr vols, namely testvol and logvol

IOs are being done on cvlt-ecv volume(just DDs and appends)

Two of the nodes have been upgraded over past few days.
As part of upgrading the last node of a 3 node cluster to 6.0.5(including kernel), I did a node reboot.
Post that the bricks were not coming up due to some bad entries in fstab and on resolving them I also noticed that the cluster went to rejected state.
When check the cksums of the cvlt-ecv volume, I noticed a difference in the cksum value b/w n3(node being upgraded) when compared to n1 and n2
Hence to fix that we deleted all the cvlt-ecv directory under /var/lib/glusterd so that glusterd will heal them.
Did a restart of glusterd and the peer rejected issue was fixed.

However, we noticed that the shd was not showing online for the 2 afr volumes.

Tried to do restart of glusterd( including deleting glusterfsd,shd,fs procs)

But the shd is not coming up for the 2 afr volumes

based on the logs we noticed that the /var/run/gluster/testvol and logvol have stale pid entries still existing and hence blocking the shd start on these volumes


I went ahead and deleted the old stale pid files and shd came up on all the volumes.

While I thought it was a one off thing, However I now see the same behavior in another node too, which is quite concerning, as we see below problems
1) manual index heal command is timing out
2) heal deamon is not running on the other volumes(as stale pidfile  exists in /var/run/gluster)

Comment 1 Worker Ant 2019-06-20 15:19:42 UTC
REVIEW: https://review.gluster.org/22909 (shd/mux: Fix race between mux_proc unlink and stop) posted (#2) for review on master by mohammed rafi  kc

Comment 2 Worker Ant 2019-06-24 05:02:21 UTC
REVIEW: https://review.gluster.org/22909 (shd/mux: Fix race between mux_proc unlink and stop) merged (#4) on master by Atin Mukherjee

Comment 3 Worker Ant 2019-06-24 15:19:45 UTC
REVIEW: https://review.gluster.org/22935 (glusterd/svc: Fix race between shd start and volume stop) posted (#1) for review on master by mohammed rafi  kc

Comment 4 Worker Ant 2019-07-09 12:19:37 UTC
REVIEW: https://review.gluster.org/22935 (glusterd/svc: update pid of mux volumes from the shd process) merged (#17) on master by Atin Mukherjee