Bug 1721802

Summary:	stale shd process files leading to heal timing out and heal deamon not coming up for all volumes
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Nag Pavan Chilakam <nchilaka>
Component:	replicate	Assignee:	Mohammed Rafi KC <rkavunga>
Status:	CLOSED DEFERRED	QA Contact:	Nag Pavan Chilakam <nchilaka>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.5	CC:	amukherj, rhs-bugs, rkavunga, sheggodu, storage-qa-internal, vdas
Target Milestone:	---	Keywords:	Regression
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	shd-multiplexing
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1722541 (view as bug list)		Environment:
Last Closed:	2020-01-20 08:02:53 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1722541, 1732668

Description Nag Pavan Chilakam 2019-06-19 05:34:12 UTC

Description of problem:
======================
Description of problem:
=======================
I have a 3 node brickmux enabled cluster
3 volumes exist as below
12x(6+2) ecvol named cvlt-ecv
2 1x3 afr vols, namely testvol and logvol

IOs are being done on cvlt-ecv volume(just DDs and appends)

Two of the nodes have been upgraded over past few days.
As part of upgrading the last node of a 3 node cluster to 6.0.5(including kernel), I did a node reboot.
Post that the bricks were not coming up due to some bad entries in fstab and on resolving them I also noticed that the cluster went to rejected state.
When check the cksums of the cvlt-ecv volume, I noticed a difference in the cksum value b/w n3(node being upgraded) when compared to n1 and n2
Hence to fix that we deleted all the cvlt-ecv directory under /var/lib/glusterd so that glusterd will heal them.
Did a restart of glusterd and the peer rejected issue was fixed.

However, we noticed that the shd was not showing online for the 2 afr volumes.

Tried to do restart of glusterd( including deleting glusterfsd,shd,fs procs)

But the shd is not coming up for the 2 afr volumes

based on the logs we noticed that the /var/run/gluster/testvol and logvol have stale pid entries still existing and hence blocking the shd start on these volumes


I went ahead and deleted the old stale pid files and shd came up on all the volumes.

While I thought it was a one off thing, However I now see the same behavior in another node too, which is quite concerning, as we see below problems
1) manual index heal command is timing out
2) heal deamon is not running on the other volumes(as stale pidfile  exists in /var/run/gluster)


Version-Release number of selected component (if applicable):
===================
6.0.5

How reproducible:
============
consistent on my cluster

Steps to Reproduce:
-================
explained in description and also more details on cluster available at https://docs.google.com/spreadsheets/d/1_jmnDAcs1TqXbWjw-r4iCYo4zGKheSAzP1lfMSxVS6w/edit#gid=0