1721802 – stale shd process files leading to heal timing out and heal deamon not coming up for all volumes

Bug 1721802 - stale shd process files leading to heal timing out and heal deamon not coming up for all volumes

Summary: stale shd process files leading to heal timing out and heal deamon not coming...

Keywords:
Status:	CLOSED DEFERRED
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	replicate
Sub Component:
Version:	rhgs-3.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Mohammed Rafi KC
QA Contact:	Nag Pavan Chilakam
Docs Contact:
URL:
Whiteboard:	shd-multiplexing
Depends On:
Blocks:	1722541 1732668
TreeView+	depends on / blocked

Reported:	2019-06-19 05:34 UTC by Nag Pavan Chilakam
Modified:	2020-01-20 08:02 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1722541 (view as bug list)
Environment:
Last Closed:	2020-01-20 08:02:53 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Nag Pavan Chilakam 2019-06-19 05:34:12 UTC

Description of problem:
======================
Description of problem:
=======================
I have a 3 node brickmux enabled cluster
3 volumes exist as below
12x(6+2) ecvol named cvlt-ecv
2 1x3 afr vols, namely testvol and logvol

IOs are being done on cvlt-ecv volume(just DDs and appends)

Two of the nodes have been upgraded over past few days.
As part of upgrading the last node of a 3 node cluster to 6.0.5(including kernel), I did a node reboot.
Post that the bricks were not coming up due to some bad entries in fstab and on resolving them I also noticed that the cluster went to rejected state.
When check the cksums of the cvlt-ecv volume, I noticed a difference in the cksum value b/w n3(node being upgraded) when compared to n1 and n2
Hence to fix that we deleted all the cvlt-ecv directory under /var/lib/glusterd so that glusterd will heal them.
Did a restart of glusterd and the peer rejected issue was fixed.

However, we noticed that the shd was not showing online for the 2 afr volumes.

Tried to do restart of glusterd( including deleting glusterfsd,shd,fs procs)

But the shd is not coming up for the 2 afr volumes

based on the logs we noticed that the /var/run/gluster/testvol and logvol have stale pid entries still existing and hence blocking the shd start on these volumes


I went ahead and deleted the old stale pid files and shd came up on all the volumes.

While I thought it was a one off thing, However I now see the same behavior in another node too, which is quite concerning, as we see below problems
1) manual index heal command is timing out
2) heal deamon is not running on the other volumes(as stale pidfile  exists in /var/run/gluster)


Version-Release number of selected component (if applicable):
===================
6.0.5

How reproducible:
============
consistent on my cluster

Steps to Reproduce:
-================
explained in description and also more details on cluster available at https://docs.google.com/spreadsheets/d/1_jmnDAcs1TqXbWjw-r4iCYo4zGKheSAzP1lfMSxVS6w/edit#gid=0

Note You need to log in before you can comment on or make changes to this bug.