Bug 1593865

Summary:	shd crash on startup
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	John Strunk <jstrunk>
Component:	replicate	Assignee:	Ravishankar N <ravishankar>
Status:	CLOSED ERRATA	QA Contact:	Nag Pavan Chilakam <nchilaka>
Severity:	high	Docs Contact:
Priority:	high
Version:	rhgs-3.3	CC:	amukherj, jstrunk, rallan, ravishankar, rhs-bugs, sankarshan, sheggodu, srmukher, storage-qa-internal
Target Milestone:	---
Target Release:	RHGS 3.4.0
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	glusterfs-3.12.2-14	Doc Type:	Bug Fix
Doc Text:	glusterd can send heal related requests to self-heal daemon before the latter's graph is fully initialized. In this case, the self-heal daemon used to crash when trying to access certain data structures. With the fix, if the self-heal daemon receives a request before its graph is initialized, it ignores the request.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-09-04 06:49:14 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1596513, 1597229, 1597230
Bug Blocks:	1503137, 1582526, 1597663, 1598340

Description John Strunk 2018-06-21 17:27:20 UTC

Description of problem:
When gluster starts up after a reboot, sometimes self-heal daemon crashes. Result is that volumes don't heal until manual intervention to restart shd.


Version-Release number of selected component (if applicable):
rhgs 3.3.1

$ rpm -aq | grep gluster
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
glusterfs-cli-3.8.4-54.10.el7rhgs.x86_64
glusterfs-geo-replication-3.8.4-54.10.el7rhgs.x86_64
glusterfs-client-xlators-3.8.4-54.10.el7rhgs.x86_64
glusterfs-api-3.8.4-54.10.el7rhgs.x86_64
python-gluster-3.8.4-54.10.el7rhgs.noarch
gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64
pcp-pmda-gluster-4.1.0-0.201805281909.git68ab4b18.el7.x86_64
glusterfs-libs-3.8.4-54.10.el7rhgs.x86_64
glusterfs-fuse-3.8.4-54.10.el7rhgs.x86_64
vdsm-gluster-4.17.33-1.2.el7rhgs.noarch
libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.5.x86_64
glusterfs-3.8.4-54.10.el7rhgs.x86_64
glusterfs-server-3.8.4-54.10.el7rhgs.x86_64
glusterfs-rdma-3.8.4-54.10.el7rhgs.x86_64



How reproducible:
Happens approximately 10% of the time on reboot


Steps to Reproduce:
1. Stop glusterd, bricks, and mounts as per admin guide
2. shutdown -r now
3. check gluster vol status post reboot

Actual results:
Approx 10% of the time, self-heal daemon will not be running, and the pid will be NA in gluster vol status


Expected results:
shd should start up and run properly after reboot


Additional info:

Comment 5 Atin Mukherjee 2018-07-02 04:01:28 UTC

upstream patch : https://review.gluster.org/20422

Comment 11 Nag Pavan Chilakam 2018-07-25 15:00:15 UTC

testversion:3.12.2-14

tc#1 polarion RHG3-13523 -->PASS
1. create a replica 3 volume and start it.
2. `while true; do gluster volume heal <volname>;sleep 0.5; done` in one terminal.
3. In another terminal, keep running 'service glusterd restart`


I was seen crash frequently before fix, but now with fix, I didnt see this problem , after running test for an hour

hence moving to verified


However note hit other issues, for which bugs have been reported
BZ#1608352 - brick (glusterfsd) crashed at in pl_trace_flush
BZ#1607888 - backtrace seen in glusterd log when triggering glusterd restart on issuing of index heal (TC#RHG3-13523)


also retried steps in description


didnt hit the shd crash

Comment 14 Srijita Mukherjee 2018-09-03 13:34:06 UTC

Doc text looks good to me.

Comment 15 errata-xmlrpc 2018-09-04 06:49:14 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2607

Comment 16 Ravishankar N 2018-10-22 06:04:48 UTC

*** Bug 1519105 has been marked as a duplicate of this bug. ***