Bug 1641872

Summary:	Spurious failures in bug-1637802-arbiter-stale-data-heal-lock.t
Product:	[Community] GlusterFS	Reporter:	Ravishankar N <ravishankar>
Component:	tests	Assignee:	bugs <bugs>
Status:	CLOSED CURRENTRELEASE	QA Contact:
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	5	CC:	bugs
Target Milestone:	---	Keywords:	Triaged
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	glusterfs-5.1	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1641344	Environment:
Last Closed:	2018-11-29 15:20:34 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1641344
Bug Blocks:	1641761, 1641762

Description Ravishankar N 2018-10-23 04:22:00 UTC

+++ This bug was initially created as a clone of Bug #1641344 +++

Problem:
    https://review.gluster.org/#/c/glusterfs/+/21427/ seems to be failing
    this .t spuriously. On checking one of the failure logs, I see:

    22:05:44 Launching heal operation to perform index self heal on volume patchy has been unsuccessful:
    22:05:44 Self-heal daemon is not running. Check self-heal daemon log file.
    22:05:44 not ok 20 , LINENUM:38

    In glusterd log:
    [2018-10-18 22:05:44.298832] E [MSGID: 106301] [glusterd-syncop.c:1352:gd_stage_op_phase] 0-management: Staging of operation 'Volume Heal' failed on localhost : Self-heal daemon is not running. Check self-heal daemon log file

    But the tests which preceed this check whether via a statedump if the shd is
    conected to the bricks, and they have succeeded and even started
    healing. From glustershd.log:

    [2018-10-18 22:05:40.975268] I [MSGID: 108026] [afr-self-heal-common.c:1732:afr_log_selfheal] 0-patchy-replicate-0: Completed data selfheal on 3b83d2dd-4cf2-4ea3-a33e-4275be40f440. sources=[0] 1  sinks=2

    So the only reason I can see launching heal via cli failing is a race where
    shd has been spawned but glusterd has not yet updated in-memory that it is up,
    and hence failing the CLI.

    Fix:
    Check for shd up status before launching heal via CLI

--- Additional comment from Worker Ant on 2018-10-21 08:17:59 EDT ---

REVIEW: https://review.gluster.org/21451 (tests: check for shd up status in bug-1637802-arbiter-stale-data-heal-lock.t) posted (#1) for review on master by Ravishankar N

--- Additional comment from Worker Ant on 2018-10-22 09:49:30 EDT ---

COMMIT: https://review.gluster.org/21451 committed in master by "Pranith Kumar Karampuri" <pkarampu> with a commit message- tests: check for shd up status in bug-1637802-arbiter-stale-data-heal-lock.t

Problem:
https://review.gluster.org/#/c/glusterfs/+/21427/ seems to be failing
this .t spuriously. On checking one of the failure logs, I see:

22:05:44 Launching heal operation to perform index self heal on volume patchy has been unsuccessful:
22:05:44 Self-heal daemon is not running. Check self-heal daemon log file.
22:05:44 not ok 20 , LINENUM:38

In glusterd log:
[2018-10-18 22:05:44.298832] E [MSGID: 106301] [glusterd-syncop.c:1352:gd_stage_op_phase] 0-management: Staging of operation 'Volume Heal' failed on localhost : Self-heal daemon is not running. Check self-heal daemon log file

But the tests which preceed this check whether via a statedump if the shd is
conected to the bricks, and they have succeeded and even started
healing. From glustershd.log:

[2018-10-18 22:05:40.975268] I [MSGID: 108026] [afr-self-heal-common.c:1732:afr_log_selfheal] 0-patchy-replicate-0: Completed data selfheal on 3b83d2dd-4cf2-4ea3-a33e-4275be40f440. sources=[0] 1  sinks=2

So the only reason I can see launching heal via cli failing is a race where
shd has been spawned but glusterd has not yet updated in-memory that it is up,
and hence failing the CLI.

Fix:
Check for shd up status before launching heal via CLI

Change-Id: Ic88abf14ad3d51c89cb438db601fae4df179e8f4
fixes: bz#1641344
Signed-off-by: Ravishankar N <ravishankar>

Comment 1 Worker Ant 2018-10-23 04:23:48 UTC

REVIEW: https://review.gluster.org/21462 (tests: check for shd up status in bug-1637802-arbiter-stale-data-heal-lock.t) posted (#1) for review on release-5 by Ravishankar N

Comment 2 Worker Ant 2018-10-25 13:12:57 UTC

COMMIT: https://review.gluster.org/21462 committed in release-5 by "Shyamsundar Ranganathan" <srangana> with a commit message- tests: check for shd up status in bug-1637802-arbiter-stale-data-heal-lock.t

Problem:
https://review.gluster.org/#/c/glusterfs/+/21427/ seems to be failing
this .t spuriously. On checking one of the failure logs, I see:

22:05:44 Launching heal operation to perform index self heal on volume patchy has been unsuccessful:
22:05:44 Self-heal daemon is not running. Check self-heal daemon log file.
22:05:44 not ok 20 , LINENUM:38

In glusterd log:
[2018-10-18 22:05:44.298832] E [MSGID: 106301] [glusterd-syncop.c:1352:gd_stage_op_phase] 0-management: Staging of operation 'Volume Heal' failed on localhost : Self-heal daemon is not running. Check self-heal daemon log file

But the tests which preceed this check whether via a statedump if the shd is
conected to the bricks, and they have succeeded and even started
healing. From glustershd.log:

[2018-10-18 22:05:40.975268] I [MSGID: 108026] [afr-self-heal-common.c:1732:afr_log_selfheal] 0-patchy-replicate-0: Completed data selfheal on 3b83d2dd-4cf2-4ea3-a33e-4275be40f440. sources=[0] 1  sinks=2

So the only reason I can see launching heal via cli failing is a race where
shd has been spawned but glusterd has not yet updated in-memory that it is up,
and hence failing the CLI.

Fix:
Check for shd up status before launching heal via CLI

Change-Id: Ic88abf14ad3d51c89cb438db601fae4df179e8f4
fixes: bz#1641872
Signed-off-by: Ravishankar N <ravishankar>
(cherry picked from commit 3dea105556130abd4da0fd3f8f2c523ac52398d1)

Comment 3 Shyamsundar 2018-11-29 15:20:34 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-5.1, please open a new bug report.

glusterfs-5.1 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] https://lists.gluster.org/pipermail/announce/2018-November/000116.html
[2] https://www.gluster.org/pipermail/gluster-users/