1630145 – Geo-rep: Few workers fails to start with out any failure

Bug 1630145 - Geo-rep: Few workers fails to start with out any failure

Summary: Geo-rep: Few workers fails to start with out any failure

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	geo-replication
Sub Component:
Version:	4.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Kotresh HR
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:	1614799
Blocks:	1623749
TreeView+	depends on / blocked

Reported:	2018-09-18 05:46 UTC by Kotresh HR
Modified:	2018-09-26 14:02 UTC (History)
CC List:	1 user (show)
Fixed In Version:	glusterfs-4.1.5
Clone Of:	1614799
Environment:
Last Closed:	2018-09-26 14:02:57 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Kotresh HR 2018-09-18 05:46:27 UTC

+++ This bug was initially created as a clone of Bug #1614799 +++

Description of problem:
Few workers fails to start with out any failure.

Version-Release number of selected component (if applicable):
mainline

How reproducible:
Seen only while running upstream regression test case
prove -v tests/00-geo-rep/georep-basic-dr-rsync.t

Steps to Reproduce:
1. Get upstream gluster source code
2. source install gluster
3. prove -v tests/00-geo-rep/georep-basic-dr-rsync.t

Actual results:
one of the worker fails to start without any log

Expected results:
No worker should fail to start

Additional info:

--- Additional comment from Worker Ant on 2018-08-10 08:41:14 EDT ---

REVIEW: https://review.gluster.org/20704 (geo-rep: Fix deadlock during worker start) posted (#1) for review on master by Kotresh HR

--- Additional comment from Worker Ant on 2018-08-12 23:52:34 EDT ---

COMMIT: https://review.gluster.org/20704 committed in master by "Amar Tumballi" <amarts> with a commit message- geo-rep: Fix deadlock during worker start

Analysis:
Monitor process spawns monitor threads (one per brick).
Each monitor thread, forks worker and agent processes.
Each monitor thread, while intializing, updates the
monitor status file. It is synchronized using flock.
The race is that, some thread can fork worker while
other thread opened the status file resulting in
holding the reference of fd in worker process.

Cause:
flock gets unlocked either by specifically unlocking it
or by closing all duplicate fds referring to the file.
The code was relying on fd close, hence a reference
in worker/agent process by fork could cause the deadlock.

Fix:
1. flock is unlocked specifically.
2. Also made sure to update status file in approriate places so that
the reference is not leaked to worker/agent process.

With this fix, both the deadlock and possible fd
leaks is solved.

fixes: bz#1614799
Change-Id: I0d1ce93072dab07d0dbcc7e779287368cd9f093d
Signed-off-by: Kotresh HR <khiremat>

Comment 1 Worker Ant 2018-09-18 07:14:32 UTC

REVIEW: https://review.gluster.org/21201 (geo-rep: Fix deadlock during worker start) posted (#1) for review on release-4.1 by Kotresh HR

Comment 2 Worker Ant 2018-09-21 13:26:37 UTC

COMMIT: https://review.gluster.org/21201 committed in release-4.1 by "Shyamsundar Ranganathan" <srangana> with a commit message- geo-rep: Fix deadlock during worker start

Analysis:
Monitor process spawns monitor threads (one per brick).
Each monitor thread, forks worker and agent processes.
Each monitor thread, while intializing, updates the
monitor status file. It is synchronized using flock.
The race is that, some thread can fork worker while
other thread opened the status file resulting in
holding the reference of fd in worker process.

Cause:
flock gets unlocked either by specifically unlocking it
or by closing all duplicate fds referring to the file.
The code was relying on fd close, hence a reference
in worker/agent process by fork could cause the deadlock.

Fix:
1. flock is unlocked specifically.
2. Also made sure to update status file in approriate places so that
the reference is not leaked to worker/agent process.

With this fix, both the deadlock and possible fd
leaks is solved.

Backport of:
 > Patch: https://review.gluster.org/20704
 > BUG: bz#1614799
 > Change-Id: I0d1ce93072dab07d0dbcc7e779287368cd9f093d
 > Signed-off-by: Kotresh HR <khiremat>

fixes: bz#1630145
Change-Id: I0d1ce93072dab07d0dbcc7e779287368cd9f093d
Signed-off-by: Kotresh HR <khiremat>

Comment 3 Shyamsundar 2018-09-26 14:02:57 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-4.1.5, please open a new bug report.

glusterfs-4.1.5 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] https://lists.gluster.org/pipermail/announce/2018-September/000113.html
[2] https://www.gluster.org/pipermail/gluster-users/

Note You need to log in before you can comment on or make changes to this bug.