Bug 1623749 - Geo-rep: Few workers fails to start with out any failure
Summary: Geo-rep: Few workers fails to start with out any failure
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: geo-replication
Version: rhgs-3.4
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: RHGS 3.4.z Batch Update 1
Assignee: Sunny Kumar
QA Contact: Rochelle
URL:
Whiteboard:
Depends On: 1614799 1630145
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-08-30 06:07 UTC by Amar Tumballi
Modified: 2022-07-09 10:10 UTC (History)
14 users (show)

Fixed In Version: glusterfs-3.12.2-21
Doc Type: Bug Fix
Doc Text:
Previously, workers failed during startup due to deadlock caused while waiting for the flock. When a monitor starts the workers, they update the status file by using flock to synchronize. When worker one opened the status file to update, worker two could be forked causing the file descriptor to be referenced by worker two. Since it was necessary to close the file descriptor to unlock the lock, worker one failed to unlock as the reference existed in worker two causing a deadlock for worker 2 to come up. With this fix, the flock is unlocked specifically and the status file is updated so that the reference is not leaked to any worker or agent process. As a result of this fix, all workers come up without fail.
Clone Of: 1614799
Environment:
Last Closed: 2018-10-31 08:46:14 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2018:3432 0 None None None 2018-10-31 08:47:58 UTC

Description Amar Tumballi 2018-08-30 06:07:01 UTC
+++ This bug was initially created as a clone of Bug #1614799 +++

Description of problem:
Few workers fails to start with out any failure.

Version-Release number of selected component (if applicable):
mainline

How reproducible:
Seen only while running upstream regression test case
prove -v tests/00-geo-rep/georep-basic-dr-rsync.t

Steps to Reproduce:
1. Get upstream gluster source code
2. source install gluster
3. prove -v tests/00-geo-rep/georep-basic-dr-rsync.t

Actual results:
one of the worker fails to start without any log

Expected results:
No worker should fail to start

Additional info:

--- Additional comment from Worker Ant on 2018-08-10 08:41:14 EDT ---

REVIEW: https://review.gluster.org/20704 (geo-rep: Fix deadlock during worker start) posted (#1) for review on master by Kotresh HR

--- Additional comment from Worker Ant on 2018-08-12 23:52:34 EDT ---

COMMIT: https://review.gluster.org/20704 committed in master by "Amar Tumballi" <amarts> with a commit message- geo-rep: Fix deadlock during worker start

Analysis:
Monitor process spawns monitor threads (one per brick).
Each monitor thread, forks worker and agent processes.
Each monitor thread, while intializing, updates the
monitor status file. It is synchronized using flock.
The race is that, some thread can fork worker while
other thread opened the status file resulting in
holding the reference of fd in worker process.

Cause:
flock gets unlocked either by specifically unlocking it
or by closing all duplicate fds referring to the file.
The code was relying on fd close, hence a reference
in worker/agent process by fork could cause the deadlock.

Fix:
1. flock is unlocked specifically.
2. Also made sure to update status file in approriate places so that
the reference is not leaked to worker/agent process.

With this fix, both the deadlock and possible fd
leaks is solved.

fixes: bz#1614799
Change-Id: I0d1ce93072dab07d0dbcc7e779287368cd9f093d
Signed-off-by: Kotresh HR <khiremat>

Comment 1 Amar Tumballi 2018-08-30 06:18:26 UTC
This bug is marked for 3.4.1 mainly because with the patch, the upstream tests which were failing consistently in geo-rep are now passing successfully. Hence it makes sense to get it into the product, IMO.

Comment 15 errata-xmlrpc 2018-10-31 08:46:14 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:3432


Note You need to log in before you can comment on or make changes to this bug.