+++ This bug was initially created as a clone of Bug #1614799 +++ Description of problem: Few workers fails to start with out any failure. Version-Release number of selected component (if applicable): mainline How reproducible: Seen only while running upstream regression test case prove -v tests/00-geo-rep/georep-basic-dr-rsync.t Steps to Reproduce: 1. Get upstream gluster source code 2. source install gluster 3. prove -v tests/00-geo-rep/georep-basic-dr-rsync.t Actual results: one of the worker fails to start without any log Expected results: No worker should fail to start Additional info: --- Additional comment from Worker Ant on 2018-08-10 08:41:14 EDT --- REVIEW: https://review.gluster.org/20704 (geo-rep: Fix deadlock during worker start) posted (#1) for review on master by Kotresh HR --- Additional comment from Worker Ant on 2018-08-12 23:52:34 EDT --- COMMIT: https://review.gluster.org/20704 committed in master by "Amar Tumballi" <amarts> with a commit message- geo-rep: Fix deadlock during worker start Analysis: Monitor process spawns monitor threads (one per brick). Each monitor thread, forks worker and agent processes. Each monitor thread, while intializing, updates the monitor status file. It is synchronized using flock. The race is that, some thread can fork worker while other thread opened the status file resulting in holding the reference of fd in worker process. Cause: flock gets unlocked either by specifically unlocking it or by closing all duplicate fds referring to the file. The code was relying on fd close, hence a reference in worker/agent process by fork could cause the deadlock. Fix: 1. flock is unlocked specifically. 2. Also made sure to update status file in approriate places so that the reference is not leaked to worker/agent process. With this fix, both the deadlock and possible fd leaks is solved. fixes: bz#1614799 Change-Id: I0d1ce93072dab07d0dbcc7e779287368cd9f093d Signed-off-by: Kotresh HR <khiremat>
This bug is marked for 3.4.1 mainly because with the patch, the upstream tests which were failing consistently in geo-rep are now passing successfully. Hence it makes sense to get it into the product, IMO.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:3432