Bug 1541038 - A down brick is incorrectly considered to be online and makes the volume to be started without any brick available
Summary: A down brick is incorrectly considered to be online and makes the volume to b...
Alias: None
Product: GlusterFS
Classification: Community
Component: replicate
Version: mainline
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
Assignee: Xavi Hernandez
QA Contact:
Depends On:
Blocks: 1541928 1541929 1541930 1541932
TreeView+ depends on / blocked
Reported: 2018-02-01 15:04 UTC by Xavi Hernandez
Modified: 2018-06-20 17:58 UTC (History)
1 user (show)

Fixed In Version: glusterfs-v4.1.0
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1541928 1541929 1541930 1541932 (view as bug list)
Last Closed: 2018-06-20 17:58:44 UTC
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:

Attachments (Terms of Use)

Description Xavi Hernandez 2018-02-01 15:04:38 UTC
Description of problem:

In a replica 2 volume, if one of the bricks is down and it reports its state before the online one, AFR tries to find another online brick in find_best_down_child(). Since priv->child_up array has been initialized with -1 and this function only checks if it's 0, it considers that the other brick is alive and sends a CHILD_UP notification.

At this point the other xlators start sending requests, which fail with ENOTCONN when they reach afr. This can cause several unexpected errors.

Version-Release number of selected component (if applicable): mainline

How reproducible:

It happens randomly, depending on the order in which bricks are started.

Steps to Reproduce:

Actual results:

Expected results:

Additional info:

Comment 1 Worker Ant 2018-02-01 15:12:08 UTC
REVIEW: https://review.gluster.org/19440 (cluster/afr: remove unnecessary child_up initialization) posted (#1) for review on master by Xavier Hernandez

Comment 2 Worker Ant 2018-02-03 09:56:22 UTC
COMMIT: https://review.gluster.org/19440 committed in master by "Pranith Kumar Karampuri" <pkarampu@redhat.com> with a commit message- cluster/afr: remove unnecessary child_up initialization

The child_up array was initialized with all elements being -1 to
allow afr_notify() to differentiate down bricks from bricks that
haven't reported yet. With current implementation this is not needed
anymore and it was causing unexpected results when other parts of
the code considered that if child_up[i] != 0, it meant that it was up.

Change-Id: I2a9d712ee64c512f24bd5cd3a48dcb37e3139472
BUG: 1541038
Signed-off-by: Xavier Hernandez <jahernan@redhat.com>

Comment 3 Shyamsundar 2018-06-20 17:58:44 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-v4.1.0, please open a new bug report.

glusterfs-v4.1.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/announce/2018-June/000102.html
[2] https://www.gluster.org/pipermail/gluster-users/

Note You need to log in before you can comment on or make changes to this bug.