Bug 1005526 - All null pending matrix
All null pending matrix
Status: CLOSED CURRENTRELEASE
Product: GlusterFS
Classification: Community
Component: locks (Show other bugs)
3.4.0
All Unspecified
unspecified Severity low
: ---
: ---
Assigned To: Pranith Kumar K
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-09-08 00:09 EDT by manu@netbsd.org
Modified: 2014-04-17 09:14 EDT (History)
2 users (show)

See Also:
Fixed In Version: glusterfs-3.4.3
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2014-04-17 09:14:29 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description manu@netbsd.org 2013-09-08 00:09:56 EDT
Description of problem:

On 2x2 Distributed-Replicate volume with heterogeneous nodes (i386 and x86-64), filesystem exhibits split brain files for no reason. Debug shows an all null pending matrix.

The problem vanishes if
1) eager locks are disabled
OR
2) x86-64 node is replaced by an i386 node.

This bug has been observed on NetBSD 6.0, but it also probably exist on Linux. We are not sure whether the problem is caused by heterogeneous cluster or if it is LP64 specific.

Version-Release number of selected component (if applicable):
glusterfs-3.4.0


How reproducible:

Always happens after a few hours of activity building NetBSD source tree (my usual stress test for glusterfs)

Steps to Reproduce:
1. set up a 2x2 Distributed-Replicate volume with 3 i386 bricks and an x86-63 one. Here is my gluster info output: (silo and hangar are i386, debacle is x86-64)
Volume Name: gfs340
Type: Distributed-Replicate
Volume ID: d2745193-58ff-4406-8f1e-d65bebdda017
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: silo:/export/wd2a
Brick2: hangar:/export/wd1a
Brick3: hangar:/export/wd3a
Brick4: debacle:/export/wd1a

2. on the glusterfs volume, fetch and unpack NetBSD source tarbals from ftp://ftp.netbsd.org/pub/NetBSD/NetBSD-6.0/source/sets/

3. build NetBSD: cd usr/src/ && ./build.sh -Uum i386 release


Actual results:
Get a split brain. Logs with debug enabled (see attachment) report a all-NULL matrix:
[afr-self-heal-common.c:138:afr_sh_print_pending_matrix] 0-gfs34-replicate-1: pending_matrix: [ 0 0 ]
[afr-self-heal-common.c:138:afr_sh_print_pending_matrix] 0-gfs34-replicate-1: pending_matrix: [ 0 0 ]


Expected results:
It should complete the build without a hitch. This is what happens on homogeneous cluster, or with eager-locks disabled.


Additional info:

Attachment contains complete client log with debug enabled
Comment 1 Anand Avati 2013-10-02 00:09:20 EDT
REVIEW: http://review.gluster.org/6020 (Disable eager-locks on NetBSD for 3.4 branch) posted (#1) for review on release-3.4 by Emmanuel Dreyfus (manu@netbsd.org)
Comment 2 Anand Avati 2013-10-24 08:21:33 EDT
COMMIT: http://review.gluster.org/6020 committed in release-3.4 by Vijay Bellur (vbellur@redhat.com) 
------
commit 02ede06cbb00aef2ad1fbceb8c818c5d649ab512
Author: Emmanuel Dreyfus <manu@netbsd.org>
Date:   Wed Oct 2 06:07:23 2013 +0200

    Disable eager-locks on NetBSD for 3.4 branch
    
    As described in https://bugzilla.redhat.com/show_bug.cgi?id=1005526
    eager-locks are broken on release-3.4, at least for NetBSD. This
    change disable them by default, leaving the admin the possibility
    to explicitely enable the feature if needed.
    
    BUG: 1005526
    Change-Id: I6f1b393865b103ec56ad5eb5143f59bb8672f19c
    Signed-off-by: Emmanuel Dreyfus <manu@netbsd.org>
    Reviewed-on: http://review.gluster.org/6020
    Tested-by: Gluster Build System <jenkins@build.gluster.com>
    Reviewed-by: Vijay Bellur <vbellur@redhat.com>
Comment 3 Pranith Kumar K 2014-01-16 04:06:11 EST
Emanuel,
    According to the recent mail in gluster-devel. This issue is not seen anymore on 3.5. Could you close this bug if that is the case

Pranith
Comment 4 Niels de Vos 2014-04-17 09:14:29 EDT
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.4.3, please reopen this bug report.

glusterfs-3.4.3 has been announced on the Gluster Developers mailinglist [1], packages for several distributions should already be or become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

The fix for this bug likely to be included in all future GlusterFS releases i.e. release > 3.4.3. In the same line the recent release i.e. glusterfs-3.5.0 [3] likely to have the fix. You can verify this by reading the comments in this bug report and checking for comments mentioning "committed in release-3.5".

[1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/5978
[2] http://news.gmane.org/gmane.comp.file-systems.gluster.user
[3] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/6137

Note You need to log in before you can comment on or make changes to this bug.