Description of problem: On 2x2 Distributed-Replicate volume with heterogeneous nodes (i386 and x86-64), filesystem exhibits split brain files for no reason. Debug shows an all null pending matrix. The problem vanishes if 1) eager locks are disabled OR 2) x86-64 node is replaced by an i386 node. This bug has been observed on NetBSD 6.0, but it also probably exist on Linux. We are not sure whether the problem is caused by heterogeneous cluster or if it is LP64 specific. Version-Release number of selected component (if applicable): glusterfs-3.4.0 How reproducible: Always happens after a few hours of activity building NetBSD source tree (my usual stress test for glusterfs) Steps to Reproduce: 1. set up a 2x2 Distributed-Replicate volume with 3 i386 bricks and an x86-63 one. Here is my gluster info output: (silo and hangar are i386, debacle is x86-64) Volume Name: gfs340 Type: Distributed-Replicate Volume ID: d2745193-58ff-4406-8f1e-d65bebdda017 Status: Started Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: silo:/export/wd2a Brick2: hangar:/export/wd1a Brick3: hangar:/export/wd3a Brick4: debacle:/export/wd1a 2. on the glusterfs volume, fetch and unpack NetBSD source tarbals from ftp://ftp.netbsd.org/pub/NetBSD/NetBSD-6.0/source/sets/ 3. build NetBSD: cd usr/src/ && ./build.sh -Uum i386 release Actual results: Get a split brain. Logs with debug enabled (see attachment) report a all-NULL matrix: [afr-self-heal-common.c:138:afr_sh_print_pending_matrix] 0-gfs34-replicate-1: pending_matrix: [ 0 0 ] [afr-self-heal-common.c:138:afr_sh_print_pending_matrix] 0-gfs34-replicate-1: pending_matrix: [ 0 0 ] Expected results: It should complete the build without a hitch. This is what happens on homogeneous cluster, or with eager-locks disabled. Additional info: Attachment contains complete client log with debug enabled
REVIEW: http://review.gluster.org/6020 (Disable eager-locks on NetBSD for 3.4 branch) posted (#1) for review on release-3.4 by Emmanuel Dreyfus (manu)
COMMIT: http://review.gluster.org/6020 committed in release-3.4 by Vijay Bellur (vbellur) ------ commit 02ede06cbb00aef2ad1fbceb8c818c5d649ab512 Author: Emmanuel Dreyfus <manu> Date: Wed Oct 2 06:07:23 2013 +0200 Disable eager-locks on NetBSD for 3.4 branch As described in https://bugzilla.redhat.com/show_bug.cgi?id=1005526 eager-locks are broken on release-3.4, at least for NetBSD. This change disable them by default, leaving the admin the possibility to explicitely enable the feature if needed. BUG: 1005526 Change-Id: I6f1b393865b103ec56ad5eb5143f59bb8672f19c Signed-off-by: Emmanuel Dreyfus <manu> Reviewed-on: http://review.gluster.org/6020 Tested-by: Gluster Build System <jenkins.com> Reviewed-by: Vijay Bellur <vbellur>
Emanuel, According to the recent mail in gluster-devel. This issue is not seen anymore on 3.5. Could you close this bug if that is the case Pranith
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.4.3, please reopen this bug report. glusterfs-3.4.3 has been announced on the Gluster Developers mailinglist [1], packages for several distributions should already be or become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. The fix for this bug likely to be included in all future GlusterFS releases i.e. release > 3.4.3. In the same line the recent release i.e. glusterfs-3.5.0 [3] likely to have the fix. You can verify this by reading the comments in this bug report and checking for comments mentioning "committed in release-3.5". [1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/5978 [2] http://news.gmane.org/gmane.comp.file-systems.gluster.user [3] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/6137