Bug 1356076 - DHT doesn't evenly balance files on FreeBSD with ZFS
Summary: DHT doesn't evenly balance files on FreeBSD with ZFS
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: distribute
Version: mainline
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
Assignee: Xavi Hernandez
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks: 1411898 1411899 1411901 1422777
TreeView+ depends on / blocked
 
Reported: 2016-07-13 11:20 UTC by Xavi Hernandez
Modified: 2017-05-30 18:34 UTC (History)
5 users (show)

Fixed In Version: glusterfs-3.11.0
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1411898 1411899 1411901 1422777 (view as bug list)
Environment:
Last Closed: 2017-03-06 17:20:39 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:


Attachments (Terms of Use)

Description Xavi Hernandez 2016-07-13 11:20:55 UTC
Description of problem:

On a pure distributed volume with one brick being a FreeBSD node with ZFS as filesystem and the other a Linux, dht puts ten more times data on FreeBSD node (3 TB vs 30 TB)

Version-Release number of selected component (if applicable): mainline


How reproducible:

Not sure

Steps to Reproduce:
1. Create a distributed volume with two bricks: one on a FreeBSD/ZFS and another one on a CentOS
2. Start copying files
3.

Actual results:

almost all files are placed in the FreeBSD node.

Expected results:

nearly 50% of files should be placed in each node.

Additional info:

A "gluster volume status detail" command shows a space on FreeBSD filesystem much bigger that it really is (~256 times bigger). It also doesn't detect the filesystem and some other information:

    File System          : N/A
    Device               : N/A
    Mount Options        : N/A
    Inode Size           : N/A
    Disk Space Free      : 2.6PB
    Total Disk Space     : 12.6PB

Real brick space is 45TB

A statvfs() call on FreeBSD returns this:

    f_frsize: 512
    f_bsize:  131072

From statvfs() man page on FreeBSD:

    "The statvfs() and fstatvfs() functions fill the structure pointed to by
     buf with garbage.  This garbage will occasionally bear resemblance to
     file system statistics, but portable applications must not depend on
     this.  Applications must pass a pathname or file descriptor which refers
     to a file on the file system in which they are interested."

    "f_frsize   The size in bytes of the minimum unit of allocation on
                this file system.  (This corresponds to the f_bsize mem-
                ber of struct statfs.)"

    "f_bsize    The preferred length of I/O requests for files on this
                file system.  (Corresponds to the f_iosize member of
                struct statfs.)"

Probably gluster uses f_bsize as the block size, but on FreeBSD it's the optimal I/O size, not the block size.

As a workaround, disabling 'weighted-rebalance' distributes files evenly between bricks.

Comment 1 Jeff Darcy 2016-07-13 15:54:25 UTC
You're probably right, Xavier.  Unfortunately, Linux and FreeBSD seem to have some fundamental disagreements about what these fields mean, so we'll probably have to add some platform-conditional code in some of the several places that use them.  I also doubt that this is the last problem we'll find in OS-heterogeneous clusters.  :(

Comment 2 Xavi Hernandez 2016-07-14 06:17:25 UTC
I agree.

We are using wrapped system calls in many places right now (syscall.h). Maybe we should enforce the usage of these wrappers and place the specific OS code in syscall.c.

For this particular case we could solve the problem simply by setting f_bsize = f_frsize on FreeBSD.

Comment 3 Worker Ant 2017-01-09 12:14:24 UTC
REVIEW: http://review.gluster.org/16361 (libglusterfs: fix statvfs in FreeBSD) posted (#1) for review on master by Xavier Hernandez (xhernandez@datalab.es)

Comment 4 Worker Ant 2017-01-09 12:18:34 UTC
REVIEW: http://review.gluster.org/16361 (libglusterfs: fix statvfs in FreeBSD) posted (#2) for review on master by Xavier Hernandez (xhernandez@datalab.es)

Comment 5 Worker Ant 2017-01-10 07:48:09 UTC
REVIEW: http://review.gluster.org/16361 (libglusterfs: fix statvfs in FreeBSD) posted (#3) for review on master by Xavier Hernandez (xhernandez@datalab.es)

Comment 6 Worker Ant 2017-01-10 17:08:05 UTC
COMMIT: http://review.gluster.org/16361 committed in master by Jeff Darcy (jdarcy@redhat.com) 
------
commit d6bc8da62f1b0d454fa5187687fdbf894403c7ce
Author: Xavier Hernandez <xhernandez@datalab.es>
Date:   Mon Jan 9 13:10:19 2017 +0100

    libglusterfs: fix statvfs in FreeBSD
    
    FreeBSD interprets statvfs' f_bsize field in a different way than Linux.
    
    This fix modifies the value returned by statvfs() on FreeBSD to match
    the expected value by Gluster.
    
    Change-Id: I930dab6e895671157238146d333e95874ea28a08
    BUG: 1356076
    Signed-off-by: Xavier Hernandez <xhernandez@datalab.es>
    Reviewed-on: http://review.gluster.org/16361
    Smoke: Gluster Build System <jenkins@build.gluster.org>
    NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org>
    Reviewed-by: Kaleb KEITHLEY <kkeithle@redhat.com>
    CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
    Reviewed-by: Jeff Darcy <jdarcy@redhat.com>

Comment 7 Worker Ant 2017-02-01 09:06:54 UTC
REVIEW: https://review.gluster.org/16498 (extras/rebalance.py: Fix statvfs for FreeBSD in python) posted (#1) for review on master by Xavier Hernandez (xhernandez@datalab.es)

Comment 8 Worker Ant 2017-02-01 09:13:42 UTC
REVIEW: https://review.gluster.org/16498 (extras/rebalance.py: Fix statvfs for FreeBSD in python) posted (#2) for review on master by Xavier Hernandez (xhernandez@datalab.es)

Comment 9 Worker Ant 2017-02-07 12:51:10 UTC
COMMIT: https://review.gluster.org/16498 committed in master by Jeff Darcy (jdarcy@redhat.com) 
------
commit cafdab5e13d74130abab6dca4267778d22d7d7f4
Author: Xavier Hernandez <xhernandez@datalab.es>
Date:   Wed Feb 1 10:01:26 2017 +0100

    extras/rebalance.py: Fix statvfs for FreeBSD in python
    
    FreeBSD doesn't return the block size in f_bsize as linux does. It
    returns the optimal I/O size, so we need to consider this to avoid
    invalid results. On FreeBSD we take f_frsize as the block size.
    
    Change-Id: I72083d8ae183548439de874c77f1d60d9c2d14a7
    BUG: 1356076
    Signed-off-by: Xavier Hernandez <xhernandez@datalab.es>
    Reviewed-on: https://review.gluster.org/16498
    CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
    NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org>
    Smoke: Gluster Build System <jenkins@build.gluster.org>
    Reviewed-by: Jeff Darcy <jdarcy@redhat.com>

Comment 10 Shyamsundar 2017-03-06 17:20:39 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.10.0, please open a new bug report.

glusterfs-3.10.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/gluster-users/2017-February/030119.html
[2] https://www.gluster.org/pipermail/gluster-users/

Comment 11 Shyamsundar 2017-05-30 18:34:38 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.11.0, please open a new bug report.

glusterfs-3.11.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/announce/2017-May/000073.html
[2] https://www.gluster.org/pipermail/gluster-users/


Note You need to log in before you can comment on or make changes to this bug.