Bug 1281946

Summary:	Large system file distribution is broken
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Harold Miller <hamiller>
Component:	distribute	Assignee:	Bug Updates Notification Mailing List <rhs-bugs>
Status:	CLOSED ERRATA	QA Contact:	RamaKasturi <knarra>
Severity:	high	Docs Contact:
Priority:	high
Version:	rhgs-3.1	CC:	annair, asrivast, bmohanra, bugzilla.redhat, byarlaga, jgeraert, mhergaar, rgowdapp, sabansal, sankarshan, spalai, srangana
Target Milestone:	---	Keywords:	ZStream
Target Release:	RHGS 3.1.2
Hardware:	All
OS:	Linux
Whiteboard:	triaged, fixed-in-upstream
Fixed In Version:	glusterfs-3.7.5-14	Doc Type:	Bug Fix
Doc Text:	Previously, the total size of the cluster was deduced and stored in an unsigned 32 bit variable. Due to this, for large clusters, this value may overflow leading to wrong computations and in some cases the layout may overflow and not set correctly. With this fix, unsigned 64 bit are used to handle large values and the files are distributed properly.	Story Points:	---
Clone Of:
Clones:	1282751 (view as bug list)		Environment:
Last Closed:	2016-03-01 05:54:54 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1282751, 1294969

Description Harold Miller 2015-11-13 21:39:08 UTC

Description of problem: On 96 node, 288 brick system DIST volume with 100K files end up with bricks with 256 bricks with 'zero' files, and 32 bricks with aprox 6000 files ea.


Version-Release number of selected component (if applicable): glusterfs-3.7.1-16.el7rhgs.x86_6


How reproducible: customer reproduces at will.

A brief sketch of the environment:

100+ Red Hat Gluster Storage 3.1 nodes
Each node has 3 bricks

Volume 'replvol01' has been created:
  Type: Distributed-Replica
  Nodes: 96 nodes are part of this volume
  Number of Bricks: 144 x 2 = 288

  Options reconfigured:
    performance.readdir-ahead: on
    server.allow-insecure: on

The volume is started and has all bricks online


The problem:

I create 100000 files on the mounted GlusterFS Volume (just a simple touch)
with incremental names (ie: tmpfile.000001, tmpfile.000002, etc...) After the
creation has finished (verified with 'ls | wc -l' returning 100000) I counted
the amount of files created on _each_ brick

Results:
  bricks with 'zero' files:      256 bricks
  bricks with aprox 6000 files:  32 bricks

I would expect the files to be distributed over _ALL_ bricks (taking into
account the brick replica-pairs).  I don't expect the distributing to be exact,
some bricks will have more files than other.  But with the current
configuration 256 bricks are just unused!

Instead of 32 bricks having about 6000 files, I would expect 288 bricks to have
between 300 and 700 files.



As soon as I mount the GlusterFS volume, the GlusterFS FUSE client will log some details.
It will show you the 'layout' of the '/'. (or any other directory that you create):

  [2015-11-13 14:18:33.516120] I [MSGID: 109036] [dht-common.c:7804:dht_log_new_layout_for_dir_selfheal] 0-replvol01-dht: Setting layout of / with

  ... dump of the layout array/hash ...

From what I understand from this output it shows the following:

For each subvolume (replication-pair):
  Name, relates to the entry in the .vol file
  Start, starting position on the distribution 'ring'
  Stop, ending position on the distribution 'ring'
  Hash, no idea
  Err, error state? (-1 everywhere)

Since the start and end positions are between 0 and 4,294,967,296 I assume this
is the 32bit integer which the DHT uses to distribute data over nodes.  Then
each 'brick' (or replication-pair) has to have an equally sized range on the
32bit integer.

When I parse the layout array/hash which has been logged I see _a lot_ of
overlap.  I will attach the whole list to the case, but for now I'll just show
the first entries.  The list is sorted on 'starting' position:
 
    Subvol_Name                 Start        End
    replvol01-replicate-92      0	     268419087
    replvol01-replicate-79      266323984    534743071
    replvol01-replicate-64      266585872    535004959
    replvol01-replicate-5       266847760    535266847
    replvol01-replicate-35      267109648    535528735
    replvol01-replicate-20      267371536    535790623
    replvol01-replicate-135     267633424    536052511
    replvol01-replicate-120     267895312    536314399  
    replvol01-replicate-106     268157200    536576287
    replvol01-replicate-93      268419088    536838175
    replvol01-replicate-8       534743072    803162159
    replvol01-replicate-65      535004960    803424047
    replvol01-replicate-50      535266848    803685935
    replvol01-replicate-36      535528736    803947823
    replvol01-replicate-21      535790624    804209711
    replvol01-replicate-136     536052512    804471599
    replvol01-replicate-121     536314400    804733487
    replvol01-replicate-107     536576288    804995375
    replvol01-replicate-94      536838176    805257263

Each time about 11 bricks share the same range on the ring.
I verified 22 bricks. All contain 0 files, except for one!

The delta between the start and end position is always 268419087, this is
1/16th of a int32. Which might indicate that adding 16 brick/replication-pair
into a volume is the maximum for optimal distribution.

Comment 2 Shyamsundar 2015-11-14 16:18:45 UTC

The problem stems from the total volume size exceeding 4PB. Each brick contributes about 32TB of capacity, and hence the 144 replica pairs contribute about 4.5 PB of space.

DHT layout computation uses a count of 1MB chunks to denote the size of a single brick. When totaling these chunks up the int32 value overflows, and causes incorrect chunk computation, giving rise to overflowing layout every few bricks (the above layout sort order would be slightly incorrect when viewed from DHT dev eyes, as it should be sorted based on subvolume name as it is a fresh layout).

The function being referred to where this overflow occurs is: dht_selfheal_layout_new_directory
 - total_size here overflows when adding chunks from each brick pair
 - hence chunk becomes a larger value, as a result we do not end up with disjoint layout ranges

To fix the issue, this computation needs to be fixed to handle total chunks beyond 32 bit integer. Looking at possible solutions here.

To reproduce the customer situation, created a 20 brick setup, but changed posix_statfs to return synthesized brick sizes that would exceed 4PB. If this is done and the volume mounted, DHT always detects that root has overlaps and attempts to correct the same, and ends up in the same bug as described.

Comment 3 Raghavendra G 2015-11-16 08:35:18 UTC

(In reply to Shyamsundar from comment #2)

> 
> To fix the issue, this computation needs to be fixed to handle total chunks
> beyond 32 bit integer. Looking at possible solutions here.

Won't using an unsigned 64 bit type for variables total_size, chunks (and relevant variables) fix the issue? With 64 bit, we can handle around 17179869184.0 PB, which should be sufficient.

Comment 4 Susant Kumar Palai 2015-11-16 12:41:02 UTC

Agree with Du. Won't a long data type replacement should do the job?

Comment 5 Susant Kumar Palai 2015-11-16 12:41:09 UTC

Agree with Du. Won't a long data type replacement should do the job?

Comment 6 Susant Kumar Palai 2015-11-16 12:41:19 UTC

Agree with Du. Won't a long data type replacement should do the job?

Comment 8 Harold Miller 2015-11-16 15:03:22 UTC

The customer says "s this currently blocks our implementation of GlusterFS I raised this case to a SEV 2", so the BZ SEV has also been raised to "high". If this fix is just modifying a variable type, can we expect a quick patch? It is a large Gluster system.

Comment 9 Sakshi 2015-11-17 04:56:30 UTC

(In reply to Raghavendra G from comment #3)
> (In reply to Shyamsundar from comment #2)
> 
> > 
> > To fix the issue, this computation needs to be fixed to handle total chunks
> > beyond 32 bit integer. Looking at possible solutions here.
> 
> Won't using an unsigned 64 bit type for variables total_size, chunks (and
> relevant variables) fix the issue? With 64 bit, we can handle around
> 17179869184.0 PB, which should be sufficient.

Currently the max size is 0xffffffff. With the increase in the total size would be need to increase the maz size as well?

Comment 12 Raghavendra G 2015-12-30 05:40:01 UTC

upstream fix:
http://review.gluster.org/#/c/12597/

Comment 15 RamaKasturi 2016-01-12 12:51:13 UTC

Verified and works fine with build glusterfs-3.7.5-15.el7rhgs.x86_64.

Created a distributed replicate volume with the following values and the volume had a total size of 5.3P

gluster volume info output:
==========================

Volume Name: replvol01
Type: Distributed-Replicate
Volume ID: 12d377b2-60b0-44a0-bc9d-263245194e47
Status: Started
Number of Bricks: 136 x 2 = 272
Transport-type: tcp

complete volume info can be found in link below:

http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1281946/vol_info

I create 100000 files on the mounted GlusterFS Volume (just a simple touch)
with incremental names (ie: tmpfile.000001, tmpfile.000002, etc...) After the
creation has finished (verified with 'ls | wc -l' returning 100000)

Output from client:
==================

[root@dhcp37-75 ~]# df -TH
Filesystem                                      Type            Size  Used Avail Use% Mounted on
/dev/mapper/rhel_dhcp37--75-root                xfs              48G  1.8G   46G   4% /
devtmpfs                                        devtmpfs        4.1G     0  4.1G   0% /dev
tmpfs                                           tmpfs           4.2G     0  4.2G   0% /dev/shm
tmpfs                                           tmpfs           4.2G   77M  4.1G   2% /run
tmpfs                                           tmpfs           4.2G     0  4.2G   0% /sys/fs/cgroup
/dev/vda1                                       xfs             521M  216M  305M  42% /boot
tmpfs                                           tmpfs           821M     0  821M   0% /run/user/0
rhs-client2.lab.eng.blr.redhat.com:/vol_new     fuse.glusterfs  641G   43G  598G   7% /mnt/vol_new
rhs-arch-srv1.lab.eng.blr.redhat.com:/replvol01 fuse.glusterfs  5.3P   11G  5.3P   1% /mnt/replvol01
 
 
[root@dhcp37-75 ~]# cd /mnt/replvol01/
[root@dhcp37-75 replvol01]# ls | wc -l
100000

saw that the files were getting distributed to all the bricks in the volume. None of the bricks in the volume are empty. providing the output of ls-l /rhs/brick1/* from all the nodes in the cluster. output of the ls command and logs are  present in the link below.

http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1281946/

parsed the dht range from the log files and the file  can be found in the link below:

http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1281946/hashrange_node1

Comment 16 Bhavana 2016-02-23 08:52:46 UTC

Hi Sakshi,

The doc text is modified slightly. Do take a look and share your review comments if any. If it looks ok then sign-off on the same.

Comment 17 Sakshi 2016-02-23 08:56:16 UTC

Looks good.

Comment 19 errata-xmlrpc 2016-03-01 05:54:54 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0193.html