Description of problem: On 96 node, 288 brick system DIST volume with 100K files end up with bricks with 256 bricks with 'zero' files, and 32 bricks with aprox 6000 files ea. Version-Release number of selected component (if applicable): glusterfs-3.7.1-16.el7rhgs.x86_6 How reproducible: customer reproduces at will. A brief sketch of the environment: 100+ Red Hat Gluster Storage 3.1 nodes Each node has 3 bricks Volume 'replvol01' has been created: Type: Distributed-Replica Nodes: 96 nodes are part of this volume Number of Bricks: 144 x 2 = 288 Options reconfigured: performance.readdir-ahead: on server.allow-insecure: on The volume is started and has all bricks online The problem: I create 100000 files on the mounted GlusterFS Volume (just a simple touch) with incremental names (ie: tmpfile.000001, tmpfile.000002, etc...) After the creation has finished (verified with 'ls | wc -l' returning 100000) I counted the amount of files created on _each_ brick Results: bricks with 'zero' files: 256 bricks bricks with aprox 6000 files: 32 bricks I would expect the files to be distributed over _ALL_ bricks (taking into account the brick replica-pairs). I don't expect the distributing to be exact, some bricks will have more files than other. But with the current configuration 256 bricks are just unused! Instead of 32 bricks having about 6000 files, I would expect 288 bricks to have between 300 and 700 files. As soon as I mount the GlusterFS volume, the GlusterFS FUSE client will log some details. It will show you the 'layout' of the '/'. (or any other directory that you create): [2015-11-13 14:18:33.516120] I [MSGID: 109036] [dht-common.c:7804:dht_log_new_layout_for_dir_selfheal] 0-replvol01-dht: Setting layout of / with ... dump of the layout array/hash ... From what I understand from this output it shows the following: For each subvolume (replication-pair): Name, relates to the entry in the .vol file Start, starting position on the distribution 'ring' Stop, ending position on the distribution 'ring' Hash, no idea Err, error state? (-1 everywhere) Since the start and end positions are between 0 and 4,294,967,296 I assume this is the 32bit integer which the DHT uses to distribute data over nodes. Then each 'brick' (or replication-pair) has to have an equally sized range on the 32bit integer. When I parse the layout array/hash which has been logged I see _a lot_ of overlap. I will attach the whole list to the case, but for now I'll just show the first entries. The list is sorted on 'starting' position: Subvol_Name Start End replvol01-replicate-92 0 268419087 replvol01-replicate-79 266323984 534743071 replvol01-replicate-64 266585872 535004959 replvol01-replicate-5 266847760 535266847 replvol01-replicate-35 267109648 535528735 replvol01-replicate-20 267371536 535790623 replvol01-replicate-135 267633424 536052511 replvol01-replicate-120 267895312 536314399 replvol01-replicate-106 268157200 536576287 replvol01-replicate-93 268419088 536838175 replvol01-replicate-8 534743072 803162159 replvol01-replicate-65 535004960 803424047 replvol01-replicate-50 535266848 803685935 replvol01-replicate-36 535528736 803947823 replvol01-replicate-21 535790624 804209711 replvol01-replicate-136 536052512 804471599 replvol01-replicate-121 536314400 804733487 replvol01-replicate-107 536576288 804995375 replvol01-replicate-94 536838176 805257263 Each time about 11 bricks share the same range on the ring. I verified 22 bricks. All contain 0 files, except for one! The delta between the start and end position is always 268419087, this is 1/16th of a int32. Which might indicate that adding 16 brick/replication-pair into a volume is the maximum for optimal distribution.
The problem stems from the total volume size exceeding 4PB. Each brick contributes about 32TB of capacity, and hence the 144 replica pairs contribute about 4.5 PB of space. DHT layout computation uses a count of 1MB chunks to denote the size of a single brick. When totaling these chunks up the int32 value overflows, and causes incorrect chunk computation, giving rise to overflowing layout every few bricks (the above layout sort order would be slightly incorrect when viewed from DHT dev eyes, as it should be sorted based on subvolume name as it is a fresh layout). The function being referred to where this overflow occurs is: dht_selfheal_layout_new_directory - total_size here overflows when adding chunks from each brick pair - hence chunk becomes a larger value, as a result we do not end up with disjoint layout ranges To fix the issue, this computation needs to be fixed to handle total chunks beyond 32 bit integer. Looking at possible solutions here. To reproduce the customer situation, created a 20 brick setup, but changed posix_statfs to return synthesized brick sizes that would exceed 4PB. If this is done and the volume mounted, DHT always detects that root has overlaps and attempts to correct the same, and ends up in the same bug as described.
(In reply to Shyamsundar from comment #2) > > To fix the issue, this computation needs to be fixed to handle total chunks > beyond 32 bit integer. Looking at possible solutions here. Won't using an unsigned 64 bit type for variables total_size, chunks (and relevant variables) fix the issue? With 64 bit, we can handle around 17179869184.0 PB, which should be sufficient.
Agree with Du. Won't a long data type replacement should do the job?
The customer says "s this currently blocks our implementation of GlusterFS I raised this case to a SEV 2", so the BZ SEV has also been raised to "high". If this fix is just modifying a variable type, can we expect a quick patch? It is a large Gluster system.
(In reply to Raghavendra G from comment #3) > (In reply to Shyamsundar from comment #2) > > > > > To fix the issue, this computation needs to be fixed to handle total chunks > > beyond 32 bit integer. Looking at possible solutions here. > > Won't using an unsigned 64 bit type for variables total_size, chunks (and > relevant variables) fix the issue? With 64 bit, we can handle around > 17179869184.0 PB, which should be sufficient. Currently the max size is 0xffffffff. With the increase in the total size would be need to increase the maz size as well?
upstream fix: http://review.gluster.org/#/c/12597/
Verified and works fine with build glusterfs-3.7.5-15.el7rhgs.x86_64. Created a distributed replicate volume with the following values and the volume had a total size of 5.3P gluster volume info output: ========================== Volume Name: replvol01 Type: Distributed-Replicate Volume ID: 12d377b2-60b0-44a0-bc9d-263245194e47 Status: Started Number of Bricks: 136 x 2 = 272 Transport-type: tcp complete volume info can be found in link below: http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1281946/vol_info I create 100000 files on the mounted GlusterFS Volume (just a simple touch) with incremental names (ie: tmpfile.000001, tmpfile.000002, etc...) After the creation has finished (verified with 'ls | wc -l' returning 100000) Output from client: ================== [root@dhcp37-75 ~]# df -TH Filesystem Type Size Used Avail Use% Mounted on /dev/mapper/rhel_dhcp37--75-root xfs 48G 1.8G 46G 4% / devtmpfs devtmpfs 4.1G 0 4.1G 0% /dev tmpfs tmpfs 4.2G 0 4.2G 0% /dev/shm tmpfs tmpfs 4.2G 77M 4.1G 2% /run tmpfs tmpfs 4.2G 0 4.2G 0% /sys/fs/cgroup /dev/vda1 xfs 521M 216M 305M 42% /boot tmpfs tmpfs 821M 0 821M 0% /run/user/0 rhs-client2.lab.eng.blr.redhat.com:/vol_new fuse.glusterfs 641G 43G 598G 7% /mnt/vol_new rhs-arch-srv1.lab.eng.blr.redhat.com:/replvol01 fuse.glusterfs 5.3P 11G 5.3P 1% /mnt/replvol01 [root@dhcp37-75 ~]# cd /mnt/replvol01/ [root@dhcp37-75 replvol01]# ls | wc -l 100000 saw that the files were getting distributed to all the bricks in the volume. None of the bricks in the volume are empty. providing the output of ls-l /rhs/brick1/* from all the nodes in the cluster. output of the ls command and logs are present in the link below. http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1281946/ parsed the dht range from the log files and the file can be found in the link below: http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1281946/hashrange_node1
Hi Sakshi, The doc text is modified slightly. Do take a look and share your review comments if any. If it looks ok then sign-off on the same.
Looks good.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-0193.html