Bug 1281946
Summary: | Large system file distribution is broken | |||
---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Harold Miller <hamiller> | |
Component: | distribute | Assignee: | Bug Updates Notification Mailing List <rhs-bugs> | |
Status: | CLOSED ERRATA | QA Contact: | RamaKasturi <knarra> | |
Severity: | high | Docs Contact: | ||
Priority: | high | |||
Version: | rhgs-3.1 | CC: | annair, asrivast, bmohanra, bugzilla.redhat, byarlaga, jgeraert, mhergaar, rgowdapp, sabansal, sankarshan, spalai, srangana | |
Target Milestone: | --- | Keywords: | ZStream | |
Target Release: | RHGS 3.1.2 | |||
Hardware: | All | |||
OS: | Linux | |||
Whiteboard: | triaged, fixed-in-upstream | |||
Fixed In Version: | glusterfs-3.7.5-14 | Doc Type: | Bug Fix | |
Doc Text: |
Previously, the total size of the cluster was deduced and stored in an unsigned 32 bit variable. Due to this, for large clusters, this value may overflow leading to wrong computations and in some cases the layout may overflow and not set correctly. With this fix, unsigned 64 bit are used to handle large values and the files are distributed properly.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1282751 (view as bug list) | Environment: | ||
Last Closed: | 2016-03-01 05:54:54 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1282751, 1294969 |
Description
Harold Miller
2015-11-13 21:39:08 UTC
The problem stems from the total volume size exceeding 4PB. Each brick contributes about 32TB of capacity, and hence the 144 replica pairs contribute about 4.5 PB of space. DHT layout computation uses a count of 1MB chunks to denote the size of a single brick. When totaling these chunks up the int32 value overflows, and causes incorrect chunk computation, giving rise to overflowing layout every few bricks (the above layout sort order would be slightly incorrect when viewed from DHT dev eyes, as it should be sorted based on subvolume name as it is a fresh layout). The function being referred to where this overflow occurs is: dht_selfheal_layout_new_directory - total_size here overflows when adding chunks from each brick pair - hence chunk becomes a larger value, as a result we do not end up with disjoint layout ranges To fix the issue, this computation needs to be fixed to handle total chunks beyond 32 bit integer. Looking at possible solutions here. To reproduce the customer situation, created a 20 brick setup, but changed posix_statfs to return synthesized brick sizes that would exceed 4PB. If this is done and the volume mounted, DHT always detects that root has overlaps and attempts to correct the same, and ends up in the same bug as described. (In reply to Shyamsundar from comment #2) > > To fix the issue, this computation needs to be fixed to handle total chunks > beyond 32 bit integer. Looking at possible solutions here. Won't using an unsigned 64 bit type for variables total_size, chunks (and relevant variables) fix the issue? With 64 bit, we can handle around 17179869184.0 PB, which should be sufficient. Agree with Du. Won't a long data type replacement should do the job? Agree with Du. Won't a long data type replacement should do the job? Agree with Du. Won't a long data type replacement should do the job? The customer says "s this currently blocks our implementation of GlusterFS I raised this case to a SEV 2", so the BZ SEV has also been raised to "high". If this fix is just modifying a variable type, can we expect a quick patch? It is a large Gluster system. (In reply to Raghavendra G from comment #3) > (In reply to Shyamsundar from comment #2) > > > > > To fix the issue, this computation needs to be fixed to handle total chunks > > beyond 32 bit integer. Looking at possible solutions here. > > Won't using an unsigned 64 bit type for variables total_size, chunks (and > relevant variables) fix the issue? With 64 bit, we can handle around > 17179869184.0 PB, which should be sufficient. Currently the max size is 0xffffffff. With the increase in the total size would be need to increase the maz size as well? upstream fix: http://review.gluster.org/#/c/12597/ Verified and works fine with build glusterfs-3.7.5-15.el7rhgs.x86_64. Created a distributed replicate volume with the following values and the volume had a total size of 5.3P gluster volume info output: ========================== Volume Name: replvol01 Type: Distributed-Replicate Volume ID: 12d377b2-60b0-44a0-bc9d-263245194e47 Status: Started Number of Bricks: 136 x 2 = 272 Transport-type: tcp complete volume info can be found in link below: http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1281946/vol_info I create 100000 files on the mounted GlusterFS Volume (just a simple touch) with incremental names (ie: tmpfile.000001, tmpfile.000002, etc...) After the creation has finished (verified with 'ls | wc -l' returning 100000) Output from client: ================== [root@dhcp37-75 ~]# df -TH Filesystem Type Size Used Avail Use% Mounted on /dev/mapper/rhel_dhcp37--75-root xfs 48G 1.8G 46G 4% / devtmpfs devtmpfs 4.1G 0 4.1G 0% /dev tmpfs tmpfs 4.2G 0 4.2G 0% /dev/shm tmpfs tmpfs 4.2G 77M 4.1G 2% /run tmpfs tmpfs 4.2G 0 4.2G 0% /sys/fs/cgroup /dev/vda1 xfs 521M 216M 305M 42% /boot tmpfs tmpfs 821M 0 821M 0% /run/user/0 rhs-client2.lab.eng.blr.redhat.com:/vol_new fuse.glusterfs 641G 43G 598G 7% /mnt/vol_new rhs-arch-srv1.lab.eng.blr.redhat.com:/replvol01 fuse.glusterfs 5.3P 11G 5.3P 1% /mnt/replvol01 [root@dhcp37-75 ~]# cd /mnt/replvol01/ [root@dhcp37-75 replvol01]# ls | wc -l 100000 saw that the files were getting distributed to all the bricks in the volume. None of the bricks in the volume are empty. providing the output of ls-l /rhs/brick1/* from all the nodes in the cluster. output of the ls command and logs are present in the link below. http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1281946/ parsed the dht range from the log files and the file can be found in the link below: http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1281946/hashrange_node1 Hi Sakshi, The doc text is modified slightly. Do take a look and share your review comments if any. If it looks ok then sign-off on the same. Looks good. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-0193.html |