Description of problem: If the bricks in a distributed volume (and likely the volume as a whole) are more than half full, migrating data with "remove-brick start" will fail, even if the remaining bricks in the volume as a whole have enough space to absorb the migration. Issuing the "remove-brick start" command multiple times (waiting for completion each time) appears to be a workaround. Once eah migration is complete (with or without failures), the used disk space on each brick drops down to the expected level. Steps to Reproduce: 1. Start with a 4x2 distribute-replicate volume that is about 60% full. 2. Initiate a remove-brick start command for the last brick pair. 3. Note that eventually all remaining bricks will fill up and failures will be logged to $volume-rebalance. There will be many failures in the 'status' output, because the bricks will fill up long before all the files are migrated. 4. When the operation completes, initiate it again. Another cycle may be required to finally complete without error. 5. Note that Bug 862332 has probably also occurred. 6. Issue the commit operation to eliminate the effects of Bug 862332. Expected results: The removal migration should do one of two things: 1) Use less disk space to begin with -- On a 4x2 -> 3x2 removal migration, each remaining brick should only see additional usage approximately equivalent to one third of the used space of the removed brick. 2) Do automatic disk space reclamations when the system notices that all brick space is used. This reclamation already happens when the migration completes, can it be done in the middle of the operation too?
Hi Shawn, Migration of files is based on their hash names and not on their size. So, it is possible that one or more of the distribute subvolume gets full as a result of remove-brick migration. Can you please attach the remove-brick (<volname>-rebalance.log) when the failures are detected?
I will set all this up again and attach the log. In the meantime, some additional info: I used photos, text, and very small files containing their metadata. The largest files were only a few megabytes. File size was not an issue. It's not 'one or more' of the remaining bricks that gets full during a remove-brick migration on a 60% full volume. It's every single one of them. Another sample test you can do with a volume that's less than half full, to fully illustrate the underlying problem here: * Set up a 4x2 volume where each brick is 1GB. * Fill it 25% full so that all the bricks have about 250MB on them. * Initiate a migration that removes the last brick pair. In this instance, the six remaining bricks will all more than double in size during the migration, rather than increase to the approximately 333 MB that I would have expected. When the migration is complete (before the commit), space will be reclaimed and all the bricks will drop in size somewhere close to the expected 333 MB. It's worth noting that when I first did the "less than half full" test I have outlined here, I had a positive failure count in the 'status' screen, even though I did not run out of disk space during that test. At that time, I did not attempt to look into the failures at all. The extreme disk space discrepancy led me to repeat the test with a 60% full volume to find out if I would have reason to file this bug. That's when I really looked into the failures.
Another note: so far this is all testbed. We are very close to buying production hardware and rolling out though.
In order to save time for a repeat of my 60% full test, I have re-added my fourth brick pair to my volume, and I am in the process of rebalancing to put data back on the fourth brick pair. The remainder of this comment concerns this rebalance, not the original problem. I have noticed a very similar disk space problem while doing the initial rebalance. It uses considerably more disk space than it should, then when the rebalance is complete, disk usage drops. In order to get the rebalance to work fully, I had to run it several times. Also, there are TONS of failed migrations in the log for all of the rebalance runs. There does not seem to be any data loss, though. [dht-rebalance.c:1194:gf_defrag_migrate_data] 0-qasmb-dht: migrate-data failed for /CNW/012/484.thm
Information just before starting the remove-brick: [root@testb1 ~]# gluster volume info Volume Name: qasmb Type: Distributed-Replicate Volume ID: 86938037-e1bb-466f-b353-8a1bd939a345 Status: Started Number of Bricks: 4 x 2 = 8 Transport-type: tcp Bricks: Brick1: testb1:/bricks/b1/qasmb Brick2: testb2:/bricks/b1/qasmb Brick3: testb1:/bricks/b2/qasmb Brick4: testb2:/bricks/b2/qasmb Brick5: testb1:/bricks/b3/qasmb Brick6: testb2:/bricks/b3/qasmb Brick7: testb1:/bricks/b4/qasmb Brick8: testb2:/bricks/b4/qasmb [root@testb1 ~]# df Filesystem 1K-blocks Used Available Use% Mounted on /dev/mapper/vg_main-lv_root 49537840 3500440 43521024 8% / tmpfs 1914332 0 1914332 0% /dev/shm /dev/md1 1032076 127836 851812 14% /boot /dev/sda3 1038336 673036 365300 65% /bricks/b1 /dev/sdb3 1038336 673916 364420 65% /bricks/b2 /dev/sdc3 1038336 669692 368644 65% /bricks/b3 /dev/sdd3 1038336 670464 367872 65% /bricks/b4 testb:qasmb 4153344 2687488 1465856 65% /shared/qasmb [root@testb1 ~]# uname -a Linux testb1 2.6.32-279.9.1.el6.centos.plus.x86_64 #1 SMP Wed Sep 26 03:52:55 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux
Command used to initiate removal: gluster volume remove-brick qasmb testb1:/bricks/b4/qasmb testb2:/bricks/b4/qasmb start
df after only three minutes. Note that space usage on brick 4 has only dropped by 200MB, but disk usage on brick 2 is up by nearly 300MB and the others are not far behind: [root@testb1 ~]# df Filesystem 1K-blocks Used Available Use% Mounted on /dev/mapper/vg_main-lv_root 49537840 3350020 43671444 8% / tmpfs 1914332 0 1914332 0% /dev/shm /dev/md1 1032076 127836 851812 14% /boot /dev/sda3 1038336 787812 250524 76% /bricks/b1 /dev/sdb3 1038336 965928 72408 94% /bricks/b2 /dev/sdc3 1038336 922104 116232 89% /bricks/b3 /dev/sdd3 1038336 470708 567628 46% /bricks/b4 testb:qasmb 4153344 3148288 1005056 76% /shared/qasmb
Shortly after the previous comment, failures began showing up in the log, and df looked like this. Now brick 2 is completely full: [root@testb1 ~]# df -k Filesystem 1K-blocks Used Available Use% Mounted on /dev/mapper/vg_main-lv_root 49537840 3351872 43669592 8% / tmpfs 1914332 0 1914332 0% /dev/shm /dev/md1 1032076 127836 851812 14% /boot /dev/sda3 1038336 957344 80992 93% /bricks/b1 /dev/sdb3 1038336 1038328 8 100% /bricks/b2 /dev/sdc3 1038336 994952 43384 96% /bricks/b3 /dev/sdd3 1038336 330092 708244 32% /bricks/b4 testb:qasmb 4153344 3320832 832512 80% /shared/qasmb
After the first run completed, the status and df looked like this. There were 4266 migration failures, and brick 4 is still 25% full. Will attach the rebalance log. Note that the used space on both the volume as a whole as well as the bricks has dropped significantly since the last df showing brick 2 completely full. [root@testb1 qasmb]# gluster volume remove-brick qasmb testb1:/bricks/b4/qasmb testb2:/bricks/b4/qasmb status Node Rebalanced-files size scanned failures status --------- ----------- ----------- ----------- ----------- ------------ localhost 8452 641210309 37789 4266 completed testb4 0 0 36527 0 not started testb3 0 0 36526 0 not started testb2 0 0 36501 0 completed [root@testb1 ~]# df -k Filesystem 1K-blocks Used Available Use% Mounted on /dev/mapper/vg_main-lv_root 49537840 3358100 43663364 8% / tmpfs 1914332 0 1914332 0% /dev/shm /dev/md1 1032076 127836 851812 14% /boot /dev/sda3 1038336 838728 199608 81% /bricks/b1 /dev/sdb3 1038336 811140 227196 79% /bricks/b2 /dev/sdc3 1038336 784172 254164 76% /bricks/b3 /dev/sdd3 1038336 255828 782508 25% /bricks/b4 testb:qasmb 4153344 2690560 1462784 65% /shared/qasmb
Moving the log before attaching it here: [root@testb1 ~]# mv /var/log/glusterfs/qasmb-rebalance.log ~/first-run-qasmb-rebalance.log
Created attachment 621071 [details] The rebalance log after the first remove-brick run. This is the full rebalance log after the first remove-brick run completed. I did not run into Bug 862332 at this point. I will watch for that bug on subsequent runs.
Partway through the process, the bricks once again began to fill up: [root@testb1 glusterfs]# df Filesystem 1K-blocks Used Available Use% Mounted on /dev/mapper/vg_main-lv_root 49537840 3361468 43659996 8% / tmpfs 1914332 0 1914332 0% /dev/shm /dev/md1 1032076 127836 851812 14% /boot /dev/sda3 1038336 838756 199580 81% /bricks/b1 /dev/sdb3 1038336 1038336 0 100% /bricks/b2 /dev/sdc3 1038336 993868 44468 96% /bricks/b3 /dev/sdd3 1038336 189388 848948 19% /bricks/b4 testb:qasmb 4153344 3060352 1092992 74% /shared/qasmb
When the second run was complete, There were still failures and brick 4 is still not empty. Once again, disk usage dropped dramatically. [root@testb1 qasmb]# gluster volume remove-brick qasmb testb1:/bricks/b4/qasmb testb2:/bricks/b4/qasmb status Node Rebalanced-files size scanned failures status --------- ----------- ----------- ----------- ----------- ------------ localhost 6039 450055294 37910 1522 completed testb4 0 0 36527 0 not started testb3 0 0 36526 0 not started testb2 0 0 36580 0 completed [root@testb1 glusterfs]# df Filesystem 1K-blocks Used Available Use% Mounted on /dev/mapper/vg_main-lv_root 49537840 3365056 43656408 8% / tmpfs 1914332 0 1914332 0% /dev/shm /dev/md1 1032076 127836 851812 14% /boot /dev/sda3 1038336 920084 118252 89% /bricks/b1 /dev/sdb3 1038336 862764 175572 84% /bricks/b2 /dev/sdc3 1038336 829144 209192 80% /bricks/b3 /dev/sdd3 1038336 79420 958916 8% /bricks/b4 testb:qasmb 4153344 2692096 1461248 65% /shared/qasmb [root@testb1 glusterfs]# find /bricks/b4 -type f | wc -l 3336
Created attachment 621085 [details] rebalance log for second remove-brick run [root@testb1 ~]# mv /var/log/glusterfs/qasmb-rebalance.log ~/second-run-qasmb-rebalance.log
On the third run, there were no failures, but it did not remove all of the files from the fourth brick. [root@testb1 glusterfs]# find /bricks/b4 -type f | wc -l 2208 [root@testb1 glusterfs]# df Filesystem 1K-blocks Used Available Use% Mounted on /dev/mapper/vg_main-lv_root 49537840 3366508 43654956 8% / tmpfs 1914332 0 1914332 0% /dev/shm /dev/md1 1032076 127836 851812 14% /boot /dev/sda3 1038336 875540 162796 85% /bricks/b1 /dev/sdb3 1038336 898612 139724 87% /bricks/b2 /dev/sdc3 1038336 878524 159812 85% /bricks/b3 /dev/sdd3 1038336 38796 999540 4% /bricks/b4 testb:qasmb 4153344 2692224 1461120 65% /shared/qasmb [root@testb1 qasmb]# gluster volume remove-brick qasmb testb1:/bricks/b4/qasmb testb2:/bricks/b4/qasmb status Node Rebalanced-files size scanned failures status --------- ----------- ----------- ----------- ----------- ------------ localhost 1522 104671121 37344 0 completed testb3 0 0 36526 0 not started testb4 0 0 36527 0 not started testb2 0 0 36728 0 completed
Created attachment 621088 [details] The log from the third run.
Running the remove-brick a fourth time resulted in no files getting moved, yet there are still over 2000 files on the bricks that are being removed. It looks like I'm going to have to start completely over.
Created attachment 621089 [details] The rest of the gluster logs, covering all four remove-brick runs
You'll notice testb3 and testb4 in the status output and possibly elsewhere. These are Fedora 17 machines that are part of the cluster. The purpose of these machines is to provide NFS, Samba, and UFO access to the volume. They are Fedora so that we can have newer versions of software for the centralized access, while using a more stable distro (CentOS 6) for the gluster volume itself.
After committing the remove-brick, I checked into the files remaining on brick 4. Despite the fact that they were not deleted from brick 4, they are still accessible from the client mount, and it appears that all data is intact as well, even after completely unmounting those bricks.
The failures are because you have run out of space in few of the subvolumes. Rebalance/remove-brick fails the migration in such cases. [2012-10-03 12:47:09.330674] E [dht-rebalance.c:367:__dht_check_free_space] 0-qasmb-dht: data movement a ttempted from node (qasmb-replicate-3) with to node (qasmb-replicate-1) which does not have required fre e space for /CNW/004/660 [2012-10-03 12:47:09.330882] E [dht-rebalance.c:1194:gf_defrag_migrate_data] 0-qasmb-dht: migrate-data f ailed for /CNW/004/660 As mentioned before, rebalance/remove-brick does not depend on the file sizes. All it depends is on their names for hash. Every directory's hash range is spread across the subvolumes. Each file's hash value determines which bricks/subvolume's directory holds it. Even if some subvolume has free space, but it is not the hashed target, the files do not get migrated. Additionally, remove-brick elevates the problem here as space would be taken away from the cluster. For mitigation of data, please do a add-brick(empty) followed by a remove-brick(if de-commissioning is needed)/rebalance
New round of testing. Here's information showing the empty volume. This volume is a LOT bigger than my previous one. It will be a while before I post anything more for this round, because it's going to take a while to fill this up. [root@testb1 ~]# df Filesystem 1K-blocks Used Available Use% Mounted on /dev/mapper/vg_main-lv_root 49537840 3371060 43650404 8% / tmpfs 1914332 0 1914332 0% /dev/shm /dev/md1 1032076 127836 851812 14% /boot /dev/sda3 922833364 33168 922800196 1% /bricks/b1 /dev/sdb3 922833364 33168 922800196 1% /bricks/b2 /dev/sdc3 922833364 33168 922800196 1% /bricks/b3 /dev/sdd3 922833364 33168 922800196 1% /bricks/b4 testb:testvol 3691333376 132608 3691200768 1% /shared/test [root@testb1 ~]# gluster volume info Volume Name: testvol Type: Distributed-Replicate Volume ID: 182df850-96f3-4d69-95b9-18e9ea409dfb Status: Started Number of Bricks: 4 x 2 = 8 Transport-type: tcp Bricks: Brick1: testb1:/bricks/b1/testvol Brick2: testb2:/bricks/b1/testvol Brick3: testb1:/bricks/b2/testvol Brick4: testb2:/bricks/b2/testvol Brick5: testb1:/bricks/b3/testvol Brick6: testb2:/bricks/b3/testvol Brick7: testb1:/bricks/b4/testvol Brick8: testb2:/bricks/b4/testvol [root@testb1 ~]# find /shared/test -type f | wc -l 0
I looked at comment 21, posted at 2012-10-04 03:17:52 EDT. The parts that I can fully understand seem irrelevant to the discussion. How about we take a step back and I will explain the problem I am seeing with less verbosity. At the beginning of my previous test (see comment 5), I had four bricks, each with 670MB of space used. I asked it to remove one brick (and its replica). The entire volume had 1.4GB of space available. If you ignore the free space showing on the last brick (which is the one being removed), it actually had a little over 1GB of space available. It was my expectation that the brick removal would complete without error in one pass, because it only had to fit 670MB into 1GB, and none of the files were larger than about 8 MB. In reality, it took three passes, because the bricks ran out of space on the first two passes. Is the behavior that I saw what you would expect? Is there any way to have it work according to my expectation? When the time comes for me to actually perform this operation, running out of space on the bricks could prove fatal, because the filesystem will be online and there will be data constantly being written to it.
Hi Shawn, Though you had ~1GB free in the cluster, when remove brick starts migrating data, it does to brick pair which has the required hash range (based on the file name). May be due to your file name pattern, the files rehash to a given brick pair. That is why you seem to be running out of space on those brick pairs, even though you have enough space on the cluster. Files will only get migrated to the brick pair which are having free space only if their files names hash falls in the range given to that brick pair. The behavior you are experiencing is in sync with the current design of dht/remove-brick. So, a better approach to doing remove brick is do a add-brick (empty bricks) and then do a remove-brick.
What you are saying makes sense, but still doesn't explain what I am seeing. Take a look at comment 7 as it compares to comment 5. About 200mb of data has been moved off of brick 4, but well over 500 mb of additional space has been consumed on the other three bricks. That's nearly triple the disk space. Later in the process all three bricks did fill up completely, but the migration wasn't even half done. There are thousands and thousands of files on this system. Most of them are only a few kilobytes. About a fifth of them are jpg photos of typical digital camera quality. I don't have exact numbers, but if any of them is larger than 8mb I would be very surprised. There are no very large files that could fill up one brick too fast.
Sorry about delay in responding with updates. We are trying to reproduce the setup and see the behavior ourself, will update you soon about the findings.
I completed another test with failures while trying to reproduce bug 862332. Logs and details can be found on that ticket.
I am wondering if a workaround in the meantime might be to start a remove-brick, abort it before the other bricks fill up, then repeat until it's done.
On the IRC channel, they are talking about the cluster.min-free-disk option in relation to some other problem. I don't have time to look into it right now, but is it possible that setting this option might keep a rebalance or remove-brick from filling up all the brick space? http://gluster.org/community/documentation/index.php/Gluster_3.2:_Setting_Volume_Options#cluster.min-free-disk
I've seen the same problem (in my case, removing a replica pair of bricks from a 7x2 volume). Running 3.3.1 with the min-free-disk patch from #874554. > Migration of files is based on their hash names and not on their size. So, it is possible that one or more of the distribute subvolume gets full as a result of remove-brick migration. Seems like the rebalance (I presume) that happens during remove-brick ignores the cluster.min-free-disk option, and fills up complete subvolumes regardless of the amount of space free across the cluster. Which would be okay if failures were queued to retry later in the process (eg. moving files from A > B; B fills up, failures happen; then B > C rebalancing happens; then A > B retries happen, since now there is space). The docs (or anything else) don't make it clear what 'failures' mean as shown by 'remove-brick status' either - should you *not* commit and keep running 'remove-brick start' until it completes without errors?
The version that this bug has been reported against, does not get any updates from the Gluster Community anymore. Please verify if this report is still valid against a current (3.4, 3.5 or 3.6) release and update the version, or close this bug. If there has been no update before 9 December 2014, this bug will get automatocally closed.
I haven't seen anything to indicate that the problem has been fixed in a newer version, so I would expect that it is still a problem. Because of the extreme length of time it would take to complete a rebalance on my production cluster (now running 3.4.2), and the potential for fallout if/when it fails, I cannot try it. I included instructions for reproducing the problem on a testbed when I opened the bug, so it is easy to verify. Note that I did not use enormous files when I filled my testbed volume to 60%. I used jpeg images no larger than a few megabytes. Gluster has proven to be unstable for us, so we have purchased a commercial scale-out storage solution and are migrating off Gluster as quickly as we can. We expect to be entirely done with it in the first few months of 2015 ... but I'd like to make sure that the problems I've encountered get fixed for other users.
Thinking about this after I commented... After we got our production volume up and running with 3.3.1, we ran into a data-loss situation when doing a rebalance after adding bricks. We found the likely bugzilla ID that represented the problem we encountered, so I set up a new testbed to verify it. I duplicated the problem we ran into by pausing a brick process with OS signals, which caused a 42-second timeout and data loss. Then I verified that by upgrading the cluster to 3.4.2, the problem went away. The point that's relevant to this bug: When I did that rebalance on 3.4.2, it also exhibited the same disk usage expansion described by this bug.