Bug 862347 - Migration with "remove-brick start" fails if bricks are more than half full
Summary: Migration with "remove-brick start" fails if bricks are more than half full
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: GlusterFS
Classification: Community
Component: distribute
Version: 3.3.0
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ---
Assignee: bugs@gluster.org
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-10-02 17:06 UTC by Shawn Heisey
Modified: 2014-12-14 19:40 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-12-14 19:40:33 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)
The rebalance log after the first remove-brick run. (5.27 MB, application/octet-stream)
2012-10-03 18:58 UTC, Shawn Heisey
no flags Details
rebalance log for second remove-brick run (2.74 MB, application/octet-stream)
2012-10-03 19:14 UTC, Shawn Heisey
no flags Details
The log from the third run. (525.86 KB, application/octet-stream)
2012-10-03 19:21 UTC, Shawn Heisey
no flags Details
The rest of the gluster logs, covering all four remove-brick runs (1.00 MB, application/gzip)
2012-10-03 19:26 UTC, Shawn Heisey
no flags Details

Description Shawn Heisey 2012-10-02 17:06:47 UTC
Description of problem:
If the bricks in a distributed volume (and likely the volume as a whole) are more than half full, migrating data with "remove-brick start" will fail, even if the remaining bricks in the volume as a whole have enough space to absorb the migration.  Issuing the "remove-brick start" command multiple times (waiting for completion each time) appears to be a workaround.  Once eah migration is complete (with or without failures), the used disk space on each brick drops down to the expected level.

Steps to Reproduce:
1. Start with a 4x2 distribute-replicate volume that is about 60% full.
2. Initiate a remove-brick start command for the last brick pair.
3. Note that eventually all remaining bricks will fill up and failures will be logged to $volume-rebalance.  There will be many failures in the 'status' output, because the bricks will fill up long before all the files are migrated.
4. When the operation completes, initiate it again.  Another cycle may be required to finally complete without error.
5. Note that Bug 862332 has probably also occurred.
6. Issue the commit operation to eliminate the effects of Bug 862332.
  
Expected results:
The removal migration should do one of two things:
1) Use less disk space to begin with -- On a 4x2 -> 3x2 removal migration, each remaining brick should only see additional usage approximately equivalent to one third of the used space of the removed brick.
2) Do automatic disk space reclamations when the system notices that all brick space is used.  This reclamation already happens when the migration completes, can it be done in the middle of the operation too?

Comment 1 shishir gowda 2012-10-03 05:41:20 UTC
Hi Shawn,

Migration of files is based on their hash names and not on their size. So, it is possible that one or more of the distribute subvolume gets full as a result of remove-brick migration.

Can you please attach the remove-brick (<volname>-rebalance.log) when the failures are detected?

Comment 2 Shawn Heisey 2012-10-03 13:50:59 UTC
I will set all this up again and attach the log.  In the meantime, some additional info:

I used photos, text, and very small files containing their metadata.  The largest files were only a few megabytes.  File size was not an issue.

It's not 'one or more' of the remaining bricks that gets full during a remove-brick migration on a 60% full volume.  It's every single one of them.  Another sample test you can do with a volume that's less than half full, to fully illustrate the underlying problem here:

* Set up a 4x2 volume where each brick is 1GB.
* Fill it 25% full so that all the bricks have about 250MB on them.
* Initiate a migration that removes the last brick pair.

In this instance, the six remaining bricks will all more than double in size during the migration, rather than increase to the approximately 333 MB that I would have expected.  When the migration is complete (before the commit), space will be reclaimed and all the bricks will drop in size somewhere close to the expected 333 MB.

It's worth noting that when I first did the "less than half full" test I have outlined here, I had a positive failure count in the 'status' screen, even though I did not run out of disk space during that test.  At that time, I did not attempt to look into the failures at all.  The extreme disk space discrepancy led me to repeat the test with a 60% full volume to find out if I would have reason to file this bug.  That's when I really looked into the failures.

Comment 3 Shawn Heisey 2012-10-03 14:13:23 UTC
Another note: so far this is all testbed. We are very close to buying production hardware and rolling out though.

Comment 4 Shawn Heisey 2012-10-03 18:32:53 UTC
In order to save time for a repeat of my 60% full test, I have re-added my fourth brick pair to my volume, and I am in the process of rebalancing to put data back on the fourth brick pair.  The remainder of this comment concerns this rebalance, not the original problem.

I have noticed a very similar disk space problem while doing the initial rebalance.  It uses considerably more disk space than it should, then when the rebalance is complete, disk usage drops.  In order to get the rebalance to work fully, I had to run it several times.

Also, there are TONS of failed migrations in the log for all of the rebalance runs.  There does not seem to be any data loss, though.
[dht-rebalance.c:1194:gf_defrag_migrate_data] 0-qasmb-dht: migrate-data failed for /CNW/012/484.thm

Comment 5 Shawn Heisey 2012-10-03 18:40:41 UTC
Information just before starting the remove-brick:


[root@testb1 ~]# gluster volume info

Volume Name: qasmb
Type: Distributed-Replicate
Volume ID: 86938037-e1bb-466f-b353-8a1bd939a345
Status: Started
Number of Bricks: 4 x 2 = 8
Transport-type: tcp
Bricks:
Brick1: testb1:/bricks/b1/qasmb
Brick2: testb2:/bricks/b1/qasmb
Brick3: testb1:/bricks/b2/qasmb
Brick4: testb2:/bricks/b2/qasmb
Brick5: testb1:/bricks/b3/qasmb
Brick6: testb2:/bricks/b3/qasmb
Brick7: testb1:/bricks/b4/qasmb
Brick8: testb2:/bricks/b4/qasmb

[root@testb1 ~]# df
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/mapper/vg_main-lv_root
                      49537840   3500440  43521024   8% /
tmpfs                  1914332         0   1914332   0% /dev/shm
/dev/md1               1032076    127836    851812  14% /boot
/dev/sda3              1038336    673036    365300  65% /bricks/b1
/dev/sdb3              1038336    673916    364420  65% /bricks/b2
/dev/sdc3              1038336    669692    368644  65% /bricks/b3
/dev/sdd3              1038336    670464    367872  65% /bricks/b4
testb:qasmb            4153344   2687488   1465856  65% /shared/qasmb

[root@testb1 ~]# uname -a
Linux testb1 2.6.32-279.9.1.el6.centos.plus.x86_64 #1 SMP Wed Sep 26 03:52:55 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

Comment 6 Shawn Heisey 2012-10-03 18:41:00 UTC
Command used to initiate removal:

gluster volume remove-brick qasmb testb1:/bricks/b4/qasmb testb2:/bricks/b4/qasmb start

Comment 7 Shawn Heisey 2012-10-03 18:49:54 UTC
df after only three minutes.  Note that space usage on brick 4 has only dropped by 200MB, but disk usage on brick 2 is up by nearly 300MB and the others are not far behind:

[root@testb1 ~]# df
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/mapper/vg_main-lv_root
                      49537840   3350020  43671444   8% /
tmpfs                  1914332         0   1914332   0% /dev/shm
/dev/md1               1032076    127836    851812  14% /boot
/dev/sda3              1038336    787812    250524  76% /bricks/b1
/dev/sdb3              1038336    965928     72408  94% /bricks/b2
/dev/sdc3              1038336    922104    116232  89% /bricks/b3
/dev/sdd3              1038336    470708    567628  46% /bricks/b4
testb:qasmb            4153344   3148288   1005056  76% /shared/qasmb

Comment 8 Shawn Heisey 2012-10-03 18:50:30 UTC
Shortly after the previous comment, failures began showing up in the log, and df looked like this.  Now brick 2 is completely full:

[root@testb1 ~]# df -k
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/mapper/vg_main-lv_root
                      49537840   3351872  43669592   8% /
tmpfs                  1914332         0   1914332   0% /dev/shm
/dev/md1               1032076    127836    851812  14% /boot
/dev/sda3              1038336    957344     80992  93% /bricks/b1
/dev/sdb3              1038336   1038328         8 100% /bricks/b2
/dev/sdc3              1038336    994952     43384  96% /bricks/b3
/dev/sdd3              1038336    330092    708244  32% /bricks/b4
testb:qasmb            4153344   3320832    832512  80% /shared/qasmb

Comment 9 Shawn Heisey 2012-10-03 18:54:09 UTC
After the first run completed, the status and df looked like this.  There were 4266 migration failures, and brick 4 is still 25% full.  Will attach the rebalance log.  Note that the used space on both the volume as a whole as well as the bricks has dropped significantly since the last df showing brick 2 completely full.

[root@testb1 qasmb]# gluster volume remove-brick qasmb testb1:/bricks/b4/qasmb testb2:/bricks/b4/qasmb status
                                    Node Rebalanced-files          size       scanned      failures         status
                               ---------      -----------   -----------   -----------   -----------   ------------
                               localhost             8452    641210309        37789         4266      completed
                                  testb4                0            0        36527            0    not started
                                  testb3                0            0        36526            0    not started
                                  testb2                0            0        36501            0      completed

[root@testb1 ~]# df -k
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/mapper/vg_main-lv_root
                      49537840   3358100  43663364   8% /
tmpfs                  1914332         0   1914332   0% /dev/shm
/dev/md1               1032076    127836    851812  14% /boot
/dev/sda3              1038336    838728    199608  81% /bricks/b1
/dev/sdb3              1038336    811140    227196  79% /bricks/b2
/dev/sdc3              1038336    784172    254164  76% /bricks/b3
/dev/sdd3              1038336    255828    782508  25% /bricks/b4
testb:qasmb            4153344   2690560   1462784  65% /shared/qasmb

Comment 10 Shawn Heisey 2012-10-03 18:55:17 UTC
Moving the log before attaching it here:
[root@testb1 ~]# mv /var/log/glusterfs/qasmb-rebalance.log ~/first-run-qasmb-rebalance.log

Comment 11 Shawn Heisey 2012-10-03 18:58:08 UTC
Created attachment 621071 [details]
The rebalance log after the first remove-brick run.

This is the full rebalance log after the first remove-brick run completed.  I did not run into Bug 862332 at this point.  I will watch for that bug on subsequent runs.

Comment 12 Shawn Heisey 2012-10-03 19:07:35 UTC
Partway through the process, the bricks once again began to fill up:

[root@testb1 glusterfs]# df
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/mapper/vg_main-lv_root
                      49537840   3361468  43659996   8% /
tmpfs                  1914332         0   1914332   0% /dev/shm
/dev/md1               1032076    127836    851812  14% /boot
/dev/sda3              1038336    838756    199580  81% /bricks/b1
/dev/sdb3              1038336   1038336         0 100% /bricks/b2
/dev/sdc3              1038336    993868     44468  96% /bricks/b3
/dev/sdd3              1038336    189388    848948  19% /bricks/b4
testb:qasmb            4153344   3060352   1092992  74% /shared/qasmb

Comment 13 Shawn Heisey 2012-10-03 19:12:22 UTC
When the second run was complete, There were still failures and brick 4 is still not empty.  Once again, disk usage dropped dramatically.

[root@testb1 qasmb]# gluster volume remove-brick qasmb testb1:/bricks/b4/qasmb testb2:/bricks/b4/qasmb status
                                    Node Rebalanced-files          size       scanned      failures         status
                               ---------      -----------   -----------   -----------   -----------   ------------
                               localhost             6039    450055294        37910         1522      completed
                                  testb4                0            0        36527            0    not started
                                  testb3                0            0        36526            0    not started
                                  testb2                0            0        36580            0      completed

[root@testb1 glusterfs]# df
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/mapper/vg_main-lv_root
                      49537840   3365056  43656408   8% /
tmpfs                  1914332         0   1914332   0% /dev/shm
/dev/md1               1032076    127836    851812  14% /boot
/dev/sda3              1038336    920084    118252  89% /bricks/b1
/dev/sdb3              1038336    862764    175572  84% /bricks/b2
/dev/sdc3              1038336    829144    209192  80% /bricks/b3
/dev/sdd3              1038336     79420    958916   8% /bricks/b4
testb:qasmb            4153344   2692096   1461248  65% /shared/qasmb

[root@testb1 glusterfs]# find /bricks/b4 -type f | wc -l
3336

Comment 14 Shawn Heisey 2012-10-03 19:14:18 UTC
Created attachment 621085 [details]
rebalance log for second remove-brick run

[root@testb1 ~]# mv /var/log/glusterfs/qasmb-rebalance.log ~/second-run-qasmb-rebalance.log

Comment 15 Shawn Heisey 2012-10-03 19:20:46 UTC
On the third run, there were no failures, but it did not remove all of the files from the fourth brick.

[root@testb1 glusterfs]# find /bricks/b4 -type f | wc -l
2208

[root@testb1 glusterfs]# df
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/mapper/vg_main-lv_root
                      49537840   3366508  43654956   8% /
tmpfs                  1914332         0   1914332   0% /dev/shm
/dev/md1               1032076    127836    851812  14% /boot
/dev/sda3              1038336    875540    162796  85% /bricks/b1
/dev/sdb3              1038336    898612    139724  87% /bricks/b2
/dev/sdc3              1038336    878524    159812  85% /bricks/b3
/dev/sdd3              1038336     38796    999540   4% /bricks/b4
testb:qasmb            4153344   2692224   1461120  65% /shared/qasmb

[root@testb1 qasmb]# gluster volume remove-brick qasmb testb1:/bricks/b4/qasmb testb2:/bricks/b4/qasmb status
                                    Node Rebalanced-files          size       scanned      failures         status
                               ---------      -----------   -----------   -----------   -----------   ------------
                               localhost             1522    104671121        37344            0      completed
                                  testb3                0            0        36526            0    not started
                                  testb4                0            0        36527            0    not started
                                  testb2                0            0        36728            0      completed

Comment 16 Shawn Heisey 2012-10-03 19:21:35 UTC
Created attachment 621088 [details]
The log from the third run.

Comment 17 Shawn Heisey 2012-10-03 19:25:22 UTC
Running the remove-brick a fourth time resulted in no files getting moved, yet there are still over 2000 files on the bricks that are being removed.  It looks like I'm going to have to start completely over.

Comment 18 Shawn Heisey 2012-10-03 19:26:32 UTC
Created attachment 621089 [details]
The rest of the gluster logs, covering all four remove-brick runs

Comment 19 Shawn Heisey 2012-10-03 19:30:03 UTC
You'll notice testb3 and testb4 in the status output and possibly elsewhere.  These are Fedora 17 machines that are part of the cluster.  The purpose of these machines is to provide NFS, Samba, and UFO access to the volume.  They are Fedora so that we can have newer versions of software for the centralized access, while using a more stable distro (CentOS 6) for the gluster volume itself.

Comment 20 Shawn Heisey 2012-10-03 19:37:58 UTC
After committing the remove-brick, I checked into the files remaining on brick 4.  Despite the fact that they were not deleted from brick 4, they are still accessible from the client mount, and it appears that all data is intact as well, even after completely unmounting those bricks.

Comment 21 shishir gowda 2012-10-04 07:17:52 UTC
The failures are because you have run out of space in few of the subvolumes. Rebalance/remove-brick fails the migration in such cases.

[2012-10-03 12:47:09.330674] E [dht-rebalance.c:367:__dht_check_free_space] 0-qasmb-dht: data movement a
ttempted from node (qasmb-replicate-3) with to node (qasmb-replicate-1) which does not have required fre
e space for /CNW/004/660
[2012-10-03 12:47:09.330882] E [dht-rebalance.c:1194:gf_defrag_migrate_data] 0-qasmb-dht: migrate-data f
ailed for /CNW/004/660

As mentioned before, rebalance/remove-brick does not depend on the file sizes.
All it depends is on their names for hash.
Every directory's hash range is spread across the subvolumes. Each file's hash value determines which bricks/subvolume's directory holds it.
Even if some subvolume has free space, but it is not the hashed target, the files do not get migrated.
Additionally, remove-brick elevates the problem here as space would be taken away from the cluster.

For mitigation of data, please do a add-brick(empty) followed by a remove-brick(if de-commissioning is needed)/rebalance

Comment 22 Shawn Heisey 2012-10-04 18:35:23 UTC
New round of testing.  Here's information showing the empty volume.  This volume is a LOT bigger than my previous one.  It will be a while before I post anything more for this round, because it's going to take a while to fill this up.

[root@testb1 ~]# df
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/mapper/vg_main-lv_root
                      49537840   3371060  43650404   8% /
tmpfs                  1914332         0   1914332   0% /dev/shm
/dev/md1               1032076    127836    851812  14% /boot
/dev/sda3            922833364     33168 922800196   1% /bricks/b1
/dev/sdb3            922833364     33168 922800196   1% /bricks/b2
/dev/sdc3            922833364     33168 922800196   1% /bricks/b3
/dev/sdd3            922833364     33168 922800196   1% /bricks/b4
testb:testvol        3691333376    132608 3691200768   1% /shared/test


[root@testb1 ~]# gluster volume info

Volume Name: testvol
Type: Distributed-Replicate
Volume ID: 182df850-96f3-4d69-95b9-18e9ea409dfb
Status: Started
Number of Bricks: 4 x 2 = 8
Transport-type: tcp
Bricks:
Brick1: testb1:/bricks/b1/testvol
Brick2: testb2:/bricks/b1/testvol
Brick3: testb1:/bricks/b2/testvol
Brick4: testb2:/bricks/b2/testvol
Brick5: testb1:/bricks/b3/testvol
Brick6: testb2:/bricks/b3/testvol
Brick7: testb1:/bricks/b4/testvol
Brick8: testb2:/bricks/b4/testvol

[root@testb1 ~]# find /shared/test -type f | wc -l
0

Comment 23 Shawn Heisey 2012-10-04 21:45:15 UTC
I looked at comment 21, posted at 2012-10-04 03:17:52 EDT. The parts that I can fully understand seem irrelevant to the discussion.  How about we take a step back and I will explain the problem I am seeing with less verbosity.

At the beginning of my previous test (see comment 5), I had four bricks, each with 670MB of space used.  I asked it to remove one brick (and its replica).  The entire volume had 1.4GB of space available.  If you ignore the free space showing on the last brick (which is the one being removed), it actually had a little over 1GB of space available.

It was my expectation that the brick removal would complete without error in one pass, because it only had to fit 670MB into 1GB, and none of the files were larger than about 8 MB.  In reality, it took three passes, because the bricks ran out of space on the first two passes.

Is the behavior that I saw what you would expect?  Is there any way to have it work according to my expectation?

When the time comes for me to actually perform this operation, running out of space on the bricks could prove fatal, because the filesystem will be online and there will be data constantly being written to it.

Comment 24 shishir gowda 2012-10-05 04:16:53 UTC
Hi Shawn,

Though you had ~1GB free in the cluster, when remove brick starts migrating data, it does to brick pair which has the required hash range (based on the file name). May be due to your file name pattern, the files rehash to a given brick pair. That is why you seem to be running out of space on those brick pairs, even though you have enough space on the cluster. Files will only get migrated to the brick pair which are having free space only if their files names hash falls in the range given to that brick pair.

The behavior you are experiencing is in sync with the current design of dht/remove-brick.

So, a better approach to doing remove brick is do a add-brick (empty bricks) and then do a remove-brick.

Comment 25 Shawn Heisey 2012-10-05 14:22:07 UTC
What you are saying makes sense, but still doesn't explain what I am seeing. Take a look at comment 7 as it compares to comment 5.  About 200mb of data has been moved off of brick 4, but well over 500 mb of additional space has been consumed on the other three bricks. That's nearly triple the disk space. Later in the process all three bricks did fill up completely, but the migration wasn't even half done.

There are thousands and thousands of files on this system. Most of them are only a few kilobytes.  About a fifth of them are jpg photos of typical digital camera quality. I don't have exact numbers, but if any of them is larger than 8mb I would be very surprised. There are no very large files that could fill up one brick too fast.

Comment 26 Amar Tumballi 2012-10-23 14:27:25 UTC
Sorry about delay in responding with updates. We are trying to reproduce the setup and see the behavior ourself, will update you soon about the findings.

Comment 27 Shawn Heisey 2012-10-23 20:15:38 UTC
I completed another test with failures while trying to reproduce bug 862332.  Logs and details can be found on that ticket.

Comment 28 Shawn Heisey 2012-10-31 16:56:40 UTC
I am wondering if a workaround in the meantime might be to start a remove-brick, abort it before the other bricks fill up, then repeat until it's done.

Comment 29 Shawn Heisey 2012-11-28 21:33:50 UTC
On the IRC channel, they are talking about the cluster.min-free-disk option in relation to some other problem.  I don't have time to look into it right now, but is it possible that setting this option might keep a rebalance or remove-brick from filling up all the brick space?

http://gluster.org/community/documentation/index.php/Gluster_3.2:_Setting_Volume_Options#cluster.min-free-disk

Comment 30 Robert Coup 2013-06-06 01:41:07 UTC
I've seen the same problem (in my case, removing a replica pair of bricks from a 7x2 volume). Running 3.3.1 with the min-free-disk patch from #874554. 

> Migration of files is based on their hash names and not on their size. So, it is possible that one or more of the distribute subvolume gets full as a result of remove-brick migration.

Seems like the rebalance (I presume) that happens during remove-brick ignores the cluster.min-free-disk option, and fills up complete subvolumes regardless of the amount of space free across the cluster.

Which would be okay if failures were queued to retry later in the process (eg. moving files from A > B; B fills up, failures happen; then B > C rebalancing happens; then A > B retries happen, since now there is space).

The docs (or anything else) don't make it clear what 'failures' mean as shown by 'remove-brick status' either - should you *not* commit and keep running 'remove-brick start' until it completes without errors?

Comment 32 Niels de Vos 2014-11-27 14:53:56 UTC
The version that this bug has been reported against, does not get any updates from the Gluster Community anymore. Please verify if this report is still valid against a current (3.4, 3.5 or 3.6) release and update the version, or close this bug.

If there has been no update before 9 December 2014, this bug will get automatocally closed.

Comment 33 Shawn Heisey 2014-11-27 16:30:32 UTC
I haven't seen anything to indicate that the problem has been fixed in a newer version, so I would expect that it is still a problem.

Because of the extreme length of time it would take to complete a rebalance on my production cluster (now running 3.4.2), and the potential for fallout if/when it fails, I cannot try it.

I included instructions for reproducing the problem on a testbed when I opened the bug, so it is easy to verify.  Note that I did not use enormous files when I filled my testbed volume to 60%.  I used jpeg images no larger than a few megabytes.

Gluster has proven to be unstable for us, so we have purchased a commercial scale-out storage solution and are migrating off Gluster as quickly as we can.  We expect to be entirely done with it in the first few months of 2015 ... but I'd like to make sure that the problems I've encountered get fixed for other users.

Comment 34 Shawn Heisey 2014-11-27 16:44:18 UTC
Thinking about this after I commented...

After we got our production volume up and running with 3.3.1, we ran into a data-loss situation when doing a rebalance after adding bricks.

We found the likely bugzilla ID that represented the problem we encountered, so I set up a new testbed to verify it.  I duplicated the problem we ran into by pausing a brick process with OS signals, which caused a 42-second timeout and data loss.  Then I verified that by upgrading the cluster to 3.4.2, the problem went away.

The point that's relevant to this bug:  When I did that rebalance on 3.4.2, it also exhibited the same disk usage expansion described by this bug.


Note You need to log in before you can comment on or make changes to this bug.