Bug 808906 - Data removed from bricks when continuing rebalance after crash
Data removed from bricks when continuing rebalance after crash
Status: CLOSED CURRENTRELEASE
Product: GlusterFS
Classification: Community
Component: core (Show other bugs)
3.2.5
x86_64 Linux
high Severity high
: ---
: ---
Assigned To: shishir gowda
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2012-04-01 11:41 EDT by Jonathan Dieter
Modified: 2013-12-08 20:30 EST (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2012-07-10 23:58:09 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Configuration file (2.54 KB, text/plain)
2012-04-01 11:44 EDT, Jonathan Dieter
no flags Details

  None (edit)
Description Jonathan Dieter 2012-04-01 11:41:50 EDT
Description of problem:
We have three servers, ds01, ds02 and ds03.  Ds01 and ds03 have two 2TB drives in them, while ds02 has four 2TB drives.

I originally set up GlusterFS on two of the four drives on ds02 and both of the drives on ds03, using afr with a replica count of 2.  The first drive on ds02 was replicated with the first drive on ds03 and the same for the second drive.

After testing GlusterFS for a week, we went ahead and added the remaining drives, with the third drive on ds02 being replicated with the first on ds01 and the fourth drive on ds02 being replicated with the second on ds02.

I then ran a rebalance on ds01.  It succeeded in fixing the layout and started to migrate the data.  After a couple of days, ds02 crashed. I have no idea what exactly happened, but I ended up having to reboot it a few hours after the crash.  During this time, ds01 continued to try to do the rebalance, but was obviously running into problems.  I manually stopped the rebalance while ds02 was down, and then restarted it after I'd brought ds02 back up.

After eight hours or so, I realized that files were being removed from the original four bricks without being migrated to the new bricks.  After an emergency meeting with management, we decided to ditch GlusterFS, recover what data we could, and migrate back to NFS over ext4 over DRBD.

I used rsync to copy the files from each of the underlying filesystems, but, in the end, we still lost approximately 20% of the files (which were recoverable from our backups).

I realize this is all rather ugly, and all the more so as we have no way of testing whether the bug is fixed.  I don't expect anyone to spend loads of time on this, but I wrote about the experience in a post and it was politely pointed out to me that a bug report would go much further in helping others who might run into the same problem.  I will attach our configuration files and logs.

Version-Release number of selected component (if applicable):
glusterfs-3.2.6-1.fc16.x86_64
glusterfs-fuse-3.2.6-1.fc16.x86_64
Comment 1 Jonathan Dieter 2012-04-01 11:44:15 EDT
Created attachment 574346 [details]
Configuration file
Comment 2 Jonathan Dieter 2012-04-01 11:52:34 EDT
Created attachment 574349 [details]
Compressed rebalance log
Comment 3 Amar Tumballi 2012-04-01 13:11:28 EDT
looking into the logs to corner the issue.
Comment 4 Jonathan Dieter 2012-04-01 13:18:00 EDT
Not sure if it's relevant, but the filesystem was being accessed by other clients while it was doing the rebalance.  If there's nothing in that log that indicates why the files have gone missing, I could see what logs are available for the other clients.

All clients were using glusterfs-3.2.6 on either Fedora 16 or Centos 6.2.
Comment 5 Amar Tumballi 2012-04-01 13:35:51 EDT
thanks for these details. technically, even if other clients are accessing the volume while rebalance is happening, it should not result in this behavior.
Comment 6 Jeff Darcy 2012-04-02 16:09:41 EDT
(Made log attachment private for security reasons)

The logs are quite . . . interesting, to say the least.  They represent five separate incarnations of the client, with fix-layout activity in the first three.  There also seems to be a pattern of increasingly frequent disconnections:

* At 17:53 (near the end of the third session), clients 0/2/5/7 disconnect in fairly rapid succession.  This represents one subvolume for each replica pair.

* At 21:05 (fourth session now), the same four clients disconnect again, then a second time almost immediately.

* At 21:08, the same four clients disconnect as a group for the fourth time . . . only this time they're quickly joined by client 2 so replica-1 is totally down.

* At 21:26, there's another round of disconnections . . . 6, 4, 5 (replica-2 goes offline), 2, 0, 7 (replica-3), 3 (replica-1), 1 (replica-0).  At this point we're totally down.

* At 00:17 (fifth session), clients 0/2/5 disconnect.  In the next few minutes there are further disconnections from 2/5/7.  At 01:23 all four of the original culprits disconnect.  At 07:42 clients 2/5/7 drop yet again.

It seems fairly likely that the network was basically melting down, causing intermittent connectivity throughout the system.  This is further backed up by the fact that there are 12226 messages about split brain (indicating that different clients were able to make updates on different replicas) and another 1444 about holes in layouts.

None of this explains why files were being deleted without having previously been copied to their new/correct locations.  As I pointed out on the user's blog, this shouldn't be possible because the relocation is done as a copy plus rename.  I see 40 messages about failed renames, but that's not nearly enough to account for the reported massive loss of data.  I would still like to know whether those files were actually present on other bricks besides the one that had gone down (which would indicate that there wasn't actually any data loss at all).  It should also be worth looking into why migrate-data didn't simply give up and terminate when subvolumes became available at 21:26.  It seems like it should.
Comment 7 Jonathan Dieter 2012-04-03 01:11:26 EDT
(In reply to comment #6)
> (Made log attachment private for security reasons)

Thank you.

> The logs are quite . . . interesting, to say the least.  They represent five
> separate incarnations of the client, with fix-layout activity in the first
> three.

IIRC, I ran migrate-data first, read the docs, stopped migrate-data, started fix-layout, read some more docs, stopped fix-layout and then just ran rebalance, expecting it to do both fix-layout and migrate-data (which it did, I think).

> There also seems to be a pattern of increasingly frequent
> disconnections:
> 
> * At 17:53 (near the end of the third session), clients 0/2/5/7 disconnect in
> fairly rapid succession.  This represents one subvolume for each replica pair.

Not sure what happened here, but clients 0/2/5/7 are all on the same server, ds02.  I didn't think that ds02 had crashed yet at this point.

> * At 21:05 (fourth session now), the same four clients disconnect again, then a
> second time almost immediately.
> 
> * At 21:08, the same four clients disconnect as a group for the fourth time . .
> . only this time they're quickly joined by client 2 so replica-1 is totally
> down.

As far as I know, this is where ds02 went completely down.

> * At 21:26, there's another round of disconnections . . . 6, 4, 5 (replica-2
> goes offline), 2, 0, 7 (replica-3), 3 (replica-1), 1 (replica-0).  At this
> point we're totally down.

I think this is where I realized that ds02 was down and rebooted it.  I may have run gluster volume stop before rebooting ds02; I don't really remember.

> * At 00:17 (fifth session), clients 0/2/5 disconnect.  In the next few minutes
> there are further disconnections from 2/5/7.  At 01:23 all four of the original
> culprits disconnect.  At 07:42 clients 2/5/7 drop yet again.

Not sure what was going on during the night, but 7:42 was probably the point where we realized there was missing data and decided to give up and switch back to ext4 over DRBD.

> It seems fairly likely that the network was basically melting down, causing
> intermittent connectivity throughout the system.  This is further backed up by
> the fact that there are 12226 messages about split brain (indicating that
> different clients were able to make updates on different replicas) and another
> 1444 about holes in layouts.
> 
> None of this explains why files were being deleted without having previously
> been copied to their new/correct locations.  As I pointed out on the user's
> blog, this shouldn't be possible because the relocation is done as a copy plus
> rename.  I see 40 messages about failed renames, but that's not nearly enough
> to account for the reported massive loss of data.  I would still like to know
> whether those files were actually present on other bricks besides the one that
> had gone down (which would indicate that there wasn't actually any data loss at
> all).  It should also be worth looking into why migrate-data didn't simply give
> up and terminate when subvolumes became available at 21:26.  It seems like it
> should.

The reason I was convinced that the data was completely missing was that a df on client 1 showed a massive decrease in used disk space, from 1.2TB to roughly 600 MB.  Client 0 didn't have the same reduction in disk space, but it was down to 1.0TB.

If clients 4/5 and 6/7 had shown a combined increase in used disk space, I wouldn't have worried, but their space hadn't increased even 200G, much less 600G.

In the end, I rsync'd straight from the brick filesystems to new drives, using the -u switch so I'd get the latest files from each filesystem.  After all that, I checked and realized that some known files were missing, so I rsync'd from our backup, also using the -u switch to fill in the holes.

That final rsync updated roughly one in five files, which is where I got the 20% data loss number from.

Thanks much for looking at this.  My apologies that I wasn't able to leave the filesystem alone for further examination.
Comment 8 Amar Tumballi 2012-04-11 14:31:34 EDT
Currently added multiple test cases to evaluate the scenario. The testing will be done mainly on 3.3.0 release branch (currently still master), as it already has multiple significant rebalance improvements.
Comment 9 Amar Tumballi 2012-04-17 14:14:44 EDT
Jonathan,

we have made a 3.3.0beta3 release, and one of the main feature of that is 'rebalance improvements'. If there is a bandwidth to test, please use this release stream from now onwards.
Comment 10 shishir gowda 2012-07-10 23:58:09 EDT
This is fixed in current release(3.3.0). Please plan a upgrade, and reopen the bug if you still encounter these issues.

Rebalance feature improvements are complex in nature, hence back-porting is not planned currently.

Note You need to log in before you can comment on or make changes to this bug.