Bug 808906
Summary: | Data removed from bricks when continuing rebalance after crash | ||||||
---|---|---|---|---|---|---|---|
Product: | [Community] GlusterFS | Reporter: | Jonathan Dieter <jdieter> | ||||
Component: | core | Assignee: | shishir gowda <sgowda> | ||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | |||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 3.2.5 | CC: | gluster-bugs, jdarcy, nsathyan, samuel-rhbugs | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2012-07-11 03:58:09 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Jonathan Dieter
2012-04-01 15:41:50 UTC
Created attachment 574346 [details]
Configuration file
Created attachment 574349 [details]
Compressed rebalance log
looking into the logs to corner the issue. Not sure if it's relevant, but the filesystem was being accessed by other clients while it was doing the rebalance. If there's nothing in that log that indicates why the files have gone missing, I could see what logs are available for the other clients. All clients were using glusterfs-3.2.6 on either Fedora 16 or Centos 6.2. thanks for these details. technically, even if other clients are accessing the volume while rebalance is happening, it should not result in this behavior. (Made log attachment private for security reasons) The logs are quite . . . interesting, to say the least. They represent five separate incarnations of the client, with fix-layout activity in the first three. There also seems to be a pattern of increasingly frequent disconnections: * At 17:53 (near the end of the third session), clients 0/2/5/7 disconnect in fairly rapid succession. This represents one subvolume for each replica pair. * At 21:05 (fourth session now), the same four clients disconnect again, then a second time almost immediately. * At 21:08, the same four clients disconnect as a group for the fourth time . . . only this time they're quickly joined by client 2 so replica-1 is totally down. * At 21:26, there's another round of disconnections . . . 6, 4, 5 (replica-2 goes offline), 2, 0, 7 (replica-3), 3 (replica-1), 1 (replica-0). At this point we're totally down. * At 00:17 (fifth session), clients 0/2/5 disconnect. In the next few minutes there are further disconnections from 2/5/7. At 01:23 all four of the original culprits disconnect. At 07:42 clients 2/5/7 drop yet again. It seems fairly likely that the network was basically melting down, causing intermittent connectivity throughout the system. This is further backed up by the fact that there are 12226 messages about split brain (indicating that different clients were able to make updates on different replicas) and another 1444 about holes in layouts. None of this explains why files were being deleted without having previously been copied to their new/correct locations. As I pointed out on the user's blog, this shouldn't be possible because the relocation is done as a copy plus rename. I see 40 messages about failed renames, but that's not nearly enough to account for the reported massive loss of data. I would still like to know whether those files were actually present on other bricks besides the one that had gone down (which would indicate that there wasn't actually any data loss at all). It should also be worth looking into why migrate-data didn't simply give up and terminate when subvolumes became available at 21:26. It seems like it should. (In reply to comment #6) > (Made log attachment private for security reasons) Thank you. > The logs are quite . . . interesting, to say the least. They represent five > separate incarnations of the client, with fix-layout activity in the first > three. IIRC, I ran migrate-data first, read the docs, stopped migrate-data, started fix-layout, read some more docs, stopped fix-layout and then just ran rebalance, expecting it to do both fix-layout and migrate-data (which it did, I think). > There also seems to be a pattern of increasingly frequent > disconnections: > > * At 17:53 (near the end of the third session), clients 0/2/5/7 disconnect in > fairly rapid succession. This represents one subvolume for each replica pair. Not sure what happened here, but clients 0/2/5/7 are all on the same server, ds02. I didn't think that ds02 had crashed yet at this point. > * At 21:05 (fourth session now), the same four clients disconnect again, then a > second time almost immediately. > > * At 21:08, the same four clients disconnect as a group for the fourth time . . > . only this time they're quickly joined by client 2 so replica-1 is totally > down. As far as I know, this is where ds02 went completely down. > * At 21:26, there's another round of disconnections . . . 6, 4, 5 (replica-2 > goes offline), 2, 0, 7 (replica-3), 3 (replica-1), 1 (replica-0). At this > point we're totally down. I think this is where I realized that ds02 was down and rebooted it. I may have run gluster volume stop before rebooting ds02; I don't really remember. > * At 00:17 (fifth session), clients 0/2/5 disconnect. In the next few minutes > there are further disconnections from 2/5/7. At 01:23 all four of the original > culprits disconnect. At 07:42 clients 2/5/7 drop yet again. Not sure what was going on during the night, but 7:42 was probably the point where we realized there was missing data and decided to give up and switch back to ext4 over DRBD. > It seems fairly likely that the network was basically melting down, causing > intermittent connectivity throughout the system. This is further backed up by > the fact that there are 12226 messages about split brain (indicating that > different clients were able to make updates on different replicas) and another > 1444 about holes in layouts. > > None of this explains why files were being deleted without having previously > been copied to their new/correct locations. As I pointed out on the user's > blog, this shouldn't be possible because the relocation is done as a copy plus > rename. I see 40 messages about failed renames, but that's not nearly enough > to account for the reported massive loss of data. I would still like to know > whether those files were actually present on other bricks besides the one that > had gone down (which would indicate that there wasn't actually any data loss at > all). It should also be worth looking into why migrate-data didn't simply give > up and terminate when subvolumes became available at 21:26. It seems like it > should. The reason I was convinced that the data was completely missing was that a df on client 1 showed a massive decrease in used disk space, from 1.2TB to roughly 600 MB. Client 0 didn't have the same reduction in disk space, but it was down to 1.0TB. If clients 4/5 and 6/7 had shown a combined increase in used disk space, I wouldn't have worried, but their space hadn't increased even 200G, much less 600G. In the end, I rsync'd straight from the brick filesystems to new drives, using the -u switch so I'd get the latest files from each filesystem. After all that, I checked and realized that some known files were missing, so I rsync'd from our backup, also using the -u switch to fill in the holes. That final rsync updated roughly one in five files, which is where I got the 20% data loss number from. Thanks much for looking at this. My apologies that I wasn't able to leave the filesystem alone for further examination. Currently added multiple test cases to evaluate the scenario. The testing will be done mainly on 3.3.0 release branch (currently still master), as it already has multiple significant rebalance improvements. Jonathan, we have made a 3.3.0beta3 release, and one of the main feature of that is 'rebalance improvements'. If there is a bandwidth to test, please use this release stream from now onwards. This is fixed in current release(3.3.0). Please plan a upgrade, and reopen the bug if you still encounter these issues. Rebalance feature improvements are complex in nature, hence back-porting is not planned currently. |