Bug 765266 (GLUSTER-3534) - Massive amount of missing files after brick power outage
Summary: Massive amount of missing files after brick power outage
Keywords:
Status: CLOSED DEFERRED
Alias: GLUSTER-3534
Product: GlusterFS
Classification: Community
Component: replicate
Version: 3.2.1
Hardware: x86_64
OS: Linux
medium
high
Target Milestone: ---
Assignee: Pranith Kumar K
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-09-09 23:19 UTC by kyle.sabine
Modified: 2014-12-14 19:40 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-12-14 19:40:32 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)
Client Log File (89.81 KB, application/octet-stream)
2011-09-10 20:02 UTC, kyle.sabine
no flags Details
Brick log file (880.41 KB, application/octet-stream)
2011-09-12 12:26 UTC, kyle.sabine
no flags Details

Description Pranith Kumar K 2011-09-09 22:48:42 UTC
hi kyle,
      could you post the "volume info" output and tell us which brick went down and came backup.
Could you confirm if the following is true:
On the bricks, the brick that went down and came back up has zero size files and the brick that has been running fine has all the data.

Pranith

Comment 1 kyle.sabine 2011-09-09 23:19:52 UTC
We have a 9TB volume replicated across two bricks.  One night we had a power outage affecting one of the two bricks.  When the brick that had a power outage came back online the two volumes replicated but 3TB of the 4.1TB of data went missing!  The folder structure remained intact but the data was gone on the volume and on both individual bricks.

When we ran a scan of the volume on testdisk we can see all the files but they all show up as deleted 0 byte files.  We now have 3TB of critical data that appears to have gone up in smoke.  No self-heal efforts have returned any of the data...

Comment 2 Vijay Bellur 2011-09-10 01:49:33 UTC
Can you also please provide the client and brick log files?

Comment 3 Alex Aster 2011-09-10 08:31:58 UTC
Hello Kyle, do you use XFS on your backend servers?
I've known it to happen. It makes XFS, therefore we use now only ext4.

Comment 4 kyle.sabine 2011-09-10 20:02:06 UTC
Created attachment 654

Comment 5 kyle.sabine 2011-09-12 12:12:18 UTC
Prantith:

Brick 1 is the one that went down while brick 0 remained operational.  On the surface neither brick has all the files nor does either brick have 0 byte files visible on the bricks or mounted gluster volume.  Both bricks are missing 3TB of files.  It is only when I run a deleted file scan via testdisk that I see the 0 byte files.  All of the missing files appear in the deleted file scan but they show up as 0 byte files and there is no data in the files when they are recovered.

Vijay:

I provided the client log file.  The brick log file is over 500MB so I will have to find a way to host it and send a link.

Alex:

We are using ext4 on both bricks.

Comment 6 kyle.sabine 2011-09-12 12:26:15 UTC
Created attachment 655

Comment 7 kyle.sabine 2011-09-12 12:28:23 UTC
I attached a very abridged version of the brick log file.

6/26 is the date that the bricks reconnected and 6/27 is the date of the files loss.

I will still try and find a way to host the full 500MB log file and send a link.

Comment 8 kyle.sabine 2011-09-12 12:57:52 UTC
You can find the full brick log from brick 0 at https://docs.google.com/leaf?id=0B1M-2wIiAsYeNTY2NjUwNDctNDZjNi00ZWU0LWE0M2YtOGVmOWEwZTAwOWQ0&hl=en_US

Comment 9 Joe Julian 2011-09-16 23:33:09 UTC
Looks like you had filesystem problems underlying this: read-only state, and running extundelete on the backend for starters.

You had occasions where you were running glusterfsd twice on the same server for the same brick, which really mix up the log so it's pretty hard to follow, and I'm pretty sure that one of your clients was resolving a server incorrectly and creating linkfiles over the tops of your existing files. I bet all your 0 byte files are mode 1000. If they are, your directory entries are lost.

It's a crappy thing to say, considering you've already said you think all the data's lost, but I'd be remiss if I didn't suggest recovering from a backup if you have one.

Otherwise, I think you're going to have to try to find one of those utilities that scan the drive and look for magic's to identify files and recover what you can.

Comment 10 kyle.sabine 2011-09-19 15:18:55 UTC
We assumed something similar.  Out of curiosity what dates were you looking to on the logs.  This instance happened on 6/27/2011 time frame.  Anything after that is recovery attempts.

Comment 11 Joe Julian 2011-09-19 21:05:41 UTC
We've been chatting on IRC and here are some of my observations.

The files that were touched while the server was down seem to be okay. This implies that the pending xattrs were set and self-healed correctly when the server (10.6.154.99) came back up.

My idea that sticky links overwrote files is, I believe, invalid as DHT was not in the graph.

There were 3 clients connected at the time of the failure. One of them suggests that there were pending xattrs on the filesystem that had been corrupted by a power outage.

The filesystem that had been corrupted by the power outage, had come back read-only due to manual fsck being needed. That read-only filesystem had some missing directory entries.

There are version mismatches. One server (10.6.154.99) was running 3.2.0, one client (10.6.154.100) was 3.2.0 and the other server/client (10.6.154.98) was running 3.1.3. (Kyle had to leave chat before he could get me the client log for 10.6.154.99)

10.6.154.99  brick  => http://goo.gl/mL6D2
10.6.154.100 client => http://goo.gl/B6H5l

Comment 12 Niels de Vos 2014-11-27 14:54:42 UTC
The version that this bug has been reported against, does not get any updates from the Gluster Community anymore. Please verify if this report is still valid against a current (3.4, 3.5 or 3.6) release and update the version, or close this bug.

If there has been no update before 9 December 2014, this bug will get automatocally closed.


Note You need to log in before you can comment on or make changes to this bug.