hi kyle, could you post the "volume info" output and tell us which brick went down and came backup. Could you confirm if the following is true: On the bricks, the brick that went down and came back up has zero size files and the brick that has been running fine has all the data. Pranith
We have a 9TB volume replicated across two bricks. One night we had a power outage affecting one of the two bricks. When the brick that had a power outage came back online the two volumes replicated but 3TB of the 4.1TB of data went missing! The folder structure remained intact but the data was gone on the volume and on both individual bricks. When we ran a scan of the volume on testdisk we can see all the files but they all show up as deleted 0 byte files. We now have 3TB of critical data that appears to have gone up in smoke. No self-heal efforts have returned any of the data...
Can you also please provide the client and brick log files?
Hello Kyle, do you use XFS on your backend servers? I've known it to happen. It makes XFS, therefore we use now only ext4.
Created attachment 654
Prantith: Brick 1 is the one that went down while brick 0 remained operational. On the surface neither brick has all the files nor does either brick have 0 byte files visible on the bricks or mounted gluster volume. Both bricks are missing 3TB of files. It is only when I run a deleted file scan via testdisk that I see the 0 byte files. All of the missing files appear in the deleted file scan but they show up as 0 byte files and there is no data in the files when they are recovered. Vijay: I provided the client log file. The brick log file is over 500MB so I will have to find a way to host it and send a link. Alex: We are using ext4 on both bricks.
Created attachment 655
I attached a very abridged version of the brick log file. 6/26 is the date that the bricks reconnected and 6/27 is the date of the files loss. I will still try and find a way to host the full 500MB log file and send a link.
You can find the full brick log from brick 0 at https://docs.google.com/leaf?id=0B1M-2wIiAsYeNTY2NjUwNDctNDZjNi00ZWU0LWE0M2YtOGVmOWEwZTAwOWQ0&hl=en_US
Looks like you had filesystem problems underlying this: read-only state, and running extundelete on the backend for starters. You had occasions where you were running glusterfsd twice on the same server for the same brick, which really mix up the log so it's pretty hard to follow, and I'm pretty sure that one of your clients was resolving a server incorrectly and creating linkfiles over the tops of your existing files. I bet all your 0 byte files are mode 1000. If they are, your directory entries are lost. It's a crappy thing to say, considering you've already said you think all the data's lost, but I'd be remiss if I didn't suggest recovering from a backup if you have one. Otherwise, I think you're going to have to try to find one of those utilities that scan the drive and look for magic's to identify files and recover what you can.
We assumed something similar. Out of curiosity what dates were you looking to on the logs. This instance happened on 6/27/2011 time frame. Anything after that is recovery attempts.
We've been chatting on IRC and here are some of my observations. The files that were touched while the server was down seem to be okay. This implies that the pending xattrs were set and self-healed correctly when the server (10.6.154.99) came back up. My idea that sticky links overwrote files is, I believe, invalid as DHT was not in the graph. There were 3 clients connected at the time of the failure. One of them suggests that there were pending xattrs on the filesystem that had been corrupted by a power outage. The filesystem that had been corrupted by the power outage, had come back read-only due to manual fsck being needed. That read-only filesystem had some missing directory entries. There are version mismatches. One server (10.6.154.99) was running 3.2.0, one client (10.6.154.100) was 3.2.0 and the other server/client (10.6.154.98) was running 3.1.3. (Kyle had to leave chat before he could get me the client log for 10.6.154.99) 10.6.154.99 brick => http://goo.gl/mL6D2 10.6.154.100 client => http://goo.gl/B6H5l
The version that this bug has been reported against, does not get any updates from the Gluster Community anymore. Please verify if this report is still valid against a current (3.4, 3.5 or 3.6) release and update the version, or close this bug. If there has been no update before 9 December 2014, this bug will get automatocally closed.