Description of problem: Summary: The (non-hardware) fail of an encrypted RAID1 mirrored partition was not reported to the user (outside of /var/log/messages and /proc/mdstat). A subsequent unknown change that lead to the "failed" partition being revived and the more up-to-date partition being dropped was not reported either. This potentially resulted in significant loss of data, which could have been avoided if it had been reported as hardware disk errors are. For a blow-by-blow account, see http://forums.fedoraforum.org/showthread.php?t=281211. Otherwise: (1) Firstly the set-up: /boot: ext4 partition sda2; /: md-raid RAID0 (striped) ext4 of sda3 & sdb1; /var & /tmp: encrypted ext4, sda5 & sdb3 respectively; /home: encrypted md-raid RAID1 (mirrored) ext4 of sda6 & sdb4; (Other partitions for BIOS boot and 2 x swap). (2) At some point, possibly due to a crash (I don't know), sdb4 became regarded as out-of-sync with its mirror sda6. Only sda6 was used and sdb4 was left to drift further out-of-sync: Jun 10 08:26:55 gareth-desktop kernel: [ 21.679114] md: bind<sda6> Jun 10 08:26:55 gareth-desktop kernel: [ 21.679768] md: kicking non-fresh sdb4 from array! Aside from these lines in /var/log/messages on every boot, this was not reported to the user, so I was completely unaware of it. (3) At a later point, again for reasons wholly unknown (definitely not a crash), the system decided to use sdb4 instead of sda6, silently swapping the file-system being mounted on /home and losing recent files as a result. At this point, becoming aware of the problem, I could re-sync the disks, but because some new files were on one file-system and some on the other, a manual merge process was needed first. (4) The hardware is fine, both SMART and the RAID0 root file-system across both disks is fine. The file-systems are also both fine, but diverged. There are two aspects to this bug: firstly, that nothing was reported, and the only visible effect was the sudden apparently inexplicable disappearance of recent files; and secondly, the apparently random switch between which file-system was actually used. Steps to Reproduce: I'm not sure how to simulate this situation artificially, as from my perspective it just happened.
Please attach /proc/mdstat output, info from /var/log/messages, your /etc/mdadm.conf, and partition information. Please also make sure you have the latest updated mdadm - currently mdadm-3.2.5 is sitting in testing-updates The fact that the disks get kicked off the raid like this repeatedly sounds like you are having a hardware problem. If the disks are sound, this really shouldn't happen. Jes
I'm away from home at the moment so I won't be able to get at the logs or config etc. until next week. I remember that after I noticed the problem (when the missing/working partitions had already swapped, and sdb4 was now active), /proc/mdstat looked normal, except for the absence of sda6 and "_" instead of "U" in the corresponding status. I didn't see mdstat when sdb4 first went offline before the swap. After re-adding sda6, and successfully re-syncing the array over-night, the problem recurred after a reboot - sdb4 was dropped. This time no message about it being kicked in /var/log/messages, but mdstat showed only sda6 as present and "_" for sdb4's status, even though it had been used as the source mirror when re-adding sda6 just the night before. At no pointed through any of this did I change /etc/mdadm.conf, it was as Anaconda created it. SMART reported both disks as perfectly healthy, and / (RAID0 across the same disks) and all other partitions on both disks are fine. Neither of the file-systems on the mirrored devices were broken either, at least not beyond ext4's journalling abilities. (I mounted the lost sda6 outside of RAID to retrieve the missing files to sdb4 before re-syncing it to the array.) I'll get logs etc. next week.
Ok, I am curious to see how your mdadm.conf file looks. The normal way for mdadm to report failures is via an email sent to the email address specified in /etc/mdadm.conf using the MAILADDR variable. If Anaconda didn't set one, then I don't think mdadm will mail out warnings in case of error. Looking briefly through the code, that is what it looks like at least. If there is a MAILADDR entry and no mail was sent out when the failures were detected, that would be a real issue. If there is no MAILADDR entry in the config file, then I would say this is an Anaconda bug that should be addressed there. Cheers, Jes
Unfortunately, due to the recurrence of this and me needing this machine to work, I gave up on md-raid and switched to Btrfs/RAID instead, which so far is working fine. I no longer have /etc/mdadm, but from what I remember, the email line contained "root" as the address, without any "@localhost" or similar. Please take this with a pinch of salt, as it's from my memory, and it might be what is intended anyway. I did save the logs though. I would suggest that a local email is not a particularly good way to report RAID problems on a desktop in any case. To check the hardware before reinstalling I ran a "badblocks" pass on both drives and rechecked the SMART data, and both drives are perfectly healthy. Btrfs RAID is working perfectly fine. I'm just going through the log files now.
Created attachment 595098 [details] Logs relating to RAID Generated with: cat messages* | grep -i '![ae]md\|mdadm\|md0\|md1\|raid\|sda\|sdb' > messages.txt Notes: Line 964: Last complete RAID1 array. Line 1019: First degraded array (sda6 only), sdb4 not mentioned. Line 1073: Kicking stale sdb4. Line 2189: Switch from sda6 to sdb4, no mention of sda6, missing files. Line 2462: Around here I rebuilt the array using sdb4 as source (after separately mounting sda6 and copying missing files). Line 2559: sda6 only again, no mention of sdb4.
Gareth, It's puzzling the drives get kicked off like that. One question, are they both connected to the same SATA controller? The logs you posted didn't include info about the probing of the drives. Thanks, Jes
I think they are on the same controller – I'm using an ASUS P6T Deluxe motherboard, which has three controllers, but only one of them is plain SATA (6 ports), the others being 2xSAS/SATA and PATA+eSATA. I'll attach a log of a complete boot in a moment, I didn't realize I'd filtered the probing out, sorry!
Created attachment 595761 [details] Complete /var/log/messages of first boot, up to the "firstboot" set-up screen.
An update on F17 and raid error reporting. I did a fresh install on a test system here and created a raid device during the installation. I verified that /etc/mdadm.conf does indeed get the correct MAILADDR line added. I then tried to fail a drive on the array and as expected the error message shows up in root's mail folder. We can certainly discuss whether just defaulting to root is the right thing to do. However if Anaconda should be made to ask for an email address, then that really should be filed as an RFE against Anaconda. I am still curious why your drives will get kicked out of the array though. Jes
Me too. If there's any other information I can provide just ask. As for the error reporting, while email makes sense for a server, it doesn't seem right for a desktop, where the local email system isn't really connected to anything anyway. A direct on-screen notification would make more sense, but I'm not sure how practical that is to implement.
Gareth, Thanks for the log - I looked at it, there is about a 1 second delay between the probing of the two SATA drives, with the DVD drive showing up in the middle. This really shouldn't make a difference (I have seen issues where some of the drives are on a separate controller and the probe delay is > 10 seconds). It could be a try to move the DVD drive to a different port so it is found after the harddrives, but it really shouldn't matter. That said, everything here is pointing at mdadm having reported the errors as expected, but they were noticed since you weren't monitoring the root mail address (like most users as you rightfully point out). Now the issue is how/where to address it. Doing an mdadm specific tool that pops up a warning would be rather silly. What really needs to be implemented would be some daemon level thing that can monitor all the different types of storage and report errors that way. To be honest, I don't know what is currently happening for other things, like SMART, dm-raid, fail-over etc. so not sure where we should file this RFE. Cheers, Jes
Gareth, I have created a new bug to handle the issue about what email address to send the error messages to. I think we should start there with the issue, I will also try and start a discussion on how to handle this on a broader scale. I am going to close this bug since the problem itself doesn't seem to be in mdadm. Cheers, Jes