Bug 838957

Summary: Anaconda should explicitly ask for an email address for which to report raid events during installation
Product: [Fedora] Fedora Reporter: Jes Sorensen <Jes.Sorensen>
Component: distributionAssignee: Radek Vokál <rvokal>
Status: CLOSED EOL QA Contact: Radek Vokál <rvokal>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 23CC: agk, amulhern, bmr, dcantrell, dennis, dledford, gareth.k.jones, g.kaviyarasu, Jes.Sorensen, jmoran, jonathan, vanmeeuwen+fedora
Target Milestone: ---Keywords: Reopened
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 832616 Environment:
Last Closed: 2016-12-20 12:14:43 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 832616    
Bug Blocks: 1061711    

Description Jes Sorensen 2012-07-10 13:54:49 UTC
Currently when Anaconda creates an md raid during installation, it adds
MAILADDR=root to /etc/mdadm.conf

While this is technically correct, in reality a lot of users do not monitor
root's mailbox, especially if they rely on an external IMAP service of some
sort. As such it would make more sense that during installation
anaconda would ask for an email address for which to report mdadm events
(and probably other error/notification events) which currently default to
root.

Please see below bug report for previous discussion.

Jes


+++ This bug was initially created as a clone of Bug #832616 +++

Description of problem:

Summary:
The (non-hardware) fail of an encrypted RAID1 mirrored partition was not reported to the user (outside of /var/log/messages and /proc/mdstat).  A subsequent unknown change that lead to the "failed" partition being revived and the more up-to-date partition being dropped was not reported either.  This potentially resulted in significant loss of data, which could have been avoided if it had been reported as hardware disk errors are.

For a blow-by-blow account, see http://forums.fedoraforum.org/showthread.php?t=281211.  Otherwise:

(1) Firstly the set-up:
/boot: ext4 partition sda2;
/: md-raid RAID0 (striped) ext4 of sda3 & sdb1;
/var & /tmp: encrypted ext4, sda5 & sdb3 respectively;
/home: encrypted md-raid RAID1 (mirrored) ext4 of sda6 & sdb4;
(Other partitions for BIOS boot and 2 x swap).

(2) At some point, possibly due to a crash (I don't know), sdb4 became regarded as out-of-sync with its mirror sda6.  Only sda6 was used and sdb4 was left to drift further out-of-sync:
    Jun 10 08:26:55 gareth-desktop kernel: [   21.679114] md: bind<sda6>
    Jun 10 08:26:55 gareth-desktop kernel: [   21.679768] md: kicking non-fresh sdb4 from array!
Aside from these lines in /var/log/messages on every boot, this was not reported to the user, so I was completely unaware of it.

(3) At a later point, again for reasons wholly unknown (definitely not a crash), the system decided to use sdb4 instead of sda6, silently swapping the file-system being mounted on /home and losing recent files as a result.  At this point, becoming aware of the problem, I could re-sync the disks, but because some new files were on one file-system and some on the other, a manual merge process was needed first.

(4) The hardware is fine, both SMART and the RAID0 root file-system across both disks is fine.  The file-systems are also both fine, but diverged.

There are two aspects to this bug: firstly, that nothing was reported, and the only visible effect was the sudden apparently inexplicable disappearance of recent files; and secondly, the apparently random switch between which file-system was actually used.


Steps to Reproduce:
I'm not sure how to simulate this situation artificially, as from my perspective it just happened.

--- Additional comment from Jes.Sorensen on 2012-06-18 11:58:47 EDT ---

Please attach /proc/mdstat output, info from /var/log/messages, your
/etc/mdadm.conf, and partition information.

Please also make sure you have the latest updated mdadm - currently
mdadm-3.2.5 is sitting in testing-updates

The fact that the disks get kicked off the raid like this repeatedly
sounds like you are having a hardware problem. If the disks are sound,
this really shouldn't happen.

Jes

--- Additional comment from gareth.k.jones on 2012-06-18 13:07:41 EDT ---

I'm away from home at the moment so I won't be able to get at the logs or config etc. until next week.

I remember that after I noticed the problem (when the missing/working partitions had already swapped, and sdb4 was now active), /proc/mdstat looked normal, except for the absence of sda6 and "_" instead of "U" in the corresponding status.  I didn't see mdstat when sdb4 first went offline before the swap.  After re-adding sda6, and successfully re-syncing the array over-night, the problem recurred after a reboot - sdb4 was dropped.  This time no message about it being kicked in /var/log/messages, but mdstat showed only sda6 as present and "_" for sdb4's status, even though it had been used as the source mirror when re-adding sda6 just the night before.  At no pointed through any of this did I change /etc/mdadm.conf, it was as Anaconda created it.

SMART reported both disks as perfectly healthy, and / (RAID0 across the same disks) and all other partitions on both disks are fine.  Neither of the file-systems on the mirrored devices were broken either, at least not beyond ext4's journalling abilities.  (I mounted the lost sda6 outside of RAID to retrieve the missing files to sdb4 before re-syncing it to the array.)

I'll get logs etc. next week.

--- Additional comment from Jes.Sorensen on 2012-06-20 04:15:27 EDT ---

Ok, I am curious to see how your mdadm.conf file looks.

The normal way for mdadm to report failures is via an email sent to the
email address specified in /etc/mdadm.conf using the MAILADDR variable.
If Anaconda didn't set one, then I don't think mdadm will mail out warnings
in case of error. Looking briefly through the code, that is what it looks
like at least.

If there is a MAILADDR entry and no mail was sent out when the failures were
detected, that would be a real issue.

If there is no MAILADDR entry in the config file, then I would say this is
an Anaconda bug that should be addressed there.

Cheers,
Jes

--- Additional comment from gareth.k.jones on 2012-06-28 13:28:41 EDT ---

Unfortunately, due to the recurrence of this and me needing this machine to work, I gave up on md-raid and switched to Btrfs/RAID instead, which so far is working fine.  I no longer have /etc/mdadm, but from what I remember, the email line contained "root" as the address, without any "@localhost" or similar.  Please take this with a pinch of salt, as it's from my memory, and it might be what is intended anyway.  I did save the logs though.  I would suggest that a local email is not a particularly good way to report RAID problems on a desktop in any case.

To check the hardware before reinstalling I ran a "badblocks" pass on both drives and rechecked the SMART data, and both drives are perfectly healthy.  Btrfs RAID is working perfectly fine.

I'm just going through the log files now.

--- Additional comment from gareth.k.jones on 2012-06-28 13:54:21 EDT ---

Created attachment 595098 [details]
Logs relating to RAID

Generated with: cat messages* | grep -i '![ae]md\|mdadm\|md0\|md1\|raid\|sda\|sdb' > messages.txt

Notes:
Line 964: Last complete RAID1 array.
Line 1019: First degraded array (sda6 only), sdb4 not mentioned.
Line 1073: Kicking stale sdb4.
Line 2189: Switch from sda6 to sdb4, no mention of sda6, missing files.
Line 2462: Around here I rebuilt the array using sdb4 as source (after separately mounting sda6 and copying missing files).
Line 2559: sda6 only again, no mention of sdb4.

--- Additional comment from Jes.Sorensen on 2012-07-02 10:16:48 EDT ---

Gareth,

It's puzzling the drives get kicked off like that. One question, are they
both connected to the same SATA controller?

The logs you posted didn't include info about the probing of the drives.

Thanks,
Jes

--- Additional comment from gareth.k.jones on 2012-07-02 11:24:19 EDT ---

I think they are on the same controller – I'm using an ASUS P6T Deluxe motherboard, which has three controllers, but only one of them is plain SATA (6 ports), the others being 2xSAS/SATA and PATA+eSATA.  I'll attach a log of a complete boot in a moment, I didn't realize I'd filtered the probing out, sorry!

--- Additional comment from gareth.k.jones on 2012-07-02 11:25:43 EDT ---

Created attachment 595761 [details]
Complete /var/log/messages of first boot, up to the "firstboot" set-up screen.

--- Additional comment from Jes.Sorensen on 2012-07-02 16:20:30 EDT ---

An update on F17 and raid error reporting.

I did a fresh install on a test system here and created a raid device
during the installation. I verified that /etc/mdadm.conf does indeed get
the correct MAILADDR line added.

I then tried to fail a drive on the array and as expected the error message
shows up in root's mail folder.

We can certainly discuss whether just defaulting to root is the right thing
to do. However if Anaconda should be made to ask for an email address, then
that really should be filed as an RFE against Anaconda.

I am still curious why your drives will get kicked out of the array though.

Jes

--- Additional comment from gareth.k.jones on 2012-07-02 17:10:06 EDT ---

Me too.  If there's any other information I can provide just ask.

As for the error reporting, while email makes sense for a server, it doesn't seem right for a desktop, where the local email system isn't really connected to anything anyway.  A direct on-screen notification would make more sense, but I'm not sure how practical that is to implement.

--- Additional comment from Jes.Sorensen on 2012-07-09 08:21:49 EDT ---

Gareth,

Thanks for the log - I looked at it, there is about a 1 second delay 
between the probing of the two SATA drives, with the DVD drive showing
up in the middle. This really shouldn't make a difference (I have seen
issues where some of the drives are on a separate controller and the
probe delay is > 10 seconds). It could be a try to move the DVD drive
to a different port so it is found after the harddrives, but it really
shouldn't matter.

That said, everything here is pointing at mdadm having reported the
errors as expected, but they were noticed since you weren't monitoring
the root mail address (like most users as you rightfully point out).

Now the issue is how/where to address it. Doing an mdadm specific tool
that pops up a warning would be rather silly. What really needs to be
implemented would be some daemon level thing that can monitor all the
different types of storage and report errors that way.

To be honest, I don't know what is currently happening for other things,
like SMART, dm-raid, fail-over etc. so not sure where we should file this
RFE.

Cheers,
Jes

Comment 1 Doug Ledford 2012-07-10 14:08:52 UTC
note: it would probably be best not to do this as a change to the mailaddr in the mdadm.conf file, but by using the aliases feature of the mailer program instead to redirect all mail for root to some entered email address.  It does, however, require knowing which email backend is being installed and how to configure it.

Comment 2 Jesse Keating 2012-07-11 22:06:58 UTC
This really isn't something anaconda should be doing.  A post-install setup program should take care of this.

Comment 3 Jes Sorensen 2012-07-12 06:14:13 UTC
Rather than closing it, reassign it appropriately.

Only Anaconda knows what has been done during the setup of the system, so
it needs to be involved somehow.

Comment 4 Chris Lumens 2012-07-12 13:48:49 UTC
This sort of notification is a system-wide problem, and it needs a system-wide solution.  It's come up in other areas before.  I don't know what the solution will look like, but anaconda's really not going to be involved besides just writing out a value somewhere.

Comment 5 Fedora End Of Life 2013-07-04 06:22:05 UTC
This message is a reminder that Fedora 17 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 17. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '17'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 17's end of life.

Bug Reporter:  Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 17 is end of life. If you 
would still like  to see this bug fixed and are able to reproduce it 
against a later version  of Fedora, you are encouraged  change the 
'version' to a later Fedora version prior to Fedora 17's end of life.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 6 Fedora End Of Life 2013-12-21 15:04:16 UTC
This message is a reminder that Fedora 18 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 18. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '18'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 18's end of life.

Thank you for reporting this issue and we are sorry that we may not be 
able to fix it before Fedora 18 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior to Fedora 18's end of life.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 9 Jan Kurik 2015-07-15 15:06:28 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 23 development cycle.
Changing version to '23'.

(As we did not run this process for some time, it could affect also pre-Fedora 23 development
cycle bugs. We are very sorry. It will help us with cleanup during Fedora 23 End Of Life. Thank you.)

More information and reason for this action is here:
https://fedoraproject.org/wiki/BugZappers/HouseKeeping/Fedora23

Comment 10 Fedora End Of Life 2016-11-24 10:41:16 UTC
This message is a reminder that Fedora 23 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 23. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as EOL if it remains open with a Fedora  'version'
of '23'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 23 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 11 Fedora End Of Life 2016-12-20 12:14:43 UTC
Fedora 23 changed to end-of-life (EOL) status on 2016-12-20. Fedora 23 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.