Bug 710447 - Anaconda fails with "Could not commit to disk /dev/sdb" with software raid
Summary: Anaconda fails with "Could not commit to disk /dev/sdb" with software raid
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: anaconda
Version: 6.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: ---
Assignee: Anaconda Maintenance Team
QA Contact: Release Test Team
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-06-03 12:32 UTC by Jonathan Underwood
Modified: 2018-11-28 20:45 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-06-03 15:29:37 UTC
Target Upstream Version:


Attachments (Terms of Use)
Anaconda log (7.19 KB, text/plain)
2011-06-03 12:33 UTC, Jonathan Underwood
no flags Details
Kickstart (12.42 KB, text/plain)
2011-06-03 12:34 UTC, Jonathan Underwood
no flags Details
ks-pre log (15.63 KB, text/plain)
2011-06-03 12:34 UTC, Jonathan Underwood
no flags Details
storage.log (152.40 KB, text/plain)
2011-06-03 12:35 UTC, Jonathan Underwood
no flags Details
syslog (48.48 KB, text/plain)
2011-06-03 12:36 UTC, Jonathan Underwood
no flags Details
traceback (252.58 KB, text/plain)
2011-06-03 14:42 UTC, Jonathan Underwood
no flags Details
logs, kickstart etc demonstrating failure mode (63.11 KB, application/x-gzip)
2011-06-03 17:07 UTC, Jonathan Underwood
no flags Details

Description Jonathan Underwood 2011-06-03 12:32:51 UTC
Description of problem:
I have a machine with two identical disks onto which I am installing using mdraid RAID 1 via network install kickstart like this:

zerombr
clearpart --all --initlabel 
ignoredisk --only-use=sda,sdb
bootloader  --location=mbr --driveorder=sda,sdb
# /boot
part raid.01 --asprimary --size=1024 --ondisk=sda
part raid.02 --asprimary --size=1024 --ondisk=sdb
# /
# Note that we add --grow here. We'd need to remove this if the two disks weren't the same size!
part raid.11 --asprimary --size=61440 --ondisk=sda --grow 
part raid.12 --asprimary --size=61440 --ondisk=sdb --grow
# <swap>
part raid.21 --asprimary --size=4096 --ondisk=sda
part raid.22 --asprimary --size=4096 --ondisk=sdb
# Format /boot and /.
raid /boot --fstype=ext4 --level=1 --device=md0 raid.01 raid.02
raid /     --fstype=ext4 --level=1 --device=md1 raid.11 raid.12
raid swap  --fstype=swap --level=1 --device=md2 raid.21 raid.22

Consistently, when I reinstall this machine, the first time I try, anaconda bails showing an error message "Could not commit to disk /dev/sdb". If I then reboot the machine and restart the installation, it installs fine. I thought at first this was some issue with a persistent raid superblock being found on the disks somehow and confusing things, so I've gone to some lengths in the kickstart to try and obliterate the superblock, so I don't think this is the root cause. I am now scratching my head.




Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Jonathan Underwood 2011-06-03 12:33:36 UTC
Created attachment 502798 [details]
Anaconda log

Comment 2 Jonathan Underwood 2011-06-03 12:34:23 UTC
Created attachment 502799 [details]
Kickstart

Comment 3 Jonathan Underwood 2011-06-03 12:34:59 UTC
Created attachment 502800 [details]
ks-pre log

Comment 4 Jonathan Underwood 2011-06-03 12:35:29 UTC
Created attachment 502801 [details]
storage.log

Comment 5 Jonathan Underwood 2011-06-03 12:36:01 UTC
Created attachment 502802 [details]
syslog

Comment 6 Jonathan Underwood 2011-06-03 12:37:57 UTC
Seems that this has been seen elsewhere too:

https://www.redhat.com/archives/rhelv6-beta-list/2010-May/msg00177.html

Comment 8 Jonathan Underwood 2011-06-03 14:42:36 UTC
Created attachment 502817 [details]
traceback

Comment 9 David Lehman 2011-06-03 15:27:12 UTC
From your ks-pre log:

find /dev -name md[0-9]* -exec umount '{}' \;
+ find /dev -name md0.dmfhJB md1.vGDHcZ md2.KU252P -exec umount '{}' ';'
+ sleep 10
find: paths must precede expression: md1.vGDHcZ
Usage: find [-H] [-L] [-P] [-Olevel] [-D help|tree|search|stat|rates|opt|exec] [path...] [expression]
find /dev -name md[0-9]* -exec mdadm -S '{}' \;
+ find /dev -name md0.dmfhJB md1.vGDHcZ md2.KU252P -exec mdadm -S '{}' ';'
find: paths must precede expression: md1.vGDHcZ
Usage: find [-H] [-L] [-P] [-Olevel] [-D help|tree|search|stat|rates|opt|exec] [path...] [expression]


You should escape the '*' in your find expression -- otherwise it will get expanded as a glob in the current directory by the shell, eg:

If in the current dir you have fooX, fooY and run this command:

  find /somedir -name foo* -exec <whatever>

the command you are actually running is this:

  find /somedir -name fooX fooY -exec <whatever>

regardless of what is in /somedir. This is probably not what you want. If /somedir contains fooA and fooZ your find will fail because of the shell expansion. To fix it, just use this instead:

  find /somedir --name foo\* -exec <whatever>

Comment 10 David Lehman 2011-06-03 15:29:37 UTC
The end result of this is that you are never deactivating the RAID arrays and thus are putting the system into an inconsistent state before running anaconda. Parted thinks the disks have no partitions but the kernel thinks otherwise because those preexisting partitions have been held open by md.

Comment 11 Jonathan Underwood 2011-06-03 15:42:56 UTC
Hello David,

Yes, you're right, fixing that fixed the problem - seems I was on the right track to understanding the problem being that the presence of the superblock was the problem. All that stuff in %pre was just trying to work around the problem really. Shouldn't anaconda take care of properly removing raid superblocks etc by virtue of the "clearpart --all" ? This is probably more of an RFE than a bug report, but still worth considering I think.

Thanks again for the pointer.

Comment 12 David Lehman 2011-06-03 15:57:35 UTC
clearpart should remove the raid superblocks. If you'd like to attach logs showing what happens when you omit the %pre I would be happy to take a look at them to see what's going on.

Comment 13 Jonathan Underwood 2011-06-03 16:55:56 UTC
OK, looking a bit closer here is what I find:

clearpart does successfully remove the superblock if nothing in %pre has caused md to activate the RAID array. Good.

Previously, I was activating the raid array(s) in %pre in order to save the old ssh host keys prior to installation, and importantly, not subsequently shutting down the raid array(s), and this causes anaconda to fall over in a rather odd way. Anaconda gets quite a long way towards making the new raid array before failing (in the manner I originally reported).

One could of course argue that I'm an idiot for not deactivating the raid arrays at the end of %pre, and that that anaconda can't protect against %pre lunacy. That said, I could imagine that either of these anaconda behaviours would be better in such situations:

1) clearpart deactivates any raid arrays that are active but are part of the set of devices about to be (re)partitioned; or

2) clearpart checks first to see if any of the devices (raid or otherwise) that are about to be partitioned are activated/mounted before proceeding and errors out at that point before writing anything to the disks.

Option 2 is probably the safer bet, as option 1 is probably a good bullet to shoot yourself in the foot with :).

I'll attach new logs in a second.

Comment 14 Jonathan Underwood 2011-06-03 17:07:25 UTC
Created attachment 502865 [details]
logs, kickstart etc demonstrating failure mode

Comment 15 David Lehman 2011-06-03 17:27:37 UTC
If you remove --initlabel from your clearpart command you might get the behavior you were hoping for. I'm not sure, though, since there isn't any valid case for testing already-active mdraid or lvm in RHEL.


Note You need to log in before you can comment on or make changes to this bug.