Bug 510772

Summary: F11 Anaconda crashing on system w/ SW RAID
Product: [Fedora] Fedora Reporter: Joe Christy <joe.christy>
Component: anacondaAssignee: Radek Vykydal <rvykydal>
Status: CLOSED CURRENTRELEASE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: low    
Version: 12CC: bobgus, hdegoede, jones, mikolaj, pjones, rmaximo, vamsee, vanmeeuwen+fedora
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard: anaconda_trace_hash:401a1160798dce90218a71edb067d4440c6529cc9ba67ec0cb62401f5d1c69be
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-08-13 09:46:58 EDT Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
Attachments:
Description Flags
Attached traceback automatically from anaconda.
none
the traceback from anaconda after passing nodmraid to the install disc kernel
none
output from dmraid -ay -t during install
none
result ou grep'ing lspci out put for SATA controller
none
automatic dump on peter jones' machine
none
dmesg output from Peter Jones' machine, running F10
none
lspci -v outpout from Peter Jones' machine, running F10
none
fdisk -l output from Peter Jones' machine, running F10
none
Anaconda logs when it fails with not enough space none

Description Joe Christy 2009-07-10 13:40:20 EDT
The following was filed automatically by anaconda:
anaconda 11.5.0.59 exception report
Traceback (most recent call first):
  File "/usr/lib64/python2.6/site-packages/block/__init__.py", line 35, in dm_log
    raise Exception, message
  File "/usr/lib64/python2.6/site-packages/block/device.py", line 49, in removeDeviceMap
    map.remove()
  File "/usr/lib64/python2.6/site-packages/block/device.py", line 46, in removeDeviceMap
    removeDeviceMap(m)
  File "/usr/lib64/python2.6/site-packages/block/device.py", line 843, in deactivate
    removeDeviceMap(self._RaidSet__map)
  File "/usr/lib/anaconda/storage/devices.py", line 2664, in teardown
    self._raidSet.deactivate()
  File "/usr/lib/anaconda/storage/devices.py", line 281, in teardownParents
    parent.teardown(recursive=recursive)
  File "/usr/lib/anaconda/storage/devices.py", line 556, in teardown
    self.teardownParents(recursive=recursive)
  File "/usr/lib/anaconda/storage/devicetree.py", line 1707, in teardownAll
    device.teardown(recursive=True)
  File "/usr/lib/anaconda/storage/devicetree.py", line 1697, in populate
    self.teardownAll()
  File "/usr/lib/anaconda/storage/__init__.py", line 302, in reset
    self.devicetree.populate()
  File "/usr/lib/anaconda/storage/__init__.py", line 102, in storageInitialize
    storage.reset()
  File "/usr/lib/anaconda/dispatch.py", line 205, in moveStep
    rc = stepFunc(self.anaconda)
  File "/usr/lib/anaconda/dispatch.py", line 128, in gotoNext
    self.moveStep()
  File "/usr/lib/anaconda/gui.py", line 1339, in nextClicked
    self.anaconda.dispatch.gotoNext()
Exception: device-mapper: remove ioctl failed: Device or resource busy
Comment 1 Joe Christy 2009-07-10 13:40:28 EDT
Created attachment 351284 [details]
Attached traceback automatically from anaconda.
Comment 2 Joe Christy 2009-07-10 13:48:03 EDT
This occurred during an install of F11 from a (check-out OK) x86_64 DVD installation DVD onto a dual-drive Thinkpad W700 previously running F10 with everything but /boot on a SW RAID partition.

I had checked the install medium (which I had previously used successfully on a differeent x86_64 system), chosen my language and keyboard, then hey "next" when ker-blooey!
Comment 3 Joe Christy 2009-07-10 15:52:09 EDT
I should add that this machine came from Lenovo with "RAID1" supplied via Intel Matrix Storage Manager, which F10 didn't recognize, hence the F10 MD RAID1 install on the two disks. Nonetheless F11 seems to be detecting OEMRAID1; could this be the root of my problem?
Comment 4 Peter Jones 2009-07-10 16:14:14 EDT
That seems very likely to be the cause of the problem, yes.
Comment 5 Joe Christy 2009-07-10 17:12:49 EDT
Aha!

The BIOS is set for AHCI mode rather than RAID mode for the SATA Controller, FWIW.
Comment 6 Joe Christy 2009-07-10 18:15:22 EDT
A few more data points:
Passing nodmraid as a bootloader arg in anaconda had no practical effect; see 2nd attached anacdump.txt

Also, I captured some info on dmraid's view of my laptop, attached as well.

Sigh - back to F10, which somehow installed in April.
Comment 7 Joe Christy 2009-07-10 18:17:08 EDT
Created attachment 351308 [details]
the traceback from anaconda after passing nodmraid to the install disc kernel
Comment 8 Joe Christy 2009-07-10 18:18:11 EDT
Created attachment 351309 [details]
output from dmraid -ay -t during install
Comment 9 Joe Christy 2009-07-10 18:19:23 EDT
Created attachment 351310 [details]
result ou grep'ing lspci out put for SATA controller
Comment 10 Bob Gustafson 2009-07-12 14:07:28 EDT
Interesting...

I have two systems on which I successfully installed F11 w/RAID1 (full wipe install - 3 partitions, /boot, swap /), but both were software raid before (F10) and had no problems. (both 32 bit systems, one IDE, the other SCSI)

My third system is x86_64 and has IHC9 ?? motherboard RAID. It is stuck at F9 because of Anaconda Raid problems in F10. I was thinking of switching off the motherboard raid and going with software only raid, but your problem gives me pause..
Comment 11 Joe Christy 2009-07-12 17:42:31 EDT
Bob - I got around the RAID issues w/ F10 by switching the controller from RAID to AHCI in the BIOS, which, I fear is the root of my current problem. For me, switching RAID back on in the BIOS would clobber my existing data, so no rollback w/out much pain.

OTOH, it sounds like you already have RAID on in the BIOS, so from my limited experience, doing an install of F11, if it were going to fail, would fail before it actually touched the discs, so what's to loose? It's an experiment that would have much less downside for you, if you're willing.

My install fails while anaconda is trying to figure out what to do with the discs and, I conjecture, gets confused by the seeming co-existence of mdraid and dmraid (which was undetected by F10), long before it actually does anything to them :(.

Am I to take it that F9 installed over the IHC9 RAID? If so, maybe there was a regression in F10, that has now been corrected in F11.
Comment 12 Bob Gustafson 2009-07-12 18:22:53 EDT
Joe: I installed F9 over F8 on the IHC9 RAID without any problems.

I have contributed to bug reports since then on Anaconda's failings when it comes to RAID (search on my name in bugzilla.. - all bugs - even 'closed')

My ICH9 x86_64 system is also my main gateway/mail/dns/nas system, so it can't be down for any real length of time without affecting my wife's computer and our phone system (asterisk..), so I need to configure one of the other systems to take over those duties while I flail.

Rather than do an 'update', I think doing a full wipe and leting Anacondo do its thing with /boot ext3 and / ext4 is more reasonable. I also have more RAM and so the swap file needs to be bigger.

It will take awhile for me to make the move. I was thinking of getting another system, but I already have too many keyboards on my desk..

If having hardware RAID, even if not used - gives problems -- this is not good news.

Can anyone else confirm if they have F11 w/ Software Raid running on an ICH9 x86_64 system?
Comment 13 Hans de Goede 2009-07-14 04:57:13 EDT
Joe,

The problem is that anaconda is still seeing the intel BIOS-RAID metadata
on your disks (seed dmraid -ay -t output), and it is also seeing the mdraid software raid metadata too and this combination is confusing it (granted it should not crash).

If I read your comment in bug 489148 correctly you are willing to do a full install, in that case I can advice you either of the 2 following scenario's:

1) Remove the BIOS-RAID metadata from your disks:
   Enable RAID in your BIOS, enter the OROM setup (ctrl-I) and reset the disks
   to non raid status (this is something which you should have done in the
   past before disabling the RAID in the BIOS, so that the disks would not
   be seen as BIOS-RAID by Linux now).
   And then disable BIOS-RAID again

2) Switch to using BIOS-RAID (so enable it again in your BIOS):
   Remove the mdraid metadata using mdadm --zero /dev/sda# /dev/sdb#
   where # is the partition number of the partitions which make up your
   software raidset, you can do this from the installer on tty2
   (ctrl + alt + F2) before pressing next on the welcome screen (so before the
   initial storage scan). I'm not sure if this will work.

I'm leaving this bug open to track the backtrace, because as said that should not happen.
Comment 14 Hans de Goede 2009-07-14 05:01:26 EDT
rvykydal,

I've analysed the attached log file, here is what is happening:
1) We correctly identify the BIOS-RAID set and bring it online using
   dmraid
2) Thus we now only see one of the 2 partitions which were used to make the
   mdraid set in F-10 (where we did not identify the BIOS-RAID set).
3) We thus have an incomplete mdraid set
4) When tearing down everything at the end of the initial storage scan,
   the mdraid set is not stopped (I guess because it is incomplete it
   returns False as status, causing it to not be stopped)
5) When we get to tearing down the dmraid array, the mdraid partition is
   still in use by mdraid (as the set was not stopped) -> boom

So I think we need to fix 4 (which is a larger issue then this bug alone) and
make sure we also stop mdraid sets which are incomplete when tearing down
storage.
Comment 15 Bob Gustafson 2009-07-14 08:05:36 EDT
(In reply to comment #14)

> 2) Thus we now only see one of the 2 partitions which were used to make the
>    mdraid set in F-10 (where we did not identify the BIOS-RAID set).
> 3) We thus have an incomplete mdraid set

Perhaps this is an opportunity to address another problem - the ability/option of continuing to install Fnn on an 'incomplete' RAID set.

There has been a desire over the years for this feature (see bugs: Bug #105598, Bug #129306, Bug# 151652, Bug# 152158, Bug# 177894, Bug# 188314, Bug# 195812, Bug# 247119, Bug# 310241, Bug# 452441)
Comment 16 Joe Christy 2009-07-14 20:31:41 EDT
Hans

Thanks for the pointer. Being bold/stupid/credulous in the belief that BIOS-RAID would give better performance, I tried 2) - switching RAID back on in the BIOS, etc. and it worked like a charm.

Now I'm happy again.
Comment 17 Peter H. Jones 2009-09-10 10:15:48 EDT
Created attachment 360505 [details]
automatic dump on peter jones' machine

When I, Peter Jones saw this problem on my machine, I decided to send some information on my occurrence. The crash appears to be in the same place (anaconda finding storage devices), but the dump seems to be in a different place.

This file is the dump that was produced automatically.

I plan to also include dmesg, lspci and fdisk output from this machine,
using the currently-running F10.
Comment 18 Peter H. Jones 2009-09-10 10:18:05 EDT
Created attachment 360506 [details]
dmesg output from Peter Jones' machine, running F10
Comment 19 Peter H. Jones 2009-09-10 10:19:13 EDT
Created attachment 360507 [details]
lspci -v outpout from Peter Jones' machine, running F10
Comment 20 Peter H. Jones 2009-09-10 10:20:23 EDT
Created attachment 360508 [details]
fdisk -l output from Peter Jones' machine, running F10

Last posting for now. Hope this information helps.
Comment 21 Bug Zapper 2009-11-16 05:46:21 EST
This bug appears to have been reported against 'rawhide' during the Fedora 12 development cycle.
Changing version to '12'.

More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping
Comment 22 vamsee 2009-11-25 18:42:08 EST
I tried to upgrade from Fedora 10 to Fedora 12 on my Dell Precision 690 work station today and ran into the "installation root not found" error when I rebooted after running pre-upgrade (shortly after trying to find storage media). If I add 'upgradeany' to the boot command, it gets past the error, shows a couple of GUI screens about dependency checking etc, starts the install but then fails with "not enough space on /mnt/sysimage" error. Atleast at this point it lets me write anaconda log to a remote machine via scp. It is attached. 

This info is probably in the logs but I will summarize. I have two disks. /dev/sda1 has a Windows partition of about 20GB. It has a second partition which is Linux. The second disk /dev/sdb1 is dedicated to Fedora and has a smal /boot partition, a 2GB swap and the rest is for Fedora (logical volume?).  Fedora 10 is installed on it and works fine. 

I didn't really understand some of the comments above about turning on/off BIOS-RAID. In machine uses a DELL SAS Host Bus Adapter 6.06 00.02 (2006.04.05). Bios is A01 (6/40/06). In the BIOS settings screen (F2) it shows Drives 0, 1,2 mapped to SATA-0,1,2 but they are all set to 'off' (you may need to see this screen to understand it). Drive 3 is mapped to PATA-0 and is on and the drive id is that of DVD/RW drive. Drive 4 mapped to PATA-1 is off. If try to turn on any of the 'off' items it gives an error during boot. Finally there is a SATA controller setting of AHCI or ATA. I tried both settings but I 'm not able to finish install either way. The attached logs are with the ATA setting in the BIOS. 

dmraid -ay -t says there are no raid disks. 

lspci | grep SATA says: Intel 631xESBx632xESB/3100 Chipset SATA IDE Controller (Rev 09) 

I tried ctrl+alt+f2 before it looks for storage and tried mdadm --zero /dev/sda1. It says unrecognised md component device. If I do a df at the prompt it shows /dev/sda1, /dev/sda2, /dev/sdb1 etc. But for some reason Anaconda is not able to find them or it is trying to install everything into the very small /boot partition on /dev/sdb1

I would like to upgrade this instead of doing a new install due to other configured software.
Comment 23 vamsee 2009-11-25 18:45:00 EST
Created attachment 373880 [details]
Anaconda logs when it fails with not enough space

There is about 30GB  of free space on /dev/sda1 and around 50GB of free space on /dev/sdb1
Comment 24 Hans de Goede 2009-11-26 04:39:52 EST
Vamsee, it seems that anaconda is not recognizing your lvm setup, which is completely unrelated to this bug, please file a new bug for this.
Comment 25 Radek Vykydal 2010-08-06 03:32:56 EDT
Hans, I think you've fixed biosraid and mdraid a lot since the report,
and I've lost track of what has been happening in this area, do you think we can close this one as CURRENTRELEASE (i.e. is 4) from comment #14 fixed too)?
Comment 26 Bob Gustafson 2010-08-06 07:44:00 EDT
I am running F13 w software RAID1, two partitions (/boot and /) with lvm (incl generous swap) on the / partition - on two systems. I have been having failing disks, so the concept of RAID1 has gotten a workout. So far no loss of data..

I still have my F9 ICHR10 Bios Raid system which is my central firewall/mail server, etc. Hard to shut down. I will probably configure one of the F13 systems to take over that job and then wipe the disks and start over with F13 software RAID1 on that system.

Software RAID1 requires you to write grub to MBR on both disks. This is not done automatically. If you get a failure on one disk and reboot, not having grub on the good disk will give obvious problems. This could be automated in Anaconda.

 I also noticed that the bios allows a selection of which hard disk to boot from in the boot sequence - floppy, cdrom, hard disk. In a failure situation, if the failed disk happens to be the selected disk gives obvious boot problems. This bios disk selection may have bios bugs. Switching cables is more reliable in my experience.
Comment 27 Radek Vykydal 2010-08-06 08:05:20 EDT
(In reply to comment #26)
 
> Software RAID1 requires you to write grub to MBR on both disks. This is not
> done automatically. If you get a failure on one disk and reboot, not having
> grub on the good disk will give obvious problems. This could be automated in
> Anaconda.

Isn't this fixed with http://git.fedorahosted.org/git/?p=anaconda.git;a=commit;h=d625c76082493ffbc4a258c1eb1604d1f0e2edaa?
Comment 28 Bob Gustafson 2010-08-06 09:58:31 EDT
(In reply to comment #27)
> (In reply to comment #26)
>
> Isn't this fixed with
> http://git.fedorahosted.org/git/?p=anaconda.git;a=commit;h=d625c76082493ffbc4a258c1eb1604d1f0e2edaa?    

As I recall, when I finished with F13 Anaconda, I did a

  dd if=/dev/sdx bs=512 count=1 | od -c | more

on both disks (sda, sdb) and found they were different..

When I manually re-wrote the MBR on both, then the above test showed them to be the same.
Comment 29 Hans de Goede 2010-08-13 09:46:58 EDT
(In reply to comment #25)
> Hans, I think you've fixed biosraid and mdraid a lot since the report,
> and I've lost track of what has been happening in this area, do you think we
> can close this one as CURRENTRELEASE (i.e. is 4) from comment #14 fixed too)?    

The teardown code for an MDRaidArrayDevice now reads:

        # We don't really care what the array's state is. If the device
        # file exists, we want to deactivate it. mdraid has too many   
        # states.
        if self.exists and os.path.exists(self.path):
            mdraid.mddeactivate(self.path)

So yes I believe that 4) from comment #14 is fixed now and this can be closed.