Bug 510772
Summary: | F11 Anaconda crashing on system w/ SW RAID | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Joe Christy <joe.christy> |
Component: | anaconda | Assignee: | Radek Vykydal <rvykydal> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
Severity: | medium | Docs Contact: | |
Priority: | low | ||
Version: | 12 | CC: | bobgus, hdegoede, jones.peter.busi, mikolaj, pjones, rmaximo, vamsee, vanmeeuwen+fedora |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | anaconda_trace_hash:401a1160798dce90218a71edb067d4440c6529cc9ba67ec0cb62401f5d1c69be | ||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2010-08-13 13:46:58 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Attachments: |
Description
Joe Christy
2009-07-10 17:40:20 UTC
Created attachment 351284 [details]
Attached traceback automatically from anaconda.
This occurred during an install of F11 from a (check-out OK) x86_64 DVD installation DVD onto a dual-drive Thinkpad W700 previously running F10 with everything but /boot on a SW RAID partition. I had checked the install medium (which I had previously used successfully on a differeent x86_64 system), chosen my language and keyboard, then hey "next" when ker-blooey! I should add that this machine came from Lenovo with "RAID1" supplied via Intel Matrix Storage Manager, which F10 didn't recognize, hence the F10 MD RAID1 install on the two disks. Nonetheless F11 seems to be detecting OEMRAID1; could this be the root of my problem? That seems very likely to be the cause of the problem, yes. Aha! The BIOS is set for AHCI mode rather than RAID mode for the SATA Controller, FWIW. A few more data points: Passing nodmraid as a bootloader arg in anaconda had no practical effect; see 2nd attached anacdump.txt Also, I captured some info on dmraid's view of my laptop, attached as well. Sigh - back to F10, which somehow installed in April. Created attachment 351308 [details]
the traceback from anaconda after passing nodmraid to the install disc kernel
Created attachment 351309 [details]
output from dmraid -ay -t during install
Created attachment 351310 [details]
result ou grep'ing lspci out put for SATA controller
Interesting... I have two systems on which I successfully installed F11 w/RAID1 (full wipe install - 3 partitions, /boot, swap /), but both were software raid before (F10) and had no problems. (both 32 bit systems, one IDE, the other SCSI) My third system is x86_64 and has IHC9 ?? motherboard RAID. It is stuck at F9 because of Anaconda Raid problems in F10. I was thinking of switching off the motherboard raid and going with software only raid, but your problem gives me pause.. Bob - I got around the RAID issues w/ F10 by switching the controller from RAID to AHCI in the BIOS, which, I fear is the root of my current problem. For me, switching RAID back on in the BIOS would clobber my existing data, so no rollback w/out much pain. OTOH, it sounds like you already have RAID on in the BIOS, so from my limited experience, doing an install of F11, if it were going to fail, would fail before it actually touched the discs, so what's to loose? It's an experiment that would have much less downside for you, if you're willing. My install fails while anaconda is trying to figure out what to do with the discs and, I conjecture, gets confused by the seeming co-existence of mdraid and dmraid (which was undetected by F10), long before it actually does anything to them :(. Am I to take it that F9 installed over the IHC9 RAID? If so, maybe there was a regression in F10, that has now been corrected in F11. Joe: I installed F9 over F8 on the IHC9 RAID without any problems. I have contributed to bug reports since then on Anaconda's failings when it comes to RAID (search on my name in bugzilla.. - all bugs - even 'closed') My ICH9 x86_64 system is also my main gateway/mail/dns/nas system, so it can't be down for any real length of time without affecting my wife's computer and our phone system (asterisk..), so I need to configure one of the other systems to take over those duties while I flail. Rather than do an 'update', I think doing a full wipe and leting Anacondo do its thing with /boot ext3 and / ext4 is more reasonable. I also have more RAM and so the swap file needs to be bigger. It will take awhile for me to make the move. I was thinking of getting another system, but I already have too many keyboards on my desk.. If having hardware RAID, even if not used - gives problems -- this is not good news. Can anyone else confirm if they have F11 w/ Software Raid running on an ICH9 x86_64 system? Joe, The problem is that anaconda is still seeing the intel BIOS-RAID metadata on your disks (seed dmraid -ay -t output), and it is also seeing the mdraid software raid metadata too and this combination is confusing it (granted it should not crash). If I read your comment in bug 489148 correctly you are willing to do a full install, in that case I can advice you either of the 2 following scenario's: 1) Remove the BIOS-RAID metadata from your disks: Enable RAID in your BIOS, enter the OROM setup (ctrl-I) and reset the disks to non raid status (this is something which you should have done in the past before disabling the RAID in the BIOS, so that the disks would not be seen as BIOS-RAID by Linux now). And then disable BIOS-RAID again 2) Switch to using BIOS-RAID (so enable it again in your BIOS): Remove the mdraid metadata using mdadm --zero /dev/sda# /dev/sdb# where # is the partition number of the partitions which make up your software raidset, you can do this from the installer on tty2 (ctrl + alt + F2) before pressing next on the welcome screen (so before the initial storage scan). I'm not sure if this will work. I'm leaving this bug open to track the backtrace, because as said that should not happen. rvykydal, I've analysed the attached log file, here is what is happening: 1) We correctly identify the BIOS-RAID set and bring it online using dmraid 2) Thus we now only see one of the 2 partitions which were used to make the mdraid set in F-10 (where we did not identify the BIOS-RAID set). 3) We thus have an incomplete mdraid set 4) When tearing down everything at the end of the initial storage scan, the mdraid set is not stopped (I guess because it is incomplete it returns False as status, causing it to not be stopped) 5) When we get to tearing down the dmraid array, the mdraid partition is still in use by mdraid (as the set was not stopped) -> boom So I think we need to fix 4 (which is a larger issue then this bug alone) and make sure we also stop mdraid sets which are incomplete when tearing down storage. (In reply to comment #14) > 2) Thus we now only see one of the 2 partitions which were used to make the > mdraid set in F-10 (where we did not identify the BIOS-RAID set). > 3) We thus have an incomplete mdraid set Perhaps this is an opportunity to address another problem - the ability/option of continuing to install Fnn on an 'incomplete' RAID set. There has been a desire over the years for this feature (see bugs: Bug #105598, Bug #129306, Bug# 151652, Bug# 152158, Bug# 177894, Bug# 188314, Bug# 195812, Bug# 247119, Bug# 310241, Bug# 452441) Hans Thanks for the pointer. Being bold/stupid/credulous in the belief that BIOS-RAID would give better performance, I tried 2) - switching RAID back on in the BIOS, etc. and it worked like a charm. Now I'm happy again. Created attachment 360505 [details]
automatic dump on peter jones' machine
When I, Peter Jones saw this problem on my machine, I decided to send some information on my occurrence. The crash appears to be in the same place (anaconda finding storage devices), but the dump seems to be in a different place.
This file is the dump that was produced automatically.
I plan to also include dmesg, lspci and fdisk output from this machine,
using the currently-running F10.
Created attachment 360506 [details]
dmesg output from Peter Jones' machine, running F10
Created attachment 360507 [details]
lspci -v outpout from Peter Jones' machine, running F10
Created attachment 360508 [details]
fdisk -l output from Peter Jones' machine, running F10
Last posting for now. Hope this information helps.
This bug appears to have been reported against 'rawhide' during the Fedora 12 development cycle. Changing version to '12'. More information and reason for this action is here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping I tried to upgrade from Fedora 10 to Fedora 12 on my Dell Precision 690 work station today and ran into the "installation root not found" error when I rebooted after running pre-upgrade (shortly after trying to find storage media). If I add 'upgradeany' to the boot command, it gets past the error, shows a couple of GUI screens about dependency checking etc, starts the install but then fails with "not enough space on /mnt/sysimage" error. Atleast at this point it lets me write anaconda log to a remote machine via scp. It is attached. This info is probably in the logs but I will summarize. I have two disks. /dev/sda1 has a Windows partition of about 20GB. It has a second partition which is Linux. The second disk /dev/sdb1 is dedicated to Fedora and has a smal /boot partition, a 2GB swap and the rest is for Fedora (logical volume?). Fedora 10 is installed on it and works fine. I didn't really understand some of the comments above about turning on/off BIOS-RAID. In machine uses a DELL SAS Host Bus Adapter 6.06 00.02 (2006.04.05). Bios is A01 (6/40/06). In the BIOS settings screen (F2) it shows Drives 0, 1,2 mapped to SATA-0,1,2 but they are all set to 'off' (you may need to see this screen to understand it). Drive 3 is mapped to PATA-0 and is on and the drive id is that of DVD/RW drive. Drive 4 mapped to PATA-1 is off. If try to turn on any of the 'off' items it gives an error during boot. Finally there is a SATA controller setting of AHCI or ATA. I tried both settings but I 'm not able to finish install either way. The attached logs are with the ATA setting in the BIOS. dmraid -ay -t says there are no raid disks. lspci | grep SATA says: Intel 631xESBx632xESB/3100 Chipset SATA IDE Controller (Rev 09) I tried ctrl+alt+f2 before it looks for storage and tried mdadm --zero /dev/sda1. It says unrecognised md component device. If I do a df at the prompt it shows /dev/sda1, /dev/sda2, /dev/sdb1 etc. But for some reason Anaconda is not able to find them or it is trying to install everything into the very small /boot partition on /dev/sdb1 I would like to upgrade this instead of doing a new install due to other configured software. Created attachment 373880 [details]
Anaconda logs when it fails with not enough space
There is about 30GB of free space on /dev/sda1 and around 50GB of free space on /dev/sdb1
Vamsee, it seems that anaconda is not recognizing your lvm setup, which is completely unrelated to this bug, please file a new bug for this. Hans, I think you've fixed biosraid and mdraid a lot since the report, and I've lost track of what has been happening in this area, do you think we can close this one as CURRENTRELEASE (i.e. is 4) from comment #14 fixed too)? I am running F13 w software RAID1, two partitions (/boot and /) with lvm (incl generous swap) on the / partition - on two systems. I have been having failing disks, so the concept of RAID1 has gotten a workout. So far no loss of data.. I still have my F9 ICHR10 Bios Raid system which is my central firewall/mail server, etc. Hard to shut down. I will probably configure one of the F13 systems to take over that job and then wipe the disks and start over with F13 software RAID1 on that system. Software RAID1 requires you to write grub to MBR on both disks. This is not done automatically. If you get a failure on one disk and reboot, not having grub on the good disk will give obvious problems. This could be automated in Anaconda. I also noticed that the bios allows a selection of which hard disk to boot from in the boot sequence - floppy, cdrom, hard disk. In a failure situation, if the failed disk happens to be the selected disk gives obvious boot problems. This bios disk selection may have bios bugs. Switching cables is more reliable in my experience. (In reply to comment #26) > Software RAID1 requires you to write grub to MBR on both disks. This is not > done automatically. If you get a failure on one disk and reboot, not having > grub on the good disk will give obvious problems. This could be automated in > Anaconda. Isn't this fixed with http://git.fedorahosted.org/git/?p=anaconda.git;a=commit;h=d625c76082493ffbc4a258c1eb1604d1f0e2edaa? (In reply to comment #27) > (In reply to comment #26) > > Isn't this fixed with > http://git.fedorahosted.org/git/?p=anaconda.git;a=commit;h=d625c76082493ffbc4a258c1eb1604d1f0e2edaa? As I recall, when I finished with F13 Anaconda, I did a dd if=/dev/sdx bs=512 count=1 | od -c | more on both disks (sda, sdb) and found they were different.. When I manually re-wrote the MBR on both, then the above test showed them to be the same. (In reply to comment #25) > Hans, I think you've fixed biosraid and mdraid a lot since the report, > and I've lost track of what has been happening in this area, do you think we > can close this one as CURRENTRELEASE (i.e. is 4) from comment #14 fixed too)? The teardown code for an MDRaidArrayDevice now reads: # We don't really care what the array's state is. If the device # file exists, we want to deactivate it. mdraid has too many # states. if self.exists and os.path.exists(self.path): mdraid.mddeactivate(self.path) So yes I believe that 4) from comment #14 is fixed now and this can be closed. |