495078 – left with an uninstallable system

Bug 495078 - left with an uninstallable system

Summary: left with an uninstallable system

Keywords:
Status:	CLOSED RAWHIDE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	anaconda
Sub Component:
Version:	12
Hardware:	All
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	---
Assignee:	Anaconda Maintenance Team
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2009-04-09 16:47 UTC by Ray Todd Stevens
Modified:	2010-01-07 12:51 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2010-01-07 07:41:23 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Ray Todd Stevens 2009-04-09 16:47:37 UTC

Anaconda can't seem to figure out how to install on a system I am trying fc11b on.

I started out with a fc9 system that when I tried to preupgrade I got unbootable.   Basically grub didn't have any kernels in its menu to load.   Since I was going to do a new install anyway I just ignored this.

Then I tried a net install off of the isos.

I was trying to setup a /boot on sda swap on sda (larger than sdb) and then a primary / on /dev/md0

This time I ended up with a crash that was traced to a dump that was reported and was labeled as closed and would be fixed in the next release.

Was able to manually create the md0 configuration without the use of the raid button.   But it then told me it could not mount /boot and exited.

Now the scan for devices thing can't even find anything to install on.

not a good path.

Comment 1 Chris Lumens 2009-04-13 18:12:57 UTC

We like to deal with one problem per bug report.  Is it possible for you to elaborate on one of your bugs in this report, including error messages and the exception dump if at all possible?  The most useful one to deal with would probably be the earliest bug, since the fallout from that one could very well be the cause of your later problems.

Comment 2 Ray Todd Stevens 2009-04-13 19:12:58 UTC

Chris:

Your comment about fallout is why I have not made separate bug reports. This may be the result of the first incident and the rest not bugs but I doubt it. Of course having a working system that it can't install on would seem to be a bug in and of itself. It is possible this is different manifestations of the same bug. Or it could be different bugs. I would bet on some connection to option 2. However each step here is definately in error. I figured until we determined which was the case one report was the best bet.

Bug manifestation 1 was that I had a fc9 system. I ran yum upgrade on it. I rebooted and checked everything was working. I then ran preupgrade on it. The system configuration is a 300meg /boot on /dev/sda and the a raid1 /dev/md0 as / I then rebooted and the upgrade ran. Or it said it ran successfully. I rebooted and got a grub that sat there with no menu options and asking me what to do.

As I said this is an out of service unit I am using for testing, and frankly my next step after the upgrade and making sure it was working was going to be to reinstall from scratch.

I downloaded the net install ISO and burned a cd from it. I believe that the iso was dated march 25th or 26th. If need be I could check that.

I booted to anaconda from that. It actually had my old system still listed, but with the new version. I decided it was useless to try and upgrade especially since a scratch install was my goal anyway. So I selected a new install. Use all drives. Modify.

I tried to recreate the old configuration with boot and md0. I deleted the "old" stuff as automatically created by the partition system with the lv stuff. Starting with empty drives I created my partitions. I have a 80g sda and a 40g sdb drives. My goal was boot and swap on a md0 raid1 40 gig md0 as /. When I clicked on the raid button things crashed with a dump. It asked me to report the crash, which I did. The resulting bug report was labeled of a dupe of one already existing that was fixed in the next version. (fixed rawwhide) Note about next anaconda in bug report.

Couldn't find a next version of anaconda. I also noticed that just clicking on the button caused the problem so I decided to give a manual setup a try. I used the create partition thing to create two 40 gig software raid partitions. I then setup a / /dev/md0 as raid 1 from these again using the manual options and avoiding the raid button. All seemed to work.

I said to save partition information and run install. It did this without giving error message. However a root reboot gave me a system that again would not boot with that what do I do now grub error.

I decided to give it one more try. Now when I boot anaconda it says that it is scanning hardware or something like that, and that there are no hard drives in the system to install on.

Does this explain things better.

Comment 3 Ray Todd Stevens 2009-04-15 16:47:56 UTC

I am working on this a little more.   Chris I need some guidance on one thing.

I also found that the system will not boot to rescue mode on this machine.   It hangs right at the start of going into this mode.   Should I fill out a separate bug report on this, or would this probably be another manifestation of the same bug.

Comment 4 Ray Todd Stevens 2009-04-15 17:02:20 UTC

OK we have a new error same system.

Don't know what changed, but to some extent this makes more sense.   However, it also is obviously a bug.

Now I get to a point where it is scanning devices.   The previous error was telling me there were no devices, when in fact there were.  It also kind of hung up the system with an inability to alt ctrl fx to the other screens.   As soon as you do that it switches to a mode with a blank screen and appears to hardware hang.   No amount of additional ctrl alt fx ing will cause anychange.  Also return and crtl alt delete do nothing.   BUT...

We are now in a different mode.  I would suspect it is related to the same bug.   But now I get that there is a file system error and it can't continue.   I can now switch screens.   When I do I get all kinds of errors on /dev/sda1.   It apparently really really "loves" the journal.   I can ether hit ctrl alt delete to reboot or return on the xwindows screen.

Now at this point I am trying for a scratch install, so any attempt to read the drives which finds them to exist, but to have data it doesn't understand is an error in and of itself.   As long as it can find the drives you should be able to repartition and scratch load.

I hope this helps.

Comment 5 Ray Todd Stevens 2009-04-15 17:35:38 UTC

OK I found something else interesting out.    I booted into rescue mode with an fc9 cd.   When I look at the partition takes /dev/sdb is OK BUT /dev/sda has two anomolies

#1    there is an error that the first partition doesn't end on a cyinder boundry.

#2    There is an extra partition I didn't create as partition 5 which is a linux partition.

Now the $60,000 question, do you guys need me to leave this machine in this status, so that we can get more information, or do I need to simply go in with the fc9 disk delete all of the partitions and see if I can scratch install that way?

Comment 6 Ray Todd Stevens 2009-05-01 16:19:46 UTC

OK preview will install the system properly.   So half of this in a way is fixed.   But the older versions (alpha and beta) leave a system that you have to go into rescue mode and use fdisk to remove the whole thing before installing.   

You should always be able to do a fresh install regardless of the state of the drives, provided that they are at least functional.

Comment 7 Ray Todd Stevens 2009-06-02 16:49:17 UTC

Actually tried a new install with fc11 preview and was left with the same problem.   Except this time there was no way that I could find to do rescue and even get it booting.

What I am finding is that apparently /dev/md0 is still not working if you are using RAID1 in many cases.   

I tried to do a normal install, but found that the LVM stuff leaves a system with a very low level of reliability.   Now it appears that the software RAID stuff is defaulting to RAID1 instead of the past RAID0,  But LVM still seems to be more interested in storage capacity than reliability.   I can't even find a place to tell LVM that I want to use the drives redundantly.

If I wanted 500 gig of storage I would have bought a 500 gig drive not 2 250s.   

I realize that this probably should be in a RFE, but I can't even figure out what to file it against.

Comment 8 Ray Todd Stevens 2009-06-02 20:42:52 UTC

When I get a chance tonight I will create a second bug report for this and relate it back to this one.   This is a different problem, but presents the same.   In the mean time for mine and your notes for some reason on the latest machine to experience this problem named amos, sda is hd1 during boot and sdb is hd0.

In addition to this being strange and confusing it also has the install program confused as it sets up the grub.conf the way you would think it would.

Comment 9 Chris Lumens 2009-06-08 19:17:27 UTC

I can't follow this bug report, so I have no idea what's going wrong.  Since there aren't any actual error messages, we don't know where things are failing or even really how to know if we've been able to reproduce the problem.  If you could please attempt to reproduce this problem, outlining exactly which steps you took and exactly which error messages you are receiving, it would be most helpful in us being able to fix your bug.  Without specific error messages, we don't really have much to go on.

Comment 10 Ray Todd Stevens 2009-06-08 20:24:22 UTC

OK this is two problems. They have the same presentation, but so far different solutions. I also started this bug on one machine, and that machine is now acting like one other. I have one other machine which has the same original presentation, but a very different fix that causes it to start working again. I will try and specify the exact symptoms in greater deal and will specify machine names to help with this.

So at one point I had three machines which would do exactly the same thing, and fail in exactly the same way. That is why I merged them into one bug. That probably may have been a mistake. Lets talk about three issues. I also will try and be a quite a bit more specific.

The original problem reported on this bug was on a machine named lilly. Lilly started out as a fc9 machine. I did a preupgrade based upgrade. It told me that this was successful. BUT when I rebooted it went through the bios and everything, and then at the point where the menu would normally show up listing the kernels it would be possible to load and then the count down to the load, instead I get the word "grub" at the top of the screen and that is it. The system halts.

Now I tried to do a fresh install on the same machine, so I figured it must be the same bug when it didn't work. I did this install from the fc11 alpha "net install" iso. On this version the system dumps out of a system dump. The system dump says to start a new bug report. If you start a bug report on it the bug report is connected to a second report that says that it is closed and fixed. I then tried the install without using the "raid" "button", and got a different crash. Same machine attempting to do the same thing, with slightly different results, figured this was most likely the same bug. Then I retried the install and got a third result. But trying the exact same thing on the same machine. Seemed logical to consider this to be the same underlying bug.

That is why I didn't start a new bug report. Frankly this problem was indeed fixed in the preview version of the release. So the section from "Then I tried a net install off of the isos." on down in the description should be considered fixed and ignorable.

Now we are still working on Lilly. I was able to run a complete install under the beta "net install" iso. BUT!!!!! now I was back to it saying that I had successfully completed the install, but when I rebooted I get the word "grub" at the top left of the screen and nothing else. (No error messages what so ever either doing the install or during the boot process) But since I was back to exactly the same point that I was when I originally reported the bug and with the exact same machine I figured this was still one bug report. At this point this bug report should be considered to be a description of "Ran disk and network based install. Install said it was successful, but when I report the machine hangs with the word grub. Repeated installs continue to give the same result."

To get the install to take on Lilly I had to first boot with an fc9 disk to rescue mode and use that fdisk to remove all partitions. Fc10 and fc11 (alpha, beta, and preview) all refuse to boot at all or operate with the partitions left by the failed alpha install. This is even if you didn't try and load the sysimage. You never even get to that status. But once the partitions are removed, then the preview install "worked" as specified above.

(lets put lilly away for a few minutes here)

Now I tried to install on a second machine, called ranchhand. Ranch hand was a scratch install on empty disks with the same hardware configuration as lilly (using preview disks). I used the same process, and the install said that it completed perfectly. But a reboot only gives that same "grub" message in the upper left corner. Same hardware, slightly different path --- new install vs upgrade, iso install vs preupgrade upgrade -- but the exact same end result. Sounds like the same underlying bug. Seemed illogical to start a new bug report.

Did find a way to get both of these machines working. On both lilly and ranchhand if you rescue boot, and then run grub-install after chrooting, then they become bootable. But the actual install process leaves them unbootable.

(OK lets put lilly and ranchhand aside)

New machine amos. Amos is exclusively discussed in comments #7 #8 Amos has different hardware than the other two. but I am using the same volume structure. I run a fresh install on amos using fc11 preview net install iso. Install says it completes successfully. Once again when I reboot I get the message "grub" in the upper left corner. I hope this is sounding a bit repeating at this point. I sure thought so, but when I tried the same fix and it didn't work. But yet three machines same volume structure and same result from very similar actions. That would seem to be the same bug with different presentation.

In the case of amos as I said grub-install didn't work. However with much experimenting. I found that on this machine for reasons I have no idea of the disk that grub refers to as hd0 is the second drive and linux refers to as sda. The disk that grub refers to as hd1 is the first drive and the drive that linux refers to a sdb. If I take this into account and do a custom grub command line install of the grub loaded amos is now bootable. But as I said the same install ran perfectly but would not boot and instead gave the same message. Not an error message, but the same message. So this appears to be the same or a highly related bug. That is why I have filed this one as one bug for all three machines lilly, ranchhand, and amos.

I can make all three machines currently do this with fc11 preview.

(I will now deal with the second part of the comment 7 as a separate comment)

Comment 11 Ray Todd Stevens 2009-06-08 20:42:49 UTC

Comment elaborating on second part of comment 7.

The point here is two things.

#1 in the past I have filed a quite a few bug reports on the use of software RAID1 /dev/md0 volumes. Generally the response from your side is about 2 or three comments into it is to say that I should be using lvm and accepting the defaults and then closing the bug as "will not fix" So I am trying to explain why I am using the structure I am and why the lvm structure will not work for me and many others.

I run systems which need to be highly reliable. So we run with multiple redundant hardware. One of these types of hardware are the drives. In all three examples here we have two 250 gig drives. We actually need to store about 30 gig or less. The point of the two drives is redundancy. But the default configuration is to use lvm. No problem in and of itself, but lvm insists upon those two drives being seen as one 500 gig drive and to stripe the data across both. At times this might be a logical solution, but very seldom. It might add some efficiency in a few situations. BUT then if you lose either drive you lose all of your data. That is not acceptable.

The last time I played with LVM enough to setup what it claimed was a redundant structure, it turns out that the data was still not redundant with regards to hardware. To read any data you had to have both drives active. Interestingly a disk editor told me why. The first stripe is placed on the first drive twice and then the second stripe placed on the second drive twice. Not exactly what one wants with redundant data.

So I need the software raid, raid 1 system. It used to be when you installed even manually and setup the RAID system it defaulted to RAID 0 and you then had to go in and change this. I commented above that this default has indeed changed and now when you manually setup raid in defaults for two identical drives to RAID 1. This was a good start and I was trying to give you credit for it.

#2 I would rather just use the default structure. I would like to see lvm address this major reliability deficiency. So I assume that I need to file a request for enhancement IE a RFE: But I am not sure exactly which component to file this against. So I was requesting this information. In the past if I filed a RFE against a logical but wrong "component" once again instead of telling me which component to file against it basically gets closed as "won't fix" not something that is part of our component. So I am trying to find out in this case which group to address this issue with.

Comment 12 Bug Zapper 2009-06-09 13:33:22 UTC

This bug appears to have been reported against 'rawhide' during the Fedora 11 development cycle.
Changing version to '11'.

More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 13 Ray Todd Stevens 2009-06-09 15:21:17 UTC

OK this is even getting me confused at points I did one paragraph wrong.

The paragraph should read

In the case of amos as I said grub-install didn't work.   However with much
experimenting.  I found that on this machine for reasons I have no idea of the
disk that grub refers to as hd0 is the second drive and linux refers to as sdb.
 The disk that grub refers to as hd1 is the first drive and the drive that
linux refers to a sda.   If I take this into account and do a custom grub
command line install of the grub loaded amos is now bootable.   But as I said
the same install ran perfectly but would not boot and instead gave the same
message.   Not an error message, but the same message.   So this appears to be
the same or a highly related bug.   That is why I have filed this one as one
bug for all three machines lilly, ranchhand, and amos.

Comment 14 Ray Todd Stevens 2010-01-06 22:28:29 UTC

Thi problem is still regularly occurring in fc12.   

I have now also had this problem in a fresh install with blank disks at the start.

Comment 15 Hans de Goede 2010-01-06 22:51:12 UTC

(In reply to comment #14)
> Thi problem is still regularly occurring in fc12.   
> 
> I have now also had this problem in a fresh install with blank disks at the
> start.  

Hi,

Can you please exactly define what "this" problem is exactly?

Thanks,

Hans

Comment 16 Ray Todd Stevens 2010-01-06 23:34:46 UTC

You run an upgrade or now a fresh install and the system when rebooted and where it should load fedora instead hangs.   It either ended up with  a fully black screen, a black screen with the cursor in the upper left corn or the grub command line menu.   Mostly it is the cursor in the upper left corner.

The problem appears to be that the boot loader is either not getting loaded properly or at all.   It loads appears that the problem is that the install is placing the boot loader somewhere it should not or in loading the boot loader is loading it with parameters saying it should immediately go to places that it should not.   This appears to be because the system is getting lost with regards to the order of the hard drives.

I have had grub or grub-install insist that the disks are in the reverse order that they are, or that there are disks which don't actually exist and or the first disks in boot order.

Comment 17 Hans de Goede 2010-01-07 07:41:23 UTC

(In reply to comment #16)
> You run an upgrade or now a fresh install and the system when rebooted and
> where it should load fedora instead hangs.   It either ended up with  a fully
> black screen, a black screen with the cursor in the upper left corn or the grub
> command line menu.   Mostly it is the cursor in the upper left corner.
> 
> The problem appears to be that the boot loader is either not getting loaded
> properly or at all.   It loads appears that the problem is that the install is
> placing the boot loader somewhere it should not or in loading the boot loader
> is loading it with parameters saying it should immediately go to places that it
> should not.   This appears to be because the system is getting lost with
> regards to the order of the hard drives.
> 
> I have had grub or grub-install insist that the disks are in the reverse order
> that they are, or that there are disks which don't actually exist and or the
> first disks in boot order.  

Ok,

So the BIOS has a different drive ordering then the Linux kernel, this happens
from time to time and there is nothing we can, note that you can indicate which
drive is the drive to boot from in the partitioning method selection screen. 

Normally this should be enough, but we had a bug where with a mirrored raid
we would write a grub which would depend also on the second disk (which it
shouldn't as that removes the redundancy), this means that the BIOS drive order must be known for both disks, or you get this boot problem in F-12.

You can configure the entire BIOS Drive order by choosing advanced bootloader
options and then in the advanced screen clicking the bootloader drive order
button. If you configure this to be correct, things should work. Albeit you
will still have the boot not redundant bug, as grub uses both disks.

This whole thing (grub on a sw raid mirror) has been fixed (almost to the point of being re-written) for Fedora 13. Now we write a grub which only depends on one disk, and correctly write a second grub to the second disk of the mirror, which depends on the same BIOS disk number (as, once the first disk fails the second disk gets its number in the BIOS drive ordering).

Here is the complete set of commits fixing this:
http://git.fedorahosted.org/git/?p=anaconda.git;a=commitdiff;h=d625c76082493ffbc4a258c1eb1604d1f0e2edaa
http://git.fedorahosted.org/git/?p=anaconda.git;a=commitdiff;h=ef850f35ba602c61b3c94f0bbe42bac79a8307a2
http://git.fedorahosted.org/git/?p=anaconda.git;a=commitdiff;h=dcea0bf06200230fabd7830ef46f67e193ccb544
http://git.fedorahosted.org/git/?p=anaconda.git;a=commitdiff;h=558abca4d3f6740eadba96ead716839440112759
http://git.fedorahosted.org/git/?p=anaconda.git;a=commitdiff;h=8a99c19e8f2c52dcc69e35c4917ce533e0a9f926

Radek, the author of these patches has done extensive testing of the new code.
Still feedback from people actually using this feature would be greatly appreciated, so it would be great if you could test this with Fedora 13 alpha once that is released, and report back to us with any issues you find.

Given that this is fixed for F-13, I'm going to close this bug as such.

Comment 18 Ray Todd Stevens 2010-01-07 12:39:00 UTC

I will point out that this same problem has been around since Fedora Core 10.   So was this discovered in 12 or was it introduced in 12.  If it is the first, you might have something, if it was introduced in 12 we may still need to look.

Comment 19 Hans de Goede 2010-01-07 12:51:50 UTC

It was discovered in F-12, the problem indeed has been around for much longer.

Note You need to log in before you can comment on or make changes to this bug.