Created attachment 322325 [details] Full console log Description of problem: It seems that attempting an autopart install on a drive which was originally part of a multi-device VG fails. With all of the force options I can think of, we're still getting: There are 3 physical volumes missing. visited LogVol00 visited LogVol01 A volume group called 'VolGroup00' already exists. Version-Release number of selected component (if applicable): RHEL5.3-Server-20081020.1 anaconda-11.1.2.145-1 (This probably happens in 5.2 and earlier as well. I haven't checked yet.) How reproducible: 100% (I think) Steps to Reproduce: 1. autopartition across multiple drives 2. Remove all but the first drive 3. Attempt to autopartition Actual results: vgcreate failed for VolGroup00 Expected results: Successful install using only the first drive. Additional info: The kickstart in question already uses: zerombr clearpart --all --initlabel autopart If there are more "wipe everything and reinstall" type options I'm willing to try them. Details from a failed install can be found at: https://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=4868722 I'll attach the interesting files.
Created attachment 322326 [details] Anaconda log
Created attachment 322327 [details] Kickstart which triggers bug
Created attachment 322329 [details] syslog
Created attachment 322330 [details] LVM log which shows the exact error reported
It is probably duplicate of bug #468431, but thanks for a reproducer that we have been missing. I am trying to reproduce it and test the updates file that should fix it (https://bugzilla.redhat.com/show_bug.cgi?id=468431#c6.)
Matt, I can't reproduce the bug (using kvm with iso install), is it possible for you to try if http://rvykydal.fedorapeople.org/updates.wipelvm.149.img fixes the problem?
Changing priority to "Urgent" given impact on testing described in my comment above.
Setting dev-ack and adding blocker = "?" so the bot doesnt pm-nak this. Matt, Josh, we really need to know if the fix in Comment #6 took care of this.
Note: I've reproduced the bug in 5.2 and 5.0. At least this isn't a regression. (In reply to comment #6) > Matt, I can't reproduce the bug (using kvm with iso install), is > it possible for you to try if > http://rvykydal.fedorapeople.org/updates.wipelvm.149.img > fixes the problem? I haven't had any luck using that image. No matter what I try, the installer hangs at the one of the (many) DHCP requests: Sending request for IP information for eth0... Determining host name and domain... Sending request for IP information for eth0... That's after loading stage2, not spawning a shell, and loading modules. I'm not sure *why* it's doing another DHCP request at that point. Without the updates image it progresses to the partitioning failure.
Matt, from what I've seen with latest 5.3 builds I think anaconda will try to make 2 DHCP requests. 1st to configure the network if you're using http method for example and second to configure networking again so it can download updates.img after it has downloaded stage2.img. This sounds like a bug in the network code.
Matt, do you have any logs (rhts link) for comment #12? Maybe it could be worked around for now by putting the updates file in (probably copy of) the installation tree (as file images/updates.img) thanks
Matt, the anaconda.log from comment #1 seems incomplete, could you try to get me one more anaconda.log (better with all other logs) or point me to some in rhts? I can reproduce (and fix with the updates file) the bug only with different ks partitioning (https://bugzilla.redhat.com/show_bug.cgi?id=468431#c5), and the log can add to my confidence that the bug I reproduce is the same. Thanks
Matt, Alexander: There's a reason for the two DHCP requests. The first is to get the kickstart file, the second is applying the network config in the downloaded kickstart file. That's a feature, not a bug.
Fixed with commit b5a48bfc44a8084b4c751e192c70eb1837e44e19, to be included in 11.1.2.156 (Snapshot #3)
Created attachment 323455 [details] console log for dell-pe6800-01.rhts.bos.redhat.com with RHEL5.3-Server-20081112.0 distro I just queued up a job to the box with the RHEL5.3-Server-20081112.0 distro It failed ... The attached log makes not of anaconda 11.1.2.156 --- Running anaconda, the Red Hat Enterprise Linux Server system installer - please wait... Probing for video card: ATI Technologies Inc Radeon RV100 QY [Radeon 7000/VE] Running pre-install scripts Retrieving installation information... In progress... Completed Completed Retrieving installation information... In progress... Completed Completed Retrieving installation information... In progress... Completed Completed Retrieving installation information... In progress... Completed Completed Checking dependencies in packages selected for installation... In progress... Can't have a question in command line mode! LVM operation failed vgcreate failed for VolGroup00 The installer will now exit... custom ['_Reboot'] --- Barry
The patch didn't work, interesting parts of logs: (from the failed install: https://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=5132172) anaconda.log tail: 14:24:59 INFO : moving (1) to step enablefilesystems 14:24:59 DEBUG : starting mpaths 14:24:59 DEBUG : self.driveList(): ['sda'] 14:24:59 DEBUG : DiskSet.skippedDisks: [] 14:24:59 DEBUG : DiskSet.skippedDisks: [] 14:24:59 DEBUG : done starting mpaths. Drivelist: ['sda'] 14:24:59 DEBUG : removing drive sda from disk lists 14:24:59 DEBUG : starting mpaths 14:24:59 DEBUG : self.driveList(): ['sda'] 14:24:59 DEBUG : DiskSet.skippedDisks: [] 14:24:59 DEBUG : DiskSet.skippedDisks: [] 14:24:59 DEBUG : done starting mpaths. Drivelist: ['sda'] 14:24:59 DEBUG : adding drive sda to disk list 14:24:59 INFO : lv is VolGroup00/LogVol00, size of 582688 14:24:59 INFO : lv is VolGroup00/LogVol01, size of 1984 14:24:59 INFO : removing obsolete LV VolGroup00/LogVol00 14:24:59 INFO : removing obsolete LV VolGroup00/LogVol01 14:24:59 INFO : vg VolGroup00, size is 584800, pesize is 32768 14:24:59 INFO : removing obsolete VG VolGroup00 14:24:59 INFO : vgremove VolGroup00 14:24:59 ERROR : createLogicalVolumes failed with vgcreate failed for VolGroup00 lvmout.log tail: Wiping cache of LVM-capable devices Couldn't find device with uuid 'r038e1-2wPU-KXOT-nvth-JMqA-HVe0-uULQ2q'. Couldn't find device with uuid 'qIUaL1-YkEW-0JOO-PlYq-dTx2-G7da-g03qVX'. Couldn't find device with uuid 'ml5vI6-Khfr-gbxq-slhy-SWvs-hC2l-mwLiCh'. There are 3 physical volumes missing. A volume group called 'VolGroup00' already exists. sys.log tail: <6>md: raid6 personality registered for level 6 <6>md: raid5 personality registered for level 5 <6>md: raid4 personality registered for level 4 <4>GFS2 (built Nov 10 2008 18:30:46) installed <6>Lock_Nolock (built Nov 10 2008 18:30:57) installed <6>device-mapper: uevent: version 1.0.3 <6>device-mapper: ioctl: 4.11.5-ioctl (2007-12-12) initialised: dm-devel <6>device-mapper: multipath: version 1.0.5 loaded <6>device-mapper: multipath round-robin: version 1.0.0 loaded <6>device-mapper: multipath emc: version 0.0.3 loaded <7>eth0: no IPv6 routers present <6>lvm[1984]: segfault at 0000000000000024 rip 0000000000426bc9 rsp 00007fffc88d9ce0 error 4 So I'm asking myself: - why the segfault in sys.log happened? - when we are removing the obsolete VG, is it in a way that is enough to wipe old metadata? Last line of lvmout.log suggests that no. There were cases with same lvm error (the last line of lvmout.log) fixed with the patch, but it seems that we have only part of the fix. I think the https://bugzilla.redhat.com/show_bug.cgi?id=468431#c1 is the answer - we must not only remove the obsolete volume group, but also actually wipe metadata from PVs of the obsolete VG. (it is happening, though for another reason in Fedora). I am going to prepare the updates.img with a patch doing this. --------------- As for comparsion with logs from Description of the bug, they seem incomplete (anaconda.log), and sys.log doesn't contain the segfault. anaconda.log tail: 20:23:07 INFO : moving (1) to step postselection 20:23:07 INFO : kernel-xen package selected for kernel 20:23:07 DEBUG : selecting kernel-xen-devel lvmout.log tail: Wiping cache of LVM-capable devices Couldn't find device with uuid 'r038e1-2wPU-KXOT-nvth-JMqA-HVe0-uULQ2q'. Couldn't find device with uuid 'qIUaL1-YkEW-0JOO-PlYq-dTx2-G7da-g03qVX'. Couldn't find device with uuid 'ml5vI6-Khfr-gbxq-slhy-SWvs-hC2l-mwLiCh'. There are 3 physical volumes missing. visited LogVol00 visited LogVol01 A volume group called 'VolGroup00' already exists. sys.log tail: <6>md: raid4 personality registered for level 4 <4>GFS2 (built Oct 17 2008 18:01:43) installed <6>Lock_Nolock (built Oct 17 2008 18:01:51) installed <6>device-mapper: uevent: version 1.0.3 <6>device-mapper: ioctl: 4.11.5-ioctl (2007-12-12) initialised: dm-devel <6>device-mapper: multipath: version 1.0.5 loaded <6>device-mapper: multipath round-robin: version 1.0.0 loaded <6>device-mapper: multipath emc: version 0.0.3 loaded
(In reply to comment #22) > I think the > https://bugzilla.redhat.com/show_bug.cgi?id=468431#c1 > is the answer - we must not only remove the obsolete volume group, > but also actually wipe metadata from PVs of the obsolete VG. > (it is happening, though for another reason in Fedora). > I am going to prepare the updates.img with a patch doing this. > No, this is not it, i was looking at rhel4 by mistake, it is happening (wiping the metadata from pv in our vgremove function) in rhel5 too.
I bet that 14:24:59 INFO : vgremove VolGroup00 from anaconda.log failed (this is this vgremove which wipes the PV metadata by calling pvremove and pvcreate). Unfortunately the patch silently catches (even without logging) exceptions thrown by vgremove function which is bad. So the updates file i prepare will contain hack to get more logs (all outputs from lvm) and will not catch lvm errors from vgcreate function.
Well after running out of time and needing to move on, I succeeded in getting the system working again. Here is what was done and it's impact 1. First thing I tried was doing an install with the HBA to the shared storage disconnected. This did not work with autopart. In fact it complained the exact same way as with the storage connected. So the real issue has nothing to do with the shared storage and everything to do with the state of the local drive from a previous build. 2. It took going to the Megaraid interface at the BIOS and initializing the LUN from there to clear the problem. 3. Once the system installed, I plugged in the HBA, rescanned the HBA attached devices and initialized them with dd. Then I rebooted the machine to make sure everything would come up properly. Which it did. 4. I returned the system to RHTS and requeued the same type of job again, to make sure it would work totally automated. Which it did. So there is still a real problem, but I'm no longer convinced I know how to recreate it. This will have to be attempted when we get out of this QA cycle. Barry
Well, to help avoid any megaraid volume confusion, we should clear and recreate all volumes @ each install. PERC3/4/5 cards have a rpm installable utility that will perform this (among other) functions. Should I create a ticket for this to happen with this bug as a dependency?
I think we really need install/support tools that work even harder and clearing and checking what is actually out there. Also, bpeck and jburke have run into something very similar on ibm-js20-5.test.redhat.com https://bugzilla.redhat.com/show_bug.cgi?id=468431 The PERC proposal sounds interesting but only solves Dell based systems. If jburke is seeing this on all sorts of hardware, how will we handle that if its related ? This other bug may be more related to what we are running into. Barry
(In reply to comment #29) > I think we really need install/support tools that work even harder and clearing > and checking what is actually out there. Agreed wholeheartedly. > > Also, bpeck and jburke have run into something very similar on > ibm-js20-5.test.redhat.com Interesting... > > https://bugzilla.redhat.com/show_bug.cgi?id=468431 > > > The PERC proposal sounds interesting but only solves Dell based systems. If Not really. The tools should be available for any Adaptec/LSI hardware. > jburke is seeing this on all sorts of hardware, how will we handle that if its > related ? This other bug may be more related to what we are running into. My suggestion was an exemplary stopgap pending comprehensive install and support tools. Additionally, the volume/container creation possibilities have great test design potential...this of course is the bonus of implementing this sorta fix. > > Barry
I think I've run into this problem today with trying to convert my root filesystem from being on a logical vol to being a physical partition outside of LVM in my ks.cfg file and then rebooting and trying to kickstart. The kickstart process is seeing the old volume group and refusing to wipe it and create the new (shrunken) one. I understand that LVM is trying to keep me safe from dataloss, but I need some way to poke it to destroy the vg from kickstart, and its refusal to wipe out data is really annoying me. I could use dd in the %pre section to wipe it out, but I don't know how much of the disk needs to get wiped to erase the vg metadata... (and the double-DHCP in kickstart issue is really annoying when you're not using DHCP and need to enter static IP information twice, but i believe that was addressed and fixed in a different bug going into 5.3)
Can you please describe error message you are seeing, attach the kickstart file used, and anaconda dump if available or anaconda log files if not (/tmp/anaconda.log, /tmp/lvmout, and /tmp/syslog). This could help us to see what exactly the problem can be in your case. Please note that this bug concerns damaged VGs, that is for example VGs with a drive containing a PV of VG removed. The issue might have been resolved in Snapshot #5 of 5.3 with fix of another (most probably) duplicate bug. Erasing the first MB of physical volume with dd should wipe out the lvm metadata.
I'm not dealing with a damaged VG, but I am taking a box which was built with: part /boot --fstype ext3 --size=300 --ondisk=sda part pv.01 --size=100 --grow --ondisk=sda volgroup VolGroup00 vg.01 logvol / --fstype ext3 --name=root --vgname=VolGroup00 --size=8192 logvol swap --fstype swap --name=swap --vgname=VolGroup00 --size=512 And trying to rebuild it as: part /boot --fstype ext3 --size=300 --ondisk=sda part pv.01 --size=100 --grow --ondisk=sda volgroup VolGroup00 vg.01 part / --fstype ext3 --ondisk=sda --size=8192 part swap --fstype swap --ondisk=sda --size=512 So, i move root and swap out of the volgroup onto physical partitions and in so doing shrink the size of the volgroup. Since the old VG isn't actually damaged it may be a different case (this isn't a VG that you would need to use --partial on or anything). I can get you the rest of the information you requested if you need it, but you should be able to replicate it with that.
Issue described in comment #35 is fixed in bug #468431, the fix was included in Snapshot #3 of rhel 5.3. (it is case 1) from https://bugzilla.redhat.com/show_bug.cgi?id=468431#c7)
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
The bug should be resolved with fix for bug #468431 from 5.3. QA, are you able to reproduce it, is it still a problem?
I'm going to put this on modified so it gets into the work flow. According to comment #36 there where no changes but the issue seems fixed.
My issue technically never got resolved when I needed it and took matters into my own hands to clear the damaged VG data on the shared disks. Theoretically, the easiest way I know how to test this is to: 1. Reserve in RHTS either (dell-pe6800-01.rhts... or veritas2.rhts...) 2. Blow away the install and reinstall manually without a kickstart file and let anaconda try and grab the shared storage and build a mega system LUN. 3. Return the box to RHTS (cancel the job) 4. Reserve the box again. No matter what is out on the shared storage, anaconda is told through the kickstart file to ignore it. 5. If anaconda cannot deal with the LVM bits on the shared storage at reboot (after install) time, then there is still an issue. The only way around the issue if it still exists is to boot the system without the attached storage, then connect it and zero the disks LVM metadata. Barry
Barry, both dell-pe6800-01 and veritas2 systems require access keys which RTT is not aware of. Please schedule a regular install job and post the links here. Thanks.
With RHEL5.3 GA and steps to reproduce from #43 : Setting hostname veritas2.rhts.bos.redhat.com: [ OK ] Setting up Logical Volume Management: Couldn't find device with uuid 'IyXpSS-4Sze-I60A-oUby-pMM0-dSw7-NSZZ9F'. Couldn't find device with uuid 'IyXpSS-4Sze-I60A-oUby-pMM0-dSw7-NSZZ9F'. Couldn't find device with uuid 'IyXpSS-4Sze-I60A-oUby-pMM0-dSw7-NSZZ9F'. Couldn't find device with uuid 'IyXpSS-4Sze-I60A-oUby-pMM0-dSw7-NSZZ9F'. Couldn't find device with uuid 'IyXpSS-4Sze-I60A-oUby-pMM0-dSw7-NSZZ9F'. Couldn't find device with uuid 'IyXpSS-4Sze-I60A-oUby-pMM0-dSw7-NSZZ9F'. Refusing activation of partial LV LogVol00. Use --partial to override. Couldn't find device with uuid 'IyXpSS-4Sze-I60A-oUby-pMM0-dSw7-NSZZ9F'. 1 logical volume(s) in volume group "VolGroup00" now active 2 logical volume(s) in volume group "VolRHTS01" now active [ OK ] Checking filesystems Checking all file systems. [/sbin/fsck.ext3 (1) -- /] fsck.ext3 -a /dev/VolRHTS01/root /dev/VolRHTS01/root: clean, 66773/35258368 files, 1501770/35241984 blocks [/sbin/fsck.ext3 (1) -- /boot] fsck.ext3 -a /dev/sda1 /boot: clean, 33/26104 files, 15743/104388 blocks [ OK ] Remounting root filesystem in read-write mode: [ OK ] Mounting local filesystems: [ OK ] Enabling local filesystem quotas: [ OK ] Enabling /etc/fstab swaps: [ OK ] INIT: Entering runlevel: 3 Now trying with 5.4
Same result with RHEL 5.4 snap #5 as in comment #45
Original report from comment #0 is not reproducible with xen/pv guest on RHEL5.3 ks.cfg: zerombr yes clearpart --all --initlabel autopart 1) Install the system with 3 disks 2) remove all but xvda 3) reinstall on xvda with autopart 4) /tmp/lvmout in stage2 doesn't contain anything suspicious lvmout from the link in comment #0 says: Wiping cache of LVM-capable devices Couldn't find device with uuid 'r038e1-2wPU-KXOT-nvth-JMqA-HVe0-uULQ2q'. Couldn't find device with uuid 'qIUaL1-YkEW-0JOO-PlYq-dTx2-G7da-g03qVX'. Couldn't find device with uuid 'ml5vI6-Khfr-gbxq-slhy-SWvs-hC2l-mwLiCh'. There are 3 physical volumes missing. visited LogVol00 visited LogVol01 A volume group called 'VolGroup00' already exists.
Also see: https://bugzilla.redhat.com/show_bug.cgi?id=476582#c8 https://bugzilla.redhat.com/show_bug.cgi?id=476582#c9 It describes the steps in comment #47 with physical machines and verifies that the behavior is not present in RHEL5.4 snap #5
Retested on PPC and i386, seems not to be fixed: Steps to reproduce: 1. Start instalation on system with more than one drive (tested with 2 and 4 drives). 2. Proceed to partitioning dialog 3. Select "Remove all partitions and create default layout" and select all drives 4. Click next and finish the instalation. 5. After reboot, restart the instalation. 6. Proceed to partitioning dialog 7. Select "Remove linux partitions and create default layout" and select just first drive. 8. Click next. Result: Anaconda throws exception. Test enviromnent: RHEL5.4 RC1 anaconda-11.1.2.195-1 tested on: ibm-p5-03.rhts.englab.brq.redhat.com (PPC machine with 4 drives). kvm with 2 drives (15G and 8G) with i386 image. As it seems it's not fixed, moving back to ASSIGNED state
Created attachment 357161 [details] exception file related to comment 49
(In reply to comment #49) This bug concerns damaged VGs, that is VGs where a disc was physically removed, or is missing (as in bug Description). In comment #49 case, though it looks like a bug too, a disc is just deselected in UI, so I think it is not a reproducer of this bug.
(In reply to comment #49) What you describe is probably a duplicate of bug #532536.
*** This bug has been marked as a duplicate of bug 532536 ***