| Summary: | Kickstart --useexisting LVM Fails on HP CCISS Hardware on RHEL 5.7 | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | Stephen Benjamin <stbenjam> |
| Component: | anaconda | Assignee: | Martin Kolman <mkolman> |
| Status: | CLOSED ERRATA | QA Contact: | Release Test Team <release-test-team> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 5.7 | CC: | atodorov, mkolman, mkovarik, stbenjam |
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | anaconda-11.1.2.263-1 | Doc Type: | Bug Fix |
| Doc Text: |
Cause: The lvm tool on cciss devices returns pv paths delimited by ! but Anaconda uses / delimited paths.
Consequence: Installation failed in some cases on cciss devices.
Fix: Replace ! by / in lvm tool output.
Result: Installation now works fine in the cases where it previously failed.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2013-10-01 00:30:07 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Bug Depends On: | |||
| Bug Blocks: | 921048, 928844 | ||
| Attachments: | |||
|
Description
Stephen Benjamin
2011-11-04 13:37:33 UTC
Created attachment 531768 [details]
Kickstart file (sensitive data in %post removed)
We ended up resolving this by handling /appdata in %pre entirely and removing this from anaconda's control. Not an ideal solution, and this is still probably a bug in anaconda in RHEL 5.7. Created attachment 748900 [details] An instrumeted updates image for getting more data about the bug (BZ 751351) Looks like I need some more information, so I've created an instrumented updates image (cciss_updates_v1.img) that logs quite a lot of additional data, which should help me fix the bug. Could you run it on the affected hardware and send me the anaconda log file ? Thanks in advance ! Hi Martin, I was a consultant on-site at the customer, I haven't been to this customer in a while. I'll ask them if they still have the old RHEL5 build infrastructure in place to try this, but I think they've completely moved to RHEL 6 for the HP blades (and not even using this /appdata kickstart code anymore). Created attachment 749258 [details]
anaconda.log with update image cciss_updates_v1.img
Hi Martin,
I am able to provide you desired informations.
From closer examination, it seems that the sequence of commands in the attached kickstart should not work at all, because: First pass: * creates 2 partitions on c0d0 - /boot - pv.01 * creates 1 partition on c1d0 - pv.02 Second pass * destroys pv.02 with "part /boot --onpart=cciss/c1d0p1" (using just --onpart will still format the partition with default filesystem, to skip formating, both --onpart and --noformat need to be used at the same time according to the docs[1]) - "pv.02" is the first & only partition on c1d0, so c1d0p1 in the part command can be referring only to it * as a result, "logvol /appdata --noformat --name=lv_nfs --vgname=vg_cluster" will fail, as "vg_cluster" was on "pv.02", that just got overwritten by /boot So unless there is some additional smart-array related magic in action, the kickstart should not work in the first place and if it ever did, it might have been due to some now fixed regression or partitioning changes done outside of Anaconda. [1] https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/5/html/Installation_Guide/s1-kickstart2-options.html First pass: --driveorder=cciss/c1d0,cciss/c0d0 Second pass: --driveorder=cciss/c0d0,cciss/c1d0 "* as a result, "logvol /appdata --noformat --name=lv_nfs --vgname=vg_cluster" will fail, as "vg_cluster" was on "pv.02", that just got overwritten by /boot" That's an incorrect understanding of what was happening. For some reason in the second pass the drive order was reversed, pv.02 is on c0d0 in the second pass. I don't really remember why this was done, maybe in a later step after initial provisioning we were doing something about order of the storage controllers. I'll have to check with the customer. But this snippet of kickstart worked for years up to and including RHEL 5.6 and stopped working in 5.7. (In reply to Stephen Benjamin from comment #10) > First pass: > --driveorder=cciss/c1d0,cciss/c0d0 > > Second pass: > --driveorder=cciss/c0d0,cciss/c1d0 > > > > "* as a result, "logvol /appdata --noformat --name=lv_nfs > --vgname=vg_cluster" will fail, as "vg_cluster" was on "pv.02", that just > got overwritten by /boot" > > That's an incorrect understanding of what was happening. For some reason in > the second pass the drive order was reversed, pv.02 is on c0d0 in the second > pass. I don't really remember why this was done, maybe in a later step > after initial provisioning we were doing something about order of the > storage controllers. I'll have to check with the customer. Oh, so in the second pass c0d0 = c1d0 and c1d0 = c0d0 ? Yeah, that would explain it your log. BTW, but even like this, it could still fail, because of partition ordering. Even though: "part /boot --fstype=ext3 --size=100 --ondisk=cciss/c0d0 --asprimary" is run before "part pv.01 --size=135000 --grow --asprimary --ondisk=cciss/c0d0" it does not mean Anaconda will create the partitions in this order. Citing the Anaconda kickstart wiki page[1]: "Anaconda may create partitions in any particular order, so it is safer to use labels than absolute partition names." While this note does not appear in the the RHEL 5 kickstart documentation, it is mentioned behaving like this on RHEL 5 in bug 410011 comment 21 So Anaconda could theoretically create pv.01 as the first partition and /boot as second, meaning that pv.01 would be obliterated by "part /boot --onpart=cciss/c1d0p1" in the second pass, together with the vg_sys that hosts / & swap. > > But this snippet of kickstart worked for years up to and including RHEL 5.6 > and stopped working in 5.7. Thanks! This information makes the period this bug got introduced in quite a bit narrower. [1] http://fedoraproject.org/wiki/Anaconda/Kickstart#part_or_partition Created attachment 753674 [details]
instrumented debug image V2
Created attachment 753675 [details]
instrumented debug image V3
(In reply to Michal Kovarik from comment #8) > Created attachment 749258 [details] > anaconda.log with update image cciss_updates_v1.img > > Hi Martin, > I am able to provide you desired informations. Thanks a lot for the logs ! Based on them, I made further instrumented images (V2 and V3) - could you run them with the same kickstart and hardware configuration & post the logs ? Thanks in advance ! BTW, the difference between the two images: V2 - basically V1 with more debugging messages V3 - V2 with partition name parsing reverted to 5.6 behavior I'll just add that the V2 and V3 images are configured to wait indefinitely once Anaconda exits (for easier debugging), so If are getting that, it is not a bug. Created attachment 753917 [details]
anaconda.log V2
Created attachment 753918 [details]
anaconda.log V3
Now testing with RHEL5.10-Server-20130530.0, anaconda-11.1.2.263-2
+-----------------+ Error +-----------------+
| |
| Error mounting device vg_cluster/lv_nfs |
| as /appdata: Invalid argument |
| |
| This most likely means this partition |
| has not been formatted. |
| |
| Press OK to reboot your system. |
| |
| +----+ |
| | OK | |
| +----+ |
| |
| |
+-------------------------------------------+
Martin, what does this mean?
(In reply to Alexander Todorov from comment #20) > Now testing with RHEL5.10-Server-20130530.0, anaconda-11.1.2.263-2 > > > +-----------------+ Error +-----------------+ > > | | > > | Error mounting device vg_cluster/lv_nfs | > > | as /appdata: Invalid argument | > > | | > > | This most likely means this partition | > > | has not been formatted. | > > | | > > | Press OK to reboot your system. | > > | | > > | +----+ | > > | | OK | | > > | +----+ | > > | | > > | | > > +-------------------------------------------+ > > > > Martin, what does this mean? Have you changed the order of the disks for the second pass as mentioned by Stephen in comment 10 ? Yes, see kickstart config in comment #19 Indeed, the kickstart you are using changes the order of the disks in the second pass, but that expects that the physical order of the disks also changed, as mentioned by Stephen in comment 10. That's probably what is causing the error you are seeing. (As c0d1p1, which holds the pv.02 partition will be overwritten by the boot partition and therefore vg_cluster and all its LVs are gone and can't be reused.) I see two possible solutions: 1) swap the physical location of the disks (c0d0 -> c0d1, c0d1 -> c0d0) and run the unchanged kickstart 2) don't swap the order of the disks in the kickstart for the second pass The second pass would then look like this: echo "bootloader --location mbr --driveorder=cciss/c0d0,cciss/c0d1" > /tmp/part-include echo "part /boot/efi --onpart=cciss/c0d0p1" >> /tmp/part-include echo "volgroup vg_sys --useexisting" >> /tmp/part-include echo "logvol / --fstype=ext3 --useexisting --name=lv_root --vgname=vg_sys" >> /tmp/part-include echo "logvol swap --fstype=swap --useexisting --name=lv_swap --vgname=vg_sys" >> /tmp/part-include echo "volgroup vg_cluster --noformat" >> /tmp/part-include echo "logvol /appdata --noformat --name=lv_nfs --vgname=vg_cluster" >> /tmp/part-include Hi Stephen, did you physically change the order of disks in the system? Hi Martin, I don't think I need to physically swap the disks because: 1) I didn't do it and I was able to reproduce the exact same traceback with 5.7. How is it possible to reproduce if I had to swap the disks physically and I didn't? 2) In the given kickstart snippet: First pass: bootloader --location mbr --driveorder=cciss/c0d0,cciss/c0d1 clearpart --drives=cciss/c0d0,cciss/c0d1 --all --initlabel part /boot/efi --fstype=vfat --size=100 --ondisk=cciss/c0d0 --asprimary /boot/efi lands on 1st partition of the first disk (c0d0p1). Second pass: bootloader --location mbr --driveorder=cciss/c0d1,cciss/c0d0 part /boot/efi --onpart=cciss/c0d1p1 Now c0d0 becomes c0d1 because the drive order in kickstart is reversed. So c0d1p1 is the same partition where /boot/efi originally was in the first place. Hi all, There was nothing that needed to be done on the hardware side in terms of swapping the disks before the second pass. I don't even remember why the --driveorder is reversed, although I guess I had it like that for some reason. What happens if you use --driveorder=cciss/c0d0,cciss/c0d1 on both passes? Update: testing with the latest available tree: RHEL5.10-Server-20130701.4/anaconda-11.1.2.263-2.ia64 Tested with steps from comment #19, the result is as in comment #20. Now going to test comment #26. (In reply to Stephen Benjamin from comment #26) > Hi all, > > There was nothing that needed to be done on the hardware side in terms of > swapping the disks before the second pass. I don't even remember why the > --driveorder is reversed, although I guess I had it like that for some > reason. > > What happens if you use --driveorder=cciss/c0d0,cciss/c0d1 on both passes? See comment #20. That happens. Moving back to ASSIGNED. I'll just add that --driveorder does not change order of the disks in any way, it just sets the boot order (it's parameter for the bootloader kickstart command after all). Citing the documentation[1]: " --driveorder — Specify which drive is first in the BIOS boot order. For example: bootloader --driveorder=sda,hda " So --driveorder should be completely irrelevant to this issue. [1] https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/5/html/Installation_Guide/s1-kickstart2-options.html Hi Martin, I need to know if what I see is expected after the fix or another bug? When you patched this what was the supposed behavior ? Created attachment 777003 [details]
Kistart from atodorov modified to work with virtio disks
(In reply to Alexander Todorov from comment #30) > Hi Martin, > I need to know if what I see is expected after the fix or another bug? I think it's a different bug or an issue wuth the kickstart itself. I've modified your kickstart for use with virtio drives, only changing: cciss/c0d0 -> vda cciss/c0d0 -> vdb and using /boot in place of /boot/efi The kickstart successfully completes the first pass, but fails on the second pass with: "An error occurred trying to format vdb1. This problem is serious, and the install cannot continue." While not being the exactly same error you are getting, if the error was smartarray/cciss related, the kickstart should work fine when using virtual drives. So that's why I think this is unrelated to the original cciss bug. >When you patched this what was the supposed behavior ? The second pass should run to the end and you should get a bootable install. Before submitting the patch, I've tested it on a single disk cciss machine and it worked fine. Created attachment 777049 [details]
Kickstart from atodorov modified to work with virtio disks - reversed disk order in second pass
Created attachment 777050 [details]
LVM layout after second pass with reversed disk order
For the record, I've changed the order of the disks in the second pass (kickstart is attached). This causes the second pass to finish successfully and create the desired partition layout (see screenshot). I think this conclusively shows that the disk order for the second pass is wrong in the original kickstart. (In reply to Martin Kolman from comment #35) > For the record, I've changed the order of the disks in the second pass > (kickstart is attached). This causes the second pass to finish successfully > and create the desired partition layout (see screenshot). > Note: this comment refers to the line: echo "part /boot --onpart=vdb1" >> /tmp/part-include in comment #31 vs. comment #34 Testing again on hp-rx2660-01.rhts.eng.brq.redhat.com with disk order reversed, as suggested by Martin: text key --skip install reboot auth --useshadow --enablemd5 firewall --disable firstboot --disable selinux --disabled keyboard us lang en_US timezone --utc Etc/UTC rootpw ...... skipx %include /tmp/part-include %packages @core @base %pre ( #!/bin/bash #disk1="" #disk2="" # let's see if this appliance was bootstrapped before lvm lvs | grep -q lv_nfs declare -i ret=$? # BZ 652417 (?): lvm vgchange -an if [ "$ret" -eq 0 ]; then echo "bootloader --location mbr --driveorder=cciss/c0d1,cciss/c0d0" > /tmp/part-include echo "part /boot/efi --onpart=cciss/c0d0p1" >> /tmp/part-include echo "volgroup vg_sys --useexisting" >> /tmp/part-include echo "logvol / --fstype=ext3 --useexisting --name=lv_root --vgname=vg_sys" >> /tmp/part-include echo "logvol swap --fstype=swap --useexisting --name=lv_swap --vgname=vg_sys" >> /tmp/part-include echo "volgroup vg_cluster --noformat" >> /tmp/part-include echo "logvol /appdata --noformat --name=lv_nfs --vgname=vg_cluster" >> /tmp/part-include else echo "bootloader --location mbr --driveorder=cciss/c0d0,cciss/c0d1" > /tmp/part-include echo "clearpart --drives=cciss/c0d0,cciss/c0d1 --all --initlabel" >> /tmp/part-include echo "part /boot/efi --fstype=vfat --size=100 --ondisk=cciss/c0d0 --asprimary" >> /tmp/part-include echo "part pv.01 --size=1350 --grow --asprimary --ondisk=cciss/c0d0" >> /tmp/part-include echo "volgroup vg_sys --pesize=32768 pv.01" >> /tmp/part-include echo "logvol / --fstype ext3 --name=lv_root --vgname=vg_sys --size=3000 --maxsize=20000 --grow" >> /tmp/part-include echo "logvol swap --fstype swap --name=lv_swap --vgname=vg_sys --size=1024 --maxsize=4096" >> /tmp/part-include echo "part pv.02 --size=8000 --grow --ondisk=cciss/c0d1 --asprimary" >> /tmp/part-include echo "volgroup vg_cluster --pesize=32768 pv.02" >> /tmp/part-include echo "logvol /appdata --fstype ext3 --name=lv_nfs --vgname=vg_cluster --size=7500" >> /tmp/part-include fi ) > /tmp/ks-pre.log 2>&1 Installation works without issues and the previously created /appdata directory is present and content is intact. Moving to VERIFIED. Please re-open if seen again. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-1354.html |