Bug 1389130
Summary: | Existing RAID or LVM metadata can cause various types of install failures. | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Jason Tibbitts <j> |
Component: | python-blivet | Assignee: | Vratislav Podzimek <vpodzime> |
Status: | CLOSED ERRATA | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 25 | CC: | anaconda-maint-list, blivet-maint-list, bugzilla, g.kaviyarasu, gmarr, jonathan, mkolman, robatino, sbueno, vanmeeuwen+fedora, vpodzime, vponcova |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | AcceptedBlocker | ||
Fixed In Version: | python-blivet-2.1.6-3.fc25 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2016-11-15 13:32:37 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1277289 | ||
Attachments: |
Created attachment 1214452 [details]
program.log
Created attachment 1214453 [details]
anaconda.log
Created attachment 1214455 [details]
backtrace for install after wiping partition tables
Doing sgdisk -Z /dev/sda; sgdisk -Z /dev/sdb and rebooting into the installer gets me to the point where the installer starts creating filesystems, but then dies:
gi.overrides.BlockDev.LVMError: Failed to call the 'PvCreate' method on the '/com/redhat/lvmdbus1/Manager' object: GDBus.Error:org.freedesktop.DBus.Python.dbus.exceptions.DBusException: ('com.redhat.lvmdbus1.Manager', 'PV Already exists!')
I will include the backtrace, but I have to edit it because it includes my entire kickstart file and I really can't show that to you.
At this point, three arrays were activated. I don't know what happened to md2:
md3 : active raid1 sdb4[1] sda4[0]
215948288 blocks super 1.2 [2/2] [UU]
[============>........] resync = 63.4% (136970240/215948288) finish=6.3min speed=206308K/sec
bitmap: 1/2 pages [4KB], 65536KB chunk
md1 : active raid1 sdb1[1] sda1[0]
1047552 blocks super 1.2 [2/2] [UU]
bitmap: 0/1 pages [0KB], 65536KB chunk
md0 : active (auto-read-only) raid1 sdb2[1]
524224 blocks super 1.0 [2/1] [_U]
bitmap: 0/1 pages [0KB], 65536KB chunk
Created attachment 1214457 [details]
backtrace for install after zeroing the beginning of the arrays
From that point I wrote zeros to the beginning of /dev/md0, 1, and 3 and, then rah sgdisk -Z on each drive. Then I rebooted into the installer yet again (with, as always, the same kickstart file) and received a different backtrace:
gi.overrides.BlockDev.MDRaidError: Process reported exit code 256: mdadm: super1.x cannot open /dev/sdb2: Device or resource busy
mdadm: /dev/sdb2 is not suitable for this array.
mdadm: create aborted
At this point all four arrays had been created:
md2 : active raid1 sdb3[1] sda3[0]
16760832 blocks super 1.2 [2/2] [UU]
resync=DELAYED
md3 : active raid1 sdb4[1] sda4[0]
215948288 blocks super 1.2 [2/2] [UU]
[==>..................] resync = 12.1% (26183488/215948288) finish=16.2min speed=194822K/sec
bitmap: 2/2 pages [8KB], 65536KB chunk
md1 : active raid1 sdb1[1] sda1[0]
1047552 blocks super 1.2 [2/2] [UU]
bitmap: 0/1 pages [0KB], 65536KB chunk
md0 : active raid1 sdb2[1]
524224 blocks super 1.0 [2/1] [_U]
bitmap: 1/1 pages [4KB], 65536KB chunk
Note that md0 has only one disk, and sdb2 is already in there.
Created attachment 1214459 [details]
backtrace for install after zeroing the beginning and end of each partition
At this point, I dd'd zeroes onto the beginning and end of each partition on sda and sdb, did sgdisk -Z, and rebooted into the installer. Again, same kickstart file.
This gives the "PV Already exists!" error again:
gi.overrides.BlockDev.LVMError: Failed to call the 'PvCreate' method on the '/com/redhat/lvmdbus1/Manager' object: GDBus.Error:org.freedesktop.DBus.Python.dbus.exceptions.DBusException: ('com.redhat.lvmdbus1.Manager', 'PV Already exists!')
At this point, though, only two arrays have been activated:
Personalities : [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] [linear]
md3 : active raid1 sdb4[1] sda4[0]
215948288 blocks super 1.2 [2/2] [UU]
[=>...................] resync = 9.2% (20005760/215948288) finish=15.8min speed=206312K/sec
bitmap: 2/2 pages [8KB], 65536KB chunk
md1 : active raid1 sdb1[1] sda1[0]
1047552 blocks super 1.2 [2/2] [UU]
bitmap: 0/1 pages [0KB], 65536KB chunk
Even though it's the same error, I'll still attach the traceback file just in case there something extra in there.
At this point I either run this: for i in /dev/md/*; do dd if=/dev/zero of=$i count=1024; done; for i in sda sdb; do for j in /dev/$i*; do dd if=/dev/zero of=$i bs=512 seek=$(( $(blockdev --getsz $j) - 1024 )) count=1024; done; sgdisk -Z /dev/$i; done; sync; sync Or this (and wait for the SSDs to clear completely): for i in /dev/sd[ab]; do hdparm --user-master u --security-set-pass a $i; hdparm --user-master u --security-erase a $i; done and reboot. Then the machine installs. So I have the former on hand at all times and just do that when I'm going to reinstall something. I should probably just stick it in %pre. But basically, in order to do a clean install over drives when there are existing RAID arrays, you must: erase the beginning of each array erase the beginning and end of each partition on each disk erase the partition table of each disk Doing just one or two of these will turn up various failures. I didn't try doing those without also wiping the partition table, but I have reinstall yet again for unrelated reasons, so I'll try that too. Created attachment 1214464 [details]
Backtrace after clearing just the beginning of the one active array.
So as we know from he original comment, a boot into the installer from a freshly installed system stops with "Kickstart insufficient". If at that time I do nothing other than zero the beginning of the one array which activates at that time and reboot, I got a boot which hung; eventually complaints about the dracut timeout handlers being run ended up on the console, so I reset the machine and booted back to the installer.
Now the install proceeds until I get yet a different message:
gi.overrides.BlockDev.MDRaidError: Process reported exit code 256: mdadm: error opening /dev/md/3: No such file or directory
There are no active arrays at this point.
That's about all of the combinations I can think of testing, and I think that covers all of the errors and backtraces I've seen while trying to get this one poor system reinstalled.
(In reply to Jason Tibbitts from comment #3) > Created attachment 1214455 [details] > backtrace for install after wiping partition tables > > Doing sgdisk -Z /dev/sda; sgdisk -Z /dev/sdb and rebooting into the > installer gets me to the point where the installer starts creating > filesystems, but then dies: > > gi.overrides.BlockDev.LVMError: Failed to call the 'PvCreate' method on the > '/com/redhat/lvmdbus1/Manager' object: > GDBus.Error:org.freedesktop.DBus.Python.dbus.exceptions.DBusException: > ('com.redhat.lvmdbus1.Manager', 'PV Already exists!') The array is being created anew on the same portion of the disks as the previous run. Since you did not remove the old lvm metadata you hit this error. I don't know what to tell you except that you have to try harder to remove old metadata than that. (In reply to Jason Tibbitts from comment #4) > Created attachment 1214457 [details] > backtrace for install after zeroing the beginning of the arrays > > From that point I wrote zeros to the beginning of /dev/md0, 1, and 3 and, > then rah sgdisk -Z on each drive. Then I rebooted into the installer yet > again (with, as always, the same kickstart file) and received a different > backtrace: > > gi.overrides.BlockDev.MDRaidError: Process reported exit code 256: mdadm: > super1.x cannot open /dev/sdb2: Device or resource busy > mdadm: /dev/sdb2 is not suitable for this array. > mdadm: create aborted > > At this point all four arrays had been created: > > md2 : active raid1 sdb3[1] sda3[0] > 16760832 blocks super 1.2 [2/2] [UU] > resync=DELAYED > > md3 : active raid1 sdb4[1] sda4[0] > 215948288 blocks super 1.2 [2/2] [UU] > [==>..................] resync = 12.1% (26183488/215948288) > finish=16.2min speed=194822K/sec > bitmap: 2/2 pages [8KB], 65536KB chunk > > md1 : active raid1 sdb1[1] sda1[0] > 1047552 blocks super 1.2 [2/2] [UU] > bitmap: 0/1 pages [0KB], 65536KB chunk > > md0 : active raid1 sdb2[1] > 524224 blocks super 1.0 [2/1] [_U] > bitmap: 1/1 pages [4KB], 65536KB chunk > > Note that md0 has only one disk, and sdb2 is already in there. When we create sdb2 it still has the old md sig (it's in the same place on the disk, and you didn't remove it), so udev/mdadm activate the broken mirror. Then we fail to create a new array b/c that old/broken array is active. Vratislav, the pv info key is "/dev/md126" instead of the symbolic name ('/dev/md/lvm' in my test). I think this has been fixed at least once before. At the moment it looks like the installer cannot remove an lvm-on-md layout it created. (In reply to David Lehman from comment #8) > I don't know what to tell you except that you have to try harder to > remove old metadata than that. Yes, of course. I did try to make it clear that I knew that a work around was to figure out what needed zeroing and to make sure it was wiped before installing. But I was asked to open a ticket with my various attempts to install while incrementally figuring out what needed to be wiped, so... this ticket is the result. Of course, it's actually not possible to know what to zero when you don't have a partition table on the device. You don't even know how anaconda will lay out the partitions. You must wait until anaconda actually writes the partition table it will use and then fails to install. Only at that point can you start wiping things. And even then I've still been in a situation where I just couldn't figure out what to wipe (besides the entire drive). (In reply to David Lehman from comment #10) > Vratislav, the pv info key is "/dev/md126" instead of the symbolic name > ('/dev/md/lvm' in my test). I think this has been fixed at least once before. Yes, but I believe that's correct because that's what LVM says. A proper fix here would be to add get_all_device_symlinks() and resolve_device_symlink() functions to libblockdev and use them for cases like this. A quick and a bit dirty fix is available at https://github.com/rhinstaller/blivet/pull/521 (In reply to Vratislav Podzimek from comment #13) > (In reply to David Lehman from comment #10) > > Vratislav, the pv info key is "/dev/md126" instead of the symbolic name > > ('/dev/md/lvm' in my test). I think this has been fixed at least once before. > > Yes, but I believe that's correct because that's what LVM says. A proper fix > here would be to add get_all_device_symlinks() and resolve_device_symlink() > functions to libblockdev and use them for cases like this. A quick and a bit > dirty fix is available at https://github.com/rhinstaller/blivet/pull/521 Blivet's lvm config says to prefer /dev/md/* over /dev/md*. Have we lost the ability to control this now? (In reply to David Lehman from comment #14) > (In reply to Vratislav Podzimek from comment #13) > > (In reply to David Lehman from comment #10) > > > Vratislav, the pv info key is "/dev/md126" instead of the symbolic name > > > ('/dev/md/lvm' in my test). I think this has been fixed at least once before. > > > > Yes, but I believe that's correct because that's what LVM says. A proper fix > > here would be to add get_all_device_symlinks() and resolve_device_symlink() > > functions to libblockdev and use them for cases like this. A quick and a bit > > dirty fix is available at https://github.com/rhinstaller/blivet/pull/521 > > Blivet's lvm config says to prefer /dev/md/* over /dev/md*. Have we lost the > ability to control this now? Yes, because with LVM DBus API, getting list of PVs is just an enumeration of DBus objects and their properties. And since blivet's configuration of preferred names/paths is not a global one, the LVM DBus daemon doesn't use it. python-blivet-2.1.6-3.fc25 has been submitted as an update to Fedora 25. https://bodhi.fedoraproject.org/updates/FEDORA-2016-91844f4982 Discussed during the 2016-11-07 blocker review meeting: [1] The decision to classify this bug as an AcceptedBlocker was made as this bug violates the following Beta-blocker criteria: "When using the custom partitioning flow, the installer must be able to: Correctly interpret, and modify as described below, any disk with a valid ms-dos or gpt disk label and partition table containing ext4 partitions, LVM and/or btrfs volumes, and/or software RAID arrays at RAID levels 0, 1 and 5 containing ext4 partitions" [1] https://meetbot.fedoraproject.org/fedora-blocker-review/2016-11-07/f25-blocker-review.2016-11-07-17.01.txt anaconda-25.20.8-1.fc25, python-blivet-2.1.6-3.fc25 has been pushed to the Fedora 25 testing repository. If problems still persist, please make note of it in this bug report. See https://fedoraproject.org/wiki/QA:Updates_Testing for instructions on how to install test updates. You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-91844f4982 anaconda-25.20.8-1.fc25, python-blivet-2.1.6-3.fc25 has been pushed to the Fedora 25 stable repository. If problems still persist, please make note of it in this bug report. |
Created attachment 1214451 [details] storage.log Take a system which was installed to blank drives. Install using a basic kickstart file which includes this for the disk section: zerombr clearpart --drives=sda,sdb --all --initlabel part raid.01 --asprimary --ondisk=sda --size=512 part raid.11 --asprimary --ondisk=sda --size=1024 part raid.21 --ondisk=sda --size=16384 part raid.31 --ondisk=sda --grow part raid.02 --asprimary --ondisk=sdb --size=512 part raid.12 --asprimary --ondisk=sdb --size=1024 part raid.22 --ondisk=sdb --size=16384 part raid.32 --ondisk=sdb --grow raid /boot/efi --fstype efi --level=1 --device=md0 raid.01 raid.02 raid /boot --level=1 --device=md1 raid.11 raid.12 raid swap --level=1 --device=md2 raid.21 raid.22 raid pv.0 --level=1 --device=md3 raid.31 raid.32 volgroup os pv.0 logvol /scratch --fstype xfs --name=scratch --vgname=os --size=1000 logvol / --fstype xfs --name=root --vgname=os --size=10000 logvol /var --fstype xfs --name=var --vgname=os --size=8192 (That's actually the output of my %pre script, not the kickstart file itself.) Then reboot to the installer and feed it exactly the same kickstart file. For me the install stops, repeatably, with "Kickstart insufficient". I will attach logs. I believe the problem is that one (but just one) of the arrays from the previous install somehow activated: Personalities : [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] [linear] md3 : active raid1 sdb4[1] sda4[0] 215948288 blocks super 1.2 [2/2] [UU] bitmap: 0/2 pages [0KB], 65536KB chunk Note that this is one failure in a sequence, which I can achieve by erasing various pieces of metadata. I've managed to uncover several different failures, though besides this one all have caused backtraces instead of the dreaded "Kickstart insufficient". At least the backtraces tell you what went wrong. Those will follow as I try to categorize which piece of extraneous metadata causes which anaconda failure. Of course I know how to erase enough metadata so that this doesn't happen, but I was asked to file a ticket for this anyway. I have some recollection about open tickets relating to this kind of thing but I had trouble finding one.