Description of problem: Booting into Rescue-Mode from Fedora-Server-dvd-x86_64-32-1.3.iso doesn't recognize LVM-Partitions but works with normal partitions. It works with Fedora-Server-dvd-x86_64-31-1.9.iso and Fedora32 Installations. Version-Release number of selected component (if applicable): Fedora-Server-dvd-x86_64-32-1.3.iso and an installation to Harddisk from Fedora-Workstation-Live-x86_64-32-1.3.iso with the default Disk-Layout suggested by the Installer. How reproducible: Install to Harddisk from Fedora-Workstation-Live-x86_64-32-1.3.iso and use the default disk-Layout suggested by the Installer. Boot into Rescue-Mode from Fedora-Server-dvd-x86_64-32-1.3.iso. Steps to Reproduce: 1. Install to Harddisk from Fedora-Workstation-Live-x86_64-32-1.3.iso and use the default disk-Layout suggested by the Installer. 2. Boot into Rescue-Mode from Fedora-Server-dvd-x86_64-32-1.3.iso 3. Choose "1" to "Continue" 4. The messages came "You don't have any Linux partitions. Rebooting. Please press ENTER to get a shell:" 5. Boot into Rescue-Mode from Fedora-Server-dvd-x86_64-31-1.9.iso works as expected. Actual results: Rescuemode doesn't recognize Linux-LVM Partitions but it could see normal Partitions. Expected results: Rescuemode should recognize Linux-LVM Partitions. Additional info:
Proposing as a blocker per https://fedoraproject.org/wiki/Fedora_32_Beta_Release_Criteria#Rescue_mode , but we do have a test for this in openQA, so it seems surprising if it's broken. Investigating...
so the openQA test uses an install from the Server DVD with automatic partitioning (so, should be using LVM) but also with encryption (because we want to test that rescue mode can handle data decryption). That test definitely passed on the RC. Will try a few scenarios manually.
Please try to boot from Fedora-Server-dvd-x86_64-32-1.3.iso, choose in the menu "Troubleshooting", then "Rescue a Fedora system", then "1" to continue. After that the messages came "You don't have any Linux partitions. Rebooting. Please press ENTER to get a shell:"
I have tested exactly according to the given reproducer, using: * The Workstation Live to create a Fedora system using the default hard drive partitioning. * The DVD Live to rescue the system. * I booted the system as read/write using the Option 1 and booted the disks normally on /mnt/sysroot as expected. I have tested in a virtual machine and seeing the above #Comment 2, maybe this only affects bare metals? I will try that, too.
I can confirm this is an issue. Installed F32 Workstation on bare metal (everything default, LVM), rescue mode from Fedora Server DVD ends with "You don't have any Linux partitions."
I tested an install from Server DVD and rescue from the same Server DVD and it was fine. Testing install from a Workstation live USB now, will try rescuing from the Server DVD. #c4 vs. #c5 seems odd.
I can confirm this issue on Lenovo T460s. Same approach as in comment 4, same results as in comment 5.
I wonder if this is not about install source, but some property of the storage device used. Can folks post more details on exactly what storage they're using? What size disk etc.? fdisk , lvdisplay etc. outputs may be useful. My test box has a 500GB disk with 512 byte sectors, it looks like, according to fdisk...here's lvdisplay output... --- Logical volume --- LV Path /dev/fedora/swap00 LV Name swap00 VG Name fedora LV UUID cuxddF-J2Js-SY88-Ngyq-HCla-kfM5-rZg24K LV Write Access read/write LV Creation host, time localhost, 2020-04-15 20:57:11 +0000 LV Status available # open 2 LV Size 6.85 GiB Current LE 1754 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 256 Block device 253:4 --- Logical volume --- LV Path /dev/fedora/home00 LV Name home00 VG Name fedora LV UUID Mev5P9-efGS-Ja0m-YPSI-fnl1-46tF-TarvXj LV Write Access read/write LV Creation host, time localhost, 2020-04-15 20:57:11 +0000 LV Status available # open 1 LV Size <387.91 GiB Current LE 99304 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 256 Block device 253:3 --- Logical volume --- LV Path /dev/fedora/root00 LV Name root00 VG Name fedora LV UUID P5td6w-XSDN-RuMX-PmXY-3Kc7-ZPkh-0814ur LV Write Access read/write LV Creation host, time localhost, 2020-04-15 20:57:22 +0000 LV Status available # open 1 LV Size 70.00 GiB Current LE 17920 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 256 Block device 253:2 pvdisplay: --- Physical volume --- PV Name /dev/sda2 VG Name fedora PV Size 464.76 GiB / not usable 3.00 MiB Allocatable yes (but full) PE Size 4.00 MiB Total PE 118978 Free PE 0 Allocated PE 118978 PV UUID 76tPZb-YkPz-G14E-12Rn-NDRp-jZa1-6wy5ed vgdisplay: --- Volume group --- VG Name fedora System ID Format lvm2 Metadata Areas 1 Metadata Sequence No 4 VG Access read/write VG Status resizable MAX LV 0 Cur LV 3 Open LV 3 Max PV 0 Cur PV 1 Act PV 1 VG Size <464.76 GiB PE Size 4.00 MiB Total PE 118978 Alloc PE / Size 118978 / <464.76 GiB Free PE / Size 0 / 0 VG UUID hfzLuK-Rjdm-xhLj-OVvA-o6p0-HUaV-T1WifW
fdisk: Disk /dev/sda: 465.78 GiB, 500107862016 bytes, 976773168 sectors Disk model: ST3500418AS Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disklabel type: dos Disk identifier: 0xa6981153 Device Boot Start End Sectors Size Id Type /dev/sda1 * 2048 2099199 2097152 1G 83 Linux /dev/sda2 2099200 976773119 974673920 464.8G 8e Linux LVM
oh, and did you guys do UEFI or BIOS installs?
Same result with Option 2 - no Linux partitions found.
I will post the logs from the affected machine in no time, just for the time's being, I was doing BIOS based installations both on VM and bare metal.
lvdisplay: --- Logical volume --- LV Path /dev/fedora_localhost-live/swap LV Name swap VG Name fedora_localhost-live LV UUID Kk94TD-DnnW-anOu-lE6y-LziR-E85h-BQKrSw LV Write Access read/write LV Creation host, time localhost-live, 2020-04-16 16:45:00 -0400 LV Status available # open 2 LV Size 9.55 GiB Current LE 2445 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 256 Block device 253:1 --- Logical volume --- LV Path /dev/fedora_localhost-live/home LV Name home VG Name fedora_localhost-live LV UUID faQGA2-XQjG-2okD-WSAc-0IIS-6mQi-3JG7zs LV Write Access read/write LV Creation host, time localhost-live, 2020-04-16 16:45:00 -0400 LV Status available # open 1 LV Size 157.92 GiB Current LE 40428 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 256 Block device 253:2 --- Logical volume --- LV Path /dev/fedora_localhost-live/root LV Name root VG Name fedora_localhost-live LV UUID BYa2Ka-ivG2-UZdu-znBT-VuTk-DrwO-HuP41f LV Write Access read/write LV Creation host, time localhost-live, 2020-04-16 16:45:06 -0400 LV Status available # open 1 LV Size 70.00 GiB Current LE 17920 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 256 Block device 253:0
lsblk: NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 238.5G 0 disk ├─sda1 8:1 0 1G 0 part /boot └─sda2 8:2 0 237.5G 0 part ├─fedora_localhost--live-root 253:0 0 70G 0 lvm / ├─fedora_localhost--live-swap 253:1 0 9.6G 0 lvm [SWAP] └─fedora_localhost--live-home 253:2 0 157.9G 0 lvm /home
Fedora 32 was installed on the first hdd (/dev/sda) fdisk: Disk /dev/sda: 111.81 GiB, 120034123776 bytes, 234441648 sectors Disk model: KINGSTON SV300S3 Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disklabel type: gpt Disk identifier: D94C5922-5790-4B1A-9FB2-DAC78D88D400 Device Start End Sectors Size Type /dev/sda1 2048 1230847 1228800 600M EFI System /dev/sda2 1230848 3327999 2097152 1G Linux filesystem /dev/sda3 3328000 234440703 231112704 110.2G Linux LVM Disk /dev/sdb: 111.81 GiB, 120034123776 bytes, 234441648 sectors Disk model: KINGSTON SHFS37A Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk /dev/sdc: 465.78 GiB, 500107862016 bytes, 976773168 sectors Disk model: TOSHIBA DT01ACA0 Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 4096 bytes I/O size (minimum/optimal): 4096 bytes / 4096 bytes Disk /dev/mapper/fedora_localhost--live-root00: 70 GiB, 75161927680 bytes, 146800640 sectors Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk /dev/mapper/fedora_localhost--live-swap00: 5.87 GiB, 6283067392 bytes, 12271616 sectors Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk /dev/mapper/fedora_localhost--live-home: 34.36 GiB, 36880515072 bytes, 72032256 sectors Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes vgdisplay: --- Volume group --- VG Name fedora_localhost-live System ID Format lvm2 Metadata Areas 1 Metadata Sequence No 4 VG Access read/write VG Status resizable MAX LV 0 Cur LV 3 Open LV 3 Max PV 0 Cur PV 1 Act PV 1 VG Size <110.20 GiB PE Size 4.00 MiB Total PE 28211 Alloc PE / Size 28211 / <110.20 GiB Free PE / Size 0 / 0 VG UUID LjHHhW-LxDh-IuEr-5s7T-k52J-jQKG-PbJa8m
Aaand lvdisplay: --- Logical volume --- LV Path /dev/fedora_localhost-live/swap00 LV Name swap00 VG Name fedora_localhost-live LV UUID OPIPcs-ccVO-T4sO-qjDz-xAGH-WRUi-KITD0Y LV Write Access read/write LV Creation host, time localhost-live, 2020-04-16 18:20:17 +0200 LV Status available # open 2 LV Size 5.85 GiB Current LE 1498 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 256 Block device 253:1 --- Logical volume --- LV Path /dev/fedora_localhost-live/home LV Name home VG Name fedora_localhost-live LV UUID s3UIkm-0sl9-luSy-3d1Q-y83J-Tpkv-7u8s3U LV Write Access read/write LV Creation host, time localhost-live, 2020-04-16 18:20:17 +0200 LV Status available # open 1 LV Size <34.35 GiB Current LE 8793 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 256 Block device 253:2 --- Logical volume --- LV Path /dev/fedora_localhost-live/root00 LV Name root00 VG Name fedora_localhost-live LV UUID I35mZJ-Lln5-nh1U-tO0k-B5dF-p2dV-SLmvHn LV Write Access read/write LV Creation host, time localhost-live, 2020-04-16 18:20:26 +0200 LV Status available # open 1 LV Size 70.00 GiB Current LE 17920 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 256 Block device 253:0
Created attachment 1679475 [details] logs from broken rescue mode I can easily reproduce this in a VM when using a SATA or SCSI disk. When I use a VIRTIO disk, the issue is not present. I can see this with a default Fedora Workstation Live installation, or even Everything netinst default minimal installation. The problem happens with F32 RC1.3 Server DVD, F32 Beta Server DVD, but it doesn't happen with F31 Server DVD. Logs from F32 RC1.3 Server DVD rescue mode (which can't find partitions) are attached here.
In today's Go/No-Go meeting, we agreed to accept this as a Fedora 32 Final blocker per the following two criteria: 1. The rescue mode of the installer must start successfully and be able to detect and mount any installation performed according to the applicable criteria, and provide a shell with access to utilities capable of performing typical recovery operations. 2. Custom partitioning (https://fedoraproject.org/wiki/Fedora_32_Beta_Release_Criteria#Custom_partitioning) https://meetbot.fedoraproject.org/fedora-meeting-1/2020-04-16/f32-final-go_no_go-meeting.2020-04-16-16.59.log.html#l-460
Created attachment 1679476 [details] broken custom partitioning This bug also affects custom partitioning on affected devices. Existing LVM partitions can't be used, see screenshot.
So I've been digging into this with the help of kparal's logs at https://kparal.fedorapeople.org/tmp/rescue-bug/ , my own logs from a success case, and the blivet source. My best theory so far is this is a race between LVM activation and blivet device scanning. We can see this is kparal's syslog. These are interesting snippets: 17:41:35,255 WARNING org.fedoraproject.Anaconda.Modules.Storage:INFO:blivet:DeviceTree.populate: ignored_disks is [] ; exclusive_disks is [] 17:41:35,437 INFO lvm: pvscan[1924] PV /dev/sda2 online, VG fedora_localhost-live is complete. 17:41:35,437 INFO lvm: pvscan[1924] VG fedora_localhost-live run autoactivation. 17:41:35,482 INFO lvm: 2 logical volume(s) in volume group "fedora_localhost-live" now active 17:41:35,498 INFO systemd:Finished LVM event activation on device 8:2. 17:41:35,753 WARNING org.fedoraproject.Anaconda.Modules.Storage:INFO:blivet:scanning sda2 (/sys/devices/pci0000:00/0000:00:07.0/virtio3/host8/target8:0:0/8:0:0:0/block/sda/sda2)... 17:41:35,817 WARNING org.fedoraproject.Anaconda.Modules.Storage:DEBUG:blivet: DeviceTree.get_device_by_uuid returned existing 14 GiB lvmvg fedora_localhost-live (100) 17:41:35,817 WARNING org.fedoraproject.Anaconda.Modules.Storage:DEBUG:blivet:no LVs listed for VG fedora_localhost-live 17:41:35,818 WARNING org.fedoraproject.Anaconda.Modules.Storage:INFO:blivet:scanning fedora_localhost--live-swap00 (/sys/devices/virtual/block/dm-2)... if we compare with my syslog from the success case: 17:34:42,902 INFO lvm: pvscan[1449] PV /dev/sda2 online, VG fedora_localhost-live is complete. 17:34:42,902 INFO lvm: pvscan[1449] VG fedora_localhost-live run autoactivation. 17:34:44,121 INFO lvm: 3 logical volume(s) in volume group "fedora_localhost-live" now active 17:34:44,133 INFO systemd:Finished LVM event activation on device 8:2. 17:35:31,695 WARNING org.fedoraproject.Anaconda.Modules.Storage:INFO:blivet:DeviceTree.populate: ignored_disks is [] ; exclusive_disks is [] 17:35:32,188 WARNING org.fedoraproject.Anaconda.Modules.Storage:INFO:blivet:scanning sda2 (/sys/devices/pci0000:00/0000:00:02.0/0000:01:00.0/ata11/host10/target10:0:0/10:0:0:0/block/sda/sda2)... 17:35:32,305 WARNING org.fedoraproject.Anaconda.Modules.Storage:DEBUG:blivet: DeviceTree.get_device_by_uuid returned existing 464.76 GiB lvmvg fedora_localhost-live (33) 17:35:32,338 WARNING org.fedoraproject.Anaconda.Modules.Storage:DEBUG:blivet:updated fedora_localhost-live-home size to 387.91 GiB (387.91 GiB) 17:35:32,339 WARNING org.fedoraproject.Anaconda.Modules.Storage:DEBUG:blivet:Adding fedora_localhost-live-home/387.91 GiB to fedora_localhost-live 17:35:32,339 WARNING org.fedoraproject.Anaconda.Modules.Storage:INFO:blivet:added lvmlv fedora_localhost-live-home (id 37) to device tree (yadda yadda) 17:35:32,357 WARNING org.fedoraproject.Anaconda.Modules.Storage:INFO:blivet:scanning fedora_localhost--live-home (/sys/devices/virtual/block/dm-3)... you can see that in the failure case, VG autoactivation is happening *during the same second* blivet is busy scanning devices. It does happen very slightly before blivet gets to scanning sda2, but it's very close - and it happens 0.2 seconds *after* that "ignored_disks is [] ; exclusive_disks is []" log. Why is that significant? Because there's a call to `lvs_info.drop_cache()` at just about exactly the same time as that log call (in blivet.populator.PopulatorMixin._populate()). That is the cache from which blivet.populator.helpers.lvm.LVMFormatPopulator._update_lvs() reads lv_info , which if not populated, results in an early return with that "no LVs listed for VG" log message - which we *do* see in kparal's log, but *don't* see in mine. So it seems at least plausible that that cache is getting hit and re-populated in those next 0.2 seconds, before the activation of the VG is actually complete. At the very least, we can't rule it out. I did an updates.img that adds a call to `lvs_info.drop_cache()` right before the population of `lv_info`, and that fixes the bug in a VM for me and kparal. Of course, simply *the fact of using an updates.img* might affect a race, so I'm not 100% sure that proves my point, but it does help a bit.
The updates.img for anyone else who wants to try it on a system that hits the bug is https://www.happyassassin.net/updates/1824418.1.img .
OK, so https://www.happyassassin.net/updates/1824418.2.img is an updates.img that adds the cache drop *and* adds a log statement whenever the cache is re-populated. https://www.happyassassin.net/updates/1824418.logonly.img adds the log statement but not the extra cache drop. Our VM-with-SCSI reproducer case seems to behave very much in line with my theory, with these images. With .2.img it works, and logs one population of the cache *just before* VG autoactivation (which is happening in upstream code somewhere) and one a little *after* (which is my added one). With .logonly.img it only logs the one population of the cache, just before VG autoactivation, and fails. That all adds up nicely. kparal's bare metal reproducer, though, isn't lining up quite as nicely. When we boot it with an updates image - even just logonly.img - we see autoactivation ~20 seconds before device scanning, and it works. When we boot it without an updates image, we see autoactivation right around the same time as device scanning, and it fails. This isn't inconsistent with my theory, but it doesn't prove it quite as nicely. It seems like perhaps on bare metal, just the sheer act of loading an updates.img either changes the timing or actually changes how the VG autoactivation is triggered somehow. I'm currently looking into how this autoactivation stuff is actually set up, it comes from the lvm2 package. There's a systemd service `lvm2-pvscan@.service` which seems to be the thing that gets run; I think it's triggered via udev rules, but I'm not 100% sure yet.
for clarity, here are the git diffs of the updates images: === .2.img: diff --git a/blivet/populator/helpers/lvm.py b/blivet/populator/helpers/lvm.py index 4b674fac..eb5a09ea 100644 --- a/blivet/populator/helpers/lvm.py +++ b/blivet/populator/helpers/lvm.py @@ -120,6 +120,7 @@ class LVMFormatPopulator(FormatPopulator): return vg_name = vg_device.name + lvs_info.drop_cache() lv_info = dict((k, v) for (k, v) in iter(lvs_info.cache.items()) if v.vg_name == vg_name) diff --git a/blivet/static_data/lvm_info.py b/blivet/static_data/lvm_info.py index 42261f9b..b45cb0a9 100644 --- a/blivet/static_data/lvm_info.py +++ b/blivet/static_data/lvm_info.py @@ -40,6 +40,7 @@ class LVsInfo(object): @property def cache(self): if self._lvs_cache is None: + log.info("XXX Populating LVsInfo cache!") try: lvs = blockdev.lvm.lvs() except NotImplementedError: === .logonly.img: diff --git a/blivet/static_data/lvm_info.py b/blivet/static_data/lvm_info.py index 42261f9b..b45cb0a9 100644 --- a/blivet/static_data/lvm_info.py +++ b/blivet/static_data/lvm_info.py @@ -40,6 +40,7 @@ class LVsInfo(object): @property def cache(self): if self._lvs_cache is None: + log.info("XXX Populating LVsInfo cache!") try: lvs = blockdev.lvm.lvs() except NotImplementedError: ===
So, here's what I've figured out so far regarding the autoactivation. The actual systemd service involved is lvm2-pvscan@.service . This is part of the lvm2 package. There is also a relevant udev rule, because we rely on event-based activation here: the udev rule is 69-dm-lvm-metad.rules , and is also part of the lvm2 package. Neither the udev rule nor the service has changed in a relevant way since F31. The way things basically work here is that the udev rule 'tells' systemd that the service should be activated for a device if that devices is a PV with LVM metadata. The rule does all the scan-y stuff then does this: ENV{SYSTEMD_ALIAS}="/dev/block/$major:$minor" ENV{SYSTEMD_WANTS}+="lvm2-pvscan@$major:$minor.service" and systemd picks that up. If we look at logs with systemd.log_level=debug, it seems that systemd then ultimately actually triggers the service activation when it sees the device go to status 'plugged', like this: Apr 16 21:16:58 localhost systemd[1]: dev-block-8:2.device: Changed dead -> plugged Apr 16 21:16:58 localhost systemd[1]: dev-disk-by\x2did-scsi\x2d0QEMU_QEMU_HARDDISK_drive\x2dscsi0\x2d0\x2d0\x2d0\x2dpart2.device: Changed dead -> plugged Apr 16 21:16:58 localhost systemd[1]: dev-disk-by\x2dpath-pci\x2d0000:00:07.0\x2dscsi\x2d0:0:0:0\x2dpart2.device: Changed dead -> plugged Apr 16 21:16:58 localhost systemd[1]: dev-disk-by\x2dpartuuid-b4ba01b2\x2d02.device: Changed dead -> plugged Apr 16 21:16:58 localhost systemd[1]: dev-disk-by\x2did-lvm\x2dpv\x2duuid\x2dqiccbW\x2dG2Q4\x2dBzZJ\x2dTcRU\x2dcS5M\x2drFS2\x2dZd9TDR.device: Changed dead -> plugged Apr 16 21:16:58 localhost systemd[1]: dev-sda2.device: Changed dead -> plugged Apr 16 21:16:58 localhost systemd[1]: sys-devices-pci0000:00-0000:00:07.0-virtio3-host2-target2:0:0-2:0:0:0-block-sda-sda2.device: Changed dead -> plugged Apr 16 21:16:58 localhost systemd[1]: lvm2-pvscan@8:2.service: Trying to enqueue job lvm2-pvscan@8:2.service/start/fail Apr 16 21:16:58 localhost systemd[1]: Added job lvm2-pvscan@8:2.service/start to transaction. Apr 16 21:16:58 localhost systemd[1]: Pulling in system-lvm2\x2dpvscan.slice/start from lvm2-pvscan@8:2.service/start Apr 16 21:16:58 localhost systemd[1]: Added job system-lvm2\x2dpvscan.slice/start to transaction. So, it seems like when the bug happens, that "dead -> plugged" state change is happening *late* for some reason. It actually seems like the blivet device scan *itself* triggers that state change and thus causes the device to be activated. When the bug doesn't happen, this "dead -> plugged" state change happens much earlier and so the device is activated much earlier. It seems like a whole block of interesting stuff happens in F31, but not F32. Here's F31: [adamw@adam system]$ grep dev-block-8:2.device /share/data/31systemddebug.txt Apr 16 21:22:05 localhost systemd[1]: dev-block-8:2.device: Collecting. Apr 16 21:22:17 localhost systemd[1]: Pulling in dev-block-8:2.device/start from lvm2-pvscan@8:2.service/start Apr 16 21:22:17 localhost systemd[1]: Added job dev-block-8:2.device/start to transaction. Apr 16 21:22:17 localhost systemd[1]: Keeping job dev-block-8:2.device/start because of lvm2-pvscan@8:2.service/start Apr 16 21:22:17 localhost systemd[1]: dev-block-8:2.device: Installed new job dev-block-8:2.device/start as 303 Apr 16 21:22:17 localhost systemd[1]: dev-block-8:2.device: Changed dead -> plugged after that, lvm2-pvscan@8:2.service fires and activates the LV. Here's F32: [adamw@adam system]$ grep dev-block-8:2.device /share/data/32systemddebug.txt Apr 16 21:16:29 localhost systemd[1]: dev-block-8:2.device: Collecting. Apr 16 21:16:58 localhost systemd[1]: dev-block-8:2.device: Changed dead -> plugged Apr 16 21:16:58 localhost systemd[1]: Pulling in dev-block-8:2.device/start from lvm2-pvscan@8:2.service/start Apr 16 21:16:58 localhost systemd[1]: Added job dev-block-8:2.device/start to transaction. note how in F31 we pull in dev-block-8:2.device/start *before* it goes from dead -> plugged, and it's only 12 seconds after the 'Collecting' message (which happens very early in the boot process), long before blivet is scanning devices...whereas in F32 the device changes to plugged and *then* we pull in a 'start' job for it, and this happens *29* seconds after the "Collecting" message (during blivet device scan). I'm attaching the full journal dumps with systemd debug logging enabled from F31 (success) and F32 (failure) cases.
Created attachment 1679496 [details] journal output from F31 (success) case with systemd.log_level=debug
Created attachment 1679498 [details] journal output from F32 (failure) case with systemd.log_level=debug
The F32 virtio case - which is also a success case - looks much like the F31 case. Here's better filtered logs from each case, with both important services shown: F31 VM SCSI (success) ===================== Apr 16 21:22:01 localhost systemd[1]: lvm2-pvscan@8:2.service: Failed to load configuration: No such file or directory Apr 16 21:22:01 localhost systemd[1]: lvm2-pvscan@8:2.service: Trying to enqueue job lvm2-pvscan@8:2.service/start/fail Apr 16 21:22:02 localhost systemd[1]: lvm2-pvscan@8:2.service: Failed to load configuration: No such file or directory Apr 16 21:22:05 localhost systemd[1]: dev-block-8:2.device: Collecting. Apr 16 21:22:05 localhost systemd[1]: lvm2-pvscan@8:2.service: Collecting. Apr 16 21:22:17 localhost systemd[1]: lvm2-pvscan@8:2.service: Trying to enqueue job lvm2-pvscan@8:2.service/start/fail Apr 16 21:22:17 localhost systemd[1]: Added job lvm2-pvscan@8:2.service/start to transaction. Apr 16 21:22:17 localhost systemd[1]: Pulling in system-lvm2\x2dpvscan.slice/start from lvm2-pvscan@8:2.service/start Apr 16 21:22:17 localhost systemd[1]: Pulling in dev-block-8:2.device/start from lvm2-pvscan@8:2.service/start Apr 16 21:22:17 localhost systemd[1]: Added job dev-block-8:2.device/start to transaction. Apr 16 21:22:17 localhost systemd[1]: Pulling in shutdown.target/stop from lvm2-pvscan@8:2.service/start Apr 16 21:22:17 localhost systemd[1]: Keeping job dev-block-8:2.device/start because of lvm2-pvscan@8:2.service/start Apr 16 21:22:17 localhost systemd[1]: Keeping job system-lvm2\x2dpvscan.slice/start because of lvm2-pvscan@8:2.service/start Apr 16 21:22:17 localhost systemd[1]: dev-block-8:2.device: Installed new job dev-block-8:2.device/start as 303 Apr 16 21:22:17 localhost systemd[1]: lvm2-pvscan@8:2.service: Installed new job lvm2-pvscan@8:2.service/start as 298 Apr 16 21:22:17 localhost systemd[1]: lvm2-pvscan@8:2.service: Enqueued job lvm2-pvscan@8:2.service/start as 298 Apr 16 21:22:17 localhost systemd[1]: dev-block-8:2.device: Changed dead -> plugged F32 VM SCSI (failure) ===================== Apr 16 21:16:24 localhost systemd[1]: lvm2-pvscan@8:2.service: Failed to load configuration: No such file or directory Apr 16 21:16:24 localhost systemd[1]: lvm2-pvscan@8:2.service: Trying to enqueue job lvm2-pvscan@8:2.service/start/fail Apr 16 21:16:24 localhost systemd[1]: lvm2-pvscan@8:2.service: Failed to load configuration: No such file or directory Apr 16 21:16:25 localhost systemd[1]: lvm2-pvscan@8:2.service: Failed to load configuration: No such file or directory Apr 16 21:16:29 localhost systemd[1]: dev-block-8:2.device: Collecting. Apr 16 21:16:29 localhost systemd[1]: lvm2-pvscan@8:2.service: Collecting. Apr 16 21:16:58 localhost systemd[1]: dev-block-8:2.device: Changed dead -> plugged Apr 16 21:16:58 localhost systemd[1]: lvm2-pvscan@8:2.service: Trying to enqueue job lvm2-pvscan@8:2.service/start/fail Apr 16 21:16:58 localhost systemd[1]: Added job lvm2-pvscan@8:2.service/start to transaction. F32 VM virtio (success) ======================= Apr 16 21:54:47 localhost systemd[1]: lvm2-pvscan@252:2.service: Failed to load configuration: No such file or directory Apr 16 21:54:47 localhost systemd[1]: lvm2-pvscan@252:2.service: Trying to enqueue job lvm2-pvscan@252:2.service/start/fail Apr 16 21:54:47 localhost systemd[1]: lvm2-pvscan@252:2.service: Failed to load configuration: No such file or directory Apr 16 21:54:48 localhost systemd[1]: lvm2-pvscan@252:2.service: Failed to load configuration: No such file or directory Apr 16 21:54:51 localhost systemd[1]: dev-block-252:2.device: Collecting. Apr 16 21:54:51 localhost systemd[1]: lvm2-pvscan@252:2.service: Collecting. Apr 16 21:55:03 localhost systemd[1]: lvm2-pvscan@252:2.service: Trying to enqueue job lvm2-pvscan@252:2.service/start/fail Apr 16 21:55:03 localhost systemd[1]: Added job lvm2-pvscan@252:2.service/start to transaction. Apr 16 21:55:03 localhost systemd[1]: Pulling in system-lvm2\x2dpvscan.slice/start from lvm2-pvscan@252:2.service/start Apr 16 21:55:03 localhost systemd[1]: Pulling in dev-block-252:2.device/start from lvm2-pvscan@252:2.service/start Apr 16 21:55:03 localhost systemd[1]: Added job dev-block-252:2.device/start to transaction. Apr 16 21:55:03 localhost systemd[1]: Pulling in shutdown.target/stop from lvm2-pvscan@252:2.service/start Apr 16 21:55:03 localhost systemd[1]: Keeping job dev-block-252:2.device/start because of lvm2-pvscan@252:2.service/start Apr 16 21:55:03 localhost systemd[1]: Keeping job system-lvm2\x2dpvscan.slice/start because of lvm2-pvscan@252:2.service/start Apr 16 21:55:03 localhost systemd[1]: lvm2-pvscan@252:2.service: Installed new job lvm2-pvscan@252:2.service/start as 298 Apr 16 21:55:03 localhost systemd[1]: dev-block-252:2.device: Installed new job dev-block-252:2.device/start as 303 Apr 16 21:55:03 localhost systemd[1]: lvm2-pvscan@252:2.service: Enqueued job lvm2-pvscan@252:2.service/start as 298 Apr 16 21:55:03 localhost systemd[1]: dev-block-252:2.device: Changed dead -> plugged I have not yet figured out what bit of systemd, or whatever, we're in that is *sometimes* doing this earlier enqueueing of the job and sometimes not, or why. Still working on that. Oh, note the very early failures to load the service (that show in all cases) are, I think, from the dracut environment. The service is not present in the dracut environment, hence the failure.
Adding zbyszek for a systemd angle on this. Zbigniew, any idea what's going on from a systemd perspective here? To reproduce this, you just need a qemu VM and a disk image with a stock Fedora install on it (we've tested with 32, but probably affects others too). The test procedure is to attach the disk image to the VM, boot from a Fedora installer image, and at the boot menu, go to 'Troubleshooting' then 'Rescue a Fedora system'. Wait for the prompt and hit 1). If it tells you it mounted the installed system to /mnt/sysroot, the bug didn't happen. If it tells you you don't have any existing Fedora partitions, the bug did happen. The bug seems to relate to the timing of LV activation, see more detail in investigation comments above. So far, we've found the following combinations seem to be repeatable: Disk attached as SCSI | Disk attached as VirtIO ----------------------------------------------------- F31 | PASS | PASS F32 | FAIL | PASS we also have some bare metal systems which we think are running into the same issue. Here are installer images to test with: F32 - https://dl.fedoraproject.org/pub/alt/stage/32_RC-1.3/Everything/x86_64/iso/Fedora-Everything-netinst-x86_64-32-1.3.iso F31 - https://dl.fedoraproject.org/pub/fedora/linux/releases/31/Everything/x86_64/iso/Fedora-Everything-netinst-x86_64-31-1.9.iso
also CCing mcsontos for LVM perspective.
I find myself intrigued by this bit of systemd: https://github.com/systemd/systemd/blob/master/src/core/device.c#L422 because the precise log message we're getting here in the success case - "Trying to enqueue job lvm2-pvscan@8:2.service/start/fail" - indicates that manager_add_job is being called with JobType start and JobMode fail. And that's exactly what that code does (via manager_add_job_by_name, which just gets from a unit name to a Unit instance): r = manager_add_job_by_name(u->manager, JOB_START, *i, JOB_FAIL, NULL, &error, NULL); and the comment on this block sounds a lot like something we might be hitting here: /* So here's a special hack, to compensate for the fact that the udev database's reload cycles are not * synchronized with our own reload cycles: when we detect that the SYSTEMD_WANTS property of a device * changes while the device unit is already up, let's manually trigger any new units listed in it not * seen before. This typically happens during the boot-time switch root transition, as udev devices * will generally already be up in the initrd, but SYSTEMD_WANTS properties get then added through udev * rules only available on the host system, and thus only when the initial udev coldplug trigger runs. * * We do this only if the device has been up already when we parse this, as otherwise the usual * dependency logic that is run from the dead → plugged transition will trigger these deps. */ there's only one other place I can see off hand where this happens, and it's less of an obvious candidate for being where we're landing on the success path here - it's in `retroactively_start_dependencies` in unit.c.
Does this happen only in rescue mode and the installed system boots just normally, right? Perhaps something missing in the initrd... I am not sure I would be able to dive deeper into this today
(In reply to Marian Csontos from comment #31) > Does this happen only in rescue mode and the installed system boots just > normally, right? Perhaps something missing in the initrd... I am not sure I > would be able to dive deeper into this today Yes, this only happens in the rescue mode.
(In reply to Lukas Ruzicka from comment #32) > Yes, this only happens in the rescue mode. To be more precise, in anaconda netinst/dvd, for both rescue mode and custom/blivet partitioning during installation. The installed system always boots fine, yes.
Comparing with "the installed system" isn't really apples-to-apples at all, though. They're very different scenarios. The VG in question contains the actual root filesystem of "the installed system", after all. So it'll be listed in fstab, and we may well build something into the initramfs to activate and mount it (in fact we kinda have to). This is all very different from the installer environment which uses a generic initramfs and doesn't actually need this VG to be activated in order to boot. So I'm not sure that comparison is telling us anything as the scenarios are just too different. I guess a slightly more accurate comparison might be if you did a non-LVM install then attached the disk image with an LVM install on it as a secondary disk and booted and saw whether they got autoactivated or not?
The "installed system" has LVM hinting for dracut as a boot parameter (in the grub.cfg): rd.lvm.lv=fedora/swap rd.lvm.lv=fedora/root The installer media's initramfs detects that it's an installer boot, and uses rd.lvm=0 rd.luks=0 so the detection and activation of LVM must happen later. Not working (f32 netinstaller) [ 44.197693] localhost-live lvm[1817]: 3 logical volume(s) in volume group "fedora_localhost-live" now active ## Ostensibly the LV's are active at some point. But I can't tell if the bug is that blivet just doesn't recognize them for some reason; or if they vanish. Non working case from f32 netinstaller: [ 44.723497] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: DEBUG:blivet: DeviceTree.handle_device: name: fedora_localhost--live-root ; info: {'DEVLINKS': '/dev/fedora_localhost-live/root ' [ 44.723497] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: '/dev/disk/by-uuid/74d31899-63b2-45de-a124-b2ed6a6ba479 ' [ 44.723497] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: '/dev/disk/by-id/dm-name-fedora_localhost--live-root ' [ 44.723497] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: '/dev/disk/by-id/dm-uuid-LVM-BHwM8lvqSrbQY99BFHk6hZwjayvzoACdaxzcjZhrmtUBUCM6m0tmtUBr5xWWe1hh ' [ 44.723497] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: '/dev/mapper/fedora_localhost--live-root', [ 44.723497] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: 'DEVNAME': '/dev/dm-4', [ 44.723497] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: 'DEVPATH': '/devices/virtual/block/dm-4', [ 44.723497] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: 'DEVTYPE': 'disk', [ 44.723497] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: 'DM_ACTIVATION': '1', [ 44.723497] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: 'DM_LV_NAME': 'root', [ 44.723497] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: 'DM_NAME': 'fedora_localhost--live-root', [ 44.723497] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: 'DM_SUSPENDED': '0', [ 44.723497] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: 'DM_UDEV_DISABLE_LIBRARY_FALLBACK_FLAG': '1', [ 44.723497] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: 'DM_UDEV_PRIMARY_SOURCE_FLAG': '1', [ 44.723497] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: 'DM_UDEV_RULES_VSN': '2', [ 44.723497] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: 'DM_UUID': 'LVM-BHwM8lvqSrbQY99BFHk6hZwjayvzoACdaxzcjZhrmtUBUCM6m0tmtUBr5xWWe1hh', [ 44.723497] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: 'DM_VG_NAME': 'fedora_localhost-live', [ 44.723497] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: 'ID_FS_TYPE': 'ext4', [ 44.723497] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: 'ID_FS_USAGE': 'filesystem', [ 44.723497] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: 'ID_FS_UUID': '74d31899-63b2-45de-a124-b2ed6a6ba479', [ 44.723497] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: 'ID_FS_UUID_ENC': '74d31899-63b2-45de-a124-b2ed6a6ba479', [ 44.723497] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: 'ID_FS_VERSION': '1.0', [ 44.723497] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: 'MAJOR': '253', [ 44.723497] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: 'MINOR': '4', [ 44.723497] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: 'SUBSYSTEM': 'block', [ 44.723497] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: 'SYS_NAME': 'dm-4', [ 44.723497] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: 'SYS_PATH': '/sys/devices/virtual/block/dm-4', [ 44.723497] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: 'TAGS': ':systemd:', [ 44.723497] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: 'USEC_INITIALIZED': '44139414'} ; [ 44.724075] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: INFO:blivet:scanning fedora_localhost--live-root (/sys/devices/virtual/block/dm-4)... [ 44.724734] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: DEBUG:blivet: DeviceTree.get_device_by_name: name: fedora_localhost--live-root ; incomplete: False ; hidden: False ; [ 44.725590] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: DEBUG:blivet: DeviceTree.get_device_by_name returned None [ 44.725801] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: INFO:program:Running [32] dmsetup info -co subsystem --noheadings fedora_localhost--live-root ... [ 44.730403] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: INFO:program:stdout[32]: LVM [ 44.730462] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: INFO:program:stderr[32]: [ 44.730525] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: INFO:program:...done [32] (exit code: 0) [ 44.730667] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: INFO:program:Running [33] dmsetup info -co subsystem --noheadings fedora_localhost--live-root ... [ 44.734597] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: INFO:program:stdout[33]: LVM [ 44.734684] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: INFO:program:stderr[33]: [ 44.734738] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: INFO:program:...done [33] (exit code: 0) [ 44.734895] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: INFO:program:Running [34] dmsetup info -co subsystem --noheadings fedora_localhost--live-root ... [ 44.738863] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: INFO:program:stdout[34]: LVM [ 44.738945] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: INFO:program:stderr[34]: [ 44.738994] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: INFO:program:...done [34] (exit code: 0) [ 44.740626] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: DEBUG:blivet: LVMDevicePopulator.run: name: fedora_localhost--live-root ; [ 44.742015] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: DEBUG:blivet: DeviceTree.get_device_by_name: name: fedora_localhost-live ; incomplete: False ; hidden: True ; [ 44.744598] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: DEBUG:blivet: DeviceTree.get_device_by_name returned existing 248.41 GiB lvmvg fedora_localhost-live (108) [ 44.747855] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: DEBUG:blivet: DeviceTree.get_device_by_name: name: sda3 ; incomplete: False ; hidden: False ; [ 44.749122] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: DEBUG:blivet: DeviceTree.get_device_by_name returned existing 248.41 GiB partition sda3 (99) with existing lvmpv [ 44.750228] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: DEBUG:blivet: DeviceTree.get_device_by_name: name: fedora_localhost-live ; incomplete: False ; hidden: False ; [ 44.751675] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: DEBUG:blivet: DeviceTree.get_device_by_name returned existing 248.41 GiB lvmvg fedora_localhost-live (108) [ 44.753096] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: DEBUG:blivet: DeviceTree.get_device_by_name: name: fedora_localhost--live-root ; incomplete: False ; hidden: False ; [ 44.754094] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: DEBUG:blivet: DeviceTree.get_device_by_name returned None [ 44.754148] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: DEBUG:blivet:no device obtained for fedora_localhost--live-root Questions: Why is 'dmsetup info' run three times? That doesn't happen with f31 netinstaller. I can't tell if it's already inactive at this point, this dmsetup info command filters out the state. The command tells us the LV is definitely seen; but is it active? *shrug* Not sure. The last line "no device obtained" is also not present with the f31 netinstaller.
Chris: I have already investigated all that stuff (except 'dmsetup info'). It is all covered in my notes. I explained in detail the timing problem with the activation when the bug happens. When the bug happens, the activation is triggered during *blivet's device scan itself*, and happens *just* (~0.2s) after blivet caches LV info retrieved via libblockdev - so it caches that there are no active LVs, *then* activates them, then when it reaches a partition that's part of a VG it starts scanning the VG but then gives up because of the bad cached data. When the bug does *not* happen, LV activation is triggered during system startup, 10-15 seconds *before* the blivet device scan runs. So the LVs are active when blivet caches the LV info, and everything goes fine.
f32 workstation live [ 9.722496] localhost systemd[1]: Created slice system-lvm2\x2dpvscan.slice. [ 9.751451] localhost lvm[1028]: pvscan[1028] PV /dev/sda3 online, VG fedora_localhost-live is complete. [ 9.751451] localhost lvm[1028]: pvscan[1028] VG fedora_localhost-live run autoactivation. [ 9.940521] localhost lvm[1028]: 3 logical volume(s) in volume group "fedora_localhost-live" now active f32 netinstall [ 30.233570] localhost-live systemd[1]: Created slice system-lvm2\x2dpvscan.slice. [ 30.260823] localhost-live lvm[1881]: pvscan[1881] PV /dev/sda3 online, VG fedora_localhost-live is complete. [ 30.261107] localhost-live lvm[1881]: pvscan[1881] VG fedora_localhost-live run autoactivation. [ 30.325830] localhost-live lvm[1881]: 3 logical volume(s) in volume group "fedora_localhost-live" now active Why is pvscan so late on the netinstall?
that's *still* the stuff I already wrote about. look at comment #27. you are seeing the lvm2-pvscan@ service I've been writing about for two days.
How do I infer that pvscan is late in one case and not another when c27 lacks monotonic times? From c34: >I guess a slightly more accurate comparison might be if you did a non-LVM install then attached the disk image with an LVM install on it as a secondary disk and booted and saw whether they got autoactivated or not? That's functionally achieved, and in a more standard way, without changing the controller interface type, by using workstation live to compare to the netinstall, which is why I did it. Still unanswered is why it's happening so late with netinstall.
"How do I infer that pvscan is late in one case and not another when c27 lacks monotonic times?" well, from the fact that I wrote about a dozen lines about how it's late. That was the entire point of the comment. (And also the fact that it's 12 seconds after 'Collecting' in the early case, but 29 seconds after 'Collecting' in the late case.)
(In reply to Kamil Páral from comment #19) > Created attachment 1679476 [details] > broken custom partitioning > > This bug also affects custom partitioning on affected devices. Existing LVM > partitions can't be used, see screenshot. If you press the "reload"-Button in this screen and try again the custom partitioning then it works as expected.
If you append the kernel commandline paramter rd.lvm=1 to the entry "Rescue a Fedora system" (i used Fedora-Server-dvd-x86_64-32-1.3.iso) then it works at the first time. The same works appending rd.lvm=1 to the "install ..." entry.
(In reply to Markus Teuber from comment #42) > If you append the kernel commandline paramter rd.lvm=1 to the entry "Rescue > a Fedora system" (i used Fedora-Server-dvd-x86_64-32-1.3.iso) then it works > at the first time. This works for me in my VM, but doesn't work on my bare metal laptop (3 attempts, none successful). There's still some race condition or a problem somewhere, even with that argument.
BTW, another question that occurred to me over the weekend is: should blivet actually be relying on the activation being triggered by systemd/udev/lvm2 anyway? After all, I thought the idea was kinda that it wanted to *inhibit* the rest of the OS from activating storage devices and stuff, and do all that work itself - that's why it sets `rd.lvm=0` (and other args) in the first place. So I suppose even if *currently* in the cases where this works it seems to work because of the earlier activation, making that trigger more often might not be the "right" fix here. We might be better off just assuming that the LVs ought to be activated by blivet itself, and looking for ordering bugs there - my debugging seems to suggest that at least in the VM case, blivet caches the LV info just barely before it somehow triggers activation of the VGs, so if we presume that blivet should be able to successfully activate and then scan VGs *itself*, it seems it may have an internal ordering issue.
It looks like the issue is with partitions not being recognized during the initial blivet scan. This is taken from journal from the "logs from broken rescue mode" attachment, but I've seen the same issue on my VM. Our first scan: INFO:blivet:devices to scan: ['sda', 'sr0', 'loop0', 'loop1', 'loop2', 'live-rw', 'live-base'] During that we use parted to gather information about the partition table which causes kernel to read it and we get uevents for the partitions: Apr 16 17:41:35 localhost-live.default kernel: sda: sda1 sda2 After that the pvscan service is autostarted: Apr 16 17:41:35 localhost-live.default systemd[1]: Created slice system-lvm2\x2dpvscan.slice. Apr 16 17:41:35 localhost-live.default systemd[1]: Starting LVM event activation on device 8:2... But it's already too late because we store information about LVs and PVs during the initial scan. Invalidating the cache and re-scanning the system later might be a solution, but there is still a lot of space for another race condition.
FEDORA-2020-8f552bb1ee has been submitted as an update to Fedora 32. https://bodhi.fedoraproject.org/updates/FEDORA-2020-8f552bb1ee
So to hopefully fix this well enough for F32 release, we bodged in a drop of the LVM info caches and a 2 second sleep (urgh) to each iteration of the device discovery loop. This works for at least the VM SCSI reproducer, and Kamil's hardware reproducer. Once the new RC is built I'll ask here for other folks who could reproduce this to try with it. This is not really a proper fix, though. There's a few avenues we're exploring in terms of a cleaner solution.
FEDORA-2020-8f552bb1ee fixes both Rescue Mode and Advanced (Blivet-gui) partitioning on bare metal (tested via RC 1.4). I'll go ahead and marked as verified.
With Fedora-Server-dvd-x86_64-32-1.4.iso, on both bare metal and VM, I can now successfully rescue a system or see the LVM partitions in the installer advanced partitioning.
works for me too. Thanks a lot.
FEDORA-2020-8f552bb1ee has been pushed to the Fedora 32 testing repository. In short time you'll be able to install the update with the following command: `sudo dnf upgrade --enablerepo=updates-testing --advisory=FEDORA-2020-8f552bb1ee` You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2020-8f552bb1ee See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.
Works fine also with RC 1.5, bare metal and VM.
Created attachment 1680834 [details] journal without partitions
Created attachment 1680835 [details] storage.log without partitions
Bad news. We have another failed installation with the fix included but this time there are no partitions discovered during the blivet scan and installation fails when trying to wipe the disks. For some reason this time our scan did not trigger the uevent and kernel found the partitions only after I manually ran parted and fdisk (timestamp 11:04:15,567).
(In reply to Vojtech Trefny from comment #55) > Bad news. We have another failed installation with the fix included but this > time there are no partitions discovered during the blivet scan and > installation fails when trying to wipe the disks. For some reason this time > our scan did not trigger the uevent and kernel found the partitions only > after I manually ran parted and fdisk (timestamp 11:04:15,567). I've tried this 10 more times on the same VM with RC 1.5 and I wasn't able to reproduce this behaviour. The logs are from rawhide and I can reproduce this there but I am 99 % sure this happened also on F32. Unfortunately I can't reproduce it now so it's possible it never happened. Sorry.
If anyone has spare cycles, please test this.
I test this again with RC 1.5 and it worked fine for me.
RC 1.5 worked just fine on my two desktops (which were broken with 1.3), both Rescue mode and blivet-gui partitioning.
I can totally believe there are still occasional transient problems in this area, but I think the reproducible bugs here are clearly fixed according to feedback, setting back to VERIFIED.
FEDORA-2020-8f552bb1ee has been pushed to the Fedora 32 stable repository. If problem still persists, please make note of it in this bug report.
I found this thread because I recently ran across the same issue with the Rocky Linux 9 Fedora distro. Also happened on CentOS and RHEL 9. I build hundreds of custom Fedora ISOs a month. I discovered that the reason this happens is likely a bad ISO which is missing some hidden files on the root of the disc/ISO. There are two hidden files specifically that cause this to happen that I have confirmed under extensive testing and across many Fedora types. These files are .discinfo and .treeinfo on the root of the ISO. It's easy to leave those out during ISO creation if one doesn't add the right flag parameter to include hidden files. My guess is that the ISO mentioned in this thread 'Fedora-Server-dvd-x86_64-32-1.3.iso' was faulty along with other isos made around the same time period. Strangely that iso has been scrubbed from the internet and there is no trace of it today.