Bug 1824418 - Rescue Mode doesn't recognize LVM -Partitions
Summary: Rescue Mode doesn't recognize LVM -Partitions
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: python-blivet
Version: 32
Hardware: x86_64
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Blivet Maintenance Team
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard: AcceptedBlocker
Depends On:
Blocks: F32FinalBlocker
TreeView+ depends on / blocked
 
Reported: 2020-04-16 07:55 UTC by Markus Teuber
Modified: 2020-04-23 18:09 UTC (History)
26 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-04-23 18:09:44 UTC
Type: Bug


Attachments (Terms of Use)
logs from broken rescue mode (131.73 KB, application/zip)
2020-04-16 19:14 UTC, Kamil Páral
no flags Details
broken custom partitioning (98.47 KB, image/png)
2020-04-16 19:25 UTC, Kamil Páral
no flags Details
journal output from F31 (success) case with systemd.log_level=debug (7.00 MB, text/plain)
2020-04-16 21:43 UTC, Adam Williamson
no flags Details
journal output from F32 (failure) case with systemd.log_level=debug (3.36 MB, text/plain)
2020-04-16 21:44 UTC, Adam Williamson
no flags Details

Description Markus Teuber 2020-04-16 07:55:13 UTC
Description of problem:
Booting into Rescue-Mode from Fedora-Server-dvd-x86_64-32-1.3.iso doesn't recognize LVM-Partitions but works with normal partitions.
It works with Fedora-Server-dvd-x86_64-31-1.9.iso and Fedora32 Installations.

Version-Release number of selected component (if applicable):
Fedora-Server-dvd-x86_64-32-1.3.iso and an installation to Harddisk from Fedora-Workstation-Live-x86_64-32-1.3.iso with the default Disk-Layout suggested by the Installer.

How reproducible:
Install to Harddisk from Fedora-Workstation-Live-x86_64-32-1.3.iso and use the default disk-Layout suggested by the Installer.
Boot into Rescue-Mode from Fedora-Server-dvd-x86_64-32-1.3.iso.

Steps to Reproduce:
1. Install to Harddisk from Fedora-Workstation-Live-x86_64-32-1.3.iso and use the default disk-Layout suggested by the Installer.
2. Boot into Rescue-Mode from Fedora-Server-dvd-x86_64-32-1.3.iso
3. Choose "1" to "Continue"
4. The messages came "You don't have any Linux partitions. Rebooting.
   Please press ENTER to get a shell:"
5. Boot into Rescue-Mode from Fedora-Server-dvd-x86_64-31-1.9.iso works as expected.

Actual results:
Rescuemode doesn't recognize Linux-LVM Partitions but it could see normal Partitions.

Expected results:
Rescuemode should recognize Linux-LVM Partitions.

Additional info:

Comment 1 Adam Williamson 2020-04-16 15:38:41 UTC
Proposing as a blocker per https://fedoraproject.org/wiki/Fedora_32_Beta_Release_Criteria#Rescue_mode , but we do have a test for this in openQA, so it seems surprising if it's broken. Investigating...

Comment 2 Adam Williamson 2020-04-16 15:41:50 UTC
so the openQA test uses an install from the Server DVD with automatic partitioning (so, should be using LVM) but also with encryption (because we want to test that rescue mode can handle data decryption). That test definitely passed on the RC. Will try a few scenarios manually.

Comment 3 Markus Teuber 2020-04-16 16:05:37 UTC
Please try to boot from Fedora-Server-dvd-x86_64-32-1.3.iso, choose in the menu "Troubleshooting", then "Rescue a Fedora system", then "1" to continue.
After that the messages came "You don't have any Linux partitions. Rebooting.
   Please press ENTER to get a shell:"

Comment 4 Lukas Ruzicka 2020-04-16 16:30:46 UTC
I have tested exactly according to the given reproducer, using:

* The Workstation Live to create a Fedora system using the default hard drive partitioning.
* The DVD Live to rescue the system.
* I booted the system as read/write using the Option 1 and booted the disks normally on /mnt/sysroot as expected.

I have tested in a virtual machine and seeing the above #Comment 2, maybe this only affects bare metals?
I will try that, too.

Comment 5 František Zatloukal 2020-04-16 16:37:10 UTC
I can confirm this is an issue. Installed F32 Workstation on bare metal (everything default, LVM), rescue mode from Fedora Server DVD ends with "You don't have any Linux partitions."

Comment 6 Adam Williamson 2020-04-16 16:47:46 UTC
I tested an install from Server DVD and rescue from the same Server DVD and it was fine.

Testing install from a Workstation live USB now, will try rescuing from the Server DVD.

#c4 vs. #c5 seems odd.

Comment 7 Lukas Ruzicka 2020-04-16 16:56:30 UTC
I can confirm this issue on Lenovo T460s. Same approach as in comment 4, same results as in comment 5.

Comment 8 Adam Williamson 2020-04-16 16:56:39 UTC
I wonder if this is not about install source, but some property of the storage device used. Can folks post more details on exactly what storage they're using? What size disk etc.? fdisk , lvdisplay etc. outputs may be useful. My test box has a 500GB disk with 512 byte sectors, it looks like, according to fdisk...here's lvdisplay output...

  --- Logical volume ---
  LV Path                /dev/fedora/swap00
  LV Name                swap00
  VG Name                fedora
  LV UUID                cuxddF-J2Js-SY88-Ngyq-HCla-kfM5-rZg24K
  LV Write Access        read/write
  LV Creation host, time localhost, 2020-04-15 20:57:11 +0000
  LV Status              available
  # open                 2
  LV Size                6.85 GiB
  Current LE             1754
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:4
   
  --- Logical volume ---
  LV Path                /dev/fedora/home00
  LV Name                home00
  VG Name                fedora
  LV UUID                Mev5P9-efGS-Ja0m-YPSI-fnl1-46tF-TarvXj
  LV Write Access        read/write
  LV Creation host, time localhost, 2020-04-15 20:57:11 +0000
  LV Status              available
  # open                 1
  LV Size                <387.91 GiB
  Current LE             99304
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:3
   
  --- Logical volume ---
  LV Path                /dev/fedora/root00
  LV Name                root00
  VG Name                fedora
  LV UUID                P5td6w-XSDN-RuMX-PmXY-3Kc7-ZPkh-0814ur
  LV Write Access        read/write
  LV Creation host, time localhost, 2020-04-15 20:57:22 +0000
  LV Status              available
  # open                 1
  LV Size                70.00 GiB
  Current LE             17920
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:2
   
pvdisplay:

  --- Physical volume ---
  PV Name               /dev/sda2
  VG Name               fedora
  PV Size               464.76 GiB / not usable 3.00 MiB
  Allocatable           yes (but full)
  PE Size               4.00 MiB
  Total PE              118978
  Free PE               0
  Allocated PE          118978
  PV UUID               76tPZb-YkPz-G14E-12Rn-NDRp-jZa1-6wy5ed
   
vgdisplay:

  --- Volume group ---
  VG Name               fedora
  System ID             
  Format                lvm2
  Metadata Areas        1
  Metadata Sequence No  4
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                3
  Open LV               3
  Max PV                0
  Cur PV                1
  Act PV                1
  VG Size               <464.76 GiB
  PE Size               4.00 MiB
  Total PE              118978
  Alloc PE / Size       118978 / <464.76 GiB
  Free  PE / Size       0 / 0   
  VG UUID               hfzLuK-Rjdm-xhLj-OVvA-o6p0-HUaV-T1WifW

Comment 9 Adam Williamson 2020-04-16 16:58:16 UTC
fdisk:

Disk /dev/sda: 465.78 GiB, 500107862016 bytes, 976773168 sectors
Disk model: ST3500418AS     
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0xa6981153

Device     Boot   Start       End   Sectors   Size Id Type
/dev/sda1  *       2048   2099199   2097152     1G 83 Linux
/dev/sda2       2099200 976773119 974673920 464.8G 8e Linux LVM

Comment 10 Adam Williamson 2020-04-16 16:58:38 UTC
oh, and did you guys do UEFI or BIOS installs?

Comment 11 Lukas Ruzicka 2020-04-16 16:59:37 UTC
Same result with Option 2 - no Linux partitions found.

Comment 12 Lukas Ruzicka 2020-04-16 17:00:20 UTC
I will post the logs from the affected machine in no time, just for the time's being, I was doing BIOS based installations both on VM and bare metal.

Comment 13 Lukas Ruzicka 2020-04-16 17:04:07 UTC
lvdisplay:

 --- Logical volume ---
  LV Path                /dev/fedora_localhost-live/swap
  LV Name                swap
  VG Name                fedora_localhost-live
  LV UUID                Kk94TD-DnnW-anOu-lE6y-LziR-E85h-BQKrSw
  LV Write Access        read/write
  LV Creation host, time localhost-live, 2020-04-16 16:45:00 -0400
  LV Status              available
  # open                 2
  LV Size                9.55 GiB
  Current LE             2445
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:1
   
  --- Logical volume ---
  LV Path                /dev/fedora_localhost-live/home
  LV Name                home
  VG Name                fedora_localhost-live
  LV UUID                faQGA2-XQjG-2okD-WSAc-0IIS-6mQi-3JG7zs
  LV Write Access        read/write
  LV Creation host, time localhost-live, 2020-04-16 16:45:00 -0400
  LV Status              available
  # open                 1
  LV Size                157.92 GiB
  Current LE             40428
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:2
   
  --- Logical volume ---
  LV Path                /dev/fedora_localhost-live/root
  LV Name                root
  VG Name                fedora_localhost-live
  LV UUID                BYa2Ka-ivG2-UZdu-znBT-VuTk-DrwO-HuP41f
  LV Write Access        read/write
  LV Creation host, time localhost-live, 2020-04-16 16:45:06 -0400
  LV Status              available
  # open                 1
  LV Size                70.00 GiB
  Current LE             17920
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:0

Comment 14 Lukas Ruzicka 2020-04-16 17:05:52 UTC
lsblk:
NAME                            MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda                               8:0    0 238.5G  0 disk 
├─sda1                            8:1    0     1G  0 part /boot
└─sda2                            8:2    0 237.5G  0 part 
  ├─fedora_localhost--live-root 253:0    0    70G  0 lvm  /
  ├─fedora_localhost--live-swap 253:1    0   9.6G  0 lvm  [SWAP]
  └─fedora_localhost--live-home 253:2    0 157.9G  0 lvm  /home

Comment 15 František Zatloukal 2020-04-16 17:08:21 UTC
Fedora 32 was installed on the first hdd (/dev/sda)

fdisk:
Disk /dev/sda: 111.81 GiB, 120034123776 bytes, 234441648 sectors
Disk model: KINGSTON SV300S3
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: D94C5922-5790-4B1A-9FB2-DAC78D88D400

Device       Start       End   Sectors   Size Type
/dev/sda1     2048   1230847   1228800   600M EFI System
/dev/sda2  1230848   3327999   2097152     1G Linux filesystem
/dev/sda3  3328000 234440703 231112704 110.2G Linux LVM


Disk /dev/sdb: 111.81 GiB, 120034123776 bytes, 234441648 sectors
Disk model: KINGSTON SHFS37A
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes


Disk /dev/sdc: 465.78 GiB, 500107862016 bytes, 976773168 sectors
Disk model: TOSHIBA DT01ACA0
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes




Disk /dev/mapper/fedora_localhost--live-root00: 70 GiB, 75161927680 bytes, 146800640 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes


Disk /dev/mapper/fedora_localhost--live-swap00: 5.87 GiB, 6283067392 bytes, 12271616 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes


Disk /dev/mapper/fedora_localhost--live-home: 34.36 GiB, 36880515072 bytes, 72032256 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes


vgdisplay:
  --- Volume group ---
  VG Name               fedora_localhost-live
  System ID             
  Format                lvm2
  Metadata Areas        1
  Metadata Sequence No  4
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                3
  Open LV               3
  Max PV                0
  Cur PV                1
  Act PV                1
  VG Size               <110.20 GiB
  PE Size               4.00 MiB
  Total PE              28211
  Alloc PE / Size       28211 / <110.20 GiB
  Free  PE / Size       0 / 0   
  VG UUID               LjHHhW-LxDh-IuEr-5s7T-k52J-jQKG-PbJa8m

Comment 16 František Zatloukal 2020-04-16 17:09:42 UTC
Aaand lvdisplay:

  --- Logical volume ---
  LV Path                /dev/fedora_localhost-live/swap00
  LV Name                swap00
  VG Name                fedora_localhost-live
  LV UUID                OPIPcs-ccVO-T4sO-qjDz-xAGH-WRUi-KITD0Y
  LV Write Access        read/write
  LV Creation host, time localhost-live, 2020-04-16 18:20:17 +0200
  LV Status              available
  # open                 2
  LV Size                5.85 GiB
  Current LE             1498
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:1
   
  --- Logical volume ---
  LV Path                /dev/fedora_localhost-live/home
  LV Name                home
  VG Name                fedora_localhost-live
  LV UUID                s3UIkm-0sl9-luSy-3d1Q-y83J-Tpkv-7u8s3U
  LV Write Access        read/write
  LV Creation host, time localhost-live, 2020-04-16 18:20:17 +0200
  LV Status              available
  # open                 1
  LV Size                <34.35 GiB
  Current LE             8793
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:2
   
  --- Logical volume ---
  LV Path                /dev/fedora_localhost-live/root00
  LV Name                root00
  VG Name                fedora_localhost-live
  LV UUID                I35mZJ-Lln5-nh1U-tO0k-B5dF-p2dV-SLmvHn
  LV Write Access        read/write
  LV Creation host, time localhost-live, 2020-04-16 18:20:26 +0200
  LV Status              available
  # open                 1
  LV Size                70.00 GiB
  Current LE             17920
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:0

Comment 17 Kamil Páral 2020-04-16 19:14:05 UTC
Created attachment 1679475 [details]
logs from broken rescue mode

I can easily reproduce this in a VM when using a SATA or SCSI disk. When I use a VIRTIO disk, the issue is not present. I can see this with a default Fedora Workstation Live installation, or even Everything netinst default minimal installation. The problem happens with F32 RC1.3 Server DVD, F32 Beta Server DVD, but it doesn't happen with F31 Server DVD.

Logs from F32 RC1.3 Server DVD rescue mode (which can't find partitions) are attached here.

Comment 18 Ben Cotton 2020-04-16 19:24:41 UTC
In today's Go/No-Go meeting, we agreed to accept this as a Fedora 32 Final blocker per the following two criteria:

1. The rescue mode of the installer must start successfully and be able to detect and mount any installation performed according to the applicable criteria, and provide a shell with access to utilities capable of performing typical recovery operations. 

2. Custom partitioning (https://fedoraproject.org/wiki/Fedora_32_Beta_Release_Criteria#Custom_partitioning)

https://meetbot.fedoraproject.org/fedora-meeting-1/2020-04-16/f32-final-go_no_go-meeting.2020-04-16-16.59.log.html#l-460

Comment 19 Kamil Páral 2020-04-16 19:25:23 UTC
Created attachment 1679476 [details]
broken custom partitioning

This bug also affects custom partitioning on affected devices. Existing LVM partitions can't be used, see screenshot.

Comment 20 Adam Williamson 2020-04-16 19:41:51 UTC
So I've been digging into this with the help of kparal's logs at https://kparal.fedorapeople.org/tmp/rescue-bug/ , my own logs from a success case, and the blivet source.

My best theory so far is this is a race between LVM activation and blivet device scanning. We can see this is kparal's syslog. These are interesting snippets:

17:41:35,255 WARNING org.fedoraproject.Anaconda.Modules.Storage:INFO:blivet:DeviceTree.populate: ignored_disks is [] ; exclusive_disks is []
17:41:35,437 INFO lvm:  pvscan[1924] PV /dev/sda2 online, VG fedora_localhost-live is complete.
17:41:35,437 INFO lvm:  pvscan[1924] VG fedora_localhost-live run autoactivation.
17:41:35,482 INFO lvm:  2 logical volume(s) in volume group "fedora_localhost-live" now active
17:41:35,498 INFO systemd:Finished LVM event activation on device 8:2.
17:41:35,753 WARNING org.fedoraproject.Anaconda.Modules.Storage:INFO:blivet:scanning sda2 (/sys/devices/pci0000:00/0000:00:07.0/virtio3/host8/target8:0:0/8:0:0:0/block/sda/sda2)...
17:41:35,817 WARNING org.fedoraproject.Anaconda.Modules.Storage:DEBUG:blivet:                        DeviceTree.get_device_by_uuid returned existing 14 GiB lvmvg fedora_localhost-live (100)
17:41:35,817 WARNING org.fedoraproject.Anaconda.Modules.Storage:DEBUG:blivet:no LVs listed for VG fedora_localhost-live
17:41:35,818 WARNING org.fedoraproject.Anaconda.Modules.Storage:INFO:blivet:scanning fedora_localhost--live-swap00 (/sys/devices/virtual/block/dm-2)...

if we compare with my syslog from the success case:

17:34:42,902 INFO lvm:  pvscan[1449] PV /dev/sda2 online, VG fedora_localhost-live is complete.
17:34:42,902 INFO lvm:  pvscan[1449] VG fedora_localhost-live run autoactivation.
17:34:44,121 INFO lvm:  3 logical volume(s) in volume group "fedora_localhost-live" now active
17:34:44,133 INFO systemd:Finished LVM event activation on device 8:2.
17:35:31,695 WARNING org.fedoraproject.Anaconda.Modules.Storage:INFO:blivet:DeviceTree.populate: ignored_disks is [] ; exclusive_disks is []
17:35:32,188 WARNING org.fedoraproject.Anaconda.Modules.Storage:INFO:blivet:scanning sda2 (/sys/devices/pci0000:00/0000:00:02.0/0000:01:00.0/ata11/host10/target10:0:0/10:0:0:0/block/sda/sda2)...
17:35:32,305 WARNING org.fedoraproject.Anaconda.Modules.Storage:DEBUG:blivet:                        DeviceTree.get_device_by_uuid returned existing 464.76 GiB lvmvg fedora_localhost-live (33)
17:35:32,338 WARNING org.fedoraproject.Anaconda.Modules.Storage:DEBUG:blivet:updated fedora_localhost-live-home size to 387.91 GiB (387.91 GiB)
17:35:32,339 WARNING org.fedoraproject.Anaconda.Modules.Storage:DEBUG:blivet:Adding fedora_localhost-live-home/387.91 GiB to fedora_localhost-live
17:35:32,339 WARNING org.fedoraproject.Anaconda.Modules.Storage:INFO:blivet:added lvmlv fedora_localhost-live-home (id 37) to device tree
(yadda yadda)
17:35:32,357 WARNING org.fedoraproject.Anaconda.Modules.Storage:INFO:blivet:scanning fedora_localhost--live-home (/sys/devices/virtual/block/dm-3)...

you can see that in the failure case, VG autoactivation is happening *during the same second* blivet is busy scanning devices. It does happen very slightly before blivet gets to scanning sda2, but it's very close - and it happens 0.2 seconds *after* that "ignored_disks is [] ; exclusive_disks is []" log. Why is that significant? Because there's a call to `lvs_info.drop_cache()` at just about exactly the same time as that log call (in blivet.populator.PopulatorMixin._populate()). That is the cache from which blivet.populator.helpers.lvm.LVMFormatPopulator._update_lvs() reads lv_info , which if not populated, results in an early return with that "no LVs listed for VG" log message - which we *do* see in kparal's log, but *don't* see in mine. So it seems at least plausible that that cache is getting hit and re-populated in those next 0.2 seconds, before the activation of the VG is actually complete. At the very least, we can't rule it out.

I did an updates.img that adds a call to `lvs_info.drop_cache()` right before the population of `lv_info`, and that fixes the bug in a VM for me and kparal. Of course, simply *the fact of using an updates.img* might affect a race, so I'm not 100% sure that proves my point, but it does help a bit.

Comment 21 Adam Williamson 2020-04-16 19:42:55 UTC
The updates.img for anyone else who wants to try it on a system that hits the bug is https://www.happyassassin.net/updates/1824418.1.img .

Comment 22 Adam Williamson 2020-04-16 20:12:01 UTC
OK, so https://www.happyassassin.net/updates/1824418.2.img is an updates.img that adds the cache drop *and* adds a log statement whenever the cache is re-populated. https://www.happyassassin.net/updates/1824418.logonly.img adds the log statement but not the extra cache drop.

Our VM-with-SCSI reproducer case seems to behave very much in line with my theory, with these images. With .2.img it works, and logs one population of the cache *just before* VG autoactivation (which is happening in upstream code somewhere) and one a little *after* (which is my added one). With .logonly.img it only logs the one population of the cache, just before VG autoactivation, and fails. That all adds up nicely.

kparal's bare metal reproducer, though, isn't lining up quite as nicely. When we boot it with an updates image - even just logonly.img - we see autoactivation ~20 seconds before device scanning, and it works. When we boot it without an updates image, we see autoactivation right around the same time as device scanning, and it fails.

This isn't inconsistent with my theory, but it doesn't prove it quite as nicely. It seems like perhaps on bare metal, just the sheer act of loading an updates.img either changes the timing or actually changes how the VG autoactivation is triggered somehow.

I'm currently looking into how this autoactivation stuff is actually set up, it comes from the lvm2 package. There's a systemd service `lvm2-pvscan@.service` which seems to be the thing that gets run; I think it's triggered via udev rules, but I'm not 100% sure yet.

Comment 23 Adam Williamson 2020-04-16 20:13:40 UTC
for clarity, here are the git diffs of the updates images:

===

.2.img:

diff --git a/blivet/populator/helpers/lvm.py b/blivet/populator/helpers/lvm.py
index 4b674fac..eb5a09ea 100644
--- a/blivet/populator/helpers/lvm.py
+++ b/blivet/populator/helpers/lvm.py
@@ -120,6 +120,7 @@ class LVMFormatPopulator(FormatPopulator):
             return
 
         vg_name = vg_device.name
+        lvs_info.drop_cache()
         lv_info = dict((k, v) for (k, v) in iter(lvs_info.cache.items())
                        if v.vg_name == vg_name)
 
diff --git a/blivet/static_data/lvm_info.py b/blivet/static_data/lvm_info.py
index 42261f9b..b45cb0a9 100644
--- a/blivet/static_data/lvm_info.py
+++ b/blivet/static_data/lvm_info.py
@@ -40,6 +40,7 @@ class LVsInfo(object):
     @property
     def cache(self):
         if self._lvs_cache is None:
+            log.info("XXX Populating LVsInfo cache!")
             try:
                 lvs = blockdev.lvm.lvs()
             except NotImplementedError:


===

.logonly.img:

diff --git a/blivet/static_data/lvm_info.py b/blivet/static_data/lvm_info.py
index 42261f9b..b45cb0a9 100644
--- a/blivet/static_data/lvm_info.py
+++ b/blivet/static_data/lvm_info.py
@@ -40,6 +40,7 @@ class LVsInfo(object):
     @property
     def cache(self):
         if self._lvs_cache is None:
+            log.info("XXX Populating LVsInfo cache!")
             try:
                 lvs = blockdev.lvm.lvs()
             except NotImplementedError:

===

Comment 24 Adam Williamson 2020-04-16 21:42:53 UTC
So, here's what I've figured out so far regarding the autoactivation.

The actual systemd service involved is lvm2-pvscan@.service . This is part of the lvm2 package. There is also a relevant udev rule, because we rely on event-based activation here: the udev rule is 69-dm-lvm-metad.rules , and is also part of the lvm2 package. Neither the udev rule nor the service has changed in a relevant way since F31.

The way things basically work here is that the udev rule 'tells' systemd that the service should be activated for a device if that devices is a PV with LVM metadata. The rule does all the scan-y stuff then does this:

ENV{SYSTEMD_ALIAS}="/dev/block/$major:$minor"
ENV{SYSTEMD_WANTS}+="lvm2-pvscan@$major:$minor.service"

and systemd picks that up. If we look at logs with systemd.log_level=debug, it seems that systemd then ultimately actually triggers the service activation when it sees the device go to status 'plugged', like this:

Apr 16 21:16:58 localhost systemd[1]: dev-block-8:2.device: Changed dead -> plugged
Apr 16 21:16:58 localhost systemd[1]: dev-disk-by\x2did-scsi\x2d0QEMU_QEMU_HARDDISK_drive\x2dscsi0\x2d0\x2d0\x2d0\x2dpart2.device: Changed dead -> plugged
Apr 16 21:16:58 localhost systemd[1]: dev-disk-by\x2dpath-pci\x2d0000:00:07.0\x2dscsi\x2d0:0:0:0\x2dpart2.device: Changed dead -> plugged
Apr 16 21:16:58 localhost systemd[1]: dev-disk-by\x2dpartuuid-b4ba01b2\x2d02.device: Changed dead -> plugged
Apr 16 21:16:58 localhost systemd[1]: dev-disk-by\x2did-lvm\x2dpv\x2duuid\x2dqiccbW\x2dG2Q4\x2dBzZJ\x2dTcRU\x2dcS5M\x2drFS2\x2dZd9TDR.device: Changed dead -> plugged
Apr 16 21:16:58 localhost systemd[1]: dev-sda2.device: Changed dead -> plugged
Apr 16 21:16:58 localhost systemd[1]: sys-devices-pci0000:00-0000:00:07.0-virtio3-host2-target2:0:0-2:0:0:0-block-sda-sda2.device: Changed dead -> plugged
Apr 16 21:16:58 localhost systemd[1]: lvm2-pvscan@8:2.service: Trying to enqueue job lvm2-pvscan@8:2.service/start/fail
Apr 16 21:16:58 localhost systemd[1]: Added job lvm2-pvscan@8:2.service/start to transaction.
Apr 16 21:16:58 localhost systemd[1]: Pulling in system-lvm2\x2dpvscan.slice/start from lvm2-pvscan@8:2.service/start
Apr 16 21:16:58 localhost systemd[1]: Added job system-lvm2\x2dpvscan.slice/start to transaction.

So, it seems like when the bug happens, that "dead -> plugged" state change is happening *late* for some reason. It actually seems like the blivet device scan *itself* triggers that state change and thus causes the device to be activated. When the bug doesn't happen, this "dead -> plugged" state change happens much earlier and so the device is activated much earlier. It seems like a whole block of interesting stuff happens in F31, but not F32. Here's F31:

[adamw@adam system]$ grep dev-block-8:2.device /share/data/31systemddebug.txt
Apr 16 21:22:05 localhost systemd[1]: dev-block-8:2.device: Collecting.
Apr 16 21:22:17 localhost systemd[1]: Pulling in dev-block-8:2.device/start from lvm2-pvscan@8:2.service/start
Apr 16 21:22:17 localhost systemd[1]: Added job dev-block-8:2.device/start to transaction.
Apr 16 21:22:17 localhost systemd[1]: Keeping job dev-block-8:2.device/start because of lvm2-pvscan@8:2.service/start
Apr 16 21:22:17 localhost systemd[1]: dev-block-8:2.device: Installed new job dev-block-8:2.device/start as 303
Apr 16 21:22:17 localhost systemd[1]: dev-block-8:2.device: Changed dead -> plugged

after that, lvm2-pvscan@8:2.service fires and activates the LV.

Here's F32:

[adamw@adam system]$ grep dev-block-8:2.device /share/data/32systemddebug.txt 
Apr 16 21:16:29 localhost systemd[1]: dev-block-8:2.device: Collecting.
Apr 16 21:16:58 localhost systemd[1]: dev-block-8:2.device: Changed dead -> plugged
Apr 16 21:16:58 localhost systemd[1]: Pulling in dev-block-8:2.device/start from lvm2-pvscan@8:2.service/start
Apr 16 21:16:58 localhost systemd[1]: Added job dev-block-8:2.device/start to transaction.

note how in F31 we pull in dev-block-8:2.device/start *before* it goes from dead -> plugged, and it's only 12 seconds after the 'Collecting' message (which happens very early in the boot process), long before blivet is scanning devices...whereas in F32 the device changes to plugged and *then* we pull in a 'start' job for it, and this happens *29* seconds after the "Collecting" message (during blivet device scan).

I'm attaching the full journal dumps with systemd debug logging enabled from F31 (success) and F32 (failure) cases.

Comment 25 Adam Williamson 2020-04-16 21:43:49 UTC
Created attachment 1679496 [details]
journal output from F31 (success) case with systemd.log_level=debug

Comment 26 Adam Williamson 2020-04-16 21:44:18 UTC
Created attachment 1679498 [details]
journal output from F32 (failure) case with systemd.log_level=debug

Comment 27 Adam Williamson 2020-04-16 22:10:11 UTC
The F32 virtio case - which is also a success case - looks much like the F31 case. Here's better filtered logs from each case, with both important services shown:

F31 VM SCSI (success)
=====================

Apr 16 21:22:01 localhost systemd[1]: lvm2-pvscan@8:2.service: Failed to load configuration: No such file or directory
Apr 16 21:22:01 localhost systemd[1]: lvm2-pvscan@8:2.service: Trying to enqueue job lvm2-pvscan@8:2.service/start/fail
Apr 16 21:22:02 localhost systemd[1]: lvm2-pvscan@8:2.service: Failed to load configuration: No such file or directory
Apr 16 21:22:05 localhost systemd[1]: dev-block-8:2.device: Collecting.
Apr 16 21:22:05 localhost systemd[1]: lvm2-pvscan@8:2.service: Collecting.
Apr 16 21:22:17 localhost systemd[1]: lvm2-pvscan@8:2.service: Trying to enqueue job lvm2-pvscan@8:2.service/start/fail
Apr 16 21:22:17 localhost systemd[1]: Added job lvm2-pvscan@8:2.service/start to transaction.
Apr 16 21:22:17 localhost systemd[1]: Pulling in system-lvm2\x2dpvscan.slice/start from lvm2-pvscan@8:2.service/start
Apr 16 21:22:17 localhost systemd[1]: Pulling in dev-block-8:2.device/start from lvm2-pvscan@8:2.service/start
Apr 16 21:22:17 localhost systemd[1]: Added job dev-block-8:2.device/start to transaction.
Apr 16 21:22:17 localhost systemd[1]: Pulling in shutdown.target/stop from lvm2-pvscan@8:2.service/start
Apr 16 21:22:17 localhost systemd[1]: Keeping job dev-block-8:2.device/start because of lvm2-pvscan@8:2.service/start
Apr 16 21:22:17 localhost systemd[1]: Keeping job system-lvm2\x2dpvscan.slice/start because of lvm2-pvscan@8:2.service/start
Apr 16 21:22:17 localhost systemd[1]: dev-block-8:2.device: Installed new job dev-block-8:2.device/start as 303
Apr 16 21:22:17 localhost systemd[1]: lvm2-pvscan@8:2.service: Installed new job lvm2-pvscan@8:2.service/start as 298
Apr 16 21:22:17 localhost systemd[1]: lvm2-pvscan@8:2.service: Enqueued job lvm2-pvscan@8:2.service/start as 298
Apr 16 21:22:17 localhost systemd[1]: dev-block-8:2.device: Changed dead -> plugged

F32 VM SCSI (failure)
=====================

Apr 16 21:16:24 localhost systemd[1]: lvm2-pvscan@8:2.service: Failed to load configuration: No such file or directory
Apr 16 21:16:24 localhost systemd[1]: lvm2-pvscan@8:2.service: Trying to enqueue job lvm2-pvscan@8:2.service/start/fail
Apr 16 21:16:24 localhost systemd[1]: lvm2-pvscan@8:2.service: Failed to load configuration: No such file or directory
Apr 16 21:16:25 localhost systemd[1]: lvm2-pvscan@8:2.service: Failed to load configuration: No such file or directory
Apr 16 21:16:29 localhost systemd[1]: dev-block-8:2.device: Collecting.
Apr 16 21:16:29 localhost systemd[1]: lvm2-pvscan@8:2.service: Collecting.
Apr 16 21:16:58 localhost systemd[1]: dev-block-8:2.device: Changed dead -> plugged
Apr 16 21:16:58 localhost systemd[1]: lvm2-pvscan@8:2.service: Trying to enqueue job lvm2-pvscan@8:2.service/start/fail
Apr 16 21:16:58 localhost systemd[1]: Added job lvm2-pvscan@8:2.service/start to transaction.

F32 VM virtio (success)
=======================

Apr 16 21:54:47 localhost systemd[1]: lvm2-pvscan@252:2.service: Failed to load configuration: No such file or directory
Apr 16 21:54:47 localhost systemd[1]: lvm2-pvscan@252:2.service: Trying to enqueue job lvm2-pvscan@252:2.service/start/fail
Apr 16 21:54:47 localhost systemd[1]: lvm2-pvscan@252:2.service: Failed to load configuration: No such file or directory
Apr 16 21:54:48 localhost systemd[1]: lvm2-pvscan@252:2.service: Failed to load configuration: No such file or directory
Apr 16 21:54:51 localhost systemd[1]: dev-block-252:2.device: Collecting.
Apr 16 21:54:51 localhost systemd[1]: lvm2-pvscan@252:2.service: Collecting.
Apr 16 21:55:03 localhost systemd[1]: lvm2-pvscan@252:2.service: Trying to enqueue job lvm2-pvscan@252:2.service/start/fail
Apr 16 21:55:03 localhost systemd[1]: Added job lvm2-pvscan@252:2.service/start to transaction.
Apr 16 21:55:03 localhost systemd[1]: Pulling in system-lvm2\x2dpvscan.slice/start from lvm2-pvscan@252:2.service/start
Apr 16 21:55:03 localhost systemd[1]: Pulling in dev-block-252:2.device/start from lvm2-pvscan@252:2.service/start
Apr 16 21:55:03 localhost systemd[1]: Added job dev-block-252:2.device/start to transaction.
Apr 16 21:55:03 localhost systemd[1]: Pulling in shutdown.target/stop from lvm2-pvscan@252:2.service/start
Apr 16 21:55:03 localhost systemd[1]: Keeping job dev-block-252:2.device/start because of lvm2-pvscan@252:2.service/start
Apr 16 21:55:03 localhost systemd[1]: Keeping job system-lvm2\x2dpvscan.slice/start because of lvm2-pvscan@252:2.service/start
Apr 16 21:55:03 localhost systemd[1]: lvm2-pvscan@252:2.service: Installed new job lvm2-pvscan@252:2.service/start as 298
Apr 16 21:55:03 localhost systemd[1]: dev-block-252:2.device: Installed new job dev-block-252:2.device/start as 303
Apr 16 21:55:03 localhost systemd[1]: lvm2-pvscan@252:2.service: Enqueued job lvm2-pvscan@252:2.service/start as 298
Apr 16 21:55:03 localhost systemd[1]: dev-block-252:2.device: Changed dead -> plugged

I have not yet figured out what bit of systemd, or whatever, we're in that is *sometimes* doing this earlier enqueueing of the job and sometimes not, or why. Still working on that.

Oh, note the very early failures to load the service (that show in all cases) are, I think, from the dracut environment. The service is not present in the dracut environment, hence the failure.

Comment 28 Adam Williamson 2020-04-16 22:41:32 UTC
Adding zbyszek for a systemd angle on this. Zbigniew, any idea what's going on from a systemd perspective here?

To reproduce this, you just need a qemu VM and a disk image with a stock Fedora install on it (we've tested with 32, but probably affects others too). The test procedure is to attach the disk image to the VM, boot from a Fedora installer image, and at the boot menu, go to 'Troubleshooting' then 'Rescue a Fedora system'. Wait for the prompt and hit 1). If it tells you it mounted the installed system to /mnt/sysroot, the bug didn't happen. If it tells you you don't have any existing Fedora partitions, the bug did happen. The bug seems to relate to the timing of LV activation, see more detail in investigation comments above.

So far, we've found the following combinations seem to be repeatable:

    Disk attached as SCSI  |  Disk attached as VirtIO
-----------------------------------------------------
F31 |       PASS           |          PASS
F32 |       FAIL           |          PASS

we also have some bare metal systems which we think are running into the same issue. Here are installer images to test with:

F32 - https://dl.fedoraproject.org/pub/alt/stage/32_RC-1.3/Everything/x86_64/iso/Fedora-Everything-netinst-x86_64-32-1.3.iso
F31 - https://dl.fedoraproject.org/pub/fedora/linux/releases/31/Everything/x86_64/iso/Fedora-Everything-netinst-x86_64-31-1.9.iso

Comment 29 Adam Williamson 2020-04-16 22:42:14 UTC
also CCing mcsontos for LVM perspective.

Comment 30 Adam Williamson 2020-04-16 23:09:06 UTC
I find myself intrigued by this bit of systemd:

https://github.com/systemd/systemd/blob/master/src/core/device.c#L422

because the precise log message we're getting here in the success case - "Trying to enqueue job lvm2-pvscan@8:2.service/start/fail" - indicates that manager_add_job is being called with JobType start and JobMode fail. And that's exactly what that code does (via manager_add_job_by_name, which just gets from a unit name to a Unit instance):

                        r = manager_add_job_by_name(u->manager, JOB_START, *i, JOB_FAIL, NULL, &error, NULL);

and the comment on this block sounds a lot like something we might be hitting here:

/* So here's a special hack, to compensate for the fact that the udev database's reload cycles are not
 * synchronized with our own reload cycles: when we detect that the SYSTEMD_WANTS property of a device
 * changes while the device unit is already up, let's manually trigger any new units listed in it not
 * seen before. This typically happens during the boot-time switch root transition, as udev devices
 * will generally already be up in the initrd, but SYSTEMD_WANTS properties get then added through udev
 * rules only available on the host system, and thus only when the initial udev coldplug trigger runs.
 *
 * We do this only if the device has been up already when we parse this, as otherwise the usual
 * dependency logic that is run from the dead → plugged transition will trigger these deps. */

there's only one other place I can see off hand where this happens, and it's less of an obvious candidate for being where we're landing on the success path here - it's in `retroactively_start_dependencies` in unit.c.

Comment 31 Marian Csontos 2020-04-17 09:38:34 UTC
Does this happen only in rescue mode and the installed system boots just normally, right? Perhaps something missing in the initrd... I am not sure I would be able to dive deeper into this today

Comment 32 Lukas Ruzicka 2020-04-17 10:24:06 UTC
(In reply to Marian Csontos from comment #31)
> Does this happen only in rescue mode and the installed system boots just
> normally, right? Perhaps something missing in the initrd... I am not sure I
> would be able to dive deeper into this today

Yes, this only happens in the rescue mode.

Comment 33 Kamil Páral 2020-04-17 12:02:42 UTC
(In reply to Lukas Ruzicka from comment #32)
> Yes, this only happens in the rescue mode.

To be more precise, in anaconda netinst/dvd, for both rescue mode and custom/blivet partitioning during installation. The installed system always boots fine, yes.

Comment 34 Adam Williamson 2020-04-17 18:51:41 UTC
Comparing with "the installed system" isn't really apples-to-apples at all, though. They're very different scenarios.

The VG in question contains the actual root filesystem of "the installed system", after all. So it'll be listed in fstab, and we may well build something into the initramfs to activate and mount it (in fact we kinda have to). This is all very different from the installer environment which uses a generic initramfs and doesn't actually need this VG to be activated in order to boot.

So I'm not sure that comparison is telling us anything as the scenarios are just too different. I guess a slightly more accurate comparison might be if you did a non-LVM install then attached the disk image with an LVM install on it as a secondary disk and booted and saw whether they got autoactivated or not?

Comment 35 Chris Murphy 2020-04-17 21:57:09 UTC
The "installed system" has LVM hinting for dracut as a boot parameter (in the grub.cfg): rd.lvm.lv=fedora/swap rd.lvm.lv=fedora/root

The installer media's initramfs detects that it's an installer boot, and uses rd.lvm=0 rd.luks=0 so the detection and activation of LVM must happen later.

Not working (f32 netinstaller)
[   44.197693] localhost-live lvm[1817]:   3 logical volume(s) in volume group "fedora_localhost-live" now active

## Ostensibly the LV's are active at some point.

But I can't tell if the bug is that blivet just doesn't recognize them for some reason; or if they vanish.


Non working case from f32 netinstaller:

[   44.723497] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: DEBUG:blivet:                 DeviceTree.handle_device: name: fedora_localhost--live-root ; info: {'DEVLINKS': '/dev/fedora_localhost-live/root '
[   44.723497] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]:              '/dev/disk/by-uuid/74d31899-63b2-45de-a124-b2ed6a6ba479 '
[   44.723497] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]:              '/dev/disk/by-id/dm-name-fedora_localhost--live-root '
[   44.723497] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]:              '/dev/disk/by-id/dm-uuid-LVM-BHwM8lvqSrbQY99BFHk6hZwjayvzoACdaxzcjZhrmtUBUCM6m0tmtUBr5xWWe1hh '
[   44.723497] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]:              '/dev/mapper/fedora_localhost--live-root',
[   44.723497] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]:  'DEVNAME': '/dev/dm-4',
[   44.723497] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]:  'DEVPATH': '/devices/virtual/block/dm-4',
[   44.723497] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]:  'DEVTYPE': 'disk',
[   44.723497] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]:  'DM_ACTIVATION': '1',
[   44.723497] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]:  'DM_LV_NAME': 'root',
[   44.723497] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]:  'DM_NAME': 'fedora_localhost--live-root',
[   44.723497] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]:  'DM_SUSPENDED': '0',
[   44.723497] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]:  'DM_UDEV_DISABLE_LIBRARY_FALLBACK_FLAG': '1',
[   44.723497] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]:  'DM_UDEV_PRIMARY_SOURCE_FLAG': '1',
[   44.723497] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]:  'DM_UDEV_RULES_VSN': '2',
[   44.723497] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]:  'DM_UUID': 'LVM-BHwM8lvqSrbQY99BFHk6hZwjayvzoACdaxzcjZhrmtUBUCM6m0tmtUBr5xWWe1hh',
[   44.723497] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]:  'DM_VG_NAME': 'fedora_localhost-live',
[   44.723497] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]:  'ID_FS_TYPE': 'ext4',
[   44.723497] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]:  'ID_FS_USAGE': 'filesystem',
[   44.723497] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]:  'ID_FS_UUID': '74d31899-63b2-45de-a124-b2ed6a6ba479',
[   44.723497] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]:  'ID_FS_UUID_ENC': '74d31899-63b2-45de-a124-b2ed6a6ba479',
[   44.723497] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]:  'ID_FS_VERSION': '1.0',
[   44.723497] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]:  'MAJOR': '253',
[   44.723497] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]:  'MINOR': '4',
[   44.723497] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]:  'SUBSYSTEM': 'block',
[   44.723497] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]:  'SYS_NAME': 'dm-4',
[   44.723497] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]:  'SYS_PATH': '/sys/devices/virtual/block/dm-4',
[   44.723497] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]:  'TAGS': ':systemd:',
[   44.723497] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]:  'USEC_INITIALIZED': '44139414'} ;
[   44.724075] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: INFO:blivet:scanning fedora_localhost--live-root (/sys/devices/virtual/block/dm-4)...
[   44.724734] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: DEBUG:blivet:                   DeviceTree.get_device_by_name: name: fedora_localhost--live-root ; incomplete: False ; hidden: False ;
[   44.725590] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: DEBUG:blivet:                   DeviceTree.get_device_by_name returned None
[   44.725801] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: INFO:program:Running [32] dmsetup info -co subsystem --noheadings fedora_localhost--live-root ...
[   44.730403] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: INFO:program:stdout[32]: LVM
[   44.730462] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: INFO:program:stderr[32]:
[   44.730525] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: INFO:program:...done [32] (exit code: 0)
[   44.730667] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: INFO:program:Running [33] dmsetup info -co subsystem --noheadings fedora_localhost--live-root ...
[   44.734597] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: INFO:program:stdout[33]: LVM
[   44.734684] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: INFO:program:stderr[33]:
[   44.734738] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: INFO:program:...done [33] (exit code: 0)
[   44.734895] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: INFO:program:Running [34] dmsetup info -co subsystem --noheadings fedora_localhost--live-root ...
[   44.738863] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: INFO:program:stdout[34]: LVM
[   44.738945] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: INFO:program:stderr[34]:
[   44.738994] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: INFO:program:...done [34] (exit code: 0)
[   44.740626] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: DEBUG:blivet:                  LVMDevicePopulator.run: name: fedora_localhost--live-root ;
[   44.742015] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: DEBUG:blivet:                    DeviceTree.get_device_by_name: name: fedora_localhost-live ; incomplete: False ; hidden: True ;
[   44.744598] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: DEBUG:blivet:                    DeviceTree.get_device_by_name returned existing 248.41 GiB lvmvg fedora_localhost-live (108)
[   44.747855] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: DEBUG:blivet:                      DeviceTree.get_device_by_name: name: sda3 ; incomplete: False ; hidden: False ;
[   44.749122] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: DEBUG:blivet:                      DeviceTree.get_device_by_name returned existing 248.41 GiB partition sda3 (99) with existing lvmpv
[   44.750228] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: DEBUG:blivet:                    DeviceTree.get_device_by_name: name: fedora_localhost-live ; incomplete: False ; hidden: False ;
[   44.751675] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: DEBUG:blivet:                    DeviceTree.get_device_by_name returned existing 248.41 GiB lvmvg fedora_localhost-live (108)
[   44.753096] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: DEBUG:blivet:                    DeviceTree.get_device_by_name: name: fedora_localhost--live-root ; incomplete: False ; hidden: False ;
[   44.754094] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: DEBUG:blivet:                    DeviceTree.get_device_by_name returned None
[   44.754148] localhost-live org.fedoraproject.Anaconda.Modules.Storage[1660]: DEBUG:blivet:no device obtained for fedora_localhost--live-root


Questions: 
Why is 'dmsetup info' run three times? That doesn't happen with f31 netinstaller. I can't tell if it's already inactive at this point, this dmsetup info command filters out the state. The command tells us the LV is definitely seen; but is it active? *shrug* Not sure.

The last line "no device obtained" is also not present with the f31 netinstaller.

Comment 36 Adam Williamson 2020-04-17 22:53:46 UTC
Chris: I have already investigated all that stuff (except 'dmsetup info'). It is all covered in my notes. I explained in detail the timing problem with the activation when the bug happens. When the bug happens, the activation is triggered during *blivet's device scan itself*, and happens *just* (~0.2s) after blivet caches LV info retrieved via libblockdev - so it caches that there are no active LVs, *then* activates them, then when it reaches a partition that's part of a VG it starts scanning the VG but then gives up because of the bad cached data.

When the bug does *not* happen, LV activation is triggered during system startup, 10-15 seconds *before* the blivet device scan runs. So the LVs are active when blivet caches the LV info, and everything goes fine.

Comment 37 Chris Murphy 2020-04-18 21:32:57 UTC
f32 workstation live
[    9.722496] localhost systemd[1]: Created slice system-lvm2\x2dpvscan.slice.
[    9.751451] localhost lvm[1028]:   pvscan[1028] PV /dev/sda3 online, VG fedora_localhost-live is complete.
[    9.751451] localhost lvm[1028]:   pvscan[1028] VG fedora_localhost-live run autoactivation.
[    9.940521] localhost lvm[1028]:   3 logical volume(s) in volume group "fedora_localhost-live" now active

f32 netinstall

[   30.233570] localhost-live systemd[1]: Created slice system-lvm2\x2dpvscan.slice.
[   30.260823] localhost-live lvm[1881]:   pvscan[1881] PV /dev/sda3 online, VG fedora_localhost-live is complete.
[   30.261107] localhost-live lvm[1881]:   pvscan[1881] VG fedora_localhost-live run autoactivation.
[   30.325830] localhost-live lvm[1881]:   3 logical volume(s) in volume group "fedora_localhost-live" now active

Why is pvscan so late on the netinstall?

Comment 38 Adam Williamson 2020-04-18 22:08:18 UTC
that's *still* the stuff I already wrote about. look at comment #27. you are seeing the lvm2-pvscan@ service I've been writing about for two days.

Comment 39 Chris Murphy 2020-04-18 22:23:45 UTC
How do I infer that pvscan is late in one case and not another when c27 lacks monotonic times?

From c34: >I guess a slightly more accurate comparison might be if you did a non-LVM install then attached the disk image with an LVM install on it as a secondary disk and booted and saw whether they got autoactivated or not?

That's functionally achieved, and in a more standard way, without changing the controller interface type, by using workstation live to compare to the netinstall, which is why I did it. Still unanswered is why it's happening so late with netinstall.

Comment 40 Adam Williamson 2020-04-18 22:41:06 UTC
"How do I infer that pvscan is late in one case and not another when c27 lacks monotonic times?"

well, from the fact that I wrote about a dozen lines about how it's late. That was the entire point of the comment. (And also the fact that it's 12 seconds after 'Collecting' in the early case, but 29 seconds after 'Collecting' in the late case.)

Comment 41 Markus Teuber 2020-04-18 23:05:05 UTC
(In reply to Kamil Páral from comment #19)
> Created attachment 1679476 [details]
> broken custom partitioning
> 
> This bug also affects custom partitioning on affected devices. Existing LVM
> partitions can't be used, see screenshot.

If you press the "reload"-Button in this screen and try again the custom partitioning then it works as expected.

Comment 42 Markus Teuber 2020-04-18 23:31:06 UTC
If you append the kernel commandline paramter rd.lvm=1 to the entry "Rescue a Fedora system" (i used Fedora-Server-dvd-x86_64-32-1.3.iso) then it works at the first time.
The same works appending rd.lvm=1 to the "install ..." entry.

Comment 43 Kamil Páral 2020-04-20 10:46:47 UTC
(In reply to Markus Teuber from comment #42)
> If you append the kernel commandline paramter rd.lvm=1 to the entry "Rescue
> a Fedora system" (i used Fedora-Server-dvd-x86_64-32-1.3.iso) then it works
> at the first time.

This works for me in my VM, but doesn't work on my bare metal laptop (3 attempts, none successful). There's still some race condition or a problem somewhere, even with that argument.

Comment 44 Adam Williamson 2020-04-20 15:21:15 UTC
BTW, another question that occurred to me over the weekend is: should blivet actually be relying on the activation being triggered by systemd/udev/lvm2 anyway? After all, I thought the idea was kinda that it wanted to *inhibit* the rest of the OS from activating storage devices and stuff, and do all that work itself - that's why it sets `rd.lvm=0` (and other args) in the first place.

So I suppose even if *currently* in the cases where this works it seems to work because of the earlier activation, making that trigger more often might not be the "right" fix here. We might be better off just assuming that the LVs ought to be activated by blivet itself, and looking for ordering bugs there - my debugging seems to suggest that at least in the VM case, blivet caches the LV info just barely before it somehow triggers activation of the VGs, so if we presume that blivet should be able to successfully activate and then scan VGs *itself*, it seems it may have an internal ordering issue.

Comment 45 Vojtech Trefny 2020-04-20 17:43:54 UTC
It looks like the issue is with partitions not being recognized during the initial blivet scan.

This is taken from journal from the "logs from broken rescue mode" attachment, but I've seen the same issue on my VM.

Our first scan:
INFO:blivet:devices to scan: ['sda', 'sr0', 'loop0', 'loop1', 'loop2', 'live-rw', 'live-base']

During that we use parted to gather information about the partition table which causes kernel to read it and we get uevents for the partitions:
Apr 16 17:41:35 localhost-live.default kernel:  sda: sda1 sda2

After that the pvscan service is autostarted:
Apr 16 17:41:35 localhost-live.default systemd[1]: Created slice system-lvm2\x2dpvscan.slice.
Apr 16 17:41:35 localhost-live.default systemd[1]: Starting LVM event activation on device 8:2...

But it's already too late because we store information about LVs and PVs during the initial scan. Invalidating the cache and re-scanning the system later might be a solution, but there is still a lot of space for another race condition.

Comment 46 Fedora Update System 2020-04-20 22:15:45 UTC
FEDORA-2020-8f552bb1ee has been submitted as an update to Fedora 32. https://bodhi.fedoraproject.org/updates/FEDORA-2020-8f552bb1ee

Comment 47 Adam Williamson 2020-04-20 22:24:53 UTC
So to hopefully fix this well enough for F32 release, we bodged in a drop of the LVM info caches and a 2 second sleep (urgh) to each iteration of the device discovery loop. This works for at least the VM SCSI reproducer, and Kamil's hardware reproducer. Once the new RC is built I'll ask here for other folks who could reproduce this to try with it.

This is not really a proper fix, though. There's a few avenues we're exploring in terms of a cleaner solution.

Comment 48 František Zatloukal 2020-04-21 09:09:57 UTC
FEDORA-2020-8f552bb1ee fixes both Rescue Mode and Advanced (Blivet-gui) partitioning on bare metal (tested via RC 1.4). I'll go ahead and marked as verified.

Comment 49 Kamil Páral 2020-04-21 11:12:30 UTC
With Fedora-Server-dvd-x86_64-32-1.4.iso, on both bare metal and VM, I can now successfully rescue a system or see the LVM partitions in the installer advanced partitioning.

Comment 50 Markus Teuber 2020-04-21 17:00:11 UTC
works for me too. Thanks a lot.

Comment 51 Fedora Update System 2020-04-21 18:43:19 UTC
FEDORA-2020-8f552bb1ee has been pushed to the Fedora 32 testing repository.
In short time you'll be able to install the update with the following command:
`sudo dnf upgrade --enablerepo=updates-testing --advisory=FEDORA-2020-8f552bb1ee`
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2020-8f552bb1ee

See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.

Comment 52 Kamil Páral 2020-04-22 08:49:04 UTC
Works fine also with RC 1.5, bare metal and VM.

Comment 53 Vojtech Trefny 2020-04-22 11:26:22 UTC
Created attachment 1680834 [details]
journal without partitions

Comment 54 Vojtech Trefny 2020-04-22 11:27:34 UTC
Created attachment 1680835 [details]
storage.log without partitions

Comment 55 Vojtech Trefny 2020-04-22 11:36:53 UTC
Bad news. We have another failed installation with the fix included but this time there are no partitions discovered during the blivet scan and installation fails when trying to wipe the disks. For some reason this time our scan did not trigger the uevent and kernel found the partitions only after I manually ran parted and fdisk (timestamp 11:04:15,567).

Comment 56 Vojtech Trefny 2020-04-22 13:07:25 UTC
(In reply to Vojtech Trefny from comment #55)
> Bad news. We have another failed installation with the fix included but this
> time there are no partitions discovered during the blivet scan and
> installation fails when trying to wipe the disks. For some reason this time
> our scan did not trigger the uevent and kernel found the partitions only
> after I manually ran parted and fdisk (timestamp 11:04:15,567).

I've tried this 10 more times on the same VM with RC 1.5 and I wasn't able to reproduce this behaviour. The logs are from rawhide and I can reproduce this there but I am 99 % sure this happened also on F32. Unfortunately I can't reproduce it now so it's possible it never happened. Sorry.

Comment 57 Kamil Páral 2020-04-22 13:48:45 UTC
If anyone has spare cycles, please test this.

Comment 58 Markus Teuber 2020-04-22 14:55:18 UTC
I test this again with RC 1.5 and it worked fine for me.

Comment 59 František Zatloukal 2020-04-22 14:58:39 UTC
RC 1.5 worked just fine on my two desktops (which were broken with 1.3), both Rescue mode and blivet-gui partitioning.

Comment 60 Adam Williamson 2020-04-23 17:06:00 UTC
I can totally believe there are still occasional transient problems in this area, but I think the reproducible bugs here are clearly fixed according to feedback, setting back to VERIFIED.

Comment 61 Fedora Update System 2020-04-23 18:02:40 UTC
FEDORA-2020-8f552bb1ee has been pushed to the Fedora 32 stable repository.
If problem still persists, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.