Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1701600

Summary: Timed out waiting for Device/LV during boot
Product: Red Hat Enterprise Linux 7 Reporter: jhouston
Component: lvm2Assignee: Peter Rajnoha <prajnoha>
lvm2 sub component: Activating existing Logical Volumes QA Contact: cluster-qe <cluster-qe>
Status: CLOSED WORKSFORME Docs Contact:
Severity: high    
Priority: urgent CC: agk, bhull, bubrown, cmarthal, heinzm, jbrassow, jpittman, kwalker, loberman, lvm-team, msekleta, msnitzer, nweddle, pdwyer, prajnoha, revers, rhandlin, rleese, rmadhuso, rmetrich, teigland, tonay, zkabelac
Version: 7.6   
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1703202 (view as bug list) Environment:
Last Closed: 2019-09-13 12:46:11 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1703202, 1711360    
Attachments:
Description Flags
Sos report with plymouth.enable=0 none

Description jhouston 2019-04-19 20:00:09 UTC
Description of problem:
After upgrade to RHEL 7.6 System seeing Device time outs for several OS LV's during boot time and dropping into emergency mode. 

Version-Release number of selected component (if applicable):
3.10.0-957.el7.x86_64
lvm2-2.02.180-8.el7.x86_64                                

How reproducible:
Every Boot

Steps to Reproduce:
N/A

Actual results:
System is dropping into emergency mode because of timeouts waiting for device dev-mapper/* 

Expected results:
System should be able to activate all LV's and have the filesystems available in time for fstab and not drop into emergency mode. 

Additional info:
1. Strict filter is set:

  WARNING: Locking disabled. Be careful! This could corrupt your metadata.
  PV         VG  Fmt  Attr PSize   PFree    DevSize PV UUID                                PMdaFree  PMdaSize  #PMda #PMdaUse 1st PE 
  /dev/sda1           ---       0        0  200.00m                                               0         0      0        0      0 
  /dev/sda2           ---       0        0    1.00g                                               0         0      0        0      0 
  /dev/sda3  vg1 lvm2 a--  837.13g <485.55g 837.13g X0iKFy-pLa0-hAcH-1TK0-rfqJ-18bc-84AJlj   506.00k  1020.00k     1        1   1.00m

$ grep filter etc/lvm/lvm.conf

	filter = ["a|/dev/sda|","r|.*|"]

lvmetad is also disabled:
$ grep lvmetad etc/lvm/lvm.conf
	use_lvmetad = 0

* lvm2-lvmetad.socket
   Loaded: masked (/dev/null; bad)
   Active: inactive (dead)

2. All partition types/filesystem types appear to be expected/Normal:

/dev/sda1: SEC_TYPE="msdos" UUID="DACC-3A20" TYPE="vfat" PARTLABEL="EFI System Partition" PARTUUID="62a6a371-4d51-4464-b451-8bbc6e238591" 
/dev/sda2: UUID="35ced8a0-608d-487e-ae15-3d219e82563f" TYPE="xfs" PARTUUID="76a39479-c23b-491c-a75e-80e66ba98d73" 
/dev/sda3: UUID="X0iKFy-pLa0-hAcH-1TK0-rfqJ-18bc-84AJlj" TYPE="LVM2_member" PARTUUID="a137f84e-3dbc-440f-b3e3-38c6fdbbcff4" 

Model: HP LOGICAL VOLUME (scsi)
Disk /dev/sda: 1758109232s
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags: 

Number  Start     End          Size         File system  Name                  Flags
 1      2048s     411647s      409600s      fat16        EFI System Partition  boot
 2      411648s   2508799s     2097152s     xfs
 3      2508800s  1758107647s  1755598848s                                     lvm

3. Customer has attempted to append rd.lvm.vg=<vgname> as a workaround to kernel parameters but still see timeouts

4. Attempted x-systemd.device-timeout=300 within fstab for all problem filesystems, still see timeouts:
<snippet>
/dev/mapper/vg1-lv_root /                       xfs     defaults,x-systemd.device-timeout=300        0 0
UUID=35ced8a0-608d-487e-ae15-3d219e82563f /boot                   xfs     defaults,x-systemd.device-timeout=300,nodev,nosuid        0 0
UUID=DACC-3A20          /boot/efi               vfat    defaults,nodev,nosuid,uid=0,gid=0,umask=0077,shortname=winnt 0 0

5. One single local device used as a PV. Customer can either wait a few moments after dropping into emergency mode and Ctrl + D where it boots without issue, or manually isolate to multi-user after entering root password. All LV's/VG's activate and mount as soon as you enter emergency mode without issues. 

<snippet of timeouts>
-- Subject: Unit dev-disk-by\x2duuid-DACC\x2d3A20.device has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit dev-disk-by\x2duuid-DACC\x2d3A20.device has failed.
-- 
-- The result is timeout.
Apr 18 15:53:46 rhel7-3-ssg systemd[1]: Dependency failed for /boot/efi.
-- Subject: Unit boot-efi.mount has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit boot-efi.mount has failed.
-- Unit dev-mapper-vg1\x2dlv_swap.device has failed.
-- 
-- The result is timeout.
Apr 18 15:53:46 rhel7-3-ssg systemd[1]: Dependency failed for /dev/mapper/vg1-lv_swap.
-- Subject: Unit dev-mapper-vg1\x2dlv_swap.swap has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit dev-mapper-vg1\x2dlv_swap.swap has failed.
-- 
-- The result is dependency.
Apr 18 15:53:46 rhel7-3-ssg systemd[1]: Dependency failed for Swap.
-- Subject: Unit swap.target has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit swap.target has failed.

Comment 8 jhouston 2019-05-17 16:42:43 UTC
Created attachment 1570173 [details]
Sos report with plymouth.enable=0

Comment 10 David Teigland 2019-05-22 21:04:33 UTC
There are couple of things I've not seen before in /var/log/messages which I'd like to understand:

Why does audit complain about the lvm2-pvscan service?

May 15 10:44:34 rhel7-3-ssg kernel: type=1130 audit(1557931474.699:61): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=lvm2-pvscan@8:3 com
m="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success

The timing of messages indicates that iTCO could be stalling the boot by about 40 seconds.  I suggest disabling iTCO somehow to see if that fixes the problem.

May 15 10:45:05 rhel7-3-ssg kernel: iTCO_vendor_support: vendor-support=0
May 15 10:45:06 rhel7-3-ssg kernel: iTCO_wdt: Intel TCO WatchDog Timer Driver v1.11
May 15 10:45:06 rhel7-3-ssg kernel: iTCO_wdt: unable to reset NO_REBOOT flag, device disabled by hardware/BIOS
May 15 10:45:45 rhel7-3-ssg systemd: Job dev-mapper-vg1\x2dlv_swap.device/start timed out.
May 15 10:45:45 rhel7-3-ssg systemd: Timed out waiting for device dev-mapper-vg1\x2dlv_swap.device.
May 15 10:45:46 rhel7-3-ssg systemd: Dependency failed for /dev/mapper/vg1-lv_swap.
May 15 10:45:46 rhel7-3-ssg systemd: Dependency failed for Swap.

It's not just lvm that's failing and timing out, it seems to be everything:

May 15 10:47:00 rhel7-3-ssg systemd: systemd-udev-settle.service: main process exited, code=exited, status=1/FAILURE
May 15 10:47:00 rhel7-3-ssg systemd: Failed to start udev Wait for Complete Device Initialization.
May 15 10:47:02 rhel7-3-ssg systemd: systemd-udev-settle.service failed.

My impression here is that lvm is a victim of whatever is wrong with this machine.

Comment 11 David Teigland 2019-06-05 14:38:43 UTC
Is there any more info to lead us to believe there is an lvm problem here?  Has the iTCO watchdog been disabled or fixed?

Comment 23 Peter Rajnoha 2019-07-08 10:45:11 UTC
Looking a bit more at this and looking at the reason why there's the:

   Jun 18 15:49:03 rhel7-3-ssg systemd[1]: Job dev-mapper-vg1\x2dlv_swap.device/start timed out

The LV with swap is activated correctly in initramfs and the /dev/mapper/vg1-lv_swap symlink is created correctly there, from the udev log (in initramfs):

  Jun 18 15:47:30 rhel7-3-ssg systemd-udevd[1006]: creating symlink '/dev/mapper/vg1-lv_swap' to '../dm-1

Then cleaning up the udev database and switching to root fs...

Then udevadm trigger called to reinitialize the udev db content, including already activated devices (like /dev/mapper/vg1-lv_swap). Here, I'm finding:

  Jun 18 15:49:35 rhel7-3-ssg systemd[1]: systemd-udev-settle.service failed.


So looking why this one failed, looking at the systemd-udev-trigger.service status (sos_commands/systemd/systemctl_status_--all), that looks OK:


  * systemd-udev-trigger.service - udev Coldplug all Devices                
     Loaded: loaded (/usr/lib/systemd/system/systemd-udev-trigger.service; static; vendor preset: disabled)
     Active: active (exited) since Tue 2019-06-18 15:47:34 EDT; 16h ago     
       Docs: man:udev(7)                                                    
             man:systemd-udevd.service(8)                                   
    Process: 22506 ExecStart=/usr/bin/udevadm trigger --type=devices --action=add (code=exited, status=0/SUCCESS)
    Process: 22416 ExecStart=/usr/bin/udevadm trigger --type=subsystems --action=add (code=exited, status=0/SUCCESS)
   Main PID: 22506 (code=exited, status=0/SUCCESS)                          
      Tasks: 0                                                              
     CGroup: /system.slice/systemd-udev-trigger.service                     
                                                                          
  Jun 18 15:47:34 rhel7-3-ssg systemd[1]: Starting udev Coldplug all Devices...
  Jun 18 15:47:34 rhel7-3-ssg systemd[1]: Started udev Coldplug all Devices.    

Then the settle for the udev trigger (not OK, but we don't know why):

  * systemd-udev-settle.service - udev Wait for Complete Device Initialization
     Loaded: loaded (/usr/lib/systemd/system/systemd-udev-settle.service; static; vendor preset: disabled)
     Active: failed (Result: exit-code) since Tue 2019-06-18 15:49:35 EDT; 16h ago
       Docs: man:udev(7)  
             man:systemd-udevd.service(8)
    Process: 22795 ExecStart=/usr/bin/udevadm settle (code=exited, status=1/FAILURE)
   Main PID: 22795 (code=exited, status=1/FAILURE)
                        
  Jun 18 15:47:34 rhel7-3-ssg systemd[1]: Starting udev Wait for Complete Device Initialization...
  Jun 18 15:49:35 rhel7-3-ssg systemd[1]: Failed to start udev Wait for Complete Device Initialization.
  Jun 18 15:49:35 rhel7-3-ssg systemd[1]: Unit systemd-udev-settle.service entered failed state.


So it looks like that while processing the uevents from the trigger, something is stalling. The critical time is between 15:47:34 and 15:49:35 (which is exactly the default 90s timeout for systemd units).


I see lots of these mostly in the udev log:

  Jun 18 15:47:35 rhel7-3-ssg systemd-udevd[22912]: starting '/bin/systemd-detect-virt'
  ...
  Jun 18 15:48:02 rhel7-3-ssg systemd-udevd[25174]: starting '/bin/systemd-detect-virt'
  ...
  Jun 18 15:48:02 rhel7-3-ssg systemd-udevd[22534]: PROGRAM '/bin/systemd-detect-virt' /usr/lib/udev/rules.d/40-redhat.rules:11
  ...
  Jun 18 15:51:33 rhel7-3-ssg systemd-udevd[45655]: PROGRAM '/bin/systemd-detect-virt' /usr/lib/udev/rules.d/40-redhat.rules:11
  Jun 18 15:51:33 rhel7-3-ssg systemd-udevd[45665]: starting '/bin/systemd-detect-virt'



The "systemd-detect-virt" is slowing down the boot a lot here - I think that one is causing the udev settle timeout in the end! Michal, any clue what may be wrong with systemd-detect-virt?

Comment 24 Peter Rajnoha 2019-07-08 11:03:48 UTC
Checked with Michal on irc, there was a problem exactly with this and this should be resolved in z-stream release systemd-219-62.el7_6.7 and higher. Please, if possible, try this version.

Comment 29 Corey Marthaler 2019-07-22 17:05:43 UTC
Adding QA ack for 7.8. 

It doesn't appear that a definitive reproducer for this exists. 
This bug will be marked verified (SanityOnly) once final 7.8 regression testing has been completed.

Comment 36 Renaud Métrich 2019-09-04 14:22:37 UTC
Hello,

Just to react on Comment #23, recently I filed BZ #1748051 (Rules in 40-redhat.rules file for SUBSYSTEM==memory are suboptimal and may lead to timing issues) because I found that on some hardware, there were tons of dummy "uname -p", "systemd-detect-virt" and similar processes being spawned that could explain systemd-udevd-trigger timing out.

Renaud.