Bug 1933038

Summary: systemd: Timed out waiting for Device/LV during boot on rhel-8.3
Product: Red Hat Enterprise Linux 8 Reporter: Jon Magrini <jmagrini>
Component: dracutAssignee: Lukáš Nykrýn <lnykryn>
Status: CLOSED DUPLICATE QA Contact: qe-baseos-daemons
Severity: high Docs Contact:
Priority: urgent    
Version: 8.3CC: agk, akgaikwa, alex.wang, bmarzins, bubrown, dracut-maint-list, dtardon, fsilva, hannsj_uhl, heinzm, jbrassow, jhouston, mgandhi, msnitzer, nweddle, pmahale, prajnoha, rmadhuso, teigland, ysudarev, zkabelac
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-05-25 20:31:38 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1951002    
Attachments:
Description Flags
Service to collect udev monitor log none

Description Jon Magrini 2021-02-25 14:43:51 UTC
Description of problem:
VG/LV fails to activate during boot 

Version-Release number of selected component (if applicable):
kernel: 4.18.0-240.15.1.el8_3
systemd-239-41.el8_3.1
lvm2-2.03.09-5.el8_3.2

How reproducible:
sporadic

Steps to Reproduce:
1. configure multipath
2. create pv/vg/lv from mpath devices
3. mount via fstab
4. reboot

Actual results:
systemd[1]: dev-mapper-vg\lv.device: Job dev-mapper-vg\lv.device/start timed out.
systemd[1]: Timed out waiting for device dev-mapper-vg\lv.device.

Expected results:
normal boot sequence

Additional info:

Comment 1 Jon Magrini 2021-02-25 14:45:57 UTC
Device table is created prior to switch root:
---
Feb 24 12:30:25 hostname multipathd[847]: mpathcb: load table [0 419430400 multipath 1 queue_if_no_path 1 alua 2 1 service-time 0 2 1 65:80 1 67:112 1 service-time 0 2 1 8:64 1 6
6:96 1]
[..]
Feb 24 12:30:25 hostname multipathd[847]: exit (signal)
Feb 24 12:30:25 hostname multipathd[847]: --------shut down-------
Feb 24 12:30:25 hostname systemd[1]: multipathd.service: Succeeded.

Feb 24 12:30:25 hostname systemd[1]: Starting Switch Root...
Feb 24 12:30:25 hostname systemd[1]: Switching root.
[..]

Seems lvm attempting to activate prior to multipathd starting: 
---
Feb 24 12:30:28 hostname systemd[1]: Created slice system-lvm2\x2dpvscan.slice.
Feb 24 12:30:28 hostname systemd[1]: Starting LVM event activation on device 253:10...
Feb 24 12:30:28 hostname systemd[1]: Starting LVM event activation on device 253:7...
Feb 24 12:30:28 hostname systemd[1]: Starting LVM event activation on device 253:1...
Feb 24 12:30:28 hostname systemd[1]: Starting LVM event activation on device 253:4...
Feb 24 12:30:28 hostname systemd[1]: Starting LVM event activation on device 253:5...
Feb 24 12:30:28 hostname systemd[1]: Starting LVM event activation on device 253:13...
Feb 24 12:30:28 hostname systemd[1]: Starting LVM event activation on device 253:9...
Feb 24 12:30:28 hostname systemd[1]: Starting LVM event activation on device 253:12...
Feb 24 12:30:28 hostname lvm[2584]:   pvscan[2584] PV /dev/mapper/mpatha3 online, VG rootvg is complete.
Feb 24 12:30:28 hostname lvm[2584]:   pvscan[2584] VG rootvg run autoactivation.
Feb 24 12:30:28 hostname kernel: ch 3:0:0:1: [ch0] dt 0x2: 
Feb 24 12:30:28 hostname kernel: ch 3:0:0:1: [ch0] ID/LUN unknown
Feb 24 12:30:28 hostname lvm[2581]:   pvscan[2581] PV /dev/mapper/mpathcl online, VG vg_hanadata incomplete (need 3).
Feb 24 12:30:28 hostname lvm[2583]:   pvscan[2583] PV /dev/mapper/mpathb online, VG vg_paging is complete.
Feb 24 12:30:28 hostname lvm[2583]:   pvscan[2583] VG vg_paging run autoactivation.
Feb 24 12:30:28 hostname lvm[2585]:   pvscan[2585] PV /dev/mapper/mpathcg online, VG vg_hanadata incomplete (need 2).
Feb 24 12:30:28 hostname lvm[2587]:   pvscan[2587] PV /dev/mapper/mpathcj online, VG vg_hanadata incomplete (need 1).
Feb 24 12:30:28 hostname lvm[2582]:   pvscan[2582] PV /dev/mapper/mpathci online, VG vg_hanadata is complete.
Feb 24 12:30:28 hostname lvm[2582]:   pvscan[2582] VG vg_hanadata run autoactivation.
Feb 24 12:30:28 hostname lvm[2586]:   pvscan[2586] PV /dev/mapper/mpathcm online, VG vg_hanalog incomplete (need 2).
Feb 24 12:30:28 hostname lvm[2588]:   pvscan[2588] PV /dev/mapper/mpathck online, VG vg_hanalog incomplete (need 1).
[..]
Feb 24 12:30:32 hostname systemd[1]: Started udev Wait for Complete Device Initialization.
Feb 24 12:30:32 hostname systemd[1]: Starting Device-Mapper Multipath Device Controller...
Feb 24 12:30:32 hostname multipathd[2675]: --------start up--------
Feb 24 12:30:32 hostname multipathd[2675]: read /etc/multipath.conf
Feb 24 12:30:32 hostname multipathd[2675]: path checkers start up
[..]
Feb 24 12:30:35 hostname multipathd[2675]: mpathcb: load table [0 419430400 multipath 1 queue_if_no_path 1 alua 2 1 service-time 0 2 1 65:80 1 67:112 1 service-time 0 2 1 8:64 1 66:96 1]
[..]
Feb 24 12:30:35 hostname systemd[1]: Started Device-Mapper Multipath Device Controller.


Feb 24 12:35:26 hostname systemd[1]: dev-mapper-vg_hana\x2dlv_hana02.device: Job dev-mapper-vg_hana\x2dlv_hana02.device/start timed out.
Feb 24 12:35:26 hostname systemd[1]: Timed out waiting for device dev-mapper-vg_hana\x2dlv_hana02.device.

Comment 2 Jon Magrini 2021-02-25 14:46:53 UTC
Feb 24 12:35:26 hostname systemd[1]: dev-mapper-vg_hana\x2dlv_hana02.device: Job dev-mapper-vg_hana\x2dlv_hana02.device/start timed out.
Feb 24 12:35:26 hostname systemd[1]: Timed out waiting for device dev-mapper-vg_hana\x2dlv_hana02.device.


  LV            VG          Attr       LSize    Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert LV Tags Devices
  lv_hana03     vg_hana     -wi-------  300.00g                                                             /dev/mapper/mpathcb(0)                                                                           -1   auto    1      0 
  lv_hana03     vg_hana     -wi-------  300.00g                                                             /dev/mapper/mpathcc(0)                                                                           -1   auto    1      0 



./etc/lvm/lvm.conf
[..]
filter = [ "a|/dev/mapper/mpath.*|" "r|.*|" ]


mpathcb (3600507638081053c1800000000000729) dm-17 IBM,2145
size=200G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| |- 1:0:1:4  sdv  65:80  active ready running
| `- 2:0:1:4  sdbd 67:112 active ready running
`-+- policy='service-time 0' prio=10 status=enabled
  |- 1:0:0:4  sde  8:64   active ready running
  `- 2:0:0:4  sdam 66:96  active ready running

udevadm_info_.dev.sdv
[..]
E: DM_MULTIPATH_DEVICE_PATH=1
E: ID_FS_TYPE=mpath_member

Comment 3 pratapsingh 2021-02-25 16:57:50 UTC
Hello Team,

I have similar case where LV fails to activate during boot. 

Environments
-------------
Red Hat Enterprise Linux release 8.1 (Ootpa)

kernel:  4.18.0-147.5.1.el8_1.ppc64le
lvm2-2.03.05-5.el8.ppc64le                                  Thu Jan  9 15:59:57 2020
device-mapper-multipath-0.8.0-5.el8.ppc64le                 Thu Jan  9 15:59:55 2020

logs:
----

Feb 24 16:28:46 hostname systemd[1]: dev-mapper-datavg\x2dimpervalv.device: Job dev-mapper-datavg\x2dimpervalv.device/start timed out.
Feb 24 16:28:46 hostname systemd[1]: Timed out waiting for device dev-mapper-datavg\x2dimpervalv.device.
-- Subject: Unit dev-mapper-datavg\x2dimpervalv.device has failed
-- Defined-By: systemd


Feb 24 16:28:46 hostname systemd[1]: dev-mapper-datavg\x2dsapmntlv.device: Job dev-mapper-datavg\x2dsapmntlv.device/start timed out.
Feb 24 16:28:46 hostname systemd[1]: Timed out waiting for device dev-mapper-datavg\x2dsapmntlv.device.
Feb 24 16:28:46 hostname lxhn013 systemd[1]: dev-mapper-datavg\x2dimpervalv.device: Job dev-mapper-datavg\x2dimpervalv.device/start timed out.
Feb 24 16:28:46 hostname lxhn013 systemd[1]: Timed out waiting for device dev-mapper-datavg\x2dimpervalv.device.
Feb 24 16:28:46 hostname lxhn013 systemd[1]: dev-mapper-sapbackupvg\x2dsapbackuplv.device: Job dev-mapper-sapbackupvg\x2dsapbackuplv.device/start timed out.
Feb 24 16:28:46 hostname lxhn013 systemd[1]: Timed out waiting for device dev-mapper-sapbackupvg\x2dsapbackuplv.device.
Feb 24 16:28:46 hostname lxhn013 systemd[1]: dev-mapper-datavg\x2dsaplv.device: Job dev-mapper-datavg\x2dsaplv.device/start timed out.
Feb 24 16:28:46 hostname lxhn013 systemd[1]: Timed out waiting for device dev-mapper-datavg\x2dsaplv.device.

Regards,
Pratapsingh

Comment 5 David Teigland 2021-02-25 18:22:38 UTC
There are only 8 or 9 lvm2-pvscan services, but there are about 17 PVs, so it seems that lvm2-pvscan services are not being generated for around half of the mpath devices.  A VG will not be activated unless there's an lvm2-pvscan service generated for each PV in the VG.

Comment 6 Jon Magrini 2021-02-25 19:13:48 UTC
(In reply to David Teigland from comment #5)
> There are only 8 or 9 lvm2-pvscan services, but there are about 17 PVs, so
> it seems that lvm2-pvscan services are not being generated for around half
> of the mpath devices.  A VG will not be activated unless there's an
> lvm2-pvscan service generated for each PV in the VG.

Seems appropriate here. For the case mentioned in the bz opening that seems true as there are 8 units present: 

$ grep lvm2-pvscan sos_commands/systemd/systemctl_list-units
  lvm2-pvscan@253:1.service                                           loaded active exited    LVM event activation on device 253:1                                         
  lvm2-pvscan@253:10.service                                          loaded active exited    LVM event activation on device 253:10                                        
  lvm2-pvscan@253:12.service                                          loaded active exited    LVM event activation on device 253:12                                        
  lvm2-pvscan@253:13.service                                          loaded active exited    LVM event activation on device 253:13                                        
  lvm2-pvscan@253:4.service                                           loaded active exited    LVM event activation on device 253:4                                         
  lvm2-pvscan@253:5.service                                           loaded active exited    LVM event activation on device 253:5                                         
  lvm2-pvscan@253:7.service                                           loaded active exited    LVM event activation on device 253:7                                         
  lvm2-pvscan@253:9.service                                           loaded active exited    LVM event activation on device 253:9     

There are 20 mpath devices total, though only 17 are used by LVM: 
 
$ less sos_commands/devicemapper/dmsetup_info_-c | sed -n '1p;/mpath/p'
Name                      Maj Min Stat Open Targ Event  UUID                                                                
mpathch                   253  22 L--w    0    1      0 mpath-3600507638081053c180000000000072e                             
mpathcg                   253   5 L--w    1    1      0 mpath-3600507638081053c180000000000072f                             
mpathcf                   253  21 L--w    0    1      0 mpath-3600507638081053c180000000000072d                             
mpathbz                   253  15 L--w    0    1      0 mpath-3600507638081053c1800000000000727                             
mpathce                   253  20 L--w    0    1      0 mpath-3600507638081053c180000000000072c                             
mpathb                    253   1 L--w    1    1      0 mpath-3600507638081053c18000000000006db                             
mpathcd                   253  19 L--w    0    1      0 mpath-3600507638081053c180000000000072b                             
mpatha                    253   0 L--w    3    1      0 mpath-3600507638081053c18000000000006da                             
mpatha3                   253   4 L--w    7    1      0 part3-mpath-3600507638081053c18000000000006da                       
mpathcc                   253  18 L--w    0    1      0 mpath-3600507638081053c180000000000072a                             
mpatha2                   253   3 L--w    1    1      0 part2-mpath-3600507638081053c18000000000006da                       
mpathcb                   253  17 L--w    0    1      0 mpath-3600507638081053c1800000000000729                             
mpatha1                   253   2 L--w    0    1      0 part1-mpath-3600507638081053c18000000000006da                       
mpathcn                   253  14 L--w    0    1      0 mpath-3600507638081053c1800000000000735                             
mpathca                   253  16 L--w    0    1      0 mpath-3600507638081053c1800000000000728                             
mpathcm                   253  13 L--w    0    1      0 mpath-3600507638081053c1800000000000734                             
mpathcl                   253  10 L--w    1    1      0 mpath-3600507638081053c1800000000000732                             
mpathck                   253  12 L--w    0    1      0 mpath-3600507638081053c1800000000000733                             
mpathcj                   253   9 L--w    1    1      0 mpath-3600507638081053c1800000000000731                             
mpathci                   253   7 L--w    1    1      0 mpath-3600507638081053c1800000000000730   

$ less sos_commands/block/blkid_-c_.dev.null |grep mpath |grep -c LVM2
17

The system appears to be configured for event-based activation: 

$ grep event_activation etc/lvm/lvm.conf
	# Configuration option global/event_activation.
	# When event_activation is disabled, the system will generally run
	event_activation = 1

Shouldn't one of the lvm2-activation*.service generate the missing units? Or is a fix simply as re-running the lvm2-activation-generator?

Comment 7 David Teigland 2021-02-25 20:14:30 UTC
The lvm2-activation-generator looks at lvm.conf event_activation, and if it's 0 it will create the lvm2-activation* services that will call vgchange -aay directly.  When event_activation=1, the generator does not create the lvm2-activation services, and the system depends entirely on lvm2-pvscan services to activate everything.

Comment 9 Peter Rajnoha 2021-03-01 09:12:36 UTC
Created attachment 1759904 [details]
Service to collect udev monitor log

(In reply to Peter Rajnoha from comment #8)
> I've had a set of helper rules/systemd units which could collect the whole
> uevent sequence which could help us in debugging this, I'll try to find them
> and attach it here...

The steps to take is this (7, 8, 11 is extra, but the more context we have, the better):

1) place systemd-udev-monitor.service file in /etc/systemd/system
2) call "systemctl daemon-reload"
3) call "systemctl enable systemd-udev-monitor.service"
4) reboot
5) append "systemd.log_level=debug systemd.log_target=kmsg udev.log-priority=debug log_buf_len=8M" to kernel cmd line at boot
6) make sure systemd-udev-monitor.service is running by checking "systemctl status systemd-udev-monitor.service"
7) after booting, grab the journal with "journalctl -b"
8) generate lvmdump by calling "lvmdump -u -l -s" and take the generated dump file
9) stop the systemd-udev-monitor.service with "systemctl stop systemd-udev-monitor.service"
10) take the /run/udev_monitor.log file that got generated
11) do the sosreport, just in case we'd need some extra info not already gathered by commands above
12) call "systemctl disable systemd-udev-monitor.service" and "rm /etc/systemd/system/systemd-udev-monitor.service" for cleanup.

(Copied from here: https://bugzilla.redhat.com/show_bug.cgi?id=1324028#c10)

Comment 39 Jon Magrini 2021-05-25 14:27:12 UTC
As Bz 1949076 addresses the issue initially reported in this BZ (comment 1 & 2). Though there are four other cases linked as well, which 3 of them the additions to multipathd.service have resolved. The final fourth case is chasing an additional issue not related to this BZ, therefore we should be good to close this BZ and defer to 1949076.

Comment 40 Peter Rajnoha 2021-05-25 20:31:38 UTC
OK, thanks for confirmation. I'm closing this one then.

*** This bug has been marked as a duplicate of bug 1949076 ***