Bug 1787013
Summary: | lvm2-monitor.service hangs boot if no PVs on disks | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 8 | Reporter: | Jukka Lehtonen <jukka.lehtonen> | ||||||||||||
Component: | lvm2 | Assignee: | Peter Rajnoha <prajnoha> | ||||||||||||
lvm2 sub component: | Storage | QA Contact: | cluster-qe <cluster-qe> | ||||||||||||
Status: | CLOSED CURRENTRELEASE | Docs Contact: | |||||||||||||
Severity: | unspecified | ||||||||||||||
Priority: | high | CC: | agk, heinzm, jbrassow, msnitzer, pasik, prajnoha, zkabelac | ||||||||||||
Version: | 8.1 | ||||||||||||||
Target Milestone: | rc | ||||||||||||||
Target Release: | 8.0 | ||||||||||||||
Hardware: | x86_64 | ||||||||||||||
OS: | Linux | ||||||||||||||
Whiteboard: | |||||||||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||
Clone Of: | Environment: | ||||||||||||||
Last Closed: | 2020-12-09 12:39:03 UTC | Type: | Bug | ||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||
Documentation: | --- | CRM: | |||||||||||||
Verified Versions: | Category: | --- | |||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||
Embargoed: | |||||||||||||||
Attachments: |
|
Description
Jukka Lehtonen
2019-12-30 11:08:19 UTC
What happens if you run "vgchange --monitor y -vvvv" on command line directly? Can you hit any segfault there? Also, if possible, please, run "lvmdump -u -l -s" and attach created archive here in this report. The archive contains collected device-mapper/lvm2 related debug info. (In reply to Peter Rajnoha from comment #2) > What happens if you run "vgchange --monitor y -vvvv" on command line > directly? Can you hit any segfault there? Also, please, send the output from the vgchange command. Thanks. Created attachment 1650605 [details]
lvmdump
Created attachment 1650606 [details]
vgchange
No, the vgchange does not segfault on running system. (I did run lvmdump before vgchange.) # systemctl start lvm2-monitor completes without delay (logs get both "Starting ..." and "Started ..." ) Created attachment 1650620 [details]
lvm2-monitor.service debug configuration
OK, thanks for the info. Could you please try deleting "/run/lvm/pvs_online", "/run/lvm/vgs_online" and "/run/lvm/hints" and then try running the "vgchange --monitor y -vvvv" to see if you can hit the issue. If yes, please, send the output.
If not, then please install the attached lvm2-monitor.service.d directory (with debug.conf file) to /etc/systemd/system directory, enable the lvm2-monitor.service and reboot. On next boot, if the issue happens, wait for the lvm2-monitor.service start timeout (the debug.conf file sets the timeout to 10s for the service, but systemd tries several times before giving up completely - but it should be within 1 minute). Then collect "lvmdump -u -l -s" please. Thanks.
No such files after reboot: /run/lvm/pvs_online, /run/lvm/vgs_online, /run/lvm/hints Previous vgchange log had: --- filters/filter-persistent.c:134 filter caching bad /dev/md127 label/hints.c:613 Reading hint file config/config.c:1481 devices/global_filter not found in config: defaulting to global_filter = [ "a|.*|" ] config/config.c:1481 devices/filter not found in config: defaulting to filter = [ "a|.*|" ] label/hints.c:1273 get_hints: no entries label/label.c:1036 Will scan 16 devices skip 0 --- Now, as first lvm-related operation I spot: --- filters/filter-persistent.c:134 filter caching bad /dev/md127 label/hints.c:1219 get_hints: no file label/label.c:1036 Will scan 16 devices skip 0 --- No segfault from vgchange. Created attachment 1650800 [details]
lvdump after debug.conf
Tried perhaps three times, well under a minute. I did not see "lvm segfault" on the screen.
(In reply to Jukka Lehtonen from comment #9) > Created attachment 1650800 [details] > lvdump after debug.conf > > Tried perhaps three times, well under a minute. I did not see "lvm segfault" > on the screen. There are a few interesting parts though in the log (lvmdump's sysreport/systemd_lvm2_services_status and the one related to lvm2-monitor.service): === Jan 08 19:42:04.214865 localhost.localdomain lvm[1382]: 19:42:04.124916 vgchange[1382] label/label.c:941 Finding devices to scan Jan 08 19:42:04.214865 localhost.localdomain lvm[1382]: 19:42:04.124919 vgchange[1382] device/dev-cache.c:1173 Creating list of system devices. ... Jan 08 19:42:04.215153 localhost.localdomain lvm[1382]: 19:42:04.126756 vgchange[1382] device/dev-cache.c:712 Found dev 8:16 /dev/sdb - new. Jan 08 19:42:04.215153 localhost.localdomain lvm[1382]: 19:42:04.126804 vgchange[1382] device/dev-cache.c:712 Found dev 8:17 /dev/sdb1 - new. Jan 08 19:42:04.215153 localhost.localdomain lvm[1382]: 19:42:04.126854 vgchange[1382] device/dev-cache.c:712 Found dev 8:18 /dev/sdb2 - new. ... Jan 08 19:42:04.432240 localhost.localdomain lvm[1382]: 19:42:04.432235 vgchange[1382] device/dev-io.c:482 Opened /dev/sdb1 RO O_DIRECT Jan 08 19:42:04.432262 localhost.localdomain lvm[1382]: 19:42:04.432241 vgchange[1382] device/dev-io.c:185 /dev/sdb1: size is 1677721600 sectors Jan 08 19:42:04.432262 localhost.localdomain lvm[1382]: 19:42:04.432246 vgchange[1382] device/dev-io.c:521 Closed /dev/sdb1 Jan 08 19:42:04.432262 localhost.localdomain lvm[1382]: 19:42:04.432252 vgchange[1382] filters/filter-mpath.c:207 /dev/sdb1: Device is a partition, using primary device sdb for m Jan 08 19:42:04.432321 localhost.localdomain lvm[1382]: 19:42:04.432267 vgchange[1382] filters/filter-partitioned.c:31 filter partitioned deferred /dev/sdb1 Jan 08 19:42:04.432321 localhost.localdomain lvm[1382]: 19:42:04.432270 vgchange[1382] filters/filter-signature.c:32 filter signature deferred /dev/sdb1 Jan 08 19:42:04.432359 localhost.localdomain lvm[1382]: 19:42:04.432331 vgchange[1382] device/dev-type.c:1119 Device /dev/sdb1 not initialized in udev database (1/100, 0 microsec Jan 08 19:42:04.532558 localhost.localdomain lvm[1382]: 19:42:04.532531 vgchange[1382] device/dev-type.c:1119 Device /dev/sdb1 not initialized in udev database (2/100, 100000 mic Jan 08 19:42:04.634567 localhost.localdomain lvm[1382]: 19:42:04.634404 vgchange[1382] device/dev-type.c:1119 Device /dev/sdb1 not initialized in udev database (3/100, 200000 mic Jan 08 19:42:04.734640 localhost.localdomain lvm[1382]: 19:42:04.734612 vgchange[1382] device/dev-type.c:1119 Device /dev/sdb1 not initialized in udev database (4/100, 300000 mic Jan 08 19:42:04.835115 localhost.localdomain lvm[1382]: 19:42:04.834889 vgchange[1382] device/dev-type.c:1119 Device /dev/sdb1 not initialized in udev database (5/100, 400000 mic Jan 08 19:42:04.935242 localhost.localdomain lvm[1382]: 19:42:04.935229 vgchange[1382] device/dev-type.c:1105 WARNING: Failed to get udev device handler for device /dev/sdb1. ... Jan 08 19:42:04.935747 localhost.localdomain lvm[1382]: 19:42:04.935741 vgchange[1382] label/label.c:541 Device open /dev/sdb1 8:17 failed errno 2 Jan 08 19:42:04.935771 localhost.localdomain lvm[1382]: 19:42:04.935749 vgchange[1382] label/label.c:545 Device open /dev/sdb1 8:17 stat failed errno 2 Jan 08 19:42:04.940901 localhost.localdomain lvm[1382]: 19:42:04.940860 vgchange[1382] label/label.c:560 Device open /dev/sdb1 retry Jan 08 19:42:04.940944 localhost.localdomain lvm[1382]: 19:42:04.940924 vgchange[1382] label/label.c:541 Device open /dev/sdb1 8:17 failed errno 2 Jan 08 19:42:04.940944 localhost.localdomain lvm[1382]: 19:42:04.940929 vgchange[1382] label/label.c:545 Device open /dev/sdb1 8:17 stat failed errno 2 Jan 08 19:42:04.940944 localhost.localdomain lvm[1382]: 19:42:04.940933 vgchange[1382] label/label.c:677 Scan failed to open /dev/sdb1. === And similar problems with sda1: === Jan 08 19:42:04.214865 localhost.localdomain lvm[1382]: 19:42:04.126659 vgchange[1382] device/dev-cache.c:712 Found dev 8:1 /dev/sda1 - new. Jan 08 19:42:04.429794 localhost.localdomain lvm[1382]: 19:42:04.429782 vgchange[1382] device/dev-cache.c:1391 /dev/sda1: stat failed: No such file or directory Jan 08 19:42:04.429833 localhost.localdomain lvm[1382]: 19:42:04.429796 vgchange[1382] device/dev-cache.c:1400 Path /dev/sda1 no longer valid for device(8,1) Jan 08 19:42:04.429833 localhost.localdomain lvm[1382]: 19:42:04.429802 vgchange[1382] device/dev-cache.c:1417 Aborting - please provide new pathname for what used to be /dev/sda1 Jan 08 19:42:04.429833 localhost.localdomain lvm[1382]: 19:42:04.429816 vgchange[1382] filters/filter-usable.c:38 /dev/sda1: Skipping: dev_get_size failed Jan 08 19:42:04.429833 localhost.localdomain lvm[1382]: 19:42:04.429821 vgchange[1382] filters/filter-persistent.c:134 filter caching bad /dev/sda1 Jan 08 19:42:04.954008 localhost.localdomain lvm[1382]: 19:42:04.954004 vgchange[1382] device/dev-cache.c:1391 /dev/sda1: stat failed: No such file or directory Jan 08 19:42:04.954036 localhost.localdomain lvm[1382]: 19:42:04.954008 vgchange[1382] device/dev-cache.c:1400 Path /dev/sda1 no longer valid for device(8,1) Jan 08 19:42:04.954036 localhost.localdomain lvm[1382]: 19:42:04.954012 vgchange[1382] device/dev-cache.c:1417 Aborting - please provide new pathname for what used to be /dev/sda1 === Now, looking at lsblk outpiut, we have sda and sdb under MD control: === md126p1 259:11 0 800G 0 md └─md126 9:126 0 931.5G 0 raid1 ├─sda 8:0 0 931.5G 0 disk ├─sda 8:0 0 931.5G 0 disk ├─sdb 8:16 0 931.5G 0 disk └─sdb 8:16 0 931.5G 0 disk md126p2 259:12 0 131.5G 0 md └─md126 9:126 0 931.5G 0 raid1 ├─sda 8:0 0 931.5G 0 disk └─sdb 8:16 0 931.5G 0 disk === From the above, it seems that when the vgchange --monitor y is executed as part of lvm2-monitor.service early at boot, the sda and sdb is not yet set up with MD device on top and we also see sda1 and sdb1. But as the vgchange --monitor y is executed further, the situations changes underneath in parallel where the partitions are gone and instead, we end up with the MD device properly set up and then the partitions are mapped on top of this MD device. The md126, at least based on udev info, is IMSM RAID. It seems that this mixture can cause some error path to be taken by lvm2 which may not be correctly handled and it may cause even the segfault. I'll check further details... Based on the logs, the sda and sdb is part of the Intel firmware RAID. This one stores metadata at the end of the device so if there's a partition table at the beginning of the device, we'll get usual sda1, sda2 and sdb1, sdb2 partitions set up. After a while at boot, this is corrected right when the MD device representing the firmware RAID is created on top of sda and sdb (and hence sda1/2 and sdb1/2 is removed and the md126p1/2 is created instead). Based on the verbose logs from vgchange, the vgchagne stops here: Jan 08 19:42:04.954760 localhost.localdomain lvm[1382]: 19:42:04.954756 vgchange[1382] device/dev-io.c:162 /dev/sdb: using cached size 1953525168 sectors Then the vgchange is being terminated by systemd because vgchange does not return even after 10s (which we set in the debug.conf): Jan 08 19:42:14.233989 localhost.localdomain systemd[1]: lvm2-monitor.service: Start operation timed out. Terminating. Unfortunately, that last vgchange's log line doesn't tell me why it stops there. I'll try to simulate this situation with my testing environment. In the meantime, could you please try setting these in lvm.conf and see if the issue is improved: external_device_info_source = "udev" fw_raid_component_detection = 1 Created attachment 1651392 [details] lvdump with udev/detect After setting: > external_device_info_source = "udev" > fw_raid_component_detection = 1 there was no pause during boot. Does this problem still appear with recent lvm2 version if you change the settings back to default? (external_device_info_source = "none" and fw_raid_component_detection = 0) I have CentOS 8, so it is still derived from RHEL 8.2 and thus has lvm2-2.03.08-3.el8.x86_64, atm. The problem still appears with this version with default config; a segfault and delay, but timeout is now at 10(?) sec. Will check again, once CentOS 8.3-2011 drops in. New test with lvm2-2.03.09-5.el8.x86_64 Default config. No wait, no segfault. The problem appears to be solved. OK, thanks for trying it out. I'm closing this report then. |