Bug 1701600
| Summary: | Timed out waiting for Device/LV during boot | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | jhouston | ||||
| Component: | lvm2 | Assignee: | Peter Rajnoha <prajnoha> | ||||
| lvm2 sub component: | Activating existing Logical Volumes | QA Contact: | cluster-qe <cluster-qe> | ||||
| Status: | CLOSED WORKSFORME | Docs Contact: | |||||
| Severity: | high | ||||||
| Priority: | urgent | CC: | agk, bhull, bubrown, cmarthal, heinzm, jbrassow, jpittman, kwalker, loberman, lvm-team, msekleta, msnitzer, nweddle, pdwyer, prajnoha, revers, rhandlin, rleese, rmadhuso, rmetrich, teigland, tonay, zkabelac | ||||
| Version: | 7.6 | ||||||
| Target Milestone: | rc | ||||||
| Target Release: | --- | ||||||
| Hardware: | All | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | |||||||
| : | 1703202 (view as bug list) | Environment: | |||||
| Last Closed: | 2019-09-13 12:46:11 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 1703202, 1711360 | ||||||
| Attachments: |
|
||||||
|
Description
jhouston
2019-04-19 20:00:09 UTC
Created attachment 1570173 [details]
Sos report with plymouth.enable=0
There are couple of things I've not seen before in /var/log/messages which I'd like to understand: Why does audit complain about the lvm2-pvscan service? May 15 10:44:34 rhel7-3-ssg kernel: type=1130 audit(1557931474.699:61): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=lvm2-pvscan@8:3 com m="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success The timing of messages indicates that iTCO could be stalling the boot by about 40 seconds. I suggest disabling iTCO somehow to see if that fixes the problem. May 15 10:45:05 rhel7-3-ssg kernel: iTCO_vendor_support: vendor-support=0 May 15 10:45:06 rhel7-3-ssg kernel: iTCO_wdt: Intel TCO WatchDog Timer Driver v1.11 May 15 10:45:06 rhel7-3-ssg kernel: iTCO_wdt: unable to reset NO_REBOOT flag, device disabled by hardware/BIOS May 15 10:45:45 rhel7-3-ssg systemd: Job dev-mapper-vg1\x2dlv_swap.device/start timed out. May 15 10:45:45 rhel7-3-ssg systemd: Timed out waiting for device dev-mapper-vg1\x2dlv_swap.device. May 15 10:45:46 rhel7-3-ssg systemd: Dependency failed for /dev/mapper/vg1-lv_swap. May 15 10:45:46 rhel7-3-ssg systemd: Dependency failed for Swap. It's not just lvm that's failing and timing out, it seems to be everything: May 15 10:47:00 rhel7-3-ssg systemd: systemd-udev-settle.service: main process exited, code=exited, status=1/FAILURE May 15 10:47:00 rhel7-3-ssg systemd: Failed to start udev Wait for Complete Device Initialization. May 15 10:47:02 rhel7-3-ssg systemd: systemd-udev-settle.service failed. My impression here is that lvm is a victim of whatever is wrong with this machine. Is there any more info to lead us to believe there is an lvm problem here? Has the iTCO watchdog been disabled or fixed? Looking a bit more at this and looking at the reason why there's the:
Jun 18 15:49:03 rhel7-3-ssg systemd[1]: Job dev-mapper-vg1\x2dlv_swap.device/start timed out
The LV with swap is activated correctly in initramfs and the /dev/mapper/vg1-lv_swap symlink is created correctly there, from the udev log (in initramfs):
Jun 18 15:47:30 rhel7-3-ssg systemd-udevd[1006]: creating symlink '/dev/mapper/vg1-lv_swap' to '../dm-1
Then cleaning up the udev database and switching to root fs...
Then udevadm trigger called to reinitialize the udev db content, including already activated devices (like /dev/mapper/vg1-lv_swap). Here, I'm finding:
Jun 18 15:49:35 rhel7-3-ssg systemd[1]: systemd-udev-settle.service failed.
So looking why this one failed, looking at the systemd-udev-trigger.service status (sos_commands/systemd/systemctl_status_--all), that looks OK:
* systemd-udev-trigger.service - udev Coldplug all Devices
Loaded: loaded (/usr/lib/systemd/system/systemd-udev-trigger.service; static; vendor preset: disabled)
Active: active (exited) since Tue 2019-06-18 15:47:34 EDT; 16h ago
Docs: man:udev(7)
man:systemd-udevd.service(8)
Process: 22506 ExecStart=/usr/bin/udevadm trigger --type=devices --action=add (code=exited, status=0/SUCCESS)
Process: 22416 ExecStart=/usr/bin/udevadm trigger --type=subsystems --action=add (code=exited, status=0/SUCCESS)
Main PID: 22506 (code=exited, status=0/SUCCESS)
Tasks: 0
CGroup: /system.slice/systemd-udev-trigger.service
Jun 18 15:47:34 rhel7-3-ssg systemd[1]: Starting udev Coldplug all Devices...
Jun 18 15:47:34 rhel7-3-ssg systemd[1]: Started udev Coldplug all Devices.
Then the settle for the udev trigger (not OK, but we don't know why):
* systemd-udev-settle.service - udev Wait for Complete Device Initialization
Loaded: loaded (/usr/lib/systemd/system/systemd-udev-settle.service; static; vendor preset: disabled)
Active: failed (Result: exit-code) since Tue 2019-06-18 15:49:35 EDT; 16h ago
Docs: man:udev(7)
man:systemd-udevd.service(8)
Process: 22795 ExecStart=/usr/bin/udevadm settle (code=exited, status=1/FAILURE)
Main PID: 22795 (code=exited, status=1/FAILURE)
Jun 18 15:47:34 rhel7-3-ssg systemd[1]: Starting udev Wait for Complete Device Initialization...
Jun 18 15:49:35 rhel7-3-ssg systemd[1]: Failed to start udev Wait for Complete Device Initialization.
Jun 18 15:49:35 rhel7-3-ssg systemd[1]: Unit systemd-udev-settle.service entered failed state.
So it looks like that while processing the uevents from the trigger, something is stalling. The critical time is between 15:47:34 and 15:49:35 (which is exactly the default 90s timeout for systemd units).
I see lots of these mostly in the udev log:
Jun 18 15:47:35 rhel7-3-ssg systemd-udevd[22912]: starting '/bin/systemd-detect-virt'
...
Jun 18 15:48:02 rhel7-3-ssg systemd-udevd[25174]: starting '/bin/systemd-detect-virt'
...
Jun 18 15:48:02 rhel7-3-ssg systemd-udevd[22534]: PROGRAM '/bin/systemd-detect-virt' /usr/lib/udev/rules.d/40-redhat.rules:11
...
Jun 18 15:51:33 rhel7-3-ssg systemd-udevd[45655]: PROGRAM '/bin/systemd-detect-virt' /usr/lib/udev/rules.d/40-redhat.rules:11
Jun 18 15:51:33 rhel7-3-ssg systemd-udevd[45665]: starting '/bin/systemd-detect-virt'
The "systemd-detect-virt" is slowing down the boot a lot here - I think that one is causing the udev settle timeout in the end! Michal, any clue what may be wrong with systemd-detect-virt?
Checked with Michal on irc, there was a problem exactly with this and this should be resolved in z-stream release systemd-219-62.el7_6.7 and higher. Please, if possible, try this version. Adding QA ack for 7.8. It doesn't appear that a definitive reproducer for this exists. This bug will be marked verified (SanityOnly) once final 7.8 regression testing has been completed. Hello, Just to react on Comment #23, recently I filed BZ #1748051 (Rules in 40-redhat.rules file for SUBSYSTEM==memory are suboptimal and may lead to timing issues) because I found that on some hardware, there were tons of dummy "uname -p", "systemd-detect-virt" and similar processes being spawned that could explain systemd-udevd-trigger timing out. Renaud. |