1700451 – Booting with a large number of multipath devices drops into emergency shell

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1700451 - Booting with a large number of multipath devices drops into emergency shell

Summary: Booting with a large number of multipath devices drops into emergency shell

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	device-mapper-multipath
Sub Component:
Version:	---
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	rc
Target Release:	8.0
Assignee:	Ben Marzinski
QA Contact:	Lin Li
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1723746
TreeView+	depends on / blocked

Reported:	2019-04-16 15:20 UTC by Ben Marzinski
Modified:	2021-09-06 15:20 UTC (History)
CC List:	13 users (show)
Fixed In Version:	device-mapper-multipath-0.8.0-2.el8
Doc Type:	Bug Fix
Doc Text:	Cause: When multipath is determining whether it should claim a block device as a path device in udev, it checks if multipathd is running by openning a socket connection to it. If multipathd hasn't started up yet adn there are a large number of block devics, this can hang, causing udev to hang as well. Consequence: udev processing for block devices can be delayed on bootup, possibly causing the bootup to fail to the emergency shell. Fix: multipath now tries to connect to the multipathd socket in a nonblocking manner. If that fails, it looks at the error to determine if multipathd will be starting up. Result: multipath no longer causes udev processing of block devices to hang in setups with a large number of block devices.
Clone Of:
Clones:	1723746 (view as bug list)
Environment:
Last Closed:	2019-11-05 22:18:16 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:3578	0	None	None	None	2019-11-05 22:18:26 UTC

Description Ben Marzinski 2019-04-16 15:20:04 UTC

Description of problem:
If there are a large number of devices on a system, and device-mapper-multipath is configured, even if it is configured to blacklist all of the devices. It can cause the system to timeout trying to initialize devices needed for filesystem mounts on bootup, and drop into the emergency shell.

Version-Release number of selected component (if applicable):
device-mapper-multipath-0.7.8-7.el8

How reproducible:
Always on effected setups

Steps to Reproduce:
1. Install RHEL8 on a node with a large number of devices
2. Set up multipathing
3. Update the node. I'm not sure why this set appears necessary to reproduce the issue

Actual results:
On boot, the node drops into the emergency shell. The devices will be initialized automatically, so no action needs to be taken except exitting
the emergency shell when the filesystems have mounted.

Expected results:
The node will not drop into the emergecny shell on boot.

Additional info:
This issue appears to be caused by multipath attempting to determine if it should claim devices (set ENV{DM_MULTIPATH_DEVICE_PATH}="1") in 62-multipath.rules. As part of doing this, it checks if multipathd is running. In rhel-8, this triggers a socket activation of multipathd through the /lib/systemd/system/multipathd.socket file. It appears that sometimes, this
can take some time, and while that is happening, the multipath command is stalled.

This bug can be worked around simply by removing /lib/systemd/system/multipathd.socket

Removing that file shouldn't cause other issues, since multipathd should be running all the time anyway. Also, if multipathd isn't currently running, multipath checks if systemd has multipathd enabled, so that it will be started, and acts as the same as if multipath is running.

However, it should be possible to fix this so that socket autoactivation remains, but the multipath claiming code doesn't trigger the autoactivation.

Comment 3 Mike Snitzer 2019-04-17 14:25:35 UTC

Think there is a more fundamental issue for this particular testbed: the root volume is _not_ using multipathing; so the initramfs shouldn't even be doing anything related to multipath.

Comment 4 Ben Marzinski 2019-04-17 14:38:18 UTC

The initramfs isn't doing anything with multipathing.  But the /home directory isn't being mounted in the initramfs, and after the pivot root, the multipath stall waiting for multipathd on all the other devices is causing the local device used for /home to not get re-initialized in time.  Now, it does seem that since the local device is currently there with active LVs already using it for the root directory and /boot, that udev should know about it and not need to recheck it before it can mount /home.

But regardless, multipath shouldn't have to wait for multipathd to get started before it can even begin to check if a device should be claimed as a multipath path.  The intention of the code was not to wait, it's just that the socket autoactivation makes this happen. I have a patch that checks if multipathd is running without accessing the socket, so that it doesn't trigger this. The other solution is to drop the autoactivation, since multipathd should always be running.

Comment 6 Ben Marzinski 2019-04-30 21:54:40 UTC

A different solution was agreed upon upstream, so the test packages do not reflect the actual solution.  Instead of not accessing the multipathd socket, the multipath -u command now tries to open it non-blocking, and on failure checks the error code to see if multipathd will be starting up later.

Comment 8 Lin Li 2019-06-05 14:17:26 UTC

Hello Barry,
Could you provide test result following up your steps with fixed version ?
I will reproduce this issue with a large number of scsi_debug devices.
Thanks in advance!

Comment 9 Lin Li 2019-06-06 08:18:11 UTC

Hello Barry,
Could you provide me steps to reproduce? I want to reproduce it using your steps.
Thanks in advance!

Comment 10 Barry Marson 2019-06-10 19:41:35 UTC

I simply have a large number of multipath devices.  In my case, there are 48x8(multipath) LUNS.

Upon booting, a local volume group does not properly initialize and we drop into maintenance mode

Barry

Comment 23 errata-xmlrpc 2019-11-05 22:18:16 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:3578

Note You need to log in before you can comment on or make changes to this bug.