Bug 2121277

Summary: [RHEL9.1] system hung at Started cancel waiting for multipath siblings of x
Product: Red Hat Enterprise Linux 9 Reporter: Ben Marzinski <bmarzins>
Component: device-mapper-multipathAssignee: Ben Marzinski <bmarzins>
Status: CLOSED ERRATA QA Contact: Lin Li <lilin>
Severity: high Docs Contact:
Priority: high    
Version: 9.0CC: abenoit, acardace, agk, atragler, bgalvani, bmarzins, bugproxy, cwei, dracut-maint-list, drosario, dtardon, fge, guazhang, heinzm, honli, lilin, lrintel, mgandhi, mharri, msnitzer, nyewale, pgm-rhel-tools, phess, prajnoha, pvlasin, rituagar, rkhan, rmetrich, saurav.kashyap, shangsong2, sukulkar, thaller, till, wdh, zkabelac
Target Milestone: rcKeywords: Reopened, Triaged
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: device-mapper-multipath-0.8.7-12.el9 Doc Type: Bug Fix
Doc Text:
Cause: When multipath is configured with "find_multipaths smart" (which it is when booting into anaconda) and a new storage device appears, it starts a systemd timer to wait for another path to the device to appear. If this timer expires while the initramfs is cleaning up to pivot to the regular filesystem during boot, it will restart multipathd, which will stop systemd from cleaning up the initramfs. Consequence: systems can hang booting into anaconda during installation, if storage devices appear late enough in the initramfs portion of the bootup. Fix: The systemd timers now conflict with initramfs cleanup, so they will automatically get stopped when the system cleans up to pivot to the regular file system. They also no longer restart multipathd if it has stopped running Result: Systems no longer hang while booting into anaconda for installation.
Story Points: ---
Clone Of: 1916168
: 2123372 (view as bug list) Environment:
Last Closed: 2022-11-15 11:16:24 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1916168    
Bug Blocks: 1997272, 1916117, 1934584, 1965064, 1997257, 2024217, 2123372    
Attachments:
Description Flags
Patch to fix the hang. none

Description Ben Marzinski 2022-08-25 03:27:38 UTC
+++ This bug was initially created as a clone of Bug #1916168 +++

Description of problem:
Reserve a server failed and system drop into emergency mode, from the console log found the system hung at "Started cancel waiting for multipath siblings of".

Version-Release number of selected component (if applicable):
RHEL-8.4.0-20210114.n.0 BaseOS x86_64

How reproducible:
100%

Steps to Reproduce:
1. install OS to server 
2.
3.

Actual results:
install failed 

Expected results:
install successful

Additional info:


] Started Open-iSCSI.  
         Starting dracut initqueue hook...  
[      
  OK     
] Started Create Volatile Files and Directories.  
[      
  OK     
] Reached target System Initialization.  
[      
  OK     
] Reached target Basic System.  
[      
  OK     
] Started cancel waiting for multipath siblings of nvme0n1.  
[      
  OK     
] Started cancel waiting for multipath siblings of sdh.  
[      
  OK     
] Started cancel waiting for multipath siblings of sde.  
[      
  OK     
] Started cancel waiting for multipath siblings of sdg.  
[      
  OK     
] Started cancel waiting for multipath siblings of sdd.  
[      
  OK     
] Started cancel waiting for multipath siblings of sdf.  
[      
  OK     
] Started cancel waiting for multipath siblings of sdb.  
[      
  OK     
] Started cancel waiting for multipath siblings of sda.  
[      
  OK     
] Started cancel waiting for multipath siblings of sdc.  
[      
  OK     
] Started cancel waiting for multipath siblings of nvme0n1.  
[      
  OK     
] Started cancel waiting for multipath siblings of sdd.  
[      
  OK     
] Started cancel waiting for multipath siblings of sdg.  
[      
  OK     
] Started cancel waiting for multipath siblings of sde.  
[      
  OK     
] Started cancel waiting for multipath siblings of sdh.  
[      
  OK     
] Started cancel waiting for multipath siblings of sda.  
[      
  OK     
] Started cancel waiting for multipath siblings of sdb.  
[      
  OK     
] Started cancel waiting for multipath siblings of sdf.  
[      
  OK     
] Started cancel waiting for multipath siblings of sdc.  
[-- MARK -- Thu Jan 14 10:30:00 2021] 
[-- MARK -- Thu Jan 14 10:35:00 2021] 
[-- MARK -- Thu Jan 14 10:40:00 2021] 
[-- MARK -- Thu Jan 14 10:45:00 2021] 
[-- MARK -- Thu Jan 14 10:50:00 2021] 
[-- MARK -- Thu Jan 14 10:55:00 2021] 
[ 1632.404021] dracut-initqueue[1029]: Warning: dracut-initqueue timeout - starting timeout scripts  
[ 1639.128347] dracut-initqueue[1029]: Warning: dracut-initqueue timeout - starting timeout scripts  
[ 1645.834957] dracut-initqueue[1029]: Warning: dracut-initqueue timeout - starting timeout scripts  
[ 1652.531923] dracut-initqueue[1029]: Warning: dracut-initqueue timeout - starting timeout scripts  
[ 1659.227969] dracut-initqueue[1029]: Warning: dracut-initqueue timeout - starting timeout scripts  


https://beaker.engineering.redhat.com/recipes/9390807#tasks


< comments trimmed >

--- Additional comment from Ben Marzinski on 2022-07-18 18:54:22 UTC ---

The "Started waiting (...) siblings of sda" messages should have gone away if they booted with nompath. Were they?  "nompath" should disable both multipathd and multipath path claiming.  It should pretty much completely disable multipath during that boot. So if the issue still exists when the node is booted with that commandline option, then it very likely has nothing to do with multipath.

--- Additional comment from Ben Marzinski on 2022-08-19 22:31:15 UTC ---

So, in another bug that is likely a duplicate, Bug 2059813, booting with "inst.nompath" still hangs, but booting with "nompath" does not. This makes sense, since "nompath" will take effect in the initramfs where the bug is, but "inst.nompath" will only effect anaconda, which is never reached because of the bug.

--- Additional comment from Ben Marzinski on 2022-08-19 22:49:27 UTC ---

After looking into Bug 2059813, which is likely a duplicate, it seems likely that this is a multipath issue. The problem is that when multipath is configured with find_multipaths "smart", which it is when booting into anaoconda, mulitpath creates systemd timers to wait for possible siblings of path devices.  If these timers expire after the intramfs starts cleaning up, they restart multipathd, which conflicts with the initramfs cleanup, and causes it to stop. The solution is to make the timers themselves conflict with the initramfs cleanup, so they will be stopped when cleanup starts. Also even if they trigger, they will no longer start up multipathd.

To verify that this is actually the problem, could you try booting with:

https://fedorapeople.org/groups/anaconda/rhbz2059813/boot.2059813.iso

instead of your regular installation iso.  This boot iso won't actually be able to install a system, since it doesn't contain any of the necessary installation sources. It will just boot you into anaconda. But since this iso has the multipath fix, you should be able to successfully boot into anaconda, without hanging in the initramfs.

Comment 3 Ben Marzinski 2022-08-25 04:03:27 UTC
Created attachment 1907478 [details]
Patch to fix the hang.

This is the patch from the test iso that fixes the issue. When multipath is configured with find_multipaths "smart" (which it is in the installer boot initramfs) it waits to see if multiple paths will appear for devices. It sets systemd timers to stop this waiting. If these timers triggered while the initramfs was cleaning up to pivot to the actual root filesystem, they would restart multipathd, which would cause the cleanup to hang.  The fix makes the timers conflict with initrd-cleanup.service, so that they get disabled when the initramfs starts cleaning up.  Also, they no longer force multipathd to restart if it has already been stopped.

Comment 8 Ben Marzinski 2022-09-02 14:51:13 UTC
*** Bug 2123663 has been marked as a duplicate of this bug. ***

Comment 9 Ben Marzinski 2022-09-02 15:24:29 UTC
*** Bug 2123372 has been marked as a duplicate of this bug. ***

Comment 10 Ben Marzinski 2022-09-02 23:22:54 UTC
A test iso with a patch to resolve this issue is available here:

https://people.redhat.com/bmarzins/isos/bz2121277/rhel-9.1-patched-boot.iso

Can you try booting with this iso instead of your regular installation iso. This boot iso won't actually be able to install a system, since it doesn't contain any of the necessary installation sources. It will just boot you into anaconda. But since it has the multipath fix, you should be able to successfully boot into anaconda, without hanging in the initramfs.

Comment 21 errata-xmlrpc 2022-11-15 11:16:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (device-mapper-multipath bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:8313