Bug 2090169

Summary: Invalid entry in /etc/multipath/wwids causes unbootable ovirt-node
Product: [oVirt] vdsm Reporter: Jean-Louis Dupond <jean-louis>
Component: GeneralAssignee: Nir Soffer <nsoffer>
Status: CLOSED CURRENTRELEASE QA Contact: Yaning Wang <yaniwang>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.50.0.13CC: aefrat, aesteve, ahadas, bugs, bzlotnik, cshao, michal.skrivanek, sbonazzo, ymankad
Target Milestone: ovirt-4.5.1Flags: pm-rhel: ovirt-4.5?
michal.skrivanek: exception+
Target Release: 4.50.1.4   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: vdsm-4.50.1.4 Doc Type: Bug Fix
Doc Text:
Cause: LVM check for multipath components using multipath wwids file is incorrect. When configuring lvm devices, lvm may skip some devices because it thinks they are multipath component. Consequence: On the next boot, the host cannot find the missing devices and the boot end in emergency mode. Fix: Disable lvm check for using wwids. This check is not useful when using lvm devices or filter, always used by RHV. Result: Host boot correctly in the case it failed to boot before.
Story Points: ---
Clone Of:
: 2095588 (view as bug list) Environment:
Last Closed: 2022-06-23 07:55:04 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2095588    

Description Jean-Louis Dupond 2022-05-25 09:35:40 UTC
Description of problem:
After upgrading an oVirt node from 4.4.10 to 4.5.0, the node didn't boot anymore and ended up in dracut rescue shell.

In the rescue shell it was clear that it didn't boot because LVM did not activate the root devices.

Now when trying to manually activate the LV's, it refused because the device was part of a mpath component.

But that was not the case, and the device's wwid was correctly added to the blacklist by vdsm.

Finally I found out that it was caused by the fact that despite the wwid was blacklisted, it was still listed in /etc/multipath/wwids.
This caused LVM to ignore the device, and render the node unbootable.

Removing the device from the wwids file fixed the issue.

Guess this is caused by or a newer LVM version or the fact that filter entry was removed in 4.5.0?

Comment 1 Benny Zlotnik 2022-06-06 13:00:24 UTC
Albert, any chance it's related to https://github.com/oVirt/vdsm/pull/228 ?

Comment 2 Albert Esteve 2022-06-06 13:50:30 UTC
> Albert, any chance it's related to https://github.com/oVirt/vdsm/pull/228 ?

It sounds similar, but I don't think so.

What triggered that change was an update in LVM that affected nodes running in rhel9 systems.
LVM now only uses event activation in rhel9, so the flag 'event_activation=0' was causing a misbehavior that left LVs inactive and broke the boot.

In this case, it seems to be a problem with LVM and multipath.

Comment 3 Arik 2022-06-06 14:36:31 UTC
We didn't test this with node/rhvh yet so this might be a node-issue that we'll face once we test upgrades with RHVH

Comment 4 Nir Soffer 2022-06-06 15:49:31 UTC
I think this is a duplicate of bug 2076262

David already fixed this in LVM, but I don't know when the fix will be available.

Comment 6 Nir Soffer 2022-06-09 20:31:12 UTC
Jean-Louis, do you want to test the fix? you can use the rpms built
by github here:
https://github.com/oVirt/vdsm/actions/runs/2470998957

Comment 7 Nir Soffer 2022-06-09 20:59:40 UTC
Updating severity, this cause node upgrade to fail and require fixing the host
in emergency mode.

Comment 8 Jean-Louis Dupond 2022-06-10 13:06:04 UTC
Nir: What would be a proper way to test on ovirt-node? Can't I just change the lvmlocal.conf and rebuild the initramfs (how to do that correctly?)

Comment 9 Nir Soffer 2022-06-10 19:39:00 UTC
(In reply to Jean-Louis Dupond from comment #8)
> Nir: What would be a proper way to test on ovirt-node? Can't I just change
> the lvmlocal.conf and rebuild the initramfs (how to do that correctly?)

Testing on existing hosts that has this issue (device in /etc/multiapth/wwids)
and running "vdsm-tool config-lvm-filter" does not import the vgs devices 
to the devices file is good case to test.

Steps:
1. Update the lvmlocal.conf with changes from the patch (new option, new revision)
2. Configure lvm:

   vdsm-tool config-lvm-filter
3. reboot

Expected results:
- use_devicesfile should be enabled in lvm.conf
- lvm filter should be removed from lvm.conf
- lvmdevices command should report all the relevant devices used by all host vgs.
- host should reboot successfully

I'm not sure that rebuilding the initramfs is needed since lvm does not use
the devices file during early boot. If you want to be sure you can run

   dracut -f

This may not be enough for ovirt-node.

The other use case we need to test is upgrade - "vdsm-tool configue" installs
a new lvmlocal.conf and we need to make sure the new file is use in the new
layer after rebooting.

Comment 14 Michal Skrivanek 2022-06-23 07:55:04 UTC
https://cbs.centos.org/koji/buildinfo?buildID=39701