2095588 – RHV/RHHI 4.4 -> 4.5 upgrade results in maintenance mode due to LVM use_devicesfile = 1

Bug 2095588 - RHV/RHHI 4.4 -> 4.5 upgrade results in maintenance mode due to LVM use_devicesfile = 1

Summary: RHV/RHHI 4.4 -> 4.5 upgrade results in maintenance mode due to LVM use_device...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	4.5.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Albert Esteve
QA Contact:	Lukas Svaty
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	2104515 (view as bug list)
Depends On:	2090169
Blocks:
TreeView+	depends on / blocked

Reported:	2022-06-09 22:43 UTC by Sean Haselden
Modified:	2023-02-16 16:28 UTC (History)
CC List:	22 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	2090169
Environment:
Last Closed:	2022-07-11 14:22:12 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHV-46388	0	None	None	None	2022-06-09 22:46:22 UTC
Red Hat Knowledge Base (Solution)	6962678	0	None	None	None	2022-06-16 13:11:03 UTC

Description Sean Haselden 2022-06-09 22:43:33 UTC

+++ This bug was initially created as a clone of Bug #2090169 +++

This is a **similar** issue, but unsure if the root of the problem is the same 


Description of problem:
After upgrading an rhv virtualization-host from 4.4.10 to 4.5.0, the node didn't boot anymore and ended up in dracut rescue shell.

In the rescue shell it was clear that it didn't boot because LVM did not activate the gluster devices, and it failed to mount the related filsystems:
UUID=24092465-cad0-412b-8ef3-d3abbf9e7b5b /gluster_bricks/engine xfs inode64,noatime,nodiratime 0 0
UUID=67dce97d-13b1-4734-9158-aeafd7e57426 /gluster_bricks/data xfs inode64,noatime,nodiratime 0 0
UUID=d303fdb0-3bf4-4cab-8d66-125035a24899 /gluster_bricks/vmstore xfs inode64,noatime,nodiratime 0 0
UUID=21e8907f-122a-44d8-8f6c-cfc8dc6bc8eb /gluster_bricks/vmstore2 xfs inode64,noatime,nodiratime 0 0

The LVM physical volumes are luks encrypted drives:
/dev/mapper/luks_sdb                                  gluster_vg_luks_sdb lvm2 a--   <3.20t      0    <3.20t yktKae-63L1-GIfU-mWdK-gvXS-1e0f-CLXXDu   505.50k  1020.00k     1        1   1.00m
/dev/mapper/luks_sdc                                  gluster_vg_sdc      lvm2 a--   <4.37t  35.16g   <4.37t TWNu4B-aaGD-LjNh-W2Y9-FRzW-Fzho-9nfmFA   506.50k  1020.00k     1        1   1.00m
  /dev/sda1                                                                      ---       0       0     1.00g                                               0         0      0        0      0 
  /dev/sda2                                                                      ---       0       0  <299.00g                                               0         0      0        0      0 
  /dev/sdb                                                                       ---       0       0    <3.20t                                               0         0      0        0      0 
  /dev/sdc                                                                       ---       0       0    <4.37t                                               0         0      0        0      0 

When trying to manually scan lvm inside of rescue/maintenance mode, we see pvscan skipped over the device due to "deviceid"

17:39:12.088409 pvscan[14920] filters/filter-deviceid.c:40  /dev/mapper/luks_sdb: Skipping (deviceid)

Comparing customers /etc/lvm/lvm.conf file prior to the upgrade and after, we see the following is added in 4.5:

use_devicesfile = 1

This is per: 
https://bugzilla.redhat.com/show_bug.cgi?id=2012830 

On the customer system we can see that the /etc/lvm/devices/system.devices was not properly populated with the two /dev/mapper/luks* devices and there for lvm ignores them.

From maintenance mode, we did the following to work around the problem:

vi /etc/lvm/lvm.conf 
use_devicesfile = 0  ## temporarily disable so we can manually activate 
vgchange -ay    ## activate all of the volume groups 
vi /etc/lvm/lvm.conf
use_devicesfile = 1   ## set it back to enabled
vgimportdevices -a ## run the import and correctly populate /etc/lvm/devices/system.devices


Moving to the second host it was suggested based on Bug #2090169 to removing the device from the wwids file prior to upgrading a second node.  This did not resolve the problem, and we had to do the above work around on 2 more hosts.

We have sosreports from the 4.10 boot, 4.5 boot in maintenance mode, and 4.5 boot after fixing it.

Comment 4 Nir Soffer 2022-06-10 20:17:05 UTC

Based on the info in description 0, this is not the same issue as bug 2090169.
In that case removing the wwid from /etc/multipath/wwid should avoid this issue.

This may be an issue with luks encrypted devices, I don't think this was tested
or even considered in "vdsm-tool config-lvm-filter" tool.

We need to reproduce by building such RHV-H system.

It can help if we get the output of "vdsm-tool config-lvm-filter" when running
on such host before the upgrade.

For the way to fix such system, we can simplify it by disabling the devices file
temporarily, and import only the required vgs. The example give is very risky if
the host is had FC storage connected - it can import RHV stoarge domain vgs, and
even guest vgs from active lvs for raw disks.

Fixing instructions:

1. Activate the needed vgs, disabling the devices file temporarily:

    vgchange --devicesfile= -ay gluster_vg_luks_sdb gluster_vg_sdc

2. Import the devices to the system devices file

    vgimportdevices gluster_vg_luks_sdb gluster_vg_sdc

Comment 8 Arik 2022-06-27 14:38:40 UTC

We're still looking for an environment that this happens on

Comment 24 Nir Soffer 2022-07-06 23:28:45 UTC

Lev, please check comment 22 and comment 23. I think this should be fixed
in imagebased (bind mount /gluster_bricks in the chroot?).

Comment 25 Nir Soffer 2022-07-06 23:39:56 UTC

(In reply to Sean Haselden from comment #0)
> In the rescue shell it was clear that it didn't boot because LVM did not
> activate the gluster devices, and it failed to mount the related filsystems:

This is explained by comment 23 and comment 23. So this is a new issue and
not related to bug 2090169.

Comment 26 Lev Veyde 2022-07-07 01:33:27 UTC

(In reply to Nir Soffer from comment #24)
> Lev, please check comment 22 and comment 23. I think this should be fixed
> in imagebased (bind mount /gluster_bricks in the chroot?).

Shouldn't it access/detect it through /dev , just as it does with the LVM based volumes?

Comment 28 Arik 2022-07-07 07:59:45 UTC

*** Bug 2104515 has been marked as a duplicate of this bug. ***

Comment 32 Nir Soffer 2022-07-07 12:58:06 UTC

(In reply to Lev Veyde from comment #26)
> (In reply to Nir Soffer from comment #24)
> > Lev, please check comment 22 and comment 23. I think this should be fixed
> > in imagebased (bind mount /gluster_bricks in the chroot?).
> 
> Shouldn't it access/detect it through /dev , just as it does with the LVM
> based volumes?

No, it need to see the mounts to detect the required lvs.

Run lsblk in the chroot - if it does not show the mountpoints for lvs, the lvs
are not considered for creating filter/adding to devices file.

Comment 37 bkaraore 2022-07-11 12:44:45 UTC

temporary solution without Ansible

Before upgrading following procedure can be also applied to avoid hypervisor boot issues.

* Remove LVM filters.
~~~
# sed -i /^filter/d /etc/lvm/lvm.conf
~~~
* Enable system devices. Search *Allow_mixed_block_sizes* in */etc/lvm/lvm.conf* file and add a new line after it as follows.
~~~
# sed '/^Allow_mixed_block_sizes = 0/a use_devicesfile = 1' /etc/lvm/lvm.conf
~~~
* Populate system devices 
~~~
# vgimportdevices -a
~~~

Continue with upgrade will not have any issue after that.

Comment 38 Arik 2022-07-11 14:22:12 UTC

The attached KCS was validated, please checkout a minor suggestion to improve it in comment 35
Since we don't have an easy way to handle that and this is a one-time issue (once fixed, it won't reproduce on future upgrades), following the KCS is the best way to go

Comment 39 Arik 2022-07-11 14:23:28 UTC

(In reply to Arik from comment #38)
> The attached KCS was validated, please checkout a minor suggestion to
> improve it in comment 35

Only minor point I'd make to the KCS solution is maybe using this in step 3:
  # vgimportdevices <volume group name>

Comment 40 Sean Haselden 2022-07-12 14:57:19 UTC

(In reply to Arik from comment #39)
> (In reply to Arik from comment #38)
> > The attached KCS was validated, please checkout a minor suggestion to
> > improve it in comment 35
> 
> Only minor point I'd make to the KCS solution is maybe using this in step 3:
>   # vgimportdevices <volume group name>

I assume this command will leave the current disks in the devices file and add the ones specified as part of "<volume group name>"? If so I can make the edit.

Comment 41 Albert Esteve 2022-07-12 15:11:10 UTC

(In reply to Sean Haselden from comment #40)
> (In reply to Arik from comment #39)
> > (In reply to Arik from comment #38)
> > > The attached KCS was validated, please checkout a minor suggestion to
> > > improve it in comment 35
> > 
> > Only minor point I'd make to the KCS solution is maybe using this in step 3:
> >   # vgimportdevices <volume group name>
> 
> I assume this command will leave the current disks in the devices file and
> add the ones specified as part of "<volume group name>"? If so I can make
> the edit.

Yes, that is correct. vgimportdevices creates the devicesfile if none exists,
and appends new devices individually.

Actually, vdsm-tool invokes vgimportdevices in a loop for the proper devices
when we do "vdsm-tool config-lvm-filter".

Note You need to log in before you can comment on or make changes to this bug.

aesteve
ahadas
bkaraore
bugs
bzlotnik
fsun
jean-louis
lsurette
lveyde
mavital
mkalinin
mwaykole
nsoffer
schandle
sfishbai
srevivo
swachira
teigland
vdas
vpapnoi
ycui
ymankad