Bug 1670722 - Enabling gluster service fails when disk device names change after reboot.
Summary: Enabling gluster service fails when disk device names change after reboot.
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: rhhi
Version: rhhiv-1.5
Hardware: x86_64
OS: Linux
urgent
high
Target Milestone: ---
: RHHI-V 1.7
Assignee: Kaustav Majumder
QA Contact: SATHEESARAN
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-01-30 09:09 UTC by Krist van Besien
Modified: 2023-09-07 19:41 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Gluster bricks were previously mounted using the device name. This device name could change after a system reboot, which made the brick unavailable. Bricks are now mounted using the UUID instead of the device name, avoiding this issue.
Clone Of:
: 1693540 (view as bug list)
Environment:
Last Closed: 2020-02-13 15:57:20 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 4613401 0 None None None 2019-11-26 22:43:44 UTC
Red Hat Product Errata RHBA-2020:0508 0 None None None 2020-02-13 15:57:34 UTC

Description Krist van Besien 2019-01-30 09:09:56 UTC
Description of problem:


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Krist van Besien 2019-01-30 09:19:39 UTC
(Aparently you cannot edit the description anymore, after you accidently submit the BZ by touching the wrong key on your keyboard to dismiss an annoying dialog...)

Problem:

On a RHHI install the gluster service was somehow not enabled during the install. Enabling it afterwards threw an error. In the log files we saw:

VDSM xx.host.local command GetStorageDeviceListVDS failed: Internal JSON-RPC error: {'reason': "'gluster_vg_sda-/dev/sdd: open failed: No medium found'"}


Additional info.

On these hosts the disk devices sometimes get renamed after a reboot. 

For example: 


[root@pasrhvnd00001b ~]# lsblk
NAME                                          MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda                                             8:0    0   200G  0 disk
├─sda1                                          8:1    0   200M  0 part /boot/efi
├─sda2                                          8:2    0     1G  0 part /boot
└─sda3                                          8:3    0 198.8G  0 part
  ├─rhvh-swap                                 253:0    0     4G  0 lvm  [SWAP]
  ├─rhvh-pool00_tmeta                         253:1    0     1G  0 lvm
  │ └─rhvh-pool00-tpool                       253:3    0 154.9G  0 lvm
  │   ├─rhvh-rhvh--4.2.7.5--0.20181121.0+1    253:4    0 127.9G  0 lvm  /
  │   ├─rhvh-pool00                           253:11   0 154.9G  0 lvm
  │   ├─rhvh-var_log_audit                    253:12   0     2G  0 lvm  /var/log/audit
  │   ├─rhvh-var_log                          253:13   0     8G  0 lvm  /var/log
  │   ├─rhvh-var                              253:14   0    15G  0 lvm  /var
  │   ├─rhvh-tmp                              253:15   0     1G  0 lvm  /tmp
  │   ├─rhvh-home                             253:16   0     1G  0 lvm  /home
  │   ├─rhvh-root                             253:17   0 127.9G  0 lvm
  │   └─rhvh-var_crash                        253:18   0    10G  0 lvm  /var/crash
  └─rhvh-pool00_tdata                         253:2    0 154.9G  0 lvm
    └─rhvh-pool00-tpool                       253:3    0 154.9G  0 lvm
      ├─rhvh-rhvh--4.2.7.5--0.20181121.0+1    253:4    0 127.9G  0 lvm  /
      ├─rhvh-pool00                           253:11   0 154.9G  0 lvm
      ├─rhvh-var_log_audit                    253:12   0     2G  0 lvm  /var/log/audit
      ├─rhvh-var_log                          253:13   0     8G  0 lvm  /var/log
      ├─rhvh-var                              253:14   0    15G  0 lvm  /var
      ├─rhvh-tmp                              253:15   0     1G  0 lvm  /tmp
      ├─rhvh-home                             253:16   0     1G  0 lvm  /home
      ├─rhvh-root                             253:17   0 127.9G  0 lvm
      └─rhvh-var_crash                        253:18   0    10G  0 lvm  /var/crash
sdb                                             8:16   0   150G  0 disk
└─gluster_vg_sdb-gluster_lv_engine            253:10   0   150G  0 lvm  /gluster_bricks/engine
sdc                                             8:32   0   500G  0 disk
├─gluster_vg_sdd-gluster_thinpool_sdd_tmeta   253:5    0     3G  0 lvm
│ └─gluster_vg_sdd-gluster_thinpool_sdd-tpool 253:7    0   494G  0 lvm
│   ├─gluster_vg_sdd-gluster_thinpool_sdd     253:8    0   494G  0 lvm
│   └─gluster_vg_sdd-gluster_lv_oradata       253:9    0   500G  0 lvm  /gluster_bricks/oradata
└─gluster_vg_sdd-gluster_thinpool_sdd_tdata   253:6    0   494G  0 lvm
  └─gluster_vg_sdd-gluster_thinpool_sdd-tpool 253:7    0   494G  0 lvm
    ├─gluster_vg_sdd-gluster_thinpool_sdd     253:8    0   494G  0 lvm
    └─gluster_vg_sdd-gluster_lv_oradata       253:9    0   500G  0 lvm  /gluster_bricks/oradata
sdd                                             8:48   0   500G  0 disk
└─vdo_sde                                     253:19   0   4.4T  0 vdo
  └─gluster_vg_sde-gluster_lv_oraadmin        253:24   0   500G  0 lvm  /gluster_bricks/oraadmin
sde                                             8:64   0     2T  0 disk
└─vdo_sdf                                     253:20   0  17.6T  0 vdo
  └─gluster_vg_sdf-gluster_lv_vmstore1        253:23   0     2T  0 lvm  /gluster_bricks/vmstore1
sdf                                             8:80   0     2T  0 disk
└─vdo_sdg                                     253:21   0  17.6T  0 vdo
  └─gluster_vg_sdg-gluster_lv_vmstore2        253:22   0     2T  0 lvm  /gluster_bricks/vmstore2

On this host sdc was actually sdd during install, as you can see from the lv names. sdd was sde, sdf was sdg.
sdg is now no longer there.

But the disk device is still there as the output of lvs shows:

[root@pasrhvnd00001b ~]# lvs
  /dev/sdg: open failed: No medium found
  LV                          VG             Attr       LSize    Pool                 Origin                    Data%  Meta%  Move Log Cpy%Sync Convert
  gluster_lv_engine           gluster_vg_sdb -wi-ao---- <150.00g
  gluster_lv_oradata          gluster_vg_sdd Vwi-aotz--  500.00g gluster_thinpool_sdd                           0.05
  gluster_thinpool_sdd        gluster_vg_sdd twi-aotz-- <494.00g                                                0.05   0.54
  gluster_lv_oraadmin         gluster_vg_sde -wi-ao----  500.00g
  gluster_lv_vmstore1         gluster_vg_sdf -wi-ao----    1.95t
  gluster_lv_vmstore2         gluster_vg_sdg -wi-ao----    1.95t
  home                        rhvh           Vwi-aotz--    1.00g pool00                                         4.79
  pool00                      rhvh           twi-aotz-- <154.88g                                                4.17   1.99
  rhvh-4.2.7.5-0.20181121.0   rhvh           Vwi---tz-k <127.88g pool00               root
  rhvh-4.2.7.5-0.20181121.0+1 rhvh           Vwi-aotz-- <127.88g pool00               rhvh-4.2.7.5-0.20181121.0 3.18
  root                        rhvh           Vwi-a-tz-- <127.88g pool00                                         3.10
  swap                        rhvh           -wi-ao----    4.00g
  tmp                         rhvh           Vwi-aotz--    1.00g pool00                                         38.46
  var                         rhvh           Vwi-aotz--   15.00g pool00                                         4.49
  var_crash                   rhvh           Vwi-aotz--   10.00g pool00                                         2.86
  var_log                     rhvh           Vwi-aotz--    8.00g pool00                                         7.75
  var_log_audit               rhvh           Vwi-aotz--    2.00g pool00                                         6.82

And this exactly the error shown in the log. The debug info in /var/log/vdsm/vdsmsuper.log confirms that it is indeed those errors thrown during the execution of lvs that cause the issue.

I removed /dev/sdg and afterwards we could enable the gluster service.

This is similar to this bug:
https://bugzilla.redhat.com/show_bug.cgi?id=1659774

What we need is either a fix for vdsm to handle those missing disks more gracefully.
Or the RHHI docs need to describe how to set up your system so that disk device names do not change.

\

Comment 3 SATHEESARAN 2019-02-19 20:15:23 UTC
what is the proposed fix ?

Comment 4 Sahina Bose 2019-02-21 08:29:39 UTC
(In reply to SATHEESARAN from comment #3)
> what is the proposed fix ?

We don't know yet - investigating.

Could you provide the /var/log/messages, /var/log/glusterfs/glusterd.log and dmesg from the host ?
The vdsm error "VDSM xx.host.local command GetStorageDeviceListVDS failed: Internal JSON-RPC error: {'reason': "'gluster_vg_sda-/dev/sdd: open failed: No medium found'"}" only results in the storage devices being not reported to engine. Does not result in gluster service start issue.

Comment 5 Krist van Besien 2019-02-21 09:38:42 UTC
(In reply to Sahina Bose from comment #4)
> (In reply to SATHEESARAN from comment #3)
> > what is the proposed fix ?
> 
> We don't know yet - investigating.
> 
> Could you provide the /var/log/messages, /var/log/glusterfs/glusterd.log and
> dmesg from the host ?
> The vdsm error "VDSM xx.host.local command GetStorageDeviceListVDS failed:
> Internal JSON-RPC error: {'reason': "'gluster_vg_sda-/dev/sdd: open failed:
> No medium found'"}" only results in the storage devices being not reported
> to engine. Does not result in gluster service start issue.

I am no longer on site with the customer. I am not sure I can get these files.
What I can confirm however is that once we removed this device enabling the gluster service in the admin console was successful.

Comment 8 Sahina Bose 2019-03-07 05:35:55 UTC
Kaustav, is this related to Bug 1679458?

Comment 9 Kaustav Majumder 2019-03-11 10:26:31 UTC
(In reply to Sahina Bose from comment #8)
> Kaustav, is this related to Bug 1679458?

Unable to reproduce this issue. Although it seems similar to the bug mentioned wherein the storage device names are changed on reboot.

Comment 11 Sahina Bose 2019-03-11 12:53:53 UTC
Adding a needinfo to look at the logs attached

Comment 17 SATHEESARAN 2019-03-21 06:58:26 UTC
As the issue is not root caused, this bug is removed from RHHI-V 1.6

Comment 18 Sahina Bose 2019-03-26 07:29:17 UTC
Going through the glusterd logs, there are logs from 01-14-2019 upto 01-16-2019, but no logs from 01-28-2019 when the device change happened.
Hosted-engine_Logs.txt:2019-01-28 22:45:59,728+02 WARN  [org.ovirt.engine.core.vdsbroker.gluster.GetStorageDeviceListVDSCommand] (default task-23) [1e6487ca] Unexpected return value: Status [code=-32603, message=Internal JSON-RPC error: {'reason': "'rhvh-/dev/sdg: open failed: No medium found'"}]


It's likely that the gluster service could not be started as the bricks mentioned in the vol file could not be found, as there are errors related to gluster_vg_sd*
./supervdsm.log:MainProcess|jsonrpc/6::ERROR::2019-03-12 09:07:18,193::devicetree::718::blivet::(addUdevLVDevice) failed to find vg 'gluster_vg_sdg' after scanning pvs
./supervdsm.log:MainProcess|jsonrpc/6::ERROR::2019-03-12 09:07:18,238::devicetree::718::blivet::(addUdevLVDevice) failed to find vg 'gluster_vg_sdf' after scanning pvs
./supervdsm.log:MainProcess|jsonrpc/6::ERROR::2019-03-12 09:07:18,285::devicetree::718::blivet::(addUdevLVDevice) failed to find vg 'gluster_vg_sde' after scanning pvs

This is not a problem with vdsm code, as the device order has changed causing these issues.

I'm not sure if there's anything that can be done with the way the LVs are created to avoid this problem. 
Sachi, should we use disk uuid for brick provisioning?

Comment 19 Sahina Bose 2019-03-26 07:30:11 UTC
Question to lvm team - what could cause the device rename on reboot? Is there a way to prevent this?
Mike, adding a needinfo on you. Could you help redirect to relevant contact?

Comment 20 Sachidananda Urs 2019-03-26 11:11:09 UTC
(In reply to Sahina Bose from comment #18)
> Going through the glusterd logs, there are logs from 01-14-2019 upto
> 01-16-2019, but no logs from 01-28-2019 when the device change happened.
> Hosted-engine_Logs.txt:2019-01-28 22:45:59,728+02 WARN 
> [org.ovirt.engine.core.vdsbroker.gluster.GetStorageDeviceListVDSCommand]
> (default task-23) [1e6487ca] Unexpected return value: Status [code=-32603,
> message=Internal JSON-RPC error: {'reason': "'rhvh-/dev/sdg: open failed: No
> medium found'"}]
> 
> 
> It's likely that the gluster service could not be started as the bricks
> mentioned in the vol file could not be found, as there are errors related to
> gluster_vg_sd*
> ./supervdsm.log:MainProcess|jsonrpc/6::ERROR::2019-03-12
> 09:07:18,193::devicetree::718::blivet::(addUdevLVDevice) failed to find vg
> 'gluster_vg_sdg' after scanning pvs
> ./supervdsm.log:MainProcess|jsonrpc/6::ERROR::2019-03-12
> 09:07:18,238::devicetree::718::blivet::(addUdevLVDevice) failed to find vg
> 'gluster_vg_sdf' after scanning pvs
> ./supervdsm.log:MainProcess|jsonrpc/6::ERROR::2019-03-12
> 09:07:18,285::devicetree::718::blivet::(addUdevLVDevice) failed to find vg
> 'gluster_vg_sde' after scanning pvs
> 
> This is not a problem with vdsm code, as the device order has changed
> causing these issues.
> 
> I'm not sure if there's anything that can be done with the way the LVs are
> created to avoid this problem. 
> Sachi, should we use disk uuid for brick provisioning?

Sahina, UUID is used while mounting the device. We can do that, instead of putting
the device name, we can put the UUID of the disk. As far as I know, UUID does not
change when disk name changes during reboot.

Comment 21 Mike Snitzer 2019-03-26 17:56:11 UTC
(In reply to Sachidananda Urs from comment #20)

> Sahina, UUID is used while mounting the device. We can do that, instead of
> putting
> the device name, we can put the UUID of the disk. As far as I know, UUID
> does not
> change when disk name changes during reboot.

Correct, please use UUID.

Using the device name (particularly SCSI device name) will not work reliably unless the device itself takes steps to use UUID based naming or provides persistent naming (e.g. dm-multipath via device-mapper-multipath).

Comment 22 SATHEESARAN 2019-03-28 06:38:11 UTC
(In reply to Mike Snitzer from comment #21)
> (In reply to Sachidananda Urs from comment #20)
> 
> > Sahina, UUID is used while mounting the device. We can do that, instead of
> > putting
> > the device name, we can put the UUID of the disk. As far as I know, UUID
> > does not
> > change when disk name changes during reboot.
> 
> Correct, please use UUID.
> 
> Using the device name (particularly SCSI device name) will not work reliably
> unless the device itself takes steps to use UUID based naming or provides
> persistent naming (e.g. dm-multipath via device-mapper-multipath).

As I read the comments, I am getting 2 things here:
1. Create bricks with disk UUID
I am not sure, how is this being done ?
Are we targeting this.

2. Mounting the XFS filesystems with the XFS UUID.
This can be done with changes in gluster-ansible

Which are we targetting here ?

Comment 24 Sachidananda Urs 2019-03-28 09:01:06 UTC
(In reply to SATHEESARAN from comment #22)
> (In reply to Mike Snitzer from comment #21)
> > (In reply to Sachidananda Urs from comment #20)
> > 

> As I read the comments, I am getting 2 things here:
> 1. Create bricks with disk UUID
> I am not sure, how is this being done ?
> Are we targeting this.

We are not targeting this. This can't be done, you get the device attributes like Label or UUID
upon creation of lvm or filesystem ...

> 
> 2. Mounting the XFS filesystems with the XFS UUID.
> This can be done with changes in gluster-ansible
> 
> Which are we targetting here ?

We are targeting this.

Comment 25 Sachidananda Urs 2019-03-29 08:49:39 UTC
PR: https://github.com/gluster/gluster-ansible-infra/pull/55 fixes the issue.

Comment 29 SATHEESARAN 2020-01-07 12:00:46 UTC
Verified with gluster-ansible-roles-1.0.5-7

XFS filesystems are mounted using XFS UUID

<snip>
UUID=3bc4e3bf-c779-4af9-8783-f25741012643 /gluster_bricks/engine xfs inode64,noatime,nodiratime,_netdev,x-systemd.device-timeout=0,x-systemd.requires=vdo.service 0 0
UUID=bf0008a9-7f78-4f88-897e-a4a33f455597 /gluster_bricks/data xfs inode64,noatime,nodiratime,_netdev,x-systemd.device-timeout=0,x-systemd.requires=vdo.service 0 0
UUID=8fe40512-2c47-45af-99d8-282e681d0e13 /gluster_bricks/vmstore xfs inode64,noatime,nodiratime,_netdev,x-systemd.device-timeout=0,x-systemd.requires=vdo.service 0 0
UUID=710c8a19-fdbe-401d-994a-23a0f8438c37 /gluster_bricks/test xfs inode64,noatime,nodiratime,_netdev,x-systemd.device-timeout=0,x-systemd.requires=vdo.service 0 0
</snip>

Comment 31 errata-xmlrpc 2020-02-13 15:57:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0508


Note You need to log in before you can comment on or make changes to this bug.