Description of problem: Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
(Aparently you cannot edit the description anymore, after you accidently submit the BZ by touching the wrong key on your keyboard to dismiss an annoying dialog...) Problem: On a RHHI install the gluster service was somehow not enabled during the install. Enabling it afterwards threw an error. In the log files we saw: VDSM xx.host.local command GetStorageDeviceListVDS failed: Internal JSON-RPC error: {'reason': "'gluster_vg_sda-/dev/sdd: open failed: No medium found'"} Additional info. On these hosts the disk devices sometimes get renamed after a reboot. For example: [root@pasrhvnd00001b ~]# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 200G 0 disk ├─sda1 8:1 0 200M 0 part /boot/efi ├─sda2 8:2 0 1G 0 part /boot └─sda3 8:3 0 198.8G 0 part ├─rhvh-swap 253:0 0 4G 0 lvm [SWAP] ├─rhvh-pool00_tmeta 253:1 0 1G 0 lvm │ └─rhvh-pool00-tpool 253:3 0 154.9G 0 lvm │ ├─rhvh-rhvh--4.2.7.5--0.20181121.0+1 253:4 0 127.9G 0 lvm / │ ├─rhvh-pool00 253:11 0 154.9G 0 lvm │ ├─rhvh-var_log_audit 253:12 0 2G 0 lvm /var/log/audit │ ├─rhvh-var_log 253:13 0 8G 0 lvm /var/log │ ├─rhvh-var 253:14 0 15G 0 lvm /var │ ├─rhvh-tmp 253:15 0 1G 0 lvm /tmp │ ├─rhvh-home 253:16 0 1G 0 lvm /home │ ├─rhvh-root 253:17 0 127.9G 0 lvm │ └─rhvh-var_crash 253:18 0 10G 0 lvm /var/crash └─rhvh-pool00_tdata 253:2 0 154.9G 0 lvm └─rhvh-pool00-tpool 253:3 0 154.9G 0 lvm ├─rhvh-rhvh--4.2.7.5--0.20181121.0+1 253:4 0 127.9G 0 lvm / ├─rhvh-pool00 253:11 0 154.9G 0 lvm ├─rhvh-var_log_audit 253:12 0 2G 0 lvm /var/log/audit ├─rhvh-var_log 253:13 0 8G 0 lvm /var/log ├─rhvh-var 253:14 0 15G 0 lvm /var ├─rhvh-tmp 253:15 0 1G 0 lvm /tmp ├─rhvh-home 253:16 0 1G 0 lvm /home ├─rhvh-root 253:17 0 127.9G 0 lvm └─rhvh-var_crash 253:18 0 10G 0 lvm /var/crash sdb 8:16 0 150G 0 disk └─gluster_vg_sdb-gluster_lv_engine 253:10 0 150G 0 lvm /gluster_bricks/engine sdc 8:32 0 500G 0 disk ├─gluster_vg_sdd-gluster_thinpool_sdd_tmeta 253:5 0 3G 0 lvm │ └─gluster_vg_sdd-gluster_thinpool_sdd-tpool 253:7 0 494G 0 lvm │ ├─gluster_vg_sdd-gluster_thinpool_sdd 253:8 0 494G 0 lvm │ └─gluster_vg_sdd-gluster_lv_oradata 253:9 0 500G 0 lvm /gluster_bricks/oradata └─gluster_vg_sdd-gluster_thinpool_sdd_tdata 253:6 0 494G 0 lvm └─gluster_vg_sdd-gluster_thinpool_sdd-tpool 253:7 0 494G 0 lvm ├─gluster_vg_sdd-gluster_thinpool_sdd 253:8 0 494G 0 lvm └─gluster_vg_sdd-gluster_lv_oradata 253:9 0 500G 0 lvm /gluster_bricks/oradata sdd 8:48 0 500G 0 disk └─vdo_sde 253:19 0 4.4T 0 vdo └─gluster_vg_sde-gluster_lv_oraadmin 253:24 0 500G 0 lvm /gluster_bricks/oraadmin sde 8:64 0 2T 0 disk └─vdo_sdf 253:20 0 17.6T 0 vdo └─gluster_vg_sdf-gluster_lv_vmstore1 253:23 0 2T 0 lvm /gluster_bricks/vmstore1 sdf 8:80 0 2T 0 disk └─vdo_sdg 253:21 0 17.6T 0 vdo └─gluster_vg_sdg-gluster_lv_vmstore2 253:22 0 2T 0 lvm /gluster_bricks/vmstore2 On this host sdc was actually sdd during install, as you can see from the lv names. sdd was sde, sdf was sdg. sdg is now no longer there. But the disk device is still there as the output of lvs shows: [root@pasrhvnd00001b ~]# lvs /dev/sdg: open failed: No medium found LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert gluster_lv_engine gluster_vg_sdb -wi-ao---- <150.00g gluster_lv_oradata gluster_vg_sdd Vwi-aotz-- 500.00g gluster_thinpool_sdd 0.05 gluster_thinpool_sdd gluster_vg_sdd twi-aotz-- <494.00g 0.05 0.54 gluster_lv_oraadmin gluster_vg_sde -wi-ao---- 500.00g gluster_lv_vmstore1 gluster_vg_sdf -wi-ao---- 1.95t gluster_lv_vmstore2 gluster_vg_sdg -wi-ao---- 1.95t home rhvh Vwi-aotz-- 1.00g pool00 4.79 pool00 rhvh twi-aotz-- <154.88g 4.17 1.99 rhvh-4.2.7.5-0.20181121.0 rhvh Vwi---tz-k <127.88g pool00 root rhvh-4.2.7.5-0.20181121.0+1 rhvh Vwi-aotz-- <127.88g pool00 rhvh-4.2.7.5-0.20181121.0 3.18 root rhvh Vwi-a-tz-- <127.88g pool00 3.10 swap rhvh -wi-ao---- 4.00g tmp rhvh Vwi-aotz-- 1.00g pool00 38.46 var rhvh Vwi-aotz-- 15.00g pool00 4.49 var_crash rhvh Vwi-aotz-- 10.00g pool00 2.86 var_log rhvh Vwi-aotz-- 8.00g pool00 7.75 var_log_audit rhvh Vwi-aotz-- 2.00g pool00 6.82 And this exactly the error shown in the log. The debug info in /var/log/vdsm/vdsmsuper.log confirms that it is indeed those errors thrown during the execution of lvs that cause the issue. I removed /dev/sdg and afterwards we could enable the gluster service. This is similar to this bug: https://bugzilla.redhat.com/show_bug.cgi?id=1659774 What we need is either a fix for vdsm to handle those missing disks more gracefully. Or the RHHI docs need to describe how to set up your system so that disk device names do not change. \
what is the proposed fix ?
(In reply to SATHEESARAN from comment #3) > what is the proposed fix ? We don't know yet - investigating. Could you provide the /var/log/messages, /var/log/glusterfs/glusterd.log and dmesg from the host ? The vdsm error "VDSM xx.host.local command GetStorageDeviceListVDS failed: Internal JSON-RPC error: {'reason': "'gluster_vg_sda-/dev/sdd: open failed: No medium found'"}" only results in the storage devices being not reported to engine. Does not result in gluster service start issue.
(In reply to Sahina Bose from comment #4) > (In reply to SATHEESARAN from comment #3) > > what is the proposed fix ? > > We don't know yet - investigating. > > Could you provide the /var/log/messages, /var/log/glusterfs/glusterd.log and > dmesg from the host ? > The vdsm error "VDSM xx.host.local command GetStorageDeviceListVDS failed: > Internal JSON-RPC error: {'reason': "'gluster_vg_sda-/dev/sdd: open failed: > No medium found'"}" only results in the storage devices being not reported > to engine. Does not result in gluster service start issue. I am no longer on site with the customer. I am not sure I can get these files. What I can confirm however is that once we removed this device enabling the gluster service in the admin console was successful.
Kaustav, is this related to Bug 1679458?
(In reply to Sahina Bose from comment #8) > Kaustav, is this related to Bug 1679458? Unable to reproduce this issue. Although it seems similar to the bug mentioned wherein the storage device names are changed on reboot.
Adding a needinfo to look at the logs attached
As the issue is not root caused, this bug is removed from RHHI-V 1.6
Going through the glusterd logs, there are logs from 01-14-2019 upto 01-16-2019, but no logs from 01-28-2019 when the device change happened. Hosted-engine_Logs.txt:2019-01-28 22:45:59,728+02 WARN [org.ovirt.engine.core.vdsbroker.gluster.GetStorageDeviceListVDSCommand] (default task-23) [1e6487ca] Unexpected return value: Status [code=-32603, message=Internal JSON-RPC error: {'reason': "'rhvh-/dev/sdg: open failed: No medium found'"}] It's likely that the gluster service could not be started as the bricks mentioned in the vol file could not be found, as there are errors related to gluster_vg_sd* ./supervdsm.log:MainProcess|jsonrpc/6::ERROR::2019-03-12 09:07:18,193::devicetree::718::blivet::(addUdevLVDevice) failed to find vg 'gluster_vg_sdg' after scanning pvs ./supervdsm.log:MainProcess|jsonrpc/6::ERROR::2019-03-12 09:07:18,238::devicetree::718::blivet::(addUdevLVDevice) failed to find vg 'gluster_vg_sdf' after scanning pvs ./supervdsm.log:MainProcess|jsonrpc/6::ERROR::2019-03-12 09:07:18,285::devicetree::718::blivet::(addUdevLVDevice) failed to find vg 'gluster_vg_sde' after scanning pvs This is not a problem with vdsm code, as the device order has changed causing these issues. I'm not sure if there's anything that can be done with the way the LVs are created to avoid this problem. Sachi, should we use disk uuid for brick provisioning?
Question to lvm team - what could cause the device rename on reboot? Is there a way to prevent this? Mike, adding a needinfo on you. Could you help redirect to relevant contact?
(In reply to Sahina Bose from comment #18) > Going through the glusterd logs, there are logs from 01-14-2019 upto > 01-16-2019, but no logs from 01-28-2019 when the device change happened. > Hosted-engine_Logs.txt:2019-01-28 22:45:59,728+02 WARN > [org.ovirt.engine.core.vdsbroker.gluster.GetStorageDeviceListVDSCommand] > (default task-23) [1e6487ca] Unexpected return value: Status [code=-32603, > message=Internal JSON-RPC error: {'reason': "'rhvh-/dev/sdg: open failed: No > medium found'"}] > > > It's likely that the gluster service could not be started as the bricks > mentioned in the vol file could not be found, as there are errors related to > gluster_vg_sd* > ./supervdsm.log:MainProcess|jsonrpc/6::ERROR::2019-03-12 > 09:07:18,193::devicetree::718::blivet::(addUdevLVDevice) failed to find vg > 'gluster_vg_sdg' after scanning pvs > ./supervdsm.log:MainProcess|jsonrpc/6::ERROR::2019-03-12 > 09:07:18,238::devicetree::718::blivet::(addUdevLVDevice) failed to find vg > 'gluster_vg_sdf' after scanning pvs > ./supervdsm.log:MainProcess|jsonrpc/6::ERROR::2019-03-12 > 09:07:18,285::devicetree::718::blivet::(addUdevLVDevice) failed to find vg > 'gluster_vg_sde' after scanning pvs > > This is not a problem with vdsm code, as the device order has changed > causing these issues. > > I'm not sure if there's anything that can be done with the way the LVs are > created to avoid this problem. > Sachi, should we use disk uuid for brick provisioning? Sahina, UUID is used while mounting the device. We can do that, instead of putting the device name, we can put the UUID of the disk. As far as I know, UUID does not change when disk name changes during reboot.
(In reply to Sachidananda Urs from comment #20) > Sahina, UUID is used while mounting the device. We can do that, instead of > putting > the device name, we can put the UUID of the disk. As far as I know, UUID > does not > change when disk name changes during reboot. Correct, please use UUID. Using the device name (particularly SCSI device name) will not work reliably unless the device itself takes steps to use UUID based naming or provides persistent naming (e.g. dm-multipath via device-mapper-multipath).
(In reply to Mike Snitzer from comment #21) > (In reply to Sachidananda Urs from comment #20) > > > Sahina, UUID is used while mounting the device. We can do that, instead of > > putting > > the device name, we can put the UUID of the disk. As far as I know, UUID > > does not > > change when disk name changes during reboot. > > Correct, please use UUID. > > Using the device name (particularly SCSI device name) will not work reliably > unless the device itself takes steps to use UUID based naming or provides > persistent naming (e.g. dm-multipath via device-mapper-multipath). As I read the comments, I am getting 2 things here: 1. Create bricks with disk UUID I am not sure, how is this being done ? Are we targeting this. 2. Mounting the XFS filesystems with the XFS UUID. This can be done with changes in gluster-ansible Which are we targetting here ?
(In reply to SATHEESARAN from comment #22) > (In reply to Mike Snitzer from comment #21) > > (In reply to Sachidananda Urs from comment #20) > > > As I read the comments, I am getting 2 things here: > 1. Create bricks with disk UUID > I am not sure, how is this being done ? > Are we targeting this. We are not targeting this. This can't be done, you get the device attributes like Label or UUID upon creation of lvm or filesystem ... > > 2. Mounting the XFS filesystems with the XFS UUID. > This can be done with changes in gluster-ansible > > Which are we targetting here ? We are targeting this.
PR: https://github.com/gluster/gluster-ansible-infra/pull/55 fixes the issue.
Verified with gluster-ansible-roles-1.0.5-7 XFS filesystems are mounted using XFS UUID <snip> UUID=3bc4e3bf-c779-4af9-8783-f25741012643 /gluster_bricks/engine xfs inode64,noatime,nodiratime,_netdev,x-systemd.device-timeout=0,x-systemd.requires=vdo.service 0 0 UUID=bf0008a9-7f78-4f88-897e-a4a33f455597 /gluster_bricks/data xfs inode64,noatime,nodiratime,_netdev,x-systemd.device-timeout=0,x-systemd.requires=vdo.service 0 0 UUID=8fe40512-2c47-45af-99d8-282e681d0e13 /gluster_bricks/vmstore xfs inode64,noatime,nodiratime,_netdev,x-systemd.device-timeout=0,x-systemd.requires=vdo.service 0 0 UUID=710c8a19-fdbe-401d-994a-23a0f8438c37 /gluster_bricks/test xfs inode64,noatime,nodiratime,_netdev,x-systemd.device-timeout=0,x-systemd.requires=vdo.service 0 0 </snip>
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0508