1310330 – [RFE] Provide a way to remove stale LUNs from hypervisors

Bug 1310330 - [RFE] Provide a way to remove stale LUNs from hypervisors

Summary: [RFE] Provide a way to remove stale LUNs from hypervisors

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	4.0.0
Hardware:	All
OS:	All
Priority:	high
Severity:	high
Target Milestone:	ovirt-4.4.6
Target Release:	---
Assignee:	Vojtech Juranek
QA Contact:	Ilan Zuckerman
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1146115 (view as bug list)
Depends On:	deactivate_lv_on_domain_deactivation 1544370
Blocks:	1417161 1520566 1736852 1946995 1966873
TreeView+	depends on / blocked

Reported:	2016-02-20 13:13 UTC by Greg Scott
Modified:	2024-12-20 18:41 UTC (History)
CC List:	44 users (show)
Fixed In Version:
Doc Type:	Release Note
Doc Text:	After removing LUN from a storage domain, there are stale links to the removed LUN on the hosts. Ansible recipe delivered in this feature removes them from selected hosts, based on ansible inventory which this recipe is run with. Before running Ansible recipe, removed LUN has to be unzone from the storage server first by the admin, as oVirt doesn't manage storage server and cannot remove the LUN itself. After that, following playbook can be run to remove stale LUNs from hosts: `ansible-playbook -i hosts --extra-vars "lun=<LUN id>" remove_mpath_device.yml`, where `hosts` is Ansible inventory file with hosts where the stale LUN should be removed.
Clone Of:
Environment:
Last Closed:	2021-06-01 13:22:09 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:
Flags:	izuckerm: testing_plan_complete+

Attachments	(Terms of Use)
domain dialog with warnings about unzoned iSCSI LUN (48.47 KB, image/png) 2016-10-04 15:36 UTC, Tim Speetjens	no flags	Details
remove device Ansible playbook (714 bytes, text/x-vhdl) 2017-11-29 11:56 UTC, Yaniv Kaul	no flags	Details
remove mpath device (544 bytes, text/plain) 2021-03-08 09:18 UTC, Vojtech Juranek	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	880738	1	None	None	None	2021-01-20 06:05:38 UTC
Red Hat Knowledge Base (Solution)	129983	0	None	None	None	2016-11-08 00:59:36 UTC
Red Hat Knowledge Base (Solution)	2772351	0	None	None	None	2021-08-26 12:41:30 UTC
Red Hat Product Errata	RHSA-2021:2179	0	None	None	None	2021-06-01 13:23:11 UTC

Internal Links: 880738

Description Greg Scott 2016-02-20 13:13:56 UTC

Description of problem:
Bug number 880738 asks for the capability for RHEV-M to orchestrate getting rid of stale LUNs after removing a storage domain. That RFE was satisfied in 3.6.1 by reducing the impact of stale LUNs, but did not implement automating stale LUN removal. It was requested that we put together this RFE, asking for that same capability.

Version-Release number of selected component (if applicable):
3.n, 4.n

How reproducible:
At will

Steps to Reproduce:
1. Remove a storage domain
2. Remove the associated LUNs advertised by the SAN
3. RHEV hypervisors still have references to these now non-existant LUNs.
4. Either log into each hypervisor individually and remove these stale LUNs by hand, or put hypervisors into maintenance mode one by one and do a rolling reboot.

Actual results:
Hours of wasted time. With 68 hypervisors and 2500 VMs, 2 minutes to live-migrate each VM, 5 minutes to reboot each hypervisor, plus overhead, the time to clean up after removing a storage domain adds up to around 6000 minutes for each cycle. For customers who need to rotate storage domains regularly, the unnecessary overhead is unacceptable.

Removing all the paths and LUNs by hand on each hypervisor may take less time once the UUID paths are identified, but the process is error prone and requires an unacceptably high skill level.

Expected results:
Pasting word for word from bug number 880738:
If RHEV-H is supposed to be an appliance managed by RHEV-M, then RHEV-M should also be orchestrating the storage removal as well, all the way down to removing the paths.

Additional info:
The original bug number 880738 links to 39 support cases. Adding this capability will save countless support and customer hours.

Comment 1 Nir Soffer 2016-02-20 21:35:38 UTC

(In reply to Greg Scott from comment #0)
> Steps to Reproduce:
> 1. Remove a storage domain
> 2. Remove the associated LUNs advertised by the SAN
> 3. RHEV hypervisors still have references to these now non-existant LUNs.
> 4. Either log into each hypervisor individually and remove these stale LUNs
> by hand, or put hypervisors into maintenance mode one by one and do a
> rolling reboot.
> 
> Actual results:
> Hours of wasted time.  With 68 hypervisors and 2500 VMs, 2 minutes to
> live-migrate each VM, 5 minutes to reboot each hypervisor, plus overhead,
> the time to clean up after removing a storage domain adds up to around 6000
> minutes for each cycle.  

You don't need to live migrate vms or reboot a host to remove stale devices.

> For customers who need to rotate storage domains
> regularly, the unnecessary overhead is unacceptable.

Wy do you need to rotate storage domains regularly?

What is regularly?

> Removing all the paths and LUNs by hand on each hypervisor may take less
> time once the UUID paths are identified, but the process is error prone and
> requires an unacceptably high skill level.

The storage administrator who provided the luns in the first place has
all the info the remove them - lun guid. Using the guid, you can find
the underlying devices and remove the multipath device and the underlying
scsi devices.

There is nothing RHEV specific about the stale devices, they are not used
by RHEV at this point. The procedure is the same for for any host that
has stale devices.

I suggest to start by documenting this procedure for RHEV customers.

Maybe a tool for removing stale devices can be used to prevent errors?

> Expected results:
> Pasting word for word from bug number 880738:
> If RHEV-H is supposed to be an appliance managed by RHEV-M, then RHEV-M
> should also be orchestrating the storage removal as well, all the way down
> to removing the paths.

RHEV-M does not control adding the devices to the hypervisors, so it is 
arguable if it should control removal of the devices. The removal process
can be automated by other means.

RHEV-H may be harder to automate, but this should be solved by RHEV-H.

Fabian, do we have a solution for RHEV-H for automating operations in a cluster?

Comment 2 Greg Scott 2016-02-20 23:01:13 UTC

> Why do you need to rotate storage domains regularly?
>
> What is regularly?

Once per month.  The application uses a few thousand Windows 7 VMs in pools, created from a template.  Every month the template is patched and new pools and VMs created from the newly patched template.  At least until 3.6, they had to do a new storage domain and get rid of the old storage domain every month to keep enough contiguous free space in the SAN. Doing new pools and VMs in the same storage domain apparently created too much fragmentation, so Red Hat recommended doing it with new storage domains into a new set of LUNs.

So every month, the customer has to go through this process involving hours and hours and hours of manual work.

> RHEV-M does not control adding the devices to the hypervisors . . .

OK, this is technically true with fiber channel storage domains, but not for iSCSI or NFS.  For iSCSI and NFS, RHEV-M tells RHEV-H to everything it needs to login or mount and set up the correct LVM entities.  For fiber channel, rhev-m already "sees" the LUN and rhev-m tells rhev-h to do the rest.  OK, fair enough.

So when I tell rhev-m to tear down a storage domain, rhev-m knows everything it needs to instruct the SPM host to take care of business, including all the fiberchannel WWID info.  It should be possible to store that info somewhere, so when the storage administrator gets rid of the underlying LUN, I can tell rhev-m to tell all the rhev-h systems to get rid of references to the now stale LUN. 

Why do it? Well, we did tell lots of paying customer we were going to do it back in 2013.  That seems like a pretty good reason to me.

Comment 3 Greg Scott 2016-02-21 05:40:53 UTC

Typo above and bz doesn't let me edit comments. This sentence:

>  For fiber channel, rhev-m already "sees" the LUN and rhev-m tells rhev-h to do
> the rest.

Should say

For fiber channel, rhev-h already "sees" the LUN and rhev-m tells rhev-h to do the rest.

Now it makes sense.

- Greg

Comment 5 Fabian Deutsch 2016-02-22 15:55:14 UTC

(In reply to Nir Soffer from comment #1)
> (In reply to Greg Scott from comment #0)
…
> > Expected results:
> > Pasting word for word from bug number 880738:
> > If RHEV-H is supposed to be an appliance managed by RHEV-M, then RHEV-M
> > should also be orchestrating the storage removal as well, all the way down
> > to removing the paths.
> 
> RHEV-M does not control adding the devices to the hypervisors, so it is 
> arguable if it should control removal of the devices. The removal process
> can be automated by other means.
> 
> RHEV-H may be harder to automate, but this should be solved by RHEV-H.
> 
> Fabian, do we have a solution for RHEV-H for automating operations in a
> cluster?

Today Node itself does not do anything with storage.

FCoE/FC is getting configured manually.
iSCSI is connected using vdsm.

At the bottom line I'd expect vdsm to do the automation, or leave it for the admin to do it manually. At least I don't see a point where Node should do something.

Comment 6 Pavel Zhukov 2016-02-23 13:52:25 UTC

Doesn't multipath's option  "deferred_remove yes" help here? https://bugzilla.redhat.com/show_bug.cgi?id=631009
Once SD is removed from RHEV and unzoned (all paths are failed) multipath will take care about all underlying devices and remove them.

Comment 7 Marina Kalinin 2016-02-23 17:01:38 UTC

Yaniv,
Bringing this RFE to your attention please.

Comment 9 Nir Soffer 2016-03-05 20:44:20 UTC

Ben, would the fix for bug 631009 (deferred_remove yes), will resolve this
issue?

The use case is this:

1. An unused multipath device on a host is unzoned on the storage
   server.

2. All the paths on this device becomes faulty, since the server 
   does not exposed this LUN to this host now

3. using this multipath configuration:

defaults {
    polling_interval            5
    no_path_retry               fail
    user_friendly_names         no
    flush_on_last_del           yes
    fast_io_fail_tmo            5
    dev_loss_tmo                30
    max_fds                     4096
    deffered_remove             yes
}

devices {
    device {
        all_devs                yes
        no_path_retry           fail
    }
}

After some timeout (dev_loss_tmo?), paths are removed from the system,
and the multipath device is removed?

Or this only helps if you manually delete the faulty paths (I don't remember
seeing paths removed automatically).

Comment 10 Ben Marzinski 2016-03-07 19:53:48 UTC

deferred_remove won't help remove the faulty paths at all. The system should remove them after dev_loss_tmo has passed.  Once all paths to a multipath device are removed, the device itself should be removed.  Setting deferred_remove helps in cases where the device is open when multipath tries to remove it.  In this case, multipath can't remove the device, so it starts a deferred remove.  When the device is finally closed, it is automatically removed by device-mapper.

But that doesn't sound like the case you are seeing.  It sounds like the paths are failing but not being automatically removed.  This happens under multipath, in the scsi layer. dev_loss_tmo only causes a scsi device to be removed if there is a loss of connection, AFAIK. It sounds to me like the errors that the scsi device is reporting are not causing the scsi-layer to automatically remove it.
In this case, there is nothing in multipath to force remove device that have been
failed for too long.

Comment 16 Tim Speetjens 2016-10-04 15:36:44 UTC

Created attachment 1207249 [details]
domain dialog with warnings about unzoned iSCSI LUN

Both lines represent a LUN that was attached before, but now is unzoned. The orange one was an SD, but now removed. The other was never used, only zoned, then unzoned.

Comment 22 Greg Scott 2017-01-17 17:19:43 UTC

Let's not get hung up on the word, "remove." If I'm following this, the challenge is, the raw LUNs will still exist immediately after getting rid of a storage domain, at least until the storage admin gets rid of them. How about this:

The RHEV admin tells RHEVM to get rid of a storage domain.

If the storage domain is block (FC or FCOE or iSCSI), then RHVM tells all the RHV-H systems to treat the LUNs that used to be part of that storage domain as just raw LUNs if they still exist, or just get rid of references to them if the LUNs no longer exist.

But thinking this through, there's a timing challenge. After RHVM gets rid of the storage domain, maybe the SAN administrator gets rid of the raw LUNs, maybe not. So RHVM needs to "know" about all the LUNs from the hosts' point of view. And then once the storage admin gets rid of the raw LUNs, RHVM can tell the hypervisors to get rid of their references to the now stale raw LUNs.

But that gets ugly - now we have a manager tracking LUN objects over which it has no control.

So what about this - the manager gets rid of the storage domain as before. Add a GUI element and API for RHVM to tell all the hosts later on to re-enumerate all the LUNs they see, which should clean up stale LUNs after the storage admin removes them from the SAN. So the steps would be:

1 - The RHEV admin tells RHVM to get rid of the storage domain.
2 - Later on, the SAN admin gets rid of the raw LUNs from the SAN point of view.
3 - The RHEV admin clicks the RHVM GUI button to tell the hosts to clean up their stale LUNs.

- Greg

Comment 23 Nir Soffer 2017-01-17 17:34:08 UTC

Here is possible way RHV can help to remove devices from hypervisors.

1. System administrator unzone the devices on the storage server

The system administrator must do this before trying to removing the devices from
a RHV setup.

RHV is not responsible for adding or removing devices, only for *discovering*
devices added by the system administrator.

To be responsible for removing devices, RHV must have control of the storage
server, similar to OpenStack Cinder.

2. System administrator select the devices to remove

The system will show the available devices available using the same way we
show devices for creating new storage domain, using a host selected by
the system administrator (Host.getDeviceList).

3. System send request to remove the devices to all connected hosts

The system will first send a request to all hosts except the host selected
for enumerating the devices. If removal was successful (specified devices are
not available on a host) on all hosts, remove the devices on the host selected
for enumerating the devices. Finally remove the devices from RHV database.

Notes:

- You cannot remove devices from the setup if the devices are not available
  on the host selected for enumerating devices.

- System cannot remove devices from hosts which are not connected.

- If the devices were not unzoned on the storage server, they will appear 
  again on all hosts once we perform the next scsi scan, and be added to 
  RHV database on the next creation/edit of storage domain.

This requires adding new vdsm api, new UI and flow in engine similar to 
resizing of a device. In this flow the user select a device and and the system
send a request to all hosts for resizing the device.

Comment 25 Yaniv Kaul 2017-02-22 08:53:04 UTC

(In reply to Nir Soffer from comment #23)
> Here is possible way RHV can help to remove devices from hypervisors.
> 
> 1. System administrator unzone the devices on the storage server
> 
> The system administrator must do this before trying to removing the devices
> from
> a RHV setup.
> 
> RHV is not responsible for adding or removing devices, only for *discovering*
> devices added by the system administrator.

When is discovery taking place? If it's only a user initiated action, then I assume we can use 'rescan-scsi-bus.sh' with '-a -r' (and perhaps '-m' as well)

> 
> To be responsible for removing devices, RHV must have control of the storage
> server, similar to OpenStack Cinder.
> 
> 2. System administrator select the devices to remove
> 
> The system will show the available devices available using the same way we
> show devices for creating new storage domain, using a host selected by
> the system administrator (Host.getDeviceList).

The available device is seen based on Engine data or data from VDSM? Was the device unzoned already? I assume not - where exactly is the step the storage admin unzones it, so it won't be re-discovered?


> 
> 3. System send request to remove the devices to all connected hosts
> 
> The system will first send a request to all hosts except the host selected
> for enumerating the devices. If removal was successful (specified devices are
> not available on a host) on all hosts, remove the devices on the host
> selected
> for enumerating the devices. Finally remove the devices from RHV database.
> 
> Notes:
> 
> - You cannot remove devices from the setup if the devices are not available
>   on the host selected for enumerating devices.

ACK - that means it is seen by that host, not from Engine DB?

> 
> - System cannot remove devices from hosts which are not connected.

Agreed. What happens when they come back?

> 
> - If the devices were not unzoned on the storage server, they will appear 
>   again on all hosts once we perform the next scsi scan, and be added to 
>   RHV database on the next creation/edit of storage domain.

Makes sense as well - but I'd like to make sure this is the only time of re-discovery - what happens when a server reboots or goes back from maintenance to up? It is seen on the host, but not in Engine?

> 
> This requires adding new vdsm api, new UI and flow in engine similar to 
> resizing of a device. In this flow the user select a device and and the
> system
> send a request to all hosts for resizing the device.



Yaniv D. - this is assigned to Allon, but not targeted yet to 4.2?

Comment 26 Nir Soffer 2017-02-22 09:51:19 UTC

(In reply to Yaniv Kaul from comment #25)
> (In reply to Nir Soffer from comment #23)
> > Here is possible way RHV can help to remove devices from hypervisors.
> > 
> > 1. System administrator unzone the devices on the storage server
> > 
> > The system administrator must do this before trying to removing the devices
> > from
> > a RHV setup.
> > 
> > RHV is not responsible for adding or removing devices, only for *discovering*
> > devices added by the system administrator.
> 
> When is discovery taking place? If it's only a user initiated action, then I
> assume we can use 'rescan-scsi-bus.sh' with '-a -r' (and perhaps '-m' as
> well)

We are not using rescan-scsi-bus.sh but iscsiadm and our helper for scanning
fc (/usr/libexec/vdsm/fc-scan).

I'm not sure that using rescan-scsi-bus.sh is good idea, it does too much
things that may not be wanted or safe for us.

Scanning is done by the system without user interaction.

> > To be responsible for removing devices, RHV must have control of the storage
> > server, similar to OpenStack Cinder.
> > 
> > 2. System administrator select the devices to remove
> > 
> > The system will show the available devices available using the same way we
> > show devices for creating new storage domain, using a host selected by
> > the system administrator (Host.getDeviceList).
> 
> The available device is seen based on Engine data or data from VDSM? 

Based on what vdsm reports, and what engine knows about the devices
(for example, it will gray out used devices).

> Was the
> device unzoned already? I assume not - where exactly is the step the storage
> admin unzones it, so it won't be re-discovered?

It should be unzoned at this point, see step 1.

> > 3. System send request to remove the devices to all connected hosts
> > 
> > The system will first send a request to all hosts except the host selected
> > for enumerating the devices. If removal was successful (specified devices are
> > not available on a host) on all hosts, remove the devices on the host
> > selected
> > for enumerating the devices. Finally remove the devices from RHV database.
> > 
> > Notes:
> > 
> > - You cannot remove devices from the setup if the devices are not available
> >   on the host selected for enumerating devices.
> 
> ACK - that means it is seen by that host, not from Engine DB?

Yes.

We can use engine DB as well, but we must support the case when you lost
your engine DB, or you restored to older version and it does not reflect the 
the storage. The truth is what we actually see on storage.

If we want to make this more robust (and complex), we can ask all hosts to
return the device list in the same time and merge the results, marking devices
that are not available on all hosts.

> > - System cannot remove devices from hosts which are not connected.
> 
> Agreed. What happens when they come back?

In the simplest solution, nothing, you have to open the dialog again
using a host that see the device and ask to remove the device again.

In the more complex solution, the system the system will ask the host
to remove the device when it comes back online. This is how we implement
ceph secrets, each time we connect, the host get the list of secrets is
must keep. Any secrets not in this list be will removed, and new secrets
will be added.  We can do similar thing by sending list of devices a host
should see when connecting a host to storage.

In the Kubernetes world, the host can get the list of devices from etcd
remove unneeded devices, or rescan the bus to find devices which are
not available on the host, but listed in etcd.

I suggest we start with the simplest possible solution, which will be much
better than no solution.
 
> > - If the devices were not unzoned on the storage server, they will appear 
> >   again on all hosts once we perform the next scsi scan, and be added to 
> >   RHV database on the next creation/edit of storage domain.
> 
> Makes sense as well - but I'd like to make sure this is the only time of
> re-discovery - what happens when a server reboots or goes back from
> maintenance to up? It is seen on the host, but not in Engine?

Currently vdsm will discover devices when vdsm starts, when connecting
to storage server, when looking up domain in the domain cache failed, etc.

Comment 41 Gianluca Cecchi 2017-10-06 14:33:37 UTC

Hello,
Any chance to have some solution for 4.2? Or any testing for 4.1.6?
Could it be an option to have a solution for RHEV and FC bases storage domains, similar to OS version
https://access.redhat.com/solutions/20063
but richer, due to involvement of different hosts?
Thanks

Comment 42 Yaniv Kaul 2017-10-06 18:45:13 UTC

(In reply to Gianluca Cecchi from comment #41)
> Hello,
> Any chance to have some solution for 4.2? Or any testing for 4.1.6?
> Could it be an option to have a solution for RHEV and FC bases storage
> domains, similar to OS version
> https://access.redhat.com/solutions/20063
> but richer, due to involvement of different hosts?
> Thanks

I've been looking at implementing this in Ansible (where the input is the wwn).
Of course, it takes me a while as it's not my day to day work on oVirt.
I'm still hoping to complete it soon.

Comment 54 Yaniv Kaul 2017-11-27 08:19:54 UTC

Based on what I understand from https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/storage_administration_guide/removing_devices , I've implemented the following Ansible module which seems to be removing the devices and underlying device paths cleanly.

Where/how do we wish to integrate it?

---
- name: Cleanly remove storage devices (LUNs)
  hosts: localhost
  connection: local
  gather_facts: False

  tasks:
    - name: get underlying disks (paths) for a multipath device and turn into a list.
      shell: dmsetup deps -o devname "{{ lun }}" | cut -f 2 |cut -c 3- |tr -d "()" | tr " " "\n"
      register: disks

    - name: remove from multipath
      shell: multipath -f "{{ lun }}"

    - name: flush any outstanding IO
      shell: blockdev --flushbufs /dev/"{{ item }}"
      with_items:
        - "{{ disks.stdout_lines }}"

    - name: remove each path from the SCSI subsystem
      shell: "echo 1 > /sys/block/{{ item }}/device/delete"
      with_items:
        - "{{ disks.stdout_lines }}"

Comment 55 Yaniv Kaul 2017-11-29 11:56:24 UTC

Created attachment 1360289 [details]
remove device Ansible playbook

Comment 56 Yaniv Kaul 2017-11-29 12:00:49 UTC

(In reply to Yaniv Kaul from comment #55)
> Created attachment 1360289 [details]
> remove device Ansible playbook

Instructions to QE:

Run the attached playbook on a LUN that is stale (removed from RHV, removed from storage - but is still seen in the hosts).
On the host, use:
ansible-playbook --extra-vars "lun=<LUN ID>" remove_device.yml

(you can get the LUN ID - the 36... from the output of 'multipath -ll')

Expected result:
Before - you can see the LUN in 'multipath -ll' output - both the device and the underlying paths (sdX devices). You might also see in vdsm.log that VDSM is complaining about the device (as it tries to refresh it a bit).
After: the LUN and the devices are gone.

BTW, this can be tested on either 4.1 or 4.2, it's good to either.

Comment 58 Lilach Zitnitski 2017-11-30 14:48:36 UTC

(In reply to Yaniv Kaul from comment #56)
> (In reply to Yaniv Kaul from comment #55)
> > Created attachment 1360289 [details]
> > remove device Ansible playbook
> 
> Instructions to QE:
> 
> Run the attached playbook on a LUN that is stale (removed from RHV, removed
> from storage - but is still seen in the hosts).
> On the host, use:
> ansible-playbook --extra-vars "lun=<LUN ID>" remove_device.yml
> 
> (you can get the LUN ID - the 36... from the output of 'multipath -ll')

When you say run the playbook on host, I need to run this on any host in my cluster? or just the host used when the storage domain was created? because when I run it on the SPM it works and removed the faulty multipath path, but when I run it on the other hosts I get an error - 

fatal: [localhost]: FAILED! => {"changed": true, "cmd": "multipath -f \"3514f0c5a5160105c\"", "delta": "0:00:00.041311", "end": "2017-11-30 16:13:52.403053", "failed": true, "msg": "non-zero return code", "rc": 1, "start": "2017-11-30 16:13:52.361742", "stderr": "", "stderr_lines": [], "stdout": "Nov 30 16:13:52 | 3514f0c5a5160105c: map in use\nNov 30 16:13:52 | failed to remove multipath map 3514f0c5a5160105c", "stdout_lines": ["Nov 30 16:13:52 | 3514f0c5a5160105c: map in use", "Nov 30 16:13:52 | failed to remove multipath map 3514f0c5a5160105c"]}

Comment 59 Yaniv Kaul 2017-11-30 15:00:12 UTC

(In reply to Lilach Zitnitski from comment #58)
> (In reply to Yaniv Kaul from comment #56)
> > (In reply to Yaniv Kaul from comment #55)
> > > Created attachment 1360289 [details]
> > > remove device Ansible playbook
> > 
> > Instructions to QE:
> > 
> > Run the attached playbook on a LUN that is stale (removed from RHV, removed
> > from storage - but is still seen in the hosts).
> > On the host, use:
> > ansible-playbook --extra-vars "lun=<LUN ID>" remove_device.yml
> > 
> > (you can get the LUN ID - the 36... from the output of 'multipath -ll')
> 
> When you say run the playbook on host, I need to run this on any host in my
> cluster? or just the host used when the storage domain was created? because
> when I run it on the SPM it works and removed the faulty multipath path, but
> when I run it on the other hosts I get an error - 

It has nothing to do with SPM or not - as the assumption is that the SD has been removed from RHV already - and from the storage as well even.

> 
> fatal: [localhost]: FAILED! => {"changed": true, "cmd": "multipath -f
> \"3514f0c5a5160105c\"", "delta": "0:00:00.041311", "end": "2017-11-30
> 16:13:52.403053", "failed": true, "msg": "non-zero return code", "rc": 1,
> "start": "2017-11-30 16:13:52.361742", "stderr": "", "stderr_lines": [],
> "stdout": "Nov 30 16:13:52 | 3514f0c5a5160105c: map in use\nNov 30 16:13:52
> | failed to remove multipath map 3514f0c5a5160105c", "stdout_lines": ["Nov
> 30 16:13:52 | 3514f0c5a5160105c: map in use", "Nov 30 16:13:52 | failed to
> remove multipath map 3514f0c5a5160105c"]}

That means something is using that device still! What is using it?

As you can see in the book, I'm running very simple commands - this one is 'multipath -f <lun>' - and it fails, which means it's still used.

Comment 60 Lilach Zitnitski 2017-11-30 15:06:38 UTC

Yes I saw the multipath -f that you are using in the playbook.
The thing is in one host (in this case this is the spm and the host that was in use when creating the sd) the multipath -f succeeded, and in the other hosts it didn't.

The storage domain I tried to remove is a new storage domain I created specially to test this bug, so I just created new volume on the iscsi storage provider and created new sd in rhevm. 
After that I moved the storage to maintenance, detached it, removed it from rhevm and deleted the volume from the iscsi. That's it. No vm or any other data was created on this sd.

Comment 61 Yaniv Kaul 2017-11-30 15:08:57 UTC

(In reply to Lilach Zitnitski from comment #60)
> Yes I saw the multipath -f that you are using in the playbook.
> The thing is in one host (in this case this is the spm and the host that was
> in use when creating the sd) the multipath -f succeeded, and in the other
> hosts it didn't.
> 
> The storage domain I tried to remove is a new storage domain I created
> specially to test this bug, so I just created new volume on the iscsi
> storage provider and created new sd in rhevm. 
> After that I moved the storage to maintenance, detached it, removed it from
> rhevm and deleted the volume from the iscsi. That's it. No vm or any other
> data was created on this sd.

These are all correct steps.

Perhaps we need to try again - perhaps something is holding it.
Alternatively, please provide me credentials to the host and I'll look into it.
No idea what could be taking it.

Comment 69 Yaniv Kaul 2017-12-11 01:54:18 UTC

Moving back to ASSIGNED. Need to understand why 'multipath -f <mpath device>' fails.

Comment 70 Ben Marzinski 2017-12-12 15:50:06 UTC

This happens when, like the message said, the device is in use.  If there's nothing obviously mounting this device or any of its partitions devices then there are a couple of other possibilities. There could be another virtual device (such as an LV) on top of the multipath device. It doesn't need to be opened. Simply having another device being build on top of multipath will keep the device in use.  Otherwise, it's possible that all of the path devices are down/gone, and the device is still set to queue_if_no_path, and udev or something else issued IO to the device.  Usually this can be seen by running fuser or lsof.

Comment 71 Yaniv Kaul 2017-12-12 15:51:42 UTC

(In reply to Ben Marzinski from comment #70)
> This happens when, like the message said, the device is in use.  If there's
> nothing obviously mounting this device or any of its partitions devices then
> there are a couple of other possibilities. There could be another virtual
> device (such as an LV) on top of the multipath device. It doesn't need to be
> opened. Simply having another device being build on top of multipath will
> keep the device in use.  Otherwise, it's possible that all of the path
> devices are down/gone, and the device is still set to queue_if_no_path, and
> udev or something else issued IO to the device.  Usually this can be seen by
> running fuser or lsof.

Thanks Ben! I'll be looking at this to see why I did not manage to clean them.

Comment 72 Nir Soffer 2017-12-12 16:02:13 UTC

This is most likely because we do not cleanup lvs properly in all cases. We have
a bug about this. Once we fix this issue, it will be possible to remove the 
underlying devices cleanly.

Comment 76 Yaniv Kaul 2018-03-15 10:23:02 UTC

Removing a LUN (a single LUN - reduce operation) did cleanup everything nicely.

So only when removing a SD we are not OK, I reckon:
[root@lago-basic-suite-master-host-0 ~]# multipath -v 2 -f 360014051902582516604641845820019
Mar 15 06:21:19 | 360014051902582516604641845820019: map in use
Mar 15 06:21:19 | failed to remove multipath map 360014051902582516604641845820019
[root@lago-basic-suite-master-host-0 ~]# lvs |grep 0019
  /dev/mapper/360014051902582516604641845820019: read failed after 0 of 4096 at 0: Input/output error
  /dev/mapper/360014051902582516604641845820019: read failed after 0 of 4096 at 21474770944: Input/output error
  /dev/mapper/360014051902582516604641845820019: read failed after 0 of 4096 at 21474828288: Input/output error
  /dev/mapper/360014051902582516604641845820019: read failed after 0 of 4096 at 4096: Input/output error

Comment 77 spower 2018-07-03 10:40:50 UTC

We agreed to remove RFEs component from Bugzilla, if you feel the component has been renamed incorrectly please reach out.

Comment 80 Sandro Bonazzola 2019-01-28 09:41:32 UTC

This bug has not been marked as blocker for oVirt 4.3.0.
Since we are releasing it tomorrow, January 29th, this bug has been re-targeted to 4.3.1.

Comment 82 Sandro Bonazzola 2019-03-12 12:43:28 UTC

4.3.0 has been already released, automatically re-targeting to 4.3.3 for re-evaluation

Comment 87 Vojtech Juranek 2019-12-17 13:02:49 UTC

Is there any agreement in which way it should be fixed? Would it be sufficient if we add vdsm API for removing stale LUNs? Engine would be able to call this function from UI or we can expose it as part of vdsm-tool. The function would do the same thing as the ansible script mentioned here. If I got it right, the script works fine (of course assuming we will fix BZ #1544370 first). Do you agree with this?

Comment 88 Marina Kalinin 2020-03-11 01:51:14 UTC

Hi Vojtech,

Thank you for working on this bug.
I believe it should be available from UI, since this whole request is about the manager being able to manage the environment completely.

From the description:
"If RHEV-H is supposed to be an appliance managed by RHEV-M, then RHEV-M should also be orchestrating the storage removal as well, all the way down to removing the paths."

Ideally, it should be part of storage domain removal flow.

Comment 89 Nir Soffer 2020-03-11 14:44:09 UTC

(In reply to Marina Kalinin from comment #88)
> I believe it should be available from UI, since this whole request is about
> the manager being able to manage the environment completely.

RHV does not manage the storage server, so we cannot remove LUNs from a system
in a reliable way.

The procedure for removing LUNs must be:

1. Remove a LUN from a storage domain or VMs using it.

2. Unzone the LUN on the storage server - this must be done by the user

3. Remove the stale LUN from all hosts

Without step 2, after removing a LUN from the host, the LUN will be added again
on the next storage refresh. Storage refersh happens every time we open the 
"manage domain" or "create new domain" dialog, and also trigger internally by
vdsm in many flows.

I agree that being able to remove the LUN from all hosts using the UI is nice, but
maybe ansible script will be good enough.

This issue does not exist when using Managed Block Storage, since RHV does
manage the storage server in this case.

Comment 90 Greg Scott 2020-03-11 16:22:58 UTC

(In reply to Nir Soffer from comment #89)
> 
> The procedure for removing LUNs must be:
> 
> 1. Remove a LUN from a storage domain or VMs using it.
> 
> 2. Unzone the LUN on the storage server - this must be done by the user
> 
> 3. Remove the stale LUN from all hosts
> 

Step 3 is the hard part because it means either do a rolling reboot of every hypervisor, or log into each one and execute a series of commands to remove the stale LUN reference. If there were an Ansible backend to handle this, great. Let's also make a RHVM button to fire it up, and keep logs on what it does.

Comment 91 Michal Skrivanek 2020-04-15 06:49:17 UTC

(In reply to Vojtech Juranek from comment #87)
> Is there any agreement in which way it should be fixed? Would it be
> sufficient if we add vdsm API for removing stale LUNs? Engine would be able
> to call this function from UI or we can expose it as part of vdsm-tool. The
> function would do the same thing as the ansible script mentioned here. If I
> got it right, the script works fine (of course assuming we will fix BZ
> #1544370 first). Do you agree with this?

that's of limited use without significant work on engine side. Ansible interface would be way better because we'll likely argue about the engine side longer...
What exactly you call in that ansible role is insignificant, there's no need to involve vdsm at all

Comment 92 Sandro Bonazzola 2020-05-18 14:46:40 UTC

Moved to 4.4.1 not being marked as blocker for 4.4.0 and we are preparing to GA.

Comment 93 Vojtech Juranek 2020-09-23 09:08:50 UTC

Is this still an issue? With BZ #1544370, there shouldn't be any stale LUNs, expect cases when storage become unavailable during moving SD into maintenance/removing SD. This case should be addressed by BZ #1863058. Once the patch is merged, I believe stale LUNs as described by this issue shouldn't appear on the system. Nir, do you agree or there's some way how stale LUNs can appear?

Comment 94 Nir Soffer 2020-09-23 10:00:52 UTC

(In reply to Vojtech Juranek from comment #93)
The bugs you mention are about logical volumes. These used to prevent 
removal of the multipath maps on top of stale LUNs. With this bug fixed,
it is possible to remove the maps.

But the actual stale LUNs are not affected by these fixes. SCSI devices
and the multipath maps are not removed from hosts automatically. They must
be removed manually using the procedure described in the
device-mapper-multipath admin guide.

See comment 54 for ansible script doing these steps for one possible 
solution.

We need to provide a solution to remove stale LUNs on single host, and 
then provide a solution to do this in entire cluster.

Comment 96 Steffen Froemer 2020-11-04 22:21:39 UTC

(In reply to Nir Soffer from comment #89)
> (In reply to Marina Kalinin from comment #88)
> > I believe it should be available from UI, since this whole request is about
> > the manager being able to manage the environment completely.
> 
> RHV does not manage the storage server, so we cannot remove LUNs from a
> system
> in a reliable way.
> 
> The procedure for removing LUNs must be:
> 
> 1. Remove a LUN from a storage domain or VMs using it.
> 
> 2. Unzone the LUN on the storage server - this must be done by the user
> 
> 3. Remove the stale LUN from all hosts
> 

Thinking about, if it would be possible to not have a combined approach to remove the LUNs from Hosts through UI.

1. Remove the StorageDomain through UI or API. This will clean the LUN completely and would be left as same as a new LUN (empty and waiting for usage)
2. When LUN  is unzoned from storage server, it will be left as stale LUN on RHV Hosts

   ==> Is there a way to identify those stale LUNs?
if yes.
3. On regular base (like at time when rescan is performed) run a stale lun removal task.

Step three would be some kind of self-healing or doing some housekeeping work. 
Benefit is, that after Storage Domain deletion, the LUN can be re-used or is free for removal. There is no further activity required for RHV administrators. As soon the SD is deleted, the LUN  is free and Storage-Guys can unzone it from the Servers. Later the stale lun will be removed from systems automatically.

Would that make sense?

Comment 97 Robert McSwain 2020-12-08 19:44:25 UTC

For what it's worth, the customer who opened this noted some info in the case, stating that the only difference from the other VMs disks that have been removed is that these are Windows 10. That may help identify why this is having to be done in the first place if it's only Win 10 systems exhibiting this symptom.

Comment 98 Robert McSwain 2020-12-09 18:18:04 UTC

Disregard comment #97. The customer updated the incorrect case and has told me this does not apply.

Comment 99 Nir Soffer 2021-02-04 18:54:06 UTC

(In reply to Steffen Froemer from comment #96)
> (In reply to Nir Soffer from comment #89)
> > (In reply to Marina Kalinin from comment #88)
> > > I believe it should be available from UI, since this whole request is about
> > > the manager being able to manage the environment completely.
> > 
> > RHV does not manage the storage server, so we cannot remove LUNs from a
> > system
> > in a reliable way.
> > 
> > The procedure for removing LUNs must be:
> > 
> > 1. Remove a LUN from a storage domain or VMs using it.
> > 
> > 2. Unzone the LUN on the storage server - this must be done by the user
> > 
> > 3. Remove the stale LUN from all hosts
> > 
> 
> Thinking about, if it would be possible to not have a combined approach to
> remove the LUNs from Hosts through UI.
> 
> 1. Remove the StorageDomain through UI or API. This will clean the LUN
> completely and would be left as same as a new LUN (empty and waiting for
> usage)

This already exists.

> 2. When LUN  is unzoned from storage server, it will be left as stale LUN on
> RHV Hosts

The system does not control the storage server, so we don't have a way to
do that.

What can be done, is add API to mark a LUN as stale, so the system will not
use it. Vdsm has a multipath:blacklist configuration:

428     # Section: [multipath]
429     ('multipath', [
430 
431         ('blacklist', '',
432             'Comma-separated list of multipath devices WWIDs that should '
433             'not be managed by vdsm. When a hypervisor boots from SAN, the '
434             'multipath devices used by the hypervisor must be configured '
435             'to queue I/O when all paths have failed, and vdsm must not '
436             'manage them. Example: 36001405472912345,360014054954321'),
437     ]),

Adding LUNs to the list will hide them from vdsm, so vdsm will not try to
manage them and will not report them to engine. We can enhance this so
it is possible to modify during runtime and does not requiring restarting
vdsm.

But I think this is best done in lower level - since the LUN should not be
used on the host, we can blacklist it in multipath level by adding:

$ cat /etc/multipath/conf.d/disabled.conf 

blacklist {
  wwid "xxxyyy"
}
 
After reloading multipathd, this will hide the LUN from multipath, and
since we use only multipath devices, from RHV.

The underlying paths will remain on the host, but RHV will not use them.
Is this good enough to solve the issues for customers, even if we don't
remove the actual SCSI devices?

>    ==> Is there a way to identify those stale LUNs?
> if yes.
> 3. On regular base (like at time when rescan is performed) run a stale lun
> removal task.

This means removing the actual SCSI paths, and is dangerous operation, in case
the devices are used.

I don't think vdsm should do this kind of operations. The right way is to 
prevent the system from attaching the SCSI devices, not try to fix the
system by removing them after a rescan. I don't think we have a mechanism
to do this.

I think it makes scene to provide an easy way to remove devices from all 
hosts, but this removal must be controlled by the user, and not be some 
magical thing done by RHV behind your back.

Comment 100 Vojtech Juranek 2021-02-12 08:41:07 UTC

When working on this issue, I found another one - removing block SD leaves stale DM links, which cause multipath failure when removing the path from the host. This is very likely the root caused of the failure mentioned on c#58 and followup comments. I filed BZ #1928041 for it. This one needs to be solved first.

Comment 101 Vojtech Juranek 2021-02-24 14:42:38 UTC

Found one more issue causing stale DM links, now in case of multiple hosts (i.e. this is different issue than the on reffered in c #100): see BZ #1932388

Comment 102 Vojtech Juranek 2021-02-25 07:37:19 UTC

(In reply to Vojtech Juranek from comment #101)
> Found one more issue causing stale DM links, now in case of multiple hosts
> (i.e. this is different issue than the on reffered in c #100): see BZ
> #1932388

the root cause of this issue is bug in LVM: BZ #1932761

Comment 103 Vojtech Juranek 2021-03-04 13:31:59 UTC

As per comment #91 (IIUC) ansible recipe is sufficient solution. All the issues which caused that ansible recipe attached to this BZ doesn't work properly should be fixed now and the script works for me. Moving to QA. See comment #56 for instruction how to test it (what it doesn't mention: you have to change `hosts` to group of hosts you want to test on and remove `connection: local`)

Comment 104 Steffen Froemer 2021-03-04 14:42:09 UTC

(In reply to Nir Soffer from comment #99)
> > 2. When LUN  is unzoned from storage server, it will be left as stale LUN on
> > RHV Hosts
> 
> The system does not control the storage server, so we don't have a way to
> do that.
The same applies to the RHV administrator. He does not have control over the Storage environment. The storage department does.
But the storage department does not have access to the OS. So there are two distinct section who both need to do stuff together, which is difficult to schedule in time.

> 
> What can be done, is add API to mark a LUN as stale, so the system will not
> use it. Vdsm has a multipath:blacklist configuration:
> 
...
> 
> Adding LUNs to the list will hide them from vdsm, so vdsm will not try to
> manage them and will not report them to engine. We can enhance this so
> it is possible to modify during runtime and does not requiring restarting
> vdsm.
> 
> But I think this is best done in lower level - since the LUN should not be
> used on the host, we can blacklist it in multipath level by adding:
> 
> $ cat /etc/multipath/conf.d/disabled.conf 
> 
> blacklist {
>   wwid "xxxyyy"
> }
>  
> After reloading multipathd, this will hide the LUN from multipath, and
> since we use only multipath devices, from RHV.
> 
> The underlying paths will remain on the host, but RHV will not use them.
> Is this good enough to solve the issues for customers, even if we don't
> remove the actual SCSI devices?

If this is reliable and possible to perform automatically, I'm fine. But who is doing the housekeeping?
Is there a way to cleanup blacklisted wwid, if they're removed from storage point of view?


> 
> >    ==> Is there a way to identify those stale LUNs?
> > if yes.
> > 3. On regular base (like at time when rescan is performed) run a stale lun
> > removal task.
> 
> This means removing the actual SCSI paths, and is dangerous operation, in
> case
> the devices are used.
If removed from RHV, they should not be used. 

> 
> I don't think vdsm should do this kind of operations. The right way is to 
> prevent the system from attaching the SCSI devices, not try to fix the
> system by removing them after a rescan. I don't think we have a mechanism
> to do this.
> 
> I think it makes scene to provide an easy way to remove devices from all 
> hosts, but this removal must be controlled by the user, and not be some 
> magical thing done by RHV behind your back.

Overall, the LUNs are used by RHV or RHV should know, which LUN is marked to remove. 
One should avoid LUN removal of NEW luns.

I don't get the point, why this is such difficult to have managed by the engine/vdsm itself.
The engine should be single point of truth and if customers doing something obvious, we should only be able to catch this out of logs.

Comment 108 Ilan Zuckerman 2021-03-07 07:51:25 UTC

Please provide clear verification steps for verifying this Bug.

Comment 109 Vojtech Juranek 2021-03-08 09:18:47 UTC

Created attachment 1761527 [details]
remove mpath device

Adding fixed ansible script - removed flushing the device. If the LUN is removed, there's no point in flushing the device and this step fails as there's no backing device any more.

Comment 110 Vojtech Juranek 2021-03-08 09:29:04 UTC

(In reply to Ilan Zuckerman from comment #108)
> Please provide clear verification steps for verifying this Bug.

1. create 2 iSCSI domains using LUNs from the same storage server (if you create only one and detach it, hosts will log out from storage server and there won't be any stale LUNs)
2. put one of the domains into maintenance
3. detach this storage domain
4. unzone/remove corresponding LUN on the storage server
5. verify that there is still multipath device corresponding to LUN used by this SD on the hosts, e.g.

    sdc                                   8:32   0   10G  0 disk  
    `-36001405042dd41f40334273abc579952 252:1    0   10G  0 mpath

6. run attached ansible script

    ansible-playbook  -i hosts --extra-vars "lun=36001405042dd41f40334273abc579952" remove_mpath_device.yml

7. verify the mpath device is removed from the hosts and there are no complains in system log that multipath device is not accessible

Comment 111 Vojtech Juranek 2021-03-15 19:51:41 UTC

Ansible script is now part of ovirt-ansible-collection: https://github.com/oVirt/ovirt-ansible-collection/blob/master/examples/remove_mpath_device.yml

Comment 112 Ilan Zuckerman 2021-03-31 08:25:41 UTC

Verified on rhv-release-4.4.5-11

1. create 2 iSCSI domains using LUNs from the same storage server (if you create only one and detach it, hosts will log out from storage server and there won't be any stale LUNs)

sdh                                                     8:112  0   10G  0 disk  
└─3600a09803830447a4f244c4657616f79                   253:7    0   10G  0 mpath 
  ├─a605a814--2774--4995--830b--e9d477ff92e4-metadata 253:29   0  128M  0 lvm   
  ├─a605a814--2774--4995--830b--e9d477ff92e4-ids      253:30   0  128M  0 lvm   
  ├─a605a814--2774--4995--830b--e9d477ff92e4-inbox    253:31   0  128M  0 lvm   
  ├─a605a814--2774--4995--830b--e9d477ff92e4-outbox   253:32   0  128M  0 lvm   
  ├─a605a814--2774--4995--830b--e9d477ff92e4-leases   253:33   0    2G  0 lvm   
  ├─a605a814--2774--4995--830b--e9d477ff92e4-xleases  253:34   0    1G  0 lvm   
  └─a605a814--2774--4995--830b--e9d477ff92e4-master   253:35   0    1G  0 lvm  


2. put one of the domains into maintenance
3. detach this storage domain
4. remove corresponding LUN on the storage server
5. verify that there is still multipath device corresponding to LUN used by this SD on the hosts, e.g.

sdh                                                     8:112  0   10G  0 disk  
└─3600a09803830447a4f244c4657616f79                   253:7    0   10G  0 mpath 


6. run attached ansible script

[root@storage-ge13-vdsm1 ~]# ansible-playbook --extra-vars "lun=3600a09803830447a4f244c4657616f79" /usr/share/doc/ovirt-ansible-collection/examples/remove_mpath_device.yml
[WARNING]: provided hosts list is empty, only localhost is available. Note that the implicit localhost does not match 'all'

PLAY [Cleanly remove unzoned storage devices (LUNs)] *********************************************************************************

TASK [Gathering Facts] ***************************************************************************************************************
ok: [localhost]

TASK [Get underlying disks (paths) for a multipath device and turn them into a list.] ************************************************
changed: [localhost]

TASK [Remove from multipath device.] *************************************************************************************************
changed: [localhost]

TASK [Remove each path from the SCSI subsystem.] *************************************************************************************
changed: [localhost] => (item=sdh)

PLAY RECAP ***************************************************************************************************************************
localhost                  : ok=4    changed=3    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   



7. verify the mpath device is removed from the hosts and there are no complains in system log that multipath device is not accessible

[root@storage-ge13-vdsm1 ~]# multipath -ll | grep 616f79
[root@storage-ge13-vdsm1 ~]# 
[root@storage-ge13-vdsm1 ~]# 
[root@storage-ge13-vdsm1 ~]# lsblk | grep 616f79
[root@storage-ge13-vdsm1 ~]# 


From messages:

Mar 31 11:18:31 storage-ge13-vdsm1 platform-python[8193]: ansible-command Invoked with _raw_params=dmsetup deps -o devname "3600a09803830447a4f244c4657616f79" | cut -f 2 |cut -c 3- |tr -d "()" | tr " " "\n" _uses_shell=True warn=True stdin_add_newline=True strip_empty_ends=True argv=None chdir=None executable=None creates=None removes=None stdin=None
Mar 31 11:18:31 storage-ge13-vdsm1 platform-python[8217]: ansible-command Invoked with _raw_params=multipath -f "3600a09803830447a4f244c4657616f79" _uses_shell=True warn=True stdin_add_newline=True strip_empty_ends=True argv=None chdir=None executable=None creates=None removes=None stdin=None
Mar 31 11:18:31 storage-ge13-vdsm1 multipathd[817]: 3600a09803830447a4f244c4657616f79: remove map (operator)
Mar 31 11:18:31 storage-ge13-vdsm1 multipathd[817]: 3600a09803830447a4f244c4657616f79: map flushed
Mar 31 11:18:31 storage-ge13-vdsm1 multipathd[817]: dm-7: devmap not registered, can't remove
Mar 31 11:18:32 storage-ge13-vdsm1 platform-python[8239]: ansible-command Invoked with _raw_params=echo 1 > /sys/block/sdh/device/delete _uses_shell=True warn=True stdin_add_newline=True strip_empty_ends=True argv=None chdir=None executable=None creates=None removes=None stdin=None
Mar 31 11:18:32 storage-ge13-vdsm1 kernel: scsi 3:0:0:6: alua: Detached

Comment 113 Ilan Zuckerman 2021-03-31 11:53:48 UTC

Vojtech, can you please check if the rhv-release-4.4.5-11 already includes the updated remove_mpath_device.yml ?
Because It looks like I just verified this BZ in this release.
Maybe I did this wrong, or something else got messed? Please review the steps above.

Comment 114 Vojtech Juranek 2021-04-01 07:01:51 UTC

> Vojtech, can you please check if the rhv-release-4.4.5-11 already includes
> the updated remove_mpath_device.yml ?

Martin, could you please confirm https://github.com/oVirt/ovirt-ansible-collection/blob/master/examples/remove_mpath_device.yml is included in rhv-release-4.4.5-11?

> Maybe I did this wrong, or something else got messed? Please review the
> steps above.

the steps are correct

Comment 119 Ilan Zuckerman 2021-04-20 08:36:26 UTC

Moving to verified according steps from comment #112
Tested on rhv-4.4.6-4

Comment 123 errata-xmlrpc 2021-06-01 13:22:09 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: RHV Manager security update (ovirt-engine) [ovirt-4.4.6]), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2179

Comment 124 Eyal Shenitzky 2021-06-02 05:35:45 UTC

Vojtech, we need to document the fix and how to use it.

Please add a doc-text for the release note.

Comment 125 Marina Kalinin 2021-06-02 22:02:29 UTC

Should we mention that remove_mpath_device.yml playbook in the release notes too?

Comment 126 Vojtech Juranek 2021-06-03 07:04:09 UTC

(In reply to Marina Kalinin from comment #125)
> Should we mention that remove_mpath_device.yml playbook in the release notes
> too?

I don't think it's needed. It unimportant detail which can be found in the documentation.

Comment 127 Eyal Shenitzky 2021-06-03 08:12:12 UTC

(In reply to Vojtech Juranek from comment #126)
> (In reply to Marina Kalinin from comment #125)
> > Should we mention that remove_mpath_device.yml playbook in the release notes
> > too?
> 
> I don't think it's needed. It unimportant detail which can be found in the
> documentation.

Actually, it is worth mentioning it so there will be some information about what to use for that case and how.

Comment 128 Marina Kalinin 2021-06-03 17:19:19 UTC

Thank you!

Comment 131 Eyal Shenitzky 2021-08-26 12:41:31 UTC

*** Bug 1146115 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.

aefrat
aperotti
bazil89
bmarzins
bubrown
byount
dmoessne
Egarciad
eshenitz
fgarciad
gianluca.cecchi
gveitmic
gwatson
jbreitwe
jcoscia
jentrena
jortialc
lsurette
lsvaty
michal.skrivanek
mjankula
mkalinin
mnecas
mtessun
nashok
nsoffer
obockows
pbandark
pelauter
rgroten
rmcswain
rvdwees
ryan
sfroemer
shipatil
spower
srevivo
therfert
tmilsond
tnisan
troels
usurse
vjuranek
ycui