980914 – VDSM restarting right after disk plugging causes VM's conf to omit the plugged disk.

Bug 980914 - VDSM restarting right after disk plugging causes VM's conf to omit the plugged disk.

Summary: VDSM restarting right after disk plugging causes VM's conf to omit the plugge...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	3.3.0
Hardware:	x86_64
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Amit Aviram
QA Contact:	Aharon Canan
Docs Contact:
URL:
Whiteboard:	storage
Depends On:	1040476
Blocks:
TreeView+	depends on / blocked

Reported:	2013-07-03 13:47 UTC by Elad
Modified:	2022-04-11 09:27 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2015-11-16 12:34:00 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
logs (1.41 MB, application/x-gzip) 2013-07-03 13:47 UTC, Elad	no flags	Details
vdsm+engine logs (1.19 MB, application/x-gzip) 2014-05-21 14:18 UTC, Ori Gofen	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHV-45590	0	None	None	None	2022-04-11 09:27:03 UTC

Description Elad 2013-07-03 13:47:08 UTC

Created attachment 768251 [details]
logs

Description of problem:
After failed disk hotplug, disk remain accessible:

lvs
9e71eba5-e97e-405d-9004-a18603a7608a -wi-ao---



Version-Release number of selected component (if applicable):
vdsm-4.11.0-69.gitd70e3d5.el6.x86_64
sanlock-2.6-2.el6.x86_64
libvirt-0.10.2-18.el6_4.8.x86_64
rhevm-3.3.0-0.5.master.el6ev.noarch


How reproducible:
100%

Steps to Reproduce:
1. on a block pool 
2. create and run a VM with OS
3. add new disk 
4. in order to be able to fail vdsm during hotplug, create a script that pause vdsm for 20s when hotplugging a disk to vm:
   under hooks directory in vdsm:
vim /usr/libexec/vdsm/hooks/after_disk_hotplug/1.sh

#! /bin/bash

sleep 20s

5. change script permissions to vdsm:kvm  :  'chmod 755 1.sh'
6. perform disk hotplug to the VM 
7. during hotplug, restart vdsm service
8. hotplug will fail

Actual results:

engine reports that the disk is inactive. despite disk hotplug failed, lv remain accessible. vdsm cannot deactivate it and fails with:

Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 857, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 45, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 3195, in teardownImage
    img.teardown(sdUUID, imgUUID, volUUID)
  File "/usr/share/vdsm/storage/image.py", line 253, in teardown
    imgUUID, volUUIDs=[vol.volUUID for vol in chain])
  File "/usr/share/vdsm/storage/blockSD.py", line 1040, in deactivateVolumes
    lvm.deactivateLVs(self.sdUUID, volUUIDs)
  File "/usr/share/vdsm/storage/lvm.py", line 1111, in deactivateLVs
    _setLVAvailability(vgName, toDeactivate, "n")
  File "/usr/share/vdsm/storage/lvm.py", line 774, in _setLVAvailability
    raise error(str(e))
CannotDeactivateLogicalVolume: Cannot deactivate Logical Volume

user cannot activate or remove the disk. 

Expected results:
after failed disk hotplug, lv should become inaccessible 


Additional info:
logs and lvs

Comment 1 Elad 2013-07-07 08:21:48 UTC

Easier way to reproduce:

1. on a block pool 
2. create and run a VM with OS
3. add new disk 
4. during disk hotplug, block connectivity to storage server by iptables
5. hotplug will fail

Comment 2 Ayal Baron 2013-07-08 22:08:48 UTC

(In reply to Elad from comment #0)
> Created attachment 768251 [details]
> logs
> 
> Description of problem:
> After failed disk hotplug, disk remain accessible:
> 
> lvs
> 9e71eba5-e97e-405d-9004-a18603a7608a -wi-ao---
> 
> 
> 
> Version-Release number of selected component (if applicable):
> vdsm-4.11.0-69.gitd70e3d5.el6.x86_64
> sanlock-2.6-2.el6.x86_64
> libvirt-0.10.2-18.el6_4.8.x86_64
> rhevm-3.3.0-0.5.master.el6ev.noarch
> 
> 
> How reproducible:
> 100%
> 
> Steps to Reproduce:
> 1. on a block pool 
> 2. create and run a VM with OS
> 3. add new disk 
> 4. in order to be able to fail vdsm during hotplug, create a script that
> pause vdsm for 20s when hotplugging a disk to vm:
>    under hooks directory in vdsm:
> vim /usr/libexec/vdsm/hooks/after_disk_hotplug/1.sh
> 
> #! /bin/bash
> 
> sleep 20s
> 
> 5. change script permissions to vdsm:kvm  :  'chmod 755 1.sh'
> 6. perform disk hotplug to the VM 
> 7. during hotplug, restart vdsm service
> 8. hotplug will fail
> 
> Actual results:
> 
> engine reports that the disk is inactive. despite disk hotplug failed, lv
> remain accessible. vdsm cannot deactivate it and fails with:
> 
> Traceback (most recent call last):
>   File "/usr/share/vdsm/storage/task.py", line 857, in _run
>     return fn(*args, **kargs)
>   File "/usr/share/vdsm/logUtils.py", line 45, in wrapper
>     res = f(*args, **kwargs)
>   File "/usr/share/vdsm/storage/hsm.py", line 3195, in teardownImage
>     img.teardown(sdUUID, imgUUID, volUUID)
>   File "/usr/share/vdsm/storage/image.py", line 253, in teardown
>     imgUUID, volUUIDs=[vol.volUUID for vol in chain])
>   File "/usr/share/vdsm/storage/blockSD.py", line 1040, in deactivateVolumes
>     lvm.deactivateLVs(self.sdUUID, volUUIDs)
>   File "/usr/share/vdsm/storage/lvm.py", line 1111, in deactivateLVs
>     _setLVAvailability(vgName, toDeactivate, "n")
>   File "/usr/share/vdsm/storage/lvm.py", line 774, in _setLVAvailability
>     raise error(str(e))
> CannotDeactivateLogicalVolume: Cannot deactivate Logical Volume
> 
> user cannot activate or remove the disk. 

There is no disk to remove, it was never plugged and as you state, engine reports that the disk is inactive (which is the correct result if hotplug fails...)

The only issue here is the leftover active LV, but that is besides the point.
vdsm possibly could have a mechanism to deactivate all LVs that have no reference from running VMs when starting up.
Am I missing something?

> 
> Expected results:
> after failed disk hotplug, lv should become inaccessible 
> 
> 
> Additional info:
> logs and lvs

Comment 3 Elad 2013-07-09 07:50:00 UTC

(In reply to Ayal Baron from comment #2)
 

> There is no disk to remove, it was never plugged and as you state, engine
> reports that the disk is inactive (which is the correct result if hotplug
> fails...)

Reproduced again, with interruption to hotplug flow by restarting vdsm service during hotplug. the restart occurred during the run of a hook that pause the flow for 20 seconds after the disk hotplug itself.
here are my findings:
before hotplug, LV is inactive:

/dev/283c1cc8-1d44-47f6-970d-5df9f4b4dedf/21bb4146-ec05-464c-a019-c71b89ef18ff -wi------

prepareImage:

Thread-2687::INFO::2013-07-09 10:25:52,842::logUtils::44::dispatcher::(wrapper) Run and protect: prepareImage(sdUUID='283c1cc8-1d44-47f6-970d-5df9f4b4dedf', spUUID='8510a3ba-2457-4df6-9140-8257f127bca5', imgUUID='4f0f5709-a11b-4744-a333-9f8580e7192a', volUUID='21bb4146-ec05-464c-a019-c71b89ef18ff')

lvchange succeeded:

Thread-2687::DEBUG::2013-07-09 10:25:53,242::lvm::310::Storage.Misc.excCmd::(cmd) '/usr/bin/sudo -n /sbin/lvm lvchange --config " devices { preferred_names = [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write_
cache_state=0 disable_after_error_count=3 filter = [ \'a%1elad11370784|3514f0c542a00219d|3514f0c542a00219e|3514f0c542a0024ab|3514f0c542a0024ac|3514f0c542a0024ad|3514f0c542a0024ae|3514f0c542a0024af|3514f0c542a0024b
0|3514f0c542a0024b1|3514f0c542a0024b2|3514f0c542a0024b3|3514f0c542a0024b4|3514f0c542a0024b5|3514f0c542a0024b6|3514f0c542a0024b7|3514f0c542a0024b8%\', \'r%.*%\' ] }  global {  locking_type=1  prioritise_write_locks
=1  wait_for_locks=1 }  backup {  retain_min = 50  retain_days = 0 } " --autobackup n --available y 283c1cc8-1d44-47f6-970d-5df9f4b4dedf/21bb4146-ec05-464c-a019-c71b89ef18ff' (cwd None)
VM Channels Listener::DEBUG::2013-07-09 10:25:53,576::vmChannels::128::vds::(_handle_unconnected) Trying to connect fileno 29.


/dev/283c1cc8-1d44-47f6-970d-5df9f4b4dedf/21bb4146-ec05-464c-a019-c71b89ef18ff -wi-ao---


This thread continues to run because of the script, and aborted by the restart 

So, after this failure, vdsm comes up and the LV remains active, vdsm does not deactivate it back even though the hotplug has never succeeded. Engine reports that the disk is active and when user is trying to activate the disk again, vdsm fails to activate the LV because it is already active:

lvchange:

Thread-1770::DEBUG::2013-07-09 10:46:31,447::lvm::310::Storage.Misc.excCmd::(cmd) '/usr/bin/sudo -n /sbin/lvm lvchange --config " devices { preferred_names = [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write_
cache_state=0 disable_after_error_count=3 filter = [ \'a%1elad11370784|3514f0c542a00219d|3514f0c542a00219e|3514f0c542a0024ab|3514f0c542a0024ac|3514f0c542a0024ad|3514f0c542a0024ae|3514f0c542a0024af|3514f0c542a0024b
0|3514f0c542a0024b1|3514f0c542a0024b2|3514f0c542a0024b3|3514f0c542a0024b4|3514f0c542a0024b5|3514f0c542a0024b6|3514f0c542a0024b7|3514f0c542a0024b8%\', \'r%.*%\' ] }  global {  locking_type=1  prioritise_write_locks
=1  wait_for_locks=1 }  backup {  retain_min = 50  retain_days = 0 } " --autobackup n --available n 283c1cc8-1d44-47f6-970d-5df9f4b4dedf/21bb4146-ec05-464c-a019-c71b89ef18ff' (cwd None)
Thread-1773::DEBUG::2013-07-09 10:46:31,597::BindingXMLRPC::177::vds::(wrapper) client [10.35.115.10]


vdsm fails with:



Thread-1770::ERROR::2013-07-09 10:46:36,586::task::850::TaskManager.Task::(_setError) Task=`ef4c91f8-0750-421a-be1e-1dc951898489`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 857, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 45, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 3195, in teardownImage
    img.teardown(sdUUID, imgUUID, volUUID)
  File "/usr/share/vdsm/storage/image.py", line 253, in teardown
    imgUUID, volUUIDs=[vol.volUUID for vol in chain])
  File "/usr/share/vdsm/storage/blockSD.py", line 1040, in deactivateVolumes
    lvm.deactivateLVs(self.sdUUID, volUUIDs)
  File "/usr/share/vdsm/storage/lvm.py", line 1111, in deactivateLVs
    _setLVAvailability(vgName, toDeactivate, "n")
  File "/usr/share/vdsm/storage/lvm.py", line 774, in _setLVAvailability
    raise error(str(e))
CannotDeactivateLogicalVolume: Cannot deactivate Logical Volume:

Comment 4 Federico Simoncelli 2013-07-09 14:17:24 UTC

Probably there are two issues here:

1. vdsm side, lv is not deactivated (too bad, we can live with that for now)
2. backend gets out of sync when it hits some special error (e.g. VDS_NETWORK_ERROR)

E.g. vdsm restart:

2013-07-03 14:57:58,946 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HotPlugDiskVDSCommand] (pool-5-thread-48) [2ee8070d] Command HotPlugDiskVDS execution failed. Exception: VDSNetworkException: java.net.ConnectException: Connection refused
2013-07-03 14:57:58,946 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.HotPlugDiskVDSCommand] (pool-5-thread-48) [2ee8070d] FINISH, HotPlugDiskVDSCommand, log id: 47411ed4
2013-07-03 14:57:58,946 ERROR [org.ovirt.engine.core.bll.HotPlugDiskToVmCommand] (pool-5-thread-48) [2ee8070d] Command org.ovirt.engine.core.bll.HotPlugDiskToVmCommand throw Vdc Bll exception. With error message VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: java.net.ConnectException: Connection refused (Failed with VDSM error VDS_NETWORK_ERROR and code 5022)

When HotPlugDiskVDS/HotUnPlugDiskVDS fails with unknown errors, backend should try to assert the state of the device (plugged/unplugged) since it has not enough information about the real state of the call (executed/not-executed).

Best thing probably is to mark the status of the device as "Unknown" and use the getVmStats polling calls to determine/update the real status.

Comment 5 Ayal Baron 2014-02-16 09:50:40 UTC

(In reply to Federico Simoncelli from comment #4)
> Probably there are two issues here:
> 
> 1. vdsm side, lv is not deactivated (too bad, we can live with that for now)
> 2. backend gets out of sync when it hits some special error (e.g.
> VDS_NETWORK_ERROR)
> 
> E.g. vdsm restart:
> 
> 2013-07-03 14:57:58,946 ERROR
> [org.ovirt.engine.core.vdsbroker.vdsbroker.HotPlugDiskVDSCommand]
> (pool-5-thread-48) [2ee8070d] Command HotPlugDiskVDS execution failed.
> Exception: VDSNetworkException: java.net.ConnectException: Connection refused
> 2013-07-03 14:57:58,946 INFO 
> [org.ovirt.engine.core.vdsbroker.vdsbroker.HotPlugDiskVDSCommand]
> (pool-5-thread-48) [2ee8070d] FINISH, HotPlugDiskVDSCommand, log id: 47411ed4
> 2013-07-03 14:57:58,946 ERROR
> [org.ovirt.engine.core.bll.HotPlugDiskToVmCommand] (pool-5-thread-48)
> [2ee8070d] Command org.ovirt.engine.core.bll.HotPlugDiskToVmCommand throw
> Vdc Bll exception. With error message VdcBLLException:
> org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException:
> java.net.ConnectException: Connection refused (Failed with VDSM error
> VDS_NETWORK_ERROR and code 5022)
> 
> When HotPlugDiskVDS/HotUnPlugDiskVDS fails with unknown errors, backend
> should try to assert the state of the device (plugged/unplugged) since it
> has not enough information about the real state of the call
> (executed/not-executed).
> 
> Best thing probably is to mark the status of the device as "Unknown" and use
> the getVmStats polling calls to determine/update the real status.

this would require the VM async tasks infra, so need to update bug to block on that work

Comment 6 Ori Gofen 2014-05-21 14:18:35 UTC

Created attachment 898018 [details]
vdsm+engine logs

the easiest way to recreate this bug exactly doesn't seem to have any connection with the type of the domain:

1.have two domains (one file,one block)
2.create vm("vm1")+disk->run the vm
3.go to vm->disks->vm1_disk1->Deactivate
4.move disk to the other domain
5.activate the disk

on av9.2

Comment 8 Allon Mureinik 2015-06-18 13:37:44 UTC

Worth reconsidering after SPM is removed.

Comment 9 Yaniv Lavi 2015-10-22 08:17:30 UTC

Can you provide a status for this issue?

Comment 10 Amit Aviram 2015-10-22 08:50:06 UTC

As this bug's severity is medium, and targeted to 3.6, it is currently pending in favor for issues.

Comment 11 Allon Mureinik 2015-11-16 12:34:00 UTC

This is a race condition which we already solved most of in 3.6.0.
The remaining issue is extremely rare, and should be self-correcting after a while. Closing.

Note You need to log in before you can comment on or make changes to this bug.