Bug 980914
Summary: | VDSM restarting right after disk plugging causes VM's conf to omit the plugged disk. | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Elad <ebenahar> | ||||||
Component: | vdsm | Assignee: | Amit Aviram <aaviram> | ||||||
Status: | CLOSED WONTFIX | QA Contact: | Aharon Canan <acanan> | ||||||
Severity: | medium | Docs Contact: | |||||||
Priority: | medium | ||||||||
Version: | 3.3.0 | CC: | aaviram, amureini, bazulay, ebenahar, fsimonce, lpeer, ogofen, scohen, srevivo, tnisan, yeylon, ylavi | ||||||
Target Milestone: | --- | ||||||||
Target Release: | --- | ||||||||
Hardware: | x86_64 | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | storage | ||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2015-11-16 12:34:00 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | 1040476 | ||||||||
Bug Blocks: | |||||||||
Attachments: |
|
Easier way to reproduce: 1. on a block pool 2. create and run a VM with OS 3. add new disk 4. during disk hotplug, block connectivity to storage server by iptables 5. hotplug will fail (In reply to Elad from comment #0) > Created attachment 768251 [details] > logs > > Description of problem: > After failed disk hotplug, disk remain accessible: > > lvs > 9e71eba5-e97e-405d-9004-a18603a7608a -wi-ao--- > > > > Version-Release number of selected component (if applicable): > vdsm-4.11.0-69.gitd70e3d5.el6.x86_64 > sanlock-2.6-2.el6.x86_64 > libvirt-0.10.2-18.el6_4.8.x86_64 > rhevm-3.3.0-0.5.master.el6ev.noarch > > > How reproducible: > 100% > > Steps to Reproduce: > 1. on a block pool > 2. create and run a VM with OS > 3. add new disk > 4. in order to be able to fail vdsm during hotplug, create a script that > pause vdsm for 20s when hotplugging a disk to vm: > under hooks directory in vdsm: > vim /usr/libexec/vdsm/hooks/after_disk_hotplug/1.sh > > #! /bin/bash > > sleep 20s > > 5. change script permissions to vdsm:kvm : 'chmod 755 1.sh' > 6. perform disk hotplug to the VM > 7. during hotplug, restart vdsm service > 8. hotplug will fail > > Actual results: > > engine reports that the disk is inactive. despite disk hotplug failed, lv > remain accessible. vdsm cannot deactivate it and fails with: > > Traceback (most recent call last): > File "/usr/share/vdsm/storage/task.py", line 857, in _run > return fn(*args, **kargs) > File "/usr/share/vdsm/logUtils.py", line 45, in wrapper > res = f(*args, **kwargs) > File "/usr/share/vdsm/storage/hsm.py", line 3195, in teardownImage > img.teardown(sdUUID, imgUUID, volUUID) > File "/usr/share/vdsm/storage/image.py", line 253, in teardown > imgUUID, volUUIDs=[vol.volUUID for vol in chain]) > File "/usr/share/vdsm/storage/blockSD.py", line 1040, in deactivateVolumes > lvm.deactivateLVs(self.sdUUID, volUUIDs) > File "/usr/share/vdsm/storage/lvm.py", line 1111, in deactivateLVs > _setLVAvailability(vgName, toDeactivate, "n") > File "/usr/share/vdsm/storage/lvm.py", line 774, in _setLVAvailability > raise error(str(e)) > CannotDeactivateLogicalVolume: Cannot deactivate Logical Volume > > user cannot activate or remove the disk. There is no disk to remove, it was never plugged and as you state, engine reports that the disk is inactive (which is the correct result if hotplug fails...) The only issue here is the leftover active LV, but that is besides the point. vdsm possibly could have a mechanism to deactivate all LVs that have no reference from running VMs when starting up. Am I missing something? > > Expected results: > after failed disk hotplug, lv should become inaccessible > > > Additional info: > logs and lvs (In reply to Ayal Baron from comment #2) > There is no disk to remove, it was never plugged and as you state, engine > reports that the disk is inactive (which is the correct result if hotplug > fails...) Reproduced again, with interruption to hotplug flow by restarting vdsm service during hotplug. the restart occurred during the run of a hook that pause the flow for 20 seconds after the disk hotplug itself. here are my findings: before hotplug, LV is inactive: /dev/283c1cc8-1d44-47f6-970d-5df9f4b4dedf/21bb4146-ec05-464c-a019-c71b89ef18ff -wi------ prepareImage: Thread-2687::INFO::2013-07-09 10:25:52,842::logUtils::44::dispatcher::(wrapper) Run and protect: prepareImage(sdUUID='283c1cc8-1d44-47f6-970d-5df9f4b4dedf', spUUID='8510a3ba-2457-4df6-9140-8257f127bca5', imgUUID='4f0f5709-a11b-4744-a333-9f8580e7192a', volUUID='21bb4146-ec05-464c-a019-c71b89ef18ff') lvchange succeeded: Thread-2687::DEBUG::2013-07-09 10:25:53,242::lvm::310::Storage.Misc.excCmd::(cmd) '/usr/bin/sudo -n /sbin/lvm lvchange --config " devices { preferred_names = [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write_ cache_state=0 disable_after_error_count=3 filter = [ \'a%1elad11370784|3514f0c542a00219d|3514f0c542a00219e|3514f0c542a0024ab|3514f0c542a0024ac|3514f0c542a0024ad|3514f0c542a0024ae|3514f0c542a0024af|3514f0c542a0024b 0|3514f0c542a0024b1|3514f0c542a0024b2|3514f0c542a0024b3|3514f0c542a0024b4|3514f0c542a0024b5|3514f0c542a0024b6|3514f0c542a0024b7|3514f0c542a0024b8%\', \'r%.*%\' ] } global { locking_type=1 prioritise_write_locks =1 wait_for_locks=1 } backup { retain_min = 50 retain_days = 0 } " --autobackup n --available y 283c1cc8-1d44-47f6-970d-5df9f4b4dedf/21bb4146-ec05-464c-a019-c71b89ef18ff' (cwd None) VM Channels Listener::DEBUG::2013-07-09 10:25:53,576::vmChannels::128::vds::(_handle_unconnected) Trying to connect fileno 29. /dev/283c1cc8-1d44-47f6-970d-5df9f4b4dedf/21bb4146-ec05-464c-a019-c71b89ef18ff -wi-ao--- This thread continues to run because of the script, and aborted by the restart So, after this failure, vdsm comes up and the LV remains active, vdsm does not deactivate it back even though the hotplug has never succeeded. Engine reports that the disk is active and when user is trying to activate the disk again, vdsm fails to activate the LV because it is already active: lvchange: Thread-1770::DEBUG::2013-07-09 10:46:31,447::lvm::310::Storage.Misc.excCmd::(cmd) '/usr/bin/sudo -n /sbin/lvm lvchange --config " devices { preferred_names = [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write_ cache_state=0 disable_after_error_count=3 filter = [ \'a%1elad11370784|3514f0c542a00219d|3514f0c542a00219e|3514f0c542a0024ab|3514f0c542a0024ac|3514f0c542a0024ad|3514f0c542a0024ae|3514f0c542a0024af|3514f0c542a0024b 0|3514f0c542a0024b1|3514f0c542a0024b2|3514f0c542a0024b3|3514f0c542a0024b4|3514f0c542a0024b5|3514f0c542a0024b6|3514f0c542a0024b7|3514f0c542a0024b8%\', \'r%.*%\' ] } global { locking_type=1 prioritise_write_locks =1 wait_for_locks=1 } backup { retain_min = 50 retain_days = 0 } " --autobackup n --available n 283c1cc8-1d44-47f6-970d-5df9f4b4dedf/21bb4146-ec05-464c-a019-c71b89ef18ff' (cwd None) Thread-1773::DEBUG::2013-07-09 10:46:31,597::BindingXMLRPC::177::vds::(wrapper) client [10.35.115.10] vdsm fails with: Thread-1770::ERROR::2013-07-09 10:46:36,586::task::850::TaskManager.Task::(_setError) Task=`ef4c91f8-0750-421a-be1e-1dc951898489`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 857, in _run return fn(*args, **kargs) File "/usr/share/vdsm/logUtils.py", line 45, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/storage/hsm.py", line 3195, in teardownImage img.teardown(sdUUID, imgUUID, volUUID) File "/usr/share/vdsm/storage/image.py", line 253, in teardown imgUUID, volUUIDs=[vol.volUUID for vol in chain]) File "/usr/share/vdsm/storage/blockSD.py", line 1040, in deactivateVolumes lvm.deactivateLVs(self.sdUUID, volUUIDs) File "/usr/share/vdsm/storage/lvm.py", line 1111, in deactivateLVs _setLVAvailability(vgName, toDeactivate, "n") File "/usr/share/vdsm/storage/lvm.py", line 774, in _setLVAvailability raise error(str(e)) CannotDeactivateLogicalVolume: Cannot deactivate Logical Volume: Probably there are two issues here: 1. vdsm side, lv is not deactivated (too bad, we can live with that for now) 2. backend gets out of sync when it hits some special error (e.g. VDS_NETWORK_ERROR) E.g. vdsm restart: 2013-07-03 14:57:58,946 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HotPlugDiskVDSCommand] (pool-5-thread-48) [2ee8070d] Command HotPlugDiskVDS execution failed. Exception: VDSNetworkException: java.net.ConnectException: Connection refused 2013-07-03 14:57:58,946 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.HotPlugDiskVDSCommand] (pool-5-thread-48) [2ee8070d] FINISH, HotPlugDiskVDSCommand, log id: 47411ed4 2013-07-03 14:57:58,946 ERROR [org.ovirt.engine.core.bll.HotPlugDiskToVmCommand] (pool-5-thread-48) [2ee8070d] Command org.ovirt.engine.core.bll.HotPlugDiskToVmCommand throw Vdc Bll exception. With error message VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: java.net.ConnectException: Connection refused (Failed with VDSM error VDS_NETWORK_ERROR and code 5022) When HotPlugDiskVDS/HotUnPlugDiskVDS fails with unknown errors, backend should try to assert the state of the device (plugged/unplugged) since it has not enough information about the real state of the call (executed/not-executed). Best thing probably is to mark the status of the device as "Unknown" and use the getVmStats polling calls to determine/update the real status. (In reply to Federico Simoncelli from comment #4) > Probably there are two issues here: > > 1. vdsm side, lv is not deactivated (too bad, we can live with that for now) > 2. backend gets out of sync when it hits some special error (e.g. > VDS_NETWORK_ERROR) > > E.g. vdsm restart: > > 2013-07-03 14:57:58,946 ERROR > [org.ovirt.engine.core.vdsbroker.vdsbroker.HotPlugDiskVDSCommand] > (pool-5-thread-48) [2ee8070d] Command HotPlugDiskVDS execution failed. > Exception: VDSNetworkException: java.net.ConnectException: Connection refused > 2013-07-03 14:57:58,946 INFO > [org.ovirt.engine.core.vdsbroker.vdsbroker.HotPlugDiskVDSCommand] > (pool-5-thread-48) [2ee8070d] FINISH, HotPlugDiskVDSCommand, log id: 47411ed4 > 2013-07-03 14:57:58,946 ERROR > [org.ovirt.engine.core.bll.HotPlugDiskToVmCommand] (pool-5-thread-48) > [2ee8070d] Command org.ovirt.engine.core.bll.HotPlugDiskToVmCommand throw > Vdc Bll exception. With error message VdcBLLException: > org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: > java.net.ConnectException: Connection refused (Failed with VDSM error > VDS_NETWORK_ERROR and code 5022) > > When HotPlugDiskVDS/HotUnPlugDiskVDS fails with unknown errors, backend > should try to assert the state of the device (plugged/unplugged) since it > has not enough information about the real state of the call > (executed/not-executed). > > Best thing probably is to mark the status of the device as "Unknown" and use > the getVmStats polling calls to determine/update the real status. this would require the VM async tasks infra, so need to update bug to block on that work Created attachment 898018 [details]
vdsm+engine logs
the easiest way to recreate this bug exactly doesn't seem to have any connection with the type of the domain:
1.have two domains (one file,one block)
2.create vm("vm1")+disk->run the vm
3.go to vm->disks->vm1_disk1->Deactivate
4.move disk to the other domain
5.activate the disk
on av9.2
Worth reconsidering after SPM is removed. Can you provide a status for this issue? As this bug's severity is medium, and targeted to 3.6, it is currently pending in favor for issues. This is a race condition which we already solved most of in 3.6.0. The remaining issue is extremely rare, and should be self-correcting after a while. Closing. |
Created attachment 768251 [details] logs Description of problem: After failed disk hotplug, disk remain accessible: lvs 9e71eba5-e97e-405d-9004-a18603a7608a -wi-ao--- Version-Release number of selected component (if applicable): vdsm-4.11.0-69.gitd70e3d5.el6.x86_64 sanlock-2.6-2.el6.x86_64 libvirt-0.10.2-18.el6_4.8.x86_64 rhevm-3.3.0-0.5.master.el6ev.noarch How reproducible: 100% Steps to Reproduce: 1. on a block pool 2. create and run a VM with OS 3. add new disk 4. in order to be able to fail vdsm during hotplug, create a script that pause vdsm for 20s when hotplugging a disk to vm: under hooks directory in vdsm: vim /usr/libexec/vdsm/hooks/after_disk_hotplug/1.sh #! /bin/bash sleep 20s 5. change script permissions to vdsm:kvm : 'chmod 755 1.sh' 6. perform disk hotplug to the VM 7. during hotplug, restart vdsm service 8. hotplug will fail Actual results: engine reports that the disk is inactive. despite disk hotplug failed, lv remain accessible. vdsm cannot deactivate it and fails with: Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 857, in _run return fn(*args, **kargs) File "/usr/share/vdsm/logUtils.py", line 45, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/storage/hsm.py", line 3195, in teardownImage img.teardown(sdUUID, imgUUID, volUUID) File "/usr/share/vdsm/storage/image.py", line 253, in teardown imgUUID, volUUIDs=[vol.volUUID for vol in chain]) File "/usr/share/vdsm/storage/blockSD.py", line 1040, in deactivateVolumes lvm.deactivateLVs(self.sdUUID, volUUIDs) File "/usr/share/vdsm/storage/lvm.py", line 1111, in deactivateLVs _setLVAvailability(vgName, toDeactivate, "n") File "/usr/share/vdsm/storage/lvm.py", line 774, in _setLVAvailability raise error(str(e)) CannotDeactivateLogicalVolume: Cannot deactivate Logical Volume user cannot activate or remove the disk. Expected results: after failed disk hotplug, lv should become inaccessible Additional info: logs and lvs