1044466 – VDSM ignores libvirt's succuss/failure when trying to disconnect a disk.

Bug 1044466 - VDSM ignores libvirt's succuss/failure when trying to disconnect a disk.

Summary: VDSM ignores libvirt's succuss/failure when trying to disconnect a disk.

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	vdsm
Classification:	oVirt
Component:	General
Sub Component:
Version:	---
Hardware:	x86_64
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	ovirt-3.5.6
Target Release:	4.16.28
Assignee:	Amit Aviram
QA Contact:	Aharon Canan
Docs Contact:
URL:
Whiteboard:	storage
Duplicates (1):	1294071 (view as bug list)
Depends On:	807023
Blocks:
TreeView+	depends on / blocked

Reported:	2013-12-18 11:13 UTC by Elad
Modified:	2019-09-12 09:41 UTC (History)
CC List:	16 users (show)
Fixed In Version:	v4.17.5
Clone Of:
Environment:
Last Closed:	2015-12-22 13:25:32 UTC
oVirt Team:	Storage
Embargoed:
Dependent Products:
Flags:	rule-engine: ovirt-3.5.z+ ylavi: planning_ack+ amureini: devel_ack+ rule-engine: testing_ack+

Attachments	(Terms of Use)
engine, libvirt, qemu and vdsm logs (1.67 MB, application/x-xz) 2013-12-18 11:13 UTC, Elad	no flags	Details
engine, libvirt, qemu and vdsm logs (new) (1.67 MB, application/x-gzip) 2013-12-19 07:30 UTC, Elad	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	2107461	None	None	None	2019-09-12 09:39:32 UTC
oVirt gerrit	45138	master	MERGED	vm: Libvirt quering after disk detach operation addition.	Never
oVirt gerrit	45646	ovirt-3.6	MERGED	vm: Libvirt quering after disk detach operation addition.	Never
oVirt gerrit	47629	ovirt-3.5	MERGED	vm: Libvirt quering after disk detach operation addition.	Never

Description Elad 2013-12-18 11:13:52 UTC

Created attachment 838263 [details]
engine, libvirt, qemu and vdsm logs

Description of problem:
hotplugDisk fails with: 

Thread-63::DEBUG::2013-12-18 12:02:29,095::libvirtconnection::108::libvirtconnection::(wrapper) Unknown libvirterror: ecode: 1 edom: 10 level: 2 message: internal error unable to execute QEMU command '__com.redhat
_drive_add': Duplicate ID 'drive-virtio-disk1' for drive
Thread-63::ERROR::2013-12-18 12:02:29,096::vm::3448::vm.Vm::(hotplugDisk) vmId=`86a4451b-6698-407c-9444-c713c77760cc`::Hotplug failed
Traceback (most recent call last):
  File "/usr/share/vdsm/vm.py", line 3446, in hotplugDisk
    self._dom.attachDevice(driveXml)
  File "/usr/share/vdsm/vm.py", line 842, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", line 76, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib64/python2.6/site-packages/libvirt.py", line 399, in attachDevice
    if ret == -1: raise libvirtError ('virDomainAttachDevice() failed', dom=self)
libvirtError: internal error unable to execute QEMU command '__com.redhat_drive_add': Duplicate ID 'drive-virtio-disk1' for drive

Version-Release number of selected component (if applicable):
vdsm-4.13.2-0.1.rc.el6ev.x86_64
rhevm-3.3.0-0.40.rc.el6ev.noarch
libvirt-0.10.2-29.el6.x86_64
qemu-kvm-rhev-0.12.1.2-2.415.el6.x86_64

How reproducible:
100%

Steps to Reproduce:
On file storage pool:
1) Detach a disk from VM and restart vdsm right after
2) Attach the same disk to the VM when vdsm takes SPM again

Actual results:
hotplugDisk fails


Additional info: 
engine, libvirt, qemu and vdsm logs

Comment 1 Elad 2013-12-19 07:30:23 UTC

Created attachment 838791 [details]
engine, libvirt, qemu and vdsm logs (new)

wrong file was uploaded, uploading the relevant file. (new)

Comment 2 Allon Mureinik 2013-12-19 11:14:43 UTC

Does this reproduce if you wait for the disk to be detached and only then restart VDSM?
Assuming the answer is "no", deferring this to 3.3.1. If it is "yes", please comment on it and return the target version to 3.3

Comment 3 Elad 2013-12-19 11:57:37 UTC

(In reply to Allon Mureinik from comment #2)
> Does this reproduce if you wait for the disk to be detached and only then
> restart VDSM?
> Assuming the answer is "no", deferring this to 3.3.1. If it is "yes", please
> comment on it and return the target version to 3.3

no

Comment 5 Yeela Kaplan 2013-12-30 09:05:44 UTC

no...
What happens in the hotunplug code: 
1 detach disk from python vm object
2 save vm object state to disk
3 hotunplug disk through libvirt

The restart probably happened between stage 2 and 3.

Comment 6 Ayal Baron 2014-02-16 09:59:56 UTC

Yeela, any update on this?

Comment 8 Allon Mureinik 2014-03-17 10:06:54 UTC

(In reply to Ayal Baron from comment #6)
> Yeela, any update on this?
Ping?

Comment 10 Liron Aravot 2014-08-12 12:17:25 UTC

The bugs in that area are general and not disk specific (for example - they are relevant for NIC as well).

The issue is that we persist the vm configuration on vdsm side for different purposes, so it always might be inconsistent with the actual state on libvirt and we are missing handling for that.

the solution here should be more general and not for that specific case, Allon - what's your take on that?

Comment 11 Allon Mureinik 2014-08-12 20:16:24 UTC

(In reply to Liron Aravot from comment #10)
> The bugs in that area are general and not disk specific (for example - they
> are relevant for NIC as well).
> 
> The issue is that we persist the vm configuration on vdsm side for different
> purposes, so it always might be inconsistent with the actual state on
> libvirt and we are missing handling for that.
> 
> the solution here should be more general and not for that specific case,
> Allon - what's your take on that?

I agree.
Michal - can one of your guys take a look at this please?

Comment 12 Michal Skrivanek 2014-08-14 14:32:13 UTC

I don't see anything general, I see the hotunplugDisk() code doing exactly what's described in comment #5.
Yes it's the case for networking too (adding Dan), but it's not the case for hotplug CPU for example.

Comment 13 Dan Kenigsberg 2014-08-15 15:37:25 UTC

In hotplug CPU we have a very similar bug - we set (and save) conf['smp'] slightly after setVcpusFlags() succeeds, which means that a crash would create a discrepancy between libvirt and vdsm's conf. Luckily for hotplug CPU, nothing really care about this.

We should use vm.conf to create domxml and then drop it. When it's needed for backward compatibility (list verb, migration), it has to be re-calculated from libvirt's domxml.

However, this is big task that can be solved piece-wise, hence I'm keeping this bug on storage.

Comment 14 Amit Aviram 2015-08-25 14:19:27 UTC

This problem occurs when detaching a device that has no OS installed on it. (Specifically, the problem here is that calling libvirt for detaching a device in that state returns a success, while actually the device is never detached.. This causing to the device to be duplicated and at some point it cannot be attached anymore with the described message)

Elad, did you test this issue with a VM that has an OS installed?

Comment 15 Elad 2015-08-26 11:02:31 UTC

I don't remember and we cannot know from the logs.

Comment 16 Amit Aviram 2015-08-27 14:00:45 UTC

It seems like this bug origins in Bug 807023, which by now is fixed. 
(basically, libvirt didn't check if detaching a disk succeeds and it deleted it from its pci-stubs. when trying to attach a device again, the device ID was already taken. The bug fix was to verify that the device was actually detached, and only then remove it from the configuration).

VDSM doesn't even handle the device's ID, it only requests an XML description of the disk to be attached.
However, VDSM ALSO should also verify that the disk is detached, as libvirt's returns a success when it succeeds sending the guest a detach REQUEST, which doesn't mean the device is actually detached.

The posted patch adds exactly that functionality. After having it, detaching and attaching should work correctly also if unplugging did not succeed.

Comment 17 Amit Aviram 2015-08-27 14:43:13 UTC

Note that if we cherry-pick this patch to 3.5.z, we must change VDSM's spec in rhel6 to support only libvirt >= 0.10.2-38.el6 as this bug is dependent on changes made in Bug 807023.

Comment 18 Sandro Bonazzola 2015-10-30 09:36:39 UTC

In oVirt testing is done on single release by default. Therefore I'm removing
the 3.6 flag. If you think this bug must be tested in 3.6 as well, please
re-add the flag. Please note we might not have testing resources to handle the
3.6 clone.

Comment 19 Amit Aviram 2015-11-10 13:39:34 UTC

Verification steps for the bug:
-

Comment 20 Amit Aviram 2015-11-10 13:43:44 UTC

Verification steps for the bug:
1. unsuccessful unplugging:
- Create a VM
- Add a disk to it, DO NOT install any OS.
- run the VM
- Try to hotunplug the disk.

VDSM should throw a timeout error for not being able to unlplug the disk.


2. successful unplugging:
- Run a VM with OS installed
- hotunplug a disk other than the one with OS.

The disk should be successfully unplugged.

Comment 21 Aharon Canan 2015-11-10 14:38:36 UTC

(In reply to Amit Aviram from comment #20)
> Verification steps for the bug:
> 1. unsuccessful unplugging:
> - Create a VM
> - Add a disk to it, DO NOT install any OS.
> - run the VM
> - Try to hotunplug the disk.
> 
> VDSM should throw a timeout error for not being able to unlplug the disk.
> 

Thread-3456::ERROR::2015-11-10 16:27:22,006::vm::4091::vm.Vm::(hotunplugDisk) vmId=`426b742f-5ce6-477d-a9fc-b9d9dbd83afe`::Timeout detaching drive vdb


> 
> 2. successful unplugging:
> - Run a VM with OS installed
> - hotunplug a disk other than the one with OS.
> 
> The disk should be successfully unplugged.

Disk successfully unplugged

Verified using vt18.2


Amit - 
Shouldn't we trigger vdsm restart (like stated in comment #5)

Comment 22 Amit Aviram 2015-11-10 15:11:07 UTC

No, The issue described in the bug description originates in libvirt's Bug 807023 which is now fixed. 
the problem in our side was that we didn't check at all if libvirt succeeded in unplugging the disk, it has nothing to do with restarting VDSM while plugging/hotplugging (which Bug 980914 describes)

Comment 23 Sandro Bonazzola 2015-12-22 13:25:32 UTC

oVirt 3.5.6 has been released and the bz verified, moving to closed current release.

Comment 24 Amit Aviram 2016-01-24 13:12:02 UTC

*** Bug 1294071 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.