Created attachment 838263 [details] engine, libvirt, qemu and vdsm logs Description of problem: hotplugDisk fails with: Thread-63::DEBUG::2013-12-18 12:02:29,095::libvirtconnection::108::libvirtconnection::(wrapper) Unknown libvirterror: ecode: 1 edom: 10 level: 2 message: internal error unable to execute QEMU command '__com.redhat _drive_add': Duplicate ID 'drive-virtio-disk1' for drive Thread-63::ERROR::2013-12-18 12:02:29,096::vm::3448::vm.Vm::(hotplugDisk) vmId=`86a4451b-6698-407c-9444-c713c77760cc`::Hotplug failed Traceback (most recent call last): File "/usr/share/vdsm/vm.py", line 3446, in hotplugDisk self._dom.attachDevice(driveXml) File "/usr/share/vdsm/vm.py", line 842, in f ret = attr(*args, **kwargs) File "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", line 76, in wrapper ret = f(*args, **kwargs) File "/usr/lib64/python2.6/site-packages/libvirt.py", line 399, in attachDevice if ret == -1: raise libvirtError ('virDomainAttachDevice() failed', dom=self) libvirtError: internal error unable to execute QEMU command '__com.redhat_drive_add': Duplicate ID 'drive-virtio-disk1' for drive Version-Release number of selected component (if applicable): vdsm-4.13.2-0.1.rc.el6ev.x86_64 rhevm-3.3.0-0.40.rc.el6ev.noarch libvirt-0.10.2-29.el6.x86_64 qemu-kvm-rhev-0.12.1.2-2.415.el6.x86_64 How reproducible: 100% Steps to Reproduce: On file storage pool: 1) Detach a disk from VM and restart vdsm right after 2) Attach the same disk to the VM when vdsm takes SPM again Actual results: hotplugDisk fails Additional info: engine, libvirt, qemu and vdsm logs
Created attachment 838791 [details] engine, libvirt, qemu and vdsm logs (new) wrong file was uploaded, uploading the relevant file. (new)
Does this reproduce if you wait for the disk to be detached and only then restart VDSM? Assuming the answer is "no", deferring this to 3.3.1. If it is "yes", please comment on it and return the target version to 3.3
(In reply to Allon Mureinik from comment #2) > Does this reproduce if you wait for the disk to be detached and only then > restart VDSM? > Assuming the answer is "no", deferring this to 3.3.1. If it is "yes", please > comment on it and return the target version to 3.3 no
no... What happens in the hotunplug code: 1 detach disk from python vm object 2 save vm object state to disk 3 hotunplug disk through libvirt The restart probably happened between stage 2 and 3.
Yeela, any update on this?
(In reply to Ayal Baron from comment #6) > Yeela, any update on this? Ping?
The bugs in that area are general and not disk specific (for example - they are relevant for NIC as well). The issue is that we persist the vm configuration on vdsm side for different purposes, so it always might be inconsistent with the actual state on libvirt and we are missing handling for that. the solution here should be more general and not for that specific case, Allon - what's your take on that?
(In reply to Liron Aravot from comment #10) > The bugs in that area are general and not disk specific (for example - they > are relevant for NIC as well). > > The issue is that we persist the vm configuration on vdsm side for different > purposes, so it always might be inconsistent with the actual state on > libvirt and we are missing handling for that. > > the solution here should be more general and not for that specific case, > Allon - what's your take on that? I agree. Michal - can one of your guys take a look at this please?
I don't see anything general, I see the hotunplugDisk() code doing exactly what's described in comment #5. Yes it's the case for networking too (adding Dan), but it's not the case for hotplug CPU for example.
In hotplug CPU we have a very similar bug - we set (and save) conf['smp'] slightly after setVcpusFlags() succeeds, which means that a crash would create a discrepancy between libvirt and vdsm's conf. Luckily for hotplug CPU, nothing really care about this. We should use vm.conf to create domxml and then drop it. When it's needed for backward compatibility (list verb, migration), it has to be re-calculated from libvirt's domxml. However, this is big task that can be solved piece-wise, hence I'm keeping this bug on storage.
This problem occurs when detaching a device that has no OS installed on it. (Specifically, the problem here is that calling libvirt for detaching a device in that state returns a success, while actually the device is never detached.. This causing to the device to be duplicated and at some point it cannot be attached anymore with the described message) Elad, did you test this issue with a VM that has an OS installed?
I don't remember and we cannot know from the logs.
It seems like this bug origins in Bug 807023, which by now is fixed. (basically, libvirt didn't check if detaching a disk succeeds and it deleted it from its pci-stubs. when trying to attach a device again, the device ID was already taken. The bug fix was to verify that the device was actually detached, and only then remove it from the configuration). VDSM doesn't even handle the device's ID, it only requests an XML description of the disk to be attached. However, VDSM ALSO should also verify that the disk is detached, as libvirt's returns a success when it succeeds sending the guest a detach REQUEST, which doesn't mean the device is actually detached. The posted patch adds exactly that functionality. After having it, detaching and attaching should work correctly also if unplugging did not succeed.
Note that if we cherry-pick this patch to 3.5.z, we must change VDSM's spec in rhel6 to support only libvirt >= 0.10.2-38.el6 as this bug is dependent on changes made in Bug 807023.
In oVirt testing is done on single release by default. Therefore I'm removing the 3.6 flag. If you think this bug must be tested in 3.6 as well, please re-add the flag. Please note we might not have testing resources to handle the 3.6 clone.
Verification steps for the bug: -
Verification steps for the bug: 1. unsuccessful unplugging: - Create a VM - Add a disk to it, DO NOT install any OS. - run the VM - Try to hotunplug the disk. VDSM should throw a timeout error for not being able to unlplug the disk. 2. successful unplugging: - Run a VM with OS installed - hotunplug a disk other than the one with OS. The disk should be successfully unplugged.
(In reply to Amit Aviram from comment #20) > Verification steps for the bug: > 1. unsuccessful unplugging: > - Create a VM > - Add a disk to it, DO NOT install any OS. > - run the VM > - Try to hotunplug the disk. > > VDSM should throw a timeout error for not being able to unlplug the disk. > Thread-3456::ERROR::2015-11-10 16:27:22,006::vm::4091::vm.Vm::(hotunplugDisk) vmId=`426b742f-5ce6-477d-a9fc-b9d9dbd83afe`::Timeout detaching drive vdb > > 2. successful unplugging: > - Run a VM with OS installed > - hotunplug a disk other than the one with OS. > > The disk should be successfully unplugged. Disk successfully unplugged Verified using vt18.2 Amit - Shouldn't we trigger vdsm restart (like stated in comment #5)
No, The issue described in the bug description originates in libvirt's Bug 807023 which is now fixed. the problem in our side was that we didn't check at all if libvirt succeeded in unplugging the disk, it has nothing to do with restarting VDSM while plugging/hotplugging (which Bug 980914 describes)
oVirt 3.5.6 has been released and the bz verified, moving to closed current release.
*** Bug 1294071 has been marked as a duplicate of this bug. ***