Bug 1044466

Summary: VDSM ignores libvirt's succuss/failure when trying to disconnect a disk.
Product: [oVirt] vdsm Reporter: Elad <ebenahar>
Component: GeneralAssignee: Amit Aviram <aaviram>
Status: CLOSED CURRENTRELEASE QA Contact: Aharon Canan <acanan>
Severity: high Docs Contact:
Priority: unspecified    
Version: ---CC: aaviram, amureini, bazulay, bugs, danken, ebenahar, gklein, lpeer, mgoldboi, michal.skrivanek, rbalakri, sapandit, sbonazzo, scohen, yeylon, ylavi
Target Milestone: ovirt-3.5.6Flags: rule-engine: ovirt-3.5.z+
ylavi: planning_ack+
amureini: devel_ack+
rule-engine: testing_ack+
Target Release: 4.16.28   
Hardware: x86_64   
OS: Unspecified   
Whiteboard: storage
Fixed In Version: v4.17.5 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-12-22 13:25:32 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 807023    
Bug Blocks:    
Attachments:
Description Flags
engine, libvirt, qemu and vdsm logs
none
engine, libvirt, qemu and vdsm logs (new) none

Description Elad 2013-12-18 11:13:52 UTC
Created attachment 838263 [details]
engine, libvirt, qemu and vdsm logs

Description of problem:
hotplugDisk fails with: 

Thread-63::DEBUG::2013-12-18 12:02:29,095::libvirtconnection::108::libvirtconnection::(wrapper) Unknown libvirterror: ecode: 1 edom: 10 level: 2 message: internal error unable to execute QEMU command '__com.redhat
_drive_add': Duplicate ID 'drive-virtio-disk1' for drive
Thread-63::ERROR::2013-12-18 12:02:29,096::vm::3448::vm.Vm::(hotplugDisk) vmId=`86a4451b-6698-407c-9444-c713c77760cc`::Hotplug failed
Traceback (most recent call last):
  File "/usr/share/vdsm/vm.py", line 3446, in hotplugDisk
    self._dom.attachDevice(driveXml)
  File "/usr/share/vdsm/vm.py", line 842, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", line 76, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib64/python2.6/site-packages/libvirt.py", line 399, in attachDevice
    if ret == -1: raise libvirtError ('virDomainAttachDevice() failed', dom=self)
libvirtError: internal error unable to execute QEMU command '__com.redhat_drive_add': Duplicate ID 'drive-virtio-disk1' for drive

Version-Release number of selected component (if applicable):
vdsm-4.13.2-0.1.rc.el6ev.x86_64
rhevm-3.3.0-0.40.rc.el6ev.noarch
libvirt-0.10.2-29.el6.x86_64
qemu-kvm-rhev-0.12.1.2-2.415.el6.x86_64

How reproducible:
100%

Steps to Reproduce:
On file storage pool:
1) Detach a disk from VM and restart vdsm right after
2) Attach the same disk to the VM when vdsm takes SPM again

Actual results:
hotplugDisk fails


Additional info: 
engine, libvirt, qemu and vdsm logs

Comment 1 Elad 2013-12-19 07:30:23 UTC
Created attachment 838791 [details]
engine, libvirt, qemu and vdsm logs (new)

wrong file was uploaded, uploading the relevant file. (new)

Comment 2 Allon Mureinik 2013-12-19 11:14:43 UTC
Does this reproduce if you wait for the disk to be detached and only then restart VDSM?
Assuming the answer is "no", deferring this to 3.3.1. If it is "yes", please comment on it and return the target version to 3.3

Comment 3 Elad 2013-12-19 11:57:37 UTC
(In reply to Allon Mureinik from comment #2)
> Does this reproduce if you wait for the disk to be detached and only then
> restart VDSM?
> Assuming the answer is "no", deferring this to 3.3.1. If it is "yes", please
> comment on it and return the target version to 3.3

no

Comment 5 Yeela Kaplan 2013-12-30 09:05:44 UTC
no...
What happens in the hotunplug code: 
1 detach disk from python vm object
2 save vm object state to disk
3 hotunplug disk through libvirt

The restart probably happened between stage 2 and 3.

Comment 6 Ayal Baron 2014-02-16 09:59:56 UTC
Yeela, any update on this?

Comment 8 Allon Mureinik 2014-03-17 10:06:54 UTC
(In reply to Ayal Baron from comment #6)
> Yeela, any update on this?
Ping?

Comment 10 Liron Aravot 2014-08-12 12:17:25 UTC
The bugs in that area are general and not disk specific (for example - they are relevant for NIC as well).

The issue is that we persist the vm configuration on vdsm side for different purposes, so it always might be inconsistent with the actual state on libvirt and we are missing handling for that.

the solution here should be more general and not for that specific case, Allon - what's your take on that?

Comment 11 Allon Mureinik 2014-08-12 20:16:24 UTC
(In reply to Liron Aravot from comment #10)
> The bugs in that area are general and not disk specific (for example - they
> are relevant for NIC as well).
> 
> The issue is that we persist the vm configuration on vdsm side for different
> purposes, so it always might be inconsistent with the actual state on
> libvirt and we are missing handling for that.
> 
> the solution here should be more general and not for that specific case,
> Allon - what's your take on that?

I agree.
Michal - can one of your guys take a look at this please?

Comment 12 Michal Skrivanek 2014-08-14 14:32:13 UTC
I don't see anything general, I see the hotunplugDisk() code doing exactly what's described in comment #5.
Yes it's the case for networking too (adding Dan), but it's not the case for hotplug CPU for example.

Comment 13 Dan Kenigsberg 2014-08-15 15:37:25 UTC
In hotplug CPU we have a very similar bug - we set (and save) conf['smp'] slightly after setVcpusFlags() succeeds, which means that a crash would create a discrepancy between libvirt and vdsm's conf. Luckily for hotplug CPU, nothing really care about this.

We should use vm.conf to create domxml and then drop it. When it's needed for backward compatibility (list verb, migration), it has to be re-calculated from libvirt's domxml.

However, this is big task that can be solved piece-wise, hence I'm keeping this bug on storage.

Comment 14 Amit Aviram 2015-08-25 14:19:27 UTC
This problem occurs when detaching a device that has no OS installed on it. (Specifically, the problem here is that calling libvirt for detaching a device in that state returns a success, while actually the device is never detached.. This causing to the device to be duplicated and at some point it cannot be attached anymore with the described message)

Elad, did you test this issue with a VM that has an OS installed?

Comment 15 Elad 2015-08-26 11:02:31 UTC
I don't remember and we cannot know from the logs.

Comment 16 Amit Aviram 2015-08-27 14:00:45 UTC
It seems like this bug origins in Bug 807023, which by now is fixed. 
(basically, libvirt didn't check if detaching a disk succeeds and it deleted it from its pci-stubs. when trying to attach a device again, the device ID was already taken. The bug fix was to verify that the device was actually detached, and only then remove it from the configuration).

VDSM doesn't even handle the device's ID, it only requests an XML description of the disk to be attached.
However, VDSM ALSO should also verify that the disk is detached, as libvirt's returns a success when it succeeds sending the guest a detach REQUEST, which doesn't mean the device is actually detached.

The posted patch adds exactly that functionality. After having it, detaching and attaching should work correctly also if unplugging did not succeed.

Comment 17 Amit Aviram 2015-08-27 14:43:13 UTC
Note that if we cherry-pick this patch to 3.5.z, we must change VDSM's spec in rhel6 to support only libvirt >= 0.10.2-38.el6 as this bug is dependent on changes made in Bug 807023.

Comment 18 Sandro Bonazzola 2015-10-30 09:36:39 UTC
In oVirt testing is done on single release by default. Therefore I'm removing
the 3.6 flag. If you think this bug must be tested in 3.6 as well, please
re-add the flag. Please note we might not have testing resources to handle the
3.6 clone.

Comment 19 Amit Aviram 2015-11-10 13:39:34 UTC
Verification steps for the bug:
-

Comment 20 Amit Aviram 2015-11-10 13:43:44 UTC
Verification steps for the bug:
1. unsuccessful unplugging:
- Create a VM
- Add a disk to it, DO NOT install any OS.
- run the VM
- Try to hotunplug the disk.

VDSM should throw a timeout error for not being able to unlplug the disk.


2. successful unplugging:
- Run a VM with OS installed
- hotunplug a disk other than the one with OS.

The disk should be successfully unplugged.

Comment 21 Aharon Canan 2015-11-10 14:38:36 UTC
(In reply to Amit Aviram from comment #20)
> Verification steps for the bug:
> 1. unsuccessful unplugging:
> - Create a VM
> - Add a disk to it, DO NOT install any OS.
> - run the VM
> - Try to hotunplug the disk.
> 
> VDSM should throw a timeout error for not being able to unlplug the disk.
> 

Thread-3456::ERROR::2015-11-10 16:27:22,006::vm::4091::vm.Vm::(hotunplugDisk) vmId=`426b742f-5ce6-477d-a9fc-b9d9dbd83afe`::Timeout detaching drive vdb


> 
> 2. successful unplugging:
> - Run a VM with OS installed
> - hotunplug a disk other than the one with OS.
> 
> The disk should be successfully unplugged.

Disk successfully unplugged

Verified using vt18.2


Amit - 
Shouldn't we trigger vdsm restart (like stated in comment #5)

Comment 22 Amit Aviram 2015-11-10 15:11:07 UTC
No, The issue described in the bug description originates in libvirt's Bug 807023 which is now fixed. 
the problem in our side was that we didn't check at all if libvirt succeeded in unplugging the disk, it has nothing to do with restarting VDSM while plugging/hotplugging (which Bug 980914 describes)

Comment 23 Sandro Bonazzola 2015-12-22 13:25:32 UTC
oVirt 3.5.6 has been released and the bz verified, moving to closed current release.

Comment 24 Amit Aviram 2016-01-24 13:12:02 UTC
*** Bug 1294071 has been marked as a duplicate of this bug. ***