Description of problem: Hot plug seems to fail for multiple disks Version-Release number of selected component (if applicable): ovirt 3.6.0-15 vdsm-xmlrpc-4.17.9-1.el7ev.noarch vdsm-python-4.17.9-1.el7ev.noarch vdsm-jsonrpc-4.17.9-1.el7ev.noarch vdsm-4.17.9-1.el7ev.noarch vdsm-cli-4.17.9-1.el7ev.noarch vdsm-infra-4.17.9-1.el7ev.noarch vdsm-yajsonrpc-4.17.9-1.el7ev.noarch libvirt-daemon-driver-nwfilter-1.2.17-13.el7.ppc64le libvirt-daemon-driver-interface-1.2.17-13.el7.ppc64le libvirt-daemon-driver-network-1.2.17-13.el7.ppc64le libvirt-daemon-driver-nodedev-1.2.17-13.el7.ppc64le libvirt-daemon-kvm-1.2.17-13.el7.ppc64le libvirt-python-1.2.17-2.el7.ppc64le libvirt-client-1.2.17-13.el7.ppc64le libvirt-daemon-1.2.17-13.el7.ppc64le libvirt-daemon-driver-secret-1.2.17-13.el7.ppc64le libvirt-daemon-driver-storage-1.2.17-13.el7.ppc64le libvirt-daemon-driver-qemu-1.2.17-13.el7.ppc64le libvirt-daemon-config-nwfilter-1.2.17-13.el7.ppc64le libvirt-lock-sanlock-1.2.17-13.el7.ppc64le qemu-img-rhev-2.3.0-31.el7.ppc64le qemu-kvm-tools-rhev-2.3.0-31.el7.ppc64le qemu-kvm-rhev-2.3.0-31.el7.ppc64le qemu-kvm-common-rhev-2.3.0-31.el7.ppc64le How reproducible: 2/2 Steps to Reproduce: 1. Create a vm with OS on it, star it 2. create 4 disks, cow - virtio 1 Gb 3. for each disk, attach it to the vm and activate it Actual results: Third disk activation fails Additional info: 2015-10-21 15:39:14,188 ERROR [org.ovirt.engine.core.vdsbroker.VmsMonitoring] (DefaultQuartzScheduler_Worker-100) [146b4c93] VM 'c71aea19-bde7-4d90-bd01-f9bee1512238' managed non pluggable device was removed unexpectedly from libvirt: 'VmDevice:{id='VmDeviceId:{deviceId='258c99f3-eae6-43e6-ae2b-17aede7ecd59', vmId='c71aea19-bde7-4d90-bd01-f9bee1512238'}', device='vnc', type='GRAPHICS', bootOrder='0', specParams='[]', address='', managed='true', plugged='false', readOnly='false', deviceAlias='', customProperties='[]', snapshotId='null', logicalName='null', usingScsiReservation='false'}' 2015-10-21 15:39:16,216 WARN [org.ovirt.engine.core.vdsbroker.vdsbroker.HotPlugDiskVDSCommand] (ajp-/127.0.0.1:8702-12) [disks_syncAction_bf5af1e7-5064-41f6] Unexpected return value: StatusForXmlRpc [code=-32603, message=VM u'c71aea19-bde7-4d90-bd01-f9bee1512238' was not started yet or was shut down] 2015-10-21 15:39:16,216 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HotPlugDiskVDSCommand] (ajp-/127.0.0.1:8702-12) [disks_syncAction_bf5af1e7-5064-41f6] Failed in 'HotPlugDiskVDS' method 2015-10-21 15:39:16,224 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ajp-/127.0.0.1:8702-12) [disks_syncAction_bf5af1e7-5064-41f6] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM host_mixed_3 command failed: VM u'c71aea19-bde7-4d90-bd01-f9bee1512238' was not started yet or was shut down 2015-10-21 15:39:16,224 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.HotPlugDiskVDSCommand] (ajp-/127.0.0.1:8702-12) [disks_syncAction_bf5af1e7-5064-41f6] Command 'org.ovirt.engine.core.vdsbroker.vdsbroker.HotPlugDiskVDSCommand' return value 'StatusOnlyReturnForXmlRpc [status=StatusForXmlRpc [code=-32603, message=VM u'c71aea19-bde7-4d90-bd01-f9bee1512238' was not started yet or was shut down]]' 2015-10-21 15:39:16,224 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.HotPlugDiskVDSCommand] (ajp-/127.0.0.1:8702-12) [disks_syncAction_bf5af1e7-5064-41f6] HostName = host_mixed_3 2015-10-21 15:39:16,224 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HotPlugDiskVDSCommand] (ajp-/127.0.0.1:8702-12) [disks_syncAction_bf5af1e7-5064-41f6] Command 'HotPlugDiskVDSCommand(HostName = host_mixed_3, HotPlugDiskVDSParameters:{runAsync='true', hostId='e3481b6a-0d8e-425d-a33d-e42817c95d56', vmId='c71aea19-bde7-4d90-bd01-f9bee1512238', diskId='392f7373-ef61-4843-bae1-9d32470fe09f', addressMap='null'})' execution failed: VDSGenericException: VDSErrorException: Failed to HotPlugDiskVDS, error = VM u'c71aea19-bde7-4d90-bd01-f9bee1512238' was not started yet or was shut down, code = -32603 2015-10-21 15:39:16,224 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.HotPlugDiskVDSCommand] (ajp-/127.0.0.1:8702-12) [disks_syncAction_bf5af1e7-5064-41f6] FINISH, HotPlugDiskVDSCommand, log id: 3122eb52 2015-10-21 15:39:16,225 ERROR [org.ovirt.engine.core.bll.HotPlugDiskToVmCommand] (ajp-/127.0.0.1:8702-12) [disks_syncAction_bf5af1e7-5064-41f6] Command 'org.ovirt.engine.core.bll.HotPlugDiskToVmCommand' failed: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to HotPlugDiskVDS, error = VM u'c71aea19-bde7-4d90-bd01-f9bee1512238' was not started yet or was shut down, code = -32603 (Failed with error unexpected and code 16) 2015-10-21 15:39:16,294 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ajp-/127.0.0.1:8702-12) [disks_syncAction_bf5af1e7-5064-41f6] Correlation ID: disks_syncAction_bf5af1e7-5064-41f6, Call Stack: null, Custom Event ID: -1, Message: Failed to plug disk disk_2_6022 to VM single_disk_snapshot_vm_iscsi (User: admin@internal). 2015-10-21 15:39:16,294 INFO [org.ovirt.engine.core.bll.HotPlugDiskToVmCommand] (ajp-/127.0.0.1:8702-12) [disks_syncAction_bf5af1e7-5064-41f6] Lock freed to object 'EngineLock:{exclusiveLocks='[392f7373-ef61-4843-bae1-9d32470fe09f=<DISK, ACTION_TYPE_FAILED_DISKS_LOCKED$diskAliases disk_2_6022>]', sharedLocks='[c71aea19-bde7-4d90-bd01-f9bee1512238=<VM, ACTION_TYPE_FAILED_VM_IS_LOCKED>]'}' 2015-10-21 15:39:16,295 ERROR [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (ajp-/127.0.0.1:8702-12) [] Operation Failed: [Unexpected exception]
Created attachment 1085125 [details] engine.log
Can you guys look at this? I'm not sure how to debug this issue or what logs to include (to properly open a bug for RHEL project)
In oVirt testing is done on single release by default. Therefore I'm removing the 4.0 flag. If you think this bug must be tested in 4.0 as well, please re-add the flag. Please note we might not have testing resources to handle the 4.0 clone.
Any update on this?
Hi Carlos. Can you please attach full logs of VDSM as well?
Created attachment 1090309 [details] engine.log Can see how the action that starts at 2015-11-05 18:54:23,296 with [disks_syncAction_b464cae9-f1e6-4385] fails later with code=-32603
Created attachment 1090310 [details] vdsm.log I just realize this hosts has a different time zone than the engine so the timestamp is -7 hours (execution starts at 11:46)
Comment on attachment 1085125 [details] engine.log Old run, please check the newer engine.log that correspond with the vdsm.log
It seems like the vdsm logs attached are still not the ones that correspond to the engine logs. Carlos- can you recheck that and upload the vdsm logs which contains the right logs? it may be that what we are looking for is located in older log files in /var/log/vdsm/ rather than vdsm.log itself. Thanks.
what's the guest OS selected in the UI? what disk interface?
(In reply to Michal Skrivanek from comment #10) > what's the guest OS selected in the UI? what disk interface? Michal, This is all in via rest API. OS selected is RHEL 7 ppc64, interface is VIRTIO for all disks. THis happens in iscsi and nfs.
(In reply to Amit Aviram from comment #9) > It seems like the vdsm logs attached are still not the ones that correspond > to the engine logs. > Carlos- can you recheck that and upload the vdsm logs which contains the > right logs? > it may be that what we are looking for is located in older log files in > /var/log/vdsm/ rather than vdsm.log itself. > > Thanks. Hi Amit, Yes you're right, I'll update the proper logs.
Created attachment 1092746 [details] Logs for new run (in UI) Hi, Attached are vdsm, engine, libvirtd, qemu and the vm's messages. Action starts at 15:10 (vdsm's 08:10). Start a vm (Custom OS/Desktop) via UI, and added three disks (Thin/ISCSI/Virtio/1GB), after the third disks the vm shutdowns with VM golden_env_mixed_virtio_0 is down with error. Exit message: Lost connection with qemu process. at 15:15. I couldn't find the qemu core dump process in the /var/log/core/ (even though I followed the instructions for the dump). If there's anything else missing please tell me. The failure reproduces 100% of the time though.
The logs now are still different than the bug description, there's no "Failed to plug disk disk_2_6022 to VM single_disk_snapshot_vm_iscsi" message anywhere.. Also, all hotplug requests seems to be successful in the logs.. The only error in vdsm logs is: libvirtEventLoop::ERROR::2015-11-11 08:14:20,600::clientIF::597::vds::(dispatchLibvirtEvents) Error running VM callback Traceback (most recent call last): File "/usr/share/vdsm/clientIF.py", line 570, in dispatchLibvirtEvents v.onLibvirtLifecycleEvent(event, detail, None) File "/usr/share/vdsm/virt/vm.py", line 4458, in onLibvirtLifecycleEvent eventToString(event), detail, opaque) File "/usr/share/vdsm/virt/vm.py", line 171, in eventToString return _EVENT_STRINGS[event] IndexError: tuple index out of range which is probably something related to virt, or even libvirt.. Francesco, maybe you could shed some light on this, where do you think this bug belongs?
(In reply to Amit Aviram from comment #14) > The logs now are still different than the bug description, there's no > "Failed to plug disk disk_2_6022 to VM single_disk_snapshot_vm_iscsi" > message anywhere.. > > Also, all hotplug requests seems to be successful in the logs.. > > The only error in vdsm logs is: > > libvirtEventLoop::ERROR::2015-11-11 > 08:14:20,600::clientIF::597::vds::(dispatchLibvirtEvents) Error running VM > callback > Traceback (most recent call last): > File "/usr/share/vdsm/clientIF.py", line 570, in dispatchLibvirtEvents > v.onLibvirtLifecycleEvent(event, detail, None) > File "/usr/share/vdsm/virt/vm.py", line 4458, in onLibvirtLifecycleEvent > eventToString(event), detail, opaque) > File "/usr/share/vdsm/virt/vm.py", line 171, in eventToString > return _EVENT_STRINGS[event] > IndexError: tuple index out of range > > which is probably something related to virt, or even libvirt.. > > Francesco, maybe you could shed some light on this, where do you think this > bug belongs? The stacktrace you just reported is virt for sure, we need to handle this case. Maybe there is one underlying libvirt bug, but we need to fix VDSM before to think about this. I don't think this stacktrace is causing the hotplug flow to fail. As per the original issue, AFAIU we are still hunting for the VDSM logs which highlight the issue, so I can't tell yet.
See patch Excluding catastrpohic libvirt errors, the only event we don't handle yet is "Crashed", which indeed seems it happened: 2015-11-11 13:14:20.600+0000: 92604: info : qemuMonitorIOProcess:452 : QEMU_MONITOR_IO_PROCESS: mon=0x3fff88013a20 buf={"timestamp": {"seconds": 1447247660, "microseconds": 600193}, "event": "GUEST_PANICKED", "data": {"action": "pause"}} len=120 2015-11-11 13:14:20.600+0000: 92604: debug : qemuMonitorEmitEvent:1253 : mon=0x3fff88013a20 event=GUEST_PANICKED 2015-11-11 13:14:20.600+0000: 92604: info : virObjectRef:296 : OBJECT_REF: obj=0x3fff88013a20 2015-11-11 13:14:20.600+0000: 92604: debug : qemuProcessHandleEvent:643 : vm=0x3fff8800baf0 2015-11-11 13:14:20.600+0000: 92604: info : virObjectNew:202 : OBJECT_NEW: obj=0x10002218a20 classname=virDomainQemuMonitorEvent http://libvirt.org/html/libvirt-libvirt-domain.html#virDomainEventCrashedDetailType now, qemu raises this event when - quoting the docs: GUEST_PANICKED -------------- Emitted when guest OS panic is detected. Data: - "action": Action that has been taken (json-string, currently always "pause"). Example: { "event": "GUEST_PANICKED", "data": { "action": "pause" } } So it seems the guest OS crashed after the hotplug attempt. Maybe this explains (partially?) what happened here?
Carlos, are the original logs from your error exist? rather than the ones you uploaded? (see if you can find the error code "-32603" in the log files somewhere..) As Francesco mentioned, The logs uploaded describes a different bug. Anyway- both of the errors here needs to be handled, so can you please also open a different bug for the new logs?
Created attachment 1097259 [details] all logs from new run (engine, vdsm, libvirtd) Hi, I run again the tests that caused the issue in the first place (sorry for the confusion). You'll see the same error in the engine. The scenario is the same. There's a single_disk_snapshot_vm_iscsi vm up and we create 4 disks (disk_0_6022 to disk_3_6022). At 16:09:10 (from engine) we start attaching one by one, until the 3rd fails at: 16:09:22 VDSM host_mixed_2 command failed: VM u'cf986c6e-a941-4ec2-80fd-7e78207f4d40' was not started yet or was shut down 16:09:22 Failed to plug disk disk_2_6022 to VM single_disk_snapshot_vm_iscsi (User: admin@internal). Vdsm log time is different, starts at 08:01:04.
Franchesco- the new logs shows: @2015-11-20 11:09:18: libvirtEventLoop::ERROR occured. @2015-11-20 11:09:20: Trying to - "self._dom.attachDevice(driveXml)", we get: - "VM u'cf986c6e-a941-4ec2-80fd-7e78207f4d40' was not started yet or was shut down" It looks like the hotplug error described is just an implication of some other bug that caused the VM to shut down. so what do you think? this should be addressed to virt?
(In reply to Amit Aviram from comment #19) > Franchesco- the new logs shows: > > @2015-11-20 11:09:18: > libvirtEventLoop::ERROR occured. > > @2015-11-20 11:09:20: > Trying to > - "self._dom.attachDevice(driveXml)", > we get: > - "VM u'cf986c6e-a941-4ec2-80fd-7e78207f4d40' was not started yet or was > shut down" > > It looks like the hotplug error described is just an implication of some > other bug that caused the VM to shut down. so what do you think? this should > be addressed to virt? Hi Amit, This is very interesting. What happened here is that the guest panicked, and then shut down. Here VDSM receives the panic event. libvirtEventLoop::ERROR::2015-11-20 11:09:18,366::clientIF::597::vds::(dispatchLibvirtEvents) Error running VM callback Traceback (most recent call last): File "/usr/share/vdsm/clientIF.py", line 570, in dispatchLibvirtEvents v.onLibvirtLifecycleEvent(event, detail, None) File "/usr/share/vdsm/virt/vm.py", line 4458, in onLibvirtLifecycleEvent eventToString(event), detail, opaque) File "/usr/share/vdsm/virt/vm.py", line 171, in eventToString return _EVENT_STRINGS[event] IndexError: tuple index out of range VDSM fails to handle this event (fixed in https://gerrit.ovirt.org/#/c/48790/ ) Libvirt kills the qemu instance shortly afterwards: 2015-11-20 16:09:18.365+0000: 78191: info : qemuMonitorIOProcess:452 : QEMU_MONITOR_IO_PROCESS: mon=0x3fff900048d0 buf={"timestamp": {"seconds": 1448035758, "microseconds": 365580}, "event": "GUEST_PANICKED", "data": {"action": "pause"}} len=120 2015-11-20 16:09:18.365+0000: 78191: debug : qemuMonitorEmitEvent:1253 : mon=0x3fff900048d0 event=GUEST_PANICKED 2015-11-20 16:09:18.365+0000: 81794: info : virObjectNew:202 : OBJECT_NEW: obj=0x3fff48186630 classname=virDomainEventLifecycle 2015-11-20 16:09:18.365+0000: 78191: info : virObjectNew:202 : OBJECT_NEW: obj=0x10008b7e360 classname=virDomain 2015-11-20 16:09:18.366+0000: 81794: debug : processGuestPanicEvent:4045 : Preserving lock state '<null>' 2015-11-20 16:09:18.366+0000: 81794: debug : qemuProcessKill:5146 : vm=0x3fff9000bb60 name=single_disk_snapshot_vm_iscsi pid=82118 flags=1 2015-11-20 16:09:18.366+0000: 81794: debug : virProcessKillPainfully:352 : vpid=82118 force=1 2015-11-20 16:09:18.366+0000: 78191: info : virObjectRef:296 : OBJECT_REF: obj=0x3fff900048d0 2015-11-20 16:09:18.366+0000: 78191: info : qemuMonitorIOProcess:452 : QEMU_MONITOR_IO_PROCESS: mon=0x3fff900048d0 buf={"timestamp": {"seconds": 1448035758, "microseconds": 366193}, "event": "SHUTDOWN"} Not sure we can control this behaviour -and if we want to. Will check the libvirt docs/sources. However, in VDSM, we just see the consequences of libvirt actions libvirtEventLoop::DEBUG::2015-11-20 11:09:19,618::vm::4458::virt.vm::(onLibvirtLifecycleEvent) vmId=`cf986c6e-a941-4ec2-80fd-7e78207f4d40`::event Stopped detail 2 opaque None libvirtEventLoop::INFO::2015-11-20 11:09:19,618::vm::783::virt.vm::(_onQemuDeath) vmId=`cf986c6e-a941-4ec2-80fd-7e78207f4d40`::underlying process disconnected libvirtEventLoop::INFO::2015-11-20 11:09:19,618::vm::3721::virt.vm::(releaseVm) vmId=`cf986c6e-a941-4ec2-80fd-7e78207f4d40`::Release VM resources So this explains why the next attachDevice, triggered by hotplugDisk, fails. I suggest some action items: - (virt, myself): learn about libvirt handles panics, post patches if needed - (storage): since the panic seems related to hotplug, learn if there is a qemu bug here, or in general why this panic happens. Maybe known issue?
(In reply to Francesco Romani from comment #20) > I suggest some action items: > - (virt, myself): learn about libvirt handles panics, post patches if needed Turns out it was quick: http://libvirt.org/formatdomain.html#elementsEvents these are the tunables we can set. I don't see any clear benefit in this case, however.
Carlos, can you please provide the "rpm -qa" output on the host which runs the VM? thanks.
Created attachment 1102201 [details] List of pacakges on Host List of packages on host that runs the vm. Notice that the qemu-kvm-rhev is a special build for bz1279052, but the behaviour is the same.
I tested this again in the new version and made sure the kernel is the proper one and I don't see it happening anymore: Host kernel: 3.10.0-327.2.1.el7.ppc64le qemu-kvm-rhev-debuginfo-2.3.0-31.el7_2.5.ppc64le libvirt-daemon-driver-qemu-1.2.17-13.el7_2.2.ppc64le qemu-kvm-tools-rhev-2.3.0-31.el7_2.5.ppc64le qemu-kvm-common-rhev-2.3.0-31.el7_2.5.ppc64le qemu-kvm-rhev-2.3.0-31.el7_2.5.ppc64le qemu-img-rhev-2.3.0-31.el7_2.5.ppc64le guest: kernel: 3.10.0-327.2.1.el7.ppc64le ovirt: rhevm-3.6.1.3-0.1.el6.noarch I'm marking this as VERIFIED (or CLOSE since it's a dependency on the packages not rhevm?).
According to what you are saying it can't be verified as it is not a bug, closing on "not a bug".