Description of problem: Linux VM get stuck at "wait for launch" state when running it with GPU attached. (restored using Fedora and RHEL VMs). When running Windows7 VM with GPU attached, VM is running properly. Version-Release number of selected component (if applicable): engine: 3.6.0-0.11.master.el6 host: vdsm-4.17.2-1.el7ev.noarch sanlock-3.2.4-1.el7.x86_64 qemu-kvm-rhev-2.3.0-18.el7.x86_64 libvirt-client-1.2.17-4.el7.x86_64 How reproducible: Not consistently Steps to Reproduce: 1. Create RHEL7 or Fedora22 VM and attach GPU to it (in my case quadro k4200) 2. Run VM. Actual results: VM get stuck at wait for launch state. Expected results: VM should run properly. Additional info: Log-collector attached. Relevant host for this bug is intel-vfio
Created attachment 1064850 [details] LogCollector
i could not find in the logs any error or any vm stuck in wait for launch, please specify vm name/id and time of the error
I forgot to mention it... VM name: fedora_intel VM ID:153 Host name: intel-vfio Start VM time: 2015-Aug-19, 15:28
vm id is: f9796cb9-758e-49cf-a23a-1484c3fb13be the vm was run on host: intel-vfio but the attached log collector contains logs only from amd-vfio host so we are missing vdsm.log from the correct host please attach this vdsm.log for the time of the run also, did you happen to check the vm status on the host, to verify it is reported in this status on the host as well? (just to rule out monitoring issues)
I did not check VM status on the host. as for the log containment, i guess it's log collector issue. VDSM log attached (vdsm.log.5.xz)
Created attachment 1065128 [details] intel-vfio host, vdsm log
vdsm log indicates the create call got stuck in libvirt
please do include relevant logs from libvirt and qemu as well, and journal/system messages, please
As i understand, these logs are included in logcollector file attached.
(In reply to Nisim Simsolo from comment #9) > As i understand, these logs are included in logcollector file attached. As I understand from comment #4 it doesn't. Without relevant logs there's nothing to look at. Perhaps this is reproducible? If so, please include only relevant time frame data. Thanks. E
Issue occurred again using engine version: RHEVM 3.6.0-0.12.master.el6. This time, i can see the host monitor from GPU card, but in the webadmin it is still in "wait for launch" state. Also, the status of the VM in the host is up. Trying to refresh browser and logout/login is not solving this issue. VDSM and engine.log attached VM ID: ef302904-aab5-4814-b33e-48a8c6de5eb6 VM name: win12_intel
Created attachment 1070934 [details] engine_new.log
Created attachment 1070935 [details] vdsm_new.log
VM start time: 2015-09-07 12:58:38
Looking at the latest logs attached I can see that we lost connectivity to vdsm due heartbeat exceeded issue. This means that vdsm haven't sent response on time and the connection was dropped. I can see that the host is heavily loaded there is bunch of places in the logs like: Thread-21431::INFO::2015-09-07 10:12:07,178::xmlrpc::92::vds.XMLRPCServer::(_process_requests) Request handler for 127.0.0.1:56157 stopped Thread-21432::DEBUG::2015-09-07 10:12:11,445::stompreactor::304::yajsonrpc.StompServer::(send) Sending response or Thread-4076::DEBUG::2015-09-07 10:12:16,893::fileSD::169::Storage.Misc.excCmd::(getReadDelay) SUCCESS: <err> = '0+1 records in\n0+1 records out\n691 bytes (691 B) copied, 0.000593475 s, 1.2 MB/s\n'; <rc> = 0 Reactor thread::INFO::2015-09-07 10:12:22,191::protocoldetector::72::ProtocolDetector.AcceptorImpl::(handle_accept) Accepting connection from 127.0.0.1:56158 where there is several seconds missing from the log. Please make sure that your host is able to work properly or increase your heartbeat interval.
this is related to VM events rather than VFIO The host has been rebooted in the meantime (11:57) and since then events are being ignored when VM started again
More detailed explanation: we use monotonic time in the host which is not monotonic after reboot of the host. so let's say that a VM runs on a host and after the host is being rebooted, we run the VM again on the same host - all the events for this VM will be ignored since the engine thinks it already processed newer events
Verified: rhevm-3.6.0.1-0.1.el6 vdsm-4.17.10-5.el7ev.noarch qemu-kvm-rhev-2.3.0-31.el7.x86_64 libvirt-client-1.2.17-5.el7.x86_64 Scenario: 1. Run VM with GPU attached. 2. During launching VM, reboot related host. 3. After host is up again, run VM and verify VM is running properly. 4. Repeat test case, but this time reboot related host while VM is up.