Bug 1255033 - VM status events are ignored after host is rebooted
Summary: VM status events are ignored after host is rebooted
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 3.6.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ovirt-3.6.0-rc
: 3.6.0
Assignee: Arik
QA Contact: Nisim Simsolo
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-08-19 13:12 UTC by Nisim Simsolo
Modified: 2016-04-20 01:11 UTC (History)
13 users (show)

Fixed In Version: 3.6.0-12
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-04-20 01:11:49 UTC
oVirt Team: Virt
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
LogCollector (18.10 MB, application/x-xz)
2015-08-19 13:21 UTC, Nisim Simsolo
no flags Details
intel-vfio host, vdsm log (936.59 KB, application/x-xz)
2015-08-20 07:16 UTC, Nisim Simsolo
no flags Details
engine_new.log (7.04 MB, text/plain)
2015-09-07 10:21 UTC, Nisim Simsolo
no flags Details
vdsm_new.log (9.51 MB, text/plain)
2015-09-07 10:22 UTC, Nisim Simsolo
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 45804 0 master MERGED core: fix events handling after host reboot Never
oVirt gerrit 45860 0 ovirt-engine-3.6 MERGED core: fix events handling after host reboot Never

Description Nisim Simsolo 2015-08-19 13:12:58 UTC
Description of problem:
Linux VM get stuck at "wait for launch" state when running it with GPU attached.
(restored using Fedora and RHEL VMs).
When running Windows7 VM with GPU attached, VM is running properly.

Version-Release number of selected component (if applicable):
engine: 
3.6.0-0.11.master.el6
host: 
vdsm-4.17.2-1.el7ev.noarch
sanlock-3.2.4-1.el7.x86_64
qemu-kvm-rhev-2.3.0-18.el7.x86_64
libvirt-client-1.2.17-4.el7.x86_64

How reproducible:
Not consistently

Steps to Reproduce:
1. Create RHEL7 or Fedora22 VM and attach GPU to it (in my case quadro k4200)
2. Run VM.

Actual results:
VM get stuck at wait for launch state.

Expected results:
VM should run properly.

Additional info:
Log-collector attached. Relevant host for this bug is intel-vfio

Comment 1 Nisim Simsolo 2015-08-19 13:21:20 UTC
Created attachment 1064850 [details]
LogCollector

Comment 2 Omer Frenkel 2015-08-20 06:04:04 UTC
i could not find in the logs any error or any vm stuck in wait for launch,
please specify vm name/id and time of the error

Comment 3 Nisim Simsolo 2015-08-20 06:25:26 UTC
I forgot to mention it...
VM name: fedora_intel 
VM ID:153
Host name: intel-vfio
Start VM time: 2015-Aug-19, 15:28

Comment 4 Omer Frenkel 2015-08-20 06:51:20 UTC
vm id is: f9796cb9-758e-49cf-a23a-1484c3fb13be
the vm was run on host: intel-vfio
but the attached log collector contains logs only from amd-vfio host
so we are missing vdsm.log from the correct host
please attach this vdsm.log for the time of the run

also, did you happen to check the vm status on the host, to verify it is reported in this status on the host as well? (just to rule out monitoring issues)

Comment 5 Nisim Simsolo 2015-08-20 07:14:37 UTC
I did not check VM status on the host.
as for the log containment, i guess it's log collector issue.
VDSM log attached (vdsm.log.5.xz)

Comment 6 Nisim Simsolo 2015-08-20 07:16:40 UTC
Created attachment 1065128 [details]
intel-vfio host, vdsm log

Comment 7 Michal Skrivanek 2015-08-21 13:44:25 UTC
vdsm log indicates the create call got stuck in libvirt

Comment 8 Michal Skrivanek 2015-08-21 14:21:00 UTC
please do include relevant logs from libvirt and qemu as well, and journal/system messages, please

Comment 9 Nisim Simsolo 2015-08-30 07:49:46 UTC
As i understand, these logs are included in logcollector file attached.

Comment 10 Michal Skrivanek 2015-08-30 19:39:48 UTC
(In reply to Nisim Simsolo from comment #9)
> As i understand, these logs are included in logcollector file attached.

As I understand from comment #4 it doesn't. Without relevant logs there's nothing to look at. Perhaps this is reproducible? If so, please include only relevant time frame data. Thanks. 

E

Comment 11 Nisim Simsolo 2015-09-07 10:15:42 UTC
Issue occurred again using engine version: RHEVM 3.6.0-0.12.master.el6.
This time, i can see the host monitor from GPU card, but in the webadmin it is still in "wait for launch" state.
Also, the status of the VM in the host is up.
Trying to refresh browser and logout/login is not solving this issue.

VDSM and engine.log attached

VM ID: ef302904-aab5-4814-b33e-48a8c6de5eb6
VM name: win12_intel

Comment 12 Nisim Simsolo 2015-09-07 10:21:02 UTC
Created attachment 1070934 [details]
engine_new.log

Comment 13 Nisim Simsolo 2015-09-07 10:22:13 UTC
Created attachment 1070935 [details]
vdsm_new.log

Comment 14 Nisim Simsolo 2015-09-07 10:22:59 UTC
VM start time: 2015-09-07 12:58:38

Comment 15 Piotr Kliczewski 2015-09-07 11:31:39 UTC
Looking at the latest logs attached I can see that we lost connectivity to vdsm due heartbeat exceeded issue. This means that vdsm haven't sent response on time and the connection was dropped. I can see that the host is heavily loaded there is bunch of places in the logs like:

Thread-21431::INFO::2015-09-07 10:12:07,178::xmlrpc::92::vds.XMLRPCServer::(_process_requests) Request handler for 127.0.0.1:56157 stopped
Thread-21432::DEBUG::2015-09-07 10:12:11,445::stompreactor::304::yajsonrpc.StompServer::(send) Sending response

or 

Thread-4076::DEBUG::2015-09-07 10:12:16,893::fileSD::169::Storage.Misc.excCmd::(getReadDelay) SUCCESS: <err> = '0+1 records in\n0+1 records out\n691 bytes (691 B) copied, 0.000593475 s, 1.2 MB/s\n'; <rc> = 0
Reactor thread::INFO::2015-09-07 10:12:22,191::protocoldetector::72::ProtocolDetector.AcceptorImpl::(handle_accept) Accepting connection from 127.0.0.1:56158

where there is several seconds missing from the log.

Please make sure that your host is able to work properly or increase your heartbeat interval.

Comment 16 Michal Skrivanek 2015-09-07 11:37:19 UTC
this is related to VM events rather than VFIO
The host has been rebooted in the meantime (11:57) and since then events are being ignored when VM started again

Comment 17 Arik 2015-09-07 12:15:30 UTC
More detailed explanation: we use monotonic time in the host which is not monotonic after reboot of the host. so let's say that a VM runs on a host and after the host is being rebooted, we run the VM again on the same host - all the events for this VM will be ignored since the engine thinks it already processed newer events

Comment 18 Nisim Simsolo 2015-11-01 12:05:08 UTC
Verified:
rhevm-3.6.0.1-0.1.el6
vdsm-4.17.10-5.el7ev.noarch
qemu-kvm-rhev-2.3.0-31.el7.x86_64
libvirt-client-1.2.17-5.el7.x86_64

Scenario: 
1. Run VM with GPU attached. 
2. During launching VM, reboot related host.
3. After host is up again, run VM and verify VM is running properly.
4. Repeat test case, but this time reboot related host while VM is up.


Note You need to log in before you can comment on or make changes to this bug.