1255033 – VM status events are ignored after host is rebooted

Bug 1255033 - VM status events are ignored after host is rebooted

Summary: VM status events are ignored after host is rebooted

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	3.6.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	ovirt-3.6.0-rc
Target Release:	3.6.0
Assignee:	Arik
QA Contact:	Nisim Simsolo
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-08-19 13:12 UTC by Nisim Simsolo
Modified:	2016-04-20 01:11 UTC (History)
CC List:	13 users (show)
Fixed In Version:	3.6.0-12
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-04-20 01:11:49 UTC
oVirt Team:	Virt
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
LogCollector (18.10 MB, application/x-xz) 2015-08-19 13:21 UTC, Nisim Simsolo	no flags	Details
intel-vfio host, vdsm log (936.59 KB, application/x-xz) 2015-08-20 07:16 UTC, Nisim Simsolo	no flags	Details
engine_new.log (7.04 MB, text/plain) 2015-09-07 10:21 UTC, Nisim Simsolo	no flags	Details
vdsm_new.log (9.51 MB, text/plain) 2015-09-07 10:22 UTC, Nisim Simsolo	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
oVirt gerrit	45804	0	master	MERGED	core: fix events handling after host reboot	Never
oVirt gerrit	45860	0	ovirt-engine-3.6	MERGED	core: fix events handling after host reboot	Never

Description Nisim Simsolo 2015-08-19 13:12:58 UTC

Description of problem:
Linux VM get stuck at "wait for launch" state when running it with GPU attached.
(restored using Fedora and RHEL VMs).
When running Windows7 VM with GPU attached, VM is running properly.

Version-Release number of selected component (if applicable):
engine: 
3.6.0-0.11.master.el6
host: 
vdsm-4.17.2-1.el7ev.noarch
sanlock-3.2.4-1.el7.x86_64
qemu-kvm-rhev-2.3.0-18.el7.x86_64
libvirt-client-1.2.17-4.el7.x86_64

How reproducible:
Not consistently

Steps to Reproduce:
1. Create RHEL7 or Fedora22 VM and attach GPU to it (in my case quadro k4200)
2. Run VM.

Actual results:
VM get stuck at wait for launch state.

Expected results:
VM should run properly.

Additional info:
Log-collector attached. Relevant host for this bug is intel-vfio

Comment 1 Nisim Simsolo 2015-08-19 13:21:20 UTC

Created attachment 1064850 [details]
LogCollector

Comment 2 Omer Frenkel 2015-08-20 06:04:04 UTC

i could not find in the logs any error or any vm stuck in wait for launch,
please specify vm name/id and time of the error

Comment 3 Nisim Simsolo 2015-08-20 06:25:26 UTC

I forgot to mention it...
VM name: fedora_intel 
VM ID:153
Host name: intel-vfio
Start VM time: 2015-Aug-19, 15:28

Comment 4 Omer Frenkel 2015-08-20 06:51:20 UTC

vm id is: f9796cb9-758e-49cf-a23a-1484c3fb13be
the vm was run on host: intel-vfio
but the attached log collector contains logs only from amd-vfio host
so we are missing vdsm.log from the correct host
please attach this vdsm.log for the time of the run

also, did you happen to check the vm status on the host, to verify it is reported in this status on the host as well? (just to rule out monitoring issues)

Comment 5 Nisim Simsolo 2015-08-20 07:14:37 UTC

I did not check VM status on the host.
as for the log containment, i guess it's log collector issue.
VDSM log attached (vdsm.log.5.xz)

Comment 6 Nisim Simsolo 2015-08-20 07:16:40 UTC

Created attachment 1065128 [details]
intel-vfio host, vdsm log

Comment 7 Michal Skrivanek 2015-08-21 13:44:25 UTC

vdsm log indicates the create call got stuck in libvirt

Comment 8 Michal Skrivanek 2015-08-21 14:21:00 UTC

please do include relevant logs from libvirt and qemu as well, and journal/system messages, please

Comment 9 Nisim Simsolo 2015-08-30 07:49:46 UTC

As i understand, these logs are included in logcollector file attached.

Comment 10 Michal Skrivanek 2015-08-30 19:39:48 UTC

(In reply to Nisim Simsolo from comment #9)
> As i understand, these logs are included in logcollector file attached.

As I understand from comment #4 it doesn't. Without relevant logs there's nothing to look at. Perhaps this is reproducible? If so, please include only relevant time frame data. Thanks. 

E

Comment 11 Nisim Simsolo 2015-09-07 10:15:42 UTC

Issue occurred again using engine version: RHEVM 3.6.0-0.12.master.el6.
This time, i can see the host monitor from GPU card, but in the webadmin it is still in "wait for launch" state.
Also, the status of the VM in the host is up.
Trying to refresh browser and logout/login is not solving this issue.

VDSM and engine.log attached

VM ID: ef302904-aab5-4814-b33e-48a8c6de5eb6
VM name: win12_intel

Comment 12 Nisim Simsolo 2015-09-07 10:21:02 UTC

Created attachment 1070934 [details]
engine_new.log

Comment 13 Nisim Simsolo 2015-09-07 10:22:13 UTC

Created attachment 1070935 [details]
vdsm_new.log

Comment 14 Nisim Simsolo 2015-09-07 10:22:59 UTC

VM start time: 2015-09-07 12:58:38

Comment 15 Piotr Kliczewski 2015-09-07 11:31:39 UTC

Looking at the latest logs attached I can see that we lost connectivity to vdsm due heartbeat exceeded issue. This means that vdsm haven't sent response on time and the connection was dropped. I can see that the host is heavily loaded there is bunch of places in the logs like:

Thread-21431::INFO::2015-09-07 10:12:07,178::xmlrpc::92::vds.XMLRPCServer::(_process_requests) Request handler for 127.0.0.1:56157 stopped
Thread-21432::DEBUG::2015-09-07 10:12:11,445::stompreactor::304::yajsonrpc.StompServer::(send) Sending response

or 

Thread-4076::DEBUG::2015-09-07 10:12:16,893::fileSD::169::Storage.Misc.excCmd::(getReadDelay) SUCCESS: <err> = '0+1 records in\n0+1 records out\n691 bytes (691 B) copied, 0.000593475 s, 1.2 MB/s\n'; <rc> = 0
Reactor thread::INFO::2015-09-07 10:12:22,191::protocoldetector::72::ProtocolDetector.AcceptorImpl::(handle_accept) Accepting connection from 127.0.0.1:56158

where there is several seconds missing from the log.

Please make sure that your host is able to work properly or increase your heartbeat interval.

Comment 16 Michal Skrivanek 2015-09-07 11:37:19 UTC

this is related to VM events rather than VFIO
The host has been rebooted in the meantime (11:57) and since then events are being ignored when VM started again

Comment 17 Arik 2015-09-07 12:15:30 UTC

More detailed explanation: we use monotonic time in the host which is not monotonic after reboot of the host. so let's say that a VM runs on a host and after the host is being rebooted, we run the VM again on the same host - all the events for this VM will be ignored since the engine thinks it already processed newer events

Comment 18 Nisim Simsolo 2015-11-01 12:05:08 UTC

Verified:
rhevm-3.6.0.1-0.1.el6
vdsm-4.17.10-5.el7ev.noarch
qemu-kvm-rhev-2.3.0-31.el7.x86_64
libvirt-client-1.2.17-5.el7.x86_64

Scenario: 
1. Run VM with GPU attached. 
2. During launching VM, reboot related host.
3. After host is up again, run VM and verify VM is running properly.
4. Repeat test case, but this time reboot related host while VM is up.

Note You need to log in before you can comment on or make changes to this bug.