Bug 1255033 - VM status events are ignored after host is rebooted
VM status events are ignored after host is rebooted
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine (Show other bugs)
3.6.0
Unspecified Unspecified
high Severity high
: ovirt-3.6.0-rc
: 3.6.0
Assigned To: Arik
Nisim Simsolo
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-08-19 09:12 EDT by Nisim Simsolo
Modified: 2016-04-19 21:11 EDT (History)
13 users (show)

See Also:
Fixed In Version: 3.6.0-12
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-04-19 21:11:49 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Virt
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
LogCollector (18.10 MB, application/x-xz)
2015-08-19 09:21 EDT, Nisim Simsolo
no flags Details
intel-vfio host, vdsm log (936.59 KB, application/x-xz)
2015-08-20 03:16 EDT, Nisim Simsolo
no flags Details
engine_new.log (7.04 MB, text/plain)
2015-09-07 06:21 EDT, Nisim Simsolo
no flags Details
vdsm_new.log (9.51 MB, text/plain)
2015-09-07 06:22 EDT, Nisim Simsolo
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
oVirt gerrit 45804 master MERGED core: fix events handling after host reboot Never
oVirt gerrit 45860 ovirt-engine-3.6 MERGED core: fix events handling after host reboot Never

  None (edit)
Description Nisim Simsolo 2015-08-19 09:12:58 EDT
Description of problem:
Linux VM get stuck at "wait for launch" state when running it with GPU attached.
(restored using Fedora and RHEL VMs).
When running Windows7 VM with GPU attached, VM is running properly.

Version-Release number of selected component (if applicable):
engine: 
3.6.0-0.11.master.el6
host: 
vdsm-4.17.2-1.el7ev.noarch
sanlock-3.2.4-1.el7.x86_64
qemu-kvm-rhev-2.3.0-18.el7.x86_64
libvirt-client-1.2.17-4.el7.x86_64

How reproducible:
Not consistently

Steps to Reproduce:
1. Create RHEL7 or Fedora22 VM and attach GPU to it (in my case quadro k4200)
2. Run VM.

Actual results:
VM get stuck at wait for launch state.

Expected results:
VM should run properly.

Additional info:
Log-collector attached. Relevant host for this bug is intel-vfio
Comment 1 Nisim Simsolo 2015-08-19 09:21:20 EDT
Created attachment 1064850 [details]
LogCollector
Comment 2 Omer Frenkel 2015-08-20 02:04:04 EDT
i could not find in the logs any error or any vm stuck in wait for launch,
please specify vm name/id and time of the error
Comment 3 Nisim Simsolo 2015-08-20 02:25:26 EDT
I forgot to mention it...
VM name: fedora_intel 
VM ID:153
Host name: intel-vfio
Start VM time: 2015-Aug-19, 15:28
Comment 4 Omer Frenkel 2015-08-20 02:51:20 EDT
vm id is: f9796cb9-758e-49cf-a23a-1484c3fb13be
the vm was run on host: intel-vfio
but the attached log collector contains logs only from amd-vfio host
so we are missing vdsm.log from the correct host
please attach this vdsm.log for the time of the run

also, did you happen to check the vm status on the host, to verify it is reported in this status on the host as well? (just to rule out monitoring issues)
Comment 5 Nisim Simsolo 2015-08-20 03:14:37 EDT
I did not check VM status on the host.
as for the log containment, i guess it's log collector issue.
VDSM log attached (vdsm.log.5.xz)
Comment 6 Nisim Simsolo 2015-08-20 03:16:40 EDT
Created attachment 1065128 [details]
intel-vfio host, vdsm log
Comment 7 Michal Skrivanek 2015-08-21 09:44:25 EDT
vdsm log indicates the create call got stuck in libvirt
Comment 8 Michal Skrivanek 2015-08-21 10:21:00 EDT
please do include relevant logs from libvirt and qemu as well, and journal/system messages, please
Comment 9 Nisim Simsolo 2015-08-30 03:49:46 EDT
As i understand, these logs are included in logcollector file attached.
Comment 10 Michal Skrivanek 2015-08-30 15:39:48 EDT
(In reply to Nisim Simsolo from comment #9)
> As i understand, these logs are included in logcollector file attached.

As I understand from comment #4 it doesn't. Without relevant logs there's nothing to look at. Perhaps this is reproducible? If so, please include only relevant time frame data. Thanks. 

E
Comment 11 Nisim Simsolo 2015-09-07 06:15:42 EDT
Issue occurred again using engine version: RHEVM 3.6.0-0.12.master.el6.
This time, i can see the host monitor from GPU card, but in the webadmin it is still in "wait for launch" state.
Also, the status of the VM in the host is up.
Trying to refresh browser and logout/login is not solving this issue.

VDSM and engine.log attached

VM ID: ef302904-aab5-4814-b33e-48a8c6de5eb6
VM name: win12_intel
Comment 12 Nisim Simsolo 2015-09-07 06:21:02 EDT
Created attachment 1070934 [details]
engine_new.log
Comment 13 Nisim Simsolo 2015-09-07 06:22:13 EDT
Created attachment 1070935 [details]
vdsm_new.log
Comment 14 Nisim Simsolo 2015-09-07 06:22:59 EDT
VM start time: 2015-09-07 12:58:38
Comment 15 Piotr Kliczewski 2015-09-07 07:31:39 EDT
Looking at the latest logs attached I can see that we lost connectivity to vdsm due heartbeat exceeded issue. This means that vdsm haven't sent response on time and the connection was dropped. I can see that the host is heavily loaded there is bunch of places in the logs like:

Thread-21431::INFO::2015-09-07 10:12:07,178::xmlrpc::92::vds.XMLRPCServer::(_process_requests) Request handler for 127.0.0.1:56157 stopped
Thread-21432::DEBUG::2015-09-07 10:12:11,445::stompreactor::304::yajsonrpc.StompServer::(send) Sending response

or 

Thread-4076::DEBUG::2015-09-07 10:12:16,893::fileSD::169::Storage.Misc.excCmd::(getReadDelay) SUCCESS: <err> = '0+1 records in\n0+1 records out\n691 bytes (691 B) copied, 0.000593475 s, 1.2 MB/s\n'; <rc> = 0
Reactor thread::INFO::2015-09-07 10:12:22,191::protocoldetector::72::ProtocolDetector.AcceptorImpl::(handle_accept) Accepting connection from 127.0.0.1:56158

where there is several seconds missing from the log.

Please make sure that your host is able to work properly or increase your heartbeat interval.
Comment 16 Michal Skrivanek 2015-09-07 07:37:19 EDT
this is related to VM events rather than VFIO
The host has been rebooted in the meantime (11:57) and since then events are being ignored when VM started again
Comment 17 Arik 2015-09-07 08:15:30 EDT
More detailed explanation: we use monotonic time in the host which is not monotonic after reboot of the host. so let's say that a VM runs on a host and after the host is being rebooted, we run the VM again on the same host - all the events for this VM will be ignored since the engine thinks it already processed newer events
Comment 18 Nisim Simsolo 2015-11-01 07:05:08 EST
Verified:
rhevm-3.6.0.1-0.1.el6
vdsm-4.17.10-5.el7ev.noarch
qemu-kvm-rhev-2.3.0-31.el7.x86_64
libvirt-client-1.2.17-5.el7.x86_64

Scenario: 
1. Run VM with GPU attached. 
2. During launching VM, reboot related host.
3. After host is up again, run VM and verify VM is running properly.
4. Repeat test case, but this time reboot related host while VM is up.

Note You need to log in before you can comment on or make changes to this bug.