Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1640193

Summary: Rebooting VM in large scale env showing multiple vm status changes before eventually succeeding.
Product: [oVirt] ovirt-engine Reporter: Ilan Zuckerman <izuckerm>
Component: GeneralAssignee: Michal Skrivanek <michal.skrivanek>
Status: CLOSED CURRENTRELEASE QA Contact: Ilan Zuckerman <izuckerm>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.2.7.1CC: bugs, dagur, izuckerm, mburman, michal.skrivanek, mlehrer, rbarry
Target Milestone: ovirt-4.2.7Keywords: Performance
Target Release: ---Flags: rule-engine: ovirt-4.2+
rule-engine: ovirt-4.3+
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-11-08 11:58:08 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Virt RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
libvirt log
none
vsam log
none
engine log
none
journalctl -b issued from rebooted vm 'VM_2_HDD_Guy_1'
none
Libvirt log from host b01-h30 journalctl -u ***timezone is differs with the vm. that is why timestamps arent matching***
none
journalctl -o short-precise -k -b -1
none
journalctl -o short-precise -k -b 0
none
vdsm log from host b01-h30 xzgrep "2018-10-31 1[2-4]:" /var/log/vdsm/vdsm.log.4.xz none

Description Ilan Zuckerman 2018-10-17 14:12:02 UTC
Description of problem:
When rebooting a VM from engine web-admin, it takes up to 3 minutes till ui shows its status as 'up'.
Observing the VM from ui after clicking reboot, we see its status changed to 'reboot in progress', then after about 1min and 15sec the status changes to 'up', then after about 45sec its status changes to 'reboot in progress' again for another minute before it finally stay in 'up' state.

Version-Release number of selected component (if applicable):
-=>>rpm -qa vdsm
vdsm-4.20.42-1.el7ev.x86_64
-=>>rpm -q ovirt-engine
ovirt-engine-4.2.7.2-0.1.el7ev.noarch

#Topology
Engine / Engine DB (1 machine)
DWH DB (separate machine)
400 Hosts
4000 VMs
RHEL 7.5

There is only ONE VM active on this particular host where this behavior seen. This VM was rebooted.

How reproducible:
100%

Steps to Reproduce:
1. reboot vm 'VM_2_HDD_Guy_1' from web admin and observe the statuses it changes prior becoming 'up'. you will see the cycle of statuses i described above. this complete cycle takes up to 3 minutes to complete.

Actual results:
As described above.
Two errors seen in vdsm log for the relevant host:

ERROR (Thread-26) [virt.vm] (vmId='2cc7e22c-1374-441b-8ec9-26e2a217463a') desktopLock failed with unexpected exception (guestagent:496)

ERROR (Thread-27) [root] Reboot by QEMU Guest Agent failed (vm:5427)

Expected results:
Vm should be rebooted correctly within much shorter time period. 
This cycle of statuses which vm goes through (as described) shouldnt occur. 

Additional info:
Attaching the following logs:
engine, vdsm, libvirt

Environment details are shared within private message.
Be Advised, the system is under continuous testing and BZs reproduction.

Comment 1 Ilan Zuckerman 2018-10-17 14:13:27 UTC
Created attachment 1494869 [details]
libvirt log

Comment 2 Ilan Zuckerman 2018-10-17 14:13:52 UTC
Created attachment 1494870 [details]
vsam log

Comment 3 Ilan Zuckerman 2018-10-17 14:14:25 UTC
Created attachment 1494871 [details]
engine log

Comment 5 mlehrer 2018-10-17 15:06:41 UTC
adding version, updating bz title and lowering priority to medium as it eventually succeeds.

Comment 6 Roy Golan 2018-10-25 07:59:59 UTC
Tomasz please determine if this is really a condition that happens under load or happens every time, just that we don't see it.

Comment 7 Michal Skrivanek 2018-10-29 10:44:00 UTC
comment #1 is not a libvirt log but rather an excerpt from qemu log
libvirt log/journal would be helpful

we also need logs from the guest, it seems it disconnected before issuing the reboot, and that is aligned with the observed behavior

Comment 8 Ilan Zuckerman 2018-10-29 11:08:41 UTC
(In reply to Michal Skrivanek from comment #7)
> comment #1 is not a libvirt log but rather an excerpt from qemu log
It is libvirt log. Its taken for relevant vm from /var/log/libvirt/qemu/<vm name>
That is all the info this log held in that point.

> libvirt log/journal would be helpful
I can try to reproduce this on our scale env. But the libvirt log wouldnt be different.

> 
> we also need logs from the guest, it seems it disconnected before issuing
> the reboot, and that is aligned with the observed behavior
Guest - you mean the vm it self? if so, what logs you need from it?

Comment 9 Michal Skrivanek 2018-10-29 13:21:59 UTC
(In reply to Ilan Zuckerman from comment #8)
> (In reply to Michal Skrivanek from comment #7)
> > comment #1 is not a libvirt log but rather an excerpt from qemu log
> It is libvirt log. Its taken for relevant vm from /var/log/libvirt/qemu/<vm
> name>
> That is all the info this log held in that point.

that's a qemu log.
libvirt logs by default to systemd journal.


> > libvirt log/journal would be helpful
> I can try to reproduce this on our scale env. But the libvirt log wouldnt be
> different.
> 
> > 
> > we also need logs from the guest, it seems it disconnected before issuing
> > the reboot, and that is aligned with the observed behavior
> Guest - you mean the vm it self? if so, what logs you need from it?

anything you can get. ideally also the journal of the correct boot.

Comment 10 Ilan Zuckerman 2018-10-31 07:30:47 UTC
(In reply to Michal Skrivanek from comment #9)
> (In reply to Ilan Zuckerman from comment #8)
> > (In reply to Michal Skrivanek from comment #7)
> > > comment #1 is not a libvirt log but rather an excerpt from qemu log
> > It is libvirt log. Its taken for relevant vm from /var/log/libvirt/qemu/<vm
> > name>
> > That is all the info this log held in that point.
> 
> that's a qemu log.
> libvirt logs by default to systemd journal.
> 
> 
> > > libvirt log/journal would be helpful
> > I can try to reproduce this on our scale env. But the libvirt log wouldnt be
> > different.
> > 
> > > 
> > > we also need logs from the guest, it seems it disconnected before issuing
> > > the reboot, and that is aligned with the observed behavior
> > Guest - you mean the vm it self? if so, what logs you need from it?
> 
> anything you can get. ideally also the journal of the correct boot.

Attaching the following:
journalctl_-b.txt     boot log from vm 'VM_2_HDD_Guy_1'
journalctl_-u_libvirtd.txt    libvirt log from host b01-h30

Comment 11 Ilan Zuckerman 2018-10-31 07:32:05 UTC
Created attachment 1499221 [details]
journalctl -b   issued from rebooted vm 'VM_2_HDD_Guy_1'

Comment 12 Ilan Zuckerman 2018-10-31 07:34:07 UTC
Created attachment 1499222 [details]
Libvirt log from host b01-h30   journalctl -u   ***timezone is differs with the vm. that is why timestamps arent matching***

Comment 13 Michal Skrivanek 2018-10-31 11:11:11 UTC
(In reply to Ilan Zuckerman from comment #11)

they don't match at all, the guest is booting up at :18:02 but the reboot event in vdsm(host) happens at 13:08:11,260
Moreover it looks rather as a log from the boot after the reboot eventually succeeds. The log from previous boot (where you attempt to reboot) is the interesting one.

Comment 14 Michal Skrivanek 2018-10-31 12:50:49 UTC
possibly a duplicate of bug 1636723

Comment 15 Ilan Zuckerman 2018-10-31 13:55:16 UTC
(In reply to Michal Skrivanek from comment #13)
> (In reply to Ilan Zuckerman from comment #11)
> 
> they don't match at all, the guest is booting up at :18:02 but the reboot
> event in vdsm(host) happens at 13:08:11,260
> Moreover it looks rather as a log from the boot after the reboot eventually
> succeeds. The log from previous boot (where you attempt to reboot) is the
> interesting one.

Re - running the test in order to get 'the log from the boot when I triggered the reboot' as agreed with Michal.
I set journalctl to persist logging all of the boots (by default it is set to log only last boot).

I issued a reboot at VM time:
[root@localhost ~]# date
Wed Oct 31 09:39:04 EDT 2018

Listing all of existing boot logs after the reboot:
[root@localhost ~]# journalctl --list-boot
-2 d94d2adc14f04571bb999a81adbfb61c Wed 2018-10-31 03:18:02 EDT—Wed 2018-10-31 08:29:52 EDT
-1 887f134418e248b6af0b0d55a1270b3b Wed 2018-10-31 08:29:58 EDT—Wed 2018-10-31 09:41:55 EDT
 0 8f9d154c02e04018803029ca183e1805 Wed 2018-10-31 09:42:01 EDT—Wed 2018-10-31 09:43:43 EDT

We see that latest boot log (after issuing reboot) is '0'
And the boot log BEFORE the issuing of reboot is '-1'

Attaching both logs, just in case.

Comment 16 Ilan Zuckerman 2018-10-31 13:56:22 UTC
Created attachment 1499391 [details]
journalctl -o short-precise -k -b -1

Comment 17 Ilan Zuckerman 2018-10-31 13:56:49 UTC
Created attachment 1499392 [details]
journalctl -o short-precise -k -b 0

Comment 18 Michal Skrivanek 2018-10-31 15:10:15 UTC
ok, but now the original host log doesn't match:) can you add the relevant vdsm  and qemu log from host from this attempt?

Comment 19 Ilan Zuckerman 2018-11-01 07:30:21 UTC
(In reply to Michal Skrivanek from comment #18)
> ok, but now the original host log doesn't match:) can you add the relevant
> vdsm  and qemu log from host from this attempt?

Sure!

Please not the time differences between the VM (EDT) and the Host (UTC).
When its 03:13:13 EDT 2018 on VM, 
its      07:13:13 UTC 2018 on Host

Reboot took place at 31 09:39:04 EDT
I grabbed the vdsm log with following pattern "2018-10-31 1[2-4]:" in order to leave some margins.
The VM id is 2cc7e22c-1374-441b-8ec9-26e2a217463a   you can grab it from vdsm log in order to pinpoint the reboot events.

QEMU log for the relevant time gap was almost entirely empty. It includes only the following:

-=>>cat /var/log/libvirt/qemu/VM_2_HDD_Guy_1.log
inputs_channel_detach_tablet: 
inputs_channel_detach_tablet: 
inputs_channel_detach_tablet:

Comment 20 Ilan Zuckerman 2018-11-01 07:31:51 UTC
Created attachment 1499833 [details]
vdsm log from host b01-h30   xzgrep "2018-10-31 1[2-4]:" /var/log/vdsm/vdsm.log.4.xz

Comment 23 Michal Skrivanek 2018-11-01 08:55:11 UTC
should be fixed by bug 1635687 (different flow, so keeping on_qa)

Comment 24 Ilan Zuckerman 2018-11-05 12:34:48 UTC
Verified with the same vm on the same host with vdsm:
vdsm-4.20.43-1.el7ev.x86_64

Watched vdsm log and boot log of the vm. Did not notice unrelated statuses of the vm during reboot in the web ui.

Comment 25 Ryan Barry 2018-11-07 11:18:47 UTC
*** Bug 1636723 has been marked as a duplicate of this bug. ***