Bug 1334982

Summary:

[RFE] Gracefully shutdown Virtual Machines on Host reboot/shutdown.

Product:

Red Hat Enterprise Virtualization Manager

Reporter:

Germano Veit Michel <gveitmic>

Component:

ovirt-engine

Assignee:

Petr Kotas <pkotas>

Status:

CLOSED ERRATA

QA Contact:

Israel Pinto <ipinto>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

3.6.5

CC:

apinnick, dcadzow, dfediuck, eheftman, fgarciad, giordy, gordon.stocks, gveitmic, ipinto, lsurette, mavital, michal.skrivanek, mtessun, pkotas, rbalakri, Rhev-m-bugs, sbonazzo, srevivo, tjelinek, ykaul

Target Milestone:

ovirt-4.2.3

Keywords:

FutureFeature

Target Release:

---

Flags:

mavital: testing_plan_complete+

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Enhancement

Doc Text:

Previously, in an emergency, users were required to shut down the hosts to preserve the data center. This caused running virtual machines to be killed by the systemd process without performing a graceful shutdown. As a result, the virtual machines' state became undefined, which led to problematic scenarios for virtual machines running databases such as Oracle and SAP. In this release, virtual machines can be gracefully shut down by delaying the systemd process. After the virtual machines are shut down, the systemd process takes control and continues the shutdown. The VDSM is only shut down after the virtual machines have been gracefully shut down, after passing information to the Manager and waiting 5 seconds for the Manager to acknowledge the virtual machines have been shut down.

Story Points:

---

Clone Of:

Clones:

1401054 (view as bug list)

Environment:

Last Closed:

2018-05-15 17:38:32 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

Virt

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1401054

Bug Blocks:

1507935

Attachments:

Description	Flags
logs(engine,vdsm, vms-qemu, script log)	none
logs(engine,vdsm, vms-qemu, script log, journalctl)	none

Description Germano Veit Michel 2016-05-11 03:34:20 UTC

1. Proposed title of this feature request

Allow virtual machines to stop automatically with the Host.

This feature can be found in VMware:

https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=850

https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1008182   

3. What is the nature and description of the request?

When a host is internally shutdown or rebooted, all the Virtual Machines are killed. This is not desirable considering RHEV is an Enterprise solution and the reliability of the Virtual Machines can be affected by ungraceful shutdowns (aka killed).
      
4. Why does the customer need this?

In cases of emergency, non RHEV admins may be required to shutdown the hosts to preserve the Data Center. These people may not the technical capabilities or the credentials to login to RHEV-M and start this process gracefully. Also, when backup power is installed in the hosts, it may initiate the shutdown process, perhaps an ACPI signal. Currently it is impossible to achieve any good solution with systemd services and API calls (see other information below).

In short, this is an area RHEV is lacking and we must start working on it, later we can add more features such as integrating nicely with power solutions etc. But there must be a starting point and it must be easy so people can actually use it (via RHEV-M GUI).
      
5. How would the customer like to achieve this?

In RHEV-M GUI, have options to manage shutdown (and maybe start) options for hosts.

-> Gracefully shutdown all VMs that are running, updating RHEV-M DB
-> Switch to maintenance mode, updating RHEV-M DB
-> Perhaps have a list of VM the host will try to start when it comes up?
      
7. Is there already an existing RFE upstream or in Red Hat Bugzilla?

Mostly about starting the VMs automatically. In the RFE being proposed here it's mostly about shutdown and it's problems.

BZ 1108678 - [RFE] Automatic VM start after a power outage 

      
10. List any affected packages or components.  
Filled under engine but I can see libvirt, vdsm and ovirt-engine changes.

Other information:

Well, I tried to achieve this by using API calls from the host to RHEV-M. The idea was:

A) systemd service that on shutdown would call the RHEV API to get a list of VMs it has running on it
B) Issue API calls to shutdown all these VMs
C) Put the Host into Maintenance mode
D) Continue shutdown process

Problems:

1) Scopes

My systemd service was always called AFTER the VMs were destroyed, even if it was 'After' vdsmd+libvirt and tons of targets.

This is due to libvirt. When it creates a VM, it tells systemd to assign a scope to it. To make the destruction of the scope dependent on another service, libvirt needs to specifically tell it to systemd. Current implementation is hardcoded in libvirt as 'libvirt-guests.sevice".

So in RHEV, these scopes do not depend on anything, so systemd shuts them down quite early in the shutdown process, killing all the VMs.

[  175.376302] systemd-logind[489]: Power key pressed.
[  175.376312] systemd-logind[489]: Powering Off...
...
[  175.388845] systemd[1]: systemd-machined.service changed running -> stop-sigterm
[  175.388854] systemd[1]: Stopping Virtual Machine and Container Registration Service...
[  175.389604] systemd[1]: machine-qemu\x2d2\x2dCentOS7.scope changed running -> stop-sigterm
[  175.389613] systemd[1]: Stopping Virtual Machine qemu-2-CentOS7.

Looking at libvirt code, it is hard-coded to make this scopes depend only on libvirt-guests.service.

int virSystemdCreateMachine(const char *name, ...
{
....
        if (virDBusCallMethod(conn,
                              NULL,
                              &error,
                              "org.freedesktop.machine1",
                              "/org/freedesktop/machine1",
                              "org.freedesktop.machine1.Manager",
                              "CreateMachineWithNetwork",
                              "sayssusa&ia(sv)",
                              name,
                              16,
                              uuid[0], uuid[1], uuid[2], uuid[3],
                              uuid[4], uuid[5], uuid[6], uuid[7],
                              uuid[8], uuid[9], uuid[10], uuid[11],
                              uuid[12], uuid[13], uuid[14], uuid[15],
                              creatorname,
                              iscontainer ? "container" : "vm",
                              (unsigned int)pidleader,
                              rootdir ? rootdir : "",
                              nnicindexes, nicindexes,
                              3,
                              "Slice", "s", slicename,
                              "After", "as", 1, "libvirtd.service",                  
                              "Before", "as", 1, "libvirt-guests.service") < 0)
...
}

So systemd won't care about my service depending on vdsm because the VMs are assigned to these scopes and not the vdsm service. Result: only the libvirt-guests.service is able to do something here.

2) VDSM

To the point: vdsm does not like the libvirt-guests.service.

[Unit]
Description=Virtual Desktop Server Manager
...
Conflicts=libvirt-guests.service
...

So from systemd shutdown dependencies side the only thing can could help to make the shutdown more graceful here is having the libvirt-guests.service modified to issue a bunch of API calls. But it conflicts with vdsm. From my experience systemd shuts down the libvirt-guests.service during runtime (not at shutdown/reboot), which triggers the API calls and shuts down all VMs and puts the Host in Maintenance mode, but the host is not shutting down or rebooting.

Kind of a chicken and egg problem.

So, why not have a nice feature, where all this could be done in RHEV-M GUI, without having to recompile libvirt or changing vdsm, messing with systemd services or writing API scripts. Similar to what VMware have in it's interface (check both links at the top of the RFE).

Comment 2 Germano Veit Michel 2016-05-11 04:18:28 UTC

Maybe a good first step would be to get this hard-coded string

"Before", "as", 1, "libvirt-guests.service") < 0) <-----------

out of libvirt and make it customizable. Or add another 'Before' line for a new 'rhev-guests.service'.

Comment 3 Yaniv Kaul 2016-05-18 12:39:50 UTC

(In reply to Germano Veit Michel from comment #2)
> Maybe a good first step would be to get this hard-coded string
> 
> "Before", "as", 1, "libvirt-guests.service") < 0) <-----------
> 
> out of libvirt and make it customizable. Or add another 'Before' line for a
> new 'rhev-guests.service'.

The above sounds like a libvirt issue, then?

Comment 4 Germano Veit Michel 2016-05-19 05:01:09 UTC

(In reply to Yaniv Kaul from comment #3)
> The above sounds like a libvirt issue, then?

Yes, correct. 

I even considered opening a separate BZ for that. As this RFE will likely spawn a few other child RFE (libvirt, vdsm, ..) I would prefer to not get ahead of myself and let engineering properly decide what needs to be done in each component. So I created just this one, as a feature request that has the ultimate goal of having a nice, usable feature. And also describing the problems I found to provide some help/context.

We could go ahead an open the libvirt specific BZ, but first it would be good to have someone else assess what exactly would be the best implementation on libvirt's side to help this RFE.

Comment 5 Martin Sivák 2016-11-09 11:58:42 UTC

Michal? Isn't this a Virt area?

Comment 6 Michal Skrivanek 2016-11-09 15:44:12 UTC

Thinking about a bit with Moran and msivak, it seems to me that a feasible approach would be:

- put a message to /etc/motd that this host is part of xyz environment and should be put to maintenance before any actual system maintenance
- drop in a shell alias to echo a similar message when admin/root types poweroff, shutdown, reboot.

Should be fairly simple to add to host deploy and NGN, what do you think, Sandro?

other alternatives to actually block it from happening are way more complicated to do. The above should be good enough for most cases

Comment 7 Sandro Bonazzola 2016-11-11 09:37:24 UTC

(In reply to Michal Skrivanek from comment #6)
> Thinking about a bit with Moran and msivak, it seems to me that a feasible
> approach would be:
> 
> - put a message to /etc/motd that this host is part of xyz environment and
> should be put to maintenance before any actual system maintenance
> - drop in a shell alias to echo a similar message when admin/root types
> poweroff, shutdown, reboot.
> 
> Should be fairly simple to add to host deploy and NGN, what do you think,
> Sandro?
> 
> other alternatives to actually block it from happening are way more
> complicated to do. The above should be good enough for most cases

What about setting systemd to disable poweroff and shutdown[1] and re-enable it on host moved to maintenance?
[1] https://www.freedesktop.org/software/systemd/man/systemd-inhibit.html

Should probably be handled by VDSM/infra in this case.

Comment 8 Michal Skrivanek 2016-11-11 09:48:10 UTC

Hm, that would be great, but...it seems to me it assumes a command to execute. And would require manipulation from vdsm. Note that currently vdsm is not aware of its maintenance status
Sounds too complex to me...

Comment 10 Martin Tessun 2017-03-16 09:01:48 UTC

Just to summarize what I understood so far and to have a discussion starting point here.

vdsm does disable the libvirt-guests.service. This leads to systemd killing the VMs. The default behaviour of the guests service is to suspend the VMs in case of a shutdown (which is in any case not wanted by vdsm/RHV).

According to the service definition this can be changed in /etc/sysconfig/libvirt-guests: ON_SHUTDOWN=shutdown
Also the parallel shutdown could be increased. (PARALLEL_SHUTDOWN=<n>)

If I understood this correctly, using a configured libvirt-guests service could easily solve this issue, as a reboot would then do the following:

1. Shutdown the VMs via the libvirt-guest service
2. Shutdown libvirt and vdsm
---> here eventually fencing comes into place, but the VMs should already been stopped.

For the nicer approach we can add some UI elements that would change the config of libvirt-guests (disable/enable it).

So the implementation could be:
- vdsm-config configures libvirt-guest service to shutdown VMs instead of suspending thenm.
- Enable/Disable the service can be achieved via the UI by selecting a checkbox like "Gracefully shutdown VMs in case of external Poweroff". If this is checked the libvirt-guest service is enabled, otherwise it is disabled.

Thoughts?

Comment 11 Martin Tessun 2017-03-16 09:03:01 UTC

Forgot to mention:

OB_BOOT=ignore needs also be set so that the VMs are not restarted by libvirt during restart, as this could lead to VMs running more than once.

Comment 12 Michal Skrivanek 2017-05-12 06:30:23 UTC

I see no use of keeping the current behavior, a shutdown should be the default. The suspend is useless as it won't work when initiated by libvirt.
Another more complicated enhancement is to trigger a host maintenance instead....but the shutdown seems like a good enough starting point.

Comment 13 Tomas Jelinek 2017-08-30 14:01:59 UTC

please make sure to verify also the case explained in https://bugzilla.redhat.com/show_bug.cgi?id=1485967

Comment 14 Tomas Jelinek 2017-08-30 14:03:14 UTC

*** Bug 1485967 has been marked as a duplicate of this bug. ***

Comment 15 Petr Kotas 2017-09-18 13:01:55 UTC

Depends on https://github.com/oVirt/ovirt-ansible/pull/62

Cannot be tested before this gets merged.

Comment 17 Emma Heftman 2017-10-23 09:29:07 UTC

Hey Petr
I'm reviewing the Doc Text that you wrote, and also, where required, opening documentation bugs to covers RFEs.

In the doc text, don't you think we should mention that the solution should be enabled/disabled via the UI? 

As discussed here:

https://bugzilla.redhat.com/show_bug.cgi?id=1334982#c10

Comment 18 Emma Heftman 2017-10-23 09:51:47 UTC

Another question Petr. Does the 5 second configurable delay that you mentioned a new feature? I don't really understand the connection with what you wrote and what is written by Martin wrote here:

https://bugzilla.redhat.com/show_bug.cgi?id=1334982#c10

Can you please clarify exactly what changes will be made. Thanks!

Comment 19 Petr Kotas 2017-10-24 07:40:11 UTC

Hi Emma,

thanks for the questions. 

> In the doc text, don't you think we should mention that the solution should be enabled/disabled via the UI? 

Currently, there is no GUI to enable or disable this functionality. The feature is on by default.


> Does the 5 second configurable delay that you mentioned a new feature?

This is not a feature, although it is a tweakable parameter. I am not expecting this to change a lot. The 5 seconds delay exists only to allow engine finish all communication before the host machine shuts down.

Comment 22 Israel Pinto 2017-12-05 15:42:13 UTC

I am testing on version: 4.2.0-0.5.master.el7 (build 7)
I set the host_deploy_shutdown_timeout to 6000

I run the first case with 2 VMs on same host and run command "poweroff -f" on the host. (see https://polarion.engineering.redhat.com/polarion/#/project/RHEVM3/wiki/Compute/_RFE_%20Gracefully%20shutdown%20Virtual%20Machines%20on%20Host%20reboot_shutdown_)

I see that the VMs did not stopped by vdsm before host powered off.
From engine log i see that VMs (test_vm, test_vm_2) 

is down with error:
2017-12-04 10:48:30,206+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-58) [] EVENT_ID: VM_DOWN_ERROR(119), VM test_vm_2 is down with error. Exit message: VM terminated with error.
2017-12-04 10:48:31,011+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-58) [] VM '635a2291-f060-48a0-9fce-c4c05d6722d8'(test_vm) moved from 'Unknown' --> 'Down'
2017-12-04 10:48:31,039+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-58) [] EVENT_ID: VM_DOWN_ERROR(119), VM test_vm is down with error. Exit message: VM terminated with error.

From my script I see that VMs running till host and vdsm went down:
(Sampling 3 time in second)

check #225
10:44:07
 Id    Name                           State
----------------------------------------------------
 1     test_vm_2                      running
 2     test_vm                        running

vdsm:    Active: active (running) since Mon 2017-12-04 10:40:08 IST; 3min 59s ago
================================================  
check #226
10:44:07
 Id    Name                           State
----------------------------------------------------
 1     test_vm_2                      running
 2     test_vm                        running

vdsm:    Active: active (running) since Mon 2017-12-04 10:40:08 IST; 3min 59s ago
================================================  
check #227
10:44:07
 Id    Name                           State
----------------------------------------------------
 1     test_vm_2                      running
 2     test_vm                        running

vdsm:    Active: active (running) since Mon 2017-12-04 10:40:08 IST; 4min 0s ago
================================================  
check #228
10:44:08
 Id    Name                           State
----------------------------------------------------
 1     test_vm_2                      running
 2     test_vm                        running

vdsm:    Active: active (running) since Mon 2017-12-04 10:40:08 IST; 4min 0s ago
================================================

Comment 23 Michal Skrivanek 2017-12-06 15:02:51 UTC

(In reply to Israel Pinto from comment #22)

please add logs. Namely journal from host and vdsm.log, and qemu log for one of those VMs

Comment 24 Israel Pinto 2017-12-09 15:21:02 UTC

Created attachment 1365319 [details]
logs(engine,vdsm, vms-qemu, script log)

Comment 25 Michal Skrivanek 2017-12-09 15:54:16 UTC

still need journal and exact time you run poweroff

Comment 26 Israel Pinto 2017-12-10 07:23:22 UTC

Created attachment 1365484 [details]
logs(engine,vdsm, vms-qemu, script log, journalctl)

Comment 27 Michal Skrivanek 2017-12-10 07:30:08 UTC

That doesn't seem to capture the time you did the test. Please note down when you did things and doublecheck timestamps in attached logs

Comment 28 Michal Skrivanek 2017-12-10 07:30:23 UTC

That doesn't seem to capture the time you did the test. Please note down when you did things and doublecheck timestamps in attached logs

Comment 29 Israel Pinto 2017-12-10 07:43:39 UTC

1. poweroff command date:
[root@puma42 home]# date
Sun Dec 10 08:58:16 IST 2017
[root@puma42 home]# poweroff -f
Powering off.

Michal i see the jump in the time also in /var/log/messages
I did the poweroff at 08:58:16 but i don't see this time not in journal and /var/log/message.

If you need i will re-run it on different host.

Comment 31 Israel Pinto 2018-02-08 09:19:02 UTC

Update:
Engine: Software Version:4.2.1.6-0.1.el7
Host:
OS Version:RHEL - 7.4 - 18.el7
Kernel Version: 3.10.0 - 693.17.1.el7.x86_64
KVM Version: 2.9.0 - 16.el7_4.14
LIBVIRT Version:libvirt-3.9.0-8.el7
VDSM Version: vdsm-4.20.17-1.el7ev

Steps:
1. Create vm pool with 10 vms
2. Shutdown host

The vms are down with error
[root@compute-ge-6 ovirt-engine]#  grep  -ir test-2
engine.log:2018-02-07 10:07:07,563+02 INFO  [org.ovirt.engine.core.bll.AddVmCommand] (default task-7) [283f0642-f07b-47ed-9e78-8326598ed7c5] Lock Acquired to object 'EngineLock:{exclusiveLocks='[test-2=VM_NAME]', sharedLocks='[ad6dc22c-9274-4664-a6e6-e1a23e57de3d=VM_POOL, c6080af6-8fe5-48d0-a8c4-99c33a644f27=TEMPLATE, e778d44a-a7a9-4eec-8481-0fad5f008f13=DISK]'}'
engine.log:2018-02-07 10:07:09,667+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-7) [22859810] EVENT_ID: USER_ADD_VM_STARTED(37), VM test-2 creation was initiated by admin@internal-authz.
engine.log:2018-02-07 10:07:29,123+02 INFO  [org.ovirt.engine.core.bll.AddVmCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-55) [] Lock freed to object 'EngineLock:{exclusiveLocks='[test-2=VM_NAME]', sharedLocks='[ad6dc22c-9274-4664-a6e6-e1a23e57de3d=VM_POOL, c6080af6-8fe5-48d0-a8c4-99c33a644f27=TEMPLATE, e778d44a-a7a9-4eec-8481-0fad5f008f13=DISK]'}'
engine.log:2018-02-07 10:07:29,154+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-55) [] EVENT_ID: USER_ADD_VM_FINISHED_SUCCESS(53), VM test-2 creation has been completed.
engine.log:2018-02-07 11:02:22,614+02 INFO  [org.ovirt.engine.core.vdsbroker.CreateVDSCommand] (EE-ManagedThreadFactory-engine-Thread-1399) [20dd9e65-48d0-4f59-a52c-71857bd0134d] START, CreateVDSCommand( CreateVDSCommandParameters:{hostId='4a284ac7-8075-430b-98ee-c7c45d1cc04a', vmId='7dbb3560-8299-44fc-a97d-d669ac7dd31e', vm='VM [test-2]'}), log id: 69c52db4
engine.log:2018-02-07 11:02:22,617+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateBrokerVDSCommand] (EE-ManagedThreadFactory-engine-Thread-1399) [20dd9e65-48d0-4f59-a52c-71857bd0134d] START, CreateBrokerVDSCommand(HostName = host_mixed_1, CreateVDSCommandParameters:{hostId='4a284ac7-8075-430b-98ee-c7c45d1cc04a', vmId='7dbb3560-8299-44fc-a97d-d669ac7dd31e', vm='VM [test-2]'}), log id: 1b4977b
engine.log:  <name>test-2</name>
engine.log:2018-02-07 11:02:22,803+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-1399) [20dd9e65-48d0-4f59-a52c-71857bd0134d] EVENT_ID: USER_STARTED_VM(153), VM test-2 was started by admin@internal-authz (Host: host_mixed_1).
engine.log:2018-02-07 11:02:40,180+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-61) [] VM '7dbb3560-8299-44fc-a97d-d669ac7dd31e'(test-2) moved from 'WaitForLaunch' --> 'PoweringUp'
engine.log:2018-02-07 11:02:56,199+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-84) [] VM '7dbb3560-8299-44fc-a97d-d669ac7dd31e'(test-2) moved from 'PoweringUp' --> 'Up'
engine.log:2018-02-07 11:02:56,341+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-84) [] EVENT_ID: USER_RUN_VM(32), VM test-2 started on Host host_mixed_1
engine.log:2018-02-07 11:36:16,228+02 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-1784) [] EVENT_ID: VM_SET_TO_UNKNOWN_STATUS(142), VM test-2 was set to the Unknown status.
engine.log:2018-02-07 11:42:35,335+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-51) [] VM '7dbb3560-8299-44fc-a97d-d669ac7dd31e'(test-2) moved from 'Unknown' --> 'Down'
engine.log:2018-02-07 11:42:35,359+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-51) [] EVENT_ID: VM_DOWN_ERROR(119), VM test-2 is down with error. Exit message: VM terminated with error.
engine.log:2018-02-07 11:42:39,963+02 WARN  [org.ovirt.engine.core.bll.RestoreStatelessVmCommand] (EE-ManagedThreadFactory-engine-Thread-1904) [64f3541a] Couldn't update VM 'test-2' (7dbb3560-8299-44fc-a97d-d669ac7dd31e) version from it's template, continue with restoring stateless snapshot.

logs:
https://drive.google.com/open?id=1vzLUwlP3oB6Z2PFWbq_Y8ahlLPfnpI1E

Comment 32 Israel Pinto 2018-02-14 07:17:55 UTC

Failing on vm pools, VM is down with error.
Failed QA. see c31.

Comment 33 Petr Kotas 2018-02-19 15:14:30 UTC

Hi I can see there is no libvirt-guests entry in the journal log.
Is the journal from host or from guest?

Moreover, on which RHEL were the tests done? There is a known bud in 7.4 for VM shutdown, which is fixed in 7.5. If so, can you please retest it for RHEL 7.5?

Next, can you please redo the test for 10 VMs which are not in the poll? This will help eliminate the Pool as the source of issue.

Many thanks,
Petr

Comment 34 Michal Skrivanek 2018-02-19 17:27:23 UTC

let's wait for confirmation

Comment 35 Israel Pinto 2018-02-21 11:16:19 UTC

I retested with:
Engine version:4.2.2-0.1.el7
Host:
OS Version:RHEL - 7.5 - 6.el7
Kernel Version:3.10.0 - 851.el7.x86_64
KVM Version:2.10.0 - 20.el7
LIBVIRT Version:libvirt-3.9.0-13.el7
VDSM Version:vdsm-4.20.18-1.el7ev

Cases:
1. Create VM pool with 15 VMs, Run all VMs shutdown the host
2. Create 12 VMs, Run all VMs shutdown the host

Results:
Case 1: 2 VMs stay in Unknown status
Case 1: 1 VM stay in Unknown status

See logs at: https://drive.google.com/open?id=1YdeA3msXn6w8fNMLlXVmO3qO3RPe1bab

Case 1 pool with 15 vms:
VM pool_test_-7 and pool_test_-7 were set to "Unknown status"
shutdown Command:
[root@puma42 ~]# shutdown
Shutdown scheduled for Wed 2018-02-21 09:21:03 IST, use 'shutdown -c' to cancel.
[root@puma42 ~]# 
Broadcast message from root.lab.tlv.redhat.com (Wed 2018-02-21 09:20:03 IST):
The system is going down for power-off at Wed 2018-02-21 09:21:03 IST!

Engine log:
2018-02-21 09:21:10,962+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-65) [] VM '9206a475-bcc5-4e24-a125-13085a2b02be'(pool_test_-1) moved from 'Up' --> 'PoweringDown'
2018-02-21 09:21:10,971+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-65) [] VM '5c0b4c2f-d1f0-4e48-93c9-131c7ef61ef1'(pool_test_-3) moved from 'Up' --> 'Down'
2018-02-21 09:21:11,112+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-65) [] EVENT_ID: VM_DOWN(61), VM pool_test_-3 is down. Exit message: User shut down from within the guest
2018-02-21 09:21:11,475+02 WARN  [org.ovirt.engine.core.bll.RestoreStatelessVmCommand] (EE-ManagedThreadFactory-engine-Thread-21509) [27f2ac26] Couldn't update VM 'pool_test_-3' (5c0b4c2f-d1f0-4e48-93c9-131c7ef61ef1) version from it's template, continue with restoring stateless snapshot.
2018-02-21 09:21:26,176+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-45) [] VM '2ddcad0a-e94e-4f17-9268-91624d9bd9da'(pool_test_-2) moved from 'Up' --> 'Down'
2018-02-21 09:21:26,297+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-45) [] EVENT_ID: VM_DOWN(61), VM pool_test_-2 is down. Exit message: User shut down from within the guest
2018-02-21 09:21:26,306+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-45) [] VM '9206a475-bcc5-4e24-a125-13085a2b02be'(pool_test_-1) moved from 'PoweringDown' --> 'Down'
2018-02-21 09:21:26,341+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-45) [] EVENT_ID: VM_DOWN(61), VM pool_test_-1 is down. Exit message: User shut down from within the guest
2018-02-21 09:21:26,422+02 WARN  [org.ovirt.engine.core.bll.RestoreStatelessVmCommand] (EE-ManagedThreadFactory-engine-Thread-21514) [3fbc9284] Couldn't update VM 'pool_test_-1' (9206a475-bcc5-4e24-a125-13085a2b02be) version from it's template, continue with restoring stateless snapshot.
2018-02-21 09:21:26,516+02 WARN  [org.ovirt.engine.core.bll.RestoreStatelessVmCommand] (EE-ManagedThreadFactory-engine-Thread-21514) [4b3e995b] Couldn't update VM 'pool_test_-2' (2ddcad0a-e94e-4f17-9268-91624d9bd9da) version from it's template, continue with restoring stateless snapshot.
2018-02-21 09:21:41,395+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-25) [] VM 'b5af689a-6f2e-411f-89bf-dfca090a495e'(pool_test_-4) moved from 'Up' --> 'Down'
2018-02-21 09:21:41,536+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-25) [] EVENT_ID: VM_DOWN(61), VM pool_test_-4 is down. Exit message: User shut down from within the guest
2018-02-21 09:21:41,549+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-25) [] VM '3d95f01e-1e92-4ad7-bbcd-7d00167929a1'(pool_test_-14) moved from 'Up' --> 'Down'
2018-02-21 09:21:41,570+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-25) [] EVENT_ID: VM_DOWN(61), VM pool_test_-14 is down. Exit message: User shut down from within the guest
2018-02-21 09:21:41,661+02 WARN  [org.ovirt.engine.core.bll.RestoreStatelessVmCommand] (EE-ManagedThreadFactory-engine-Thread-21519) [bc90f5a] Couldn't update VM 'pool_test_-4' (b5af689a-6f2e-411f-89bf-dfca090a495e) version from it's template, continue with restoring stateless snapshot.
2018-02-21 09:21:41,764+02 WARN  [org.ovirt.engine.core.bll.RestoreStatelessVmCommand] (EE-ManagedThreadFactory-engine-Thread-21519) [2d7151cd] Couldn't update VM 'pool_test_-14' (3d95f01e-1e92-4ad7-bbcd-7d00167929a1) version from it's template, continue with restoring stateless snapshot.
2018-02-21 09:21:56,621+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-40) [] VM '2dd11809-dd6d-4041-ae01-b4d402f99579'(pool_test_-6) moved from 'Up' --> 'Down'
2018-02-21 09:21:56,646+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-40) [] EVENT_ID: VM_DOWN(61), VM pool_test_-6 is down. Exit message: User shut down from within the guest
2018-02-21 09:21:56,728+02 WARN  [org.ovirt.engine.core.bll.RestoreStatelessVmCommand] (EE-ManagedThreadFactory-engine-Thread-21524) [ac7ac7b] Couldn't update VM 'pool_test_-6' (2dd11809-dd6d-4041-ae01-b4d402f99579) version from it's template, continue with restoring stateless snapshot.
2018-02-21 09:22:11,697+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-55) [] VM '52fd7858-cdc1-47fb-9603-34fb874c34fc'(pool_test_-8) moved from 'Up' --> 'Down'
2018-02-21 09:22:11,721+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-55) [] EVENT_ID: VM_DOWN(61), VM pool_test_-8 is down. Exit message: User shut down from within the guest
2018-02-21 09:22:11,802+02 WARN  [org.ovirt.engine.core.bll.RestoreStatelessVmCommand] (EE-ManagedThreadFactory-engine-Thread-21529) [2bb28719] Couldn't update VM 'pool_test_-8' (52fd7858-cdc1-47fb-9603-34fb874c34fc) version from it's template, continue with restoring stateless snapshot.
2018-02-21 09:22:26,765+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-34) [] VM 'd6d9f51c-323e-46a2-82da-f43da155425f'(pool_test_-11) moved from 'Up' --> 'Down'
2018-02-21 09:22:26,786+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-34) [] EVENT_ID: VM_DOWN(61), VM pool_test_-11 is down. Exit message: User shut down from within the guest
2018-02-21 09:22:26,794+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-34) [] VM 'd21fd87b-95e9-435f-9155-96ec25ccb272'(pool_test_-15) moved from 'Up' --> 'Down'
2018-02-21 09:22:26,810+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-34) [] EVENT_ID: VM_DOWN(61), VM pool_test_-15 is down. Exit message: User shut down from within the guest
2018-02-21 09:22:26,880+02 WARN  [org.ovirt.engine.core.bll.RestoreStatelessVmCommand] (EE-ManagedThreadFactory-engine-Thread-21534) [127a20ab] Couldn't update VM 'pool_test_-15' (d21fd87b-95e9-435f-9155-96ec25ccb272) version from it's template, continue with restoring stateless snapshot.
2018-02-21 09:22:26,963+02 WARN  [org.ovirt.engine.core.bll.RestoreStatelessVmCommand] (EE-ManagedThreadFactory-engine-Thread-21534) [52917cf7] Couldn't update VM 'pool_test_-11' (d6d9f51c-323e-46a2-82da-f43da155425f) version from it's template, continue with restoring stateless snapshot.
2018-02-21 09:22:41,857+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-7) [] VM '8f441bc7-148f-44c0-8890-a7f8db9eca57'(pool_test_-9) moved from 'Up' --> 'Down'
2018-02-21 09:22:41,899+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-7) [] EVENT_ID: VM_DOWN(61), VM pool_test_-9 is down. Exit message: User shut down from within the guest
2018-02-21 09:22:41,981+02 WARN  [org.ovirt.engine.core.bll.RestoreStatelessVmCommand] (EE-ManagedThreadFactory-engine-Thread-21537) [760ea1af] Couldn't update VM 'pool_test_-9' (8f441bc7-148f-44c0-8890-a7f8db9eca57) version from it's template, continue with restoring stateless snapshot.
2018-02-21 09:22:56,938+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-87) [] VM '389f1314-cd45-4e55-ba70-c62ff1e6e715'(pool_test_-10) moved from 'Up' --> 'Down'
2018-02-21 09:22:56,972+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-87) [] EVENT_ID: VM_DOWN(61), VM pool_test_-10 is down. Exit message: User shut down from within the guest
2018-02-21 09:22:57,041+02 WARN  [org.ovirt.engine.core.bll.RestoreStatelessVmCommand] (EE-ManagedThreadFactory-engine-Thread-21542) [4aedf5e2] Couldn't update VM 'pool_test_-10' (389f1314-cd45-4e55-ba70-c62ff1e6e715) version from it's template, continue with restoring stateless snapshot.
2018-02-21 09:23:12,012+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-56) [] VM '52ed58bf-7a49-4776-b51e-58eabceb5c60'(pool_test_-12) moved from 'Up' --> 'Down'
2018-02-21 09:23:12,186+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-56) [] EVENT_ID: VM_DOWN(61), VM pool_test_-12 is down. Exit message: User shut down from within the guest
2018-02-21 09:23:12,432+02 WARN  [org.ovirt.engine.core.bll.RestoreStatelessVmCommand] (EE-ManagedThreadFactory-engine-Thread-21547) [63fb6556] Couldn't update VM 'pool_test_-12' (52ed58bf-7a49-4776-b51e-58eabceb5c60) version from it's template, continue with restoring stateless snapshot.
2018-02-21 09:23:27,222+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-75) [] VM '7bbd15b8-6bdc-4df0-956b-8d3dd7409485'(pool_test_-13) moved from 'Up' --> 'Down'
2018-02-21 09:23:27,247+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-75) [] EVENT_ID: VM_DOWN(61), VM pool_test_-13 is down. Exit message: User shut down from within the guest
2018-02-21 09:23:27,321+02 WARN  [org.ovirt.engine.core.bll.RestoreStatelessVmCommand] (EE-ManagedThreadFactory-engine-Thread-21550) [3069f554] Couldn't update VM 'pool_test_-13' (7bbd15b8-6bdc-4df0-956b-8d3dd7409485) version from it's template, continue with restoring stateless snapshot.
2018-02-21 09:25:02,768+02 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-21578) [] EVENT_ID: VM_SET_TO_UNKNOWN_STATUS(142), VM pool_test_-7 was set to the Unknown status.
2018-02-21 09:25:02,778+02 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-21578) [] EVENT_ID: VM_SET_TO_UNKNOWN_STATUS(142), VM pool_test_-5 was set to the Unknown status.
   
-------------------------------------------------------------------------------
Case 2 info with 12 vms, vm_2 was stay in "Unknown status":
shutdown command:
[root@puma42 ~]# shutdown
Shutdown scheduled for Wed 2018-02-21 10:30:00 IST, use 'shutdown -c' to cancel.
[root@puma42 ~]# 
Broadcast message from root.lab.tlv.redhat.com (Wed 2018-02-21 10:29:00 IST):
The system is going down for power-off at Wed 2018-02-21 10:30:00 IST!


Engine log,  vm_2 was set to the Unknown status: 
2018-02-21 10:30:28,324+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-55) [] VM '1ba7d522-17cf-4c68-a829-271ec4c8288f'(vm_1) moved from 'Up' --> 'Down'
2018-02-21 10:30:28,478+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-55) [] EVENT_ID: VM_DOWN(61), VM vm_1 is down. Exit message: User shut down from within the guest
2018-02-21 10:30:28,491+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-55) [] VM '49adc5c8-aa6a-4576-93ba-734750c713fb'(vm_11) moved from 'Up' --> 'Down'
2018-02-21 10:30:28,611+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-55) [] EVENT_ID: VM_DOWN(61), VM vm_11 is down. Exit message: User shut down from within the guest
2018-02-21 10:30:28,620+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-55) [] VM 'b36b411b-864a-4cce-a6df-143ecb345538'(vm_5) moved from 'Up' --> 'Down'
2018-02-21 10:30:28,744+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-55) [] EVENT_ID: VM_DOWN(61), VM vm_5 is down. Exit message: User shut down from within the guest
2018-02-21 10:30:43,804+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-6) [] VM 'c215a1be-7a75-41d8-b470-bea29d0ccd32'(vm_7) moved from 'Up' --> 'Down'
2018-02-21 10:30:43,926+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-6) [] EVENT_ID: VM_DOWN(61), VM vm_7 is down. Exit message: User shut down from within the guest
2018-02-21 10:30:58,975+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-10) [] VM '80349a03-8886-4cf0-a577-a39f7221875d'(vm_12) moved from 'Up' --> 'Down'
2018-02-21 10:30:59,115+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-10) [] EVENT_ID: VM_DOWN(61), VM vm_12 is down. Exit message: User shut down from within the guest
2018-02-21 10:30:59,134+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-10) [] VM '8625c439-792b-4a46-9c71-f85d0c1cdd08'(vm_8) moved from 'Up' --> 'Down'
2018-02-21 10:30:59,159+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-10) [] EVENT_ID: VM_DOWN(61), VM vm_8 is down. Exit message: User shut down from within the guest
2018-02-21 10:31:14,202+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-86) [] VM 'a7822cf9-6086-45e7-9ffc-b501457709a0'(vm_6) moved from 'Up' --> 'Down'
2018-02-21 10:31:14,223+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-86) [] EVENT_ID: VM_DOWN(61), VM vm_6 is down. Exit message: User shut down from within the guest
2018-02-21 10:31:14,233+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-86) [] VM '710c361d-e66a-49c6-a4fc-4aed3a5ca2d6'(vm_3) moved from 'Up' --> 'Down'
2018-02-21 10:31:14,253+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-86) [] EVENT_ID: VM_DOWN(61), VM vm_3 is down. Exit message: User shut down from within the guest
2018-02-21 10:31:29,292+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-4) [] VM 'e46b3232-4cbb-47fb-8f24-8c1b72486482'(vm_4) moved from 'Up' --> 'Down'
2018-02-21 10:31:29,425+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-4) [] EVENT_ID: VM_DOWN(61), VM vm_4 is down. Exit message: User shut down from within the guest
2018-02-21 10:31:29,433+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-4) [] VM '3fd29d04-0e93-49a4-b1ed-464aa2b4b73c'(vm_9) moved from 'Up' --> 'Down'
2018-02-21 10:31:29,451+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-4) [] EVENT_ID: VM_DOWN(61), VM vm_9 is down. Exit message: User shut down from within the guest
2018-02-21 10:31:44,487+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-53) [] VM 'e61ac7f6-6026-42e9-bc8b-f42113a89815'(vm_10) moved from 'Up' --> 'Down'
2018-02-21 10:31:44,511+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-53) [] EVENT_ID: VM_DOWN(61), VM vm_10 is down. Exit message: User shut down from within the guest
2018-02-21 10:31:59,541+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-6) [] VM 'b8da67a1-7c10-4170-a9e6-e87cf6d397e1'(vm_13) moved from 'Up' --> 'Down'
2018-02-21 10:31:59,563+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-6) [] EVENT_ID: VM_DOWN(61), VM vm_13 is down. Exit message: User shut down from within the guest
2018-02-21 10:33:33,656+02 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-22791) [] EVENT_ID: VM_SET_TO_UNKNOWN_STATUS(142), VM vm_2 was set to the Unknown status.
2018-02-21 10:33:33,656+02 INFO  [org.ovirt.engine.core.vdsbroker.VdsManager] (EE-ManagedThreadFactory-engine-Thread-22791) [] Server failed to respond, vds_id='4a284ac7-8075-430b-98ee-c7c45d1cc04a', vds_name='host_mixed_1', vm_count=2, spm_status='SPM', non-responsive_timeout (seconds)=81, error: java.net.NoRouteToHostException: No route to host

Comment 36 Petr Kotas 2018-02-22 09:28:41 UTC

Hi,

I have examined the logs and I can not find the conclusive result.

Please do repeat the test, this time with debug logs switched on.

This issue can not be solved without it.

Thanks you

Comment 38 Israel Pinto 2018-02-27 16:09:43 UTC

Engine:
Software Version:4.2.2.1-0.1.el7
Host:
OS Version:RHEL - 7.5 - 6.el7
Kernel Version:3.10.0 - 855.el7.x86_64
KVM Version:2.10.0 - 21.el7
LIBVIRT Version: libvirt-3.9.0-13.el7
VDSM Version: vdsm-4.20.19-1.el7ev 

I did rerun with pools and vms and open the logs to debug
see logs at: https://drive.google.com/open?id=1QktaTFoo_fdq9k61-v0HvF4iRqmniKnM
same results we have vms in unknown state after shutting down the host.

Failed QA

Comment 39 Michal Skrivanek 2018-02-27 19:25:10 UTC

we've sent the event during shutdown as expected
2018-02-27 13:36:38,841+0200 DEBUG (libvirt/events) [jsonrpc.Notification] Sending event {"params": {"7c2a20fc-12a9-44c5-95e7-1bfb3282d056": {"status": "Down", "exitMessage": "User shut down from within the guest", "vmId": "7c2a20fc-12a9-44c5-95e7-1bfb3282d056", "exitReason": 7, "exitCode": 0}, "notify_time": 4364298540}, "jsonrpc": "2.0", "method": "|virt|VM_status|7c2a20fc-12a9-44c5-95e7-1bfb3282d056"} (__init__:181)
2018-02-27 13:36:38,842+0200 DEBUG (libvirt/events) [vds] Sending notification |virt|VM_status|7c2a20fc-12a9-44c5-95e7-1bfb3282d056 with params {'7c2a20fc-12a9-44c5-95e7-1bfb3282d056': {'status': 'Down', 'exitMessage': 'User shut down from within the guest', 'vmId': '7c2a20fc-12a9-44c5-95e7-1bfb3282d056', 'exitReason': 7, 'exitCode': 0}, 'notify_time': 4364298540}  (clientIF:212)
2018-02-27 13:36:43,121+0200 DEBUG (MainThread) [vds.MultiProtocolAcceptor] Stopping Acceptor (protocoldetector:214)
2018-02-27 13:36:43,122+0200 INFO  (MainThread) [jsonrpc.JsonRpcServer] Stopping JsonRPC Server (__init__:703)


but there's only following error on engine side and nothing else.
2018-02-27 13:36:29,930+02 INFO [org.ovirt.engine.core.bll.RestoreStatelessVmCommand] (EE-ManagedThreadFactory-engine-Thread-30299) [cc26175] Lock freed to object 'EngineLock:{exclusiveLocks='[b46f4e62-d5eb-4f81-ad75-00163341ac65=VM]', sharedLocks=''}'
2018-02-27 13:36:46,956+02 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.Reactor] (SSL Stomp Reactor) [] Unable to process messages Broken pipe


Israel, debug log level was NOT enabled on the engine side so we cannot see whether the message was received or not. Please repeat with debug logs on all sides.

Comment 40 Israel Pinto 2018-03-05 14:03:05 UTC

Retested with:
Engine: Software Version:4.2.2.2-0.1.el7
Host:
OS Version:RHEL - 7.5 - 8.el7
Kernel Version:3.10.0 - 858.el7.x86_64
KVM Version:2.10.0 - 21.el7
LIBVIRT Version:libvirt-3.9.0-13.el7
VDSM Version:vdsm-4.20.19-1.el7ev

See logs: 
https://drive.google.com/open?id=1wkEIyXsmSpTywZvFfr_0k8ucFHSePu4G
This time did graceful shutdown with power management: 
From val/log/messages: 
Mar  5 15:39:08 puma42 systemd: Starting Power-Off...
Mar  5 15:39:08 puma42 systemd: Stopping Monitoring of LVM2 mirrors, snapshots etc. using dmeventd or progress polling...
Mar  5 15:39:08 puma42 systemd: Shutting down.
Mar  5 15:39:08 puma42 systemd-shutdown[1]: Sending SIGTERM to remaining processes...
Mar  5 15:39:08 puma42 journal: Journal stopped

Run again:
Mar  5 17:41:16 puma42 journal: Runtime journal is using 8.0M (max allowed 1.5G, trying to leave 2.3G free of 15.5G available → current limit 1.5G).
Mar  5 17:41:16 puma42 kernel: microcode: microcode updated early to revision 0x710, date = 2013-06-17

vm_16 set to unknown state:
engine.log:2018-03-05 15:37:29,261+02 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-709) [] EVENT_ID: VM_SET_TO_UNKNOWN_STATUS(142), VM vm_16 was set to the Unknown status.

Comment 41 Michal Skrivanek 2018-03-07 12:15:00 UTC

it seems like two issues really
1) all VMs are shutdown, but few of the last VMs (when there are more than just a few) do not seem to send their update correctly and engine declares them Unknown (event is lost)
2) all VMs are shut down as "user shut down from the guest" due to mis-identification of the shutdown reason

Comment 42 Martin Tessun 2018-03-07 12:35:26 UTC

Hi Derek,

I think with this behaviour and HA VMs (with and without leases), we need some documentation section that explains the flow(s) so that the customer understands the impact.

In case a HA-VM is shut down cleanly (be it due to this BZ or from the UI or from within the VM), the HA VM isn't restarted.
While this might be expected for the UI and in-VM usecase it is probably not 100% clear for the host shutdown usecase.

As such, I would like to have that clearly documented in the HA-VM section.
Do you think this is feasible?

Thanks!
Martin

Comment 43 Derek 2018-03-07 19:52:50 UTC

In a word, yes.   Can we meet on BlueJeans after next week (travelling to the OPS QBR next week) to go through the detail planning discussion for docs?

Comment 44 Yaniv Kaul 2018-03-12 10:49:12 UTC

(In reply to Michal Skrivanek from comment #41)
> it seems like two issues really
> 1) all VMs are shutdown, but few of the last VMs (when there are more than
> just a few) do not seem to send their update correctly and engine declares
> them Unknown (event is lost)
> 2) all VMs are shut down as "user shut down from the guest" due to
> mis-identification of the shutdown reason

Who's looking at this?

Comment 45 Petr Kotas 2018-03-12 16:23:34 UTC

I have retested the setup mentioned by the QE. All of the VMs terminated gracefully and without the error. It seems that the issue was due to the unrelated storage error.

As for the second part. I started working on the patch that sets the right
status to the terminating VM during system shutdown.

Comment 48 Israel Pinto 2018-04-29 11:59:55 UTC

Verify with:
Engine: 4.2.3.3-0.1.el7
Host:
OS Version:RHEL - 7.5 - 8.el7
Kernel Version:3.10.0 - 862.el7.x86_64
KVM Version:2.10.0 - 21.el7_5.2
LIBVIRT Version:libvirt-3.9.0-14.el7_5.2
VDSM Version:vdsm-4.20.27-1.el7ev
SPICE Version:0.14.0 - 2.el7

Test run:
https://polarion.engineering.redhat.com/polarion/#/project/RHEVM3/testrun?id=42-4&tab=records&result=passed

Comment 51 errata-xmlrpc 2018-05-15 17:38:32 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:1488

Comment 52 Franta Kust 2019-05-16 13:03:39 UTC

BZ<2>Jira Resync