Bug 1406033 - [atomic] HA vms do not start after successful power-management.
Summary: [atomic] HA vms do not start after successful power-management.
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-guest-agent
Version: 3.6.5
Hardware: All
OS: Linux
medium
medium
Target Milestone: ovirt-4.0.7
: ---
Assignee: Vinzenz Feenstra [evilissimo]
QA Contact: Artyom
URL:
Whiteboard:
Depends On: 1341106
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-12-19 14:28 UTC by rhev-integ
Modified: 2019-12-16 07:28 UTC (History)
20 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Previously, when virtual machines were stopped during a system shutdown or reboot the shut down of those instances appeared to VDSM the same as if the virtual machine had been shut down gracefully from within the Guest Operating System. This incorrect information was sent to the Red Hat Virtualization Manager. This meant that the Red Hat Virtualization Manager did not start highly available virtual machines on a different host because it considered that the virtual machines had been shut down by the user from within the Guest Operating System. Now, VDSM detects when a virtual machine was shut down from within the system, with the help of the Red Hat Virtualization guest agent. VDSM can now differentiate between an unplanned shutdown and a user-initiated shutdown and reports this information to the Red Hat Virtualization Manager accordingly. This means that highly available virtual machines stopped during a system shutdown or reboot are now restarted on a different host.
Clone Of: 1341106
Environment:
Last Closed: 2017-03-16 15:14:29 UTC
oVirt Team: Virt
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2017:0553 0 normal SHIPPED_LIVE RHEVM Guest Agent for RHEL Atomic Host 2017-03-16 19:24:25 UTC
oVirt gerrit 64991 0 master MERGED virt: Try to detect non guest iniated shutdowns 2020-11-13 09:52:56 UTC
oVirt gerrit 64994 0 master MERGED Report session start and stop on all Guest OSes 2020-11-13 09:52:56 UTC
oVirt gerrit 65342 0 ovirt-4.0 MERGED Report session start and stop on all Guest OSes 2020-11-13 09:52:56 UTC
oVirt gerrit 65394 0 master MERGED Report session-startup on refresh 2020-11-13 09:52:56 UTC

Description rhev-integ 2016-12-19 14:28:51 UTC
+++ This bug is a downstream clone. The original bug is: +++
+++   bug 1341106 +++
======================================================================

Description of problem:
With successful power management configured, the vms Marked HA should start on other host but those are getting shutdown.

Things Tried till now:

A) Configure Power Management for the Hosts. 
B) Mark VM as High Available (HIGH)

 1] Click Power Management drop down menu and select - restart [VMs remains down with Exit message: User shut down from within the guest] 
 2] Enter command reboot/ init 6 / init 0 in host - [VMs remains down with Exit message: User shut down from within the guest] 
 3] From Hypervisor console - Poweroff Host - [VMs remains down with Exit message: User shut down from within the guest]
 4] Abrupt Shutdown - [VMs restarted on other host once host fenced]
 5] Ifdown interface - [vms Unknown and start once the host is UP]

Version-Release number of selected component (if applicable):
RHEVM 3.6.5

How reproducible:
Always

Steps to Reproduce:
1. Configure power management for the host.
2. Mark the VM as Highly Available 
3. Try to gracefully shutdown the host or choose from "Host --> Power Management --> (dropdown) Restart"

Host fence is successful but VM down with error: Exit message: User shut down from within the guest

Actual results:
HA vms are not restarting on other or the same host.

Expected results:
HA vms should restart on other host.

Additional info:
Tried with this also: echo c >/proc/sysrq-trigger
With this option, vm was restarted on other host.

(Originally by Ulhas Surse)

Comment 5 rhev-integ 2016-12-19 14:29:24 UTC
Just some further findings:

It looks like the IMM2 Board from IBM/Lenovo does always send ACPI signals to the OS.

This is why systemd jumps in and kills the VM.
So we need to
a) either get systemd to ignore the ACPI signals (and as such not killing the VM)
b) Get IBM to not send ACPI signals to the OS in case of a "Immediate Power off"
   resp. "power off" without "-s" as it obviously does in that case.


Taken from the logs prior to a Poweroff-event from the IMM
(still waiting for some further logs for final confirmation):

qemu: terminating on signal 15 from pid 1
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@2016-05-26 05:48:42.924+0000: starting up libvirt version: 1.2.17, packa
ge: 13.el7_2.4 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2016-03-02-11:10:27, x86-034.build.eng.bos.redhat.com), qemu version: 2.3.0 (qemu-kvm-rhev-
2.3.0-31.el7_2.10)

(Originally by Martin Tessun)

Comment 8 rhev-integ 2016-12-19 14:29:43 UTC
Just checking the logs from my previous tests and I found the following in the messages:

2016-06-01T08:19:37.172092Z qemu-kvm: warning: All CPU(s) up to maxcpus should be described in NUMA config
main_channel_link: add main channel client
main_channel_handle_parsed: net test: latency 85.839000 ms, bitrate 1721352 bps (1.641609 Mbps) LOW BANDWIDTH
inputs_connect: inputs channel client create
red_dispatcher_set_cursor_peer: 
===> qemu: terminating on signal 15 from pid 1 <=== 

This shows the shutdown is triggered by systemd before the system is powered off.

An even better evidence can be found in the messages:
Jun  1 08:06:08 IDCRHLV01 root: PowerOff Test started
Jun  1 08:07:16 IDCRHLV01 systemd-logind: Power key pressed.
Jun  1 08:07:16 IDCRHLV01 systemd-logind: Powering Off...
Jun  1 08:07:16 IDCRHLV01 systemd-logind: System is powering down.
Jun  1 08:07:16 IDCRHLV01 systemd: Unmounting RPC Pipe File System...
Jun  1 08:07:16 IDCRHLV01 systemd: Stopped Dump dmesg to /var/log/dmesg.
Jun  1 08:07:16 IDCRHLV01 systemd: Stopping Dump dmesg to /var/log/dmesg...
Jun  1 08:07:16 IDCRHLV01 systemd: Stopped target Timers.
Jun  1 08:07:16 IDCRHLV01 systemd: Stopping Timers.
Jun  1 08:07:16 IDCRHLV01 systemd: Stopping LVM2 PV scan on device 8:144...
[...]

(Originally by Martin Tessun)

Comment 9 rhev-integ 2016-12-19 14:29:51 UTC
Sorry submitted too early.
So maybe we should disable powermanagement for Hypervisors by default.

E.g.:

1. Shutdown and disable acpid
   # systemctl disable acpid
   # systemctl stop acpid

2. Change the ACPI Actions of systemd to "IGNORE":
   # mkdir -m 755 /etc/systemd/logind.conf.d
   # cat > /etc/systemd/logind.conf.d/acpi.conf <<EOF
[Login]
HandlePowerKey=ignore
HandleSuspendKey=ignore
HandleHibernateKey=ignore
HandleLidSwitch=ignore
HandleLidSwitchDocked=ignore
EOF

3. Restart systemd-logind
   # systemctl restart systemd-logind

(Originally by Martin Tessun)

Comment 10 rhev-integ 2016-12-19 14:29:58 UTC
(In reply to Martin Tessun from comment #7)
> ===> qemu: terminating on signal 15 from pid 1 <=== 
> 
> This shows the shutdown is triggered by systemd before the system is powered
> off.
> 
this would be ok, I guess, we should handle that and still identify that as ungraceful shutdown. Was there any acpi event inside the guest perhaps? What about libvirt, did it get sigterm too?

Generally, it is a desired behavior that it tries to gracefully terminate, but then we have to rethink how HA behaves and maybe restart HA VM regardless what guest does and allow to shut down HA VM only from UI...which might be annoying

(Originally by michal.skrivanek)

Comment 11 rhev-integ 2016-12-19 14:30:04 UTC
(In reply to Martin Tessun from comment #8)
> Sorry submitted too early.
> So maybe we should disable powermanagement for Hypervisors by default.

No, we shouldn't. In case of a disaster, I expect IT to go into the server room and shutdown using the ON/OFF button, expecting a graceful shutdown. 
This change is unexpected.
I'm quite sure there's a way in IBM, via BMC or whatnot, to ungracefully kill the host. We should look into it, in the fence-agents code.

(Originally by Yaniv Kaul)

Comment 12 rhev-integ 2016-12-19 14:30:11 UTC
(In reply to Yaniv Kaul from comment #10)
> (In reply to Martin Tessun from comment #8)
> > Sorry submitted too early.
> > So maybe we should disable powermanagement for Hypervisors by default.
> 
> No, we shouldn't. In case of a disaster, I expect IT to go into the server
> room and shutdown using the ON/OFF button, expecting a graceful shutdown. 
> This change is unexpected.

Well in case of a desaster, I don't expect anyone to go to the server room. It is a desaster, so probably there might be some risks to go into the server room.

In my 20 years of Administration, I never used the Power Off button to gracefully shutdown a server. Either I have a serial console I can reach, or I do a hard poweroff (maybe even NMI triggered to get a crashdump), but probably never graceful, as this usually does not work in these cases.

Anyways, I can accept this point of view and it would of course break the current behaviour, which might lead to other cases requesting the exact opposite.

> I'm quite sure there's a way in IBM, via BMC or whatnot, to ungracefully
> kill the host. We should look into it, in the fence-agents code.

Sure from my point of view thw IBM IMM2 / BMC card has a firmware issue, as there is an option to gracefully shut down the server (power off -s).

Still I agree with Michal that we should somehow handle these sort of issues (so in case the VM is killed from systemd) at least for HA VMs.

(Originally by Martin Tessun)

Comment 17 rhev-integ 2016-12-19 14:30:42 UTC
I have tried a couple more test cases related to this and maybe we can consider it in the same BZ.

when the admin logs on to the hypervisor and issues a shutdown or reboot, all the VMS running on the host exit with the same message "User shut down from within the guest"

which means the guest will never start up again automatically and the admin will have to manually start all these VMs up again.

the solution to this could be either:
1- to enable maintenance mode on the host as part of the shutdown sequence, which would gracefully move all VMs from that host to another functional one in the cluster.
2- to forcibly kill the guest VM processes instead of attempting shutdown, this would then be picked up by RHV-M and automatically start the VMs on another host

(Originally by Ahmed El-Rayess)

Comment 18 rhev-integ 2016-12-19 14:30:48 UTC
@aelrayes:

The workaround applied here will work for those scenarios as well (A hypervisor shut-down by an administrator, it's not a user shutdown)

1) is the correct way to do this for an administrator

2) should be avoided if not necessary

(Originally by Vinzenz Feenstra)

Comment 23 Michal Skrivanek 2016-12-21 13:03:47 UTC
Moving back to ON_QA since it is testable. See (private) comment #22 for location

Comment 26 Artyom 2016-12-22 09:08:52 UTC
Thanks, Vinzenz, I was missed `atomic install` and `atomic run` commands

I can see now that the engine restart the HA VM always, also when I shutdown the VM via guest OS(`poweroff`, `shutdown now`). Does it expected behavior?

Comment 27 Artyom 2016-12-22 09:58:30 UTC
Looks like it happens only on the atomic guest(maybe it somehow connect to the fact that guest-agent runs as docker container), I checked it on the RHEL7.3 with the package rhevm-guest-agent-common-1.0.12-4.el7ev.noarch and the engine does not restart the HA VM, if VM poweroff from the guest OS.

I run the same command `poweroff` on the atomic guest OS and on regular RHEL7.3 guest OS, but I get different events:

Dec 22, 2016 11:52:10 AM
VM test_atomic is down. Exit message: VM has been terminated on the host

Dec 22, 2016 11:52:01 AM
VM test_rhel7 is down. Exit message: User shut down from within the guest

If you need I can provide to you the environment with the atomic VM.

Comment 33 Artyom 2017-02-28 11:19:08 UTC
Checked on the image:
[root@test ~]# atomic images list
   REPOSITORY                          TAG                                                IMAGE ID       CREATED            VIRTUAL SIZE   TYPE      
>  vfeenstr/rhevm-guest-agent-docker   rhevm-4.0-rhel-7-docker-candidate-20170224020507   88039b982959   2017-02-24 07:14   473.43 MB      Docker    

[root@test ~]# atomic images info 88039b982959
Image Name: vfeenstr/rhevm-guest-agent-docker:rhevm-4.0-rhel-7-docker-candidate-20170224020507
io.k8s.description: This is the RHEVM management agent running inside the guest. The agent interfaces with the RHEV manager, supplying heart-beat info as well as run-time data from within the guest itself. The agent also accepts control commands to be run executed within the OS (like: shutdown and restart).
STOP: docker kill --signal=TERM ${NAME}
Version: 1.0.12
INSTALL: docker run --rm --privileged --pid=host -v /:/host -e HOST=/host -e IMAGE=IMAGE -e NAME=NAME IMAGE /usr/local/bin/ovirt-guest-agent-install.sh
vendor: Red Hat, Inc.
description: The Red Hat Enterprise Linux Base image is designed to be a fully supported foundation for your containerized applications.  This base image provides your operations and application teams with the packages, language runtimes and tools necessary to run, maintain, and troubleshoot all of your applications. This image is maintained by Red Hat and updated regularly. It is designed and engineered to be the base layer for all of your containerized applications, middleware and utilites. When used as the source for all of your containers, only one copy will ever be downloaded and cached in your production environment. Use this image just like you would a regular Red Hat Enterprise Linux distribution. Tools like yum, gzip, and bash are provided by default. For further information on how this image was built look at the /root/anacanda-ks.cfg file.
authoritative-source-url: registry.access.redhat.com
io.k8s.display-name: RHEVM Guest Agent
version: 1.0.12
vcs-ref: 25865513b0890f8e962b87893acdf93f8079e3c0
com.redhat.component: rhevm-guest-agent-docker
distribution-scope: public
run: docker run --privileged --pid=host --net=host -v /:/host -e HOST=/host -v /proc:/hostproc -v /dev/virtio-ports/com.redhat.rhevm.vdsm:/dev/virtio-ports/com.redhat.rhevm.vdsm --env container=docker --restart=always -e IMAGE=IMAGE -e NAME=NAME IMAGE
Name: rhev4/rhevm-guest-agent
vcs-type: git
com.redhat.build-host: ip-10-29-120-149.ec2.internal
Release: 10
BZComponent: rhevm-guest-agent-docker
build-date: 2017-02-24T02:06:40.898691
UNINSTALL: docker run --rm --privileged --pid=host -v /:/host -e HOST=/host -e IMAGE=IMAGE -e NAME=NAME IMAGE /usr/local/bin/ovirt-guest-agent-uninstall.sh
RUN: docker run --privileged --pid=host --net=host -v /:/host -e HOST=/host -v /proc:/hostproc -v /dev/virtio-ports/com.redhat.rhevm.vdsm:/dev/virtio-ports/com.redhat.rhevm.vdsm --env container=docker --restart=always -e IMAGE=IMAGE -e NAME=NAME IMAGE
name: rhev4/rhevm-guest-agent
license: ASL 2.0
summary: The RHEVM Guest Agent
architecture: x86_64
install: docker run --rm --privileged --pid=host -v /:/host -e HOST=/host -e IMAGE=IMAGE -e NAME=NAME IMAGE /usr/local/bin/ovirt-guest-agent-install.sh
release: 10
io.openshift.tags: base rhel7
uninstall: docker run --rm --privileged --pid=host -v /:/host -e HOST=/host -e IMAGE=IMAGE -e NAME=NAME IMAGE /usr/local/bin/ovirt-guest-agent-uninstall.sh


The problem described under https://bugzilla.redhat.com/show_bug.cgi?id=1406033#c27 still exist

Comment 34 Vinzenz Feenstra [evilissimo] 2017-02-28 11:35:17 UTC
And you tested this with a 4.0 engine?

Comment 35 Artyom 2017-02-28 12:09:30 UTC
Sure, rhevm-4.0.7.3-0.1.el7ev.noarch. I can provide you the environment, if you want.

Comment 36 Vinzenz Feenstra [evilissimo] 2017-02-28 13:12:56 UTC
While there is still a missing part of the whole functionality - The original issue as reported is fixed - HA VMs are now restarted after host fencing/reboots etc

The issue with those Atomic Host HA VMs also restarting when it was shutdown within the VM (User shutdown) - For this a new BZ should be created to show that this issue at hand has been fixed.

Thanks.

Comment 37 Artyom 2017-02-28 13:28:03 UTC
Verified on the image:
[root@test ~]# atomic images list
   REPOSITORY                          TAG                                                IMAGE ID       CREATED            VIRTUAL SIZE   TYPE      
>  vfeenstr/rhevm-guest-agent-docker   rhevm-4.0-rhel-7-docker-candidate-20170224020507   88039b982959   2017-02-24 07:14   473.43 MB      Docker    

[root@test ~]# atomic images info 88039b982959
Image Name: vfeenstr/rhevm-guest-agent-docker:rhevm-4.0-rhel-7-docker-candidate-20170224020507
io.k8s.description: This is the RHEVM management agent running inside the guest. The agent interfaces with the RHEV manager, supplying heart-beat info as well as run-time data from within the guest itself. The agent also accepts control commands to be run executed within the OS (like: shutdown and restart).
STOP: docker kill --signal=TERM ${NAME}
Version: 1.0.12
INSTALL: docker run --rm --privileged --pid=host -v /:/host -e HOST=/host -e IMAGE=IMAGE -e NAME=NAME IMAGE /usr/local/bin/ovirt-guest-agent-install.sh
vendor: Red Hat, Inc.
description: The Red Hat Enterprise Linux Base image is designed to be a fully supported foundation for your containerized applications.  This base image provides your operations and application teams with the packages, language runtimes and tools necessary to run, maintain, and troubleshoot all of your applications. This image is maintained by Red Hat and updated regularly. It is designed and engineered to be the base layer for all of your containerized applications, middleware and utilites. When used as the source for all of your containers, only one copy will ever be downloaded and cached in your production environment. Use this image just like you would a regular Red Hat Enterprise Linux distribution. Tools like yum, gzip, and bash are provided by default. For further information on how this image was built look at the /root/anacanda-ks.cfg file.
authoritative-source-url: registry.access.redhat.com
io.k8s.display-name: RHEVM Guest Agent
version: 1.0.12
vcs-ref: 25865513b0890f8e962b87893acdf93f8079e3c0
com.redhat.component: rhevm-guest-agent-docker
distribution-scope: public
run: docker run --privileged --pid=host --net=host -v /:/host -e HOST=/host -v /proc:/hostproc -v /dev/virtio-ports/com.redhat.rhevm.vdsm:/dev/virtio-ports/com.redhat.rhevm.vdsm --env container=docker --restart=always -e IMAGE=IMAGE -e NAME=NAME IMAGE
Name: rhev4/rhevm-guest-agent
vcs-type: git
com.redhat.build-host: ip-10-29-120-149.ec2.internal
Release: 10
BZComponent: rhevm-guest-agent-docker
build-date: 2017-02-24T02:06:40.898691
UNINSTALL: docker run --rm --privileged --pid=host -v /:/host -e HOST=/host -e IMAGE=IMAGE -e NAME=NAME IMAGE /usr/local/bin/ovirt-guest-agent-uninstall.sh
RUN: docker run --privileged --pid=host --net=host -v /:/host -e HOST=/host -v /proc:/hostproc -v /dev/virtio-ports/com.redhat.rhevm.vdsm:/dev/virtio-ports/com.redhat.rhevm.vdsm --env container=docker --restart=always -e IMAGE=IMAGE -e NAME=NAME IMAGE
name: rhev4/rhevm-guest-agent
license: ASL 2.0
summary: The RHEVM Guest Agent
architecture: x86_64
install: docker run --rm --privileged --pid=host -v /:/host -e HOST=/host -e IMAGE=IMAGE -e NAME=NAME IMAGE /usr/local/bin/ovirt-guest-agent-install.sh
release: 10
io.openshift.tags: base rhel7
uninstall: docker run --rm --privileged --pid=host -v /:/host -e HOST=/host -e IMAGE=IMAGE -e NAME=NAME IMAGE /usr/local/bin/ovirt-guest-agent-uninstall.sh

Comment 39 errata-xmlrpc 2017-03-16 15:14:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:0553


Note You need to log in before you can comment on or make changes to this bug.