1876817 – Can't stop hosted-engine VM

Bug 1876817 - Can't stop hosted-engine VM

Summary: Can't stop hosted-engine VM

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	ovirt-hosted-engine-setup
Classification:	oVirt
Component:	General
Sub Component:
Version:	---
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	high
Target Milestone:	ovirt-4.4.5
Target Release:	---
Assignee:	Asaf Rachmani
QA Contact:	Nikolai Sednev
Docs Contact:
URL:
Whiteboard:
Depends On:	1900578
Blocks:
TreeView+	depends on / blocked

Reported:	2020-09-08 08:59 UTC by Roni
Modified:	2021-03-18 15:13 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2021-03-18 15:13:50 UTC
oVirt Team:	Integration
Embargoed:
Dependent Products:
Flags:	pm-rhel: ovirt-4.4+ aoconnor: blocker- pm-rhel: planning_ack+ ahadas: devel_ack+ pm-rhel: testing_ack+

Attachments	(Terms of Use)
engine logs (2.51 MB, application/gzip) 2020-09-08 08:59 UTC, Roni	no flags	Details
View All

Description Roni 2020-09-08 08:59:12 UTC

Created attachment 1714048 [details]
engine logs

Created attachment 1714048 [details]
engine logs

Created attachment 1714048 [details]
engine logs

Description of problem:
Can't stop hosted-engine VM from the command line


Version-Release number of selected component (if applicable):
4.4.3.1-0.7.el8ev

How reproducible:
50%

Steps to Reproduce:
1. From host CLI run: "hosted-engine --vm-shutdown"
2. Wait until 'hosted-engine --vm-status' will show that the VM is in 'down' state

Actual results:
VM state changed to "powering down" and then changed back to "up"

Expected results:
VM state should be 'down'

Additional info:
See attached logs

I assume that this kind of log line is related to the issue but not sure: 
VM '66d2e1f1-6baa-4349-9d36-f838f0365299'(HostedEngine) was unexpectedly detected as 'Up' on VDS 'fcd52d79-da89-4a7e-8112-d05187e3b2e7'(host_mixed_2) (expected on 'null')

Comment 15 Sandro Bonazzola 2020-09-14 11:37:30 UTC

Did you put system to global maintenance before shutting down the VM? Otherwise the HA agent will boot it again...

Comment 21 Michal Skrivanek 2020-09-14 14:02:58 UTC

please attach logs from the host running HE as well as logs from the HE guest (system logs, journal, messages)

Comment 26 Michal Skrivanek 2020-09-29 12:16:54 UTC

doesn't reproduce manually. It would help to get an environment with a VM which fails to shut down to investigate why

Comment 27 Asaf Rachmani 2020-09-30 08:27:06 UTC

I'm able to reproduce it (in QE env) only if I simulate the automation script and right after powering down the VM I execute the following command every sec:
hosted-engine --vm-status |grep "Engine status" | egrep -v "unknown|up|powering down"

The HE VM stays in powering-down status for a few minutes, goes back up, and after a few more minutes finally goes down. 

In VDSM log there is 'VM.shutdown succeeded' message but 'virsh -r list' shows HE VM is running :
2020-09-30 09:55:37,347+0300 INFO  (jsonrpc/6) [api.virt] START shutdown(delay='120', message='VM is shutting down!', reboot=False, timeout=None, force=False) from=::1,59184, vmId=401ce9ed-bf52-4de2-b5ad-480541c21c8c (api:48)
2020-09-30 09:55:37,386+0300 INFO  (jsonrpc/6) [jsonrpc.JsonRpcServer] RPC call VM.shutdown succeeded in 0.04 seconds (__init__:312)

VDSM version: vdsm-4.40.31-1.el8ev.x86_64

I think as a WA adding a few seconds delay between the task `Shutdown the hosted-engine manger VM` and task 'Check when vm status changes to down` in the automation script will most likely help.

Is it a new test that was recently written? Did you encounter this issue in 4.4.2?

Comment 28 Roni 2020-09-30 10:19:37 UTC

It takes a very long time I think ~20 min, we can't add such a long delay to our tests
This is not a new test and it the first time we encounter this, so it's a regression

Comment 29 Michal Skrivanek 2020-10-01 12:32:24 UTC

(In reply to Asaf Rachmani from comment #27)

Arik, can you please sync with Asaf and take over? Get Avihai if you need reproduction environment...but I believe that's a vert issue with qemu-guest-agent crashing

Comment 30 Roni 2020-10-01 12:52:33 UTC

Just to clarify Asaf found that calling immediately to 'hosted-engine --vm-status' after 'hosted-engine --vm-shutdown' 
trigger the issue, so he suggests to add 10sec delay between '--vm-shutdown' and '--vm-status'
I check it and indeed to works:), so yes it can be use as WR until this bug will be fixed

Comment 31 Arik 2020-10-01 14:01:06 UTC

(In reply to Michal Skrivanek from comment #29)
> (In reply to Asaf Rachmani from comment #27)
> 
> Arik, can you please sync with Asaf and take over? Get Avihai if you need
> reproduction environment...but I believe that's a vert issue with
> qemu-guest-agent crashing

This observation seems correct.

I see qemu-guest-agent receives the call to shut down the VM:
Oct  1 16:26:32 hosted-engine-06 qemu-ga[1062]: info: guest-shutdown called, mode: powerdown

Fails:
Oct  1 16:26:32 hosted-engine-06 systemd-logind[1124]: Creating /run/nologin, blocking further logins...
Oct  1 16:26:32 hosted-engine-06 qemu-ga[1062]: **
Oct  1 16:26:32 hosted-engine-06 qemu-ga[1062]: ERROR:qga/main.c:532:send_response: assertion failed: (rsp && s->channel)
Oct  1 16:26:32 hosted-engine-06 kernel: Core dump to |/usr/lib/systemd/systemd-coredump 1062 0 0 6 1601558792 18446744073709551615 hosted-engine-06.lab.eng.tlv2.redhat.com qemu-ga pipe failed
Oct  1 16:26:32 hosted-engine-06 systemd[1]: qemu-guest-agent.service: Main process exited, code=killed, status=6/ABRT
Oct  1 16:26:32 hosted-engine-06 systemd[1]: qemu-guest-agent.service: Failed with result 'signal'.

The system reaches a powering down status:
Oct  1 16:26:32 hosted-engine-06 systemd-logind[1124]: System is powering down (hypervisor initiated shutdown).

And from this point on, there are retries to start qemu-guest-agent.service tries to restart:
Oct  1 16:26:32 hosted-engine-06 systemd[1]: qemu-guest-agent.service: Service has no hold-off time (RestartSec=0), scheduling restart.
Oct  1 16:26:32 hosted-engine-06 systemd[1]: qemu-guest-agent.service: Stop job pending for unit, delaying automatic restart.
Oct  1 16:26:32 hosted-engine-06 systemd[1]: qemu-guest-agent.service: Service has no hold-off time (RestartSec=0), scheduling restart.
Oct  1 16:26:32 hosted-engine-06 systemd[1]: qemu-guest-agent.service: Stop job pending for unit, delaying automatic restart.
...
Oct  1 16:47:03 hosted-engine-06 systemd[1]: qemu-guest-agent.service: Stop job pending for unit, delaying automatic restart.
Oct  1 16:47:03 hosted-engine-06 systemd[1]: qemu-guest-agent.service: Service has no hold-off time (RestartSec=0), scheduling restart.

Few minutes after the last log, the guest finally shuts down.

Comment 32 Tomáš Golembiovský 2020-10-02 08:51:27 UTC

There are two issues that play together in this. The direct cause is a regression in qemu-ga itself where the agent crashes after the guest-shutdown command. This has been introduced by rebase to 4.1.0 in RHEL 8.3. The bug has been already fixed upstream. I have opened bug 1884531 on platform asking for backport of the fix.

Then there seems to be another issue in systemd. Service that crashed during shutdown is caught in indecision whether to restart or stop the service, until systemd finally times out, gives up and forces shutdown of the OS. I will open a bug for this too once I manage to get a clean reproducer.

Comment 34 Tomáš Golembiovský 2020-10-02 16:30:04 UTC

Because the problem is directly connected to the --vm-shutdown and not to --vm-status the WA suggested in comment 30 cannot work. But it would be interesting to know how is the HE from environment where it seemingly works different from HE in the environment where this bug is experienced.

The problem with systemd seems to be related to how HE appliance is configured because I cannot reproduce it on general RHEL 8.3 VM. But since the environment got nuked over the night I cannot investigate it anymore.

Comment 35 Tomáš Golembiovský 2020-10-02 16:39:29 UTC

The systemd problem may be related to the fact that writing core dump failed:

    Sep 15 11:49:44 hosted-engine-04.lab.eng.tlv2.redhat.com kernel: Core dump to |/usr/lib/systemd/systemd-coredump 1050 0 0 6 1600159784 18446744073709551615 hosted-engine-04.lab.eng.tlv2.redhat.com qemu-ga pipe failed

In normal case when service is aborted systemd reports that process exited with "code=dumped, status=6/ABRT", but in our case it says "code=killed, status=6/ABRT" -- the code is different.

Comment 42 Tomáš Golembiovský 2020-10-05 11:31:52 UTC

(In reply to Tomáš Golembiovský from comment #32)
> There are two issues that play together in this. The direct cause is a
> regression in qemu-ga itself where the agent crashes after the
> guest-shutdown command. This has been introduced by rebase to 4.1.0 in RHEL
> 8.3. The bug has been already fixed upstream. I have opened bug 1884531 on
> platform asking for backport of the fix.

Since it will probably take platform some time to get the fix into RHEL 8.3 is there any chance we could use guest agent from RHEL-AV (which should work ok) in the appliance?

Comment 44 Arik 2020-12-02 16:57:10 UTC

Moving to 4.4.5, see https://bugzilla.redhat.com/show_bug.cgi?id=1900578#c3

Comment 45 Arik 2021-02-11 09:49:45 UTC

bz 1900578 is verified so all we need to do here now is to produce hosted engine appliance with the fixed qemu-guest-agent, thus moving to Integration team.

Comment 46 Nikolai Sednev 2021-02-11 11:17:33 UTC

Still being reproduced on latest ovirt-hosted-engine-setup-2.4.9-4.el8ev.noarch rhvm-appliance-4.4-20201117.0.el8ev.x86_64:

Feb 11 13:01:04 nsednev-he-1 qemu-ga[1240]: info: guest-shutdown called, mode: powerdown
Feb 11 13:01:04 nsednev-he-1 systemd-logind[1293]: Creating /run/nologin, blocking further logins...
Feb 11 13:01:04 nsednev-he-1 qemu-ga[1240]: **
Feb 11 13:01:04 nsednev-he-1 qemu-ga[1240]: ERROR:qga/main.c:532:send_response: assertion failed: (rsp && s->channel)
Feb 11 13:01:04 nsednev-he-1 systemd[1]: qemu-guest-agent.service: Main process exited, code=dumped, status=6/ABRT
Feb 11 13:01:04 nsednev-he-1 systemd[1]: qemu-guest-agent.service: Failed with result 'core-dump'.
Feb 11 13:01:04 nsednev-he-1 systemd-logind[1293]: System is powering down (hypervisor initiated shutdown).
Feb 11 13:01:04 nsednev-he-1 systemd[1]: qemu-guest-agent.service: Service has no hold-off time (RestartSec=0), schedu
ling restart.
Feb 11 13:01:04 nsednev-he-1 systemd[1]: qemu-guest-agent.service: Stop job pending for unit, delaying automatic resta
rt.
Feb 11 13:01:04 nsednev-he-1 systemd[1]: qemu-guest-agent.service: Service has no hold-off time (RestartSec=0), schedu
ling restart.
Feb 11 13:01:04 nsednev-he-1 systemd[1]: qemu-guest-agent.service: Stop job pending for unit, delaying automatic resta
rt.


"Powering down" changes back to "Up".

VM never goes to down.

alma04 ~]# virsh -r list --all
 Id   Name           State
------------------------------
 7    HostedEngine   running

Comment 47 Arik 2021-02-11 12:32:44 UTC

Nikolai, yes, that's expected - you were not provided with an hosted engine appliance that consumes the fix in qemu-guest-agent yet

Comment 48 Nikolai Sednev 2021-02-11 15:43:31 UTC

hosted-engine --vm-shutdown works fine with qemu-guest-agent-4.2.0-34.module+el8.3.0+9828+7aab3355.3.x86_64 on latest ovirt-engine-setup-base-4.4.5.4-0.6.el8ev.noarch.
Engine being shutdown properly and immediately.

Comment 49 Nikolai Sednev 2021-02-11 15:50:52 UTC

ovirt-hosted-engine-ha-2.4.6-1.el8ev.noarch
ovirt-hosted-engine-setup-2.4.9-4.el8ev.noarch
ovirt-engine-setup-4.4.5.4-0.6.el8ev.noarch
qemu-guest-agent-4.2.0-34.module+el8.3.0+9828+7aab3355.3.x86_64
Linux 4.18.0-240.1.1.el8_3.x86_64 #1 SMP Fri Oct 16 13:36:46 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux release 8.3 (Ootpa)

Comment 50 Sandro Bonazzola 2021-03-18 15:13:50 UTC

This bugzilla is included in oVirt 4.4.5 release, published on March 18th 2021.

Since the problem described in this bug report should be resolved in oVirt 4.4.5 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Note You need to log in before you can comment on or make changes to this bug.