Created attachment 1714048 [details] engine logs Created attachment 1714048 [details] engine logs Created attachment 1714048 [details] engine logs Description of problem: Can't stop hosted-engine VM from the command line Version-Release number of selected component (if applicable): 4.4.3.1-0.7.el8ev How reproducible: 50% Steps to Reproduce: 1. From host CLI run: "hosted-engine --vm-shutdown" 2. Wait until 'hosted-engine --vm-status' will show that the VM is in 'down' state Actual results: VM state changed to "powering down" and then changed back to "up" Expected results: VM state should be 'down' Additional info: See attached logs I assume that this kind of log line is related to the issue but not sure: VM '66d2e1f1-6baa-4349-9d36-f838f0365299'(HostedEngine) was unexpectedly detected as 'Up' on VDS 'fcd52d79-da89-4a7e-8112-d05187e3b2e7'(host_mixed_2) (expected on 'null')
Did you put system to global maintenance before shutting down the VM? Otherwise the HA agent will boot it again...
please attach logs from the host running HE as well as logs from the HE guest (system logs, journal, messages)
doesn't reproduce manually. It would help to get an environment with a VM which fails to shut down to investigate why
I'm able to reproduce it (in QE env) only if I simulate the automation script and right after powering down the VM I execute the following command every sec: hosted-engine --vm-status |grep "Engine status" | egrep -v "unknown|up|powering down" The HE VM stays in powering-down status for a few minutes, goes back up, and after a few more minutes finally goes down. In VDSM log there is 'VM.shutdown succeeded' message but 'virsh -r list' shows HE VM is running : 2020-09-30 09:55:37,347+0300 INFO (jsonrpc/6) [api.virt] START shutdown(delay='120', message='VM is shutting down!', reboot=False, timeout=None, force=False) from=::1,59184, vmId=401ce9ed-bf52-4de2-b5ad-480541c21c8c (api:48) 2020-09-30 09:55:37,386+0300 INFO (jsonrpc/6) [jsonrpc.JsonRpcServer] RPC call VM.shutdown succeeded in 0.04 seconds (__init__:312) VDSM version: vdsm-4.40.31-1.el8ev.x86_64 I think as a WA adding a few seconds delay between the task `Shutdown the hosted-engine manger VM` and task 'Check when vm status changes to down` in the automation script will most likely help. Is it a new test that was recently written? Did you encounter this issue in 4.4.2?
It takes a very long time I think ~20 min, we can't add such a long delay to our tests This is not a new test and it the first time we encounter this, so it's a regression
(In reply to Asaf Rachmani from comment #27) Arik, can you please sync with Asaf and take over? Get Avihai if you need reproduction environment...but I believe that's a vert issue with qemu-guest-agent crashing
Just to clarify Asaf found that calling immediately to 'hosted-engine --vm-status' after 'hosted-engine --vm-shutdown' trigger the issue, so he suggests to add 10sec delay between '--vm-shutdown' and '--vm-status' I check it and indeed to works:), so yes it can be use as WR until this bug will be fixed
(In reply to Michal Skrivanek from comment #29) > (In reply to Asaf Rachmani from comment #27) > > Arik, can you please sync with Asaf and take over? Get Avihai if you need > reproduction environment...but I believe that's a vert issue with > qemu-guest-agent crashing This observation seems correct. I see qemu-guest-agent receives the call to shut down the VM: Oct 1 16:26:32 hosted-engine-06 qemu-ga[1062]: info: guest-shutdown called, mode: powerdown Fails: Oct 1 16:26:32 hosted-engine-06 systemd-logind[1124]: Creating /run/nologin, blocking further logins... Oct 1 16:26:32 hosted-engine-06 qemu-ga[1062]: ** Oct 1 16:26:32 hosted-engine-06 qemu-ga[1062]: ERROR:qga/main.c:532:send_response: assertion failed: (rsp && s->channel) Oct 1 16:26:32 hosted-engine-06 kernel: Core dump to |/usr/lib/systemd/systemd-coredump 1062 0 0 6 1601558792 18446744073709551615 hosted-engine-06.lab.eng.tlv2.redhat.com qemu-ga pipe failed Oct 1 16:26:32 hosted-engine-06 systemd[1]: qemu-guest-agent.service: Main process exited, code=killed, status=6/ABRT Oct 1 16:26:32 hosted-engine-06 systemd[1]: qemu-guest-agent.service: Failed with result 'signal'. The system reaches a powering down status: Oct 1 16:26:32 hosted-engine-06 systemd-logind[1124]: System is powering down (hypervisor initiated shutdown). And from this point on, there are retries to start qemu-guest-agent.service tries to restart: Oct 1 16:26:32 hosted-engine-06 systemd[1]: qemu-guest-agent.service: Service has no hold-off time (RestartSec=0), scheduling restart. Oct 1 16:26:32 hosted-engine-06 systemd[1]: qemu-guest-agent.service: Stop job pending for unit, delaying automatic restart. Oct 1 16:26:32 hosted-engine-06 systemd[1]: qemu-guest-agent.service: Service has no hold-off time (RestartSec=0), scheduling restart. Oct 1 16:26:32 hosted-engine-06 systemd[1]: qemu-guest-agent.service: Stop job pending for unit, delaying automatic restart. ... Oct 1 16:47:03 hosted-engine-06 systemd[1]: qemu-guest-agent.service: Stop job pending for unit, delaying automatic restart. Oct 1 16:47:03 hosted-engine-06 systemd[1]: qemu-guest-agent.service: Service has no hold-off time (RestartSec=0), scheduling restart. Few minutes after the last log, the guest finally shuts down.
There are two issues that play together in this. The direct cause is a regression in qemu-ga itself where the agent crashes after the guest-shutdown command. This has been introduced by rebase to 4.1.0 in RHEL 8.3. The bug has been already fixed upstream. I have opened bug 1884531 on platform asking for backport of the fix. Then there seems to be another issue in systemd. Service that crashed during shutdown is caught in indecision whether to restart or stop the service, until systemd finally times out, gives up and forces shutdown of the OS. I will open a bug for this too once I manage to get a clean reproducer.
Because the problem is directly connected to the --vm-shutdown and not to --vm-status the WA suggested in comment 30 cannot work. But it would be interesting to know how is the HE from environment where it seemingly works different from HE in the environment where this bug is experienced. The problem with systemd seems to be related to how HE appliance is configured because I cannot reproduce it on general RHEL 8.3 VM. But since the environment got nuked over the night I cannot investigate it anymore.
The systemd problem may be related to the fact that writing core dump failed: Sep 15 11:49:44 hosted-engine-04.lab.eng.tlv2.redhat.com kernel: Core dump to |/usr/lib/systemd/systemd-coredump 1050 0 0 6 1600159784 18446744073709551615 hosted-engine-04.lab.eng.tlv2.redhat.com qemu-ga pipe failed In normal case when service is aborted systemd reports that process exited with "code=dumped, status=6/ABRT", but in our case it says "code=killed, status=6/ABRT" -- the code is different.
(In reply to Tomáš Golembiovský from comment #32) > There are two issues that play together in this. The direct cause is a > regression in qemu-ga itself where the agent crashes after the > guest-shutdown command. This has been introduced by rebase to 4.1.0 in RHEL > 8.3. The bug has been already fixed upstream. I have opened bug 1884531 on > platform asking for backport of the fix. Since it will probably take platform some time to get the fix into RHEL 8.3 is there any chance we could use guest agent from RHEL-AV (which should work ok) in the appliance?
Moving to 4.4.5, see https://bugzilla.redhat.com/show_bug.cgi?id=1900578#c3
bz 1900578 is verified so all we need to do here now is to produce hosted engine appliance with the fixed qemu-guest-agent, thus moving to Integration team.
Still being reproduced on latest ovirt-hosted-engine-setup-2.4.9-4.el8ev.noarch rhvm-appliance-4.4-20201117.0.el8ev.x86_64: Feb 11 13:01:04 nsednev-he-1 qemu-ga[1240]: info: guest-shutdown called, mode: powerdown Feb 11 13:01:04 nsednev-he-1 systemd-logind[1293]: Creating /run/nologin, blocking further logins... Feb 11 13:01:04 nsednev-he-1 qemu-ga[1240]: ** Feb 11 13:01:04 nsednev-he-1 qemu-ga[1240]: ERROR:qga/main.c:532:send_response: assertion failed: (rsp && s->channel) Feb 11 13:01:04 nsednev-he-1 systemd[1]: qemu-guest-agent.service: Main process exited, code=dumped, status=6/ABRT Feb 11 13:01:04 nsednev-he-1 systemd[1]: qemu-guest-agent.service: Failed with result 'core-dump'. Feb 11 13:01:04 nsednev-he-1 systemd-logind[1293]: System is powering down (hypervisor initiated shutdown). Feb 11 13:01:04 nsednev-he-1 systemd[1]: qemu-guest-agent.service: Service has no hold-off time (RestartSec=0), schedu ling restart. Feb 11 13:01:04 nsednev-he-1 systemd[1]: qemu-guest-agent.service: Stop job pending for unit, delaying automatic resta rt. Feb 11 13:01:04 nsednev-he-1 systemd[1]: qemu-guest-agent.service: Service has no hold-off time (RestartSec=0), schedu ling restart. Feb 11 13:01:04 nsednev-he-1 systemd[1]: qemu-guest-agent.service: Stop job pending for unit, delaying automatic resta rt. "Powering down" changes back to "Up". VM never goes to down. alma04 ~]# virsh -r list --all Id Name State ------------------------------ 7 HostedEngine running
Nikolai, yes, that's expected - you were not provided with an hosted engine appliance that consumes the fix in qemu-guest-agent yet
hosted-engine --vm-shutdown works fine with qemu-guest-agent-4.2.0-34.module+el8.3.0+9828+7aab3355.3.x86_64 on latest ovirt-engine-setup-base-4.4.5.4-0.6.el8ev.noarch. Engine being shutdown properly and immediately.
ovirt-hosted-engine-ha-2.4.6-1.el8ev.noarch ovirt-hosted-engine-setup-2.4.9-4.el8ev.noarch ovirt-engine-setup-4.4.5.4-0.6.el8ev.noarch qemu-guest-agent-4.2.0-34.module+el8.3.0+9828+7aab3355.3.x86_64 Linux 4.18.0-240.1.1.el8_3.x86_64 #1 SMP Fri Oct 16 13:36:46 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux release 8.3 (Ootpa)
This bugzilla is included in oVirt 4.4.5 release, published on March 18th 2021. Since the problem described in this bug report should be resolved in oVirt 4.4.5 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.