Bug 1539040 - host-deploy stops libvirt-guests triggering a shutdown of all the running VMs (including HE one)
Summary: host-deploy stops libvirt-guests triggering a shutdown of all the running VMs...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-host-deploy
Classification: oVirt
Component: Plugins.VDSM
Version: 1.7.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ovirt-4.2.1
: ---
Assignee: Simone Tiraboschi
QA Contact: Nikolai Sednev
URL:
Whiteboard:
: 1538938 1539563 (view as bug list)
Depends On: 1458698
Blocks: 1478904
TreeView+ depends on / blocked
 
Reported: 2018-01-26 13:20 UTC by Simone Tiraboschi
Modified: 2018-02-12 11:53 UTC (History)
17 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
libvirt-guests is now required for the new graceful shutdown feature but host-deploy was explicitly stopping it triggering a shutdown of all the running VMs (including HE one).
Clone Of:
Environment:
Last Closed: 2018-02-12 11:53:41 UTC
oVirt Team: Integration
Embargoed:
rule-engine: ovirt-4.2+
rule-engine: blocker+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1539563 0 unspecified CLOSED Deploy HE failed via CLI based ansible deployment. 2021-02-22 00:41:40 UTC
Red Hat Bugzilla 1539734 0 high CLOSED HE setup fails due to ovirt_auth failure if the admin password is contained in REST API URL 2021-02-22 00:41:40 UTC
oVirt gerrit 86917 0 master ABANDONED ansible: virt: temporary mask libvirt-guests 2020-07-02 07:27:44 UTC
oVirt gerrit 86969 0 master MERGED deploy: avoid stopping libvirt-guests to preserve HE VM 2020-07-02 07:27:44 UTC
oVirt gerrit 86992 0 ovirt-host-deploy-1.7 MERGED deploy: avoid stopping libvirt-guests to preserve HE VM 2020-07-02 07:27:44 UTC

Internal Links: 1539563 1539734

Description Simone Tiraboschi 2018-01-26 13:20:02 UTC
Description of problem:

If hosted-engine-setup (ansible) fails (or got interrupted in the middle) successive attempts will fail with:

 [ INFO  ] TASK [Wait for the host to be up]
 [ ERROR ] Error: Failed to read response.
 [ ERROR ] fatal: [localhost]: FAILED! => {"attempts": 50, "changed": false, "msg": "Failed to read response."}
 [ ERROR ] Failed to execute stage 'Closing up': Failed executing ansible-playbook

In the logs:
2018-01-26 11:37:44,503+0100 INFO otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:81 TASK [Wait for the host to be up]
2018-01-26 11:49:07,746+0100 DEBUG otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils.run:161 ansible-playbook rc: 2


If we check what happened we discover that HostedEngineLocal got shutdown by libvirt-guests in the middle of the deployment:

 Jan 26 11:37:53 c74he20180108h1.localdomain systemd[1]: Stopping Suspend/Resume Running libvirt Guests...
 Jan 26 11:37:53 c74he20180108h1.localdomain libvirt-guests.sh[20418]: Running guests on qemu+tls://c74he20180108h1.localdomain/system URI: HostedEngineLocal
 Jan 26 11:37:53 c74he20180108h1.localdomain libvirt-guests.sh[20418]: Shutting down guests on qemu+tls://c74he20180108h1.localdomain/system URI...
 Jan 26 11:37:53 c74he20180108h1.localdomain libvirt-guests.sh[20418]: Starting shutdown on guest: HostedEngineLocal
 Jan 26 11:37:55 c74he20180108h1.localdomain libvirt-guests.sh[20418]: Waiting for guest HostedEngineLocal to shut down, 600 seconds left
 Jan 26 11:38:00 c74he20180108h1.localdomain libvirt-guests.sh[20418]: Waiting for guest HostedEngineLocal to shut down, 595 seconds left
 Jan 26 11:38:05 c74he20180108h1.localdomain libvirt-guests.sh[20418]: Waiting for guest HostedEngineLocal to shut down, 590 seconds left
 Jan 26 11:38:11 c74he20180108h1.localdomain libvirt-guests.sh[20418]: Shutdown of guest HostedEngineLocal complete.
 Jan 26 11:38:11 c74he20180108h1.localdomain systemd[1]: Stopped Suspend/Resume Running libvirt Guests.
 Jan 26 11:38:16 c74he20180108h1.localdomain systemd[1]: Starting Suspend/Resume Running libvirt Guests...
 Jan 26 11:38:16 c74he20180108h1.localdomain systemd[1]: Started Suspend/Resume Running libvirt Guests.
 Jan 26 11:38:16 c74he20180108h1.localdomain libvirt-guests.sh[20927]: libvirt-guests is configured not to start any guests on boot


In the first attempt it didn't happened but we still see something in libvirt-guests journactl logs:

 -- Logs begin at Fri 2018-01-26 11:13:53 CET, end at Fri 2018-01-26 12:20:17 CET. --
 Jan 26 11:14:06 c74he20180108h1.localdomain systemd[1]: Starting Suspend/Resume Running libvirt Guests...
 Jan 26 11:14:06 c74he20180108h1.localdomain systemd[1]: Started Suspend/Resume Running libvirt Guests.
 Jan 26 11:25:04 c74he20180108h1.localdomain systemd[1]: Stopping Suspend/Resume Running libvirt Guests...
 Jan 26 11:25:04 c74he20180108h1.localdomain virsh[15215]: All-whitespace username.
 Jan 26 11:25:04 c74he20180108h1.localdomain libvirt-guests.sh[15210]: Please enter your authentication name: Please enter your password:
 Jan 26 11:25:05 c74he20180108h1.localdomain virsh[15248]: All-whitespace username.
 Jan 26 11:25:05 c74he20180108h1.localdomain libvirt-guests.sh[15210]: Unable to connect to libvirt currently. Retrying .. 10Please enter your authentication name: Please enter your password:
 Jan 26 11:25:06 c74he20180108h1.localdomain virsh[15255]: All-whitespace username.
 Jan 26 11:25:06 c74he20180108h1.localdomain libvirt-guests.sh[15210]: Unable to connect to libvirt currently. Retrying .. 9Please enter your authentication name: Please enter your password:
 Jan 26 11:25:07 c74he20180108h1.localdomain virsh[15262]: All-whitespace username.
 Jan 26 11:25:07 c74he20180108h1.localdomain libvirt-guests.sh[15210]: Unable to connect to libvirt currently. Retrying .. 8Please enter your authentication name: Please enter your password:
 Jan 26 11:25:08 c74he20180108h1.localdomain virsh[15269]: All-whitespace username.
 Jan 26 11:25:08 c74he20180108h1.localdomain libvirt-guests.sh[15210]: Unable to connect to libvirt currently. Retrying .. 7Please enter your authentication name: Please enter your password:
 Jan 26 11:25:09 c74he20180108h1.localdomain virsh[15276]: All-whitespace username.
 Jan 26 11:25:09 c74he20180108h1.localdomain libvirt-guests.sh[15210]: Unable to connect to libvirt currently. Retrying .. 6Please enter your authentication name: Please enter your password:
 Jan 26 11:25:10 c74he20180108h1.localdomain virsh[15283]: All-whitespace username.
 Jan 26 11:25:10 c74he20180108h1.localdomain libvirt-guests.sh[15210]: Unable to connect to libvirt currently. Retrying .. 5Please enter your authentication name: Please enter your password:
 Jan 26 11:25:11 c74he20180108h1.localdomain virsh[15290]: All-whitespace username.
 Jan 26 11:25:11 c74he20180108h1.localdomain libvirt-guests.sh[15210]: Unable to connect to libvirt currently. Retrying .. 4Please enter your authentication name: Please enter your password:
 Jan 26 11:25:12 c74he20180108h1.localdomain virsh[15297]: All-whitespace username.
 Jan 26 11:25:12 c74he20180108h1.localdomain libvirt-guests.sh[15210]: Unable to connect to libvirt currently. Retrying .. 3Please enter your authentication name: Please enter your password:
 Jan 26 11:25:13 c74he20180108h1.localdomain virsh[15304]: All-whitespace username.
 Jan 26 11:25:13 c74he20180108h1.localdomain libvirt-guests.sh[15210]: Unable to connect to libvirt currently. Retrying .. 2Please enter your authentication name: Please enter your password:
 Jan 26 11:25:14 c74he20180108h1.localdomain libvirt-guests.sh[15210]: Unable to connect to libvirt currently. Retrying .. 1Can't connect to default. Skipping.
 Jan 26 11:25:14 c74he20180108h1.localdomain systemd[1]: Stopped Suspend/Resume Running libvirt Guests.
 Jan 26 11:25:19 c74he20180108h1.localdomain systemd[1]: Starting Suspend/Resume Running libvirt Guests...
 Jan 26 11:25:19 c74he20180108h1.localdomain systemd[1]: Started Suspend/Resume Running libvirt Guests.


After the deployment attempt we have in /etc/sysconfig/libvirt-guests:
 ##
 # Start of VDSM configuration
 ##
 URIS=qemu+tls://c74he20180108h1.localdomain/system
 ON_BOOT=ignore
 ON_SHUTDOWN=shutdown
 PARALLEL_SHUTDOWN=0
 SHUTDOWN_TIMEOUT=600
 ##


Version-Release number of selected component (if applicable):
ovirt-hosted-engine-setup      2.2.8

How reproducible:
100%

Steps to Reproduce:
1. start hosted-engine --deploy from CLI on a clean system
2. let it run till 'Please specify the storage you would like to use (glusterfs, iscsi, fc, nfs)[nfs]:' and then stop it with Ctrl + c
3. Retry to deploy on the same host without rebooting

Actual results:
it hangs on
[ INFO  ] TASK [Wait for the host to be up]
[ ERROR ] Error: Failed to read response.
[ ERROR ] fatal: [localhost]: FAILED! => {"attempts": 50, "changed": false, "msg": "Failed to read response."}
[ ERROR ] Failed to execute stage 'Closing up': Failed executing ansible-playbook

The local engine VM got shutdown by libvirt-guests


Expected results:
hosted-engine-setup correctly deploys also after a failure

Additional info:

Comment 1 Simone Tiraboschi 2018-01-26 13:35:57 UTC
I think it got introduced here:
https://gerrit.ovirt.org/#/c/79840/

The commits message says:
- the setup introduced by this commit is
  non intrusive
  the libvirt-guests by default still remains
  disabled and stopped
  the libvirt-guests by default configuration
  cannot connect to the libvirt, because they
  do not have proper connection string, therefore
  they can't touch the running VMs in any way

and this explains why libvirt-guests doesn't kick in on the first attempt due to  'Jan 26 11:25:04 c74he20180108h1.localdomain libvirt-guests.sh[15210]: Please enter your authentication name: Please enter your password:'

but then it will be effective on subsequent attempts due to the leftover configuration.

Petr, should we simply stop libvirt-guests while deploying hosted-engine?
Any other way to prevent libvirt-guests acting on HostedEngineLocal?

Comment 2 Simone Tiraboschi 2018-01-26 13:42:46 UTC
Please note that probably there is also something not correctly working on host-deploy side.

Also on the first attempt,
at
 [ INFO  ] TASK [Wait for the host to be up]

host-deploy already configured and started vdsm and libvirtd (with SASL authentication) and, directly or indirectly, libvirt-guests but in that case libvirt-guests has still to be restarted to consume its new configuration and so the first attempt we see:

 Jan 26 11:25:04 c74he20180108h1.localdomain virsh[15215]: All-whitespace username.
 Jan 26 11:25:04 c74he20180108h1.localdomain libvirt-guests.sh[15210]: Please enter your authentication name: Please enter your password:

just because libvirt-guests doesn't correctly authenticate to libvirtd over SASL.

Comment 3 Simone Tiraboschi 2018-01-26 14:02:53 UTC
(In reply to Simone Tiraboschi from comment #2)
> Please note that probably there is also something not correctly working on
> host-deploy side.

This is probably due to the fact that libvirt and vdsm are still configured and started by host-deploy while libvirt-guests got configured in a second shot via ansible:
https://github.com/oVirt/ovirt-engine/blob/master/packaging/playbooks/roles/ovirt-host-deploy-libvirt-guests/tasks/main.yml

Comment 4 Simone Tiraboschi 2018-01-30 10:09:49 UTC
*** Bug 1539734 has been marked as a duplicate of this bug. ***

Comment 5 Petr Kotas 2018-01-30 17:13:03 UTC
We have discussed this issue and could not find a suitable workaround, that would provide a hot-fix. The libvirt-guests service is configured in a way it shuts down all running VMs before the host shutdown. The VM shutdown is triggered when the service is stopped. This is a proper behavior and should not be changed.

What is unfortunate, the libvirt-guests is stopped during the ovirt-host-deploy phase. This results in stopping the running engines VM.

We were able to pin point the source of this pressing issue to the otopi part of ovirt-host-deploy. Namely the lines:

https://github.com/oVirt/ovirt-host-deploy/blob/master/src/plugins/ovirt-host-deploy/vdsm/packages.py#L122
https://github.com/oVirt/ovirt-host-deploy/blob/master/src/plugins/ovirt-host-deploy/vdsm/packages.py#L164

The right place to solve this issue is inside the legacy otopi code.

Comment 6 Simone Tiraboschi 2018-01-31 11:03:32 UTC
This is not a trivial change, currently otopi allows only to stop and start services, not to restart them in a single shot:

https://github.com/oVirt/otopi/blob/master/src/plugins/otopi/services/systemd.py#L134

Comment 7 Simone Tiraboschi 2018-01-31 14:24:05 UTC
*** Bug 1538938 has been marked as a duplicate of this bug. ***

Comment 8 Yaniv Lavi 2018-02-05 08:27:05 UTC
*** Bug 1539563 has been marked as a duplicate of this bug. ***

Comment 9 Nikolai Sednev 2018-02-05 17:20:45 UTC
Works for me on these components:
rhvm-appliance-4.2-20180202.0.el7.noarch
ovirt-hosted-engine-setup-2.2.9-1.el7ev.noarch
ovirt-hosted-engine-ha-2.2.4-1.el7ev.noarch
Red Hat Enterprise Linux Server release 7.4 (Maipo)

Moving to verified.
http://pastebin.test.redhat.com/552873

Comment 10 Sandro Bonazzola 2018-02-12 11:53:41 UTC
This bugzilla is included in oVirt 4.2.1 release, published on Feb 12th 2018.

Since the problem described in this bug report should be
resolved in oVirt 4.2.1 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.