Description of problem: If hosted-engine-setup (ansible) fails (or got interrupted in the middle) successive attempts will fail with: [ INFO ] TASK [Wait for the host to be up] [ ERROR ] Error: Failed to read response. [ ERROR ] fatal: [localhost]: FAILED! => {"attempts": 50, "changed": false, "msg": "Failed to read response."} [ ERROR ] Failed to execute stage 'Closing up': Failed executing ansible-playbook In the logs: 2018-01-26 11:37:44,503+0100 INFO otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:81 TASK [Wait for the host to be up] 2018-01-26 11:49:07,746+0100 DEBUG otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils.run:161 ansible-playbook rc: 2 If we check what happened we discover that HostedEngineLocal got shutdown by libvirt-guests in the middle of the deployment: Jan 26 11:37:53 c74he20180108h1.localdomain systemd[1]: Stopping Suspend/Resume Running libvirt Guests... Jan 26 11:37:53 c74he20180108h1.localdomain libvirt-guests.sh[20418]: Running guests on qemu+tls://c74he20180108h1.localdomain/system URI: HostedEngineLocal Jan 26 11:37:53 c74he20180108h1.localdomain libvirt-guests.sh[20418]: Shutting down guests on qemu+tls://c74he20180108h1.localdomain/system URI... Jan 26 11:37:53 c74he20180108h1.localdomain libvirt-guests.sh[20418]: Starting shutdown on guest: HostedEngineLocal Jan 26 11:37:55 c74he20180108h1.localdomain libvirt-guests.sh[20418]: Waiting for guest HostedEngineLocal to shut down, 600 seconds left Jan 26 11:38:00 c74he20180108h1.localdomain libvirt-guests.sh[20418]: Waiting for guest HostedEngineLocal to shut down, 595 seconds left Jan 26 11:38:05 c74he20180108h1.localdomain libvirt-guests.sh[20418]: Waiting for guest HostedEngineLocal to shut down, 590 seconds left Jan 26 11:38:11 c74he20180108h1.localdomain libvirt-guests.sh[20418]: Shutdown of guest HostedEngineLocal complete. Jan 26 11:38:11 c74he20180108h1.localdomain systemd[1]: Stopped Suspend/Resume Running libvirt Guests. Jan 26 11:38:16 c74he20180108h1.localdomain systemd[1]: Starting Suspend/Resume Running libvirt Guests... Jan 26 11:38:16 c74he20180108h1.localdomain systemd[1]: Started Suspend/Resume Running libvirt Guests. Jan 26 11:38:16 c74he20180108h1.localdomain libvirt-guests.sh[20927]: libvirt-guests is configured not to start any guests on boot In the first attempt it didn't happened but we still see something in libvirt-guests journactl logs: -- Logs begin at Fri 2018-01-26 11:13:53 CET, end at Fri 2018-01-26 12:20:17 CET. -- Jan 26 11:14:06 c74he20180108h1.localdomain systemd[1]: Starting Suspend/Resume Running libvirt Guests... Jan 26 11:14:06 c74he20180108h1.localdomain systemd[1]: Started Suspend/Resume Running libvirt Guests. Jan 26 11:25:04 c74he20180108h1.localdomain systemd[1]: Stopping Suspend/Resume Running libvirt Guests... Jan 26 11:25:04 c74he20180108h1.localdomain virsh[15215]: All-whitespace username. Jan 26 11:25:04 c74he20180108h1.localdomain libvirt-guests.sh[15210]: Please enter your authentication name: Please enter your password: Jan 26 11:25:05 c74he20180108h1.localdomain virsh[15248]: All-whitespace username. Jan 26 11:25:05 c74he20180108h1.localdomain libvirt-guests.sh[15210]: Unable to connect to libvirt currently. Retrying .. 10Please enter your authentication name: Please enter your password: Jan 26 11:25:06 c74he20180108h1.localdomain virsh[15255]: All-whitespace username. Jan 26 11:25:06 c74he20180108h1.localdomain libvirt-guests.sh[15210]: Unable to connect to libvirt currently. Retrying .. 9Please enter your authentication name: Please enter your password: Jan 26 11:25:07 c74he20180108h1.localdomain virsh[15262]: All-whitespace username. Jan 26 11:25:07 c74he20180108h1.localdomain libvirt-guests.sh[15210]: Unable to connect to libvirt currently. Retrying .. 8Please enter your authentication name: Please enter your password: Jan 26 11:25:08 c74he20180108h1.localdomain virsh[15269]: All-whitespace username. Jan 26 11:25:08 c74he20180108h1.localdomain libvirt-guests.sh[15210]: Unable to connect to libvirt currently. Retrying .. 7Please enter your authentication name: Please enter your password: Jan 26 11:25:09 c74he20180108h1.localdomain virsh[15276]: All-whitespace username. Jan 26 11:25:09 c74he20180108h1.localdomain libvirt-guests.sh[15210]: Unable to connect to libvirt currently. Retrying .. 6Please enter your authentication name: Please enter your password: Jan 26 11:25:10 c74he20180108h1.localdomain virsh[15283]: All-whitespace username. Jan 26 11:25:10 c74he20180108h1.localdomain libvirt-guests.sh[15210]: Unable to connect to libvirt currently. Retrying .. 5Please enter your authentication name: Please enter your password: Jan 26 11:25:11 c74he20180108h1.localdomain virsh[15290]: All-whitespace username. Jan 26 11:25:11 c74he20180108h1.localdomain libvirt-guests.sh[15210]: Unable to connect to libvirt currently. Retrying .. 4Please enter your authentication name: Please enter your password: Jan 26 11:25:12 c74he20180108h1.localdomain virsh[15297]: All-whitespace username. Jan 26 11:25:12 c74he20180108h1.localdomain libvirt-guests.sh[15210]: Unable to connect to libvirt currently. Retrying .. 3Please enter your authentication name: Please enter your password: Jan 26 11:25:13 c74he20180108h1.localdomain virsh[15304]: All-whitespace username. Jan 26 11:25:13 c74he20180108h1.localdomain libvirt-guests.sh[15210]: Unable to connect to libvirt currently. Retrying .. 2Please enter your authentication name: Please enter your password: Jan 26 11:25:14 c74he20180108h1.localdomain libvirt-guests.sh[15210]: Unable to connect to libvirt currently. Retrying .. 1Can't connect to default. Skipping. Jan 26 11:25:14 c74he20180108h1.localdomain systemd[1]: Stopped Suspend/Resume Running libvirt Guests. Jan 26 11:25:19 c74he20180108h1.localdomain systemd[1]: Starting Suspend/Resume Running libvirt Guests... Jan 26 11:25:19 c74he20180108h1.localdomain systemd[1]: Started Suspend/Resume Running libvirt Guests. After the deployment attempt we have in /etc/sysconfig/libvirt-guests: ## # Start of VDSM configuration ## URIS=qemu+tls://c74he20180108h1.localdomain/system ON_BOOT=ignore ON_SHUTDOWN=shutdown PARALLEL_SHUTDOWN=0 SHUTDOWN_TIMEOUT=600 ## Version-Release number of selected component (if applicable): ovirt-hosted-engine-setup 2.2.8 How reproducible: 100% Steps to Reproduce: 1. start hosted-engine --deploy from CLI on a clean system 2. let it run till 'Please specify the storage you would like to use (glusterfs, iscsi, fc, nfs)[nfs]:' and then stop it with Ctrl + c 3. Retry to deploy on the same host without rebooting Actual results: it hangs on [ INFO ] TASK [Wait for the host to be up] [ ERROR ] Error: Failed to read response. [ ERROR ] fatal: [localhost]: FAILED! => {"attempts": 50, "changed": false, "msg": "Failed to read response."} [ ERROR ] Failed to execute stage 'Closing up': Failed executing ansible-playbook The local engine VM got shutdown by libvirt-guests Expected results: hosted-engine-setup correctly deploys also after a failure Additional info:
I think it got introduced here: https://gerrit.ovirt.org/#/c/79840/ The commits message says: - the setup introduced by this commit is non intrusive the libvirt-guests by default still remains disabled and stopped the libvirt-guests by default configuration cannot connect to the libvirt, because they do not have proper connection string, therefore they can't touch the running VMs in any way and this explains why libvirt-guests doesn't kick in on the first attempt due to 'Jan 26 11:25:04 c74he20180108h1.localdomain libvirt-guests.sh[15210]: Please enter your authentication name: Please enter your password:' but then it will be effective on subsequent attempts due to the leftover configuration. Petr, should we simply stop libvirt-guests while deploying hosted-engine? Any other way to prevent libvirt-guests acting on HostedEngineLocal?
Please note that probably there is also something not correctly working on host-deploy side. Also on the first attempt, at [ INFO ] TASK [Wait for the host to be up] host-deploy already configured and started vdsm and libvirtd (with SASL authentication) and, directly or indirectly, libvirt-guests but in that case libvirt-guests has still to be restarted to consume its new configuration and so the first attempt we see: Jan 26 11:25:04 c74he20180108h1.localdomain virsh[15215]: All-whitespace username. Jan 26 11:25:04 c74he20180108h1.localdomain libvirt-guests.sh[15210]: Please enter your authentication name: Please enter your password: just because libvirt-guests doesn't correctly authenticate to libvirtd over SASL.
(In reply to Simone Tiraboschi from comment #2) > Please note that probably there is also something not correctly working on > host-deploy side. This is probably due to the fact that libvirt and vdsm are still configured and started by host-deploy while libvirt-guests got configured in a second shot via ansible: https://github.com/oVirt/ovirt-engine/blob/master/packaging/playbooks/roles/ovirt-host-deploy-libvirt-guests/tasks/main.yml
*** Bug 1539734 has been marked as a duplicate of this bug. ***
We have discussed this issue and could not find a suitable workaround, that would provide a hot-fix. The libvirt-guests service is configured in a way it shuts down all running VMs before the host shutdown. The VM shutdown is triggered when the service is stopped. This is a proper behavior and should not be changed. What is unfortunate, the libvirt-guests is stopped during the ovirt-host-deploy phase. This results in stopping the running engines VM. We were able to pin point the source of this pressing issue to the otopi part of ovirt-host-deploy. Namely the lines: https://github.com/oVirt/ovirt-host-deploy/blob/master/src/plugins/ovirt-host-deploy/vdsm/packages.py#L122 https://github.com/oVirt/ovirt-host-deploy/blob/master/src/plugins/ovirt-host-deploy/vdsm/packages.py#L164 The right place to solve this issue is inside the legacy otopi code.
This is not a trivial change, currently otopi allows only to stop and start services, not to restart them in a single shot: https://github.com/oVirt/otopi/blob/master/src/plugins/otopi/services/systemd.py#L134
*** Bug 1538938 has been marked as a duplicate of this bug. ***
*** Bug 1539563 has been marked as a duplicate of this bug. ***
Works for me on these components: rhvm-appliance-4.2-20180202.0.el7.noarch ovirt-hosted-engine-setup-2.2.9-1.el7ev.noarch ovirt-hosted-engine-ha-2.2.4-1.el7ev.noarch Red Hat Enterprise Linux Server release 7.4 (Maipo) Moving to verified. http://pastebin.test.redhat.com/552873
This bugzilla is included in oVirt 4.2.1 release, published on Feb 12th 2018. Since the problem described in this bug report should be resolved in oVirt 4.2.1 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.