Bug 1565516 - Deploy HE failed after checking host result up 120 times via cockpit based ansible deployment.
Summary: Deploy HE failed after checking host result up 120 times via cockpit based an...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: cockpit-ovirt
Classification: oVirt
Component: Hosted Engine
Version: 0.11.20
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ovirt-4.2.3
: ---
Assignee: Phillip Bailey
QA Contact: Wei Wang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-04-10 08:10 UTC by Wei Wang
Modified: 2018-09-27 07:59 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-04-20 09:18:13 UTC
oVirt Team: Integration
Embargoed:
rule-engine: ovirt-4.2+
ylavi: exception+
cshao: testing_ack+


Attachments (Terms of Use)
deployment fail picture (91.76 KB, image/png)
2018-04-10 08:10 UTC, Wei Wang
no flags Details
Log files (1.10 MB, application/x-gzip)
2018-04-10 08:11 UTC, Wei Wang
no flags Details
HE deploy fail logs (3.16 MB, application/x-gzip)
2018-04-18 03:54 UTC, Wei Wang
no flags Details

Description Wei Wang 2018-04-10 08:10:11 UTC
Created attachment 1419726 [details]
deployment fail picture

Description of problem:
Deploy HE failed after checking host result up 120 times via cockpit based ansible deployment.

[ INFO ] TASK [Add host]
[ INFO ] changed: [localhost]
[ INFO ] TASK [Wait for the host to be up]
[ ERROR ] fatal: [localhost]: FAILED! => {"ansible_facts": {"ovirt_hosts": []}, "attempts": 120, "changed": false}
[ INFO ] TASK [include_tasks]
[ INFO ] ok: [localhost]
[ INFO ] TASK [Remove local vm dir]
[ INFO ] changed: [localhost]
[ INFO ] TASK [Notify the user about a failure]
[ ERROR ] fatal: [localhost]: FAILED! => {"changed": false, "msg": "The system may not be provisioned according to the playbook results: please check the logs for the issue, fix accordingly or re-deploy from scratch.\n"}


Version-Release number of selected component (if applicable):
RHVH-4.2-20180408.0-RHVH-x86_64-dvd1.iso
cockpit-bridge-160-3.el7.x86_64
cockpit-160-3.el7.x86_64
cockpit-ws-160-3.el7.x86_64
cockpit-system-160-3.el7.noarch
cockpit-ovirt-dashboard-0.11.20-1.el7ev.noarch
ovirt-hosted-engine-setup-2.2.15-1.el7ev.noarch
ovirt-hosted-engine-ha-2.2.9-1.el7ev.noarch
rhvm-appliance-4.2-20180404.0.el7.4.2.rpm

How reproducible:
100%

Steps to Reproduce:
1. Clean install RHVH-4.2-20180408.0-RHVH-x86_64-dvd1.iso with anaconda
2. Deploy hosted-engine via cockpit based ansible deployment.


Actual results:
Deploy HE failed after checking host result up 120 times

Expected results:
Deploy HE successful without any error.


Additional info:
This issue cannot be reproduced with CLI ansible deployment.

Comment 1 Wei Wang 2018-04-10 08:11:00 UTC
Created attachment 1419727 [details]
Log files

Comment 2 Yihui Zhao 2018-04-10 09:22:31 UTC
I cannot meet this issue, please see https://bugzilla.redhat.com/show_bug.cgi?id=1562011#c8.

I cannot find the previous bug about this issue, I think it is fixed on ovirt-hosted-engine-setup-2.2.15-1.el7ev.noarch

Comment 3 Wei Wang 2018-04-10 09:46:00 UTC
(In reply to Yihui Zhao from comment #2)
> I cannot meet this issue, please see
> https://bugzilla.redhat.com/show_bug.cgi?id=1562011#c8.
> 
> I cannot find the previous bug about this issue, I think it is fixed on
> ovirt-hosted-engine-setup-2.2.15-1.el7ev.noarch

Yes, maybe it is not 100% reproducible.

Comment 4 Wei Wang 2018-04-18 03:54:00 UTC
Test with new version RHVH-4.2-20180410.1-RHVH-x86_64-dvd1.iso, after testing 5 times, deployment fail occurs 1 time.
[ INFO ] TASK [Wait for SSH to restart on the local VM]
[ ERROR ] fatal: [localhost -> localhost]: FAILED! => {"changed": false, "elapsed": 301, "msg": "Timeout when waiting for rhevh-hostedengine-vm-06.lab.eng.pek2.redhat.com:22"}
[ INFO ] TASK [include_tasks]
[ INFO ] ok: [localhost]
[ INFO ] TASK [Remove local vm dir]
[ INFO ] changed: [localhost]
[ INFO ] TASK [Notify the user about a failure]
[ ERROR ] fatal: [localhost]: FAILED! => {"changed": false, "msg": "The system may not be provisioned according to the playbook results: please check the logs for the issue, fix accordingly or re-deploy from scratch.\n"}

From above information, we can see it is different with the original one(Attach the log in attachment), but deployment fail is probability events. So for the customer, it is not a good experience.

Comment 5 Wei Wang 2018-04-18 03:54:47 UTC
Created attachment 1423318 [details]
HE deploy fail logs

Comment 6 Ryan Barry 2018-04-18 13:46:19 UTC
Is this reproducible on the CLI?

Also, if this is being tested in a VM, please try on physical hardware.

Comment 7 Wei Wang 2018-04-19 06:04:25 UTC
(In reply to Ryan Barry from comment #6)
> Is this reproducible on the CLI?
> 
> Also, if this is being tested in a VM, please try on physical hardware.


Using CLI to retest 8 times, cannot reproduce the related issue. QE use physical hardware to test no matter with cockpit or with CLI.

Comment 8 Simone Tiraboschi 2018-04-19 15:18:50 UTC
I tried reproducing it 4 times in a raw from cockpit it worked as expected.

Comment 9 Sandro Bonazzola 2018-04-20 09:00:30 UTC
We need a reproducer for being able to do something about this. Setting conditional NAK on reproducer.

Comment 10 Simone Tiraboschi 2018-04-20 09:18:13 UTC
CLOSING as WORKSFORME,
please reopen if we found a reproducer

Comment 11 Mike Goodwin 2018-05-28 21:17:05 UTC
I don't know how to reproduce this but I found a workaround 

It happened to me with oVirt Node 4.2.3 

ovirt-hosted-engine-setup-2.2.20-1.el7.centos.noarch


May 28 14:14:59 ovn-1.vm-net2 ovirt-vmconsole-host-sshd[1629]: /usr/share/ovirt-vmconsole/ovirt-vmconsole-host/ovirt-vmconsole-host-sshd/sshd_config line 23: Deprecated option RSAAuthentication
May 28 14:15:00 ovn-1.vm-net2 ovirt-vmconsole-host-sshd[1629]: Could not load host key: /etc/pki/ovirt-vmconsole/host-ssh_host_rsa
May 28 14:15:00 ovn-1.vm-net2 ovirt-vmconsole-host-sshd[1629]: sshd: no hostkeys available -- exiting.


The above service fails to start because for some reason the SSH host key isn't generated.

When I used `ssh-keygen` to generate the host key at that path, and started/enabled ovirt-vmconsole-host-sshd, and re-deployed, it got past that error.

ssh-keygen -h -t rsa /etc/pki/ovirt-vmconsole/host-ssh_host_rsa

Comment 12 David Peters 2018-09-27 07:52:10 UTC
The problem seems to be if you have dns then files in your nsswitch.conf

When the setup script modifies the hosts file and tries to ssh in the engine it will not be able to as the dns will point it to the IP that's meant to be setup on now what the bridge interface sets up initially

Comment 13 David Peters 2018-09-27 07:59:18 UTC
(In reply to Mike Goodwin from comment #11)
> I don't know how to reproduce this but I found a workaround 
> 
> It happened to me with oVirt Node 4.2.3 
> 
> ovirt-hosted-engine-setup-2.2.20-1.el7.centos.noarch
> 
> 
> May 28 14:14:59 ovn-1.vm-net2 ovirt-vmconsole-host-sshd[1629]:
> /usr/share/ovirt-vmconsole/ovirt-vmconsole-host/ovirt-vmconsole-host-sshd/
> sshd_config line 23: Deprecated option RSAAuthentication
> May 28 14:15:00 ovn-1.vm-net2 ovirt-vmconsole-host-sshd[1629]: Could not
> load host key: /etc/pki/ovirt-vmconsole/host-ssh_host_rsa
> May 28 14:15:00 ovn-1.vm-net2 ovirt-vmconsole-host-sshd[1629]: sshd: no
> hostkeys available -- exiting.
> 
> 
> The above service fails to start because for some reason the SSH host key
> isn't generated.
> 
> When I used `ssh-keygen` to generate the host key at that path, and
> started/enabled ovirt-vmconsole-host-sshd, and re-deployed, it got past that
> error.
> 
> ssh-keygen -h -t rsa /etc/pki/ovirt-vmconsole/host-ssh_host_rsa

I think you will need a -f if you specify the location 

ssh-keygen -h -t rsa -f /etc/pki/ovirt-vmconsole/host-ssh_host_rsa


Note You need to log in before you can comment on or make changes to this bug.