Bug 1565516

Summary: Deploy HE failed after checking host result up 120 times via cockpit based ansible deployment.
Product: [oVirt] cockpit-ovirt Reporter: Wei Wang <weiwang>
Component: Hosted EngineAssignee: Phillip Bailey <phbailey>
Status: CLOSED CURRENTRELEASE QA Contact: Wei Wang <weiwang>
Severity: high Docs Contact:
Priority: unspecified    
Version: 0.11.20CC: bugs, cshao, david, huzhao, jiaczhan, mike, qiyuan, rbarry, stirabos, yaniwang, ycui, ylavi, yzhao
Target Milestone: ovirt-4.2.3Flags: rule-engine: ovirt-4.2+
ylavi: exception+
cshao: testing_ack+
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-04-20 09:18:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Integration RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
deployment fail picture
none
Log files
none
HE deploy fail logs none

Description Wei Wang 2018-04-10 08:10:11 UTC
Created attachment 1419726 [details]
deployment fail picture

Description of problem:
Deploy HE failed after checking host result up 120 times via cockpit based ansible deployment.

[ INFO ] TASK [Add host]
[ INFO ] changed: [localhost]
[ INFO ] TASK [Wait for the host to be up]
[ ERROR ] fatal: [localhost]: FAILED! => {"ansible_facts": {"ovirt_hosts": []}, "attempts": 120, "changed": false}
[ INFO ] TASK [include_tasks]
[ INFO ] ok: [localhost]
[ INFO ] TASK [Remove local vm dir]
[ INFO ] changed: [localhost]
[ INFO ] TASK [Notify the user about a failure]
[ ERROR ] fatal: [localhost]: FAILED! => {"changed": false, "msg": "The system may not be provisioned according to the playbook results: please check the logs for the issue, fix accordingly or re-deploy from scratch.\n"}


Version-Release number of selected component (if applicable):
RHVH-4.2-20180408.0-RHVH-x86_64-dvd1.iso
cockpit-bridge-160-3.el7.x86_64
cockpit-160-3.el7.x86_64
cockpit-ws-160-3.el7.x86_64
cockpit-system-160-3.el7.noarch
cockpit-ovirt-dashboard-0.11.20-1.el7ev.noarch
ovirt-hosted-engine-setup-2.2.15-1.el7ev.noarch
ovirt-hosted-engine-ha-2.2.9-1.el7ev.noarch
rhvm-appliance-4.2-20180404.0.el7.4.2.rpm

How reproducible:
100%

Steps to Reproduce:
1. Clean install RHVH-4.2-20180408.0-RHVH-x86_64-dvd1.iso with anaconda
2. Deploy hosted-engine via cockpit based ansible deployment.


Actual results:
Deploy HE failed after checking host result up 120 times

Expected results:
Deploy HE successful without any error.


Additional info:
This issue cannot be reproduced with CLI ansible deployment.

Comment 1 Wei Wang 2018-04-10 08:11:00 UTC
Created attachment 1419727 [details]
Log files

Comment 2 Yihui Zhao 2018-04-10 09:22:31 UTC
I cannot meet this issue, please see https://bugzilla.redhat.com/show_bug.cgi?id=1562011#c8.

I cannot find the previous bug about this issue, I think it is fixed on ovirt-hosted-engine-setup-2.2.15-1.el7ev.noarch

Comment 3 Wei Wang 2018-04-10 09:46:00 UTC
(In reply to Yihui Zhao from comment #2)
> I cannot meet this issue, please see
> https://bugzilla.redhat.com/show_bug.cgi?id=1562011#c8.
> 
> I cannot find the previous bug about this issue, I think it is fixed on
> ovirt-hosted-engine-setup-2.2.15-1.el7ev.noarch

Yes, maybe it is not 100% reproducible.

Comment 4 Wei Wang 2018-04-18 03:54:00 UTC
Test with new version RHVH-4.2-20180410.1-RHVH-x86_64-dvd1.iso, after testing 5 times, deployment fail occurs 1 time.
[ INFO ] TASK [Wait for SSH to restart on the local VM]
[ ERROR ] fatal: [localhost -> localhost]: FAILED! => {"changed": false, "elapsed": 301, "msg": "Timeout when waiting for rhevh-hostedengine-vm-06.lab.eng.pek2.redhat.com:22"}
[ INFO ] TASK [include_tasks]
[ INFO ] ok: [localhost]
[ INFO ] TASK [Remove local vm dir]
[ INFO ] changed: [localhost]
[ INFO ] TASK [Notify the user about a failure]
[ ERROR ] fatal: [localhost]: FAILED! => {"changed": false, "msg": "The system may not be provisioned according to the playbook results: please check the logs for the issue, fix accordingly or re-deploy from scratch.\n"}

From above information, we can see it is different with the original one(Attach the log in attachment), but deployment fail is probability events. So for the customer, it is not a good experience.

Comment 5 Wei Wang 2018-04-18 03:54:47 UTC
Created attachment 1423318 [details]
HE deploy fail logs

Comment 6 Ryan Barry 2018-04-18 13:46:19 UTC
Is this reproducible on the CLI?

Also, if this is being tested in a VM, please try on physical hardware.

Comment 7 Wei Wang 2018-04-19 06:04:25 UTC
(In reply to Ryan Barry from comment #6)
> Is this reproducible on the CLI?
> 
> Also, if this is being tested in a VM, please try on physical hardware.


Using CLI to retest 8 times, cannot reproduce the related issue. QE use physical hardware to test no matter with cockpit or with CLI.

Comment 8 Simone Tiraboschi 2018-04-19 15:18:50 UTC
I tried reproducing it 4 times in a raw from cockpit it worked as expected.

Comment 9 Sandro Bonazzola 2018-04-20 09:00:30 UTC
We need a reproducer for being able to do something about this. Setting conditional NAK on reproducer.

Comment 10 Simone Tiraboschi 2018-04-20 09:18:13 UTC
CLOSING as WORKSFORME,
please reopen if we found a reproducer

Comment 11 Mike Goodwin 2018-05-28 21:17:05 UTC
I don't know how to reproduce this but I found a workaround 

It happened to me with oVirt Node 4.2.3 

ovirt-hosted-engine-setup-2.2.20-1.el7.centos.noarch


May 28 14:14:59 ovn-1.vm-net2 ovirt-vmconsole-host-sshd[1629]: /usr/share/ovirt-vmconsole/ovirt-vmconsole-host/ovirt-vmconsole-host-sshd/sshd_config line 23: Deprecated option RSAAuthentication
May 28 14:15:00 ovn-1.vm-net2 ovirt-vmconsole-host-sshd[1629]: Could not load host key: /etc/pki/ovirt-vmconsole/host-ssh_host_rsa
May 28 14:15:00 ovn-1.vm-net2 ovirt-vmconsole-host-sshd[1629]: sshd: no hostkeys available -- exiting.


The above service fails to start because for some reason the SSH host key isn't generated.

When I used `ssh-keygen` to generate the host key at that path, and started/enabled ovirt-vmconsole-host-sshd, and re-deployed, it got past that error.

ssh-keygen -h -t rsa /etc/pki/ovirt-vmconsole/host-ssh_host_rsa

Comment 12 David Peters 2018-09-27 07:52:10 UTC
The problem seems to be if you have dns then files in your nsswitch.conf

When the setup script modifies the hosts file and tries to ssh in the engine it will not be able to as the dns will point it to the IP that's meant to be setup on now what the bridge interface sets up initially

Comment 13 David Peters 2018-09-27 07:59:18 UTC
(In reply to Mike Goodwin from comment #11)
> I don't know how to reproduce this but I found a workaround 
> 
> It happened to me with oVirt Node 4.2.3 
> 
> ovirt-hosted-engine-setup-2.2.20-1.el7.centos.noarch
> 
> 
> May 28 14:14:59 ovn-1.vm-net2 ovirt-vmconsole-host-sshd[1629]:
> /usr/share/ovirt-vmconsole/ovirt-vmconsole-host/ovirt-vmconsole-host-sshd/
> sshd_config line 23: Deprecated option RSAAuthentication
> May 28 14:15:00 ovn-1.vm-net2 ovirt-vmconsole-host-sshd[1629]: Could not
> load host key: /etc/pki/ovirt-vmconsole/host-ssh_host_rsa
> May 28 14:15:00 ovn-1.vm-net2 ovirt-vmconsole-host-sshd[1629]: sshd: no
> hostkeys available -- exiting.
> 
> 
> The above service fails to start because for some reason the SSH host key
> isn't generated.
> 
> When I used `ssh-keygen` to generate the host key at that path, and
> started/enabled ovirt-vmconsole-host-sshd, and re-deployed, it got past that
> error.
> 
> ssh-keygen -h -t rsa /etc/pki/ovirt-vmconsole/host-ssh_host_rsa

I think you will need a -f if you specify the location 

ssh-keygen -h -t rsa -f /etc/pki/ovirt-vmconsole/host-ssh_host_rsa