Bug 1413926 - OCP HA fails at 10% with "Failed to connect to the host via ssh: Read from socket failed: Connection reset by peer"
Summary: OCP HA fails at 10% with "Failed to connect to the host via ssh: Read from so...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Quickstart Cloud Installer
Classification: Red Hat
Component: Installation - OpenShift
Version: 1.1
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 1.1
Assignee: Derek Whatley
QA Contact: Antonin Pagac
Derek
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-01-17 10:33 UTC by Antonin Pagac
Modified: 2017-02-28 01:44 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-02-28 01:44:26 UTC
Target Upstream Version:


Attachments (Terms of Use)
excerpt from production.log (75.49 KB, text/plain)
2017-01-17 10:33 UTC, Antonin Pagac
no flags Details
excerpt from ansible.log with ansible_debug enabled (33.63 KB, text/plain)
2017-01-24 13:00 UTC, Antonin Pagac
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1412784 0 medium CLOSED [Satellite] host registration fails 2021-02-22 00:41:40 UTC
Red Hat Product Errata RHEA-2017:0335 0 normal SHIPPED_LIVE Red Hat Quickstart Installer 1.1 2017-02-28 06:36:13 UTC

Internal Links: 1412784

Description Antonin Pagac 2017-01-17 10:33:00 UTC
Created attachment 1241661 [details]
excerpt from production.log

Description of problem:
Deployment of OCP HA on HW machines. I have plenty of machines and power available. RHV installs correctly without problems, OCP fails at 10% with:

"
| TASK [wait_for_host_up : Gather facts] *****************************************
 | fatal: [depl2-ocp-master3.example.com]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Warning: Permanently added 'depl2-ocp-master3.example.com,192.168.235.137' (ECDSA) to the list of known hosts.\r
 | Connection closed by 192.168.235.137\r
 | ", "unreachable": true}
 | fatal: [depl2-ocp-master2.example.com]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Read from socket failed: Connection reset by peer\r
 | ", "unreachable": true}
 | fatal: [depl2-ocp-node1.example.com]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Read from socket failed: Connection reset by peer\r
 | ", "unreachable": true}
 | fatal: [depl2-ocp-master1.example.com]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Read from socket failed: Connection reset by peer\r
 | ", "unreachable": true}
 | fatal: [depl2-ocp-node2.example.com]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Warning: Permanently added 'depl2-ocp-node2.example.com,192.168.235.141' (ECDSA) to the list of known hosts.\r
 | Connection closed by 192.168.235.141\r
 | ", "unreachable": true}
 | fatal: [depl2-ocp-node3.example.com]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Read from socket failed: Connection reset by peer\r
 | ", "unreachable": true}
 | fatal: [depl2-ocp-ha1.example.com]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Read from socket failed: Connection reset by peer\r
 | ", "unreachable": true}
 | fatal: [depl2-ocp-ha2.example.com]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Warning: Permanently added 'depl2-ocp-ha2.example.com,192.168.235.147' (ECDSA) to the list of known hosts.\r
 | Connection closed by 192.168.235.147\r
 | ", "unreachable": true}
"

See complete log in attachment. All hosts are green in Satellite.
The deployment failed over night, at morning I tried to resume the failing OCP task "Actions::Fusor::Deployment::OpenShift::Deploy" and it got a step further but failed when it tried to register to Satellite:

"
 | TASK [satellite_registration : Register to Satellite] **************************
 | fatal: [depl2-ocp-master3.example.com]: FAILED! => {"changed": false, "cmd": "subscription-manager register --activationkey OpenShift-depl2-OpenShift --org Default_Organization", "failed": true, "msg": "ERROR:  current transaction is aborted, commands ignored until end of transaction block", "rc": 70, "stderr": "ERROR:  current transaction is aborted, commands ignored until end of transaction block
"

This is a known issue, see bug: https://bugzilla.redhat.com/show_bug.cgi?id=1412784

Version-Release number of selected component (if applicable):
QCI-1.1-RHEL-7-20170112.t.0

How reproducible:
Unsure; first time deploying OCP HA

Steps to Reproduce:
1. Have enough HW to deploy OCP HA
2. Kick off RHV+OCP HA deployment with NFS storage
3. RHV installs correctly, OCP fails at 10%

Actual results:
OCP HA failed to deploy

Expected results:
OCP HA deploys correctly

Additional info:

Comment 2 Antonin Pagac 2017-01-17 13:24:02 UTC
Reproduced with QCI-1.1-RHEL-7-20170116.t.0.

I will try with more powerful machine as a RHV engine and see if I can reproduce.

Comment 3 Derek Whatley 2017-01-18 15:51:36 UTC
Antonin provided me with more info on the hardware used with this deploy:

> RHV engine:
> - hp-dl320e-04
>   - 4 CPUs
>   - 4096 MB RAM

> RHV hypervisors:
> - smicro-5037-02 and smicro-5037-03
>   - both 12 CPUs
>   - both 32767 MB RAM
> - dell-r220-06 and dell-r220-10
>   - both 4 CPUs
>   - both 32009 MB RAM

It is noteworthy that four (4) hypervisors were used. Initial OCP HA testing was performed with a one (1) hypervisor with 64 GB of RAM. Looking into whether this could be a complication related to placement of node VMs on the available hypervisors.

Comment 4 Antonin Pagac 2017-01-23 13:56:41 UTC
Reproduced using QCI-1.1-RHEL-7-20170120.t.0 with different error messages:

"
2017-01-23 08:00:04,246 p=4155 u=foreman |  TASK [wait_for_host_up : Gather facts] *****************************************
2017-01-23 08:02:28,606 p=4155 u=foreman |  fatal: [ocpha-ocp-master1.example.com]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Warning: Permanently added 'ocpha-ocp-master1.example.com,192.168.235.133' (ECDSA) to the list of known hosts.\r\nConnection closed by 192.168.235.133\r\n", "unreachable": true}
2017-01-23 08:02:52,969 p=4155 u=foreman |  fatal: [ocpha-ocp-master3.example.com]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Warning: Permanently added 'ocpha-ocp-master3.example.com,192.168.235.137' (ECDSA) to the list of known hosts.\r\nConnection closed by 192.168.235.137\r\n", "unreachable": true}
2017-01-23 08:03:38,596 p=4155 u=foreman |  fatal: [ocpha-ocp-master2.example.com]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Warning: Permanently added 'ocpha-ocp-master2.example.com,192.168.235.135' (ECDSA) to the list of known hosts.\r\nConnection closed by 192.168.235.135\r\n", "unreachable": true}
2017-01-23 08:04:24,821 p=4155 u=foreman |  fatal: [ocpha-ocp-node2.example.com]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Warning: Permanently added 'ocpha-ocp-node2.example.com,192.168.235.141' (ECDSA) to the list of known hosts.\r\nConnection closed by 192.168.235.141\r\n", "unreachable": true}
2017-01-23 08:04:31,017 p=4155 u=foreman |  fatal: [ocpha-ocp-node3.example.com]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Warning: Permanently added 'ocpha-ocp-node3.example.com,192.168.235.143' (ECDSA) to the list of known hosts.\r\nConnection closed by 192.168.235.143\r\n", "unreachable": true}
2017-01-23 08:05:24,895 p=4155 u=foreman |  ok: [ocpha-ocp-node1.example.com]
2017-01-23 08:06:23,663 p=4155 u=foreman |  fatal: [ocpha-ocp-ha2.example.com]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Warning: Permanently added 'ocpha-ocp-ha2.example.com,192.168.235.147' (ECDSA) to the list of known hosts.\r\nConnection closed by 192.168.235.147\r\n", "unreachable": true}
2017-01-23 08:06:50,395 p=4155 u=foreman |  ok: [ocpha-ocp-ha1.example.com]
"

I used one of the Dell machines as an engine, so that every machine in the setup has at least 4 CPUs and 32GB of memory.
It seems that the fatal error is caused by VMs not being in known_hosts. Two of them however succeeded: ocp-ha1 and ocp-node1.

Upon resuming the failing task, everything is running fine. There are no errors in ansible.log and installation continues with "TASK [satellite_registration : Get certificate from Satellite]", which is the next step.

Comment 5 Antonin Pagac 2017-01-24 12:59:16 UTC
Re-deployed OCP HA and reproduced, this time with ansible_debug enabled. What I can see from the log:

'Failed to connect to the host via ssh: ... Incorrect RSA1 identifier Could not load \"/usr/share/foreman/.ssh/id_rsa-ocpha2\" as a RSA1 public key'

Later it's trying to use different methods for authentication:

'Authentications that can continue: publickey,gssapi-keyex,gssapi-with-mic,password ... start over, passed a different list publickey,gssapi-keyex,gssapi-with-mic,password ... preferred gssapi-with-mic,gssapi-keyex,hostbased,publickey ... Next authentication method: gssapi-with-mic: Unspecified GSS failure.  Minor code may provide more information ... we did not send a packet, disable method ... remaining preferred: hostbased,publickey ... Next authentication method: gssapi-keyex: No valid Key exchange context, we did not send a packet, disable method...'

And the last method, publickey:

'remaining preferred: ,publickey ... Next authentication method: publickey: Offering RSA public key: /usr/share/foreman/.ssh/id_rsa-ocpha2 ... we sent a publickey packet, wait for reply ... Connection closed by 192.168.235.157'

I'm attaching relevant part of ansible.log and will give my testing environment to Derek, so he can investigate more.

Comment 6 Antonin Pagac 2017-01-24 13:00:07 UTC
Created attachment 1243902 [details]
excerpt from ansible.log with ansible_debug enabled

Comment 7 Fabian von Feilitzsch 2017-01-27 20:46:10 UTC
This was an issue where cloud-init brought up SSH before the users were configured, so I added an extra step to the wait_for_host_up role that loops until it is able to log in as the correct user with the correct ssh key.

https://github.com/fusor/ansible-ovirt/pull/29

Comment 8 Dylan Murray 2017-01-30 14:51:35 UTC
PR made it in to: QCI-1.1-RHEL-7-20170127.t.0

Comment 9 Antonin Pagac 2017-02-06 09:55:13 UTC
Haven't seen this in couple of deployments I did lately. The fix seems to work for me.

Comment 10 Antonin Pagac 2017-02-06 12:24:24 UTC
Verified with 20170203.t.0

Comment 12 errata-xmlrpc 2017-02-28 01:44:26 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:0335


Note You need to log in before you can comment on or make changes to this bug.