1413926 – OCP HA fails at 10% with "Failed to connect to the host via ssh: Read from socket failed: Connection reset by peer"

Bug 1413926 - OCP HA fails at 10% with "Failed to connect to the host via ssh: Read from socket failed: Connection reset by peer"

Summary: OCP HA fails at 10% with "Failed to connect to the host via ssh: Read from so...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Quickstart Cloud Installer
Classification:	Red Hat
Component:	Installation - OpenShift
Sub Component:
Version:	1.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	1.1
Assignee:	Derek Whatley
QA Contact:	Antonin Pagac
Docs Contact:	Derek
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-01-17 10:33 UTC by Antonin Pagac
Modified:	2017-02-28 01:44 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-02-28 01:44:26 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
excerpt from production.log (75.49 KB, text/plain) 2017-01-17 10:33 UTC, Antonin Pagac	no flags	Details
excerpt from ansible.log with ansible_debug enabled (33.63 KB, text/plain) 2017-01-24 13:00 UTC, Antonin Pagac	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1412784	0	medium	CLOSED	[Satellite] host registration fails	2021-02-22 00:41:40 UTC
Red Hat Product Errata	RHEA-2017:0335	0	normal	SHIPPED_LIVE	Red Hat Quickstart Installer 1.1	2017-02-28 06:36:13 UTC

Internal Links: 1412784

Description Antonin Pagac 2017-01-17 10:33:00 UTC

Created attachment 1241661 [details]
excerpt from production.log

Description of problem:
Deployment of OCP HA on HW machines. I have plenty of machines and power available. RHV installs correctly without problems, OCP fails at 10% with:

"
| TASK [wait_for_host_up : Gather facts] *****************************************
 | fatal: [depl2-ocp-master3.example.com]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Warning: Permanently added 'depl2-ocp-master3.example.com,192.168.235.137' (ECDSA) to the list of known hosts.\r
 | Connection closed by 192.168.235.137\r
 | ", "unreachable": true}
 | fatal: [depl2-ocp-master2.example.com]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Read from socket failed: Connection reset by peer\r
 | ", "unreachable": true}
 | fatal: [depl2-ocp-node1.example.com]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Read from socket failed: Connection reset by peer\r
 | ", "unreachable": true}
 | fatal: [depl2-ocp-master1.example.com]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Read from socket failed: Connection reset by peer\r
 | ", "unreachable": true}
 | fatal: [depl2-ocp-node2.example.com]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Warning: Permanently added 'depl2-ocp-node2.example.com,192.168.235.141' (ECDSA) to the list of known hosts.\r
 | Connection closed by 192.168.235.141\r
 | ", "unreachable": true}
 | fatal: [depl2-ocp-node3.example.com]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Read from socket failed: Connection reset by peer\r
 | ", "unreachable": true}
 | fatal: [depl2-ocp-ha1.example.com]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Read from socket failed: Connection reset by peer\r
 | ", "unreachable": true}
 | fatal: [depl2-ocp-ha2.example.com]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Warning: Permanently added 'depl2-ocp-ha2.example.com,192.168.235.147' (ECDSA) to the list of known hosts.\r
 | Connection closed by 192.168.235.147\r
 | ", "unreachable": true}
"

See complete log in attachment. All hosts are green in Satellite.
The deployment failed over night, at morning I tried to resume the failing OCP task "Actions::Fusor::Deployment::OpenShift::Deploy" and it got a step further but failed when it tried to register to Satellite:

"
 | TASK [satellite_registration : Register to Satellite] **************************
 | fatal: [depl2-ocp-master3.example.com]: FAILED! => {"changed": false, "cmd": "subscription-manager register --activationkey OpenShift-depl2-OpenShift --org Default_Organization", "failed": true, "msg": "ERROR:  current transaction is aborted, commands ignored until end of transaction block", "rc": 70, "stderr": "ERROR:  current transaction is aborted, commands ignored until end of transaction block
"

This is a known issue, see bug: https://bugzilla.redhat.com/show_bug.cgi?id=1412784

Version-Release number of selected component (if applicable):
QCI-1.1-RHEL-7-20170112.t.0

How reproducible:
Unsure; first time deploying OCP HA

Steps to Reproduce:
1. Have enough HW to deploy OCP HA
2. Kick off RHV+OCP HA deployment with NFS storage
3. RHV installs correctly, OCP fails at 10%

Actual results:
OCP HA failed to deploy

Expected results:
OCP HA deploys correctly

Additional info:

Comment 2 Antonin Pagac 2017-01-17 13:24:02 UTC

Reproduced with QCI-1.1-RHEL-7-20170116.t.0.

I will try with more powerful machine as a RHV engine and see if I can reproduce.

Comment 3 Derek Whatley 2017-01-18 15:51:36 UTC

Antonin provided me with more info on the hardware used with this deploy:

> RHV engine:
> - hp-dl320e-04
>   - 4 CPUs
>   - 4096 MB RAM

> RHV hypervisors:
> - smicro-5037-02 and smicro-5037-03
>   - both 12 CPUs
>   - both 32767 MB RAM
> - dell-r220-06 and dell-r220-10
>   - both 4 CPUs
>   - both 32009 MB RAM

It is noteworthy that four (4) hypervisors were used. Initial OCP HA testing was performed with a one (1) hypervisor with 64 GB of RAM. Looking into whether this could be a complication related to placement of node VMs on the available hypervisors.

Comment 4 Antonin Pagac 2017-01-23 13:56:41 UTC

Reproduced using QCI-1.1-RHEL-7-20170120.t.0 with different error messages:

"
2017-01-23 08:00:04,246 p=4155 u=foreman |  TASK [wait_for_host_up : Gather facts] *****************************************
2017-01-23 08:02:28,606 p=4155 u=foreman |  fatal: [ocpha-ocp-master1.example.com]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Warning: Permanently added 'ocpha-ocp-master1.example.com,192.168.235.133' (ECDSA) to the list of known hosts.\r\nConnection closed by 192.168.235.133\r\n", "unreachable": true}
2017-01-23 08:02:52,969 p=4155 u=foreman |  fatal: [ocpha-ocp-master3.example.com]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Warning: Permanently added 'ocpha-ocp-master3.example.com,192.168.235.137' (ECDSA) to the list of known hosts.\r\nConnection closed by 192.168.235.137\r\n", "unreachable": true}
2017-01-23 08:03:38,596 p=4155 u=foreman |  fatal: [ocpha-ocp-master2.example.com]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Warning: Permanently added 'ocpha-ocp-master2.example.com,192.168.235.135' (ECDSA) to the list of known hosts.\r\nConnection closed by 192.168.235.135\r\n", "unreachable": true}
2017-01-23 08:04:24,821 p=4155 u=foreman |  fatal: [ocpha-ocp-node2.example.com]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Warning: Permanently added 'ocpha-ocp-node2.example.com,192.168.235.141' (ECDSA) to the list of known hosts.\r\nConnection closed by 192.168.235.141\r\n", "unreachable": true}
2017-01-23 08:04:31,017 p=4155 u=foreman |  fatal: [ocpha-ocp-node3.example.com]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Warning: Permanently added 'ocpha-ocp-node3.example.com,192.168.235.143' (ECDSA) to the list of known hosts.\r\nConnection closed by 192.168.235.143\r\n", "unreachable": true}
2017-01-23 08:05:24,895 p=4155 u=foreman |  ok: [ocpha-ocp-node1.example.com]
2017-01-23 08:06:23,663 p=4155 u=foreman |  fatal: [ocpha-ocp-ha2.example.com]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Warning: Permanently added 'ocpha-ocp-ha2.example.com,192.168.235.147' (ECDSA) to the list of known hosts.\r\nConnection closed by 192.168.235.147\r\n", "unreachable": true}
2017-01-23 08:06:50,395 p=4155 u=foreman |  ok: [ocpha-ocp-ha1.example.com]
"

I used one of the Dell machines as an engine, so that every machine in the setup has at least 4 CPUs and 32GB of memory.
It seems that the fatal error is caused by VMs not being in known_hosts. Two of them however succeeded: ocp-ha1 and ocp-node1.

Upon resuming the failing task, everything is running fine. There are no errors in ansible.log and installation continues with "TASK [satellite_registration : Get certificate from Satellite]", which is the next step.

Comment 5 Antonin Pagac 2017-01-24 12:59:16 UTC

Re-deployed OCP HA and reproduced, this time with ansible_debug enabled. What I can see from the log:

'Failed to connect to the host via ssh: ... Incorrect RSA1 identifier Could not load \"/usr/share/foreman/.ssh/id_rsa-ocpha2\" as a RSA1 public key'

Later it's trying to use different methods for authentication:

'Authentications that can continue: publickey,gssapi-keyex,gssapi-with-mic,password ... start over, passed a different list publickey,gssapi-keyex,gssapi-with-mic,password ... preferred gssapi-with-mic,gssapi-keyex,hostbased,publickey ... Next authentication method: gssapi-with-mic: Unspecified GSS failure.  Minor code may provide more information ... we did not send a packet, disable method ... remaining preferred: hostbased,publickey ... Next authentication method: gssapi-keyex: No valid Key exchange context, we did not send a packet, disable method...'

And the last method, publickey:

'remaining preferred: ,publickey ... Next authentication method: publickey: Offering RSA public key: /usr/share/foreman/.ssh/id_rsa-ocpha2 ... we sent a publickey packet, wait for reply ... Connection closed by 192.168.235.157'

I'm attaching relevant part of ansible.log and will give my testing environment to Derek, so he can investigate more.

Comment 6 Antonin Pagac 2017-01-24 13:00:07 UTC

Created attachment 1243902 [details]
excerpt from ansible.log with ansible_debug enabled

Comment 7 Fabian von Feilitzsch 2017-01-27 20:46:10 UTC

This was an issue where cloud-init brought up SSH before the users were configured, so I added an extra step to the wait_for_host_up role that loops until it is able to log in as the correct user with the correct ssh key.

https://github.com/fusor/ansible-ovirt/pull/29

Comment 8 Dylan Murray 2017-01-30 14:51:35 UTC

PR made it in to: QCI-1.1-RHEL-7-20170127.t.0

Comment 9 Antonin Pagac 2017-02-06 09:55:13 UTC

Haven't seen this in couple of deployments I did lately. The fix seems to work for me.

Comment 10 Antonin Pagac 2017-02-06 12:24:24 UTC

Verified with 20170203.t.0

Comment 12 errata-xmlrpc 2017-02-28 01:44:26 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:0335

Note You need to log in before you can comment on or make changes to this bug.