Created attachment 1241661 [details] excerpt from production.log Description of problem: Deployment of OCP HA on HW machines. I have plenty of machines and power available. RHV installs correctly without problems, OCP fails at 10% with: " | TASK [wait_for_host_up : Gather facts] ***************************************** | fatal: [depl2-ocp-master3.example.com]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Warning: Permanently added 'depl2-ocp-master3.example.com,192.168.235.137' (ECDSA) to the list of known hosts.\r | Connection closed by 192.168.235.137\r | ", "unreachable": true} | fatal: [depl2-ocp-master2.example.com]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Read from socket failed: Connection reset by peer\r | ", "unreachable": true} | fatal: [depl2-ocp-node1.example.com]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Read from socket failed: Connection reset by peer\r | ", "unreachable": true} | fatal: [depl2-ocp-master1.example.com]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Read from socket failed: Connection reset by peer\r | ", "unreachable": true} | fatal: [depl2-ocp-node2.example.com]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Warning: Permanently added 'depl2-ocp-node2.example.com,192.168.235.141' (ECDSA) to the list of known hosts.\r | Connection closed by 192.168.235.141\r | ", "unreachable": true} | fatal: [depl2-ocp-node3.example.com]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Read from socket failed: Connection reset by peer\r | ", "unreachable": true} | fatal: [depl2-ocp-ha1.example.com]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Read from socket failed: Connection reset by peer\r | ", "unreachable": true} | fatal: [depl2-ocp-ha2.example.com]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Warning: Permanently added 'depl2-ocp-ha2.example.com,192.168.235.147' (ECDSA) to the list of known hosts.\r | Connection closed by 192.168.235.147\r | ", "unreachable": true} " See complete log in attachment. All hosts are green in Satellite. The deployment failed over night, at morning I tried to resume the failing OCP task "Actions::Fusor::Deployment::OpenShift::Deploy" and it got a step further but failed when it tried to register to Satellite: " | TASK [satellite_registration : Register to Satellite] ************************** | fatal: [depl2-ocp-master3.example.com]: FAILED! => {"changed": false, "cmd": "subscription-manager register --activationkey OpenShift-depl2-OpenShift --org Default_Organization", "failed": true, "msg": "ERROR: current transaction is aborted, commands ignored until end of transaction block", "rc": 70, "stderr": "ERROR: current transaction is aborted, commands ignored until end of transaction block " This is a known issue, see bug: https://bugzilla.redhat.com/show_bug.cgi?id=1412784 Version-Release number of selected component (if applicable): QCI-1.1-RHEL-7-20170112.t.0 How reproducible: Unsure; first time deploying OCP HA Steps to Reproduce: 1. Have enough HW to deploy OCP HA 2. Kick off RHV+OCP HA deployment with NFS storage 3. RHV installs correctly, OCP fails at 10% Actual results: OCP HA failed to deploy Expected results: OCP HA deploys correctly Additional info:
Reproduced with QCI-1.1-RHEL-7-20170116.t.0. I will try with more powerful machine as a RHV engine and see if I can reproduce.
Antonin provided me with more info on the hardware used with this deploy: > RHV engine: > - hp-dl320e-04 > - 4 CPUs > - 4096 MB RAM > RHV hypervisors: > - smicro-5037-02 and smicro-5037-03 > - both 12 CPUs > - both 32767 MB RAM > - dell-r220-06 and dell-r220-10 > - both 4 CPUs > - both 32009 MB RAM It is noteworthy that four (4) hypervisors were used. Initial OCP HA testing was performed with a one (1) hypervisor with 64 GB of RAM. Looking into whether this could be a complication related to placement of node VMs on the available hypervisors.
Reproduced using QCI-1.1-RHEL-7-20170120.t.0 with different error messages: " 2017-01-23 08:00:04,246 p=4155 u=foreman | TASK [wait_for_host_up : Gather facts] ***************************************** 2017-01-23 08:02:28,606 p=4155 u=foreman | fatal: [ocpha-ocp-master1.example.com]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Warning: Permanently added 'ocpha-ocp-master1.example.com,192.168.235.133' (ECDSA) to the list of known hosts.\r\nConnection closed by 192.168.235.133\r\n", "unreachable": true} 2017-01-23 08:02:52,969 p=4155 u=foreman | fatal: [ocpha-ocp-master3.example.com]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Warning: Permanently added 'ocpha-ocp-master3.example.com,192.168.235.137' (ECDSA) to the list of known hosts.\r\nConnection closed by 192.168.235.137\r\n", "unreachable": true} 2017-01-23 08:03:38,596 p=4155 u=foreman | fatal: [ocpha-ocp-master2.example.com]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Warning: Permanently added 'ocpha-ocp-master2.example.com,192.168.235.135' (ECDSA) to the list of known hosts.\r\nConnection closed by 192.168.235.135\r\n", "unreachable": true} 2017-01-23 08:04:24,821 p=4155 u=foreman | fatal: [ocpha-ocp-node2.example.com]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Warning: Permanently added 'ocpha-ocp-node2.example.com,192.168.235.141' (ECDSA) to the list of known hosts.\r\nConnection closed by 192.168.235.141\r\n", "unreachable": true} 2017-01-23 08:04:31,017 p=4155 u=foreman | fatal: [ocpha-ocp-node3.example.com]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Warning: Permanently added 'ocpha-ocp-node3.example.com,192.168.235.143' (ECDSA) to the list of known hosts.\r\nConnection closed by 192.168.235.143\r\n", "unreachable": true} 2017-01-23 08:05:24,895 p=4155 u=foreman | ok: [ocpha-ocp-node1.example.com] 2017-01-23 08:06:23,663 p=4155 u=foreman | fatal: [ocpha-ocp-ha2.example.com]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Warning: Permanently added 'ocpha-ocp-ha2.example.com,192.168.235.147' (ECDSA) to the list of known hosts.\r\nConnection closed by 192.168.235.147\r\n", "unreachable": true} 2017-01-23 08:06:50,395 p=4155 u=foreman | ok: [ocpha-ocp-ha1.example.com] " I used one of the Dell machines as an engine, so that every machine in the setup has at least 4 CPUs and 32GB of memory. It seems that the fatal error is caused by VMs not being in known_hosts. Two of them however succeeded: ocp-ha1 and ocp-node1. Upon resuming the failing task, everything is running fine. There are no errors in ansible.log and installation continues with "TASK [satellite_registration : Get certificate from Satellite]", which is the next step.
Re-deployed OCP HA and reproduced, this time with ansible_debug enabled. What I can see from the log: 'Failed to connect to the host via ssh: ... Incorrect RSA1 identifier Could not load \"/usr/share/foreman/.ssh/id_rsa-ocpha2\" as a RSA1 public key' Later it's trying to use different methods for authentication: 'Authentications that can continue: publickey,gssapi-keyex,gssapi-with-mic,password ... start over, passed a different list publickey,gssapi-keyex,gssapi-with-mic,password ... preferred gssapi-with-mic,gssapi-keyex,hostbased,publickey ... Next authentication method: gssapi-with-mic: Unspecified GSS failure. Minor code may provide more information ... we did not send a packet, disable method ... remaining preferred: hostbased,publickey ... Next authentication method: gssapi-keyex: No valid Key exchange context, we did not send a packet, disable method...' And the last method, publickey: 'remaining preferred: ,publickey ... Next authentication method: publickey: Offering RSA public key: /usr/share/foreman/.ssh/id_rsa-ocpha2 ... we sent a publickey packet, wait for reply ... Connection closed by 192.168.235.157' I'm attaching relevant part of ansible.log and will give my testing environment to Derek, so he can investigate more.
Created attachment 1243902 [details] excerpt from ansible.log with ansible_debug enabled
This was an issue where cloud-init brought up SSH before the users were configured, so I added an extra step to the wait_for_host_up role that loops until it is able to log in as the correct user with the correct ssh key. https://github.com/fusor/ansible-ovirt/pull/29
PR made it in to: QCI-1.1-RHEL-7-20170127.t.0
Haven't seen this in couple of deployments I did lately. The fix seems to work for me.
Verified with 20170203.t.0
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:0335