Description of problem: OCP HA deployment failed at 90%, with: " 2017-01-23 12:18:36,627 p=32734 u=foreman | TASK [Login as specified OSE user] ********************************************* 2017-01-23 12:18:36,996 p=32734 u=foreman | fatal: [ocpha-ocp-master1.example.com]: FAILED! => {"changed": true, "cmd": ["oc", "login", "-u", "cloudsuite-install", "-p", "changeme"], "delta": "0:00:00.118025", "end": "2017-01-23 12:18:37.106487", "failed": true, "rc": 1, "start": "2017-01-23 12:18:36.988462", "stderr": "", "stdout": "Login failed (401 Unauthorized)", "stdout_lines": ["Login failed (401 Unauthorized)"], "warnings": []} " I'm able to log in as root and also as "cloudsuite-install" user: " [root@sat62fusor .ssh]# ssh -i id_rsa-ocpha ocpha-ocp-master1.example.com Last login: Mon Jan 23 09:16:47 2017 from 192.168.235.10 [root@ocpha-ocp-master1 ~]# logout [root@sat62fusor .ssh]# ssh cloudsuite-install.com cloudsuite-install.com's password: Last login: Mon Jan 23 12:18:36 2017 from 192.168.235.10 [cloudsuite-install@ocpha-ocp-master1 ~]$ " I'm unable to connect to master1 webUI. When I resumed the task, it failed with: " 2017-01-24 04:51:09,755 p=3775 u=foreman | TASK [Create registry serviceaccount] ****************************************** 2017-01-24 04:51:10,693 p=3775 u=foreman | fatal: [ocpha-ocp-master1.example.com]: FAILED! => {"changed": true, "cmd": ["oc", "create", "serviceaccount", "registry"], "delta": "0:00:00.584173", "end": "2017-01-24 04:51:15.181933", "failed": true, "rc": 1, "start": "2017-01-24 04:51:14.597760", "stderr": "Error from server: serviceaccounts \"registry\" already exists", "stdout": "", "stdout_lines": [], "warnings": []} " Version-Release number of selected component (if applicable): QCI-1.1-RHEL-7-20170120.t.0 How reproducible: Unsure, happened to me once Steps to Reproduce: 1. Have enough HW resources to deploy OCP HA, I have multiple RHV hypervisors 2. Kick off OCP HA deployment 3. The deployment fails at 90% with unresumable error. Actual results: OCP HA deployment failed Expected results: OCP HA deployment successful Additional info: I'll try to reproduce this with ansible_debug enabled.
Antonin if you do hit this again please run 'oc get users' to see if the cloudsuite-install user was ever created. That will help debug why you got this issue.
Antonin, Are you able to reproduce this issue and provide logs or a machine we can ssh into?
I reproduced the issue just now, ISO QCI-1.1-RHEL-7-20170124.1. I logged into the master1 machine and run 'oc get users': " [root@sat62fusor .ssh]# ssh -i id_rsa-ocpha2 ocpha2-ocp-master1.example.com The authenticity of host 'ocpha2-ocp-master1.example.com (192.168.235.133)' can't be established. ECDSA key fingerprint is 72:70:c0:f2:4c:fc:55:b9:de:e8:8c:8b:bc:31:a3:69. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'ocpha2-ocp-master1.example.com,192.168.235.133' (ECDSA) to the list of known hosts. Last login: Fri Jan 27 05:39:05 2017 from 192.168.235.10 [root@ocpha2-ocp-master1 ~]# oc get users [root@ocpha2-ocp-master1 ~]# " I'm sending credentials to the env to Dylan/John via email.
What appears to be happening is that the 'oc login' requests are being round-robin load balanced among the three master hosts. The users' credentials are only stored on the primary master host as seen here: https://github.com/fusor/ansible-ocp/blob/master/playbooks/ha/post_install.yml#L24 so only requests routed to the primary master host will be successful. Derek Whatley confirmed this by monitoring the haproxy logs.
https://github.com/fusor/ansible-ocp/pull/17
PR made it in to: QCI-1.1-RHEL-7-20170127.t.0
Verified in QCI-1.1-RHEL-7-20170209.t.0.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:0335