Bug 1415979

Summary: OCP HA failed at 90%, ansible: "Login failed (401 Unauthorized)"
Product: Red Hat Quickstart Cloud Installer Reporter: Antonin Pagac <apagac>
Component: Installation - OpenShiftAssignee: Dylan Murray <dymurray>
Status: CLOSED ERRATA QA Contact: Antonin Pagac <apagac>
Severity: unspecified Docs Contact: Derek <dcadzow>
Priority: unspecified    
Version: 1.1CC: apagac, arubin, bthurber, dymurray, qci-bugzillas
Target Milestone: ---Keywords: Triaged
Target Release: 1.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-02-28 01:45:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Antonin Pagac 2017-01-24 09:56:56 UTC
Description of problem:
OCP HA deployment failed at 90%, with:

"
2017-01-23 12:18:36,627 p=32734 u=foreman |  TASK [Login as specified OSE user] *********************************************
2017-01-23 12:18:36,996 p=32734 u=foreman |  fatal: [ocpha-ocp-master1.example.com]: FAILED! => {"changed": true, "cmd": ["oc", "login", "-u", "cloudsuite-install", "-p", "changeme"], "delta": "0:00:00.118025", "end": "2017-01-23 12:18:37.106487", "failed": true, "rc": 1, "start": "2017-01-23 12:18:36.988462", "stderr": "", "stdout": "Login failed (401 Unauthorized)", "stdout_lines": ["Login failed (401 Unauthorized)"], "warnings": []}
"

I'm able to log in as root and also as "cloudsuite-install" user:

"
[root@sat62fusor .ssh]# ssh -i id_rsa-ocpha ocpha-ocp-master1.example.com
Last login: Mon Jan 23 09:16:47 2017 from 192.168.235.10
[root@ocpha-ocp-master1 ~]# logout

[root@sat62fusor .ssh]# ssh cloudsuite-install.com
cloudsuite-install.com's password: 
Last login: Mon Jan 23 12:18:36 2017 from 192.168.235.10
[cloudsuite-install@ocpha-ocp-master1 ~]$ 
"

I'm unable to connect to master1 webUI.
When I resumed the task, it failed with:

"
2017-01-24 04:51:09,755 p=3775 u=foreman |  TASK [Create registry serviceaccount] ******************************************
2017-01-24 04:51:10,693 p=3775 u=foreman |  fatal: [ocpha-ocp-master1.example.com]: FAILED! => {"changed": true, "cmd": ["oc", "create", "serviceaccount", "registry"], "delta": "0:00:00.584173", "end": "2017-01-24 04:51:15.181933", "failed": true, "rc": 1, "start": "2017-01-24 04:51:14.597760", "stderr": "Error from server: serviceaccounts \"registry\" already exists", "stdout": "", "stdout_lines": [], "warnings": []}
"

Version-Release number of selected component (if applicable):
QCI-1.1-RHEL-7-20170120.t.0

How reproducible:
Unsure, happened to me once

Steps to Reproduce:
1. Have enough HW resources to deploy OCP HA, I have multiple RHV hypervisors
2. Kick off OCP HA deployment
3. The deployment fails at 90% with unresumable error.

Actual results:
OCP HA deployment failed

Expected results:
OCP HA deployment successful

Additional info:
I'll try to reproduce this with ansible_debug enabled.

Comment 2 Dylan Murray 2017-01-25 14:10:06 UTC
Antonin if you do hit this again please run 'oc get users' to see if the cloudsuite-install user was ever created. That will help debug why you got this issue.

Comment 3 John Matthews 2017-01-26 18:36:44 UTC
Antonin,

Are you able to reproduce this issue and provide logs or a machine we can ssh into?

Comment 4 Antonin Pagac 2017-01-27 14:08:55 UTC
I reproduced the issue just now, ISO QCI-1.1-RHEL-7-20170124.1.

I logged into the master1 machine and run 'oc get users':

"
[root@sat62fusor .ssh]# ssh -i id_rsa-ocpha2 ocpha2-ocp-master1.example.com
The authenticity of host 'ocpha2-ocp-master1.example.com (192.168.235.133)' can't be established.
ECDSA key fingerprint is 72:70:c0:f2:4c:fc:55:b9:de:e8:8c:8b:bc:31:a3:69.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'ocpha2-ocp-master1.example.com,192.168.235.133' (ECDSA) to the list of known hosts.
Last login: Fri Jan 27 05:39:05 2017 from 192.168.235.10
[root@ocpha2-ocp-master1 ~]# oc get users
[root@ocpha2-ocp-master1 ~]# 
"

I'm sending credentials to the env to Dylan/John via email.

Comment 5 Dylan Murray 2017-01-27 16:23:21 UTC
What appears to be happening is that the 'oc login' requests are being round-robin load balanced among the three master hosts. The users' credentials are only stored on the primary master host as seen here: https://github.com/fusor/ansible-ocp/blob/master/playbooks/ha/post_install.yml#L24 so only requests routed to the primary master host will be successful. Derek Whatley confirmed this by monitoring the haproxy logs.

Comment 6 Dylan Murray 2017-01-27 16:32:48 UTC
https://github.com/fusor/ansible-ocp/pull/17

Comment 7 Dylan Murray 2017-01-30 14:51:10 UTC
PR made it in to: QCI-1.1-RHEL-7-20170127.t.0

Comment 8 Antonin Pagac 2017-02-10 13:09:49 UTC
Verified in QCI-1.1-RHEL-7-20170209.t.0.

Comment 10 errata-xmlrpc 2017-02-28 01:45:30 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:0335