Bug 1414414

Summary: OCP HA failed at 70%, when redeploying on existing hardware; ansible: "Installed environment detected and no additional nodes specified: aborting."
Product: Red Hat Quickstart Cloud Installer Reporter: Antonin Pagac <apagac>
Component: Installation - OpenShiftAssignee: Dylan Murray <dymurray>
Status: CLOSED ERRATA QA Contact: Antonin Pagac <apagac>
Severity: unspecified Docs Contact: Derek <dcadzow>
Priority: unspecified    
Version: 1.1CC: apagac, arubin, bthurber, dwhatley, jmatthew, llasmith, qci-bugzillas, smallamp
Target Milestone: ---Keywords: Triaged
Target Release: 1.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1420441 (view as bug list) Environment:
Last Closed: 2017-02-28 01:44:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1416509    
Bug Blocks: 1420441    
Attachments:
Description Flags
excerpt from ansible.log
none
20170203.t.0 ansible.log
none
ansible log of 2nd deployment of ansible none

Description Antonin Pagac 2017-01-18 12:47:25 UTC
Created attachment 1242161 [details]
excerpt from ansible.log

Description of problem:
OCP HA deployment failed while ansible was doing "TASK [execute atomic-openshift-installer]". This deployment had to be resumed 3 times at 10% to get all nodes registered (bug 1412784). It seems that for some reason it did not detect three of the OCP nodes. It correctly summarized:

"Total OpenShift masters: 3", "Total OpenShift nodes: 6"

but then detected only 3 nodes from a total of 6. It seems that there are 3 unscheduled OCP nodes.
From ansible.log:

"
2017-01-17 12:43:11,122 p=2155 u=foreman |  fatal: [depl1-ocp-master1.example.com]: FAILED! => {"changed": true, "cmd": "atomic-openshift-installer -v -u -c /tmp/atomic-openshift-installer.answers.cfg.yml install", "delta": "0:00:17.028231", "end": "2017-01-17 12:43:11.687981", "failed": true, "rc": 1, "start": "2017-01-17 12:42:54.659750", "stderr": "", "stdout": "*** Installation Summary ***\n\nHosts:\n- depl1-ocp-ha1.example.com\n  - Load Balancer (Preconfigured)\n- depl1-ocp-node1.example.com\n  - OpenShift node (Dedicated)\n- depl1-ocp-node2.example.com\n  - OpenShift node (Dedicated)\n- depl1-ocp-node3.example.com\n  - OpenShift node (Dedicated)\n- depl1-ocp-master1.example.com\n  - OpenShift master\n  - OpenShift node (Unscheduled)\n  - Etcd\n- depl1-ocp-master2.example.com\n  - OpenShift master\n  - OpenShift node (Unscheduled)\n  - Etcd\n- depl1-ocp-master3.example.com\n  - OpenShift master\n  - OpenShift node (Unscheduled)\n  - Etcd\n\nTotal OpenShift masters: 3\nTotal OpenShift nodes: 6\n\nNOTE: Multiple masters specified, this will be an HA deployment with a separate\netcd cluster. You will be prompted to provide the FQDN of a load balancer and\na host for storage once finished entering hosts.\n\n\nGathering information from hosts...\nInstalled environment detected.\ndepl1-ocp-node1.example.com is already an OpenShift node\ndepl1-ocp-node2.example.com is already an OpenShift node\ndepl1-ocp-node3.example.com is already an OpenShift node\ndepl1-ocp-master1.example.com is already an OpenShift master\ndepl1-ocp-master2.example.com is already an OpenShift master\ndepl1-ocp-master3.example.com is already an OpenShift master\nInstalled environment detected and no additional nodes specified: aborting. If you want a fresh install, use `atomic-openshift-installer install --force`", "stdout_lines": ["*** Installation Summary ***", "", "Hosts:", "- depl1-ocp-ha1.example.com", "  - Load Balancer (Preconfigured)", "- depl1-ocp-node1.example.com", "  - OpenShift node (Dedicated)", "- depl1-ocp-node2.example.com", "  - OpenShift node (Dedicated)", "- depl1-ocp-node3.example.com", "  - OpenShift node (Dedicated)", "- depl1-ocp-master1.example.com", "  - OpenShift master", "  - OpenShift node (Unscheduled)", "  - Etcd", "- depl1-ocp-master2.example.com", "  - OpenShift master", "  - OpenShift node (Unscheduled)", "  - Etcd", "- depl1-ocp-master3.example.com", "  - OpenShift master", "  - OpenShift node (Unscheduled)", "  - Etcd", "", "Total OpenShift masters: 3", "Total OpenShift nodes: 6", "", "NOTE: Multiple masters specified, this will be an HA deployment with a separate", "etcd cluster. You will be prompted to provide the FQDN of a load balancer and", "a host for storage once finished entering hosts.", "", "", "Gathering information from hosts...", "Installed environment detected.", "depl1-ocp-node1.example.com is already an OpenShift node", "depl1-ocp-node2.example.com is already an OpenShift node", "depl1-ocp-node3.example.com is already an OpenShift node", "depl1-ocp-master1.example.com is already an OpenShift master", "depl1-ocp-master2.example.com is already an OpenShift master", "depl1-ocp-master3.example.com is already an OpenShift master", "Installed environment detected and no additional nodes specified: aborting. If you want a fresh install, use `atomic-openshift-installer install --force`"], "warnings": []}
2017-01-17 12:43:11,123 p=2155 u=foreman |  PLAY RECAP *********************************************************************
2017-01-17 12:43:11,123 p=2155 u=foreman |  depl1-ocp-master1.example.com : ok=4    changed=0    unreachable=0    failed=1
"

Ansible tried this three times. From deployment.log:

"
Setting ansible log to /usr/share/foreman/log/deployments/depl1-1/ansible.log
ansible: executing /usr/share/ansible-ocp//playbooks/ha/install.yml with /usr/share/foreman/tmp/depl1/ansible.hosts
ansible-playbook returned a non-zero exit code on attempt 1/3.
Setting ansible log to /usr/share/foreman/log/deployments/depl1-1/ansible.log
ansible: executing /usr/share/ansible-ocp//playbooks/ha/install.yml with /usr/share/foreman/tmp/depl1/ansible.hosts
ansible-playbook returned a non-zero exit code on attempt 2/3.
Setting ansible log to /usr/share/foreman/log/deployments/depl1-1/ansible.log
ansible: executing /usr/share/ansible-ocp//playbooks/ha/install.yml with /usr/share/foreman/tmp/depl1/ansible.hosts
ansible-playbook returned a non-zero exit code on attempt 3/3.
"

Attaching ansible.log excerpt. When resumed, the error is the same.

Version-Release number of selected component (if applicable):
QCI-1.1-RHEL-7-20170116.t.0

How reproducible:
Unsure; first time deploying OCP HA

Steps to Reproduce:
1. Kick off OCP HA deployment with enought HW power
2. If node registration fails (bug 1412784), resume the task
3. Do step 2 until all nodes are registered to satellite
4. Resume the task and continue with the deployment
5. Task fails at 70%

Actual results:
OCP HA deployment failed

Expected results:
OCP HA deployment successful

Additional info:

Comment 2 Antonin Pagac 2017-01-24 10:34:33 UTC
Didn't hit this with QCI-1.1-RHEL-7-20170120.t.0

Comment 3 Antonin Pagac 2017-01-25 10:26:13 UTC
Hit this with QCI-1.1-RHEL-7-20170120.t.0 while re-deploying OCP HA.

I deleted all hosts from my Satellite, rebooted the HW machines to have them re-discovered, deleted my old HA deployment and created new one.

Comment 4 Derek Whatley 2017-01-30 14:25:01 UTC
(In reply to Antonin Pagac from comment #3)
> Hit this with QCI-1.1-RHEL-7-20170120.t.0 while re-deploying OCP HA.
> 
> I deleted all hosts from my Satellite, rebooted the HW machines to have them
> re-discovered, deleted my old HA deployment and created new one.

From the logs, it seems like some of your hosts have been previously provisioned by the atomic-openshift-installer.

Now that BZ 1412784 (satellite registration) seems to be resolved, are you able to start from the beginning and make it all the way through the deploy process?

Comment 5 Antonin Pagac 2017-02-06 09:49:16 UTC
Created attachment 1247982 [details]
20170203.t.0 ansible.log

Derek,

I reproduced the issue with 20170203.t.0. Installed ISO and kicked off OCP HA deployment. It failed at 70%, attaching ansible.log.

The task is not resumable, I'm going to delete the deployment and try again.

Comment 6 Antonin Pagac 2017-02-07 09:09:50 UTC
Reproduced also when redeploying.

Comment 7 Dylan Murray 2017-02-07 20:51:55 UTC
After some investigation I am unsure what is occurring in this deployment. All of the logs are pointing to a scenario where the installer finished and someone is rerunning the installer without the --force option and its not installing from scratch. When resuming a deployment during a failed install this is sometimes possible but shouldn't be possible from a greenfield install. Antonin, are the machines that are being used as OCP hosts baremetal?

Comment 8 Landon LaSmith 2017-02-08 04:34:18 UTC
Created attachment 1248541 [details]
ansible log of 2nd deployment of ansible

I just hit this one a baremetal deployment of OCPHA. It made it all the way to 70% w/o any intervention.  The 3 previous deployments didn't hit this issue.

QCI Media Version: QCI-1.1-RHEL-7-20170203.0

Comment 9 Antonin Pagac 2017-02-08 10:01:54 UTC
Hi Dylan, yes, I'm running OCP HA deployments on a baremetal setup.

I hit this again when deploying OCP HA as a second deployment, after having deleted the first one, which was OCP non-HA and CFME. I'm going to try once again with clean ISO install.

Comment 10 John Matthews 2017-02-08 15:09:41 UTC
How often does this issue show up?

Do we have clear reproducer steps so we can recreate the issue?

Comment 11 Antonin Pagac 2017-02-08 15:34:35 UTC
Hi John,

the issue seems to appear every time after I run a deployment using re-discovered baremetal machines. This means that I do a deployment (OCP HA or non-HA), wait until it errors out or succeeds, then delete it, delete all the hosts from satellite (all content hosts, all discovered hosts, everything), then restart all the baremetal hosts and let them PXE boot again and be discovered by Satellite. After this, I do a deployment of OCP HA and that fails at 70%.

It also seems to appear intermittently when deploying on freshly installed Satellite, doing first deployment, but using the baremetal machines that have been used in the past to deploy OCP HA. I'm running a deployment right now to verify.

Reproducer steps would be:
1. Have enough (multiple) baremetal machines to deploy OCP HA
2. Install latest compose from ISO on baremetal, run fusor-installer
3. Restart all the baremetal machines; they boot from PXE and get discovered by Satellite
4. Deploy OCP (HA or non-HA), wait for it to finish (either successfully or not)
5. Delete the deployment via Satellite UI
6. Remove all hosts from Satellite (via web-UI or using hammer)
7. Do points 3. and 4.
8. Error appears at 70% of OCP HA deployment

Comment 12 Antonin Pagac 2017-02-08 16:17:08 UTC
With QCI-1.1-RHEL-7-20170207.t.0, clean install of Satellite and OCP HA as first deployment, I did not hit this issue. My deployment got past 70%.

Comment 13 John Matthews 2017-02-08 16:26:13 UTC
We are deferring this issue to 1.2 as 1.1 testing cycle is nearing an end.

Comment #11 has reproducer steps.

Comment 15 Antonin Pagac 2017-02-09 15:36:56 UTC
Did not hit the issue with QCI-1.1-RHEL-7-20170208.t.0, fresh install of Satellite and OCP HA as a first deployment.

Comment 16 John Matthews 2017-02-10 17:01:18 UTC
We believe this issue is fixed in OCP 3.4
BZ describing the issue is: https://bugzilla.redhat.com/show_bug.cgi?id=1416509

Comment 17 Dylan Murray 2017-02-10 17:10:18 UTC
The upgrade to 3.4 was in QCI-1.1-RHEL-7-20170210.t.0.

Comment 18 Antonin Pagac 2017-02-17 15:40:59 UTC
I got successful deployment without any intervention using ISO QCI-1.1-RHEL-7-20170216.0. OCP related packages are in version 3.4.

Marking as verified

Comment 20 errata-xmlrpc 2017-02-28 01:44:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:0335