Bug 1414414
Summary: | OCP HA failed at 70%, when redeploying on existing hardware; ansible: "Installed environment detected and no additional nodes specified: aborting." | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Quickstart Cloud Installer | Reporter: | Antonin Pagac <apagac> | ||||||||
Component: | Installation - OpenShift | Assignee: | Dylan Murray <dymurray> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | Antonin Pagac <apagac> | ||||||||
Severity: | unspecified | Docs Contact: | Derek <dcadzow> | ||||||||
Priority: | unspecified | ||||||||||
Version: | 1.1 | CC: | apagac, arubin, bthurber, dwhatley, jmatthew, llasmith, qci-bugzillas, smallamp | ||||||||
Target Milestone: | --- | Keywords: | Triaged | ||||||||
Target Release: | 1.1 | ||||||||||
Hardware: | Unspecified | ||||||||||
OS: | Unspecified | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | |||||||||||
: | 1420441 (view as bug list) | Environment: | |||||||||
Last Closed: | 2017-02-28 01:44:46 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | 1416509 | ||||||||||
Bug Blocks: | 1420441 | ||||||||||
Attachments: |
|
Description
Antonin Pagac
2017-01-18 12:47:25 UTC
Didn't hit this with QCI-1.1-RHEL-7-20170120.t.0 Hit this with QCI-1.1-RHEL-7-20170120.t.0 while re-deploying OCP HA. I deleted all hosts from my Satellite, rebooted the HW machines to have them re-discovered, deleted my old HA deployment and created new one. (In reply to Antonin Pagac from comment #3) > Hit this with QCI-1.1-RHEL-7-20170120.t.0 while re-deploying OCP HA. > > I deleted all hosts from my Satellite, rebooted the HW machines to have them > re-discovered, deleted my old HA deployment and created new one. From the logs, it seems like some of your hosts have been previously provisioned by the atomic-openshift-installer. Now that BZ 1412784 (satellite registration) seems to be resolved, are you able to start from the beginning and make it all the way through the deploy process? Created attachment 1247982 [details]
20170203.t.0 ansible.log
Derek,
I reproduced the issue with 20170203.t.0. Installed ISO and kicked off OCP HA deployment. It failed at 70%, attaching ansible.log.
The task is not resumable, I'm going to delete the deployment and try again.
Reproduced also when redeploying. After some investigation I am unsure what is occurring in this deployment. All of the logs are pointing to a scenario where the installer finished and someone is rerunning the installer without the --force option and its not installing from scratch. When resuming a deployment during a failed install this is sometimes possible but shouldn't be possible from a greenfield install. Antonin, are the machines that are being used as OCP hosts baremetal? Created attachment 1248541 [details]
ansible log of 2nd deployment of ansible
I just hit this one a baremetal deployment of OCPHA. It made it all the way to 70% w/o any intervention. The 3 previous deployments didn't hit this issue.
QCI Media Version: QCI-1.1-RHEL-7-20170203.0
Hi Dylan, yes, I'm running OCP HA deployments on a baremetal setup. I hit this again when deploying OCP HA as a second deployment, after having deleted the first one, which was OCP non-HA and CFME. I'm going to try once again with clean ISO install. How often does this issue show up? Do we have clear reproducer steps so we can recreate the issue? Hi John, the issue seems to appear every time after I run a deployment using re-discovered baremetal machines. This means that I do a deployment (OCP HA or non-HA), wait until it errors out or succeeds, then delete it, delete all the hosts from satellite (all content hosts, all discovered hosts, everything), then restart all the baremetal hosts and let them PXE boot again and be discovered by Satellite. After this, I do a deployment of OCP HA and that fails at 70%. It also seems to appear intermittently when deploying on freshly installed Satellite, doing first deployment, but using the baremetal machines that have been used in the past to deploy OCP HA. I'm running a deployment right now to verify. Reproducer steps would be: 1. Have enough (multiple) baremetal machines to deploy OCP HA 2. Install latest compose from ISO on baremetal, run fusor-installer 3. Restart all the baremetal machines; they boot from PXE and get discovered by Satellite 4. Deploy OCP (HA or non-HA), wait for it to finish (either successfully or not) 5. Delete the deployment via Satellite UI 6. Remove all hosts from Satellite (via web-UI or using hammer) 7. Do points 3. and 4. 8. Error appears at 70% of OCP HA deployment With QCI-1.1-RHEL-7-20170207.t.0, clean install of Satellite and OCP HA as first deployment, I did not hit this issue. My deployment got past 70%. We are deferring this issue to 1.2 as 1.1 testing cycle is nearing an end. Comment #11 has reproducer steps. Did not hit the issue with QCI-1.1-RHEL-7-20170208.t.0, fresh install of Satellite and OCP HA as a first deployment. We believe this issue is fixed in OCP 3.4 BZ describing the issue is: https://bugzilla.redhat.com/show_bug.cgi?id=1416509 The upgrade to 3.4 was in QCI-1.1-RHEL-7-20170210.t.0. I got successful deployment without any intervention using ISO QCI-1.1-RHEL-7-20170216.0. OCP related packages are in version 3.4. Marking as verified Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:0335 |