Bug 1411571

Summary:

RHV+OSP+CFME+OCP Deployment failed: Went to status ERROR due to "Message: Unknown, Code: Unknown"

Product:

Red Hat Quickstart Cloud Installer

Reporter:

Landon LaSmith <llasmith>

Component:

Installation - RHELOSP

Assignee:

Jason Montleon <jmontleo>

Status:

CLOSED NOTABUG

QA Contact:

Landon LaSmith <llasmith>

Severity:

unspecified

Docs Contact:

Dan Macpherson <dmacpher>

Priority:

unspecified

Version:

1.1

CC:

bthurber, jmatthew, llasmith, qci-bugzillas, smallamp

Target Milestone:

---

Keywords:

Triaged

Target Release:

1.1

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2017-02-07 21:55:01 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1353464

Bug Blocks:

Attachments:

Description	Flags
Log from the deployment	none

Description Landon LaSmith 2017-01-10 01:28:17 UTC

Created attachment 1238948 [details]
Log from the deployment

Description of problem: During an all-in-one API deployment of RHV+OSP+CFME+OCP, the OSP deployment failed at 30% with the message:

ERROR: deployment failed with status: CREATE_FAILED and reason: Resource CREATE failed: ResourceInError: resources.Controller.resources[0].resources.Controller: Went to status ERROR due to "Message: Unknown, Code: Unknown"


QCI Media Version: QCI-1.1-RHEL-7-20170106.t.0
QCIOOO Media Version: QCIOOO-10.0-RHEL-7-20170104.t.0

How reproducible: First occurrence

Steps to Reproduce:
1. Install QCI & QCIOOO from iso
2. Provision resources for RHV and OSP 
3. Create and start deployment of RHV+OSP+CFME+OCP

Actual results: Deployment fails during task Actions::Fusor::Deployment::OpenStack::Deploy

Expected results: OSP deployment succeeds

Comment 2 Landon LaSmith 2017-01-10 22:10:01 UTC

As a test, I attempted to deploy OSP+CFME since RHV deployment succeeded but it failed with a different error that was reported in BZ1411935

Comment 3 Jason Montleon 2017-01-12 16:23:38 UTC

It's possible this is a duplicate of BZ1411935 even though the error is a bit different. I've seen it succeed, fail with the error in BZ1411935, and yet fail with different errors. It comes down to timing as to what if any follow up commands fail in the script.

Please retest after https://github.com/fusor/egon/pull/92 makes it into a compose.

Comment 5 Landon LaSmith 2017-01-17 17:08:53 UTC

I've also encountered the below error on a OSP+CFME API deployment with 1 controller and 2 compute nodes.  The OSP+CFME deployments have had success with the same iso.  I think both errors are a symptom of the same unknown issue the reported error just depends on if the Compute or Controller stack resource fails first.

ERROR: deployment failed with status: CREATE_FAILED and reason: Resource CREATE failed: ResourceInError: resources.Compute.resources[1].resources.NovaCompute: Went to status ERROR due to "Message: No valid host was found. There are not enough hosts available., Code: 500"

QCI Media Version: QCI-1.1-RHEL-7-20170116.t.0
QCIOOO Media Version: QCIOOO-10.0-RHEL-7-20170113.t.0

Comment 6 Jason Montleon 2017-01-17 18:31:45 UTC

The reason isn't unknown. It's "Message: No valid host was found. There are not enough hosts available., Code: 500"

You're hosts are locked up, possibly from only partial deletion of an old deployment or something else going wrong.

What does ironic node-list and ironic node-show <id> for each show?

Comment 7 Landon LaSmith 2017-01-17 20:20:54 UTC

(In reply to Jason Montleon from comment #6)
> The reason isn't unknown. It's "Message: No valid host was found. There are
> not enough hosts available., Code: 500"
> 
> You're hosts are locked up, possibly from only partial deletion of an old
> deployment or something else going wrong.
> 
> What does ironic node-list and ironic node-show <id> for each show?

Comment 5 was from a clean environment and a fresh deployment of OSP+CFME with no previous stack deployment.  The environment is no longer available but I think that the controller and 1 out of 2 compute nodes was in an ERROR state from ironic node-list.  The other compute node was powered on and active.

Comment 13 Jason Montleon 2017-01-31 14:44:44 UTC

This is probably https://bugzilla.redhat.com/show_bug.cgi?id=1353464

Maybe try the workaround suggested there and either set the max concurrent builds to 2 or even 1.

crudini --set /etc/nova/nova.conf DEFAULT max_concurrent_builds 2; openstack-service restart nova

It's probably caused by load induced from doing more builds than the director can keep up with in the virt environment.

Comment 14 Landon LaSmith 2017-02-07 15:38:52 UTC

openstack-service isn't available in the QCIOOO iso for OSP 10 but the (In reply to Jason Montleon from comment #13)
> This is probably https://bugzilla.redhat.com/show_bug.cgi?id=1353464
> 
> Maybe try the workaround suggested there and either set the max concurrent
> builds to 2 or even 1.
> 
> crudini --set /etc/nova/nova.conf DEFAULT max_concurrent_builds 2;
> openstack-service restart nova
> 
> It's probably caused by load induced from doing more builds than the
> director can keep up with in the virt environment.

openstack-service isn't available as part of the QCIOOO iso install for OSP 10 so you can replace it with systemctl command

crudini --set /etc/nova/nova.conf DEFAULT max_concurrent_builds 2; systemctl restart openstack-nova-api openstack-nova-scheduler

See: https://access.redhat.com/documentation/en/red-hat-openstack-platform/10/paged/director-installation-and-usage/chapter-9-troubleshooting-director-issues (Section 9.8 Tuning the undercloud)

Comment 15 Landon LaSmith 2017-02-07 21:25:37 UTC

I haven't seen any reoccurrence of this issue when setting max_concurrent_build manually. QE updating automated job runs to include this in all OSP runs for more data points.

Comment 16 Sudhir Mallamprabhakara 2017-02-07 21:55:01 UTC

Closing the bug. Will re-open if needed.

- Sudhir