Bug 1411571 - RHV+OSP+CFME+OCP Deployment failed: Went to status ERROR due to "Message: Unknown, Code: Unknown"
Summary: RHV+OSP+CFME+OCP Deployment failed: Went to status ERROR due to "Message: Unk...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Quickstart Cloud Installer
Classification: Red Hat
Component: Installation - RHELOSP
Version: 1.1
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 1.1
Assignee: Jason Montleon
QA Contact: Landon LaSmith
Dan Macpherson
URL:
Whiteboard:
Depends On: 1353464
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-01-10 01:28 UTC by Landon LaSmith
Modified: 2017-02-07 21:55 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-02-07 21:55:01 UTC
Target Upstream Version:


Attachments (Terms of Use)
Log from the deployment (1.70 MB, application/x-gzip)
2017-01-10 01:28 UTC, Landon LaSmith
no flags Details

Description Landon LaSmith 2017-01-10 01:28:17 UTC
Created attachment 1238948 [details]
Log from the deployment

Description of problem: During an all-in-one API deployment of RHV+OSP+CFME+OCP, the OSP deployment failed at 30% with the message:

ERROR: deployment failed with status: CREATE_FAILED and reason: Resource CREATE failed: ResourceInError: resources.Controller.resources[0].resources.Controller: Went to status ERROR due to "Message: Unknown, Code: Unknown"


QCI Media Version: QCI-1.1-RHEL-7-20170106.t.0
QCIOOO Media Version: QCIOOO-10.0-RHEL-7-20170104.t.0

How reproducible: First occurrence

Steps to Reproduce:
1. Install QCI & QCIOOO from iso
2. Provision resources for RHV and OSP 
3. Create and start deployment of RHV+OSP+CFME+OCP

Actual results: Deployment fails during task Actions::Fusor::Deployment::OpenStack::Deploy

Expected results: OSP deployment succeeds

Comment 2 Landon LaSmith 2017-01-10 22:10:01 UTC
As a test, I attempted to deploy OSP+CFME since RHV deployment succeeded but it failed with a different error that was reported in BZ1411935

Comment 3 Jason Montleon 2017-01-12 16:23:38 UTC
It's possible this is a duplicate of BZ1411935 even though the error is a bit different. I've seen it succeed, fail with the error in BZ1411935, and yet fail with different errors. It comes down to timing as to what if any follow up commands fail in the script.

Please retest after https://github.com/fusor/egon/pull/92 makes it into a compose.

Comment 5 Landon LaSmith 2017-01-17 17:08:53 UTC
I've also encountered the below error on a OSP+CFME API deployment with 1 controller and 2 compute nodes.  The OSP+CFME deployments have had success with the same iso.  I think both errors are a symptom of the same unknown issue the reported error just depends on if the Compute or Controller stack resource fails first.

ERROR: deployment failed with status: CREATE_FAILED and reason: Resource CREATE failed: ResourceInError: resources.Compute.resources[1].resources.NovaCompute: Went to status ERROR due to "Message: No valid host was found. There are not enough hosts available., Code: 500"

QCI Media Version: QCI-1.1-RHEL-7-20170116.t.0
QCIOOO Media Version: QCIOOO-10.0-RHEL-7-20170113.t.0

Comment 6 Jason Montleon 2017-01-17 18:31:45 UTC
The reason isn't unknown. It's "Message: No valid host was found. There are not enough hosts available., Code: 500"

You're hosts are locked up, possibly from only partial deletion of an old deployment or something else going wrong.

What does ironic node-list and ironic node-show <id> for each show?

Comment 7 Landon LaSmith 2017-01-17 20:20:54 UTC
(In reply to Jason Montleon from comment #6)
> The reason isn't unknown. It's "Message: No valid host was found. There are
> not enough hosts available., Code: 500"
> 
> You're hosts are locked up, possibly from only partial deletion of an old
> deployment or something else going wrong.
> 
> What does ironic node-list and ironic node-show <id> for each show?

Comment 5 was from a clean environment and a fresh deployment of OSP+CFME with no previous stack deployment.  The environment is no longer available but I think that the controller and 1 out of 2 compute nodes was in an ERROR state from ironic node-list.  The other compute node was powered on and active.

Comment 13 Jason Montleon 2017-01-31 14:44:44 UTC
This is probably https://bugzilla.redhat.com/show_bug.cgi?id=1353464

Maybe try the workaround suggested there and either set the max concurrent builds to 2 or even 1.

crudini --set /etc/nova/nova.conf DEFAULT max_concurrent_builds 2; openstack-service restart nova

It's probably caused by load induced from doing more builds than the director can keep up with in the virt environment.

Comment 14 Landon LaSmith 2017-02-07 15:38:52 UTC
openstack-service isn't available in the QCIOOO iso for OSP 10 but the (In reply to Jason Montleon from comment #13)
> This is probably https://bugzilla.redhat.com/show_bug.cgi?id=1353464
> 
> Maybe try the workaround suggested there and either set the max concurrent
> builds to 2 or even 1.
> 
> crudini --set /etc/nova/nova.conf DEFAULT max_concurrent_builds 2;
> openstack-service restart nova
> 
> It's probably caused by load induced from doing more builds than the
> director can keep up with in the virt environment.

openstack-service isn't available as part of the QCIOOO iso install for OSP 10 so you can replace it with systemctl command

crudini --set /etc/nova/nova.conf DEFAULT max_concurrent_builds 2; systemctl restart openstack-nova-api openstack-nova-scheduler

See: https://access.redhat.com/documentation/en/red-hat-openstack-platform/10/paged/director-installation-and-usage/chapter-9-troubleshooting-director-issues (Section 9.8 Tuning the undercloud)

Comment 15 Landon LaSmith 2017-02-07 21:25:37 UTC
I haven't seen any reoccurrence of this issue when setting max_concurrent_build manually. QE updating automated job runs to include this in all OSP runs for more data points.

Comment 16 Sudhir Mallamprabhakara 2017-02-07 21:55:01 UTC
Closing the bug. Will re-open if needed.

- Sudhir


Note You need to log in before you can comment on or make changes to this bug.