1546504 – Deployments fail for no reason. You just resume them and they complete successfully the 2nd or 3rd time.

Bug 1546504 - Deployments fail for no reason. You just resume them and they complete successfully the 2nd or 3rd time.

Summary: Deployments fail for no reason. You just resume them and they complete succes...

Keywords:
Status:	CLOSED DUPLICATE of bug 1559062
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	python-paunch
Sub Component:
Version:	13.0 (Queens)
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	rc
Target Release:	13.0 (Queens)
Assignee:	Steve Baker
QA Contact:	nlevinki
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-02-18 11:50 UTC by Udi Kalifon
Modified:	2018-04-06 14:14 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-04-05 23:23:11 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1752036	None	None	None	2018-02-27 11:32:50 UTC
OpenStack gerrit	548230	None	None	None	2018-02-27 11:33:44 UTC
RDO	12664	None	None	None	2018-02-27 11:38:48 UTC

Description Udi Kalifon 2018-02-18 11:50:54 UTC

Description of problem:
Often, deployments fail with an error that looks something like this:

extceph.AllNodesDeploySteps.ControllerDeployment_Step4.0:
resource_type: OS::Heat::StructuredDeployment
physical_resource_id: 12006ecd-2a62-4aa0-8a56-7ab3f2d481d9
status: CREATE_FAILED
status_reason: |
Error: resources[0]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 2
deploy_stdout: |
...
"stdout: 83a1fec9f2743743443cd90539616a713327034f7b86c316d90f43b1a7a2a169"
],
"changed": false,
"failed_when_result": true
}
to retry, use: --limit @/var/lib/heat-config/heat-config-ansible/78b02ba1-4880-4d82-8be1-47aeadd433da_playbook.retry

PLAY RECAP *********************************************************************
localhost : ok=7 changed=2 unreachable=0 failed=1

(truncated, view all with --long)
deploy_stderr: |

It appears that there is no reason for this error to happen, because you can ignore the error and just resume the deployment without fixing anything or taking any action. Deployment eventually succeeds, so it also should have succeeded on the 1st attempt.

Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-7.0.3-22.el7ost.noarch
openstack-heat-engine-9.0.1-3.el7ost.noarch
openstack-heat-engine-9.0.1-3.el7ost.noarch
openstack-tripleo-ui-7.4.3-4.el7ost.noarch
openstack-tripleo-common-7.6.3-10.el7ost.noarch

How reproducible:
Very often.

Steps to Reproduce:
1. Use the attached templates. This deployment is for 3 controllers, 2 computes and external ceph storage. Included environments include network isolation, ssl on the overcloud, and containers.
2. I started the deployment from the GUI but I'm sure you can also recreate the problem without the GUI.
3. When deployment failed on step 4, I saved the failures list and the sosreports from the controller that failed - and resumed the deployment without really fixing anything. To resume the deployment I used "openstack overcloud plan deploy extceph" (extceph is the name of the plan).
4. Deployment failed on step 5 this time, and I took the same logs again and resumed the deployment once more.

Actual results:
Deployment eventually passes. The failures along the way seem to have been for no reason, because I didn't fix any configuration or took any real action.

Expected results:
Deployment should have passed on the 1st attempt, without requiring several resumes.

Additional info:
There were 2 additional failures for which I didn't capture the logs. The first happened shortly after the beginning of the deployment - where I got a "no valid host" error and one of the controllers was in ERROR state. There are no sosreport plugins installed on the undercloud and I only found an IPMI error in one of the logs, but this error did not reproduce manually so I ignored this issue as well and resumed the deployment... After that I got the errors in steps 4 and 5 as described in the bug, and after that there was an additional failure which was due to the undercloud not being pingable on vlan10. That was the only "real" issue and I resumed the deployment after fixing that and got a successful deployment. The overcloud has a problem with the storage configuration, and I can't create images, but that still requires investigation.

Comment 2 Steve Baker 2018-02-18 21:31:08 UTC

It looks like you're seeing intermittent failures while pulling images directly from registry.access.redhat.com, which is why the documentation recommends pulling from the local registry[1]. 

Also for any stack failure there is likely more helpful feedback available by running the failures command with the --long argument:

  openstack stack failures list --long overcloud

I'm going to close this for now, feel free to reopen if you're seeing the same issues when pulling from the undercloud registry.

[1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/12/html/director_installation_and_usage/Configuring-Registry_Details#Configuring-Registry_Details-Local

Comment 3 Steve Baker 2018-02-27 11:30:20 UTC

Reopening, I've heard reports of failures when pulling from the undercloud registry.

Comment 10 Steve Baker 2018-04-05 23:23:11 UTC


*** This bug has been marked as a duplicate of bug 1559062 ***

Note You need to log in before you can comment on or make changes to this bug.