Bug 1546504

Summary:	Deployments fail for no reason. You just resume them and they complete successfully the 2nd or 3rd time.
Product:	Red Hat OpenStack	Reporter:	Udi Kalifon <ukalifon>
Component:	python-paunch	Assignee:	Steve Baker <sbaker>
Status:	CLOSED DUPLICATE	QA Contact:	nlevinki <nlevinki>
Severity:	high	Docs Contact:
Priority:	medium
Version:	13.0 (Queens)	CC:	aschultz, dbecker, emacchi, mburns, morazi, rhel-osp-director-maint, sbaker, sclewis
Target Milestone:	rc	Keywords:	Reopened, Triaged
Target Release:	13.0 (Queens)
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-04-05 23:23:11 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Udi Kalifon 2018-02-18 11:50:54 UTC

Description of problem:
Often, deployments fail with an error that looks something like this:

extceph.AllNodesDeploySteps.ControllerDeployment_Step4.0:
resource_type: OS::Heat::StructuredDeployment
physical_resource_id: 12006ecd-2a62-4aa0-8a56-7ab3f2d481d9
status: CREATE_FAILED
status_reason: |
Error: resources[0]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 2
deploy_stdout: |
...
"stdout: 83a1fec9f2743743443cd90539616a713327034f7b86c316d90f43b1a7a2a169"
],
"changed": false,
"failed_when_result": true
}
to retry, use: --limit @/var/lib/heat-config/heat-config-ansible/78b02ba1-4880-4d82-8be1-47aeadd433da_playbook.retry

PLAY RECAP *********************************************************************
localhost : ok=7 changed=2 unreachable=0 failed=1

(truncated, view all with --long)
deploy_stderr: |

It appears that there is no reason for this error to happen, because you can ignore the error and just resume the deployment without fixing anything or taking any action. Deployment eventually succeeds, so it also should have succeeded on the 1st attempt.

Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-7.0.3-22.el7ost.noarch
openstack-heat-engine-9.0.1-3.el7ost.noarch
openstack-heat-engine-9.0.1-3.el7ost.noarch
openstack-tripleo-ui-7.4.3-4.el7ost.noarch
openstack-tripleo-common-7.6.3-10.el7ost.noarch

How reproducible:
Very often.

Steps to Reproduce:
1. Use the attached templates. This deployment is for 3 controllers, 2 computes and external ceph storage. Included environments include network isolation, ssl on the overcloud, and containers.
2. I started the deployment from the GUI but I'm sure you can also recreate the problem without the GUI.
3. When deployment failed on step 4, I saved the failures list and the sosreports from the controller that failed - and resumed the deployment without really fixing anything. To resume the deployment I used "openstack overcloud plan deploy extceph" (extceph is the name of the plan).
4. Deployment failed on step 5 this time, and I took the same logs again and resumed the deployment once more.

Actual results:
Deployment eventually passes. The failures along the way seem to have been for no reason, because I didn't fix any configuration or took any real action.

Expected results:
Deployment should have passed on the 1st attempt, without requiring several resumes.

Additional info:
There were 2 additional failures for which I didn't capture the logs. The first happened shortly after the beginning of the deployment - where I got a "no valid host" error and one of the controllers was in ERROR state. There are no sosreport plugins installed on the undercloud and I only found an IPMI error in one of the logs, but this error did not reproduce manually so I ignored this issue as well and resumed the deployment... After that I got the errors in steps 4 and 5 as described in the bug, and after that there was an additional failure which was due to the undercloud not being pingable on vlan10. That was the only "real" issue and I resumed the deployment after fixing that and got a successful deployment. The overcloud has a problem with the storage configuration, and I can't create images, but that still requires investigation.

Comment 2 Steve Baker 2018-02-18 21:31:08 UTC

It looks like you're seeing intermittent failures while pulling images directly from registry.access.redhat.com, which is why the documentation recommends pulling from the local registry[1]. 

Also for any stack failure there is likely more helpful feedback available by running the failures command with the --long argument:

  openstack stack failures list --long overcloud

I'm going to close this for now, feel free to reopen if you're seeing the same issues when pulling from the undercloud registry.

[1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/12/html/director_installation_and_usage/Configuring-Registry_Details#Configuring-Registry_Details-Local

Comment 3 Steve Baker 2018-02-27 11:30:20 UTC

Reopening, I've heard reports of failures when pulling from the undercloud registry.

Comment 10 Steve Baker 2018-04-05 23:23:11 UTC


*** This bug has been marked as a duplicate of bug 1559062 ***