Bug 1546504
| Summary: | Deployments fail for no reason. You just resume them and they complete successfully the 2nd or 3rd time. | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Udi Kalifon <ukalifon> |
| Component: | python-paunch | Assignee: | Steve Baker <sbaker> |
| Status: | CLOSED DUPLICATE | QA Contact: | nlevinki <nlevinki> |
| Severity: | high | Docs Contact: | |
| Priority: | medium | ||
| Version: | 13.0 (Queens) | CC: | aschultz, dbecker, emacchi, mburns, morazi, rhel-osp-director-maint, sbaker, sclewis |
| Target Milestone: | rc | Keywords: | Reopened, Triaged |
| Target Release: | 13.0 (Queens) | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2018-04-05 23:23:11 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
It looks like you're seeing intermittent failures while pulling images directly from registry.access.redhat.com, which is why the documentation recommends pulling from the local registry[1]. Also for any stack failure there is likely more helpful feedback available by running the failures command with the --long argument: openstack stack failures list --long overcloud I'm going to close this for now, feel free to reopen if you're seeing the same issues when pulling from the undercloud registry. [1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/12/html/director_installation_and_usage/Configuring-Registry_Details#Configuring-Registry_Details-Local Reopening, I've heard reports of failures when pulling from the undercloud registry. *** This bug has been marked as a duplicate of bug 1559062 *** |
Description of problem: Often, deployments fail with an error that looks something like this: extceph.AllNodesDeploySteps.ControllerDeployment_Step4.0: resource_type: OS::Heat::StructuredDeployment physical_resource_id: 12006ecd-2a62-4aa0-8a56-7ab3f2d481d9 status: CREATE_FAILED status_reason: | Error: resources[0]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 2 deploy_stdout: | ... "stdout: 83a1fec9f2743743443cd90539616a713327034f7b86c316d90f43b1a7a2a169" ], "changed": false, "failed_when_result": true } to retry, use: --limit @/var/lib/heat-config/heat-config-ansible/78b02ba1-4880-4d82-8be1-47aeadd433da_playbook.retry PLAY RECAP ********************************************************************* localhost : ok=7 changed=2 unreachable=0 failed=1 (truncated, view all with --long) deploy_stderr: | It appears that there is no reason for this error to happen, because you can ignore the error and just resume the deployment without fixing anything or taking any action. Deployment eventually succeeds, so it also should have succeeded on the 1st attempt. Version-Release number of selected component (if applicable): openstack-tripleo-heat-templates-7.0.3-22.el7ost.noarch openstack-heat-engine-9.0.1-3.el7ost.noarch openstack-heat-engine-9.0.1-3.el7ost.noarch openstack-tripleo-ui-7.4.3-4.el7ost.noarch openstack-tripleo-common-7.6.3-10.el7ost.noarch How reproducible: Very often. Steps to Reproduce: 1. Use the attached templates. This deployment is for 3 controllers, 2 computes and external ceph storage. Included environments include network isolation, ssl on the overcloud, and containers. 2. I started the deployment from the GUI but I'm sure you can also recreate the problem without the GUI. 3. When deployment failed on step 4, I saved the failures list and the sosreports from the controller that failed - and resumed the deployment without really fixing anything. To resume the deployment I used "openstack overcloud plan deploy extceph" (extceph is the name of the plan). 4. Deployment failed on step 5 this time, and I took the same logs again and resumed the deployment once more. Actual results: Deployment eventually passes. The failures along the way seem to have been for no reason, because I didn't fix any configuration or took any real action. Expected results: Deployment should have passed on the 1st attempt, without requiring several resumes. Additional info: There were 2 additional failures for which I didn't capture the logs. The first happened shortly after the beginning of the deployment - where I got a "no valid host" error and one of the controllers was in ERROR state. There are no sosreport plugins installed on the undercloud and I only found an IPMI error in one of the logs, but this error did not reproduce manually so I ignored this issue as well and resumed the deployment... After that I got the errors in steps 4 and 5 as described in the bug, and after that there was an additional failure which was due to the undercloud not being pingable on vlan10. That was the only "real" issue and I resumed the deployment after fixing that and got a successful deployment. The overcloud has a problem with the storage configuration, and I can't create images, but that still requires investigation.