1456608 – Deployment in 3+1 environment fails with Failed to call refresh: nova-manage cell_v2 discover_hosts returned 1 instead of one of [0]

Bug 1456608 - Deployment in 3+1 environment fails with Failed to call refresh: nova-manage cell_v2 discover_hosts returned 1 instead of one of [0]

Summary: Deployment in 3+1 environment fails with Failed to call refresh: nova-manage ...

Keywords:
Status:	CLOSED DUPLICATE of bug 1434279
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-tripleo
Sub Component:
Version:	11.0 (Ocata)
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	James Slagle
QA Contact:	Arik Chernetsky
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-05-29 21:27 UTC by Andreas Karis
Modified:	2017-06-02 14:26 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-06-02 14:10:51 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
journalctl controller 0 (3.70 MB, text/plain) 2017-05-29 21:27 UTC, Andreas Karis	no flags	Details
View All

Description Andreas Karis 2017-05-29 21:27:00 UTC

Description of problem:
Deployment in 3+1 environment fails with Failed to call refresh: nova-manage  cell_v2 discover_hosts returned 1 instead of one of [0]

This resembles:
https://bugs.launchpad.net/nova/+bug/1656276

Happens in a virtual lab with 3 controllers and 1 compute node. 1 compute node and 1 controller is fine.

I can run step 3 of the .pp file later on without problems.


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Andreas Karis 2017-05-29 21:27:30 UTC

Created attachment 1283314 [details]
journalctl controller 0

Comment 2 Andreas Karis 2017-05-29 21:30:16 UTC

(...)
| ControllerDeployment_Step3                   | c22047f9-e5ce-48fb-824e-e71a207bbc98                                                              | OS::Heat::StructuredDeploymentGroup                                                                                    | CREATE_FAILED      | 2017-05-29T19:48:56Z | overcloud-AllNodesDeploySteps-itdhcw3rbf27                                                                                            |
| 0                                            | f42baa55-4929-495c-80ac-67cc2a89ba03                                                              | OS::Heat::StructuredDeployment                                                                                         | CREATE_FAILED      | 2017-05-29T20:03:16Z | overcloud-AllNodesDeploySteps-itdhcw3rbf27-ControllerDeployment_Step3-c2lmoz7zwosr                                                    |
(...)

In addition, I think I had a couple of other failed resources, but I'm currently deploying a stack upgrade to see if this resolves after that during a second run.

Comment 3 Andreas Karis 2017-05-29 21:55:53 UTC

I rerun the openstack ovecloud deploy to create a stack update, and the next time it fails on step 4:

[stack@undercloud-8 ~]$ heat resource-list -n5 overcloud | grep FAIL
WARNING (shell) "heat resource-list" is deprecated, please use "openstack stack resource list" instead
| AllNodesDeploySteps                          | b544aca6-eb0c-42f7-81e4-8fe40a60c53f                                                              | OS::TripleO::PostDeploySteps                                                                                           | UPDATE_FAILED   | 2017-05-29T21:23:29Z | overcloud                                                                                                                             |
| 0                                            | 68eb39cd-3f07-4e84-9041-9fcf061373e1                                                              | OS::Heat::StructuredDeployment                                                                                         | CREATE_FAILED   | 2017-05-29T21:37:13Z | overcloud-AllNodesDeploySteps-itdhcw3rbf27-ControllerDeployment_Step4-f5zxxa7cexwj                                                    |
| 1                                            | 029ca4a1-c596-4aaa-8f7f-a4bdac2c33d9                                                              | OS::Heat::StructuredDeployment                                                                                         | CREATE_FAILED   | 2017-05-29T21:37:13Z | overcloud-AllNodesDeploySteps-itdhcw3rbf27-ControllerDeployment_Step4-f5zxxa7cexwj                                                    |
| 2                                            | f152f422-4e6e-4373-aa35-44d14d04cc8b                                                              | OS::Heat::StructuredDeployment                                                                                         | CREATE_FAILED   | 2017-05-29T21:37:13Z | overcloud-AllNodesDeploySteps-itdhcw3rbf27-ControllerDeployment_Step4-f5zxxa7cexwj                                                    |
| ComputeDeployment_Step4                      | 27300316-7183-40c8-821b-cc01ad0660c9                                                              | OS::Heat::StructuredDeploymentGroup                                                                                    | CREATE_FAILED   | 2017-05-29T21:37:13Z | overcloud-AllNodesDeploySteps-itdhcw3rbf27                                                                                            |
| ControllerDeployment_Step4                   | a378a932-da96-4fee-ac55-a76b3cde3ef6                                                              | OS::Heat::StructuredDeploymentGroup                                                                                    | CREATE_FAILED   | 2017-05-29T21:37:13Z | overcloud-AllNodesDeploySteps-itdhcw3rbf27                                                                                            |
[stack@undercloud-8 ~]$ heat deployment-show 68eb39cd-3f07-4e84-9041-9fcf061373e1 | sed 's/\\n/\n/g' | grep -i erro
WARNING (shell) "heat deployment-show" is deprecated, please use "openstack software deployment show" instead
\u001b[mNotice: /Stage[main]/Ceilometer::Agent::Notification/Ceilometer_config[notification/ack_on_event_error]/ensure: created\u001b[0m
\u001b[mNotice: /Stage[main]/Swift::Proxy/Swift_proxy_config[pipeline:main/pipeline]/value: value changed 'catch_errors gatekeeper healthcheck proxy-logging cache container_sync bulk tempurl ratelimit copy container-quotas account-quotas slo dlo versioned_writes proxy-logging proxy-server' to 'catch_errors healthcheck proxy-logging cache ratelimit bulk tempurl formpost authtoken keystone staticweb copy container_quotas account_quotas slo dlo versioned_writes ceilometer proxy-logging proxy-server'\u001b[0m
\u001b[1;31mError: /Stage[main]/Neutron::Db::Sync/Exec[neutron-db-sync]: Failed to call refresh: Command exceeded timeout\u001b[0m
\u001b[1;31mError: /Stage[main]/Neutron::Db::Sync/Exec[neutron-db-sync]: Command exceeded timeout\u001b[0m
[stack@undercloud-8 ~]$ heat deployment-show 029ca4a1-c596-4aaa-8f7f-a4bdac2c33d9 | sed 's/\\n/\n/g' | grep -i erro
WARNING (shell) "heat deployment-show" is deprecated, please use "openstack software deployment show" instead
\u001b[mNotice: /Stage[main]/Ceilometer::Agent::Notification/Ceilometer_config[notification/ack_on_event_error]/ensure: created\u001b[0m
\u001b[mNotice: /Stage[main]/Swift::Proxy/Swift_proxy_config[pipeline:main/pipeline]/value: value changed 'catch_errors gatekeeper healthcheck proxy-logging cache container_sync bulk tempurl ratelimit copy container-quotas account-quotas slo dlo versioned_writes proxy-logging proxy-server' to 'catch_errors healthcheck proxy-logging cache ratelimit bulk tempurl formpost authtoken keystone staticweb copy container_quotas account_quotas slo dlo versioned_writes ceilometer proxy-logging proxy-server'\u001b[0m
\u001b[1;31mError: Systemd start for openstack-nova-scheduler failed!
\u001b[1;31mError: /Stage[main]/Nova::Scheduler/Nova::Generic_service[scheduler]/Service[nova-scheduler]/ensure: change from stopped to running failed: Systemd start for openstack-nova-scheduler failed!
[stack@undercloud-8 ~]$ heat deployment-show f152f422-4e6e-4373-aa35-44d14d04cc8b | sed 's/\\n/\n/g' | grep -i erro
WARNING (shell) "heat deployment-show" is deprecated, please use "openstack software deployment show" instead
\u001b[mNotice: /Stage[main]/Ceilometer::Agent::Notification/Ceilometer_config[notification/ack_on_event_error]/ensure: created\u001b[0m
\u001b[mNotice: /Stage[main]/Swift::Proxy/Swift_proxy_config[pipeline:main/pipeline]/value: value changed 'catch_errors gatekeeper healthcheck proxy-logging cache container_sync bulk tempurl ratelimit copy container-quotas account-quotas slo dlo versioned_writes proxy-logging proxy-server' to 'catch_errors healthcheck proxy-logging cache ratelimit bulk tempurl formpost authtoken keystone staticweb copy container_quotas account_quotas slo dlo versioned_writes ceilometer proxy-logging proxy-server'\u001b[0m
\u001b[1;31mError: Systemd start for openstack-nova-scheduler failed!
\u001b[1;31mError: /Stage[main]/Nova::Scheduler/Nova::Generic_service[scheduler]/Service[nova-scheduler]/ensure: change from stopped to running failed: Systemd start for openstack-nova-scheduler failed!
[stack@undercloud-8 ~]$ heat deployment-show 27300316-7183-40c8-821b-cc01ad0660c9 | sed 's/\\n/\n/g' | grep -i erro
WARNING (shell) "heat deployment-show" is deprecated, please use "openstack software deployment show" instead
Deployment not found: 27300316-7183-40c8-821b-cc01ad0660c9
[stack@undercloud-8 ~]$ heat deployment-show a378a932-da96-4fee-ac55-a76b3cde3ef6 | sed 's/\\n/\n/g' | grep -i erro
WARNING (shell) "heat deployment-show" is deprecated, please use "openstack software deployment show" instead
Deployment not found: a378a932-da96-4fee-ac55-a76b3cde3ef6
[stack@undercloud-8 ~]$

Comment 4 Andreas Karis 2017-05-29 22:30:07 UTC

On the next attempt, it fails at step 5:

WARNING (shell) "heat resource-list" is deprecated, please use "openstack stack resource list" instead
| AllNodesDeploySteps                          | b544aca6-eb0c-42f7-81e4-8fe40a60c53f                                                              | OS::TripleO::PostDeploySteps                                                                                           | UPDATE_FAILED   | 2017-05-29T21:58:53Z | overcloud                                                                                                                             |
| 0                                            | 46faf711-7f16-4a29-8232-6d9c798c6a13                                                              | OS::Heat::StructuredDeployment                                                                                         | CREATE_FAILED   | 2017-05-29T22:18:09Z | overcloud-AllNodesDeploySteps-itdhcw3rbf27-ControllerDeployment_Step5-ozqcxvc7ip3a                                                    |
| ControllerDeployment_Step5                   | e72c3e21-44b9-451e-9b23-a88faed63ef9                                                              | OS::Heat::StructuredDeploymentGroup                                                                                    | CREATE_FAILED   | 2017-05-29T22:18:09Z | overcloud-AllNodesDeploySteps-itdhcw3rbf27

Comment 5 Andreas Karis 2017-05-29 23:19:09 UTC

Finally, after the next update, it goes through:

2017-05-29 23:02:42Z [AllNodesDeploySteps]: UPDATE_COMPLETE  state changed
2017-05-29 23:02:52Z [overcloud]: UPDATE_COMPLETE  Stack UPDATE completed successfully

 Stack overcloud UPDATE_COMPLETE 

Overcloud Endpoint: http://10.0.0.6:5000/v2.0
Overcloud Deployed
[stack@undercloud-8 ~]$

Comment 7 Dan Smith 2017-06-01 21:43:54 UTC

The journalctl log is nearly unreadable, but I did snag this out:

DBError: (pymysql.err.InternalError) (1054, u\"Unknown column 'cn.uuid' in 'field list'\")

as the reason the cell_v2 command is failing. That's a legit reason to fail, so it's not a bug with that command AFAICT. I see other timeouts running things like neutron's db sync, so I kinda suspect something systemic and related to the DB.

Comment 8 Andreas Karis 2017-06-01 22:07:43 UTC

"as the reason the cell_v2 command is failing. That's a legit reason to fail, so it's not a bug with that command AFAICT. I see other timeouts running things like neutron's db sync, so I kinda suspect something systemic and related to the DB."

Am I the only one so far having / reporting this issue with OSP 11 deployment and 3 controllers? Because in my virtual env, I can consistently reproduce this on every redeploy of my environment, so I could likely provide an env for analysis.

Comment 9 Ollie Walsh 2017-06-01 22:12:39 UTC

In this virtual env, do all of the VMs share the same physical disk?

Comment 10 Andreas Karis 2017-06-01 22:41:30 UTC

yes, it's a virtual environment, and the VMs use qcow2 images on the same physical disk. the exact same lab environments work with OSP 7 to 10. OSP 11 works with 1 controller + 2 compute deployments, but not with 3 controller + 1 compute.

Comment 11 Ollie Walsh 2017-06-01 22:48:25 UTC

Try this: https://review.openstack.org/463495
Or buy an SSD :-) They're pretty cheap now.

Comment 13 Ollie Walsh 2017-06-02 14:26:19 UTC


*** This bug has been marked as a duplicate of bug 1434279 ***

Note You need to log in before you can comment on or make changes to this bug.