Bug 1456608 - Deployment in 3+1 environment fails with Failed to call refresh: nova-manage cell_v2 discover_hosts returned 1 instead of one of [0]
Summary: Deployment in 3+1 environment fails with Failed to call refresh: nova-manage ...
Keywords:
Status: CLOSED DUPLICATE of bug 1434279
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo
Version: 11.0 (Ocata)
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: James Slagle
QA Contact: Arik Chernetsky
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-05-29 21:27 UTC by Andreas Karis
Modified: 2017-06-02 14:26 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-06-02 14:10:51 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
journalctl controller 0 (3.70 MB, text/plain)
2017-05-29 21:27 UTC, Andreas Karis
no flags Details

Description Andreas Karis 2017-05-29 21:27:00 UTC
Description of problem:
Deployment in 3+1 environment fails with Failed to call refresh: nova-manage  cell_v2 discover_hosts returned 1 instead of one of [0]

This resembles:
https://bugs.launchpad.net/nova/+bug/1656276

Happens in a virtual lab with 3 controllers and 1 compute node. 1 compute node and 1 controller is fine.

I can run step 3 of the .pp file later on without problems.


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Andreas Karis 2017-05-29 21:27:30 UTC
Created attachment 1283314 [details]
journalctl controller 0

Comment 2 Andreas Karis 2017-05-29 21:30:16 UTC
(...)
| ControllerDeployment_Step3                   | c22047f9-e5ce-48fb-824e-e71a207bbc98                                                              | OS::Heat::StructuredDeploymentGroup                                                                                    | CREATE_FAILED      | 2017-05-29T19:48:56Z | overcloud-AllNodesDeploySteps-itdhcw3rbf27                                                                                            |
| 0                                            | f42baa55-4929-495c-80ac-67cc2a89ba03                                                              | OS::Heat::StructuredDeployment                                                                                         | CREATE_FAILED      | 2017-05-29T20:03:16Z | overcloud-AllNodesDeploySteps-itdhcw3rbf27-ControllerDeployment_Step3-c2lmoz7zwosr                                                    |
(...)

In addition, I think I had a couple of other failed resources, but I'm currently deploying a stack upgrade to see if this resolves after that during a second run.

Comment 3 Andreas Karis 2017-05-29 21:55:53 UTC
I rerun the openstack ovecloud deploy to create a stack update, and the next time it fails on step 4:

[stack@undercloud-8 ~]$ heat resource-list -n5 overcloud | grep FAIL
WARNING (shell) "heat resource-list" is deprecated, please use "openstack stack resource list" instead
| AllNodesDeploySteps                          | b544aca6-eb0c-42f7-81e4-8fe40a60c53f                                                              | OS::TripleO::PostDeploySteps                                                                                           | UPDATE_FAILED   | 2017-05-29T21:23:29Z | overcloud                                                                                                                             |
| 0                                            | 68eb39cd-3f07-4e84-9041-9fcf061373e1                                                              | OS::Heat::StructuredDeployment                                                                                         | CREATE_FAILED   | 2017-05-29T21:37:13Z | overcloud-AllNodesDeploySteps-itdhcw3rbf27-ControllerDeployment_Step4-f5zxxa7cexwj                                                    |
| 1                                            | 029ca4a1-c596-4aaa-8f7f-a4bdac2c33d9                                                              | OS::Heat::StructuredDeployment                                                                                         | CREATE_FAILED   | 2017-05-29T21:37:13Z | overcloud-AllNodesDeploySteps-itdhcw3rbf27-ControllerDeployment_Step4-f5zxxa7cexwj                                                    |
| 2                                            | f152f422-4e6e-4373-aa35-44d14d04cc8b                                                              | OS::Heat::StructuredDeployment                                                                                         | CREATE_FAILED   | 2017-05-29T21:37:13Z | overcloud-AllNodesDeploySteps-itdhcw3rbf27-ControllerDeployment_Step4-f5zxxa7cexwj                                                    |
| ComputeDeployment_Step4                      | 27300316-7183-40c8-821b-cc01ad0660c9                                                              | OS::Heat::StructuredDeploymentGroup                                                                                    | CREATE_FAILED   | 2017-05-29T21:37:13Z | overcloud-AllNodesDeploySteps-itdhcw3rbf27                                                                                            |
| ControllerDeployment_Step4                   | a378a932-da96-4fee-ac55-a76b3cde3ef6                                                              | OS::Heat::StructuredDeploymentGroup                                                                                    | CREATE_FAILED   | 2017-05-29T21:37:13Z | overcloud-AllNodesDeploySteps-itdhcw3rbf27                                                                                            |
[stack@undercloud-8 ~]$ heat deployment-show 68eb39cd-3f07-4e84-9041-9fcf061373e1 | sed 's/\\n/\n/g' | grep -i erro
WARNING (shell) "heat deployment-show" is deprecated, please use "openstack software deployment show" instead
\u001b[mNotice: /Stage[main]/Ceilometer::Agent::Notification/Ceilometer_config[notification/ack_on_event_error]/ensure: created\u001b[0m
\u001b[mNotice: /Stage[main]/Swift::Proxy/Swift_proxy_config[pipeline:main/pipeline]/value: value changed 'catch_errors gatekeeper healthcheck proxy-logging cache container_sync bulk tempurl ratelimit copy container-quotas account-quotas slo dlo versioned_writes proxy-logging proxy-server' to 'catch_errors healthcheck proxy-logging cache ratelimit bulk tempurl formpost authtoken keystone staticweb copy container_quotas account_quotas slo dlo versioned_writes ceilometer proxy-logging proxy-server'\u001b[0m
\u001b[1;31mError: /Stage[main]/Neutron::Db::Sync/Exec[neutron-db-sync]: Failed to call refresh: Command exceeded timeout\u001b[0m
\u001b[1;31mError: /Stage[main]/Neutron::Db::Sync/Exec[neutron-db-sync]: Command exceeded timeout\u001b[0m
[stack@undercloud-8 ~]$ heat deployment-show 029ca4a1-c596-4aaa-8f7f-a4bdac2c33d9 | sed 's/\\n/\n/g' | grep -i erro
WARNING (shell) "heat deployment-show" is deprecated, please use "openstack software deployment show" instead
\u001b[mNotice: /Stage[main]/Ceilometer::Agent::Notification/Ceilometer_config[notification/ack_on_event_error]/ensure: created\u001b[0m
\u001b[mNotice: /Stage[main]/Swift::Proxy/Swift_proxy_config[pipeline:main/pipeline]/value: value changed 'catch_errors gatekeeper healthcheck proxy-logging cache container_sync bulk tempurl ratelimit copy container-quotas account-quotas slo dlo versioned_writes proxy-logging proxy-server' to 'catch_errors healthcheck proxy-logging cache ratelimit bulk tempurl formpost authtoken keystone staticweb copy container_quotas account_quotas slo dlo versioned_writes ceilometer proxy-logging proxy-server'\u001b[0m
\u001b[1;31mError: Systemd start for openstack-nova-scheduler failed!
\u001b[1;31mError: /Stage[main]/Nova::Scheduler/Nova::Generic_service[scheduler]/Service[nova-scheduler]/ensure: change from stopped to running failed: Systemd start for openstack-nova-scheduler failed!
[stack@undercloud-8 ~]$ heat deployment-show f152f422-4e6e-4373-aa35-44d14d04cc8b | sed 's/\\n/\n/g' | grep -i erro
WARNING (shell) "heat deployment-show" is deprecated, please use "openstack software deployment show" instead
\u001b[mNotice: /Stage[main]/Ceilometer::Agent::Notification/Ceilometer_config[notification/ack_on_event_error]/ensure: created\u001b[0m
\u001b[mNotice: /Stage[main]/Swift::Proxy/Swift_proxy_config[pipeline:main/pipeline]/value: value changed 'catch_errors gatekeeper healthcheck proxy-logging cache container_sync bulk tempurl ratelimit copy container-quotas account-quotas slo dlo versioned_writes proxy-logging proxy-server' to 'catch_errors healthcheck proxy-logging cache ratelimit bulk tempurl formpost authtoken keystone staticweb copy container_quotas account_quotas slo dlo versioned_writes ceilometer proxy-logging proxy-server'\u001b[0m
\u001b[1;31mError: Systemd start for openstack-nova-scheduler failed!
\u001b[1;31mError: /Stage[main]/Nova::Scheduler/Nova::Generic_service[scheduler]/Service[nova-scheduler]/ensure: change from stopped to running failed: Systemd start for openstack-nova-scheduler failed!
[stack@undercloud-8 ~]$ heat deployment-show 27300316-7183-40c8-821b-cc01ad0660c9 | sed 's/\\n/\n/g' | grep -i erro
WARNING (shell) "heat deployment-show" is deprecated, please use "openstack software deployment show" instead
Deployment not found: 27300316-7183-40c8-821b-cc01ad0660c9
[stack@undercloud-8 ~]$ heat deployment-show a378a932-da96-4fee-ac55-a76b3cde3ef6 | sed 's/\\n/\n/g' | grep -i erro
WARNING (shell) "heat deployment-show" is deprecated, please use "openstack software deployment show" instead
Deployment not found: a378a932-da96-4fee-ac55-a76b3cde3ef6
[stack@undercloud-8 ~]$

Comment 4 Andreas Karis 2017-05-29 22:30:07 UTC
On the next attempt, it fails at step 5:

WARNING (shell) "heat resource-list" is deprecated, please use "openstack stack resource list" instead
| AllNodesDeploySteps                          | b544aca6-eb0c-42f7-81e4-8fe40a60c53f                                                              | OS::TripleO::PostDeploySteps                                                                                           | UPDATE_FAILED   | 2017-05-29T21:58:53Z | overcloud                                                                                                                             |
| 0                                            | 46faf711-7f16-4a29-8232-6d9c798c6a13                                                              | OS::Heat::StructuredDeployment                                                                                         | CREATE_FAILED   | 2017-05-29T22:18:09Z | overcloud-AllNodesDeploySteps-itdhcw3rbf27-ControllerDeployment_Step5-ozqcxvc7ip3a                                                    |
| ControllerDeployment_Step5                   | e72c3e21-44b9-451e-9b23-a88faed63ef9                                                              | OS::Heat::StructuredDeploymentGroup                                                                                    | CREATE_FAILED   | 2017-05-29T22:18:09Z | overcloud-AllNodesDeploySteps-itdhcw3rbf27

Comment 5 Andreas Karis 2017-05-29 23:19:09 UTC
Finally, after the next update, it goes through:

2017-05-29 23:02:42Z [AllNodesDeploySteps]: UPDATE_COMPLETE  state changed
2017-05-29 23:02:52Z [overcloud]: UPDATE_COMPLETE  Stack UPDATE completed successfully

 Stack overcloud UPDATE_COMPLETE 

Overcloud Endpoint: http://10.0.0.6:5000/v2.0
Overcloud Deployed
[stack@undercloud-8 ~]$

Comment 7 Dan Smith 2017-06-01 21:43:54 UTC
The journalctl log is nearly unreadable, but I did snag this out:

DBError: (pymysql.err.InternalError) (1054, u\"Unknown column 'cn.uuid' in 'field list'\")

as the reason the cell_v2 command is failing. That's a legit reason to fail, so it's not a bug with that command AFAICT. I see other timeouts running things like neutron's db sync, so I kinda suspect something systemic and related to the DB.

Comment 8 Andreas Karis 2017-06-01 22:07:43 UTC
"as the reason the cell_v2 command is failing. That's a legit reason to fail, so it's not a bug with that command AFAICT. I see other timeouts running things like neutron's db sync, so I kinda suspect something systemic and related to the DB."

Am I the only one so far having / reporting this issue with OSP 11 deployment and 3 controllers? Because in my virtual env, I can consistently reproduce this on every redeploy of my environment, so I could likely provide an env for analysis.

Comment 9 Ollie Walsh 2017-06-01 22:12:39 UTC
In this virtual env, do all of the VMs share the same physical disk?

Comment 10 Andreas Karis 2017-06-01 22:41:30 UTC
yes, it's a virtual environment, and the VMs use qcow2 images on the same physical disk. the exact same lab environments work with OSP 7 to 10. OSP 11 works with 1 controller + 2 compute deployments, but not with 3 controller + 1 compute.

Comment 11 Ollie Walsh 2017-06-01 22:48:25 UTC
Try this: https://review.openstack.org/463495
Or buy an SSD :-) They're pretty cheap now.

Comment 13 Ollie Walsh 2017-06-02 14:26:19 UTC

*** This bug has been marked as a duplicate of bug 1434279 ***


Note You need to log in before you can comment on or make changes to this bug.