Description of problem: Deployment in 3+1 environment fails with Failed to call refresh: nova-manage cell_v2 discover_hosts returned 1 instead of one of [0] This resembles: https://bugs.launchpad.net/nova/+bug/1656276 Happens in a virtual lab with 3 controllers and 1 compute node. 1 compute node and 1 controller is fine. I can run step 3 of the .pp file later on without problems. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Created attachment 1283314 [details] journalctl controller 0
(...) | ControllerDeployment_Step3 | c22047f9-e5ce-48fb-824e-e71a207bbc98 | OS::Heat::StructuredDeploymentGroup | CREATE_FAILED | 2017-05-29T19:48:56Z | overcloud-AllNodesDeploySteps-itdhcw3rbf27 | | 0 | f42baa55-4929-495c-80ac-67cc2a89ba03 | OS::Heat::StructuredDeployment | CREATE_FAILED | 2017-05-29T20:03:16Z | overcloud-AllNodesDeploySteps-itdhcw3rbf27-ControllerDeployment_Step3-c2lmoz7zwosr | (...) In addition, I think I had a couple of other failed resources, but I'm currently deploying a stack upgrade to see if this resolves after that during a second run.
I rerun the openstack ovecloud deploy to create a stack update, and the next time it fails on step 4: [stack@undercloud-8 ~]$ heat resource-list -n5 overcloud | grep FAIL WARNING (shell) "heat resource-list" is deprecated, please use "openstack stack resource list" instead | AllNodesDeploySteps | b544aca6-eb0c-42f7-81e4-8fe40a60c53f | OS::TripleO::PostDeploySteps | UPDATE_FAILED | 2017-05-29T21:23:29Z | overcloud | | 0 | 68eb39cd-3f07-4e84-9041-9fcf061373e1 | OS::Heat::StructuredDeployment | CREATE_FAILED | 2017-05-29T21:37:13Z | overcloud-AllNodesDeploySteps-itdhcw3rbf27-ControllerDeployment_Step4-f5zxxa7cexwj | | 1 | 029ca4a1-c596-4aaa-8f7f-a4bdac2c33d9 | OS::Heat::StructuredDeployment | CREATE_FAILED | 2017-05-29T21:37:13Z | overcloud-AllNodesDeploySteps-itdhcw3rbf27-ControllerDeployment_Step4-f5zxxa7cexwj | | 2 | f152f422-4e6e-4373-aa35-44d14d04cc8b | OS::Heat::StructuredDeployment | CREATE_FAILED | 2017-05-29T21:37:13Z | overcloud-AllNodesDeploySteps-itdhcw3rbf27-ControllerDeployment_Step4-f5zxxa7cexwj | | ComputeDeployment_Step4 | 27300316-7183-40c8-821b-cc01ad0660c9 | OS::Heat::StructuredDeploymentGroup | CREATE_FAILED | 2017-05-29T21:37:13Z | overcloud-AllNodesDeploySteps-itdhcw3rbf27 | | ControllerDeployment_Step4 | a378a932-da96-4fee-ac55-a76b3cde3ef6 | OS::Heat::StructuredDeploymentGroup | CREATE_FAILED | 2017-05-29T21:37:13Z | overcloud-AllNodesDeploySteps-itdhcw3rbf27 | [stack@undercloud-8 ~]$ heat deployment-show 68eb39cd-3f07-4e84-9041-9fcf061373e1 | sed 's/\\n/\n/g' | grep -i erro WARNING (shell) "heat deployment-show" is deprecated, please use "openstack software deployment show" instead \u001b[mNotice: /Stage[main]/Ceilometer::Agent::Notification/Ceilometer_config[notification/ack_on_event_error]/ensure: created\u001b[0m \u001b[mNotice: /Stage[main]/Swift::Proxy/Swift_proxy_config[pipeline:main/pipeline]/value: value changed 'catch_errors gatekeeper healthcheck proxy-logging cache container_sync bulk tempurl ratelimit copy container-quotas account-quotas slo dlo versioned_writes proxy-logging proxy-server' to 'catch_errors healthcheck proxy-logging cache ratelimit bulk tempurl formpost authtoken keystone staticweb copy container_quotas account_quotas slo dlo versioned_writes ceilometer proxy-logging proxy-server'\u001b[0m \u001b[1;31mError: /Stage[main]/Neutron::Db::Sync/Exec[neutron-db-sync]: Failed to call refresh: Command exceeded timeout\u001b[0m \u001b[1;31mError: /Stage[main]/Neutron::Db::Sync/Exec[neutron-db-sync]: Command exceeded timeout\u001b[0m [stack@undercloud-8 ~]$ heat deployment-show 029ca4a1-c596-4aaa-8f7f-a4bdac2c33d9 | sed 's/\\n/\n/g' | grep -i erro WARNING (shell) "heat deployment-show" is deprecated, please use "openstack software deployment show" instead \u001b[mNotice: /Stage[main]/Ceilometer::Agent::Notification/Ceilometer_config[notification/ack_on_event_error]/ensure: created\u001b[0m \u001b[mNotice: /Stage[main]/Swift::Proxy/Swift_proxy_config[pipeline:main/pipeline]/value: value changed 'catch_errors gatekeeper healthcheck proxy-logging cache container_sync bulk tempurl ratelimit copy container-quotas account-quotas slo dlo versioned_writes proxy-logging proxy-server' to 'catch_errors healthcheck proxy-logging cache ratelimit bulk tempurl formpost authtoken keystone staticweb copy container_quotas account_quotas slo dlo versioned_writes ceilometer proxy-logging proxy-server'\u001b[0m \u001b[1;31mError: Systemd start for openstack-nova-scheduler failed! \u001b[1;31mError: /Stage[main]/Nova::Scheduler/Nova::Generic_service[scheduler]/Service[nova-scheduler]/ensure: change from stopped to running failed: Systemd start for openstack-nova-scheduler failed! [stack@undercloud-8 ~]$ heat deployment-show f152f422-4e6e-4373-aa35-44d14d04cc8b | sed 's/\\n/\n/g' | grep -i erro WARNING (shell) "heat deployment-show" is deprecated, please use "openstack software deployment show" instead \u001b[mNotice: /Stage[main]/Ceilometer::Agent::Notification/Ceilometer_config[notification/ack_on_event_error]/ensure: created\u001b[0m \u001b[mNotice: /Stage[main]/Swift::Proxy/Swift_proxy_config[pipeline:main/pipeline]/value: value changed 'catch_errors gatekeeper healthcheck proxy-logging cache container_sync bulk tempurl ratelimit copy container-quotas account-quotas slo dlo versioned_writes proxy-logging proxy-server' to 'catch_errors healthcheck proxy-logging cache ratelimit bulk tempurl formpost authtoken keystone staticweb copy container_quotas account_quotas slo dlo versioned_writes ceilometer proxy-logging proxy-server'\u001b[0m \u001b[1;31mError: Systemd start for openstack-nova-scheduler failed! \u001b[1;31mError: /Stage[main]/Nova::Scheduler/Nova::Generic_service[scheduler]/Service[nova-scheduler]/ensure: change from stopped to running failed: Systemd start for openstack-nova-scheduler failed! [stack@undercloud-8 ~]$ heat deployment-show 27300316-7183-40c8-821b-cc01ad0660c9 | sed 's/\\n/\n/g' | grep -i erro WARNING (shell) "heat deployment-show" is deprecated, please use "openstack software deployment show" instead Deployment not found: 27300316-7183-40c8-821b-cc01ad0660c9 [stack@undercloud-8 ~]$ heat deployment-show a378a932-da96-4fee-ac55-a76b3cde3ef6 | sed 's/\\n/\n/g' | grep -i erro WARNING (shell) "heat deployment-show" is deprecated, please use "openstack software deployment show" instead Deployment not found: a378a932-da96-4fee-ac55-a76b3cde3ef6 [stack@undercloud-8 ~]$
On the next attempt, it fails at step 5: WARNING (shell) "heat resource-list" is deprecated, please use "openstack stack resource list" instead | AllNodesDeploySteps | b544aca6-eb0c-42f7-81e4-8fe40a60c53f | OS::TripleO::PostDeploySteps | UPDATE_FAILED | 2017-05-29T21:58:53Z | overcloud | | 0 | 46faf711-7f16-4a29-8232-6d9c798c6a13 | OS::Heat::StructuredDeployment | CREATE_FAILED | 2017-05-29T22:18:09Z | overcloud-AllNodesDeploySteps-itdhcw3rbf27-ControllerDeployment_Step5-ozqcxvc7ip3a | | ControllerDeployment_Step5 | e72c3e21-44b9-451e-9b23-a88faed63ef9 | OS::Heat::StructuredDeploymentGroup | CREATE_FAILED | 2017-05-29T22:18:09Z | overcloud-AllNodesDeploySteps-itdhcw3rbf27
Finally, after the next update, it goes through: 2017-05-29 23:02:42Z [AllNodesDeploySteps]: UPDATE_COMPLETE state changed 2017-05-29 23:02:52Z [overcloud]: UPDATE_COMPLETE Stack UPDATE completed successfully Stack overcloud UPDATE_COMPLETE Overcloud Endpoint: http://10.0.0.6:5000/v2.0 Overcloud Deployed [stack@undercloud-8 ~]$
The journalctl log is nearly unreadable, but I did snag this out: DBError: (pymysql.err.InternalError) (1054, u\"Unknown column 'cn.uuid' in 'field list'\") as the reason the cell_v2 command is failing. That's a legit reason to fail, so it's not a bug with that command AFAICT. I see other timeouts running things like neutron's db sync, so I kinda suspect something systemic and related to the DB.
"as the reason the cell_v2 command is failing. That's a legit reason to fail, so it's not a bug with that command AFAICT. I see other timeouts running things like neutron's db sync, so I kinda suspect something systemic and related to the DB." Am I the only one so far having / reporting this issue with OSP 11 deployment and 3 controllers? Because in my virtual env, I can consistently reproduce this on every redeploy of my environment, so I could likely provide an env for analysis.
In this virtual env, do all of the VMs share the same physical disk?
yes, it's a virtual environment, and the VMs use qcow2 images on the same physical disk. the exact same lab environments work with OSP 7 to 10. OSP 11 works with 1 controller + 2 compute deployments, but not with 3 controller + 1 compute.
Try this: https://review.openstack.org/463495 Or buy an SSD :-) They're pretty cheap now.
*** This bug has been marked as a duplicate of bug 1434279 ***