This service will be undergoing maintenance at 00:00 UTC, 2017-10-23 It is expected to last about 30 minutes
Bug 1474990 - openstack-ironic: OSP12 overcloud deployment status ERROR in nova list, the nodes weren't turned on by ironic.
openstack-ironic: OSP12 overcloud deployment status ERROR in nova list, the n...
Status: VERIFIED
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-nova (Show other bugs)
12.0 (Pike)
Unspecified Unspecified
high Severity high
: rc
: 12.0 (Pike)
Assigned To: Dan Smith
Archit Modi
: Triaged
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-07-25 15:03 EDT by Alexander Chuzhoy
Modified: 2017-10-16 16:53 EDT (History)
16 users (show)

See Also:
Fixed In Version: openstack-nova-16.0.1-0.20170921091002.edd59ae.el7.centos
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
sosreports from undercloud (11.61 MB, application/x-xz)
2017-07-25 15:11 EDT, Alexander Chuzhoy
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Launchpad 1713095 None None None 2017-08-25 11:55 EDT
OpenStack gerrit 497966 None None None 2017-09-08 10:58 EDT
OpenStack gerrit 502082 None None None 2017-09-27 05:56 EDT

  None (edit)
Description Alexander Chuzhoy 2017-07-25 15:03:26 EDT
openstack-ironic: OSP12 overcloud deployment status ERROR in nova list, the nodes weren't turned on by ironic.

Environment:

openstack-puppet-modules-10.0.0-0.20170315222135.0333c73.el7.1.noarch
openstack-ironic-common-8.0.1-0.20170719072039.d9983f1.el7ost.noarch
puppet-ironic-11.2.0-0.20170718122857.e15b0fd.el7ost.noarch
python-ironic-inspector-client-1.12.0-0.20170331023022.045fd9e.el7ost.noarch
openstack-tripleo-heat-templates-7.0.0-0.20170718190543.el7ost.noarch
instack-undercloud-7.1.1-0.20170714211622.el7ost.noarch
python-ironic-lib-2.8.0-0.20170621200136.56e7619.el7ost.noarch
openstack-ironic-conductor-8.0.1-0.20170719072039.d9983f1.el7ost.noarch
python-ironicclient-1.14.0-0.20170714004320.bd7ccdd.el7ost.noarch
openstack-ironic-inspector-5.1.1-0.20170705203602.c38596e.el7ost.noarch
openstack-ironic-api-8.0.1-0.20170719072039.d9983f1.el7ost.noarch
python-virtualbmc-1.0.1-0.20170307140838.ef2c881.el7.noarch

Steps to reproduce:
Attempt to deploy OC OSP12 using remote registry for images (without uploading the images to the local registry).


Result:

The deployment fails:
CREATE_COMPLETE  state changed", "2017-07-25 18:34:40Z [overcloud.Controller.0.NodeAdminUserData]: CREATE_COMPLETE  state changed", "2017-07-25 18:34:40Z [overcloud.Controller.0.NodeUserData]: CREATE_COMPLETE  state changed", "2017-07-25 18:34:40Z [overcloud.Controller.0.UserData]: CREATE_IN_PROGRESS  state changed", "2017-07-25 18:34:41Z [overcloud.Controller.0.UserData]: CREATE_COMPLETE  state changed", "2017-07-25 18:34:41Z [overcloud.Controller.0.Controller]: CREATE_IN_PROGRESS  state changed", "2017-07-25 18:34:44Z [overcloud.Compute.0.NovaCompute]: CREATE_FAILED  ResourceInError: resources.NovaCompute: Went to status ERROR due to \"Message: No valid host was found. There are not enough hosts available., Code: 500\"", "2017-07-25 18:34:44Z [overcloud.Compute.0.NovaCompute]: DELETE_IN_PROGRESS  state changed", "2017-07-25 18:34:45Z [overcloud.Controller.0.Controller]: CREATE_FAILED  ResourceInError: resources.Controller: Went to status ERROR due to \"Message: No valid host was found. There are not enough hosts available., Code: 500\"", "2017-07-25 18:34:45Z [overcloud.Controller.0.Controller]: DELETE_IN_PROGRESS  state changed", "2017-07-25 18:34:46Z [overcloud.Compute.0.NovaCompute]: DELETE_COMPLETE  state changed", "2017-07-25 18:34:46Z [overcloud.Controller.0.Controller]: DELETE_COMPLETE  state changed", "2017-07-25 18:34:49Z [overcloud.Controller.0.Controller]: CREATE_IN_PROGRESS  state changed", "2017-07-25 18:34:50Z [overcloud.Compute.0.NovaCompute]: CREATE_IN_PROGRESS  state changed", "2017-07-25 18:34:51Z [overcloud.Controller.0.Controller]: CREATE_FAILED  ResourceInError: resources.Controller: Went to status ERROR due to \"Message: No valid host was found. There are not enough hosts available., Code: 500\"", "2017-07-25 18:34:51Z [overcloud.Controller.0.Controller]: DELETE_IN_PROGRESS  state changed", "2017-07-25 18:34:51Z [overcloud.Compute.0.NovaCompute]: CREATE_FAILED  ResourceInError: resources.NovaCompute: Went to status ERROR due to \"Message: No valid host was found. There are not enough hosts available., Code: 500\"", "2017-07-25 18:34:51Z [overcloud.Compute.0.NovaCompute]: DELETE_IN_PROGRESS  state changed", "2017-07-25 18:34:52Z [overcloud.Controller.0.Controller]: DELETE_COMPLETE  state changed", "2017-07-25 18:34:53Z [overcloud.Compute.0.NovaCompute]: DELETE_COMPLETE  state changed", "2017-07-25 18:34:57Z [overcloud.Controller.0.Controller]: CREATE_IN_PROGRESS  state changed", "2017-07-25 18:34:58Z [overcloud.Compute.0.NovaCompute]: CREATE_IN_PROGRESS  state changed", "2017-07-25 18:34:59Z [overcloud.Controller.0.Controller]: CREATE_FAILED  ResourceInError: resources.Controller: Went to status ERROR due to \"Message: No valid host was found. There are not enough hosts available., Code: 500\"", "2017-07-25 18:34:59Z [overcloud.Controller.0.Controller]: DELETE_IN_PROGRESS  state changed", "2017-07-25 18:35:00Z [overcloud.Compute.0.NovaCompute]: CREATE_FAILED  ResourceInError: resources.NovaCompute: Went to status ERROR due to \"Message: No valid host was found. There are not enough hosts available., Code: 500\"", "2017-07-25 18:35:00Z [overcloud.Compute.0.NovaCompute]: DELETE_IN_PROGRESS  state changed", "2017-07-25 18:35:01Z [overcloud.Controller.0.Controller]: DELETE_COMPLETE  state changed", "2017-07-25 18:35:01Z [overcloud.Compute.0.NovaCompute]: DELETE_COMPLETE  state changed", "2017-07-25 18:35:11Z [overcloud.Controller.0.Controller]: CREATE_IN_PROGRESS  state changed", "2017-07-25 18:35:11Z [overcloud.Compute.0.NovaCompute]: CREATE_IN_PROGRESS  state changed", "2017-07-25 18:35:12Z [overcloud.Controller.0.Controller]: CREATE_FAILED  ResourceInError: resources.Controller: Went to status ERROR due to \"Message: No valid host was found. There are not enough hosts available., Code: 500\"", "2017-07-25 18:35:12Z [overcloud.Controller.0.Controller]: DELETE_IN_PROGRESS  state changed", "2017-07-25 18:35:13Z [overcloud.Compute.0.NovaCompute]: CREATE_FAILED  ResourceInError: resources.NovaCompute: Went to status ERROR due to \"Message: No valid host was found. There are not enough hosts available., Code: 500\"", "2017-07-25 18:35:13Z [overcloud.Compute.0.NovaCompute]: DELETE_IN_PROGRESS  state changed", "2017-07-25 18:35:14Z [overcloud.Controller.0.Controller]: DELETE_COMPLETE  state changed", "2017-07-25 18:35:15Z [overcloud.Compute.0.NovaCompute]: DELETE_COMPLETE  state changed", "2017-07-25 18:35:32Z [overcloud.Controller.0.Controller]: CREATE_IN_PROGRESS  state changed", "2017-07-25 18:35:33Z [overcloud.Compute.0.NovaCompute]: CREATE_IN_PROGRESS  state changed", "2017-07-25 18:35:34Z [overcloud.Controller.0.Controller]: CREATE_FAILED  ResourceInError: resources.Controller: Went to status ERROR due to \"Message: No valid host was found. There are not enough hosts available., Code: 500\"", "2017-07-25 18:35:34Z [overcloud.Controller.0.Controller]: DELETE_IN_PROGRESS  state changed", "2017-07-25 18:35:34Z [overcloud.Compute.0.NovaCompute]: CREATE_FAILED  ResourceInError: resources.NovaCompute: Went to status ERROR due to \"Message: No valid host was found. There are not enough hosts available., Code: 500\"", "2017-07-25 18:35:34Z [overcloud.Compute.0.NovaCompute]: DELETE_IN_PROGRESS  state changed", "2017-07-25 18:35:36Z [overcloud.Controller.0.Controller]: DELETE_COMPLETE  state changed", "2017-07-25 18:35:36Z [overcloud.Compute.0.NovaCompute]: DELETE_COMPLETE  state changed", "2017-07-25 18:36:09Z [overcloud.Controller.0.Controller]: CREATE_IN_PROGRESS  state changed", "2017-07-25 18:36:09Z [overcloud.Compute.0.NovaCompute]: CREATE_IN_PROGRESS  state changed", "2017-07-25 18:36:10Z [overcloud.Controller.0.Controller]: CREATE_FAILED  ResourceInError: resources.Controller: Went to status ERROR due to \"Message: No valid host was found. There are not enough hosts available., Code: 500\"", "2017-07-25 18:36:11Z [overcloud.Compute.0.NovaCompute]: CREATE_FAILED  ResourceInError: resources.NovaCompute: Went to status ERROR due to \"Message: No valid host was found. There are not enough hosts available., Code: 500\"", "2017-07-25 18:36:11Z [overcloud.Controller.0]: CREATE_FAILED  Resource CREATE failed: ResourceInError: resources.Controller: Went to status ERROR due to \"Message: No valid host was found. There are not enough hosts available., Code: 500\"", "2017-07-25 18:36:11Z [overcloud.Controller.0]: CREATE_FAILED  ResourceInError: resources[0].resources.Controller: Went to status ERROR due to \"Message: No valid host was found. There are not enough hosts available., Code: 500\"", "2017-07-25 18:36:11Z [overcloud.Controller]: UPDATE_FAILED  ResourceInError: resources[0].resources.Controller: Went to status ERROR due to \"Message: No valid host was found. There are not enough hosts available., Code: 500\"", "2017-07-25 18:36:12Z [overcloud.Compute.0]: CREATE_FAILED  Resource CREATE failed: ResourceInError: resources.NovaCompute: Went to status ERROR due to \"Message: No valid host was found. There are not enough hosts available., Code: 500\"", "2017-07-25 18:36:12Z [overcloud.Controller]: CREATE_FAILED  resources.Controller: ResourceInError: resources[0].resources.Controller: Went to status ERROR due to \"Message: No valid host was found. There are not enough hosts available., Code: 500\"", "2017-07-25 18:36:12Z [overcloud.Compute]: CREATE_FAILED  CREATE aborted", "2017-07-25 18:36:12Z [overcloud]: CREATE_FAILED  Resource CREATE failed: resources.Controller: ResourceInError: resources[0].resources.Controller: Went to status ERROR due to \"Message: No valid host was found. There are not enough hosts available., Code: 500\"", "2017-07-25 18:36:13Z [overcloud.Compute.0]: CREATE_FAILED  CREATE aborted", "2017-07-25 18:36:13Z [overcloud.Compute]: UPDATE_FAILED  Operation cancelled", "", " Stack overcloud CREATE_FAILED ", "", "overcloud.Controller.0.Controller:", "  resource_type: OS::TripleO::ControllerServer", "  physical_resource_id: 98b6a18c-1435-4305-b019-94bca4c2985f", "  status: CREATE_FAILED", "  status_reason: |", "    ResourceInError: resources.Controller: Went to status ERROR due to \"Message: No valid host was found. There are not enough hosts available., Code: 500\"", "overcloud.Compute.0.NovaCompute:", "  resource_type: OS::TripleO::ComputeServer", "  physical_resource_id: 7be74e74-9602-4f74-a65a-c41ba216b42e", "  status: CREATE_FAILED", "  status_reason: |", "    ResourceInError: resources.NovaCompute: Went to status ERROR due to \"Message: No valid host was found. There are not enough hosts available., Code: 500\""]}


Checking the status of nova nodes:
(undercloud) [stack@undercloud-0 ~]$ nova list
+--------------------------------------+------------------------+--------+------------+-------------+----------+
| ID                                   | Name                   | Status | Task State | Power State | Networks |
+--------------------------------------+------------------------+--------+------------+-------------+----------+
| 7be74e74-9602-4f74-a65a-c41ba216b42e | overcloud-compute-0    | ERROR  | -          | NOSTATE     |          |
| 98b6a18c-1435-4305-b019-94bca4c2985f | overcloud-controller-0 | ERROR  | -          | NOSTATE     |          |
+--------------------------------------+------------------------+--------+------------+-------------+----------+


The nodes in ironic remained off:
(undercloud) [stack@undercloud-0 ~]$ ironic node-list
+--------------------------------------+--------------+---------------+-------------+--------------------+-------------+
| UUID                                 | Name         | Instance UUID | Power State | Provisioning State | Maintenance |
+--------------------------------------+--------------+---------------+-------------+--------------------+-------------+
| dbed9827-9525-4590-8889-4f092a1304fb | compute-0    | None          | power off   | available          | False       |
| 4394beec-0ea5-4847-b66b-4e62ebabc819 | controller-0 | None          | power off   | available          | False       |




Notes:

1) Simply re-deploying on the same setup without any changes works fine
2) Waiting for 10 minutes right before starting OC deployment also works fine. This explains why we didn't see the issue with local registry as it takes time to upload images.
Comment 1 Alexander Chuzhoy 2017-07-25 15:11 EDT
Created attachment 1304421 [details]
sosreports from undercloud
Comment 2 Bob Fournier 2017-08-03 17:38:46 EDT
Hi Sasha - sounds like this is a remote registry issue if it works fine if delaying before starting deployment?

Is it related to this bug - https://bugzilla.redhat.com/show_bug.cgi?id=1456986?

I'm seeing quite a few swift errors with Permission denied, not sure if that is related:
Jul 25 14:35:02 undercloud-0 object-server: Exception dumping recon cache: #012Traceback (most recent call last):#012  File "/usr/lib/python2.7/site-packages/swift/common/utils.py", line 3108, in dump_recon_cache#012    with lock_file(cache_file, lock_timeout, unlink=False) as cf:#012  File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__#012    return self.gen.next()#012  File "/usr/lib/python2.7/site-packages/swift/common/utils.py", line 2294, in lock_file#012    fd = os.open(filename, flags)#012OSError: [Errno 13] Permission denied: '/var/cache/swift/object.recon'
Jul 25 14:45:02 undercloud-0 object-server: Exception dumping recon cache: #012Traceback (most recent call last):#012  File "/usr/lib/python2.7/site-packages/swift/common/utils.py", line 3108, in dump_recon_cache#012    with lock_file(cache_file, lock_timeout, unlink=False) as cf:#012  File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__#012    return self.gen.next()#012  File "/usr/lib/python2.7/site-packages/swift/common/utils.py", line 2294, in lock_file#012    fd = os.open(filename, flags)#012OSError: [Errno 13] Permission denied: '/var/cache/swift/object.recon'
Comment 3 Alexander Chuzhoy 2017-08-03 19:11:52 EDT
The thing is that the same remote registry used after the 10 mins wait, so I don't see how it's related.

With local registry though it takes something like 10 minutes to upload the images to the local registry and maybe that's why we don't see the issue.
Comment 4 Bob Fournier 2017-08-04 12:10:40 EDT
Thanks Sasha.  I'd like to have the nova team take a look at this.  There's really not anything Ironic can do if these images aren't available.  

Compute DFG - please send this back if this is deemed hardware provisioning related but this timing between local and remote registries makes it seems like its not. Thanks.
Comment 5 Sahid Ferdjaoui 2017-08-11 10:13:06 EDT
According to the logs of nova-scheduler:

sts for the request with instance ID '53499109-aa69-4650-8728-8638503104c3'. Filter results: ['RetryFilter: (start: 1, end: 1)', 'TripleOCapabilitiesFilter: (start: 1, end: 1)', 'ComputeCapabilitiesFilter: (start: 1, end: 1)', 'AvailabilityZoneFilter: (start: 1, end: 1)', 'RamFilter: (start: 1, end: 1)', 'DiskFilter: (start: 1, end: 1)', 'ComputeFilter: (start: 1, end: 0)']

The 'ComputeFilter' is returning 0 host. I'm not a specialist of the scheduler/filter but that one seems to indicate whether the compute node is enabled or disabled.

In nova-compute.log we can see that error:

No compute node record for host undercloud-0.redhat.local: ComputeHostNotFound_Remote: Compute host undercloud-0.redhat.local could not be found.
Traceback (most recent call last):

  File "/usr/lib/python2.7/site-packages/nova/conductor/manager.py", line 123, in _object_dispatch
    return getattr(target, method)(*args, **kwargs)

  File "/usr/lib/python2.7/site-packages/oslo_versionedobjects/base.py", line 184, in wrapper
    result = fn(cls, context, *args, **kwargs)

  File "/usr/lib/python2.7/site-packages/nova/objects/compute_node.py", line 437, in get_all_by_host
    use_slave=use_slave)

  File "/usr/lib/python2.7/site-packages/nova/db/sqlalchemy/api.py", line 235, in wrapper
    return f(*args, **kwargs)

  File "/usr/lib/python2.7/site-packages/nova/objects/compute_node.py", line 432, in _db_compute_node_get_all_by_host
    return db.compute_node_get_all_by_host(context, host)

  File "/usr/lib/python2.7/site-packages/nova/db/api.py", line 298, in compute_node_get_all_by_host
    return IMPL.compute_node_get_all_by_host(context, host)

  File "/usr/lib/python2.7/site-packages/nova/db/sqlalchemy/api.py", line 280, in wrapped
    return f(context, *args, **kwargs)

  File "/usr/lib/python2.7/site-packages/nova/db/sqlalchemy/api.py", line 713, in compute_node_get_all_by_host
    raise exception.ComputeHostNotFound(host=host)

ComputeHostNotFound: Compute host undercloud-0.redhat.local could not be found.


It's not clear for me what does that mean... Lets see if team have an idea
Comment 6 Ollie Walsh 2017-08-11 11:18:51 EDT
(In reply to Bob Fournier from comment #2)
> Hi Sasha - sounds like this is a remote registry issue if it works fine if
> delaying before starting deployment?

I'm assuming that is a remote *docker* registry and not involved in the overcloud node deployment. However it means the ironic nodes are deployed much sooner after undercloud install as there is no time spent populating the local docker registry with images.
Sasha also mentions that the issues does not occur after 10 minutes.

This suggests some sort of timing issue in the service startup after the undercloud is installed.
Comment 7 Ollie Walsh 2017-08-23 21:08:38 EDT
Looking at this a bit closer...

Seems ComputeFilter should be first on the list of filters https://bugs.launchpad.net/nova/+bug/1579213. It's not: https://github.com/openstack/instack-undercloud/blob/caa64ac1a912c8c59f1dfe4c2bfe0c44311b5a26/elements/puppet-stack-config/puppet-stack-config.yaml.template#L433

However that doesn't look like the root cause here.

Seems nova-compute blocks is waiting for 10 minutes for a reply to a message:

2017-07-25 14:26:38.631 2502 ERROR oslo.messaging._drivers.impl_rabbit [req-a6cb755e-959b-4069-b494-0b6727e77d1e - - - - -] [da19e175-0dad-4c26-84d1-283bbbc0964d] AMQP server on 192.168.24.1:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds. Client port: None: error: [Errno 111] ECONNREFUSED
2017-07-25 14:26:39.642 2502 INFO oslo.messaging._drivers.impl_rabbit [req-a6cb755e-959b-4069-b494-0b6727e77d1e - - - - -] [da19e175-0dad-4c26-84d1-283bbbc0964d] Reconnected to AMQP server on 192.168.24.1:5672 via [amqp] client with port 48580.
2017-07-25 14:36:39.665 2502 CRITICAL nova [req-a6cb755e-959b-4069-b494-0b6727e77d1e - - - - -] Unhandled error: MessagingTimeout: Timed out waiting for a reply to message ID 51a6ec286d124515b6c545250cd5e7d1
2017-07-25 14:36:39.665 2502 ERROR nova Traceback (most recent call last):
2017-07-25 14:36:39.665 2502 ERROR nova   File "/usr/bin/nova-compute", line 10, in <module>
2017-07-25 14:36:39.665 2502 ERROR nova     sys.exit(main())
2017-07-25 14:36:39.665 2502 ERROR nova   File "/usr/lib/python2.7/site-packages/nova/cmd/compute.py", line 56, in main
2017-07-25 14:36:39.665 2502 ERROR nova     topic=CONF.compute_topic)
2017-07-25 14:36:39.665 2502 ERROR nova   File "/usr/lib/python2.7/site-packages/nova/service.py", line 240, in create
2017-07-25 14:36:39.665 2502 ERROR nova     periodic_interval_max=periodic_interval_max)
2017-07-25 14:36:39.665 2502 ERROR nova   File "/usr/lib/python2.7/site-packages/nova/service.py", line 116, in __init__
2017-07-25 14:36:39.665 2502 ERROR nova     self.manager = manager_class(host=self.host, *args, **kwargs)
2017-07-25 14:36:39.665 2502 ERROR nova   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 534, in __init__
2017-07-25 14:36:39.665 2502 ERROR nova     self.driver = driver.load_compute_driver(self.virtapi, compute_driver)
2017-07-25 14:36:39.665 2502 ERROR nova   File "/usr/lib/python2.7/site-packages/nova/virt/driver.py", line 1636, in load_compute_driver
2017-07-25 14:36:39.665 2502 ERROR nova     virtapi)
2017-07-25 14:36:39.665 2502 ERROR nova   File "/usr/lib/python2.7/site-packages/oslo_utils/importutils.py", line 44, in import_object
2017-07-25 14:36:39.665 2502 ERROR nova     return import_class(import_str)(*args, **kwargs)
2017-07-25 14:36:39.665 2502 ERROR nova   File "/usr/lib/python2.7/site-packages/nova/virt/ironic/driver.py", line 152, in __init__
2017-07-25 14:36:39.665 2502 ERROR nova     self._refresh_hash_ring(nova_context.get_admin_context())
2017-07-25 14:36:39.665 2502 ERROR nova   File "/usr/lib/python2.7/site-packages/nova/virt/ironic/driver.py", line 541, in _refresh_hash_ring
2017-07-25 14:36:39.665 2502 ERROR nova     ctxt, self._get_hypervisor_type())
2017-07-25 14:36:39.665 2502 ERROR nova   File "/usr/lib/python2.7/site-packages/oslo_versionedobjects/base.py", line 177, in wrapper
2017-07-25 14:36:39.665 2502 ERROR nova     args, kwargs)
2017-07-25 14:36:39.665 2502 ERROR nova   File "/usr/lib/python2.7/site-packages/nova/conductor/rpcapi.py", line 240, in object_class_action_versions
2017-07-25 14:36:39.665 2502 ERROR nova     args=args, kwargs=kwargs)
2017-07-25 14:36:39.665 2502 ERROR nova   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 169, in call
2017-07-25 14:36:39.665 2502 ERROR nova     retry=self.retry)
2017-07-25 14:36:39.665 2502 ERROR nova   File "/usr/lib/python2.7/site-packages/oslo_messaging/transport.py", line 123, in _send
2017-07-25 14:36:39.665 2502 ERROR nova     timeout=timeout, retry=retry)
2017-07-25 14:36:39.665 2502 ERROR nova   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 578, in send
2017-07-25 14:36:39.665 2502 ERROR nova     retry=retry)
2017-07-25 14:36:39.665 2502 ERROR nova   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 567, in _send
2017-07-25 14:36:39.665 2502 ERROR nova     result = self._waiter.wait(msg_id, timeout)
2017-07-25 14:36:39.665 2502 ERROR nova   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 459, in wait
2017-07-25 14:36:39.665 2502 ERROR nova     message = self.waiters.get(msg_id, timeout=timeout)
2017-07-25 14:36:39.665 2502 ERROR nova   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 347, in get
2017-07-25 14:36:39.665 2502 ERROR nova     'to message ID %s' % msg_id)
2017-07-25 14:36:39.665 2502 ERROR nova MessagingTimeout: Timed out waiting for a reply to message ID 51a6ec286d124515b6c545250cd5e7d1
2017-07-25 14:36:39.665 2502 ERROR nova 

ironic-conductor connected to the message bus a few seconds after the message was sent at 14:26:41:
2017-07-25 14:26:41.051 1360 INFO oslo.messaging._drivers.impl_rabbit [req-f968bd19-1358-42ad-ba97-0fc08cd18fd2 - - - - -] [b5d7cb06-f274-4edb-9eff-3fad669455be] Reconnected to AMQP server on 192.168.24.1:5672 via [amqp] client with port 48628.

Don't know much about the messaging setup (and I've probably forgotten everything I ever knew about rabbitmq) but it looks like a dropped message is the root cause.
Comment 8 Ollie Walsh 2017-08-25 10:19:53 EDT
nova-conductor log:
2017-07-25 14:36:09.788 2928 ERROR nova.conductor.manager Traceback (most recent call last):
2017-07-25 14:36:09.788 2928 ERROR nova.conductor.manager   File "/usr/lib/python2.7/site-packages/nova/conductor/manager.py", line 920, in schedule_and_build_instances
2017-07-25 14:36:09.788 2928 ERROR nova.conductor.manager     instance_uuids)
2017-07-25 14:36:09.788 2928 ERROR nova.conductor.manager   File "/usr/lib/python2.7/site-packages/nova/conductor/manager.py", line 624, in _schedule_instances
2017-07-25 14:36:09.788 2928 ERROR nova.conductor.manager     request_spec, instance_uuids)
2017-07-25 14:36:09.788 2928 ERROR nova.conductor.manager   File "/usr/lib/python2.7/site-packages/nova/scheduler/utils.py", line 464, in wrapped
2017-07-25 14:36:09.788 2928 ERROR nova.conductor.manager     return func(*args, **kwargs)
2017-07-25 14:36:09.788 2928 ERROR nova.conductor.manager   File "/usr/lib/python2.7/site-packages/nova/scheduler/client/__init__.py", line 52, in select_destinations
2017-07-25 14:36:09.788 2928 ERROR nova.conductor.manager     instance_uuids)
2017-07-25 14:36:09.788 2928 ERROR nova.conductor.manager   File "/usr/lib/python2.7/site-packages/nova/scheduler/client/__init__.py", line 37, in __run_method
2017-07-25 14:36:09.788 2928 ERROR nova.conductor.manager     return getattr(self.instance, __name)(*args, **kwargs)
2017-07-25 14:36:09.788 2928 ERROR nova.conductor.manager   File "/usr/lib/python2.7/site-packages/nova/scheduler/client/query.py", line 33, in select_destinations
2017-07-25 14:36:09.788 2928 ERROR nova.conductor.manager     instance_uuids)
2017-07-25 14:36:09.788 2928 ERROR nova.conductor.manager   File "/usr/lib/python2.7/site-packages/nova/scheduler/rpcapi.py", line 136, in select_destinations
2017-07-25 14:36:09.788 2928 ERROR nova.conductor.manager     return cctxt.call(ctxt, 'select_destinations', **msg_args)
2017-07-25 14:36:09.788 2928 ERROR nova.conductor.manager   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 169, in call
2017-07-25 14:36:09.788 2928 ERROR nova.conductor.manager     retry=self.retry)
2017-07-25 14:36:09.788 2928 ERROR nova.conductor.manager   File "/usr/lib/python2.7/site-packages/oslo_messaging/transport.py", line 123, in _send
2017-07-25 14:36:09.788 2928 ERROR nova.conductor.manager     timeout=timeout, retry=retry)
2017-07-25 14:36:09.788 2928 ERROR nova.conductor.manager   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 578, in send
2017-07-25 14:36:09.788 2928 ERROR nova.conductor.manager     retry=retry)
2017-07-25 14:36:09.788 2928 ERROR nova.conductor.manager   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 569, in _send
2017-07-25 14:36:09.788 2928 ERROR nova.conductor.manager     raise result
2017-07-25 14:36:09.788 2928 ERROR nova.conductor.manager NoValidHost_Remote: No valid host was found. There are not enough hosts available.
Comment 9 Ollie Walsh 2017-08-25 11:42:19 EDT
Ignore comment #8, the fact that the exception is coming from amqpdriver.py made me think it was an issue connecting to rabbitmq when in fact it has connected successfully and is raising an Exception it received over the message bus in this traceback.
Comment 11 Alexander Chuzhoy 2017-10-16 14:46:38 EDT
Don't reproduce the issue now.

Note You need to log in before you can comment on or make changes to this bug.