Description of problem: Trying to deploy on 8.0 and I got: resources.Controller: ResourceInError: resources[0].resources.Controller: Went to status ERROR due to "Message: Exceeded maximum number of retries. Exceeded max scheduling attempts 3 for instance d3dfa06a-9489-43f7-95bf-98e537116982. Last exception: [u'Traceback (most recent call last): \n', u' File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 1, Code: 500" The only errors in the logs are in nova-compute.log: "ERROR nova.compute.manager [req-c06a802c-ce96-49d3-8a24-33533570e213 - - - - -] No compute node record for host puma01.scl.lab.tlv.redhat.com". It seems to be calling an ironic api (http://172.16.0.1:6385/v1/nodes/detail) and getting an empty {"nodes": []} in return. I'm deploying on bare metals, and this log is from the undercloud. Also, all nodes are in ERROR state - except for one... I tried to redeploy and trap the errors on the console with ipmi and got the following results: Ironic lists the nodes as "power on" while it doesn't look like they're really turned on... Only rarely, if you wait for a long time, the deployment really begins. When it begins it also often fails in the middle of "deploy_ramdisk...". Eventually the node would usually go to ERROR state in nova, and the deployment fails. Version-Release number of selected component (if applicable): openstack-ironic-inspector-2.2.2-2.el7ost.noarch openstack-ironic-conductor-4.2.2-4.el7ost.noarch python-ironicclient-0.8.1-1.el7ost.noarch openstack-ironic-common-4.2.2-4.el7ost.noarch openstack-ironic-api-4.2.2-4.el7ost.noarch python-ironic-inspector-client-1.2.0-6.el7ost.noarch How reproducible: 100% on my hardware (HP Proliant DL170e G6) Steps to Reproduce: 1. Deploy an overcloud 2. watch 'nova list && ironic node-list' 3. See how the nodes go on and off, and try catching what's going on in their console Additional info: On this system I had problems passing introspection, as I was hitting bug #1301694. I applied the workaround from there and patched the file /tftpboot/undionly.kpxe with the latest one from http://boot.ipxe.org/undionly.kpxe.
Hello! Please provide 1. ironic-conductor logs, 2. output of 'ironic node-list' after the failed deployment.
Hi @Udi, As @Dmitry already pointed out, it's hard to figure out which error we are talking about without any logs or the exact error message. Ironic node's has a a field called "last_error", in addition to the logs can you please paste the output of the "ironic node-show" of one of the nodes in error state?
Created attachment 1131227 [details] ironic-conductor log I recreated the bug in the latest 8.0 beta, and monitored the ironic-conductor log as the deployment was in progress. You can see many exceptions in the log like: Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ironic/drivers/base.py", line 630, in passthru_handler return func(*args, **kwargs) File "/usr/lib/python2.7/site-packages/ironic/drivers/modules/agent_base_vendor.py", line 434, in lookup node = self._find_node_by_macs(context, mac_addresses) File "/usr/lib/python2.7/site-packages/ironic/drivers/modules/agent_base_vendor.py", line 514, in _find_node_by_macs 'database.') % mac_addresses) NodeNotFound: No ports matching the given MAC addresses [u'00:9c:02:b0:8f:98', u'44:1e:a1:73:39:56']exist in the database. 2016-02-28 08:28:29.141 27940 ERROR oslo_messaging.rpc.dispatcher [req-a9cc4b66-cf51-411a-bd4c-82357336be18 ] Exception during message handling: No ports matching the given MAC addresses [u'00:9c:02:b0:8f:98', u'44:1e:a1:73:39:56']exist in the database. Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 142, in _dispatch_and_reply executor_callback)) File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 186, in _dispatch executor_callback) File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 129, in _do_dispatch result = func(ctxt, **new_args) File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 142, in inner return func(*args, **kwargs) File "/usr/lib/python2.7/site-packages/ironic/conductor/manager.py", line 609, in driver_vendor_passthru ret = vendor_func(context, **info) File "/usr/lib/python2.7/site-packages/ironic/drivers/base.py", line 633, in passthru_handler LOG.exception(passthru_logmessage, api_method) File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 204, in __exit__ six.reraise(self.type_, self.value, self.tb) File "/usr/lib/python2.7/site-packages/ironic/drivers/base.py", line 630, in passthru_handler return func(*args, **kwargs) File "/usr/lib/python2.7/site-packages/ironic/drivers/modules/agent_base_vendor.py", line 434, in lookup node = self._find_node_by_macs(context, mac_addresses) File "/usr/lib/python2.7/site-packages/ironic/drivers/modules/agent_base_vendor.py", line 514, in _find_node_by_macs 'database.') % mac_addresses) NodeNotFound: No ports matching the given MAC addresses [u'00:9c:02:b0:8f:98', u'44:1e:a1:73:39:56']exist in the database. 2016-02-28 08:28:29.144 27940 ERROR oslo_messaging._drivers.common [req-a9cc4b66-cf51-411a-bd4c-82357336be18 ] Returning exception No ports matching the given MAC addresses [u'00:9c:02:b0:8f:98', u'44:1e:a1:73:39:56']exist in the database. to caller 2016-02-28 08:28:29.144 27940 ERROR oslo_messaging._drivers.common [req-a9cc4b66-cf51-411a-bd4c-82357336be18 ] ['Traceback (most recent call last):\n', ' File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 142, in _dispatch_and_reply\n executor_callback))\n', ' File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 186, in _dispatch\n executor_callback)\n', ' File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 129, in _do_dispatch\n result = func(ctxt, **new_args)\n', ' File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 142, in inner\n return func(*args, **kwargs)\n', ' File "/usr/lib/python2.7/site-packages/ironic/conductor/manager.py", line 609, in driver_vendor_passthru\n ret = vendor_func(context, **info)\n', ' File "/usr/lib/python2.7/site-packages/ironic/drivers/base.py", line 633, in passthru_handler\n LOG.exception(passthru_logmessage, api_method)\n', ' File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 204, in __exit__\n six.reraise(self.type_, self.value, self.tb)\n', ' File "/usr/lib/python2.7/site-packages/ironic/drivers/base.py", line 630, in passthru_handler\n return func(*args, **kwargs)\n', ' File "/usr/lib/python2.7/site-packages/ironic/drivers/modules/agent_base_vendor.py", line 434, in lookup\n node = self._find_node_by_macs(context, mac_addresses)\n', ' File "/usr/lib/python2.7/site-packages/ironic/drivers/modules/agent_base_vendor.py", line 514, in _find_node_by_macs\n \'database.\') % mac_addresses)\n', "NodeNotFound: No ports matching the given MAC addresses [u'00:9c:02:b0:8f:98', u'44:1e:a1:73:39:56']exist in the database.\n"] ironic node-list after the failure looks like this: +------------+------+---------------+-------------+--------------------+-------------+ | UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance | +------------+------+---------------+-------------+--------------------+-------------+ | c96e332... | None | None | power off | available | False | | 73c14c8... | None | None | power off | available | False | | 09ac6d7... | None | None | power off | available | False | | 6dae280... | None | c0ff5917-e0...| power on | active | False | | 8a5ab62... | None | None | power off | available | False | | f6697ea... | None | 92496b25-36...| power on | wait call-back | False | | ecb8a7f... | None | None | power off | available | False | +------------+------+---------------+-------------+--------------------+-------------+
Created attachment 1135910 [details] ramdisk stuck
Hi, > "/usr/lib/python2.7/site-packages/ironic/drivers/modules/agent_base_vendor. > py", line 434, in lookup > node = self._find_node_by_macs(context, mac_addresses) > File > "/usr/lib/python2.7/site-packages/ironic/drivers/modules/agent_base_vendor. > py", line 514, in _find_node_by_macs > 'database.') % mac_addresses) > NodeNotFound: No ports matching the given MAC addresses > [u'00:9c:02:b0:8f:98', u'44:1e:a1:73:39:56']exist in the database. Ok this error is just saying that Ironic do not have any nodes registered with these MAC addresses. The nodes you're trying to deploy were inspected first? Can you post the output of the command below: $ ironic port-list Check if the MAC addresses registered in Ironic do correspond with the ones found in the nodes.
Immediately after registration, the MAC addresses in "ironic port-list" match the ones from instackenv.json. Then, after introspection, they change. The following is the output from before and after introspection, on the same environment: [stack@puma01 ~]$ ironic port-list +--------------------------------------+-------------------+ | UUID | Address | +--------------------------------------+-------------------+ | ff31da22-1563-47ed-bb51-0a20e32c67bb | 44:1e:a1:71:a9:c2 | | 6cec5742-a335-46f5-8bde-55e3b62043c2 | 44:1e:a1:73:34:be | | cbc309d4-43f4-4b34-ae18-1df594bc7212 | 44:1e:a1:73:39:56 | | 1bd0ff8f-e8b9-498d-92f9-7f31476b6660 | 44:1e:a1:71:a7:c8 | | f42de1a8-0011-4a88-958c-12228774301c | 44:1e:a1:73:38:60 | | ab35fd32-3776-49d2-9a6c-eb6424331631 | 44:1e:a1:73:36:ee | | 7a2943c4-667a-4a73-9413-795f3be2f541 | 44:1e:a1:71:a6:a6 | +--------------------------------------+-------------------+ [stack@puma01 ~]$ ironic port-list +--------------------------------------+-------------------+ | UUID | Address | +--------------------------------------+-------------------+ | 8ec6a132-684d-4e94-a0f8-3eafbc6a0691 | 44:1e:a1:71:a9:c3 | | cef404cc-8dbe-4e76-9f8d-d7257d82cfa8 | 44:1e:a1:73:39:57 | | 60c38987-ff7d-4eea-8ae7-660a9d4e3fe9 | 44:1e:a1:73:34:bf | | a6234639-005e-4341-82ef-3594e7093183 | 44:1e:a1:73:38:61 | | 255ce7e7-d79d-44d9-94de-686fc8be7077 | 44:1e:a1:71:a7:c9 | | aa778533-3871-4e72-bf6a-cfce146c908a | 44:1e:a1:71:a6:a7 | | a3362f36-41fd-4ae0-8c51-11c1c93bb257 | 44:1e:a1:73:36:ef | +--------------------------------------+-------------------+
Thanks @Udi, So, it is expected that inspection will change the MAC addresses. The problem is that, according to the logs provided in the bug*, the nodes (or is it a node?) do not have the MACs to match the ones that IPA uses when trying to lookup for that node and continue the deploy. * logs: > "/usr/lib/python2.7/site-packages/ironic/drivers/modules/agent_base_vendor. > py", line 434, in lookup > node = self._find_node_by_macs(context, mac_addresses) > File > "/usr/lib/python2.7/site-packages/ironic/drivers/modules/agent_base_vendor. > py", line 514, in _find_node_by_macs > 'database.') % mac_addresses) > NodeNotFound: No ports matching the given MAC addresses > [u'00:9c:02:b0:8f:98', u'44:1e:a1:73:39:56']exist in the database.
Udi, re comment #7, Note: https://bugzilla.redhat.com/show_bug.cgi?id=1275480 In general you should have the MACs of the provisoning NICs.
The issue keeps on reproducig with 8.0 on this setup. Note that the MACs in the instackenv.json file were rectified. Getting the following in the ironic-conductor.log: 2016-03-15 23:47:18.741 16984 ERROR ironic.drivers.modules.agent_base_vendor [-] Asynchronous exception for node fc1f038b-c793-49a5-a2e4-ec7c87214965: Node failed to get image for deploy. exception: ('Connection aborted.', error(110, 'ETIMEDOUT')) 2016-03-15 23:47:18.741 16984 ERROR ironic.drivers.modules.agent_base_vendor Traceback (most recent call last): 2016-03-15 23:47:18.741 16984 ERROR ironic.drivers.modules.agent_base_vendor File "/usr/lib/python2.7/site-packages/ironic/drivers/modules/agent_base_vendor.py", line 345, in heartbeat 2016-03-15 23:47:18.741 16984 ERROR ironic.drivers.modules.agent_base_vendor self.continue_deploy(task, **kwargs) 2016-03-15 23:47:18.741 16984 ERROR ironic.drivers.modules.agent_base_vendor File "/usr/lib/python2.7/site-packages/ironic/conductor/task_manager.py", line 134, in wrapper 2016-03-15 23:47:18.741 16984 ERROR ironic.drivers.modules.agent_base_vendor return f(*args, **kwargs) 2016-03-15 23:47:18.741 16984 ERROR ironic.drivers.modules.agent_base_vendor File "/usr/lib/python2.7/site-packages/ironic/drivers/modules/iscsi_deploy.py", line 873, in continue_deploy 2016-03-15 23:47:18.741 16984 ERROR ironic.drivers.modules.agent_base_vendor uuid_dict_returned = do_agent_iscsi_deploy(task, self._client) 2016-03-15 23:47:18.741 16984 ERROR ironic.drivers.modules.agent_base_vendor File "/usr/lib/python2.7/site-packages/ironic/drivers/modules/iscsi_deploy.py", line 437, in do_agent_iscsi_deploy 2016-03-15 23:47:18.741 16984 ERROR ironic.drivers.modules.agent_base_vendor result = agent_client.start_iscsi_target(node, iqn) 2016-03-15 23:47:18.741 16984 ERROR ironic.drivers.modules.agent_base_vendor File "/usr/lib/python2.7/site-packages/ironic/drivers/modules/agent_client.py", line 133, in start_iscsi_target 2016-03-15 23:47:18.741 16984 ERROR ironic.drivers.modules.agent_base_vendor wait=True) 2016-03-15 23:47:18.741 16984 ERROR ironic.drivers.modules.agent_base_vendor File "/usr/lib/python2.7/site-packages/ironic/drivers/modules/agent_client.py", line 72, in _command 2016-03-15 23:47:18.741 16984 ERROR ironic.drivers.modules.agent_base_vendor data=body) 2016-03-15 23:47:18.741 16984 ERROR ironic.drivers.modules.agent_base_vendor File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 508, in post 2016-03-15 23:47:18.741 16984 ERROR ironic.drivers.modules.agent_base_vendor return self.request('POST', url, data=data, json=json, **kwargs) 2016-03-15 23:47:18.741 16984 ERROR ironic.drivers.modules.agent_base_vendor File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 465, in request 2016-03-15 23:47:18.741 16984 ERROR ironic.drivers.modules.agent_base_vendor resp = self.send(prep, **send_kwargs) 2016-03-15 23:47:18.741 16984 ERROR ironic.drivers.modules.agent_base_vendor File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 573, in send 2016-03-15 23:47:18.741 16984 ERROR ironic.drivers.modules.agent_base_vendor r = adapter.send(request, **kwargs) 2016-03-15 23:47:18.741 16984 ERROR ironic.drivers.modules.agent_base_vendor File "/usr/lib/python2.7/site-packages/requests/adapters.py", line 415, in send 2016-03-15 23:47:18.741 16984 ERROR ironic.drivers.modules.agent_base_vendor raise ConnectionError(err, request=request) 2016-03-15 23:47:18.741 16984 ERROR ironic.drivers.modules.agent_base_vendor ConnectionError: ('Connection aborted.', error(110, 'ETIMEDOUT')) 2016-03-15 23:47:18.741 16984 ERROR ironic.drivers.modules.agent_base_vendor Notes: 1. One node out of 7 is able to load the overcloud and boot. It is the same node across attempts to re-deploy. Removing this node from the ironic DB and re-deploying doesn't cause another node to succeed. 2. All the nodes have the same HW and firmware version and are patched to the same switch, while all the switch ports have exactly the same configuration (MTU,VLAN,negotiation). 3. Once the deploy image is loaded we don't see anything on the console until the machine is rebooted (it's serial redirection, which is missing on the deploy image) 4. The ironic shows node as "deploying", folowed by "deploy failed"... Then more unsuccessful attempts to deploy on the same node are repeated.
I checked the environment and here are my findings so far: The node that got provisioned has the follwing driver_internal_info: {u'agent_url': u'http://10.35.190.228:9999', u'root_uuid_or_disk_id': u'ded6b6c9-e52b-4301-8a14-f952bb73395f', u'is_whole_disk_image': False, u'agent_last_heartbeat': 1458222794} 10.35.190.228 is part of the external network not the ctlplane network and it's provided via dhcp. In the openstack-ironic-conductor log we can see that this ip was also used for the iscsi target for writing the image: Mar 17 15:53:15 ironic : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/bin/ironic-rootwrap /etc/ironic/rootwrap.conf dd if=/var/lib/ironic/images/6d54c046-46e5-418e-a43f-4608b610d727/disk of=/dev/disk/by-path/ip-10.35.190.228:3260-iscsi-iqn.2008-10.org.openstack:6d54c046-46e5-418e-a43f-4608b610d727-lun-1-part2 bs=1M oflag=direct I'm not sure what the agent_url is used for but I believe it should use the ctlplane network, which should also be used for transferring the image. These are the nic related introspection data for the deployed node: { "all_interfaces": { "enp4s0f0": { "ip": null, "mac": "00:9c:02:b0:8f:a8" }, "enp4s0f1": { "ip": "10.35.190.228", "mac": "00:9c:02:b0:8f:ac" }, "enp5s0f0": { "ip": "10.35.163.21", "mac": "44:1e:a1:71:a7:c8" }, "enp5s0f1": { "ip": "192.0.2.103", "mac": "44:1e:a1:71:a7:c9" } }, "boot_interface": "44:1e:a1:71:a7:c9", "interfaces": { "enp5s0f1": { "ip": "192.0.2.103", "mac": "44:1e:a1:71:a7:c9" } }, "inventory": { "interfaces": [ { "ipv4_address": "10.35.190.228", "mac_address": "00:9c:02:b0:8f:ac", "name": "enp4s0f1", "switch_chassis_descr": null, "switch_port_descr": null }, { "ipv4_address": null, "mac_address": "00:9c:02:b0:8f:a8", "name": "enp4s0f0", "switch_chassis_descr": null, "switch_port_descr": null }, { "ipv4_address": "192.0.2.103", "mac_address": "44:1e:a1:71:a7:c9", "name": "enp5s0f1", "switch_chassis_descr": null, "switch_port_descr": null }, { "ipv4_address": "10.35.163.21", "mac_address": "44:1e:a1:71:a7:c8", "name": "enp5s0f0", "switch_chassis_descr": null, "switch_port_descr": null } ], }, This is the introspection data for a node that times out: { "all_interfaces": { "enp4s0f0": { "ip": "10.35.160.15", "mac": "00:9c:02:b0:8f:70" }, "enp4s0f1": { "ip": "10.35.190.152", "mac": "00:9c:02:b0:8f:74" }, "enp5s0f0": { "ip": "10.35.162.22", "mac": "44:1e:a1:73:34:be" }, "enp5s0f1": { "ip": "192.0.2.101", "mac": "44:1e:a1:73:34:bf" } }, "boot_interface": "44:1e:a1:73:34:bf", "interfaces": { "enp5s0f1": { "ip": "192.0.2.101", "mac": "44:1e:a1:73:34:bf" } }, "inventory": { "interfaces": [ { "ipv4_address": "10.35.190.152", "mac_address": "00:9c:02:b0:8f:74", "name": "enp4s0f1", "switch_chassis_descr": null, "switch_port_descr": null }, { "ipv4_address": "10.35.160.15", "mac_address": "00:9c:02:b0:8f:70", "name": "enp4s0f0", "switch_chassis_descr": null, "switch_port_descr": null }, { "ipv4_address": "192.0.2.101", "mac_address": "44:1e:a1:73:34:bf", "name": "enp5s0f1", "switch_chassis_descr": null, "switch_port_descr": null }, { "ipv4_address": "10.35.162.22", "mac_address": "44:1e:a1:73:34:be", "name": "enp5s0f0", "switch_chassis_descr": null, "switch_port_descr": null } ], In this particular environment I think the timeout is caused by asymmetric routing since the undercloud's external interface has an IP address in the 10.35.160.0/24 subnet. For the node that times out requests originating from the undercloud to 10.35.190.152 reach on enp4s0f1 interface but the replies will be sent over the enp4s0f0 interface which doesn't work by default - https://access.redhat.com/solutions/53031
So, the introspection seemed to work correctly in the last case. Do I get it right that the problem is that IPA starts listening on the external NIC instead of one on the provisioning network? And then undercloud can't connect to it?
(In reply to Dmitry Tantsur from comment #12) > So, the introspection seemed to work correctly in the last case. Do I get it > right that the problem is that IPA starts listening on the external NIC > instead of one on the provisioning network? And then undercloud can't > connect to it? Yes, introspection runs fine but during the overcloud image deploy the iscsi target is listening on the external NIC instead of the provisioning network(the 192.0.2.x addresses).
Got it. I started working on a fix upstream, but as I'll be on PTO the next week, I'm not assigning the bug to myself in hope that Lucas finishes what I've started :)
Hi, an update on this problem. I looked at Udi's environment and we found two problems there, inspection was failing with the following message: Introspection for UUID c5a871a2-ae4a-4057-b9c4-a9daaf8b276c finished with error: Unexpected exception TypeError during processing: int() argument must be a string or a number, not 'NoneType' I've fixed it locally and put a patch upstream to inspector: https://review.openstack.org/#/c/295327/ (We should open another BZ for this, I will do it). Inspector passed after that. Then I created another ramdisk* with the patch @Dmitry pointed out on comment #14 and tried to deployment and it worked: http://pastebin.test.redhat.com/358545 The patch for the deployment was approved upstream, now we are waiting it to merge and then we can work on backporting it. * I've used the upstream tools.
The patch is now merged downstream
Patch is included in ironic-python-agent-doc-1.1.0-8.el7ost.noarch.rpm
This is still failing, also with the latest undionly.kpxe. I passed introspection but deployment still behaves as described in the bug.
Udi, could you please provide your logs showing the wrong IP address and the introspection data showing all IP addresses? (just like above)
Created attachment 1148122 [details] ironic logs after failed deployment Ironic logs after the failed deployment are attached, and the output of ironic node-list looks as follows: [stack@puma01 ~]$ ironic node-list +----------+------+------------------+-------------+--------------------+-------------+ | UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance | +----------+------+------------------+-------------+--------------------+-------------+ | fec0e... | None | c32338ce-030d489 | power off | deploying | False | | ee970... | None | 2d5e14e0-045160b | power off | deploying | False | | f0f62... | None | c8bf2f1e-43471af | power off | available | False | | 59fba... | None | 4ce17653-9b011bc | power on | active | False | | e958e... | None | 5a656682-850a073 | power on | wait call-back | False | | 8f3fd... | None | None | power off | available | False | +----------+------+------------------+-------------+--------------------+-------------+ *** Note that the failure is in deployment, not in introspection. The introspection issue was resolved.
Still, please show the introspection data (see http://tripleo.org/advanced_deployment/introspection_data.html). Also, what's your IPA image version?
Marking the bug as verified. It was reopened by mistake after I had the wrong version of the images rpms. Verified in: python-tripleoclient-0.3.4-3.el7ost.noarch rhosp-director-images-8.0-20160415.1.el7ost.noarch rhosp-director-images-ipa-8.0-20160415.1.el7ost.noarch