Bug 1308981 - IPA may advertise an IP address on a wrong network
IPA may advertise an IP address on a wrong network
Status: CLOSED CURRENTRELEASE
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-ironic-python-agent (Show other bugs)
8.0 (Liberty)
x86_64 Linux
urgent Severity urgent
: async
: 8.0 (Liberty)
Assigned To: Lucas Alvares Gomes
Udi
: TestOnly
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2016-02-16 11:23 EST by Udi
Modified: 2016-04-19 10:04 EDT (History)
13 users (show)

See Also:
Fixed In Version: openstack-ironic-python-agent-1.1.0-8.el7ost
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-04-19 10:04:01 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
ironic-conductor log (580.03 KB, application/x-gzip)
2016-02-28 04:39 EST, Udi
no flags Details
ramdisk stuck (108.18 KB, image/png)
2016-03-13 12:09 EDT, Omri Hochman
no flags Details
ironic logs after failed deployment (1.33 MB, application/x-gzip)
2016-04-18 03:56 EDT, Udi
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Launchpad 1558956 None None None 2016-03-18 04:39 EDT
OpenStack gerrit 294162 None None None 2016-03-17 12:53 EDT

  None (edit)
Description Udi 2016-02-16 11:23:23 EST
Description of problem:
Trying to deploy on 8.0 and I got: resources.Controller:
ResourceInError: resources[0].resources.Controller: Went to status
ERROR due to "Message: Exceeded maximum number of retries. Exceeded
max scheduling attempts 3 for instance
d3dfa06a-9489-43f7-95bf-98e537116982. Last exception: [u'Traceback
(most recent call last): \n', u' File
"/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 1,
Code: 500"

The only errors in the logs are in nova-compute.log: "ERROR
nova.compute.manager [req-c06a802c-ce96-49d3-8a24-33533570e213 - - - -
-] No compute node record for host puma01.scl.lab.tlv.redhat.com". It
seems to be calling an ironic api
(http://172.16.0.1:6385/v1/nodes/detail) and getting an empty
{"nodes": []} in return.

I'm deploying on bare metals, and this log is from the undercloud.

Also, all nodes are in ERROR state - except for one... I tried to
redeploy and trap the errors on the console with ipmi and got the
following results: Ironic lists the nodes as "power on" while it
doesn't look like they're really turned on... Only rarely, if you wait
for a long time, the deployment really begins. When it begins it also
often fails in the middle of "deploy_ramdisk...". Eventually the node
would usually go to ERROR state in nova, and the deployment fails.

Version-Release number of selected component (if applicable):
openstack-ironic-inspector-2.2.2-2.el7ost.noarch
openstack-ironic-conductor-4.2.2-4.el7ost.noarch
python-ironicclient-0.8.1-1.el7ost.noarch
openstack-ironic-common-4.2.2-4.el7ost.noarch
openstack-ironic-api-4.2.2-4.el7ost.noarch
python-ironic-inspector-client-1.2.0-6.el7ost.noarch


How reproducible:
100% on my hardware (HP Proliant DL170e G6)


Steps to Reproduce:
1. Deploy an overcloud
2. watch 'nova list && ironic node-list'
3. See how the nodes go on and off, and try catching what's going on in their console


Additional info:
On this system I had problems passing introspection, as I was hitting bug #1301694. I applied the workaround from there and patched the file /tftpboot/undionly.kpxe with the latest one from http://boot.ipxe.org/undionly.kpxe.
Comment 2 Dmitry Tantsur 2016-02-19 06:48:37 EST
Hello! Please provide 1. ironic-conductor logs, 2. output of 'ironic node-list' after the failed deployment.
Comment 3 Lucas Alvares Gomes 2016-02-19 12:21:10 EST
Hi @Udi,

As @Dmitry already pointed out, it's hard to figure out which error we are talking about without any logs or the exact error message.

Ironic node's has a a field called "last_error", in addition to the logs can you please paste the output of the "ironic node-show" of one of the nodes in error state?
Comment 4 Udi 2016-02-28 04:39 EST
Created attachment 1131227 [details]
ironic-conductor log

I recreated the bug in the latest 8.0 beta, and monitored the ironic-conductor log as the deployment was in progress. You can see many exceptions in the log like:

Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ironic/drivers/base.py", line 630, in passthru_handler
    return func(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/ironic/drivers/modules/agent_base_vendor.py", line 434, in lookup
    node = self._find_node_by_macs(context, mac_addresses)
  File "/usr/lib/python2.7/site-packages/ironic/drivers/modules/agent_base_vendor.py", line 514, in _find_node_by_macs
    'database.') % mac_addresses)
NodeNotFound: No ports matching the given MAC addresses [u'00:9c:02:b0:8f:98', u'44:1e:a1:73:39:56']exist in the database.
 
2016-02-28 08:28:29.141 27940 ERROR oslo_messaging.rpc.dispatcher [req-a9cc4b66-cf51-411a-bd4c-82357336be18 ] Exception during message handling: No ports matching the given MAC addresses [u'00:9c:02:b0:8f:98', u'44:1e:a1:73:39:56']exist in the database.
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 142, in _dispatch_and_reply
    executor_callback))
  File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 186, in _dispatch
    executor_callback)
  File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 129, in _do_dispatch
    result = func(ctxt, **new_args)
  File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 142, in inner
    return func(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/ironic/conductor/manager.py", line 609, in driver_vendor_passthru
    ret = vendor_func(context, **info)
  File "/usr/lib/python2.7/site-packages/ironic/drivers/base.py", line 633, in passthru_handler
    LOG.exception(passthru_logmessage, api_method)
  File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 204, in __exit__
    six.reraise(self.type_, self.value, self.tb)
  File "/usr/lib/python2.7/site-packages/ironic/drivers/base.py", line 630, in passthru_handler
    return func(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/ironic/drivers/modules/agent_base_vendor.py", line 434, in lookup
    node = self._find_node_by_macs(context, mac_addresses)
  File "/usr/lib/python2.7/site-packages/ironic/drivers/modules/agent_base_vendor.py", line 514, in _find_node_by_macs
    'database.') % mac_addresses)
NodeNotFound: No ports matching the given MAC addresses [u'00:9c:02:b0:8f:98', u'44:1e:a1:73:39:56']exist in the database.

2016-02-28 08:28:29.144 27940 ERROR oslo_messaging._drivers.common [req-a9cc4b66-cf51-411a-bd4c-82357336be18 ] Returning exception No ports matching the given MAC addresses [u'00:9c:02:b0:8f:98', u'44:1e:a1:73:39:56']exist in the database. to caller
2016-02-28 08:28:29.144 27940 ERROR oslo_messaging._drivers.common [req-a9cc4b66-cf51-411a-bd4c-82357336be18 ] ['Traceback (most recent call last):\n', '  File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 142, in _dispatch_and_reply\n    executor_callback))\n', '  File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 186, in _dispatch\n    executor_callback)\n', '  File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 129, in _do_dispatch\n    result = func(ctxt, **new_args)\n', '  File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 142, in inner\n    return func(*args, **kwargs)\n', '  File "/usr/lib/python2.7/site-packages/ironic/conductor/manager.py", line 609, in driver_vendor_passthru\n    ret = vendor_func(context, **info)\n', '  File "/usr/lib/python2.7/site-packages/ironic/drivers/base.py", line 633, in passthru_handler\n    LOG.exception(passthru_logmessage, api_method)\n', '  File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 204, in __exit__\n    six.reraise(self.type_, self.value, self.tb)\n', '  File "/usr/lib/python2.7/site-packages/ironic/drivers/base.py", line 630, in passthru_handler\n    return func(*args, **kwargs)\n', '  File "/usr/lib/python2.7/site-packages/ironic/drivers/modules/agent_base_vendor.py", line 434, in lookup\n    node = self._find_node_by_macs(context, mac_addresses)\n', '  File "/usr/lib/python2.7/site-packages/ironic/drivers/modules/agent_base_vendor.py", line 514, in _find_node_by_macs\n    \'database.\') % mac_addresses)\n', "NodeNotFound: No ports matching the given MAC addresses [u'00:9c:02:b0:8f:98', u'44:1e:a1:73:39:56']exist in the database.\n"]

ironic node-list after the failure looks like this:
+------------+------+---------------+-------------+--------------------+-------------+
| UUID       | Name | Instance UUID | Power State | Provisioning State | Maintenance |
+------------+------+---------------+-------------+--------------------+-------------+
| c96e332... | None | None          | power off   | available          | False       |        
| 73c14c8... | None | None          | power off   | available          | False       |        
| 09ac6d7... | None | None          | power off   | available          | False       |        
| 6dae280... | None | c0ff5917-e0...| power on    | active             | False       |
| 8a5ab62... | None | None          | power off   | available          | False       |
| f6697ea... | None | 92496b25-36...| power on    | wait call-back     | False       |
| ecb8a7f... | None | None          | power off   | available          | False       |
+------------+------+---------------+-------------+--------------------+-------------+
Comment 5 Omri Hochman 2016-03-13 12:09 EDT
Created attachment 1135910 [details]
ramdisk stuck
Comment 6 Lucas Alvares Gomes 2016-03-14 05:44:53 EDT
Hi,

> "/usr/lib/python2.7/site-packages/ironic/drivers/modules/agent_base_vendor.
> py", line 434, in lookup
>     node = self._find_node_by_macs(context, mac_addresses)
>   File
> "/usr/lib/python2.7/site-packages/ironic/drivers/modules/agent_base_vendor.
> py", line 514, in _find_node_by_macs
>     'database.') % mac_addresses)
> NodeNotFound: No ports matching the given MAC addresses
> [u'00:9c:02:b0:8f:98', u'44:1e:a1:73:39:56']exist in the database.

Ok this error is just saying that Ironic do not have any nodes registered with these MAC addresses. The nodes you're trying to deploy were inspected first?

Can you post the output of the command below:

$ ironic port-list

Check if the MAC addresses registered in Ironic do correspond with the ones found in the nodes.
Comment 7 Udi 2016-03-14 11:53:24 EDT
Immediately after registration, the MAC addresses in "ironic port-list" match the ones from instackenv.json. Then, after introspection, they change. The following is the output from before and after introspection, on the same environment:

[stack@puma01 ~]$ ironic port-list
+--------------------------------------+-------------------+
| UUID                                 | Address           |
+--------------------------------------+-------------------+
| ff31da22-1563-47ed-bb51-0a20e32c67bb | 44:1e:a1:71:a9:c2 |
| 6cec5742-a335-46f5-8bde-55e3b62043c2 | 44:1e:a1:73:34:be |
| cbc309d4-43f4-4b34-ae18-1df594bc7212 | 44:1e:a1:73:39:56 |
| 1bd0ff8f-e8b9-498d-92f9-7f31476b6660 | 44:1e:a1:71:a7:c8 |
| f42de1a8-0011-4a88-958c-12228774301c | 44:1e:a1:73:38:60 |
| ab35fd32-3776-49d2-9a6c-eb6424331631 | 44:1e:a1:73:36:ee |
| 7a2943c4-667a-4a73-9413-795f3be2f541 | 44:1e:a1:71:a6:a6 |
+--------------------------------------+-------------------+

[stack@puma01 ~]$ ironic port-list
+--------------------------------------+-------------------+
| UUID                                 | Address           |
+--------------------------------------+-------------------+
| 8ec6a132-684d-4e94-a0f8-3eafbc6a0691 | 44:1e:a1:71:a9:c3 |
| cef404cc-8dbe-4e76-9f8d-d7257d82cfa8 | 44:1e:a1:73:39:57 |
| 60c38987-ff7d-4eea-8ae7-660a9d4e3fe9 | 44:1e:a1:73:34:bf |
| a6234639-005e-4341-82ef-3594e7093183 | 44:1e:a1:73:38:61 |
| 255ce7e7-d79d-44d9-94de-686fc8be7077 | 44:1e:a1:71:a7:c9 |
| aa778533-3871-4e72-bf6a-cfce146c908a | 44:1e:a1:71:a6:a7 |
| a3362f36-41fd-4ae0-8c51-11c1c93bb257 | 44:1e:a1:73:36:ef |
+--------------------------------------+-------------------+
Comment 8 Lucas Alvares Gomes 2016-03-14 12:08:16 EDT
Thanks @Udi,

So, it is expected that inspection will change the MAC addresses. The problem is that, according to the logs provided in the bug*, the nodes (or is it a node?) do not have the MACs to match the ones that IPA uses when trying to lookup for that node and continue the deploy.

* logs:

> "/usr/lib/python2.7/site-packages/ironic/drivers/modules/agent_base_vendor.
> py", line 434, in lookup
>     node = self._find_node_by_macs(context, mac_addresses)
>   File
> "/usr/lib/python2.7/site-packages/ironic/drivers/modules/agent_base_vendor.
> py", line 514, in _find_node_by_macs
>     'database.') % mac_addresses)
> NodeNotFound: No ports matching the given MAC addresses
> [u'00:9c:02:b0:8f:98', u'44:1e:a1:73:39:56']exist in the database.
Comment 9 Alexander Chuzhoy 2016-03-14 13:59:35 EDT
Udi,
re comment #7,

Note: https://bugzilla.redhat.com/show_bug.cgi?id=1275480

In general you should have the MACs of the provisoning NICs.
Comment 10 Alexander Chuzhoy 2016-03-15 18:12:28 EDT
The issue keeps on reproducig with 8.0 on this setup.
Note that the MACs in the instackenv.json file were rectified.


Getting the following in the ironic-conductor.log:
2016-03-15 23:47:18.741 16984 ERROR ironic.drivers.modules.agent_base_vendor [-] Asynchronous exception for node fc1f038b-c793-49a5-a2e4-ec7c87214965: Node failed to get image for deploy. exception: ('Connection aborted.', error(110, 'ETIMEDOUT'))
2016-03-15 23:47:18.741 16984 ERROR ironic.drivers.modules.agent_base_vendor Traceback (most recent call last):
2016-03-15 23:47:18.741 16984 ERROR ironic.drivers.modules.agent_base_vendor   File "/usr/lib/python2.7/site-packages/ironic/drivers/modules/agent_base_vendor.py", line 345, in heartbeat
2016-03-15 23:47:18.741 16984 ERROR ironic.drivers.modules.agent_base_vendor     self.continue_deploy(task, **kwargs)
2016-03-15 23:47:18.741 16984 ERROR ironic.drivers.modules.agent_base_vendor   File "/usr/lib/python2.7/site-packages/ironic/conductor/task_manager.py", line 134, in wrapper
2016-03-15 23:47:18.741 16984 ERROR ironic.drivers.modules.agent_base_vendor     return f(*args, **kwargs)
2016-03-15 23:47:18.741 16984 ERROR ironic.drivers.modules.agent_base_vendor   File "/usr/lib/python2.7/site-packages/ironic/drivers/modules/iscsi_deploy.py", line 873, in continue_deploy
2016-03-15 23:47:18.741 16984 ERROR ironic.drivers.modules.agent_base_vendor     uuid_dict_returned = do_agent_iscsi_deploy(task, self._client)
2016-03-15 23:47:18.741 16984 ERROR ironic.drivers.modules.agent_base_vendor   File "/usr/lib/python2.7/site-packages/ironic/drivers/modules/iscsi_deploy.py", line 437, in do_agent_iscsi_deploy
2016-03-15 23:47:18.741 16984 ERROR ironic.drivers.modules.agent_base_vendor     result = agent_client.start_iscsi_target(node, iqn)
2016-03-15 23:47:18.741 16984 ERROR ironic.drivers.modules.agent_base_vendor   File "/usr/lib/python2.7/site-packages/ironic/drivers/modules/agent_client.py", line 133, in start_iscsi_target
2016-03-15 23:47:18.741 16984 ERROR ironic.drivers.modules.agent_base_vendor     wait=True)
2016-03-15 23:47:18.741 16984 ERROR ironic.drivers.modules.agent_base_vendor   File "/usr/lib/python2.7/site-packages/ironic/drivers/modules/agent_client.py", line 72, in _command
2016-03-15 23:47:18.741 16984 ERROR ironic.drivers.modules.agent_base_vendor     data=body)
2016-03-15 23:47:18.741 16984 ERROR ironic.drivers.modules.agent_base_vendor   File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 508, in post
2016-03-15 23:47:18.741 16984 ERROR ironic.drivers.modules.agent_base_vendor     return self.request('POST', url, data=data, json=json, **kwargs)
2016-03-15 23:47:18.741 16984 ERROR ironic.drivers.modules.agent_base_vendor   File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 465, in request
2016-03-15 23:47:18.741 16984 ERROR ironic.drivers.modules.agent_base_vendor     resp = self.send(prep, **send_kwargs)
2016-03-15 23:47:18.741 16984 ERROR ironic.drivers.modules.agent_base_vendor   File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 573, in send
2016-03-15 23:47:18.741 16984 ERROR ironic.drivers.modules.agent_base_vendor     r = adapter.send(request, **kwargs)
2016-03-15 23:47:18.741 16984 ERROR ironic.drivers.modules.agent_base_vendor   File "/usr/lib/python2.7/site-packages/requests/adapters.py", line 415, in send
2016-03-15 23:47:18.741 16984 ERROR ironic.drivers.modules.agent_base_vendor     raise ConnectionError(err, request=request)
2016-03-15 23:47:18.741 16984 ERROR ironic.drivers.modules.agent_base_vendor ConnectionError: ('Connection aborted.', error(110, 'ETIMEDOUT'))
2016-03-15 23:47:18.741 16984 ERROR ironic.drivers.modules.agent_base_vendor





Notes:
1. One node out of 7 is able to load the overcloud and boot. It is the same node across attempts to re-deploy. Removing this node from the ironic DB and re-deploying doesn't cause another node to succeed.
2. All the nodes have the same HW and firmware version and are patched to the same switch, while all the switch ports have exactly the same configuration (MTU,VLAN,negotiation).
3. Once the deploy image is loaded we don't see anything on the console until the machine is rebooted (it's serial redirection, which is missing on the deploy image)
4. The ironic shows node as "deploying", folowed by "deploy failed"... Then more unsuccessful attempts to deploy on the same node are repeated.
Comment 11 Marius Cornea 2016-03-17 10:55:42 EDT
I checked the environment and here are my findings so far:

The node that got provisioned has the follwing driver_internal_info:

{u'agent_url': u'http://10.35.190.228:9999', u'root_uuid_or_disk_id':  
u'ded6b6c9-e52b-4301-8a14-f952bb73395f', u'is_whole_disk_image': False,
u'agent_last_heartbeat': 1458222794}                                   

10.35.190.228 is part of the external network not the ctlplane network and it's provided via dhcp.

In the openstack-ironic-conductor log we can see that this ip was also used for the iscsi target for writing the image:

Mar 17 15:53:15 ironic : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/bin/ironic-rootwrap /etc/ironic/rootwrap.conf dd if=/var/lib/ironic/images/6d54c046-46e5-418e-a43f-4608b610d727/disk of=/dev/disk/by-path/ip-10.35.190.228:3260-iscsi-iqn.2008-10.org.openstack:6d54c046-46e5-418e-a43f-4608b610d727-lun-1-part2 bs=1M oflag=direct

I'm not sure what the agent_url is used for but I believe it should use the ctlplane network, which should also be used for transferring the image. 

These are the nic related introspection data for the deployed node:

{
    "all_interfaces": {
        "enp4s0f0": {
            "ip": null,
            "mac": "00:9c:02:b0:8f:a8"
        },
        "enp4s0f1": {
            "ip": "10.35.190.228",
            "mac": "00:9c:02:b0:8f:ac"
        },
        "enp5s0f0": {
            "ip": "10.35.163.21",
            "mac": "44:1e:a1:71:a7:c8"
        },
        "enp5s0f1": {
            "ip": "192.0.2.103",
            "mac": "44:1e:a1:71:a7:c9"
        }
    },
    "boot_interface": "44:1e:a1:71:a7:c9",
    "interfaces": {
        "enp5s0f1": {
            "ip": "192.0.2.103",
            "mac": "44:1e:a1:71:a7:c9"
        }
    },
    "inventory": {
        "interfaces": [
            {
                "ipv4_address": "10.35.190.228",
                "mac_address": "00:9c:02:b0:8f:ac",
                "name": "enp4s0f1",
                "switch_chassis_descr": null,
                "switch_port_descr": null
            },
            {
                "ipv4_address": null,
                "mac_address": "00:9c:02:b0:8f:a8",
                "name": "enp4s0f0",
                "switch_chassis_descr": null,
                "switch_port_descr": null
            },
            {
                "ipv4_address": "192.0.2.103",
                "mac_address": "44:1e:a1:71:a7:c9",
                "name": "enp5s0f1",
                "switch_chassis_descr": null,
                "switch_port_descr": null
            },
            {
                "ipv4_address": "10.35.163.21",
                "mac_address": "44:1e:a1:71:a7:c8",
                "name": "enp5s0f0",
                "switch_chassis_descr": null,
                "switch_port_descr": null
            }
        ],
    },
 
This is the introspection data for a node that times out:

{
    "all_interfaces": {
        "enp4s0f0": {
            "ip": "10.35.160.15",
            "mac": "00:9c:02:b0:8f:70"
        },
        "enp4s0f1": {
            "ip": "10.35.190.152",
            "mac": "00:9c:02:b0:8f:74"
        },
        "enp5s0f0": {
            "ip": "10.35.162.22",
            "mac": "44:1e:a1:73:34:be"
        },
        "enp5s0f1": {
            "ip": "192.0.2.101",
            "mac": "44:1e:a1:73:34:bf"
        }
    },
    "boot_interface": "44:1e:a1:73:34:bf",
    "interfaces": {
        "enp5s0f1": {
            "ip": "192.0.2.101",
            "mac": "44:1e:a1:73:34:bf"
        }
    },
    "inventory": {
        "interfaces": [
            {
                "ipv4_address": "10.35.190.152",
                "mac_address": "00:9c:02:b0:8f:74",
                "name": "enp4s0f1",
                "switch_chassis_descr": null,
                "switch_port_descr": null
            },
            {
                "ipv4_address": "10.35.160.15",
                "mac_address": "00:9c:02:b0:8f:70",
                "name": "enp4s0f0",
                "switch_chassis_descr": null,
                "switch_port_descr": null
            },
            {
                "ipv4_address": "192.0.2.101",
                "mac_address": "44:1e:a1:73:34:bf",
                "name": "enp5s0f1",
                "switch_chassis_descr": null,
                "switch_port_descr": null
            },
            {
                "ipv4_address": "10.35.162.22",
                "mac_address": "44:1e:a1:73:34:be",
                "name": "enp5s0f0",
                "switch_chassis_descr": null,
                "switch_port_descr": null
            }
        ],

In this particular environment I think the timeout is caused by asymmetric routing since the undercloud's external interface has an IP address in the 10.35.160.0/24 subnet. For the node that times out requests originating from the undercloud to 10.35.190.152 reach on enp4s0f1 interface but the replies will be sent over the enp4s0f0 interface which doesn't work by default - https://access.redhat.com/solutions/53031
Comment 12 Dmitry Tantsur 2016-03-17 11:56:12 EDT
So, the introspection seemed to work correctly in the last case. Do I get it right that the problem is that IPA starts listening on the external NIC instead of one on the provisioning network? And then undercloud can't connect to it?
Comment 13 Marius Cornea 2016-03-17 12:06:09 EDT
(In reply to Dmitry Tantsur from comment #12)
> So, the introspection seemed to work correctly in the last case. Do I get it
> right that the problem is that IPA starts listening on the external NIC
> instead of one on the provisioning network? And then undercloud can't
> connect to it?

Yes, introspection runs fine but during the overcloud image deploy the iscsi target is listening on the external NIC instead of the provisioning network(the 192.0.2.x addresses).
Comment 14 Dmitry Tantsur 2016-03-17 12:53:24 EDT
Got it. I started working on a fix upstream, but as I'll be on PTO the next week, I'm not assigning the bug to myself in hope that Lucas finishes what I've started :)
Comment 15 Lucas Alvares Gomes 2016-03-21 13:02:25 EDT
Hi, an update on this problem.

I looked at Udi's environment and we found two problems there, inspection was failing with the following message:

Introspection for UUID c5a871a2-ae4a-4057-b9c4-a9daaf8b276c finished with error: Unexpected exception TypeError during processing: int() argument must be a string or a number, not 'NoneType'

I've fixed it locally and put a patch upstream to inspector: https://review.openstack.org/#/c/295327/  (We should open another BZ for this, I will do it).

Inspector passed after that.

Then I created another ramdisk* with the patch @Dmitry pointed out on comment #14 and tried to deployment and it worked: http://pastebin.test.redhat.com/358545

The patch for the deployment was approved upstream, now we are waiting it to merge and then we can work on backporting it.

* I've used the upstream tools.
Comment 16 Lucas Alvares Gomes 2016-03-23 05:36:14 EDT
The patch is now merged downstream
Comment 17 Thierry Vignaud 2016-03-23 10:09:36 EDT
Patch is included in ironic-python-agent-doc-1.1.0-8.el7ost.noarch.rpm
Comment 23 Udi 2016-04-17 06:43:20 EDT
This is still failing, also with the latest undionly.kpxe. I passed introspection but deployment still behaves as described in the bug.
Comment 24 Dmitry Tantsur 2016-04-18 03:36:43 EDT
Udi, could you please provide your logs showing the wrong IP address and the introspection data showing all IP addresses? (just like above)
Comment 25 Udi 2016-04-18 03:56 EDT
Created attachment 1148122 [details]
ironic logs after failed deployment

Ironic logs after the failed deployment are attached, and the output of ironic node-list looks as follows:

[stack@puma01 ~]$ ironic node-list
+----------+------+------------------+-------------+--------------------+-------------+
| UUID     | Name | Instance UUID    | Power State | Provisioning State | Maintenance |
+----------+------+------------------+-------------+--------------------+-------------+
| fec0e... | None | c32338ce-030d489 | power off   | deploying          | False       |     
| ee970... | None | 2d5e14e0-045160b | power off   | deploying          | False       |     
| f0f62... | None | c8bf2f1e-43471af | power off   | available          | False       |     
| 59fba... | None | 4ce17653-9b011bc | power on    | active             | False       |     
| e958e... | None | 5a656682-850a073 | power on    | wait call-back     | False       |     
| 8f3fd... | None | None             | power off   | available          | False       |     
+----------+------+------------------+-------------+--------------------+-------------+

*** Note that the failure is in deployment, not in introspection. The introspection issue was resolved.
Comment 26 Dmitry Tantsur 2016-04-18 04:14:11 EDT
Still, please show the introspection data (see http://tripleo.org/advanced_deployment/introspection_data.html). Also, what's your IPA image version?
Comment 28 Udi 2016-04-18 10:24:49 EDT
Marking the bug as verified. It was reopened by mistake after I had the wrong version of the images rpms. Verified in:
python-tripleoclient-0.3.4-3.el7ost.noarch
rhosp-director-images-8.0-20160415.1.el7ost.noarch
rhosp-director-images-ipa-8.0-20160415.1.el7ost.noarch

Note You need to log in before you can comment on or make changes to this bug.