+++ This bug was initially created as a clone of Bug #1322892 +++ Description of problem: I've deployed a TripleO all day today, and after rebuilding undercloud a couple of times, I always got to the same issue when introspecting. Here is an example with just one node: [stack@undercloud ~]$ openstack baremetal introspection bulk start Setting nodes for introspection to manageable... Starting introspection of node: 304e594a-5585-4c25-a754-01864d872346 Waiting for introspection to finish... Introspection for UUID 304e594a-5585-4c25-a754-01864d872346 finished with error: Preprocessing hook validate_interfaces: No suitable interfaces found in {u'eth1': {'ip': None, 'mac': u'10:60:4b:a9:b8:7c'}, u'eth0': {'ip': None, 'mac': u'10:60:4b:a9:b8:78'}} Setting manageable nodes to available... Introspection completed with errors: 304e594a-5585-4c25-a754-01864d872346: Preprocessing hook validate_interfaces: No suitable interfaces found in {u'eth1': {'ip': None, 'mac': u'10:60:4b:a9:b8:7c'}, u'eth0': {'ip': None, 'mac': u'10:60:4b:a9:b8:78'}} Physical configuration The physical configuration was working before in Liberty, but just in case, I'm describing here for your input. The undercloud (virtual) server connects with NIC 1 to the internet and NIC 2 to an non internet connected switch. The overcloud servers connect with NIC 1 to the same switch, and NIC 2 to the internet, or are unplugged. Console When looking at the console on the overcloud node, I can see that the machine obtains a valid DHCP on NIC 1 and sends a DHCPDISCOVER on NIC 2 that, gets nowhere. A warning message appears as: ironic-python-agent[895]: date 895 WARNING ironic_python_agent.ironic_api_client [-] POST failed: HTTPConnectionPool(host='127.0.0.1', port=6385): Max retries exceeded with url: /v1/drivers/agent_ipmitool/vendor_passthru/lookup (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x2617f10>: Failed to establish a new connection: [Errno 111] Connection refused',)) If I try to run the command again, the server will not PXE boot again. Mar 30 22:29:23 undercloud.hq.ltg dnsmasq-dhcp[7741]: DHCPDISCOVER(tap3d3ea1c3-d0) 10:60:4b:a9:b8:78 no address available Mar 30 22:29:25 undercloud.hq.ltg dnsmasq-dhcp[7741]: DHCPDISCOVER(tap3d3ea1c3-d0) 10:60:4b:a9:b8:78 no address available Mar 30 22:29:30 undercloud.hq.ltg dnsmasq-dhcp[7741]: DHCPDISCOVER(tap3d3ea1c3-d0) 10:60:4b:a9:b8:78 no address available Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: --- Additional comment from Dmitry Tantsur on 2016-03-31 10:50:30 EDT --- I realized we don't store ramdisk logs in this case. could you please set always_store_ramdisk_logs to true in /etc/ironic-inspector/inspector.conf, restart openstack-ironic-inspector and rerun introspection? this should give you the ramdisk logs in /var/log/ironic-inspector/ramdisk. --- Additional comment from Ignacio Bravo on 2016-03-31 12:11 EDT --- Including the ramdisk log after modifying /etc/ironic-inspector/inspector.conf --- Additional comment from Dmitry Tantsur on 2016-04-01 07:29:44 EDT --- Hmm, that's an error which I though we've fixed. Where does you introspection image (ironic-python-agent) come from? Did you build it yourself or downloaded from somewhere? --- Additional comment from Ignacio Bravo on 2016-04-01 08:38:33 EDT --- I tried both ways. I first built it myself, and had the errors, then downloaded from https://ci.centos.org/artifacts/rdo/images/mitaka/delorean/stable/ and after removing from glance, reimporting and recreating the ironic nodes, I got the same behavior. --- Additional comment from Dmitry Tantsur on 2016-04-01 10:21:51 EDT --- I've unpacked the image, and it has the fix in question. Still, in logs I see: Mar 31 16:01:20 localhost ironic-python-agent[823]: 2016-03-31 16:01:17.448 823 INFO ironic_python_agent.inspector [-] network interfaces: {'eth1': {'ip': None, 'mac': '10:60:4b:a9:b8:7c'}, 'eth0': {'ip': None, 'mac': '10:60:4b:a9:b8:78'}} ... Mar 31 16:01:20 localhost dhcp-all-interfaces.sh[826]: Inspecting interface: eth0...Configured eth0 Mar 31 16:01:20 localhost ifup[883]: Determining IP information for eth0... done. cc dprince --- Additional comment from Dan Prince on 2016-04-01 10:26:17 EDT --- Can traffic on NIC1 see the traffic on NIC2 in any way. If so I'm wondering if this might be related to ARP flux (essentially the DHCP queries are causing mac address confusion on the DHCP server or something). If this is the cause then some sysctl related arp_filter, arp_ignore, arp_announce tuning might help. Either that or you can disconnect your NIC1, NIC2 switches so the traffic isn't getting across. Another idea would be to try using the stable-interface-names element too when building your IPA discovery ramdisk. This should help ensure NIC ordering upon reboots. Just some ideas. --- Additional comment from Dmitry Tantsur on 2016-04-01 11:08:37 EDT --- In the previous (rdo-specific) ramdisk we have a lot of sleeps waiting for DHCP. I think we might introduce something like that. --- Additional comment from Ignacio Bravo on 2016-04-01 11:28:00 EDT --- (In reply to Dan Prince from comment #6) > Can traffic on NIC1 see the traffic on NIC2 in any way. No, both NICs belong to separated networks, configured on the blade enclosures. The second NIC is either unconnected or either connected to the Internet. The NIC 1 is connected to an internal network with just the OpenStack servers, so that there is no DHCP conflicts in any way. --- Additional comment from Dmitry Tantsur on 2016-04-06 13:39:50 EDT --- Fixed in the newton release. Unfortunately, it's not something we can backport... --- Additional comment from Christopher Brown on 2016-04-14 10:49:43 EDT --- I am seeing this too on RDO tripleo Mitaka and Red Hat OSP 8 RC.
Note: backported to mitaka/OSPd9, still an issue for OSPd8.
I'm seeing this in OSP8 GA. The upstream fix [1] makes inspection wait 60 seconds for *all* NIC's to get their IP addresses. Looks like a custom image with this change should be a workaround. A crude workaround is to delete and re-introspect on a loop until all of the nodes return no errors [2]. As it's a race condition this crude approach worked for me with on the third iteration with six nodes introspecting with ironic-python-agent-8.0-20160415.1.el7ost.tar. [1] https://git.openstack.org/cgit/openstack/ironic-python-agent/commit/?id=3fba1ee8db0aa0b1519ef2135e602268488570f4 [2] while true; do for uuid in $(ironic node-list | awk {'print $2'} | egrep -v "UUID|^$"); do ironic node-set-power-state $uuid off ironic node-delete $uuid done openstack baremetal import --json ~/instackenv.json openstack baremetal configure boot openstack baremetal introspection bulk start intro_done=$(openstack baremetal introspection bulk status | awk {'print $6'} | egrep -v "^$|\||None") if [[ -z "$intro_done" ]]; then break fi done
Since I've got the same exact problem, I generated the ramdisk logs while the action was failing, you can download them from here [1]. The output of the introspection was the one mentioned above: ... ... Introspection completed with errors: 94f12e03-c04e-474f-95ca-6c323e9aa6b0: Preprocessing hook validate_interfaces: No suitable interfaces found in {u'eth3': {'ip': None, 'mac': u'b8:ca:3a:66:f4:42'}, u'eth2': {'ip': None, 'mac': u'b8:ca:3a:66:f4:45'}, u'eth1': {'ip': None, 'mac': u'b8:ca:3a:66:f4:40'}, u'eth0': {'ip': None, 'mac': u'b8:ca:3a:66:f4:44'}} 065dfc49-4831-4037-9aee-4b4b61765b34: Preprocessing hook validate_interfaces: No suitable interfaces found in {u'eth3': {'ip': None, 'mac': u'b8:ca:3a:66:d6:da'}, u'eth2': {'ip': None, 'mac': u'b8:ca:3a:66:d6:dd'}, u'eth1': {'ip': None, 'mac': u'b8:ca:3a:66:d6:d8'}, u'eth0': {'ip': None, 'mac': u'b8:ca:3a:66:d6:dc'}} 41135c0b-2584-479c-9b10-226f94d4c41b: Preprocessing hook validate_interfaces: No suitable interfaces found in {u'eth3': {'ip': None, 'mac': u'b8:ca:3a:66:e3:82'}, u'eth2': {'ip': None, 'mac': u'b8:ca:3a:66:e3:85'}, u'eth1': {'ip': None, 'mac': u'b8:ca:3a:66:e3:80'}, u'eth0': {'ip': None, 'mac': u'b8:ca:3a:66:e3:84'}} fce0d063-ed57-4f48-beb1-a10b0af3631d: Preprocessing hook validate_interfaces: No suitable interfaces found in {u'eth3': {'ip': None, 'mac': u'b8:ca:3a:66:ea:5a'}, u'eth2': {'ip': None, 'mac': u'b8:ca:3a:66:ea:58'}, u'eth1': {'ip': None, 'mac': u'b8:ca:3a:66:ea:5d'}, u'eth0': {'ip': None, 'mac': u'b8:ca:3a:66:ea:5c'}} 37063fce-a18b-42a9-9a85-5f67d2fe4ed5: Preprocessing hook validate_interfaces: No suitable interfaces found in {u'eth3': {'ip': None, 'mac': u'b8:ca:3a:66:d7:02'}, u'eth2': {'ip': None, 'mac': u'b8:ca:3a:66:d7:05'}, u'eth1': {'ip': None, 'mac': u'b8:ca:3a:66:d7:00'}, u'eth0': {'ip': None, 'mac': u'b8:ca:3a:66:d7:04'}} [1] http://file.rdu.redhat.com/~rscarazz/BZ1327255/
Raoul, weird, it looks like your images are old. Could you please paste output of the following: curl -s -H "X-Auth-Token: $(openstack token issue -f value -c id)" http://127.0.0.1:5050/v1/introspection/<UUID>/data | jq '.inventory.interfaces' where <UUID> is any node with failed introspection?
Oh, one fix: the command should be run for a node with *successful* introspection, not failed. Otherwise the data won't be there.
Yes, in fact doing the command on a unsuccessful node gives just "nul", in a successful one this is the output: [stack@macb8ca3a66dcd8 ~]$ curl -s -H "X-Auth-Token: $(openstack token issue -f value -c id)" http://127.0.0.1:5050/v1/introspection/41135c0b-2584-479c-9b10-226f94d4c41b/data | jq '.inventory.interfaces' [ { "mac_address": "b8:ca:3a:66:e3:80", "ipv4_address": "10.1.241.7", "switch_chassis_descr": null, "switch_port_descr": null, "has_carrier": true, "name": "eth1" }, { "mac_address": "b8:ca:3a:66:e3:84", "ipv4_address": null, "switch_chassis_descr": null, "switch_port_descr": null, "has_carrier": false, "name": "eth0" }, { "mac_address": "b8:ca:3a:66:e3:85", "ipv4_address": null, "switch_chassis_descr": null, "switch_port_descr": null, "has_carrier": false, "name": "eth2" }, { "mac_address": "b8:ca:3a:66:e3:82", "ipv4_address": "192.0.2.103", "switch_chassis_descr": null, "switch_port_descr": null, "has_carrier": true, "name": "eth3" } ]
And one more thing: please grab the inspector logs for the relevant time (sudo journalctl -u openstack-ironic-inspector)
Done, they are here: http://file.rdu.redhat.com/~rscarazz/BZ1327255/20160506_openstack-ironic-inspector.logs
Thanks, that seems to explain the problem. We rely on /sys/class/net/XXX/carrier to detect if we have to wait for an IP address. In your case carrier=0 for all interfaces, even though it's obviously non-sense. We have to stop relying on it.
Hi Dmitry, I recently patched os-net-config to use operstate instead. Not sure if it's worth back porting that patch.
I don't think operstate is any more reliable. It's also not strictly what I'm looking for: I wanted to optimize the case when the cable is not there at all, but I'll just drop the optimization for slower but more reliable code path.
Hi Chris, Can you advice if the patching of os-net-config to use operstate, worked and solved the problem? Thanks Raviv
*** Bug 1337659 has been marked as a duplicate of this bug. ***
(In reply to Raviv Bar-Tal from comment #13) > Hi Chris, > Can you advice if the patching of os-net-config to use operstate, worked and > solved the problem? Not tried it in this scenario. My primary reason is that Lenovo and IBM hardware comes with USB to Etherley adapters for out of band management which current os-net-config evaluates as a valid interface and attempts to provision if you use the nic1, nic2 etc naming scheme for network interfaces on the overcloud nic-config yaml files.
Hello, This is still not working despite update to current-ci and rebuild of images.
Hi! If it's about OSPd, please provide your images version. If it's about RDO please move it to https://bugzilla.redhat.com/show_bug.cgi?id=1322892, provide version of IPA available in your repositories and the ramdisk logs as explained in https://bugzilla.redhat.com/show_bug.cgi?id=1322892#c1. This bug is about OSPd specifically.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:1229