1327255 – No valid interfaces found during introspection

Bug 1327255 - No valid interfaces found during introspection

Summary: No valid interfaces found during introspection

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-ironic-python-agent
Sub Component:
Version:	8.0 (Liberty)
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	async
Target Release:	8.0 (Liberty)
Assignee:	Dmitry Tantsur
QA Contact:	Raviv Bar-Tal
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1337659 (view as bug list)
Depends On:	1322892
Blocks:	1346022
TreeView+	depends on / blocked

Reported:	2016-04-14 15:14 UTC by Dmitry Tantsur
Modified:	2016-06-16 21:27 UTC (History)
CC List:	11 users (show)
Fixed In Version:	openstack-ironic-python-agent-1.1.0-9.el7ost
Doc Type:	Bug Fix
Doc Text:
Clone Of:	1322892
Clones:	1346022 (view as bug list)
Environment:
Last Closed:	2016-06-15 12:39:14 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1564954	None	None	None	2016-04-14 15:14:07 UTC
OpenStack gerrit	300548	'None'	MERGED	Wait for the interfaces to get IP addresses before inspection	2021-01-04 14:40:49 UTC
OpenStack gerrit	313511	'None'	MERGED	[inspection] wait for the PXE DHCP by default and remove the carrier check	2021-01-04 14:41:25 UTC
Red Hat Product Errata	RHBA-2016:1229	normal	SHIPPED_LIVE	Red Hat OpenStack Platform 8 director Bug Fix Advisory	2016-06-15 16:38:45 UTC

Description Dmitry Tantsur 2016-04-14 15:14:08 UTC

+++ This bug was initially created as a clone of Bug #1322892 +++

Description of problem:
I've deployed a TripleO all day today, and after rebuilding undercloud a couple of times, I always got to the same issue when introspecting. Here is an example with just one node:

[stack@undercloud ~]$ openstack baremetal introspection bulk start
Setting nodes for introspection to manageable...
Starting introspection of node: 304e594a-5585-4c25-a754-01864d872346
Waiting for introspection to finish...
Introspection for UUID 304e594a-5585-4c25-a754-01864d872346 finished with error: Preprocessing hook validate_interfaces: No suitable interfaces found in {u'eth1': {'ip': None, 'mac': u'10:60:4b:a9:b8:7c'}, u'eth0': {'ip': None, 'mac': u'10:60:4b:a9:b8:78'}}
Setting manageable nodes to available...
Introspection completed with errors:
304e594a-5585-4c25-a754-01864d872346: Preprocessing hook validate_interfaces: No suitable interfaces found in {u'eth1': {'ip': None, 'mac': u'10:60:4b:a9:b8:7c'}, u'eth0': {'ip': None, 'mac': u'10:60:4b:a9:b8:78'}}


Physical configuration
The physical configuration was working before in Liberty, but just in case, I'm describing here for your input.
The undercloud (virtual) server connects with NIC 1 to the internet and NIC 2 to an non internet connected switch.
The overcloud servers connect with NIC 1 to the same switch, and NIC 2 to the internet, or are unplugged.

Console
When looking at the console on the overcloud node, I can see that the machine obtains a valid DHCP on NIC 1 and sends a DHCPDISCOVER on NIC 2 that, gets nowhere.
A warning message appears as:

ironic-python-agent[895]: date 895 WARNING ironic_python_agent.ironic_api_client [-] POST failed: HTTPConnectionPool(host='127.0.0.1', port=6385): Max retries exceeded with url: /v1/drivers/agent_ipmitool/vendor_passthru/lookup (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x2617f10>: Failed to establish a new connection: [Errno 111] Connection refused',))

If I try to run the command again, the server will not PXE boot again.

Mar 30 22:29:23 undercloud.hq.ltg dnsmasq-dhcp[7741]: DHCPDISCOVER(tap3d3ea1c3-d0) 10:60:4b:a9:b8:78 no address available
Mar 30 22:29:25 undercloud.hq.ltg dnsmasq-dhcp[7741]: DHCPDISCOVER(tap3d3ea1c3-d0) 10:60:4b:a9:b8:78 no address available
Mar 30 22:29:30 undercloud.hq.ltg dnsmasq-dhcp[7741]: DHCPDISCOVER(tap3d3ea1c3-d0) 10:60:4b:a9:b8:78 no address available



Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

--- Additional comment from Dmitry Tantsur on 2016-03-31 10:50:30 EDT ---

I realized we don't store ramdisk logs in this case. could you please set always_store_ramdisk_logs to true in /etc/ironic-inspector/inspector.conf, restart openstack-ironic-inspector and rerun introspection? this should give you the ramdisk logs in /var/log/ironic-inspector/ramdisk.

--- Additional comment from Ignacio Bravo on 2016-03-31 12:11 EDT ---

Including the ramdisk log after modifying /etc/ironic-inspector/inspector.conf

--- Additional comment from Dmitry Tantsur on 2016-04-01 07:29:44 EDT ---

Hmm, that's an error which I though we've fixed. Where does you introspection image (ironic-python-agent) come from? Did you build it yourself or downloaded from somewhere?

--- Additional comment from Ignacio Bravo on 2016-04-01 08:38:33 EDT ---

I tried both ways. I first built it myself, and had the errors, then downloaded from https://ci.centos.org/artifacts/rdo/images/mitaka/delorean/stable/ and after removing from glance, reimporting and recreating the ironic nodes, I got the same behavior.

--- Additional comment from Dmitry Tantsur on 2016-04-01 10:21:51 EDT ---

I've unpacked the image, and it has the fix in question. Still, in logs I see:

Mar 31 16:01:20 localhost ironic-python-agent[823]: 2016-03-31 16:01:17.448 823 INFO ironic_python_agent.inspector [-] network interfaces: {'eth1': {'ip': None, 'mac': '10:60:4b:a9:b8:7c'}, 'eth0': {'ip': None, 'mac': '10:60:4b:a9:b8:78'}}
...
Mar 31 16:01:20 localhost dhcp-all-interfaces.sh[826]: Inspecting interface: eth0...Configured eth0
Mar 31 16:01:20 localhost ifup[883]: Determining IP information for eth0... done.

cc dprince

--- Additional comment from Dan Prince on 2016-04-01 10:26:17 EDT ---

Can traffic on NIC1 see the traffic on NIC2 in any way. If so I'm wondering if this might be related to ARP flux (essentially the DHCP queries are causing mac address confusion on the DHCP server or something). If this is the cause then some sysctl related arp_filter, arp_ignore, arp_announce tuning might help. Either that or you can disconnect your NIC1, NIC2 switches so the traffic isn't getting across.

Another idea would be to try using the stable-interface-names element too when building your IPA discovery ramdisk. This should help ensure NIC ordering upon reboots.

Just some ideas.

--- Additional comment from Dmitry Tantsur on 2016-04-01 11:08:37 EDT ---

In the previous (rdo-specific) ramdisk we have a lot of sleeps waiting for DHCP. I think we might introduce something like that.

--- Additional comment from Ignacio Bravo on 2016-04-01 11:28:00 EDT ---

(In reply to Dan Prince from comment #6)
> Can traffic on NIC1 see the traffic on NIC2 in any way.

No, both NICs belong to separated networks, configured on the blade enclosures. The second NIC is either unconnected or either connected to the Internet. The NIC 1 is connected to an internal network with just the OpenStack servers, so that there is no DHCP conflicts in any way.

--- Additional comment from Dmitry Tantsur on 2016-04-06 13:39:50 EDT ---

Fixed in the newton release. Unfortunately, it's not something we can backport...

--- Additional comment from Christopher Brown on 2016-04-14 10:49:43 EDT ---

I am seeing this too on RDO tripleo Mitaka and Red Hat OSP 8 RC.

Comment 1 Dmitry Tantsur 2016-05-04 13:00:17 UTC

Note: backported to mitaka/OSPd9, still an issue for OSPd8.

Comment 2 John Fulton 2016-05-04 13:08:01 UTC

I'm seeing this in OSP8 GA. The upstream fix [1] makes inspection wait 60 seconds for *all* NIC's to get their IP addresses. Looks like a custom image with this change should be a workaround. A crude workaround is to delete and re-introspect on a loop until all of the nodes return no errors [2]. As it's a race condition this crude approach worked for me with on the third iteration with six nodes introspecting with ironic-python-agent-8.0-20160415.1.el7ost.tar.

[1] https://git.openstack.org/cgit/openstack/ironic-python-agent/commit/?id=3fba1ee8db0aa0b1519ef2135e602268488570f4

[2] 
while true; do
    for uuid in $(ironic node-list | awk {'print $2'}  | egrep -v "UUID|^$"); do 
	ironic node-set-power-state $uuid off
	ironic node-delete $uuid 
    done

    openstack baremetal import --json ~/instackenv.json
    openstack baremetal configure boot
    openstack baremetal introspection bulk start 

    intro_done=$(openstack baremetal introspection bulk status | awk {'print $6'} | egrep -v "^$|\||None")
    if [[ -z "$intro_done" ]]; then 
        break
    fi
done

Comment 3 Raoul Scarazzini 2016-05-05 15:15:42 UTC

Since I've got the same exact problem, I generated the ramdisk logs while the action was failing, you can download them from here [1]. The output of the introspection was the one mentioned above:

...
...
Introspection completed with errors:
94f12e03-c04e-474f-95ca-6c323e9aa6b0: Preprocessing hook validate_interfaces: No suitable interfaces found in {u'eth3': {'ip': None, 'mac': u'b8:ca:3a:66:f4:42'}, u'eth2': {'ip': None, 'mac': u'b8:ca:3a:66:f4:45'}, u'eth1': {'ip': None, 'mac': u'b8:ca:3a:66:f4:40'}, u'eth0': {'ip': None, 'mac': u'b8:ca:3a:66:f4:44'}}
065dfc49-4831-4037-9aee-4b4b61765b34: Preprocessing hook validate_interfaces: No suitable interfaces found in {u'eth3': {'ip': None, 'mac': u'b8:ca:3a:66:d6:da'}, u'eth2': {'ip': None, 'mac': u'b8:ca:3a:66:d6:dd'}, u'eth1': {'ip': None, 'mac': u'b8:ca:3a:66:d6:d8'}, u'eth0': {'ip': None, 'mac': u'b8:ca:3a:66:d6:dc'}}
41135c0b-2584-479c-9b10-226f94d4c41b: Preprocessing hook validate_interfaces: No suitable interfaces found in {u'eth3': {'ip': None, 'mac': u'b8:ca:3a:66:e3:82'}, u'eth2': {'ip': None, 'mac': u'b8:ca:3a:66:e3:85'}, u'eth1': {'ip': None, 'mac': u'b8:ca:3a:66:e3:80'}, u'eth0': {'ip': None, 'mac': u'b8:ca:3a:66:e3:84'}}
fce0d063-ed57-4f48-beb1-a10b0af3631d: Preprocessing hook validate_interfaces: No suitable interfaces found in {u'eth3': {'ip': None, 'mac': u'b8:ca:3a:66:ea:5a'}, u'eth2': {'ip': None, 'mac': u'b8:ca:3a:66:ea:58'}, u'eth1': {'ip': None, 'mac': u'b8:ca:3a:66:ea:5d'}, u'eth0': {'ip': None, 'mac': u'b8:ca:3a:66:ea:5c'}}
37063fce-a18b-42a9-9a85-5f67d2fe4ed5: Preprocessing hook validate_interfaces: No suitable interfaces found in {u'eth3': {'ip': None, 'mac': u'b8:ca:3a:66:d7:02'}, u'eth2': {'ip': None, 'mac': u'b8:ca:3a:66:d7:05'}, u'eth1': {'ip': None, 'mac': u'b8:ca:3a:66:d7:00'}, u'eth0': {'ip': None, 'mac': u'b8:ca:3a:66:d7:04'}}

[1] http://file.rdu.redhat.com/~rscarazz/BZ1327255/

Comment 4 Dmitry Tantsur 2016-05-05 15:34:08 UTC

Raoul, weird, it looks like your images are old. Could you please paste output of the following:

 curl -s -H "X-Auth-Token: $(openstack token issue -f value -c id)" http://127.0.0.1:5050/v1/introspection/<UUID>/data | jq '.inventory.interfaces'

where <UUID> is any node with failed introspection?

Comment 5 Dmitry Tantsur 2016-05-05 15:34:52 UTC

Oh, one fix: the command should be run for a node with *successful* introspection, not failed. Otherwise the data won't be there.

Comment 6 Raoul Scarazzini 2016-05-06 11:03:12 UTC

Yes, in fact doing the command on a unsuccessful node gives just "nul", in a successful one this is the output:

[stack@macb8ca3a66dcd8 ~]$ curl -s -H "X-Auth-Token: $(openstack token issue -f value -c id)" http://127.0.0.1:5050/v1/introspection/41135c0b-2584-479c-9b10-226f94d4c41b/data | jq '.inventory.interfaces'
[
  {
    "mac_address": "b8:ca:3a:66:e3:80",
    "ipv4_address": "10.1.241.7",
    "switch_chassis_descr": null,
    "switch_port_descr": null,
    "has_carrier": true,
    "name": "eth1"
  },
  {
    "mac_address": "b8:ca:3a:66:e3:84",
    "ipv4_address": null,
    "switch_chassis_descr": null,
    "switch_port_descr": null,
    "has_carrier": false,
    "name": "eth0"
  },
  {
    "mac_address": "b8:ca:3a:66:e3:85",
    "ipv4_address": null,
    "switch_chassis_descr": null,
    "switch_port_descr": null,
    "has_carrier": false,
    "name": "eth2"
  },
  {
    "mac_address": "b8:ca:3a:66:e3:82",
    "ipv4_address": "192.0.2.103",
    "switch_chassis_descr": null,
    "switch_port_descr": null,
    "has_carrier": true,
    "name": "eth3"
  }
]

Comment 7 Dmitry Tantsur 2016-05-06 11:09:36 UTC

And one more thing: please grab the inspector logs for the relevant time (sudo journalctl -u openstack-ironic-inspector)

Comment 8 Raoul Scarazzini 2016-05-06 11:16:04 UTC

Done, they are here: http://file.rdu.redhat.com/~rscarazz/BZ1327255/20160506_openstack-ironic-inspector.logs

Comment 9 Dmitry Tantsur 2016-05-06 11:25:13 UTC

Thanks, that seems to explain the problem. We rely on /sys/class/net/XXX/carrier to detect if we have to wait for an IP address. In your case carrier=0 for all interfaces, even though it's obviously non-sense. We have to stop relying on it.

Comment 10 Christopher Brown 2016-05-06 11:50:40 UTC

Hi Dmitry,

I recently patched os-net-config to use operstate instead.

Not sure if it's worth back porting that patch.

Comment 11 Dmitry Tantsur 2016-05-06 11:52:56 UTC

I don't think operstate is any more reliable. It's also not strictly what I'm looking for: I wanted to optimize the case when the cable is not there at all, but I'll just drop the optimization for slower but more reliable code path.

Comment 13 Raviv Bar-Tal 2016-05-18 10:52:06 UTC

Hi Chris,
Can you advice if the patching of os-net-config to use operstate, worked and solved the problem?

Thanks 
Raviv

Comment 14 Dmitry Tantsur 2016-05-20 07:50:56 UTC

*** Bug 1337659 has been marked as a duplicate of this bug. ***

Comment 15 Christopher Brown 2016-05-30 07:41:51 UTC

(In reply to Raviv Bar-Tal from comment #13)
> Hi Chris,
> Can you advice if the patching of os-net-config to use operstate, worked and
> solved the problem?

Not tried it in this scenario. My primary reason  is that Lenovo and IBM hardware comes with USB to Etherley adapters for out of band management which current os-net-config evaluates as a valid interface and attempts to provision if you use the nic1, nic2 etc naming scheme for network interfaces on the overcloud nic-config yaml files.

Comment 16 Christopher Brown 2016-06-03 15:02:10 UTC

Hello,

This is still not working despite update to current-ci and rebuild of images.

Comment 17 Dmitry Tantsur 2016-06-03 15:08:21 UTC

Hi! If it's about OSPd, please provide your images version. If it's about RDO please move it to https://bugzilla.redhat.com/show_bug.cgi?id=1322892, provide version of IPA available in your repositories and the ramdisk logs as explained in https://bugzilla.redhat.com/show_bug.cgi?id=1322892#c1. This bug is about OSPd specifically.

Comment 19 errata-xmlrpc 2016-06-15 12:39:14 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1229

Note You need to log in before you can comment on or make changes to this bug.