Bug 1322892 - No valid interfaces found during introspection
Summary: No valid interfaces found during introspection
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: RDO
Classification: Community
Component: openstack-ironic-discoverd
Version: trunk
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: trunk
Assignee: Dmitry Tantsur
QA Contact: Toure Dunnon
URL:
Whiteboard:
Depends On:
Blocks: 1327255 1346022
TreeView+ depends on / blocked
 
Reported: 2016-03-31 14:31 UTC by Ignacio Bravo
Modified: 2020-08-13 08:25 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1327255 (view as bug list)
Environment:
Last Closed: 2017-06-19 20:29:57 UTC


Attachments (Terms of Use)
Ironic Inspector logs. Issue described occurred on 2015-03-30 (11.74 MB, text/plain)
2016-03-31 14:31 UTC, Ignacio Bravo
no flags Details
ramdisk logs (22.12 KB, application/x-gzip)
2016-03-31 16:11 UTC, Ignacio Bravo
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1564954 0 None None None 2016-04-01 15:08:37 UTC
OpenStack gerrit 300548 0 None MERGED Wait for the interfaces to get IP addresses before inspection 2020-07-30 18:04:16 UTC
OpenStack gerrit 305916 0 None MERGED Wait for the interfaces to get IP addresses before inspection 2020-07-30 18:04:16 UTC

Description Ignacio Bravo 2016-03-31 14:31:22 UTC
Created attachment 1142241 [details]
Ironic Inspector logs. Issue described occurred on 2015-03-30

Description of problem:
I've deployed a TripleO all day today, and after rebuilding undercloud a couple of times, I always got to the same issue when introspecting. Here is an example with just one node:

[stack@undercloud ~]$ openstack baremetal introspection bulk start
Setting nodes for introspection to manageable...
Starting introspection of node: 304e594a-5585-4c25-a754-01864d872346
Waiting for introspection to finish...
Introspection for UUID 304e594a-5585-4c25-a754-01864d872346 finished with error: Preprocessing hook validate_interfaces: No suitable interfaces found in {u'eth1': {'ip': None, 'mac': u'10:60:4b:a9:b8:7c'}, u'eth0': {'ip': None, 'mac': u'10:60:4b:a9:b8:78'}}
Setting manageable nodes to available...
Introspection completed with errors:
304e594a-5585-4c25-a754-01864d872346: Preprocessing hook validate_interfaces: No suitable interfaces found in {u'eth1': {'ip': None, 'mac': u'10:60:4b:a9:b8:7c'}, u'eth0': {'ip': None, 'mac': u'10:60:4b:a9:b8:78'}}


Physical configuration
The physical configuration was working before in Liberty, but just in case, I'm describing here for your input.
The undercloud (virtual) server connects with NIC 1 to the internet and NIC 2 to an non internet connected switch.
The overcloud servers connect with NIC 1 to the same switch, and NIC 2 to the internet, or are unplugged.

Console
When looking at the console on the overcloud node, I can see that the machine obtains a valid DHCP on NIC 1 and sends a DHCPDISCOVER on NIC 2 that, gets nowhere.
A warning message appears as:

ironic-python-agent[895]: date 895 WARNING ironic_python_agent.ironic_api_client [-] POST failed: HTTPConnectionPool(host='127.0.0.1', port=6385): Max retries exceeded with url: /v1/drivers/agent_ipmitool/vendor_passthru/lookup (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x2617f10>: Failed to establish a new connection: [Errno 111] Connection refused',))

If I try to run the command again, the server will not PXE boot again.

Mar 30 22:29:23 undercloud.hq.ltg dnsmasq-dhcp[7741]: DHCPDISCOVER(tap3d3ea1c3-d0) 10:60:4b:a9:b8:78 no address available
Mar 30 22:29:25 undercloud.hq.ltg dnsmasq-dhcp[7741]: DHCPDISCOVER(tap3d3ea1c3-d0) 10:60:4b:a9:b8:78 no address available
Mar 30 22:29:30 undercloud.hq.ltg dnsmasq-dhcp[7741]: DHCPDISCOVER(tap3d3ea1c3-d0) 10:60:4b:a9:b8:78 no address available



Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Dmitry Tantsur 2016-03-31 14:50:30 UTC
I realized we don't store ramdisk logs in this case. could you please set always_store_ramdisk_logs to true in /etc/ironic-inspector/inspector.conf, restart openstack-ironic-inspector and rerun introspection? this should give you the ramdisk logs in /var/log/ironic-inspector/ramdisk.

Comment 2 Ignacio Bravo 2016-03-31 16:11:58 UTC
Created attachment 1142296 [details]
ramdisk logs

Including the ramdisk log after modifying /etc/ironic-inspector/inspector.conf

Comment 3 Dmitry Tantsur 2016-04-01 11:29:44 UTC
Hmm, that's an error which I though we've fixed. Where does you introspection image (ironic-python-agent) come from? Did you build it yourself or downloaded from somewhere?

Comment 4 Ignacio Bravo 2016-04-01 12:38:33 UTC
I tried both ways. I first built it myself, and had the errors, then downloaded from https://ci.centos.org/artifacts/rdo/images/mitaka/delorean/stable/ and after removing from glance, reimporting and recreating the ironic nodes, I got the same behavior.

Comment 5 Dmitry Tantsur 2016-04-01 14:21:51 UTC
I've unpacked the image, and it has the fix in question. Still, in logs I see:

Mar 31 16:01:20 localhost ironic-python-agent[823]: 2016-03-31 16:01:17.448 823 INFO ironic_python_agent.inspector [-] network interfaces: {'eth1': {'ip': None, 'mac': '10:60:4b:a9:b8:7c'}, 'eth0': {'ip': None, 'mac': '10:60:4b:a9:b8:78'}}
...
Mar 31 16:01:20 localhost dhcp-all-interfaces.sh[826]: Inspecting interface: eth0...Configured eth0
Mar 31 16:01:20 localhost ifup[883]: Determining IP information for eth0... done.

cc dprince

Comment 6 Dan Prince 2016-04-01 14:26:17 UTC
Can traffic on NIC1 see the traffic on NIC2 in any way. If so I'm wondering if this might be related to ARP flux (essentially the DHCP queries are causing mac address confusion on the DHCP server or something). If this is the cause then some sysctl related arp_filter, arp_ignore, arp_announce tuning might help. Either that or you can disconnect your NIC1, NIC2 switches so the traffic isn't getting across.

Another idea would be to try using the stable-interface-names element too when building your IPA discovery ramdisk. This should help ensure NIC ordering upon reboots.

Just some ideas.

Comment 7 Dmitry Tantsur 2016-04-01 15:08:37 UTC
In the previous (rdo-specific) ramdisk we have a lot of sleeps waiting for DHCP. I think we might introduce something like that.

Comment 8 Ignacio Bravo 2016-04-01 15:28:00 UTC
(In reply to Dan Prince from comment #6)
> Can traffic on NIC1 see the traffic on NIC2 in any way.

No, both NICs belong to separated networks, configured on the blade enclosures. The second NIC is either unconnected or either connected to the Internet. The NIC 1 is connected to an internal network with just the OpenStack servers, so that there is no DHCP conflicts in any way.

Comment 9 Dmitry Tantsur 2016-04-06 17:39:50 UTC
Fixed in the newton release. Unfortunately, it's not something we can backport...

Comment 10 Christopher Brown 2016-04-14 14:49:43 UTC
I am seeing this too on RDO tripleo Mitaka and Red Hat OSP 8 RC.

Comment 11 Dmitry Tantsur 2016-04-14 15:17:19 UTC
After more upstream discussion, we decided that this bug fix is backportable.

Comment 12 Christopher Brown 2016-04-14 15:25:49 UTC
(In reply to Dmitry Tantsur from comment #11)
> After more upstream discussion, we decided that this bug fix is backportable.

Thanks, please let me know if you need more logs, testing etc.

Comment 13 John Fulton 2016-05-04 01:58:27 UTC
I'm seeing this in OSP8 GA. The upstream fix [1] makes inspection wait 60 seconds for *all* NIC's to get their IP addresses. Looks like a custom image with this change should be a workaround. A crude workaround is to delete and re-introspect on a loop until all of the nodes return no errors [2]. As it's a race condition this crude approach worked for me with on the third iteration with six nodes introspecting with ironic-python-agent-8.0-20160415.1.el7ost.tar.

[1] https://git.openstack.org/cgit/openstack/ironic-python-agent/commit/?id=3fba1ee8db0aa0b1519ef2135e602268488570f4

[2] 
while true; do
    for uuid in $(ironic node-list | awk {'print $2'}  | egrep -v "UUID|^$"); do 
	ironic node-set-power-state $uuid off
	ironic node-delete $uuid 
    done

    openstack baremetal import --json ~/instackenv.json
    openstack baremetal configure boot
    openstack baremetal introspection bulk start 

    intro_done=$(openstack baremetal introspection bulk status | awk {'print $6'} | egrep -v "^$|\||None")
    if [[ -z "$intro_done" ]]; then 
        break
    fi
done

Comment 14 Dmitry Tantsur 2016-05-04 13:02:29 UTC
John, we have a tracking bug for OSPd8, please move your comment there: https://bugzilla.redhat.com/show_bug.cgi?id=1327255

Comment 15 Dmitry Tantsur 2016-05-11 13:20:20 UTC
We've made one more patch to stable/mitaka. Hopefully, it will fix this problem as soon as it propagates to RDO (which may take substantial time).

Comment 16 Graeme Gillies 2016-05-26 03:59:25 UTC
Hi,

I am hitting this issue with RDO Mitaka, when are we expecting a backport to make it to RDO?

Regards,

Graeme

Comment 17 Dmitry Tantsur 2016-05-26 07:51:36 UTC
It depends on when the images will get updated to contain the fix.

Comment 18 Alan Pevec 2016-05-30 23:38:14 UTC
(In reply to Dmitry Tantsur from comment #17)
> It depends on when the images will get updated to contain the fix.

Which image and isn't there a procedure for users to rebuild the image on their own?
When moving to MODIFIED please provide NVR in Fixed In Version field.

For fixes which landed on stable/mitaka, you can use packages from RDO Mitaka trunk repo: https://trunk.rdoproject.org/centos7-mitaka/current-passed-ci/delorean.repo
Images build by tripleo-quickstart CI gate job for that repo are saved at http://buildlogs.centos.org/centos/7/cloud/x86_64/tripleo_images/mitaka/delorean/

Comment 19 Christopher Brown 2016-05-31 06:58:44 UTC
I'd really suggest this warrants a release to Stable-Mitaka rather than asking users to rebuild from mitaka trunk.

Comment 20 Charles Short 2016-06-10 10:23:15 UTC
> Images build by tripleo-quickstart CI gate job for that repo are saved at
> http://buildlogs.centos.org/centos/7/cloud/x86_64/tripleo_images/mitaka/
> delorean/

Hi, 

I just installed TripleO Mitaka Stable and had the same issue (with 15 HP Blades).
I was using my own CentOS images. I replaced the images with the ones you suggested above and now the issue has gone away.
Thank you

Charles

Comment 21 Charles Short 2016-06-17 08:03:10 UTC
(In reply to Charles Short from comment #20)
> > Images build by tripleo-quickstart CI gate job for that repo are saved at
> > http://buildlogs.centos.org/centos/7/cloud/x86_64/tripleo_images/mitaka/
> > delorean/
> 
> Hi, 
> 
> I just installed TripleO Mitaka Stable and had the same issue (with 15 HP
> Blades).
> I was using my own CentOS images. I replaced the images with the ones you
> suggested above and now the issue has gone away.
> Thank you
> 
> Charles

To add, the images do not have python-redis installed and so in an HA Tripleo deployment Ceilometer fails to start in pacemaker. This can be resolved by simply installing python-redis.

Charles

Comment 22 jwang 2016-10-08 10:43:59 UTC
I hit the same issue in osp9 virt test environment.
===================================================

10-08 06:33:22.695 25401 ERROR ironic_inspector.utils [-] [node: MAC 52:54:00:5a:6b:12] The following failures happened during running 
rocessing hook validate_interfaces: No suitable interfaces found in {u'ens4': {'ip': None, 'mac': u'52:54:00:5a:6b:12'}, u'ens3': {'ip': u'172.16.0.150', 'mac': u'52:54:00:5a:6b:11'}}
 Look up error: Could not find a node for attributes {'bmc_address': u'', 'mac': None}

Comment 23 jwang 2016-10-08 13:22:04 UTC
After change undercloud and overcloud-node network nic model from e1000 to virtio and following steps in Bug 1234601 Comment 19. introspection was succeed.

https://bugzilla.redhat.com/show_bug.cgi?id=1234601#c19

Comment 24 Christopher Brown 2017-06-19 20:29:57 UTC
Ok, so since fixed in updated images, closing.


Note You need to log in before you can comment on or make changes to this bug.