Bug 1480597 - 'openstack introspection bulk start' fails randomly with error: Preprocessing hook validate_interfaces: No suitable interfaces found
'openstack introspection bulk start' fails randomly with error: Preprocessin...
Status: POST
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-ironic (Show other bugs)
10.0 (Newton)
Unspecified Linux
medium Severity medium
: Upstream M1
: 13.0 (Queens)
Assigned To: Bob Fournier
mlammon
: Reopened, Triaged
Depends On:
Blocks: 1473267 1499268
  Show dependency treegraph
 
Reported: 2017-08-11 09:00 EDT by KOSAL RAJ I
Modified: 2017-10-06 11:09 EDT (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1499268 (view as bug list)
Environment:
Last Closed: 2017-09-20 13:10:34 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
ramdisk logs requested (30.68 KB, application/x-gzip)
2017-09-25 12:32 EDT, Jim Bagwell
no flags Details
2nd set of ramdisk logs (30.46 KB, application/x-gzip)
2017-09-25 12:33 EDT, Jim Bagwell
no flags Details
3rd set of ramdisk logs (30.17 KB, application/x-gzip)
2017-09-25 12:33 EDT, Jim Bagwell
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Launchpad 1721051 None None None 2017-10-04 14:31 EDT
OpenStack gerrit 509467 None None None 2017-10-04 14:32 EDT

  None (edit)
Description KOSAL RAJ I 2017-08-11 09:00:45 EDT
Description of problem:
'openstack introspection bulk start' fails randomly with error: Preprocessing hook validate_interfaces: No suitable interfaces found

Version-Release number of selected component (if applicable):
RHOSP 10

How reproducible:
Launch introspection of nodes following instructions in the beta docs.

Steps to Reproduce:
After importing nodes into ironic, 'openstack introspection bulk start' with nodes in a manageable state fails with finished with error: Preprocessing hook validate_interfaces: No suitable interfaces found

Actual results:
Introspection fails on some nodes

Expected results:
Introspection should succeed on all nodes.

Additional info:
Randomly - the work around is to just restart the bulk introspection and then it completes successfully.

Hardware - On Nokia Airframe hardware
Comment 2 Bob Fournier 2017-08-23 09:47:31 EDT
Hi Kosal,

I'm not able to access the sosreport at the link provided - http://collab-shell.usersys.redhat.com/01908302/, I get a 404 error.  Can you attach it to this bug report? Thanks.
Comment 3 Bob Fournier 2017-08-23 10:20:15 EDT
Kosal,

Actually I was able to retrieve the sosreport and ironic-inspector logs as attachments in the case although that link still appears to be invalid.  So no need to provide that info here.  Thanks.

However, I'm leaving the NeedInfo open as would like get ramdisk logs from the node when the failures occurs.  To get the ramdisk logs you can set these in
/etc/ironic-inspector/inspector.conf

# If set, logs from ramdisk will be stored in this directory. (string
# value)
ramdisk_logs_dir = /var/log/ironic-inspector/ramdisk/

# Whether to store ramdisk logs even if it did not return an error
# message (dependent upon "ramdisk_logs_dir" option being set).
# (boolean value)
always_store_ramdisk_logs = true

Note that in the case you indicated that the version of NetworkManager was correct for this not to be a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1384187.  However, it is the version of NetworkManager on the IPA image that matters, not the Undercloud.  If they are using older IPA images with a NetworkManager that has the bug, this could still be a problem.  Can you verify the NetworkManager version running in the IPA image?
Comment 4 Bob Fournier 2017-08-23 10:58:25 EDT
From the ironic-inspector logs it looks like the interface that should be used for pxe boot does not have an IP address assigned.  Just prior to the "No suitable interfaces found" is the message:

2017-08-07 19:06:42.742 24556 DEBUG ironic_inspector.plugins.standard [-] [node: MAC 54:ab:3a:cc:48:0c BMC 192.168.118.70] Skipping interface ens4f0 as it did not have an IP address assigned during the ramdisk run _validate_interfaces /usr/lib/python2.7/site-packages/ironic_inspector/plugins/standard.py

and ens4f0 does not have a IP address. 
{u'ens4f0': {'ip': None, 'mac': u'54:ab:3a:cc:48:0c'}, u'ens12f1': {'ip': u'172.31.255.107', 'mac': u'90:e2:ba:cc:72:0d'}, u'ens12f0': {'ip': u'172.31.255.118', 'mac': u'90:e2:ba:cc:72:0c'}, u'ens4f1': {'ip': None, 'mac': u'54:ab:3a:cc:48:0d'}, u'ens1f1': {'ip': u'172.31.255.105', 'mac': u'90:e2:ba:cc:6b:39'}, u'ens1f0': {'ip': u'172.31.255.114', 'mac': u'90:e2:ba:cc:6b:38'}, u'ens255f0': {'ip': None, 'mac': u'54:ab:3a:ca:37:97'}, u'ens255f1': {'ip': None, 'mac': u'54:ab:3a:ca:37:98'}}

Again, would like to get the ramdisk log for the node that fails introspection to see what is going on with this interface.
Comment 5 Bob Fournier 2017-08-23 14:30:42 EDT
I noticed that each node has 8 nodes on the provisioning network, all of which are running DHCP and retrieving addresses from the Ironic inspector dnsmasq. It looks like a few interfaces per node either do not get addresses or do not receive the DHCP ACK. When it occurs that the boot_interface does not get a dhcp address, introspection will fail. with "No suitable interface found"  You can see below that of the 8 interfaces per node, random interfaces end up not receiving dhcp addresses.

For all nodes, ens4f0 is always the boot interface.

Node 1
[node: MAC 54:ab:3a:cc:47:e0 BMC 192.168.118.66] Found interface ens1f1 with MAC "90:e2:ba:ca:d1:c1" and IP address "172.31.255.117" 
[node: MAC 54:ab:3a:cc:47:e0 BMC 192.168.118.66] Found interface ens1f0 with MAC "90:e2:ba:ca:d1:c0" and IP address "172.31.255.120" 
[node: MAC 54:ab:3a:cc:47:e0 BMC 192.168.118.66] Found interface ens4f1 with MAC "54:ab:3a:cc:47:e1" and IP address "172.31.255.112" 
[node: MAC 54:ab:3a:cc:47:e0 BMC 192.168.118.66] Found interface ens4f0 with MAC "54:ab:3a:cc:47:e0" and IP address "172.31.255.103" 
[node: MAC 54:ab:3a:cc:47:e0 BMC 192.168.118.66] Found interface ens12f1 with MAC "90:e2:ba:ca:d0:7d" and IP address "172.31.255.113" 
 [node: MAC 54:ab:3a:cc:47:e0 BMC 192.168.118.66] Found interface ens12f0 with MAC "90:e2:ba:ca:d0:7c" and IP address "172.31.255.115" 
[node: MAC 54:ab:3a:cc:47:e0 BMC 192.168.118.66] Found interface ens255f1 with MAC "54:ab:3a:ca:35:3d" and IP address "None"
[node: MAC 54:ab:3a:cc:47:e0 BMC 192.168.118.66] Found interface ens255f0 with MAC "54:ab:3a:ca:35:3c" and IP address "None"

Node 2
[node: MAC 54:ab:3a:cc:48:0c BMC 192.168.118.70] Found interface ens1f1 with MAC "90:e2:ba:cc:6b:39" and IP address "172.31.255.105"
 [node: MAC 54:ab:3a:cc:48:0c BMC 192.168.118.70] Found interface ens1f0 with MAC "90:e2:ba:cc:6b:38" and IP address "172.31.255.114" 
[node: MAC 54:ab:3a:cc:48:0c BMC 192.168.118.70] Found interface ens4f1 with MAC "54:ab:3a:cc:48:0d" and IP address "None" 
[node: MAC 54:ab:3a:cc:48:0c BMC 192.168.118.70] Found interface ens4f0 with MAC "54:ab:3a:cc:48:0c" and IP address "None"   <=== this has introspection failure
[node: MAC 54:ab:3a:cc:48:0c BMC 192.168.118.70] Found interface ens12f1 with MAC "90:e2:ba:cc:72:0d" and IP address "172.31.255.107" 
[node: MAC 54:ab:3a:cc:48:0c BMC 192.168.118.70] Found interface ens12f0 with MAC "90:e2:ba:cc:72:0c" and IP address "172.31.255.118"
[node: MAC 54:ab:3a:cc:48:0c BMC 192.168.118.70] Found interface ens255f1 with MAC "54:ab:3a:ca:37:98" and IP address "None" 
[node: MAC 54:ab:3a:cc:48:0c BMC 192.168.118.70] Found interface ens255f0 with MAC "54:ab:3a:ca:37:97" and IP address "None"

Node 3
[node: MAC 54:ab:3a:cc:47:6c BMC 192.168.118.68] Found interface ens1f1 with MAC "90:e2:ba:cc:71:cd" and IP address "172.31.255.109" 
 [node: MAC 54:ab:3a:cc:47:6c BMC 192.168.118.68] Found interface ens1f0 with MAC "90:e2:ba:cc:71:cc" and IP address "172.31.255.110" 
[node: MAC 54:ab:3a:cc:47:6c BMC 192.168.118.68] Found interface ens4f1 with MAC "54:ab:3a:cc:47:6d" and IP address "172.31.255.108"
[node: MAC 54:ab:3a:cc:47:6c BMC 192.168.118.68] Found interface ens4f0 with MAC "54:ab:3a:cc:47:6c" and IP address "172.31.255.100" 
[node: MAC 54:ab:3a:cc:47:6c BMC 192.168.118.68] Found interface ens12f1 with MAC "90:e2:ba:ca:ce:5d" and IP address "172.31.255.111" 
[node: MAC 54:ab:3a:cc:47:6c BMC 192.168.118.68] Found interface ens12f0 with MAC "90:e2:ba:ca:ce:5c" and IP address "None" 
[node: MAC 54:ab:3a:cc:47:6c BMC 192.168.118.68] Found interface ens255f1 with MAC "54:ab:3a:ca:3a:d7" and IP address "None" 
[node: MAC 54:ab:3a:cc:47:6c BMC 192.168.118.68] Found interface ens255f0 with MAC "54:ab:3a:ca:3a:d6" and IP address "None"

Node 4
[node: MAC 54:ab:3a:cc:47:98 BMC 192.168.118.67] Found interface ens1f1 with MAC "90:e2:ba:ca:cd:55" and IP address "172.31.255.104"
[node: MAC 54:ab:3a:cc:47:98 BMC 192.168.118.67] Found interface ens1f0 with MAC "90:e2:ba:ca:cd:54" and IP address "172.31.255.106"
[node: MAC 54:ab:3a:cc:47:98 BMC 192.168.118.67] Found interface ens4f1 with MAC "54:ab:3a:cc:47:99" and IP address "172.31.255.116" 
[node: MAC 54:ab:3a:cc:47:98 BMC 192.168.118.67] Found interface ens4f0 with MAC "54:ab:3a:cc:47:98" and IP address "172.31.255.101"
[node: MAC 54:ab:3a:cc:47:98 BMC 192.168.118.67] Found interface ens12f1 with MAC "90:e2:ba:cc:77:99" and IP address "172.31.255.119" 
[node: MAC 54:ab:3a:cc:47:98 BMC 192.168.118.67] Found interface ens12f0 with MAC "90:e2:ba:cc:77:98" and IP address "None"
[node: MAC 54:ab:3a:cc:47:98 BMC 192.168.118.67] Found interface ens255f1 with MAC "54:ab:3a:ca:37:8c" and IP address "None" 
[node: MAC 54:ab:3a:cc:47:98 BMC 192.168.118.67] Found interface ens255f0 with MAC "54:ab:3a:ca:37:8b" and IP address "None"

I would recommend setting up the attached switch such that only the port attached to ens4f0 is on the provisioning network.  That should stop the introspection failures.

Also, as requested earlier, please provide ramdisk logs for the node that has an introspection failure to help determine why occasionally ens4f0 does not get the dhcp response.
Comment 6 Bob Fournier 2017-08-24 10:55:28 EDT
Note also, it looks like there is a valid dhcp sequence from introspection dnsmasq server so its not clear why IPA doesn't set this address (172.31.255.102) in ramdisk interfaces entry for 54:ab:3a:cc:48:0c

Aug 09 19:03:26 undercloud dnsmasq-dhcp[25505]: DHCPREQUEST(br-ctlplane) 172.31.255.102 54:ab:3a:cc:48:0c
Aug 09 19:03:26 undercloud dnsmasq-dhcp[25505]: DHCPACK(br-ctlplane) 172.31.255.102 54:ab:3a:cc:48:0c
Aug 09 19:03:45 undercloud dnsmasq-dhcp[25505]: DHCPREQUEST(br-ctlplane) 172.31.255.102 54:ab:3a:cc:48:0c
Aug 09 19:03:45 undercloud dnsmasq-dhcp[25505]: DHCPACK(br-ctlplane) 172.31.255.102 54:ab:3a:cc:48:0c
Aug 09 19:04:18 undercloud ironic-inspector[25502]: 2017-08-09 19:04:18.761 25502 DEBUG ironic_inspector.main [-] [node: MAC 54:ab:3a:cc:48:0c BMC 192.168.118.70] Received data from the ramdisk: {u'cpu_arch': u'x86_64', u'root_disk': {u'rotational': False, u'vendor': u'ATA', u'name': u'/dev/sdb', u'wwn_vendor_extension': None, u'wwn_with_extension': u'0x55cd2e404c2926ba', u'model': u'INTEL SSDSC2BA20', u'wwn': u'0x55cd2e404c2926ba', u'serial': u'BTHV630202TE200MGN', u'size': 200049647616}, u'interfaces': {u'ens4f0': {u'ip': None, u'mac': u'54:ab:3a:cc:48:0c'},  <= missing IP here

The requested ramdisk logs will add some insight.
Comment 7 Jim Bagwell 2017-08-28 10:24:38 EDT
I do not have the ramdisk logs as its not enabled by default. Since this issue seemingly happens at random times, the setup in the logs has been redeployed and is not currently experiencing this issue. In order to produce ramdisk logs, I would need to enable the logging, and then hope to encounter this issue on a setup (which can be done, but not with any regularity. I'll upload the logs as soon as I run into another failure.
Comment 8 Bob Fournier 2017-08-28 11:20:03 EDT
Thanks Jim.  Note that having all of the interfaces on the provisioning network isn't a recommended configuration - we recommend having just one nic on provisioning net, but we are curious as to why it would result in this "No suitable interfaces" issue.  It looks like dhcp may fail on some interfaces when so many are on provisioning.
Comment 9 Dmitry Tantsur 2017-09-20 13:10:34 EDT
Please reopen the issue when/if you are ready to provide the requested logs. Thanks!

P.S.
I'll second that having so many NICs on the provisioning network is not a recommended configuration. It's known to produce issues with iPXE. I'd suggest leaving only one NIC there.
Comment 10 Jim Bagwell 2017-09-25 12:29:04 EDT
Attaching the ramdisk logs
Comment 11 Jim Bagwell 2017-09-25 12:32 EDT
Created attachment 1330615 [details]
ramdisk logs requested
Comment 12 Jim Bagwell 2017-09-25 12:33 EDT
Created attachment 1330616 [details]
2nd set of ramdisk logs
Comment 13 Jim Bagwell 2017-09-25 12:33 EDT
Created attachment 1330618 [details]
3rd set of ramdisk logs
Comment 14 Bob Fournier 2017-09-27 12:55:49 EDT
Thanks for the ramdisk logs.

What appears to be happening is that, when so many interfaces are on the provisioning network, they all get IP addresses assigned but on some of them the dhcp lease expires (set to 120 seconds) before the introspection data is sent, so IPA reports no IP addresses for these interfaces.

All of the supplied ramdisk logs appear to have the ens4f0 NIC IP set to "None" so I assume that all failed introspection.  

From one of the logs (975d698a-d178-4d26-92ea-58cc284df626_20170924-073057.011142.tar) we can see that ens4f0, ens1f1, and ens12f0 all get addresses but then have their leases expired and the addresses removed:

journal:Sep 24 03:30:38 localhost.localdomain NetworkManager[661]: <info>  [1506238238.3844] dhcp4 (ens4f0): state changed bound -> expire
journal:Sep 24 03:30:38 localhost.localdomain NetworkManager[661]: <info>  [1506238238.4167] dhcp4 (ens4f0): canceled DHCP transaction, DHCP client pid 928
journal:Sep 24 03:30:38 localhost.localdomain NetworkManager[661]: <info>  [1506238238.4168] dhcp4 (ens4f0): state changed expire -> done
journal:Sep 24 03:30:38 localhost.localdomain NetworkManager[661]: <info>  [1506238238.4169] device (ens4f0): scheduling DHCPv4 restart in 120 seconds, 3 tries left (reason: lease expired)
journal:Sep 24 03:30:41 localhost.localdomain dhclient[861]: DHCPREQUEST on ens1f1 to 255.255.255.255 port 67 (xid=0x57abd07b)
journal:Sep 24 03:30:43 localhost.localdomain NetworkManager[661]: <info>  [1506238243.5419] dhcp4 (ens1f1): state changed bound -> expire
journal:Sep 24 03:30:43 localhost.localdomain NetworkManager[661]: <info>  [1506238243.5742] dhcp4 (ens1f1): canceled DHCP transaction, DHCP client pid 861
journal:Sep 24 03:30:43 localhost.localdomain NetworkManager[661]: <info>  [1506238243.5742] dhcp4 (ens1f1): state changed expire -> done
journal:Sep 24 03:30:43 localhost.localdomain NetworkManager[661]: <info>  [1506238243.5745] device (ens1f1): scheduling DHCPv4 restart in 120 seconds, 3 tries left (reason: lease expired)
journal:Sep 24 03:30:46 localhost.localdomain NetworkManager[661]: <info>  [1506238246.8396] dhcp4 (ens12f0): state changed bound -> expire
journal:Sep 24 03:30:46 localhost.localdomain NetworkManager[661]: <info>  [1506238246.8558] dhcp4 (ens12f0): canceled DHCP transaction, DHCP client pid 956
journal:Sep 24 03:30:46 localhost.localdomain NetworkManager[661]: <info>  [1506238246.8558] dhcp4 (ens12f0): state changed expire -> done
journal:Sep 24 03:30:46 localhost.localdomain NetworkManager[661]: <info>  [1506238246.8559] device (ens12f0): scheduling DHCPv4 restart in 120 seconds, 3 tries left (reason: lease expired)


IPA then reports ip set to "None" for all three of these interfaces but not for the interfaces whose dhcp lease did not expire.
j03:30:52.806 1533 INFO ironic_python_agent.inspector [-] network interfaces: {'ens4f0': {'ip': None, 'mac': '54:ab:3a:cc:47:88'}, 'ens12f1': {'ip': '172.31.255.142', 'mac': '90:e2:ba:c9:d0:c1'}, 'ens12f0': {'ip': None, 'mac': '90:e2:ba:c9:d0:c0'}, 'ens4f1': {'ip': '172.31.255.141', 'mac': '54:ab:3a:cc:47:89'}, 'ens1f1': {'ip': None, 'mac': '90:e2:ba:cc:70:49'}, 'ens1f0': {'ip': '172.31.255.156', 'mac': '90:e2:ba:cc:70:48'}, 'ens255f0': {'ip': None, 'mac': '54:ab:3a:ca:39:d7'}, 'ens255f1': {'ip': None, 'mac': '54:ab:3a:ca:39:d8'}}
bfournie-OSX:ramdisk-2 bfournie$ 

It appears there are a couple workarounds:
1. As recommended, leave only one NIC on the provisioning network.  This is the recommended configuration, it looks like the long time for dhcp to complete in IPA is causing this lease expiration.
2. OR - try increasing the dhcp lease time by editing /etc/ironic-inspector/dnsmasq.conf
and setting the last value in dhcp-range to 240, e.g.
change:
dhcp-range=10.8.146.100,10.8.146.120,29
to:
dhcp-range=10.8.146.100,10.8.146.120,240

(Note in the config above that 29 is below the dnsmasq minimum of 120 seconds, so 120 is used by default).

We should consider a patch that will increase the default lease time that inspector uses.

Note You need to log in before you can comment on or make changes to this bug.