Bug 1384187

Summary: Node introspection fails with "No suitable interfaces found"
Product: Red Hat OpenStack Reporter: Chris Dearborn <christopher_dearborn>
Component: openstack-ironic-python-agentAssignee: Bob Fournier <bfournie>
Status: CLOSED DUPLICATE QA Contact: Raviv Bar-Tal <rbartal>
Severity: high Docs Contact:
Priority: unspecified    
Version: 10.0 (Newton)CC: arkady_kanevsky, bfournie, cdevine, christopher_dearborn, dcain, dtantsur, goneri, John_walsh, kasmith, kurt_hey, mburns, morazi, randy_perryman, richard_pioso, slinaber, smerrow, sreichar
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-11-14 20:17:49 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1335596, 1356451    
Attachments:
Description Flags
inspector conf and logs
none
DCIM_NICView of 1 failing node
none
DCIM_NICView of other failing node none

Description Chris Dearborn 2016-10-12 18:38:30 UTC
Description of problem:
Node introspection is run on 9 nodes.  It succeeds on some and fails on other with: "Preprocessing hook validate_interfaces: No suitable interfaces found"

Version-Release number of selected component (if applicable):
OSP10

How reproducible:
Launch introspection of nodes following instructions in the beta docs.

Steps to Reproduce:
1. Set node state to manage
2. Run "openstack overcloud node introspect --all-manageable --provide"

Actual results:
Introspection fails on some nodes:

[stack@director ~]$ openstack overcloud node introspect --all-manageable --provide
Started Mistral Workflow. Execution ID: 72ea6e9b-1e11-4cec-9006-9f09b164fa12
Waiting for introspection to finish...
Introspection for UUID af2785cb-cbd5-4cec-9f51-8877c99d9abc finished successfully.
Introspection for UUID c83bed06-f55a-48de-b4d4-809c8c4ecff2 finished with error: Preprocessing hook validate_interfaces: No suitable interfaces found in {u'p4p1': {'ip': None, 'mac': u'a0:36:9f:14:49:90'}, u'p4p2': {'ip': None, 'mac': u'a0:36:9f:14:49:92'}, u'p6p2': {'ip': None, 'mac': u'a0:36:9f:32:71:4a'}, u'em4': {'ip': None, 'mac': u'd4:ae:52:89:6e:34'}, u'p6p1': {'ip': None, 'mac': u'a0:36:9f:32:71:48'}, u'em1': {'ip': u'192.168.120.159', 'mac': u'a0:36:9f:14:49:92'}, u'em3': {'ip': None, 'mac': u'd4:ae:52:89:6e:33'}, u'em2': {'ip': None, 'mac': u'd4:ae:52:89:6e:32'}}
Introspection for UUID 50fe65a7-a74a-4a3b-b066-83d5b25ad182 finished successfully.
Introspection for UUID 7a6e2e84-0824-4d44-b10b-0d5d689de7e7 finished with error: Preprocessing hook validate_interfaces: No suitable interfaces found in {u'p4p1': {'ip': None, 'mac': u'a0:36:9f:07:8e:48'}, u'p4p2': {'ip': None, 'mac': u'a0:36:9f:07:8e:4a'}, u'p6p2': {'ip': None, 'mac': u'a0:36:9f:14:47:46'}, u'em4': {'ip': None, 'mac': u'd4:ae:52:89:64:40'}, u'p6p1': {'ip': None, 'mac': u'a0:36:9f:14:47:44'}, u'em1': {'ip': u'192.168.120.160', 'mac': u'a0:36:9f:07:8e:48'}, u'em3': {'ip': None, 'mac': u'd4:ae:52:89:64:3f'}, u'em2': {'ip': None, 'mac': u'd4:ae:52:89:64:3e'}}
Introspection for UUID b1e6e951-029a-49ed-9210-37af619d7f4b finished successfully.
Introspection for UUID da9c2661-50b0-4113-adb8-222b527d4848 finished successfully.
Introspection for UUID 18b65faf-5278-4189-9ed6-6f5e913cff09 finished successfully.
Introspection for UUID 62bfadd8-6cdb-424a-9d4e-04a9812af826 finished successfully.
Introspection for UUID aa20ef25-986b-4c43-b212-9e42db742021 finished successfully.
Introspection completed with errors:
c83bed06-f55a-48de-b4d4-809c8c4ecff2: Preprocessing hook validate_interfaces: No suitable interfaces found in {u'p4p1': {'ip': None, 'mac': u'a0:36:9f:14:49:90'}, u'p4p2': {'ip': None, 'mac': u'a0:36:9f:14:49:92'}, u'p6p2': {'ip': None, 'mac': u'a0:36:9f:32:71:4a'}, u'em4': {'ip': None, 'mac': u'd4:ae:52:89:6e:34'}, u'p6p1': {'ip': None, 'mac': u'a0:36:9f:32:71:48'}, u'em1': {'ip': u'192.168.120.159', 'mac': u'a0:36:9f:14:49:92'}, u'em3': {'ip': None, 'mac': u'd4:ae:52:89:6e:33'}, u'em2': {'ip': None, 'mac': u'd4:ae:52:89:6e:32'}}
7a6e2e84-0824-4d44-b10b-0d5d689de7e7: Preprocessing hook validate_interfaces: No suitable interfaces found in {u'p4p1': {'ip': None, 'mac': u'a0:36:9f:07:8e:48'}, u'p4p2': {'ip': None, 'mac': u'a0:36:9f:07:8e:4a'}, u'p6p2': {'ip': None, 'mac': u'a0:36:9f:14:47:46'}, u'em4': {'ip': None, 'mac': u'd4:ae:52:89:64:40'}, u'p6p1': {'ip': None, 'mac': u'a0:36:9f:14:47:44'}, u'em1': {'ip': u'192.168.120.160', 'mac': u'a0:36:9f:07:8e:48'}, u'em3': {'ip': None, 'mac': u'd4:ae:52:89:64:3f'}, u'em2': {'ip': None, 'mac': u'd4:ae:52:89:64:3e'}}

Expected results:
Introspection should succeed on all nodes.

Additional info:

Comment 1 Bob Fournier 2016-10-12 20:42:19 UTC
Chris - would it be possible to get contents of these files (on undercloud) when problem occurs? 
/var/log/ironic-inspector/ironic-inspector.log
/etc/ironic-inspector/inspector.conf

Thanks.

Comment 2 Dmitry Tantsur 2016-10-13 08:18:08 UTC
So, no interfaces succeeded in receiving IP addresses. This is strange. Chris, could you please also grab ramdisk logs associated with this run from /var/log/ironic-inspector/ramdisk?

Also, is it bare metal or VM? Which model of bare metal?

Comment 3 Chris Dearborn 2016-10-13 17:35:16 UTC
Had to slick my deployment, but will be redeploying, retesting, and will supply all of the above.
Introspection is running on bare metal: PowerEdge R720s and R730xds.  Not sure which models the error occurred on.  Will get that info shortly as well.  Thanks!

Comment 4 Chris Dearborn 2016-10-13 22:00:11 UTC
New test case:
[stack@director pilot]$ openstack overcloud node introspect --all-manageable --provide
Started Mistral Workflow. Execution ID: 9ad4c72a-de9e-4bef-b0cf-2053269d3b5a
Waiting for introspection to finish...
Introspection for UUID 0e73fd03-25d6-411f-b2bd-8b85748a37b6 finished successfully.
Introspection for UUID 175936b1-87f3-4f99-a114-62b247aff1fa finished with error: Preprocessing hook validate_interfaces: No suitable interfaces found in {u'p4p1': {'ip': None, 'mac': u'a0:36:9f:14:49:90'}, u'p4p2': {'ip': None, 'mac': u'a0:36:9f:14:49:92'}, u'p6p2': {'ip': None, 'mac': u'a0:36:9f:32:71:4a'}, u'em4': {'ip': None, 'mac': u'd4:ae:52:89:6e:34'}, u'p6p1': {'ip': None, 'mac': u'a0:36:9f:32:71:48'}, u'em1': {'ip': u'192.168.120.159', 'mac': u'a0:36:9f:14:49:90'}, u'em3': {'ip': None, 'mac': u'd4:ae:52:89:6e:33'}, u'em2': {'ip': None, 'mac': u'd4:ae:52:89:6e:32'}}
Introspection for UUID 78ce500f-1c24-4fec-91ff-970a60e9b493 finished successfully.
Introspection for UUID 8e19f0fc-651b-4590-bf08-29b91d0a3fe3 finished successfully.
Introspection for UUID db5d9fbc-2fc2-4caa-b481-bff978b8ea7c finished successfully.
Introspection for UUID b6acc449-cce4-4d45-ade2-3e7e64e0517c finished successfully.
Introspection for UUID 6f45bba8-a6ab-4733-822d-ecc1c345da3a finished successfully.
Introspection for UUID e3470310-97d6-45e9-8358-1dc18e0a350d finished successfully.
Introspection for UUID 22cab020-a272-4cd3-9f3c-0c5b23b54a78 finished with error: Preprocessing hook validate_interfaces: No suitable interfaces found in {u'p4p1': {'ip': None, 'mac': u'a0:36:9f:07:95:10'}, u'p4p2': {'ip': None, 'mac': u'a0:36:9f:07:95:12'}, u'p6p2': {'ip': None, 'mac': u'a0:36:9f:07:87:5e'}, u'em4': {'ip': None, 'mac': u'd4:ae:52:89:58:04'}, u'p6p1': {'ip': None, 'mac': u'a0:36:9f:07:87:5c'}, u'em1': {'ip': u'192.168.120.160', 'mac': u'a0:36:9f:07:87:5e'}, u'em3': {'ip': None, 'mac': u'd4:ae:52:89:58:03'}, u'em2': {'ip': None, 'mac': u'd4:ae:52:89:58:02'}}
Introspection completed with errors:
175936b1-87f3-4f99-a114-62b247aff1fa: Preprocessing hook validate_interfaces: No suitable interfaces found in {u'p4p1': {'ip': None, 'mac': u'a0:36:9f:14:49:90'}, u'p4p2': {'ip': None, 'mac': u'a0:36:9f:14:49:92'}, u'p6p2': {'ip': None, 'mac': u'a0:36:9f:32:71:4a'}, u'em4': {'ip': None, 'mac': u'd4:ae:52:89:6e:34'}, u'p6p1': {'ip': None, 'mac': u'a0:36:9f:32:71:48'}, u'em1': {'ip': u'192.168.120.159', 'mac': u'a0:36:9f:14:49:90'}, u'em3': {'ip': None, 'mac': u'd4:ae:52:89:6e:33'}, u'em2': {'ip': None, 'mac': u'd4:ae:52:89:6e:32'}}
22cab020-a272-4cd3-9f3c-0c5b23b54a78: Preprocessing hook validate_interfaces: No suitable interfaces found in {u'p4p1': {'ip': None, 'mac': u'a0:36:9f:07:95:10'}, u'p4p2': {'ip': None, 'mac': u'a0:36:9f:07:95:12'}, u'p6p2': {'ip': None, 'mac': u'a0:36:9f:07:87:5e'}, u'em4': {'ip': None, 'mac': u'd4:ae:52:89:58:04'}, u'p6p1': {'ip': None, 'mac': u'a0:36:9f:07:87:5c'}, u'em1': {'ip': u'192.168.120.160', 'mac': u'a0:36:9f:07:87:5e'}, u'em3': {'ip': None, 'mac': u'd4:ae:52:89:58:03'}, u'em2': {'ip': None, 'mac': u'd4:ae:52:89:58:02'}}

Comment 5 Chris Dearborn 2016-10-13 22:03:31 UTC
Both nodes that failed are bare metal PowerEdge R720s.  Note that there are 6 R720s in the cluster, and introspection on the remaining 4 worked.

Comment 6 Chris Dearborn 2016-10-13 22:09:46 UTC
Created attachment 1210274 [details]
inspector conf and logs

Comment 7 Chris Dearborn 2016-10-13 22:15:19 UTC
A zip with the inspector logs and conf file is attached.

You can see in the error messages above that em1 has an IP.  The 192.168.120.0/24 network is the provisioning network, and the expectation is that it would choose this interface.

Note that I've run introspection with OSP 9 and the same cluster, many, many times and have not seen this error with 9.

Comment 8 Chris Dearborn 2016-10-13 22:15:58 UTC
One final piece of info is that the logs contain data from 2 introspection runs.

Comment 9 Chris Dearborn 2016-10-14 20:30:44 UTC
When I encountered this problem, I was running RHEL 7.2 on my director node with OSP10.  I switched to the RHEL 7.3 beta, and have only run one test, but the problem is not occurring & introspection is completing successfully.  Will try another test shortly.

Comment 10 Bob Fournier 2016-10-14 21:09:00 UTC
Thanks for the update Chris.  This is what I see from the logs you provided:

This is coming from ironic-inspector/plugins/standard.py. The reason the error is being logged is because the interface mac doesn't match the pxe mac which is stored in 'boot_interface' in the Ironic database.
           if (CONF.processing.add_ports == 'pxe' and pxe_mac
                    and mac != pxe_mac):
                LOG.debug('Skipping interface %s as it was not PXE booting',
                          name, data=data)

It looks like the mac address that is stored for em1 in the Ironic 'interfaces' entry is incorrect.

Here is a case that generated an introspection error, the boot_interface does not match em1's mac and a "No suitable interfaces found" message was eventually generated:
=====
u'interfaces': {u'p4p1': {u'ip': None, u'mac': u'a0:36:9f:14:49:90'}, u'p4p2': {u'ip': None, u'mac': u'a0:36:9f:14:49:92'}, u'p6p2': {u'ip': None, u'mac': u'a0:36:9f:32:71:4a'}, u'em4': {u'ip': None, u'mac': u'd4:ae:52:89:6e:34'}, u'p6p1': {u'ip': None, u'mac': u'a0:36:9f:32:71:48'}, u'em1': {u'ip': u'192.168.120.159', u'mac': u'a0:36:9f:14:49:90'}, u'em3': {u'ip': None, u'mac': u'd4:ae:52:89:6e:33'}, u'em2': {u'ip': None, u'mac': u'd4:ae:52:89:6e:32'}}, u'cpus': 24, u'boot_interface': u'd4:ae:52:89:6e:31',

If you look at the mac addresses in the 'interfaces' entry it looks like the mac for em1 has been overwritten by the mac for p4p1.  You would expect that em1 would have mac u'd4:ae:52:89:6e:31' as the addresses are sequential but it doesn’t, it has the same mac as p4p1
=====
u'p4p1': u'a0:36:9f:14:49:90'
u'p4p2': u'a0:36:9f:14:49:92'
u'p6p1': u'a0:36:9f:32:71:48'
u'p6p2': u'a0:36:9f:32:71:4a'
u'em1': u'a0:36:9f:14:49:90'
u'em2': u'd4:ae:52:89:6e:32'
u'em3': u'd4:ae:52:89:6e:33'
u'em4': u'd4:ae:52:89:6e:34'

u'boot_interface': u'd4:ae:52:89:6e:31',

Because the boot_interface doesn't match any interface, a suitable interface couldn't be found and the error occurs

Compare this to a node that succeeded, here the 'boot_interface' matches em1’s mac and the macs for em1 through em4 are sequential
u'interfaces': {u'p4p1': {u'ip': None, u'mac': u'a0:36:9f:07:89:c0'}, u'p4p2': {u'ip': None, u'mac': u'a0:36:9f:07:89:c2'}, u'p6p2': {u'ip': None, u'mac': u'00:1b:21:d7:c7:3d'}, u'em4': {u'ip': None, u'mac': u'd4:ae:52:89:63:fc'}, u'p6p1': {u'ip': None, u'mac': u'00:1b:21:d7:c7:3c'}, u'em1': {u'ip': u'192.168.120.153', u'mac': u'd4:ae:52:89:63:f9'}, u'em3': {u'ip': None, u'mac': u'd4:ae:52:89:63:fb'}, u'em2': {u'ip': None, u'mac': u'd4:ae:52:89:63:fa'}}, u'cpus': 24, u'boot_interface': u'd4:ae:52:89:63:f9'

Comment 11 Bob Fournier 2016-10-15 13:04:05 UTC
Note also from the ramdisk logs, the physical interface eth0 is detected with the correct mac - d4:ae:52:89:6e:31

journal:Oct 13 16:33:43 localhost.localdomain kernel: tg3 0000:01:00.0 eth0: Tigon3 [partno(BCM95720) rev 5720000] (PCI Express) MAC address d4:ae:52:89:6e:31
journal:Oct 13 16:33:43 localhost.localdomain kernel: tg3 0000:01:00.1 eth1: Tigon3 [partno(BCM95720) rev 5720000] (PCI Express) MAC address d4:ae:52:89:6e:32
journal:Oct 13 16:33:43 localhost.localdomain kernel: tg3 0000:02:00.0 eth2: Tigon3 [partno(BCM95720) rev 5720000] (PCI Express) MAC address d4:ae:52:89:6e:33
journal:Oct 13 16:33:43 localhost.localdomain kernel: tg3 0000:02:00.1 eth3: Tigon3 [partno(BCM95720) rev 5720000] (PCI Express) MAC address d4:ae:52:89:6e:34

Yet em1 has the wrong one - p4p1's mac (a0:36:9f:14:49:90).

bfournie-OSX:ramdisk bfournie$ cat ip_addr 
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: em1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
    link/ether a0:36:9f:14:49:90 brd ff:ff:ff:ff:ff:ff
    inet 192.168.120.159/24 brd 192.168.120.255 scope global dynamic em1
       valid_lft 95sec preferred_lft 95sec
    inet6 fe80::a236:9fff:fe14:4990/64 scope link 
       valid_lft forever preferred_lft forever
3: em2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN qlen 1000
    link/ether d4:ae:52:89:6e:32 brd ff:ff:ff:ff:ff:ff
4: em3: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN qlen 1000
    link/ether d4:ae:52:89:6e:33 brd ff:ff:ff:ff:ff:ff
5: em4: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN qlen 1000
    link/ether d4:ae:52:89:6e:34 brd ff:ff:ff:ff:ff:ff
6: p6p1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
    link/ether a0:36:9f:32:71:48 brd ff:ff:ff:ff:ff:ff
7: p6p2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
    link/ether a0:36:9f:32:71:4a brd ff:ff:ff:ff:ff:ff
8: p4p1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
    link/ether a0:36:9f:14:49:90 brd ff:ff:ff:ff:ff:ff
9: p4p2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
    link/ether a0:36:9f:14:49:92 brd ff:ff:ff:ff:ff:ff

Seems to be some kind of mapping problem in IPA.

Comment 12 Chris Dearborn 2016-10-17 18:03:41 UTC
Created attachment 1211490 [details]
DCIM_NICView of 1 failing node

Comment 13 Chris Dearborn 2016-10-17 18:04:18 UTC
Created attachment 1211491 [details]
DCIM_NICView of other failing node

Comment 14 Chris Dearborn 2016-10-17 18:10:30 UTC
Good eyes Bob!  I totally missed that.  I've attached a dump of the DCIM_NICView for both of the failing nodes.  You can see that the MAC that's showing up on 2 interfaces is actually the MAC on the "p*p*" interface and that em1 has a different MAC that's not being correctly reported in the error messages above.  This is true on both nodes.

Not sure why this would be failing on RHEL 7.2, and succeeding on RHEL 7.3 beta.  I'm using the same Oct. 5th puddle in both cases, the same instackenv.json (I'm not providing any MACs in it), and the python version shows as 2.7.5 on both versions of RHEL.

Comment 15 Bob Fournier 2016-10-18 00:44:17 UTC
Thanks Chris. So it looks like the mapping of ethX to emX is done by NetworkManager as can be seen in the ramdisk logs, perhaps that is why its different in RHEL 7.3 vs 7.2.

There are a couple interesting log messages:

There are these mapping messages that show eth0 mapped to em1 as expected:
journal:Oct 13 16:33:44 localhost.localdomain NetworkManager[426]: <info>  [1476390824.2782] device (eth0): interface index 2 renamed iface from 'eth0' to 'em1'

but then, a second later, eth0 is mapped to p4p1.
journal:Oct 13 16:33:45 localhost.localdomain NetworkManager[426]: <info>  [1476390825.3173] device (eth0): interface index 8 renamed iface from 'eth0' to 'p4p1'

As shown earlier, em1 is being reported with p4p1's mac address

Perhaps more surprising is this "set-cloned MAC" message in which em1's mac is
set to p4p1's MAC, which is the source of the problem. 
journal:Oct 13 16:33:48 localhost.localdomain NetworkManager[426]: <info>  [1476390828.2222] device (em1): set-hw-addr: set-cloned MAC address to A0:36:9F:14:49:90 (permanent)  <==

Its only on em1 that this "cloned-mac" message is seen.

So there seems to some interaction with NetworkManager that is causing this...

Comment 16 Chris Dearborn 2016-10-18 16:00:16 UTC
Just ran another test case with RHEL 7.3, and the problem is occurring there now as well.

It should be noted that this worked perfectly with OSP9 on RHEL 7.2, which leads me to believe that there was a code change in introspection that is causing this problem.

Comment 17 Chris Dearborn 2016-10-18 19:13:47 UTC
Interestingly enough, only 1 node is failing, and it is not either of the 2 that were failing with RHEL 7.2.

Also, the IPA ramdisk that is being loaded and is discovering this information is the same whether RHEL 7.2 or RHEL 7.3 is installed on the director node...

Comment 18 Bob Fournier 2016-10-25 13:45:27 UTC
I've created a NetworkManager bug for this, to look a the MAC overwrite issue, while we see if we can come up with a workaround - https://bugzilla.redhat.com/show_bug.cgi?id=1388286

Comment 19 Bob Fournier 2016-10-27 13:16:27 UTC
Per discussion in https://bugzilla.redhat.com/show_bug.cgi?id=1388286, this appears to be a NetworkManager bug.  Until that is resolved, we should follow one of two approaches:

1. Disable NetworkManager in all images by setting 
"networkmanager": {"uninstall": true}" 
in package-installs.json used by disk-image-builder
2. Leave NetworkManager running but disable interface naming by biosdevname. Its possible this could be done by setting kernel flags for the ramdisk image and by using 
vert-customize for the other images.  The recommended settings are:
"net.ifnames=0 biosdevname=0"

In either case, all images will need to be modified to ensure consistent interface names across images.

Investigation of the best approach is ongoing.

Comment 20 Bob Fournier 2016-10-27 19:42:27 UTC
I tried workaround #2 and verified network device naming is no longer being done, which should fix the introspection failure.  I tested it as follows:

1. Copied over agent.ramdisk and agent.kernel from rpms being tested by Dell.
2. Set net.ifnames=0 biosdevname=0 in inspector.ipxe

cat /httpboot/inspector.ipxe
#!ipxe

:retry_dhcp
dhcp || goto retry_dhcp

:retry_boot
imgfree
kernel --timeout 60000 http://10.8.146.1:8088/agent.kernel ipa-inspection-callback-url=http://10.8.146.1:5050/v1/continue ipa-inspection-collectors=default,extra-hardware,logs systemd.journald.forward_to_console=yes BOOTIF=${mac} ipa-debug=1 ipa-inspection-dhcp-all-interfaces=1 ipa-collect-lldp=1 initrd=agent.ramdisk net.ifnames=0 biosdevname=0 || goto retry_boot
initrd --timeout 60000 http://10.8.146.1:8088/agent.ramdisk || goto retry_boot
boot

3. Ran introspection with and without these kernel changes and verified renaming isn't done with kernel change.
[stack@host01 ramdisk-rhel7.3]$ grep renamed *
journal:Oct 27 14:40:28 localhost.localdomain NetworkManager[351]: <info>  [1477593628.2570] device (eth1): interface index 3 renamed iface from 'eth1' to 'em2'
journal:Oct 27 14:40:28 localhost.localdomain NetworkManager[351]: <info>  [1477593628.2581] device (eth0): interface index 2 renamed iface from 'eth0' to 'em1'
journal:Oct 27 14:40:28 localhost.localdomain NetworkManager[351]: <info>  [1477593628.2591] device (eth2): interface index 4 renamed iface from 'eth2' to 'p2p1'
journal:Oct 27 14:40:28 localhost.localdomain NetworkManager[351]: <info>  [1477593628.2608] device (eth3): interface index 5 renamed iface from 'eth3' to 'p2p2'

This method would also require changes to other images via vert-customize to set the same kernel params.


Note also that there is also a third workaround as described in https://bugzilla.redhat.com/show_bug.cgi?id=1388286.  This has not been tried yet, but the process would be:
1. Unpack ramdisk image by "gzip -dc </httpboot/agent.ramdisk | cpio --extract"
2. Create file 'etc/NetworkManager/conf.d/90-cloned-mac-address-preserve.conf' with content:
  [connection-cloned-mac-address-preserve]
  ethernet.cloned-mac-address=preserve
3.Recreate ramdisk image by "find . | cpio -oc | gzip -c -9>| ~/agent.ramdisk"

This 2nd method should not require additional changes to other images, it will just prevent the mac overwrite and introspection error in the ramdisk.  A script can be provided for this.

Comment 21 Dmitry Tantsur 2016-11-03 09:55:35 UTC
Chris, do any of the workarounds fix the problem for you? The fix seems to belong in NetworkManager itself, so I'd like to document it as known issue until they release an update.

Comment 22 Gonéri Le Bouder 2016-11-07 20:50:57 UTC
Hi there,

I face the same issue with the last OSP8 (not 10!) puddles.

Introspection for UUID 9d5a678a-998b-42d9-87c2-5d10317ad9af finished with error: Preprocessing hook validate_interfaces: No valid interfaces found for node
with BMC 192.168.110.107, got {u'p5p1': {u'ip': None, u'mac': u'a0:36:9f:2a:2a:d4'}, u'p5p2': {u'ip': None, u'mac': u'a0:36:9f:2a:2a:d6'}, u'em4': {u'ip':
None, u
'mac': u'b8:ca:3a:67:3f:6b'}, u'em1': {u'ip': u'192.168.120.157', u'mac': u'a0:36:9f:14:47:08'}, u'em3': {u'ip': None, u'mac': u'b8:ca:3a:67:3f:6a'},
u'em2': {u'ip': None, u'mac': u'b8:ca:3a:67:3f:69'}, u'p7p2': {u'ip': None, u'mac': u'a0:36:9f:14:47:0a'}, u'p7p1': {u'ip': None, u'mac':
u'a0:36:9f:14:47:08'}}
Introspection for UUID f03fdd4e-86c6-4665-ae89-9f1ff382936b finished with error: Preprocessing hook validate_interfaces: No valid interfaces found for node
with BMC 192.168.110.105, got {u'p5p1': {u'ip': None, u'mac': u'a0:36:9f:2c:66:10'}, u'p5p2': {u'ip': None, u'mac': u'a0:36:9f:2c:66:12'}, u'em4': {u'ip':
None, u
'mac': u'b8:ca:3a:68:67:57'}, u'em1': {u'ip': u'192.168.120.158', u'mac': u'a0:36:9f:2c:66:12'}, u'em3': {u'ip': None, u'mac': u'b8:ca:3a:68:67:56'},
u'em2': {u'ip': None, u'mac': u'b8:ca:3a:68:67:55'}, u'p7p2': {u'ip': None, u'mac': u'a0:36:9f:2c:5e:a6'}, u'p7p1': {u'ip': None, u'mac':
u'a0:36:9f:2c:5e:a4'}}

This is the first time I see something like that and I'm a bit lost
because the MAC (0:36:9f:14:47:08, a0:36:9f:2c:66:12) from the error do
not match the MAC from the architecture diagram nor what ipmitool return
(b8:ca:3a:68:67:54, b8:ca:3a:67:3f:68):

[stack@director ~]$ ipmitool -I lanplus -H 192.168.110.105 -L ADMINISTRATOR -U root delloem mac
Password: 

System LOMs
NIC Number MAC Address Status

0 b8:ca:3a:68:67:54 Enabled
1 b8:ca:3a:68:67:55 Enabled
2 b8:ca:3a:68:67:56 Enabled
3 b8:ca:3a:68:67:57 Enabled

iDRAC7 MAC Address 5c:f9:dd:f6:6c:9c
[stack@director ~]$ ipmitool -I lanplus -H 192.168.110.107 -L ADMINISTRATOR -U root delloem mac 
Password: 

System LOMs
NIC Number MAC Address Status

0 b8:ca:3a:67:3f:68 Enabled
1 b8:ca:3a:67:3f:69 Enabled
2 b8:ca:3a:67:3f:6a Enabled
3 b8:ca:3a:67:3f:6b Enabled

iDRAC7 MAC Address 5c:f9:dd:f5:78:c0

  {
  "pm_type": "pxe_ipmitool",
  "pm_user": "root",
  "pm_password": "foo",
  "pm_addr": "192.168.110.107",
  "mac": [
  "b8:ca:3a:67:3f:68"
  ],
  "cpu": "20",
  "memory": "131072",
  "disk": "931",
  "arch": "x86_64"
  },
  {
  "pm_type": "pxe_ipmitool",
  "pm_user": "root",
  "pm_password": "foo",
  "pm_addr": "192.168.110.105",
  "mac": [
  "b8:ca:3a:68:67:54"
  ],
  "cpu": "20",
  "memory": "131072",
  "disk": "931",
  "arch": "x86_64"
  },

Sounds like the NIC driver decided suddenly to use another address. This is definitively not an Neutron issue.

[stack@director ~]$ rpm -qa 'rhosp*'
rhosp-director-images-8.0-20161031.1.el7ost.noarch
rhosp-director-images-ipa-8.0-20161031.1.el7ost.noarch

Comment 23 Bob Fournier 2016-11-07 21:27:14 UTC
Goneri - yes this is the same issue, it is not specific to OSP-10 but occurs due to NetworkManager in RHEL 7.3.

NetworkManager is overwriting the stored MAC addresses of the interface used for provisioning in the Ironic database. From your output, you can see that em1's mac
gets overwritten with the mac from p7p1:

u'em1': {u'ip': u'192.168.120.157', u'mac': u'a0:36:9f:14:47:08'}, 
u'p7p1': {u'ip': None, u'mac':u'a0:36:9f:14:47:08'}}

This is the NetworkManager bug causing the problem which has been fixed upstream:
https://bugzilla.redhat.com/show_bug.cgi?id=1388286

Would you be able to try either of the workarounds in Comment 20?

Comment 24 Gonéri Le Bouder 2016-11-08 18:31:29 UTC
Hi Bob.

These two solutions work fine here:
  - net.ifnames=0 biosdevname=0 in the kernel parametes
  - the 90-cloned-mac-address-preserve.conf in the initrd in the initrd

I managed to do 2 full introspections of my 9 nodes in both cases.

Comment 25 Chris Dearborn 2016-11-08 22:04:57 UTC
Hey Bob, I tested workaround #2 and verified it worked on 2 consecutive in-band introspection runs on 9 nodes with both the idrac driver and the ipmi driver.

Yes, I know the driver shouldn't make a difference in this case, but I wanted to be sure.

So that's 4 consecutive runs on 9 nodes per run.  I'll incorporate this workaround into our solution.

Thanks for all the hard work on this!

Comment 26 Gonéri Le Bouder 2016-11-10 19:29:37 UTC
On Bob request, I patched the initrd with the following RPM and restarted my tests:

- NetworkManager-1.4.0-13.test.rh1388286.01.el7_3.x86_64.rpm
- NetworkManager-libnm-1.4.0-13.test.rh1388286.01.el7_3.x86_64.rpm
- NetworkManager-tui-1.4.0-13.test.rh1388286.01.el7_3.x86_64.rpm
- NetworkManager-config-server-1.4.0-13.test.rh1388286.01.el7_3.x86_64.rpm
- NetworkManager-team-1.4.0-13.test.rh1388286.01.el7_3.x86_64.rpm

This solve the problem.

Comment 27 Chris Dearborn 2016-11-10 19:37:27 UTC
I did the same, and it DID NOT solve the problem on my cluster.  The issue reproduced on the first run.

Comment 28 Gonéri Le Bouder 2016-11-10 19:48:20 UTC
I did this to patch my image:

gzip -dc </httpboot/agent.ramdisk | cpio --extract
cp /tmp/my_rpms tmp
chroot .
rpm -Uvh /tmp/my_rpms/*.rpm
exit
find . | cpio -oc | gzip -c -9>| ~/agent.ramdisk
cp ~/agent.ramdisk /httpboot/agent.ramdisk
restorecon -v /httpboot

Comment 29 Chris Dearborn 2016-11-10 19:58:55 UTC
(In reply to Chris Dearborn from comment #27)
> I did the same, and it DID NOT solve the problem on my cluster.  The issue
> reproduced on the first run.

Argh!  I patched the overcloud-full image, which is obviously not the right one.  Too many cold meds.  Will try again.

Comment 30 Chris Dearborn 2016-11-10 20:25:27 UTC
(In reply to Gonéri Le Bouder from comment #28)
> I did this to patch my image:
> 
> gzip -dc </httpboot/agent.ramdisk | cpio --extract
> cp /tmp/my_rpms tmp
> chroot .
> rpm -Uvh /tmp/my_rpms/*.rpm
> exit
> find . | cpio -oc | gzip -c -9>| ~/agent.ramdisk
> cp ~/agent.ramdisk /httpboot/agent.ramdisk
> restorecon -v /httpboot

Thanks Goneri!

Comment 31 Chris Dearborn 2016-11-10 21:15:25 UTC
The NetworkManager RPM worked perfectly on 2 runs of 9 nodes.

Comment 32 Dmitry Tantsur 2016-11-11 11:21:49 UTC
Cool! Folks, what's our path forward wrt this bug? Should we close it in favour of NM bug? Should we close it as known issue with workarounds for now?

Comment 33 Chris Dearborn 2016-11-14 18:45:01 UTC
Hey guys,

If the fix for https://bugzilla.redhat.com/show_bug.cgi?id=1388286 makes it in to OSP10, then I think this bug can be closed as a duplicate to that bug and no further changes are necessary.

If the fix for 1388286 DOES NOT make it in to OSP10, then I think a release note needs to be added that notes this issue and lists the workarounds that are known to work.  Whether that is done under this bug or 1388286 is all the same to me.


Thanks!

Chris

Comment 34 Bob Fournier 2016-11-14 20:17:49 UTC
Per Chris's comments, closing this as duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1388286

The workarounds are already described in that bug.

*** This bug has been marked as a duplicate of bug 1388286 ***