Bug 1400764 - Wrong MAC address set by the ramdisk on the provisioning interface leads to a failure to provision
Summary: Wrong MAC address set by the ramdisk on the provisioning interface leads to a...
Keywords:
Status: CLOSED DUPLICATE of bug 1388286
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-ironic-inspector
Version: 10.0 (Newton)
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: ---
: ---
Assignee: Bob Fournier
QA Contact: Raviv Bar-Tal
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-12-02 02:05 UTC by Alexandre Maumené
Modified: 2020-03-11 15:27 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-12-19 19:07:04 UTC
Target Upstream Version:


Attachments (Terms of Use)
introspection data (32.31 KB, text/plain)
2016-12-02 02:05 UTC, Alexandre Maumené
no flags Details
logs of the ramdisk (170.81 KB, application/x-gzip)
2016-12-02 02:05 UTC, Alexandre Maumené
no flags Details
screen shot wrong mac (36.24 KB, image/png)
2016-12-02 02:06 UTC, Alexandre Maumené
no flags Details
nic-config for controller (6.63 KB, text/plain)
2016-12-05 02:00 UTC, Alexandre Maumené
no flags Details

Description Alexandre Maumené 2016-12-02 02:05:14 UTC
Created attachment 1227126 [details]
introspection data

Description of problem:
When deploying the servers are booting through PXE on the deploy ramdisk. This ramdisk does DHCP request to have an IP from the undercloud. Randomly the MAC address of a 10GE link (eno50) is "applied" to the provisioning NIC (eno1) as you can see on the screenshots.

Version-Release number of selected component (if applicable):
rhosp-director-images-ipa-10.0-20161102.1.el7ost.noarch
rhosp-director-images-ipa-10.0-20161129.2.el7ost.noarch

How reproducible:
Every time on at least on node out of 9.

Steps to Reproduce:
1. Run a deploy
2. Wait
3.

Actual results:
The server where the ramdisk is set from another interface than the one that should be used for provisioning is not able to get an IP through DHCP, hence the server doesn't have any connectivity and can't be provisioned.

Expected results:
The ramdisk should not mess with the MAC addresses.

Additional info:
Screenshots, logs from the ramdisk after having set a password to it and introspection data.

Comment 1 Alexandre Maumené 2016-12-02 02:05:38 UTC
Created attachment 1227127 [details]
logs of the ramdisk

Comment 2 Alexandre Maumené 2016-12-02 02:06:12 UTC
Created attachment 1227128 [details]
screen shot wrong mac

Comment 3 Jon Schlueter 2016-12-02 13:12:28 UTC
as someone pointed out in an email thread https://bugzilla.redhat.com/show_bug.cgi?id=1388286 might be related issue.

Comment 4 Bob Fournier 2016-12-02 14:46:32 UTC
Yes, this looks like a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1388286,
the same mac (14:02:ec:83:e8:c3) is being assigned to en01 and en050

Note that there a few workarounds described here (which is also a duplicate of 1388286):
https://bugzilla.redhat.com/show_bug.cgi?id=1384187

Comment 5 Alexandre Maumené 2016-12-04 08:02:44 UTC
Hi,

I don't think this is a duplicate of the other bug, or at least not exactly the same because:
- It only happens during nodes provisioning and NOT during introspection.
- The NetworkManager patch doesn't work and make things worst as eno1 isn't bring up in any nodes so no nodes are actually deployed.
- The ethernet.cloned-mac-address=preserve fix didn't work either and end up in the same situation.

Comment 7 Alexandre Maumené 2016-12-04 09:26:39 UTC
Hi,

The RPM fix hasn't been tried in fact, I've misunderstand what the colleague working on this with me said. This hasn't been tried for the simple reason that I don't know how/where to download theses packages. Could you please let me know where to find them ?

Thanks in advance.

Comment 8 Bob Fournier 2016-12-04 21:52:27 UTC
From https://bugzilla.redhat.com/show_bug.cgi?id=1388286, the rhel 7.3 packages
can be downloaded here:
https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=12071007

Comment 9 Alexandre Maumené 2016-12-04 23:00:43 UTC
Hi,

I tried with the net.ifnames=0 biosdevname=0 ramdisk parameter and I managed to provision the nodes successfully but then the nodes were not able to download the os-net-config/config.json. I assume it is because the instrospection stored the ifname under ethX but since it's not predictable I can't really use this either.

So back to the RPMs. Thanks for the link but I already have it from the other BZ. I just can't find where to download the RPM out of this webpage. Sorry for my ignorance but I'm not a developer and I've never worked with this tool. Also I'd like to know, if this is working, where do we stand in term of support for the customer ?

Comment 10 Bob Fournier 2016-12-04 23:50:11 UTC
OK, its good you're able to provision when interface naming is turned off.  Yes, using the net.ifnames=0 biosdevname=0 parameter will result in the interfaces not being renamed, i.e. ethX will be used.  Are you relying on the renamed (i.e. enoX) interface names for a particular reason?  Perhaps the templates will need to be modified so that os-net-config uses these new names.  Could you clarify what you 
were seeing when "not able to download os-net-config/config.json"?

Is the "ethernet.cloned-mac-address=preserve" fix still not working for you? Were you able to use the ramskThat workaround would not affect the interface naming.

I think we should hold off on the RPMs as those were intended for internal testing to verify the fix only, sorry about that.

Comment 11 Alexandre Maumené 2016-12-05 00:04:09 UTC
Hi,

I've tried to use nicX is the templates but the os-net-config/config.json was still empty. I'm going to try with the ethX naming in it instead.

The "ethernet.cloned-mac-address=preserve" fix didn't work because when provisioning the ramdisk was not able to configure the provisioning NIC, hence it wasn't able to download the overcloud image and the deployment just failed.

I understand that the RPMs are definitely not a supported solution and we need to find one as this will be in production as soon as the GA is released.

Comment 12 Alexandre Maumené 2016-12-05 01:59:52 UTC
Hi,

1) When I use net.ifnames=0 biosdevname=0 only on the provisioning ramdisk we can't deploy since the introspection stored the enoX names thus it doesn't match any nodes the undercloud knows.

2) When I use net.ifnames=0 biosdevname=0 on introspection and provisioning ramdisk we can't deploy because the os-net-config/config.json is either empty (on ceph and compute nodes) or wrong on controllers. On controllers it look like that :
"network_config": [{"use_dhcp": true, "type": "ovs_bridge", "name": "br-ex", "members": [{"type": "interface", "name": "nic1", "primary": true}]}]} which makes me thing it's the default configuration or something like that. Anyway it's supposed to setup nic1 as ctlplane and eno49 & eno50 in a bond (I've attached the nic-config for controller).

3) I can't use net.ifnames=0 biosdevname=0 on introspection, provisioning ramdisk and overcloud image as the interfaces will come up in a random order. We need to have predictable ifname when deploying OSP.

Comment 13 Alexandre Maumené 2016-12-05 02:00:24 UTC
Created attachment 1227959 [details]
nic-config for controller

Comment 17 Alexandre Maumené 2016-12-06 05:35:17 UTC
Hi,

I finally manager to apply one fix. I would need to know if it's supported before handing it over to the customer.

1) Edit the /httpboot/inspector.ipxe so it looks like that:
[...] initrd=agent.ramdisk net.ifnames=0 biosdevname=0 || goto retry_boot [...]
This way the interfaces reported to the undercloud has ethX names. Run the introspection.

2) Run an infinite loop like that:
while true;
do
sudo find /httpboot -name config -exec sed -i 's/ramdisk coreos/ramdisk net.ifnames=0 biosdevname=0 coreos/' {} \;
done
This way the provisioning ramdisk will boot with the arguments to ignore predictable naming hence it will not failed.

3) Use nicX naming scheme in the nic-configs templates so your config.json looks like that:
{"network_config": [{"dns_servers": ["10.126.162.11", "10.126.162.10"], "addresses": [{"ip_netmask": "172.17.9.201/24"}, {"ip_netmask": "10.61.80.217/26"}], "routes": [{"ip_netmask": "169.254.169.254/32", "next_hop": "172.17.9.254"}, {"default": true, "next_hop": "172.17.9.254"}], "use_dhcp": false, "type": "interface", "name": "nic1"}, {"type": "ovs_bridge", "name": "br-bond", "members": [{"type": "linux_bond", "bonding_options": "mode=4 lacp_rate=1 miimon=100", "members": [{"type": "interface", "name": "nic2", "primary": true}, {"type": "interface", "name": "nic3"}], "name": "bond0"}, {"device": "bond0", "type": "vlan", "addresses": [{"ip_netmask": "172.17.1.217/24"}], "vlan_id": 11}, {"device": "bond0", "type": "vlan", "addresses": [{"ip_netmask": "172.17.2.217/24"}], "vlan_id": 12}]}]}

And os-net-config will manage to configure the network on your nodes.

I've tried this solution last week but I don't know what it didn't work. I've retry today with the latest puddle and it seems to work.

I would like to know if this fix will be supported. But more important for the customer I would like to have a time line for a proper fix in OSP. The customer will replicate this deployment several times and don't really want to have this ugly fix in is KB.

Comment 18 Alexandre Maumené 2016-12-06 06:17:26 UTC
I've just realized. Wrong choose of word, it's a work around and not a fix.

Comment 20 Bob Fournier 2016-12-06 14:11:34 UTC
Alexandre - that's good that the workaround worked, its more evidence this is a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1388286

At this point can we mark this as a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1388286?

Comment 21 Bob Fournier 2016-12-06 21:07:56 UTC
One more bit of evidence that this a duplicate of BZ 1388286 is this log message:
messages:Dec  1 20:45:52 localhost NetworkManager[407]: <info>  [1480643152.5479] device (eno1): set-hw-addr: set-cloned MAC address to 14:02:EC:83:E8:3D (permanent)

Comparing this with the "screen shot wrong mac" attachment, 14:02:EC:83:E8:3D is
the mac that is being assigned to both eno1 and eno50 even though it belongs
to eno50 (en01 should have mac 94:18:82:09:0a:34).

This same "set-hw-addr: set-cloned MAC address to <mac> (permanent)" is seen
on the interface when the duplicated mac is written here: https://bugzilla.redhat.com/show_bug.cgi?id=1388286#c0

Comment 22 Thomas Haller 2016-12-07 10:50:01 UTC
(In reply to Alexandre Maumené from comment #9)

> So back to the RPMs. Thanks for the link but I already have it from the
> other BZ. I just can't find where to download the RPM out of this webpage.
> Sorry for my ignorance but I'm not a developer and I've never worked with
> this tool.

The previous scratch-build expired (which is non-obvious in the web UI).
Please fine a new scratch build here:
https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=12209250

(this is the intended solution for 1388286. It would be great if you could verify that this solves your issue).

Comment 23 Bob Fournier 2016-12-07 14:23:30 UTC
Thanks Thomas!

Alexandre - it would great if you have a chance to verify this fix using Thomas's rpm.  This is just intended for test though. 

Note that in 1384187, Gonéri Le Bouder did this to patch the ramdisk image with the new rpm:
gzip -dc </httpboot/agent.ramdisk | cpio --extract
 cp -r /tmp/my_rpms tmp
 chroot .
 rpm -Uvh /tmp/my_rpms/*.rpm
 exit
 find . | cpio -oc | gzip -c -9>| ~/agent.ramdisk
 cp ~/agent.ramdisk /httpboot/agent.ramdisk
 restorecon -v /httpboot

The overcloud image may also need patching with the same rpm to ensure consistent naming.

Comment 24 Alexandre Maumené 2016-12-08 04:57:46 UTC
Hi,

Thanks for your answers.
I've tried the rpm (I've added them to the ramdisk and the undercloud). Introspection work and report the interfaces correctly (but that was already the case previously), but when I deploy the provisioning doesn't work. It seems that the ramdisk is not able to match against a node to deploy and just reboot the node quickly (so I'm not able to login and have a look at what's going on). I'm using node placement with the same configuration that worked with my work around. I could provide access to the environment remotely for a short time if you want to have a look yourself.

Comment 25 Thomas Haller 2016-12-08 12:22:59 UTC
(In reply to Alexandre Maumené from comment #24)
> I've tried the rpm [...]

Hi Alexandre,

Sorry, I don't fully understand.

Are you unable to test 1.4.0-13.test.rh1388286.02.el7_3, or are you saying that the test package doesn't fix your issue?

In the latter case, please try to gather logs from NetworkManager. Preferably, enable it via a config file '/etc/NetworkManager/conf.d/99-more-logging.conf' with
   [logging]
   level=TRACE
   domains=ALL

Thanks!

Comment 26 Bob Fournier 2016-12-08 13:08:56 UTC
Alexandre - are you still seeing the duplicate mac address as in the attached screen shot or is that issue fixed?  We can take a look at the template and node configs once we get past the duplicate mac issue.  It would also be useful to see the ramdisk logs.

Comment 27 Bob Fournier 2016-12-08 18:35:50 UTC
Also a useful log to check for the duplicate mac issue would be the node logs during introspection.  These are normally only sent to the undercloud on an introspection failure but to retrieve them, if you are not already, you can do the following
on the undercloud:
1. In  /etc/ironic-inspector/inspector.conf, add this line:
always_store_ramdisk_logs = true

You can put it right under the commented out default line:
#always_store_ramdisk_logs = false

2. restart ironic-inspector
sudo systemctl restart openstack-ironic-inspector.service

3. Run introspection

The logs should be in a .gz file in /var/log/ironic-inspector/ramdisk/
identified by node uuid, e.g.
869fd77d-667b-411d-a321-b41798a3e12e_20161207-195042.685401.tar.gz

Comment 28 Bob Fournier 2016-12-09 02:05:15 UTC
Alexandre - I realize above that it was not explained, and which may be 
the source of the problems with the newly-patched images, that the patched
agent.ramdisk also needs to be updated to glance, as that is where the deployment command will use it during the provisioning phase (introspection uses the image in /httpboot).  After updating the images, the nodes will need to be re-imported
or else the nodes will have wrong ramdisk id.

Recommended steps are:

1. Copy the image with new NetworkManager rpms to directory where all your images are, e.g:

cp agent.ramdisk ~/images/ironic-python-agent.initramfs

2. Delete images in glance

glance image-list

For all images using uuids from 'image-list’ above run
glance image-delete <uuid>

3. Upload all images including patched one.

openstack overcloud image upload --image-path /home/stack/images/

4. Delete all ironic nodes

ironic node-list

for all nodes using uuids from ‘node-list’ run
ironic node-delete <uuid>

5. Import nodes

openstack baremetal import instackenv.json

6. Run introspection

openstack baremetal introspection bulk start

7. Run deploy command with whatever options you are using

openstack overcloud deploy …


Let me know if I can help and we can find a common time to look at this. Thanks.

BTW, eventually the overcloud-full image should be updated with the new rpms using the virt-customize command as described here: https://access.redhat.com/articles/1556833#sect5, but that can wait until after provisioning is working.

Comment 29 Alexandre Maumené 2016-12-09 02:32:21 UTC
Hi Bob,

Sorry for the lack of answer and if I was not clear enough in my last reply. 

Yesterday I've tried exactly what you are describing. I can confirm that the duplicate MAC address is not happening any more. But I wasn't able to manage to have a successful deployment because of what I've reported on this BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1403083

I don't think it's related so I've opened another BZ, but I can't confirm at 100% the fix for the duplicate MAC is working because this other issue is blocking any deployment to be successful.

Regards,

Comment 30 Bob Fournier 2016-12-09 02:55:53 UTC
Alexandre - yes I am seeing the same thing now, see my update in 1403083.  I think it is with the way the rpms were added to the image (comments 23) at least with my setup I'm seeing, for example:

[stack@host01 test-of-new-image]$ ls -al ./etc/NetworkManager/dispatcher.d/04-iscsi
-rwxr-xr-x. 1 stack stack 100 Dec  8 20:27 ./etc/NetworkManager/dispatcher.d/04-iscsi

these files are owned by stack, which is what is causing the error.  I think comment 23 must be run as root.

Its strange that this image works fine in the introspection stage.

Comment 31 Bob Fournier 2016-12-09 14:09:40 UTC
I verified on my setup that the ramdisk image with the new rpms is functionally fine when it is generated as root in comments #23.  As the duplicate mac doesn't occur on my H/W, I haven't been able to confirm it fixes this duplicate mac issue.  I've closed out https://bugzilla.redhat.com/show_bug.cgi?id=1403083.  

Please regenerate the image as root and retry it.  Thanks.

Note, if the deploy works, I'd recommend adding the new NetworkManager packages to the overcloud-full image using instructions here -https://access.redhat.com/articles/1556833#sect5

Comment 34 Bob Fournier 2016-12-19 19:07:04 UTC
Marking as duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1388286

*** This bug has been marked as a duplicate of bug 1388286 ***


Note You need to log in before you can comment on or make changes to this bug.