| Summary: | Wrong MAC address set by the ramdisk on the provisioning interface leads to a failure to provision | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Alexandre Maumené <amaumene> | ||||||||||
| Component: | openstack-ironic-inspector | Assignee: | Bob Fournier <bfournie> | ||||||||||
| Status: | CLOSED DUPLICATE | QA Contact: | Raviv Bar-Tal <rbartal> | ||||||||||
| Severity: | urgent | Docs Contact: | |||||||||||
| Priority: | unspecified | ||||||||||||
| Version: | 10.0 (Newton) | CC: | asimonel, bfournie, dshanbha, jschluet, mburns, mlammon, pablo.iranzo, slinaber, thaller | ||||||||||
| Target Milestone: | --- | ||||||||||||
| Target Release: | --- | ||||||||||||
| Hardware: | x86_64 | ||||||||||||
| OS: | Linux | ||||||||||||
| Whiteboard: | |||||||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||||
| Doc Text: | Story Points: | --- | |||||||||||
| Clone Of: | Environment: | ||||||||||||
| Last Closed: | 2016-12-19 19:07:04 UTC | Type: | Bug | ||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||
| Documentation: | --- | CRM: | |||||||||||
| Verified Versions: | Category: | --- | |||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
| Attachments: |
|
||||||||||||
Created attachment 1227127 [details]
logs of the ramdisk
Created attachment 1227128 [details]
screen shot wrong mac
as someone pointed out in an email thread https://bugzilla.redhat.com/show_bug.cgi?id=1388286 might be related issue. Yes, this looks like a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1388286, the same mac (14:02:ec:83:e8:c3) is being assigned to en01 and en050 Note that there a few workarounds described here (which is also a duplicate of 1388286): https://bugzilla.redhat.com/show_bug.cgi?id=1384187 Hi, I don't think this is a duplicate of the other bug, or at least not exactly the same because: - It only happens during nodes provisioning and NOT during introspection. - The NetworkManager patch doesn't work and make things worst as eno1 isn't bring up in any nodes so no nodes are actually deployed. - The ethernet.cloned-mac-address=preserve fix didn't work either and end up in the same situation. Hi, The RPM fix hasn't been tried in fact, I've misunderstand what the colleague working on this with me said. This hasn't been tried for the simple reason that I don't know how/where to download theses packages. Could you please let me know where to find them ? Thanks in advance. From https://bugzilla.redhat.com/show_bug.cgi?id=1388286, the rhel 7.3 packages can be downloaded here: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=12071007 Hi, I tried with the net.ifnames=0 biosdevname=0 ramdisk parameter and I managed to provision the nodes successfully but then the nodes were not able to download the os-net-config/config.json. I assume it is because the instrospection stored the ifname under ethX but since it's not predictable I can't really use this either. So back to the RPMs. Thanks for the link but I already have it from the other BZ. I just can't find where to download the RPM out of this webpage. Sorry for my ignorance but I'm not a developer and I've never worked with this tool. Also I'd like to know, if this is working, where do we stand in term of support for the customer ? OK, its good you're able to provision when interface naming is turned off. Yes, using the net.ifnames=0 biosdevname=0 parameter will result in the interfaces not being renamed, i.e. ethX will be used. Are you relying on the renamed (i.e. enoX) interface names for a particular reason? Perhaps the templates will need to be modified so that os-net-config uses these new names. Could you clarify what you were seeing when "not able to download os-net-config/config.json"? Is the "ethernet.cloned-mac-address=preserve" fix still not working for you? Were you able to use the ramskThat workaround would not affect the interface naming. I think we should hold off on the RPMs as those were intended for internal testing to verify the fix only, sorry about that. Hi, I've tried to use nicX is the templates but the os-net-config/config.json was still empty. I'm going to try with the ethX naming in it instead. The "ethernet.cloned-mac-address=preserve" fix didn't work because when provisioning the ramdisk was not able to configure the provisioning NIC, hence it wasn't able to download the overcloud image and the deployment just failed. I understand that the RPMs are definitely not a supported solution and we need to find one as this will be in production as soon as the GA is released. Hi,
1) When I use net.ifnames=0 biosdevname=0 only on the provisioning ramdisk we can't deploy since the introspection stored the enoX names thus it doesn't match any nodes the undercloud knows.
2) When I use net.ifnames=0 biosdevname=0 on introspection and provisioning ramdisk we can't deploy because the os-net-config/config.json is either empty (on ceph and compute nodes) or wrong on controllers. On controllers it look like that :
"network_config": [{"use_dhcp": true, "type": "ovs_bridge", "name": "br-ex", "members": [{"type": "interface", "name": "nic1", "primary": true}]}]} which makes me thing it's the default configuration or something like that. Anyway it's supposed to setup nic1 as ctlplane and eno49 & eno50 in a bond (I've attached the nic-config for controller).
3) I can't use net.ifnames=0 biosdevname=0 on introspection, provisioning ramdisk and overcloud image as the interfaces will come up in a random order. We need to have predictable ifname when deploying OSP.
Created attachment 1227959 [details]
nic-config for controller
Hi,
I finally manager to apply one fix. I would need to know if it's supported before handing it over to the customer.
1) Edit the /httpboot/inspector.ipxe so it looks like that:
[...] initrd=agent.ramdisk net.ifnames=0 biosdevname=0 || goto retry_boot [...]
This way the interfaces reported to the undercloud has ethX names. Run the introspection.
2) Run an infinite loop like that:
while true;
do
sudo find /httpboot -name config -exec sed -i 's/ramdisk coreos/ramdisk net.ifnames=0 biosdevname=0 coreos/' {} \;
done
This way the provisioning ramdisk will boot with the arguments to ignore predictable naming hence it will not failed.
3) Use nicX naming scheme in the nic-configs templates so your config.json looks like that:
{"network_config": [{"dns_servers": ["10.126.162.11", "10.126.162.10"], "addresses": [{"ip_netmask": "172.17.9.201/24"}, {"ip_netmask": "10.61.80.217/26"}], "routes": [{"ip_netmask": "169.254.169.254/32", "next_hop": "172.17.9.254"}, {"default": true, "next_hop": "172.17.9.254"}], "use_dhcp": false, "type": "interface", "name": "nic1"}, {"type": "ovs_bridge", "name": "br-bond", "members": [{"type": "linux_bond", "bonding_options": "mode=4 lacp_rate=1 miimon=100", "members": [{"type": "interface", "name": "nic2", "primary": true}, {"type": "interface", "name": "nic3"}], "name": "bond0"}, {"device": "bond0", "type": "vlan", "addresses": [{"ip_netmask": "172.17.1.217/24"}], "vlan_id": 11}, {"device": "bond0", "type": "vlan", "addresses": [{"ip_netmask": "172.17.2.217/24"}], "vlan_id": 12}]}]}
And os-net-config will manage to configure the network on your nodes.
I've tried this solution last week but I don't know what it didn't work. I've retry today with the latest puddle and it seems to work.
I would like to know if this fix will be supported. But more important for the customer I would like to have a time line for a proper fix in OSP. The customer will replicate this deployment several times and don't really want to have this ugly fix in is KB.
I've just realized. Wrong choose of word, it's a work around and not a fix. Alexandre - that's good that the workaround worked, its more evidence this is a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1388286 At this point can we mark this as a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1388286? One more bit of evidence that this a duplicate of BZ 1388286 is this log message: messages:Dec 1 20:45:52 localhost NetworkManager[407]: <info> [1480643152.5479] device (eno1): set-hw-addr: set-cloned MAC address to 14:02:EC:83:E8:3D (permanent) Comparing this with the "screen shot wrong mac" attachment, 14:02:EC:83:E8:3D is the mac that is being assigned to both eno1 and eno50 even though it belongs to eno50 (en01 should have mac 94:18:82:09:0a:34). This same "set-hw-addr: set-cloned MAC address to <mac> (permanent)" is seen on the interface when the duplicated mac is written here: https://bugzilla.redhat.com/show_bug.cgi?id=1388286#c0 (In reply to Alexandre Maumené from comment #9) > So back to the RPMs. Thanks for the link but I already have it from the > other BZ. I just can't find where to download the RPM out of this webpage. > Sorry for my ignorance but I'm not a developer and I've never worked with > this tool. The previous scratch-build expired (which is non-obvious in the web UI). Please fine a new scratch build here: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=12209250 (this is the intended solution for 1388286. It would be great if you could verify that this solves your issue). Thanks Thomas! Alexandre - it would great if you have a chance to verify this fix using Thomas's rpm. This is just intended for test though. Note that in 1384187, Gonéri Le Bouder did this to patch the ramdisk image with the new rpm: gzip -dc </httpboot/agent.ramdisk | cpio --extract cp -r /tmp/my_rpms tmp chroot . rpm -Uvh /tmp/my_rpms/*.rpm exit find . | cpio -oc | gzip -c -9>| ~/agent.ramdisk cp ~/agent.ramdisk /httpboot/agent.ramdisk restorecon -v /httpboot The overcloud image may also need patching with the same rpm to ensure consistent naming. Hi, Thanks for your answers. I've tried the rpm (I've added them to the ramdisk and the undercloud). Introspection work and report the interfaces correctly (but that was already the case previously), but when I deploy the provisioning doesn't work. It seems that the ramdisk is not able to match against a node to deploy and just reboot the node quickly (so I'm not able to login and have a look at what's going on). I'm using node placement with the same configuration that worked with my work around. I could provide access to the environment remotely for a short time if you want to have a look yourself. (In reply to Alexandre Maumené from comment #24) > I've tried the rpm [...] Hi Alexandre, Sorry, I don't fully understand. Are you unable to test 1.4.0-13.test.rh1388286.02.el7_3, or are you saying that the test package doesn't fix your issue? In the latter case, please try to gather logs from NetworkManager. Preferably, enable it via a config file '/etc/NetworkManager/conf.d/99-more-logging.conf' with [logging] level=TRACE domains=ALL Thanks! Alexandre - are you still seeing the duplicate mac address as in the attached screen shot or is that issue fixed? We can take a look at the template and node configs once we get past the duplicate mac issue. It would also be useful to see the ramdisk logs. Also a useful log to check for the duplicate mac issue would be the node logs during introspection. These are normally only sent to the undercloud on an introspection failure but to retrieve them, if you are not already, you can do the following on the undercloud: 1. In /etc/ironic-inspector/inspector.conf, add this line: always_store_ramdisk_logs = true You can put it right under the commented out default line: #always_store_ramdisk_logs = false 2. restart ironic-inspector sudo systemctl restart openstack-ironic-inspector.service 3. Run introspection The logs should be in a .gz file in /var/log/ironic-inspector/ramdisk/ identified by node uuid, e.g. 869fd77d-667b-411d-a321-b41798a3e12e_20161207-195042.685401.tar.gz Alexandre - I realize above that it was not explained, and which may be the source of the problems with the newly-patched images, that the patched agent.ramdisk also needs to be updated to glance, as that is where the deployment command will use it during the provisioning phase (introspection uses the image in /httpboot). After updating the images, the nodes will need to be re-imported or else the nodes will have wrong ramdisk id. Recommended steps are: 1. Copy the image with new NetworkManager rpms to directory where all your images are, e.g: cp agent.ramdisk ~/images/ironic-python-agent.initramfs 2. Delete images in glance glance image-list For all images using uuids from 'image-list’ above run glance image-delete <uuid> 3. Upload all images including patched one. openstack overcloud image upload --image-path /home/stack/images/ 4. Delete all ironic nodes ironic node-list for all nodes using uuids from ‘node-list’ run ironic node-delete <uuid> 5. Import nodes openstack baremetal import instackenv.json 6. Run introspection openstack baremetal introspection bulk start 7. Run deploy command with whatever options you are using openstack overcloud deploy … Let me know if I can help and we can find a common time to look at this. Thanks. BTW, eventually the overcloud-full image should be updated with the new rpms using the virt-customize command as described here: https://access.redhat.com/articles/1556833#sect5, but that can wait until after provisioning is working. Hi Bob, Sorry for the lack of answer and if I was not clear enough in my last reply. Yesterday I've tried exactly what you are describing. I can confirm that the duplicate MAC address is not happening any more. But I wasn't able to manage to have a successful deployment because of what I've reported on this BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1403083 I don't think it's related so I've opened another BZ, but I can't confirm at 100% the fix for the duplicate MAC is working because this other issue is blocking any deployment to be successful. Regards, Alexandre - yes I am seeing the same thing now, see my update in 1403083. I think it is with the way the rpms were added to the image (comments 23) at least with my setup I'm seeing, for example: [stack@host01 test-of-new-image]$ ls -al ./etc/NetworkManager/dispatcher.d/04-iscsi -rwxr-xr-x. 1 stack stack 100 Dec 8 20:27 ./etc/NetworkManager/dispatcher.d/04-iscsi these files are owned by stack, which is what is causing the error. I think comment 23 must be run as root. Its strange that this image works fine in the introspection stage. I verified on my setup that the ramdisk image with the new rpms is functionally fine when it is generated as root in comments #23. As the duplicate mac doesn't occur on my H/W, I haven't been able to confirm it fixes this duplicate mac issue. I've closed out https://bugzilla.redhat.com/show_bug.cgi?id=1403083. Please regenerate the image as root and retry it. Thanks. Note, if the deploy works, I'd recommend adding the new NetworkManager packages to the overcloud-full image using instructions here -https://access.redhat.com/articles/1556833#sect5 Marking as duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1388286 *** This bug has been marked as a duplicate of bug 1388286 *** |
Created attachment 1227126 [details] introspection data Description of problem: When deploying the servers are booting through PXE on the deploy ramdisk. This ramdisk does DHCP request to have an IP from the undercloud. Randomly the MAC address of a 10GE link (eno50) is "applied" to the provisioning NIC (eno1) as you can see on the screenshots. Version-Release number of selected component (if applicable): rhosp-director-images-ipa-10.0-20161102.1.el7ost.noarch rhosp-director-images-ipa-10.0-20161129.2.el7ost.noarch How reproducible: Every time on at least on node out of 9. Steps to Reproduce: 1. Run a deploy 2. Wait 3. Actual results: The server where the ramdisk is set from another interface than the one that should be used for provisioning is not able to get an IP through DHCP, hence the server doesn't have any connectivity and can't be provisioned. Expected results: The ramdisk should not mess with the MAC addresses. Additional info: Screenshots, logs from the ramdisk after having set a password to it and introspection data.