Bug 1293422

Summary: Over cloud provisioning network interface created with incorrect MAC address, suspect (old) FW issue
Product: Red Hat OpenStack Reporter: Bradford Nichols <bradnichols>
Component: rhosp-directorAssignee: Hugh Brock <hbrock>
Status: CLOSED NOTABUG QA Contact: Shai Revivo <srevivo>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 8.0 (Liberty)CC: dsneddon, jcoufal, mburns, mlopes, rhel-osp-director-maint
Target Milestone: ---   
Target Release: 10.0 (Newton)   
Hardware: x86_64   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: Known Issue
Doc Text:
IBM x3550 M5 servers require firmware with minimum versions to work with Red Hat OpenStack Platform. Consequently, older firmware levels must be upgraded prior to deployment. Affected systems will need to upgrade to the following versions (or newer): DSA 10.1, IMM2 1.72, UEFI 1.10, Bootcode NA, Broadcom GigE 17.0.4.4a After upgrading the firmware, deployment should proceed as expected.
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-10-14 18:53:33 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Bradford Nichols 2015-12-21 17:52:47 UTC
Description of problem:
During overcloud deployment the overcloud nodes created with a broken provisioning (ControlPlane) network. It seems a new interface was created with a new MAC address (incorrect) for the provisioning network

Version-Release number of selected component (if applicable):
Beta2 on IBM x3550 M5 servers

How reproducible:
Yes

Steps to Reproduce:
1. Deploy an overcloud
2. use IBM x3550 M5 servers with FW: DSA 10.0, IMM2 1.02, UEFI 1.03, Bootcode 1.38, Broadcom GigE 16.8.0
3.

Actual results:
heat overcloud-create completes after a long time, but overcloud nodes are unusable because they have a broken control plane network

Expected results:
A completed usable overcloud deployment.

Additional info:
During testing I found deploying the Beta 2 overcloud to one group of servers succeeded and to another group failed with the same undercloud and network yaml definition. The servers are all the same  vendor/model/network configuration. The only difference I am aware of is FW levels. The failing servers have an older firmware from ~12 months ago. 

Servers: IBM x3550 M5

Server FW versions which worked:
FW: DSA 10.1, IMM2 1.72, UEFI 1.10, Bootcode NA, Broadcom GigE 17.0.4.4a

Server FW versions which didn’t work:
FW: DSA 10.0, IMM2 1.02, UEFI 1.03, Bootcode 1.38, Broadcom GigE 16.8.0

For the servers which failed, the problem was that a new interface had been created with a new MAC address for the provisioning network. It looks like the initial PXE load, OS load and some customization completed. but in the end it created a broken configuration for the Provisioning network.  In addition I noticed another NIC which I had configured in my yamls to be ignored had been given a new interface name and was configured.  I tried a solution to NIC enumeration issues I had seen on rhos-tech to modify the overcloud image but it did not correct the situation. 

virt-customize -a overcloud-full.qcow2 --run-command "sed -i 's/net.ifnames=0 //g' /etc/default/grub"
virt-customize -a overcloud-full.qcow2 --run-command "grub2-mkconfig -o /boot/grub2/grub.cfg"

I’m not surprised that the solution can be FW sensitive in terms of PXE/Nics. I have a request into the lab to have all the servers brought up to the latest available FW.  

My recommendation is that the documentation or release notes should advise that A. director solution can be sensitive to server FW versions and B. that as a best practice you should update your servers to the latest FW before starting an install. 

I couldn’t find anything like that in the beta or 7.0 documentation

Regards,
Brad

Comment 2 Bradford Nichols 2016-01-04 18:22:44 UTC
Upgrading the equipment to the latest fw appears to eliminate the incorrect MAC issue. 
Deployment of non-HA and HA overcloud configurations succeed.

Could this be moved to a doc defect? i.e. we need a warning/recommendation to make sure you are using current FW when using OSP-D?

/Brad

Comment 4 Mike Burns 2016-04-07 21:00:12 UTC
This bug did not make the OSP 8.0 release.  It is being deferred to OSP 10.

Comment 6 Dan Sneddon 2016-10-14 18:53:33 UTC
Doc text was updated to indicate that a firmware upgrade is required for these particular servers.