Description of problem: Deployment fails when deploying OpenStack with SR-IOV using compute nodes with different interfaces' names. For example using one compute node with interface p4p1 and another with p7p1. This is part of the output of openstack stack failures list overcloud --long Error: /sys/class/net/p4p1/device/sriov_numvfs doesn't exist. Check if p4p1 is a valid network interface supporting SR-IOV Error: /Stage[main]/Tripleo::Host::Sriov/Sriov_vf_config[p4p1:5]/ensure: change from absent to present failed: /sys/class/net/p4p1/device/sriov_numvfs doesn't exist. Check if p4p1 is a valid network interface supporting SR-IOV Version-Release number of selected component (if applicable): openstack-tripleo-heat-templates-6.0.0-10.el7ost.noarch How reproducible: 100% Steps to Reproduce: 1.Deploy OpenStack with SR-IOV using the following in the templates(for example) using at least 2 compute nodes with different interfaces' names: NovaPCIPassthrough: - devname: "p4p1" physical_network: "tenant1" NeutronSriovNumVFs: "p4p1:5" Where p4p1 is a valid interface on one of the compute nodes. 2. Verify you get the above failure when running openstack stack failures list overcloud --long after the Deployment fails. Actual results: Expected results: Additional info:
It is expected to have the same nic names on all the compute nodes on a role. Both the parameters NovaPCIPassthrough and NeutronSriovNumVFs are designed in such a way. A possibility to overcome is to associated different compute nodes with different role and provide role specific parameter, like parameter_defaults: ComputeSriov1Parameters: NeutronSriovNumVFs: "p4p1:5" ComputeSriov2Parameters: NeutronSriovNumVFs: "p7p1:5" This role-specific feature supported from OSP12 onwards. Ignoring the error if the interface is not found, will also hide the actual problems, if there are any during the deployment.
Hi, I'm not sure that having one role for every compute node is a workable solution.
Franck, This requires either nic numbering support for NeutronSriovNumVFs or any other similar solution to fix it, may be we should remove NeutronSriovNumVFs and support SR-IOV VFs with os-net-config (?). Please provide your priority to address this issue. Note: NovaPCIPassthrough cannot be supported in the same manner. But here we should be able to give duplicates as it is just a configuration.
wouldn't it be fixed by https://bugzilla.redhat.com/show_bug.cgi?id=1418335 ?
As I stated in Comment #1, it will be fixed if we assign a role per variant. But I am not sure in practical, how different the naming will be? For example in a cluster with 20 compute nodes, if 10 computes falls in a group, then it is fine. But if every 2 compute nodes has a different nic name, then it is not practical to maintain multiple roles.
(In reply to Saravanan KR from comment #5) > As I stated in Comment #1, it will be fixed if we assign a role per variant. > But I am not sure in practical, how different the naming will be? For > example in a cluster with 20 compute nodes, if 10 computes falls in a group, > then it is fine. But if every 2 compute nodes has a different nic name, then > it is not practical to maintain multiple roles. If the computes are different, they will have different number of NICs. So referencing the NICs by udev name or "nic1/2/3" won't make any difference. Also, if all compute nodes are different, we're not in a datacenter anymore but in an R&D lab ;-)
(In reply to Franck Baudin from comment #6) > (In reply to Saravanan KR from comment #5) > > As I stated in Comment #1, it will be fixed if we assign a role per variant. > > But I am not sure in practical, how different the naming will be? For > > example in a cluster with 20 compute nodes, if 10 computes falls in a group, > > then it is fine. But if every 2 compute nodes has a different nic name, then > > it is not practical to maintain multiple roles. > > If the computes are different, they will have different number of NICs. So > referencing the NICs by udev name or "nic1/2/3" won't make any difference. > Also, if all compute nodes are different, we're not in a datacenter anymore > but in an R&D lab ;-) Which means you are OK with comment #1 approach and we can close this BZ?
I saw this behavior when using a Blade and every node had a different NIC names. I'm against closing this BZ.
(In reply to Itzik Brown from comment #8) > I saw this behavior when using a Blade and every node had a different NIC > names. > I'm against closing this BZ. Looks like a udev rule issue rather than a TripleO one, can you open a udev BZ and close this one as duplicate of the udev BZ, if it makes sense? Thanks!
I have Dell PowerEdge FC430. It doesn't seem like a udev problem because when provisioning the host without TripleO there is no such a problem.
I've recently posted a patch that allows "puppet apply" to pass if the PCI device isn't available when it runs. This was required to solve an issue if a physical function is used by a guest instance when an upgrade or similar is run. Please see: - https://review.openstack.org/#/c/478503/ - https://bugs.launchpad.net/tripleo/+bug/1701284 This might be useful for this issue as well.
Closing based the comment#11, avoid the failure if the interface is not present, but it could be possible that that is a actual failure because of wrong naming, but this bug would be resolved this fix. Request to re-validate the bz, if still persists, re-open it.