Description of problem: OS with network interface named 'rhevm' causes RHV 4.0 hosted-engine to fail network setup with following error: [ INFO ] Configuring the management bridge [ ERROR ] Failed to execute stage 'Misc configuration': Failed to setup networks {'ovirtmgmt': {'nic': u'em1_1', 'vlan': 250, 'ipaddr': u'x.x.132.101', 'netmask': u'255.255.254.0', 'bootproto': u'none', 'gateway': u'x.x.132.1', 'defaultRoute': True}}. Error code: "29" message: "ERROR : [/usr/sbin/ifup] ERROR: could not add vlan 250 as em1_1.250 on dev em1_1" [ INFO ] Stage: Clean up [ INFO ] Generating answer file '/var/lib/ovirt-hosted-engine-setup/answers/answers-20160927184322.conf' [ INFO ] Stage: Pre-termination [ INFO ] Stage: Termination [ ERROR ] Hosted Engine deployment failed: this system is not reliable, please check the issue,fix and redeploy Log file is located at /var/log/ovirt-hosted-engine-setup/ovirt-hosted-engine-setup-20160927184125-ninavq.log Version-Release number of selected component (if applicable): ovirt-hosted-engine-setup 2.0.1.5-1 installed on RHVH-4.0-20160907.4-RHVH-x86_64-dvd1.iso deployed RHV host How reproducible: 100% Steps to Reproduce: 1. Use RHVH 4.0 ISO to build hypervisor 2. Name network interface 'rhevm' (vlan on interface was in use, this may not be required) Actual results: 'hosted-engine --deploy' fails during stage 'Misc configuration' stating Error 29: "could not add vlan 250 as em1_1.250 on dev em1_1" Expected results: 'hosted-engine --deploy' installs and configures hosted engine RHV manager Additional info: This was a new install of RHVH 4.0 host for a greenfield RHV environment. The only way to fix the situation was to manually rename the interface 'rhevm' to 'em1_1.250', both in device name in the config file, and also the config file name in /etc/sysconfig/network-scripts/. For good measure, I also stopped and disabled NetworkManager on the RHVH 4.0 OS. After these changes, I was successful in attempt to run 'hosted-engine --deploy'.
Please attach: /var/log/ovirt-hosted-engine-setup/* /var/log/vdsm/* . How did you rename it originally to 'rhevm'? Did you then reboot and verify that everything works (before starting hosted-engine --deploy)?
Jason, may I also ask how you've create the 'rhevm' interface, and why? Is it a bridge? Why is this of urgent severity?
The customer named the interface 'rhevm' during the original install via the RHVH 4.0 ISO, using the OS install anaconda interface. The interface was not in a bond, but a single NIC (em1_1) with a single VLAN (em1_1 on vlan 250) in use. The system was rebooted after changing the NIC interface name from 'rhevm' to 'em1_1.250'. After the reboot, the network functioned correctly as it was before, and I again worked with the customer to run the 'hosted-engine --deploy'. There we no problems with the RHVM deployment after the network interface was renamed and the system was rebooted. I guess the severity could be lowered. But if a customer names an interface 'rhevm' during the install of the OS, the 'hosted-engine --deploy' process will fail, but not state why other than an error 29 about setting up network. IMHO, at a minimum, the hosted-engine installer should report 'why' the deployment failed stating the network is called a reserved name, 'rhevm'. The other option would be to have the hosted-engine installer automatically fix this issue upon deployment (rename network interface to NIC_VLAN id name). I have not tested if this is the case when not using a VLAN (naming interface em1_1 as 'rhevm'). Nor have I tested this with any bonding or bridging on the NIC. Steps from original problem to fix: 1. Install OS via RHVH 4.0 ISO. 2. During OS install, name the management network interface 'rhevm'. 3. After OS install, network functions, but hosted-engine fails to deploy manager. 4. Rename 'rhevm' network interface to 'em1_1.250' (both the ifcfg file name, and the DEVICE name in the ifcfg file). 5. Reboot OS. 6. Verified network is working. 7. Run 'hosted-engine --deploy' again, this time it correctly deploys a RHV manager. 8. RHVM 4.0 hosted engine boots and is available for use. I will be attaching the files requested.
Jason, I believe that you did not explain why was the nic renamed, and how. It is highly unusual. 'rhevm' used to be the of the management network. Having a NIC named that way is bound to fail at one point, just as is trying to define a VM network named "eth0". We recommend users to stick to systemd-allocated NIC names. Can your customer do the same?
It would be nicer for hosted-engine --deploy to fail more verbosely. But in my perspective, the more important bit is to deter users of playing with NIC names.
Simone can you please review the logs and see what's the real failure here?
The issue seams here, the vlan interface over em1_1 was named 'rhevm' instead of 'em1_1.250'. The user choose to create the management bridge over 'rhevm' but then hosted-engine-setup sent: 2016-09-27 18:00:35 DEBUG otopi.plugins.gr_he_common.network.bridge bridge._misc:400 networks: {'ovirtmgmt': {'nic': u'em1_1', 'vlan': 250, 'ipaddr': u'172.27.132.101', 'netmask': u'255.255.254.0', 'bootproto': u'none', 'gateway': u'172.27.132.1', 'defaultRoute': True}} with 'nic': u'em1_1' and 'vlan': 250. And it failed: 2016-09-27 18:00:39 DEBUG otopi.context context._executeMethod:142 method exception Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/otopi/context.py", line 132, in _executeMethod method['method']() File "/usr/share/ovirt-hosted-engine-setup/scripts/../plugins/gr-he-common/network/bridge.py", line 403, in _misc _setupNetworks(conn, networks, bonds, options) File "/usr/share/ovirt-hosted-engine-setup/scripts/../plugins/gr-he-common/network/bridge.py", line 421, in _setupNetworks 'message: "%s"' % (networks, code, message)) RuntimeError: Failed to setup networks {'ovirtmgmt': {'nic': u'em1_1', 'vlan': 250, 'ipaddr': u'172.27.132.101', 'netmask': u'255.255.254.0', 'bootproto': u'none', 'gateway': u'172.27.132.1', 'defaultRoute': True}}. Error code: "29" message: "ERROR : [/usr/sbin/ifup] ERROR: could not add vlan 250 as em1_1.250 on dev em1_1" 2016-09-27 18:00:39 ERROR otopi.context context._executeMethod:151 Failed to execute stage 'Misc configuration': Failed to setup networks {'ovirtmgmt': {'nic': u'em1_1', 'vlan': 250, 'ipaddr': u'172.27.132.101', 'netmask': u'255.255.254.0', 'bootproto': u'none', 'gateway': u'172.27.132.1', 'defaultRoute': True}}. Error code: "29" message: "ERROR : [/usr/sbin/ifup] ERROR: could not add vlan 250 as em1_1.250 on dev em1_1" If we check supervdsm logs: it tried to remove /etc/sysconfig/network-scripts/ifcfg-em1_1.250 (but the vlan interface was named rhevm). MainProcess|jsonrpc.Executor/2::DEBUG::2016-09-27 18:00:39,787::ifcfg::404::root::(restoreAtomicBackup) Removing empty configuration backup /etc/sysconfig/network-scripts/ifcfg-em1_1.250 MainProcess|jsonrpc.Executor/2::DEBUG::2016-09-27 18:00:39,787::ifcfg::302::root::(_removeFile) Removed file /etc/sysconfig/network-scripts/ifcfg-em1_1.250 MainProcess|jsonrpc.Executor/2::INFO::2016-09-27 18:00:39,788::ifcfg::409::root::(restoreAtomicBackup) Restored /etc/sysconfig/network-scripts/ifcfg-em1_1.250 and so it failed when it tried to bring up em1_1.250 since was still used by rhevm vlan. MainProcess|jsonrpc.Executor/2::ERROR::2016-09-27 18:00:39,789::supervdsmServer::96::SuperVdsm.ServerCallback::(wrapper) Error in setupNetworks Traceback (most recent call last): File "/usr/share/vdsm/supervdsmServer", line 94, in wrapper res = func(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/network/api.py", line 248, in setupNetworks _setup_networks(networks, bondings, options) File "/usr/lib/python2.7/site-packages/vdsm/network/api.py", line 275, in _setup_networks netswitch.setup(networks, bondings, options, in_rollback) File "/usr/lib/python2.7/site-packages/vdsm/network/netswitch.py", line 117, in setup _setup_legacy(legacy_nets, legacy_bonds, options, in_rollback) File "/usr/lib/python2.7/site-packages/vdsm/network/netswitch.py", line 138, in _setup_legacy bondings, _netinfo) File "/usr/lib/python2.7/site-packages/vdsm/network/legacy_switch.py", line 471, in add_missing_networks _netinfo=_netinfo, **attrs) File "/usr/lib/python2.7/site-packages/vdsm/network/legacy_switch.py", line 180, in wrapped return func(network, configurator, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/network/legacy_switch.py", line 250, in _add_network net_ent_to_configure.configure(**options) File "/usr/lib/python2.7/site-packages/vdsm/network/models.py", line 186, in configure self.configurator.configureBridge(self, **opts) File "/usr/lib/python2.7/site-packages/vdsm/network/configurators/ifcfg.py", line 110, in configureBridge bridge.port.configure(**opts) File "/usr/lib/python2.7/site-packages/vdsm/network/models.py", line 142, in configure self.configurator.configureVlan(self, **opts) File "/usr/lib/python2.7/site-packages/vdsm/network/configurators/ifcfg.py", line 118, in configureVlan _ifup(vlan) File "/usr/lib/python2.7/site-packages/vdsm/network/configurators/ifcfg.py", line 850, in _ifup _exec_ifup(iface, cgroup) File "/usr/lib/python2.7/site-packages/vdsm/network/configurators/ifcfg.py", line 807, in _exec_ifup _exec_ifup_by_name(iface.name, cgroup) File "/usr/lib/python2.7/site-packages/vdsm/network/configurators/ifcfg.py", line 793, in _exec_ifup_by_name raise ConfigNetworkError(ERR_FAILED_IFUP, out[-1] if out else '') ConfigNetworkError: (29, 'ERROR : [/usr/sbin/ifup] ERROR: could not add vlan 250 as em1_1.250 on dev em1_1')
OK, reproduced also using a vlan interface named vlan0015 which is perfectly valid according to VLAN_NAME_TYPE_PLUS_VID name type (please see https://github.com/torvalds/linux/blob/master/net/8021q/vlan.c#L245 ) My initial configuration was: [root@c72he20160922h1 ~]# ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000 link/ether 00:1a:4a:4f:bd:10 brd ff:ff:ff:ff:ff:ff inet 192.168.1.124/24 brd 192.168.1.255 scope global dynamic eth0 valid_lft 172783sec preferred_lft 172783sec inet6 fe80::21a:4aff:fe4f:bd10/64 scope link valid_lft forever preferred_lft forever 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000 link/ether 00:1a:4a:4f:bd:18 brd ff:ff:ff:ff:ff:ff inet6 fe80::21a:4aff:fe4f:bd18/64 scope link valid_lft forever preferred_lft forever 4: eth2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000 link/ether 00:1a:4a:4f:bd:19 brd ff:ff:ff:ff:ff:ff 6: vlan0015@eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP link/ether 00:1a:4a:4f:bd:18 brd ff:ff:ff:ff:ff:ff inet 192.168.3.124/24 brd 192.168.3.255 scope global vlan0015 valid_lft forever preferred_lft forever inet6 fe80::21a:4aff:fe4f:bd18/64 scope link valid_lft forever preferred_lft forever [root@c72he20160922h1 ~]# ls /etc/sysconfig/network-scripts/ifcfg-* /etc/sysconfig/network-scripts/ifcfg-eth0 /etc/sysconfig/network-scripts/ifcfg-eth1 /etc/sysconfig/network-scripts/ifcfg-lo /etc/sysconfig/network-scripts/ifcfg-vlan0015 [root@c72he20160922h1 ~]# cat /etc/sysconfig/network-scripts/ifcfg-vlan0015 VLAN=yes DEVICE=vlan0015 VLAN_NAME_TYPE=VLAN_NAME_TYPE_PLUS_VID PHYSDEV=eth1 BOOTPROTO=static ONBOOT=yes TYPE=Ethernet IPADDR=192.168.3.124 NETMASK=255.255.255.0 [root@c72he20160922h1 ~]# cat /etc/sysconfig/network-scripts/ifcfg-eth1 TYPE=Ethernet BOOTPROTO=static NAME=eth1 DEVICE=eth1 ONBOOT=yes hosted-engine-setup (but I think that this is not hosted-engine specific) correctly sent: MainProcess|jsonrpc/2::DEBUG::2016-10-13 14:07:37,487::legacy_switch::461::root::(add_missing_networks) Adding network u'ovirtmgmt' MainProcess|jsonrpc/2::INFO::2016-10-13 14:07:37,488::netconfpersistence::58::root::(setNetwork) Adding network ovirtmgmt({'ipv6autoconf': False, 'nameservers': ['192.168.1.1', '0.0.0.0'], u'nic': u'eth1', u'vlan': 15, u'ipaddr': u'192.168.3.124', u'netmask': u'255.255.255.0', 'mtu': 1500, 'switch': 'legacy', 'dhcpv6': False, 'stp': False, 'bridged': True, u'gateway': u'192.168.3.1', u'defaultRoute': True}) MainProcess|jsonrpc/2::DEBUG::2016-10-13 14:07:37,488::legacy_switch::204::root::(_add_network) Validating network... MainProcess|jsonrpc/2::INFO::2016-10-13 14:07:37,488::legacy_switch::215::root::(_add_network) Adding network ovirtmgmt with vlan=15, bonding=None, nic=eth1, mtu=1500, bridged=True, defaultRoute=True, options={'switch': 'legacy', 'stp': False} So supervdsm tried to write or rewrite /etc/sysconfig/network-scripts/ifcfg-ovirtmgmt and /etc/sysconfig/network-scripts/ifcfg-eth1.15 but nothing removed /etc/sysconfig/network-scripts/ifcfg-vlan0015 and so the issue. MainProcess|jsonrpc/2::INFO::2016-10-13 14:07:37,488::legacy_switch::242::root::(_add_network) Configuring device ovirtmgmt MainProcess|jsonrpc/2::DEBUG::2016-10-13 14:07:37,489::ifcfg::496::root::(_persistentBackup) backing up ifcfg-ovirtmgmt: # original file did not exist MainProcess|jsonrpc/2::DEBUG::2016-10-13 14:07:37,489::ifcfg::401::root::(writeBackupFile) Persistently backed up /var/lib/vdsm/netconfback/ifcfg-ovirtmgmt (until next 'set safe config') MainProcess|jsonrpc/2::DEBUG::2016-10-13 14:07:37,489::ifcfg::560::root::(writeConfFile) Writing to file /etc/sysconfig/network-scripts/ifcfg-ovirtmgmt configuration: # Generated by VDSM version 4.18.999-725.gitcfbfec8.el7.centos DEVICE=ovirtmgmt TYPE=Bridge DELAY=0 STP=off ONBOOT=yes IPADDR=192.168.3.124 NETMASK=255.255.255.0 GATEWAY=192.168.3.1 BOOTPROTO=none MTU=1500 DEFROUTE=yes NM_CONTROLLED=no IPV6INIT=no DNS1=192.168.1.1 DNS2=0.0.0.0 MainProcess|jsonrpc/2::DEBUG::2016-10-13 14:07:37,492::commands::69::root::(execCmd) /usr/bin/taskset --cpu-list 0-3 /usr/sbin/ifdown ovirtmgmt (cwd None) MainProcess|jsonrpc/2::DEBUG::2016-10-13 14:07:37,690::commands::93::root::(execCmd) SUCCESS: <err> = ''; <rc> = 0 MainProcess|jsonrpc/2::DEBUG::2016-10-13 14:07:37,691::ifcfg::496::root::(_persistentBackup) backing up ifcfg-eth1.15: # original file did not exist The issue is here ------------------------------^ since instead we have /etc/sysconfig/network-scripts/ifcfg-vlan0015 (which is valid if in VLAN_NAME_TYPE_PLUS_VID schema) and nothing backups and remove it. MainProcess|jsonrpc/2::DEBUG::2016-10-13 14:07:37,692::ifcfg::401::root::(writeBackupFile) Persistently backed up /var/lib/vdsm/netconfback/ifcfg-eth1.15 (until next 'set safe config') MainProcess|jsonrpc/2::DEBUG::2016-10-13 14:07:37,692::ifcfg::560::root::(writeConfFile) Writing to file /etc/sysconfig/network-scripts/ifcfg-eth1.15 configuration: # Generated by VDSM version 4.18.999-725.gitcfbfec8.el7.centos DEVICE=eth1.15 VLAN=yes BRIDGE=ovirtmgmt ONBOOT=yes MTU=1500 NM_CONTROLLED=no IPV6INIT=no
Correct, Vdsm is very picky about the ifcfg names it can process. Can you explain why do you mind that? What created vlan0015? Can it use VLAN_NAME_TYPE_RAW_PLUS_VID instead?
(In reply to Dan Kenigsberg from comment #12) > Correct, Vdsm is very picky about the ifcfg names it can process. Can you > explain why do you mind that? What created vlan0015? Can it use > VLAN_NAME_TYPE_RAW_PLUS_VID instead? I manually set it up before running hosted-engine-setup and nothing alerted me; of course I could also use VLAN_NAME_TYPE_RAW_PLUS_VID: it was just to reproduce the user reported behavior on a valid system configuration (VLAN_NAME_TYPE_PLUS_VID is in theory valid, 'rhevm' as the user did wasn't). IMO, if we don't want to support naming schema other than VLAN_NAME_TYPE_RAW_PLUS_VID, we should at least check and fail with a clear error so that for the user will be easier to detect the issue and fix.
Thank you, Simone. You are obviously correct, but as it stands, supporting anything other than good-old ifcfg-eth0.400 naming scheme is not on our recent roadmap - unless Edy tells me that it's terribly easy to do.
"IMO, if we don't want to support naming schema other than VLAN_NAME_TYPE_RAW_PLUS_VID, we should at least check and fail with a clear error so that for the user will be easier to detect the issue and fix." I fully agree with this statement. It is not important to support a new naming convention. But please have engine setup report an error stating this is the reason. As it is now, the user is given a generic error and is left to decipher the true cause, which will likely generate a support call. Thank you for the work put in to this.