Description of problem: Network management bridge was not created during the hosted-engine deployment. Failed to deploy the HE on Red Hat Enterprise Virtualization Hypervisor release 7.1 (20151015.0.el7ev) over iSCSI because of failure in configuring the management bridge on host during deployment of the HE, bridge was not created and connectivity was lost to the host. After restarting the host manually, bridge was created, but FQDN of the host became "localhost" instead of originally received from the DHCP. Version-Release number of selected component (if applicable): ovirt-node-plugin-rhn-3.2.3-23.el7.noarch ovirt-host-deploy-offline-1.3.0-3.el7ev.x86_64 ovirt-node-branding-rhev-3.2.3-23.el7.noarch ovirt-node-selinux-3.2.3-23.el7.noarch ovirt-hosted-engine-ha-1.2.7.2-1.el7ev.noarch ovirt-node-plugin-hosted-engine-0.2.0-18.0.el7ev.noarch ovirt-node-plugin-cim-3.2.3-23.el7.noarch ovirt-node-plugin-snmp-3.2.3-23.el7.noarch ovirt-node-3.2.3-23.el7.noarch ovirt-host-deploy-1.3.2-1.el7ev.noarch ovirt-hosted-engine-setup-1.2.6.1-1.el7ev.noarch ovirt-node-plugin-vdsm-0.2.0-26.el7ev.noarch libvirt-1.2.8-16.el7_1.4.x86_64 mom-0.4.1-5.el7ev.noarch qemu-kvm-rhev-2.1.2-23.el7_1.10.x86_64 vdsm-4.16.27-1.el7ev.x86_64 sanlock-3.2.2-2.el7.x86_64 How reproducible: 100% Steps to Reproduce: 1.Install clean Red Hat Enterprise Virtualization Hypervisor release 7.1 (20151015.0.el7ev) on host. 2.Deploy HE over iSCSI via TUI. 3. Actual results: Management bridge not created and deployment failed, with customer being disconnected from host. Expected results: Deployment should pass. Additional info: See logs attached.
Created attachment 1086560 [details] logs from the host (black)
The real reason for the failure is: Traceback (most recent call last): File "/usr/share/vdsm/rpc/BindingXMLRPC.py", line 1136, in wrapper File "/usr/share/vdsm/rpc/BindingXMLRPC.py", line 554, in setupNetworks File "/usr/share/vdsm/API.py", line 1398, in setupNetworks File "/usr/share/vdsm/supervdsm.py", line 50, in __call__ File "/usr/share/vdsm/supervdsm.py", line 48, in <lambda> File "<string>", line 2, in setupNetworks File "/usr/lib64/python2.7/multiprocessing/managers.py", line 773, in _callmethod OSError: [Errno 16] Device or resource busy: '/etc/sysconfig/network-scripts/ifcfg-rhevm' looks like a node issue
This is a node issue because vdsm is attempting to unlink a bind mounted file, but bind mounting is part and parcel of persistence, and the persistence of ifcfg files is directly handled by vdsm -- there's no clean way to handle this from the node side, since ifcfg-rhevm does not exist before hosted-engine-setup runs, and we don't have any kind of daemon monitoring for persistence or wrapping calls to unlink. There are two problems: MainProcess|Thread-17::INFO::2015-10-26 16:29:36,759::__init__::507::root.ovirt.node.utils.fs::(_persist_file) File "/etc/sysconfig/network-scripts/ifcfg-enp4s0" successfully persisted MainProcess|Thread-17::DEBUG::2015-10-26 16:29:36,761::utils::739::root::(execCmd) /usr/sbin/ifdown enp4s0 (cwd None) sourceRoute::DEBUG::2015-10-26 16:29:36,879::sourceroutethread::39::root::(process_IN_CLOSE_WRITE_filePath) Responding to DHCP response in /var/run/vdsm/sourceRoutes/1445876976 sourceRoute::INFO::2015-10-26 16:29:36,880::sourceroutethread::60::root::(process_IN_CLOSE_WRITE_filePath) interface enp4s0 is not a libvirt interface sourceRoute::WARNING::2015-10-26 16:29:36,880::utils::129::root::(rmFile) File: /var/run/vdsm/trackedInterfaces/enp4s0 already removed MainProcess|Thread-17::DEBUG::2015-10-26 16:29:36,967::utils::759::root::(execCmd) FAILED: <err> = 'bridge rhevm does not exist!\n'; <rc> = 1 MainProcess|Thread-17::DEBUG::2015-10-26 16:29:36,967::utils::739::root::(execCmd) /usr/bin/systemd-run --scope --slice=vdsm-dhclient /usr/sbin/ifup enp4s0 (cwd None) MainProcess|Thread-17::DEBUG::2015-10-26 16:29:37,125::utils::759::root::(execCmd) SUCCESS: <err> = 'Running as unit run-18679.scope.\n'; <rc> = 0 MainProcess|Thread-17::DEBUG::2015-10-26 16:29:37,125::utils::739::root::(execCmd) /usr/bin/systemd-run --scope --slice=vdsm-dhclient /usr/sbin/ifup rhevm (cwd None) MainProcess|Thread-17::DEBUG::2015-10-26 16:29:42,693::utils::759::root::(execCmd) FAILED: <err> = 'Running as unit run-18709.scope.\n'; <rc> = 1 MainProcess|Thread-17::DEBUG::2015-10-26 16:29:42,694::utils::739::root::(execCmd) /usr/sbin/ifdown rhevm (cwd None) MainProcess|Thread-17::DEBUG::2015-10-26 16:29:43,078::utils::759::root::(execCmd) SUCCESS: <err> = ''; <rc> = 0 MainProcess|Thread-17::DEBUG::2015-10-26 16:29:43,078::utils::739::root::(execCmd) /usr/sbin/ifdown enp4s0 (cwd None) MainProcess|Thread-17::DEBUG::2015-10-26 16:29:43,525::utils::759::root::(execCmd) SUCCESS: <err> = ''; <rc> = 0 MainProcess|Thread-17::INFO::2015-10-26 16:29:43,525::ifcfg::332::root::(restoreAtomicNetworkBackup) Rolling back logical networks configuration (restoring atomic logical networks backup) MainProcess|Thread-17::INFO::2015-10-26 16:29:43,525::ifcfg::372::root::(restoreAtomicBackup) Rolling back configuration (restoring atomic backup) MainProcess|Thread-17::ERROR::2015-10-26 16:29:43,526::utils::132::root::(rmFile) Removing file: /etc/sysconfig/network-scripts/ifcfg-rhevm failed Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 126, in rmFile OSError: [Errno 16] Device or resource busy: '/etc/sysconfig/network-scripts/ifcfg-rhevm' MainProcess|Thread-17::ERROR::2015-10-26 16:29:43,526::supervdsmServer::106::SuperVdsm.ServerCallback::(wrapper) Error in setupNetworks Traceback (most recent call last): File "/usr/share/vdsm/supervdsmServer", line 104, in wrapper res = func(*args, **kwargs) File "/usr/share/vdsm/supervdsmServer", line 224, in setupNetworks return setupNetworks(networks, bondings, **options) File "/usr/share/vdsm/network/api.py", line 696, in setupNetworks File "/usr/share/vdsm/network/configurators/__init__.py", line 54, in __exit__ File "/usr/share/vdsm/network/configurators/ifcfg.py", line 75, in rollback File "/usr/share/vdsm/network/configurators/ifcfg.py", line 454, in restoreBackups File "/usr/share/vdsm/network/configurators/ifcfg.py", line 375, in restoreAtomicBackup File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 126, in rmFile OSError: [Errno 16] Device or resource busy: '/etc/sysconfig/network-scripts/ifcfg-rhevm' The first problem is that it looks like configuring the networks failed and VDSM attempted to roll back. hosted-engine-setup handles creating that bridge, and I haven't seen this issue with NFS hosted engines in recent testing. Nikolai: does this work with other storage backends? The second problem is that, even though vdsm.api.network checks whether it's running on a node and imports the node persistence functions, vdsm.network.configurators.ifcfg.ConfigWriter.restoreAtomicBackup directly uses utils.rmFile without checking whether it's running on node, and whether that file is persisted, which should be a different (low priority bug). The problem is with configuring networks. Reassigning back.
I only tried this on iSCSI, not yet with the NFS.
I encountered the bug 1270587 on RHEL 7.2 Host before. Bug 1270587 - [hosted-engine-setup] Deployment fails in setup networks, 'bridged' is configured as 'True' by default by vdsm Bug 1263311 - setupNetworks fails with a KeyError exception on 'bridged' Version: # rpm -qa kernel ovirt-hosted-engine-setup vdsm kernel-3.10.0-322.el7.x86_64 ovirt-hosted-engine-setup-1.3.0-1.el7ev.noarch vdsm-4.17.9-1.el7ev.noarch kernel-3.10.0-320.el7.x86_64 # cat /etc/redhat-release Red Hat Enterprise Linux Server release 7.2 Beta (Maipo) See this bug description, I also can reproduce "failure in configuring the management bridge on RHEL 7.2 host during deployment of the HE, bridge was not created and connectivity was lost to the RHEL 7.2 host" (ovirt-hosted-engine-setup-1.3.0-1.el7ev.noarch, vdsm-4.17.9-1.el7ev.noarch), and I tried this with NFSv4. So it is not node specific issue.
(In reply to Ryan Barry from comment #3) … > The second problem is that, even though vdsm.api.network checks whether it's > running on a node and imports the node persistence functions, > vdsm.network.configurators.ifcfg.ConfigWriter.restoreAtomicBackup directly > uses utils.rmFile without checking whether it's running on node, and whether > that file is persisted, which should be a different (low priority bug). This should be fixed on the vdsm side, moving it over.
Nikolai, rhev-h-3.5.6 is going to be released on top of rhel-7.2. Please reproduce the problem on a true rhev-h-3.5.6 build in order to block that version.
(In reply to Dan Kenigsberg from comment #10) > Nikolai, rhev-h-3.5.6 is going to be released on top of rhel-7.2. > Please reproduce the problem on a true rhev-h-3.5.6 build in order to block > that version. Just for the reference: Following comment #6, Ying found this issue in RHEL7.2 BETA, while RHEVH 7.2 (20151105.132.el7ev) was built upon RHEL7.2 NOT BETA. Forth to RHEL 7.2 BETA was actually built upon RHEL7.1, I guess that the same root cause was reproduced in RHEVH 7.1 (20151015.0.el7ev). Test results: Failed to reproduce the same issue that was originally found in Red Hat Enterprise Virtualization Hypervisor release 7.1 (20151015.0.el7ev), on Red Hat Enterprise Virtualization Hypervisor release 7.2 RHEVH7.2, installed the RHEVH7.2(20151105.132.el7ev) on host from DiskOnKey (boot from flash): Host: Linux version 3.10.0-327.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.3 20140911 (Red Hat 4.8.3-9) (GCC) ) #1 SMP Thu Oct 29 17:29:29 EDT 2015 ovirt-node-selinux-3.2.3-26.el7.noarch ovirt-host-deploy-offline-1.3.0-3.el7ev.x86_64 ovirt-node-branding-rhev-3.2.3-26.el7.noarch ovirt-host-deploy-1.3.2-1.el7ev.noarch ovirt-node-plugin-rhn-3.2.3-26.el7.noarch ovirt-node-3.2.3-26.el7.noarch ovirt-hosted-engine-ha-1.2.8-1.el7ev.noarch ovirt-node-plugin-hosted-engine-0.2.0-18.0.el7ev.noarch ovirt-node-plugin-cim-3.2.3-26.el7.noarch ovirt-hosted-engine-setup-1.2.6.1-1.el7ev.noarch ovirt-node-plugin-vdsm-0.2.0-26.el7ev.noarch ovirt-node-plugin-snmp-3.2.3-26.el7.noarch mom-0.4.1-5.el7ev.noarch libvirt-1.2.17-13.el7.x86_64 qemu-kvm-rhev-2.3.0-31.el7.x86_64 vdsm-4.16.29-1.el7ev.x86_64 sanlock-3.2.4-1.el7.x86_64 RHEVM Version: 3.5.6.2-0.1.el6ev was installed on iSCSI, over PXE: rhevm-guest-agent-gdm-plugin-1.0.10-2.el6ev.x86_64 rhevm-guest-agent-common-1.0.10-2.el6ev.noarch rhevm-guest-agent-kdm-plugin-1.0.10-2.el6ev.x86_64 rhevm-guest-agent-pam-module-1.0.10-2.el6ev.x86_64 rhevm-guest-agent-debuginfo-1.0.10-2.el6ev.x86_64 rhevm-3.5.6.2-0.1.el6ev.noarch Linux version 2.6.32-573.7.1.el6.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-16) (GCC) ) #1 SMP Thu Sep 10 13:42:16 EDT 2015
To clarify ovirt-node-3.2.3-26.el7.noarch (as noted by Nikolai in comment 11) is for RHEV 3.5.z, thus Nikolai is seeing this bug on RHEV-H 3.5 for RHEV 3.5.6. In addition IIUIC then this is a TestBlocker to them. To me the bug looks like a missing unpersist call.
If I understand comment 11 properly, the issue does NOT reproduce on a valid build of rhev-h-3.5.6 on top of rhel-7.2. Therefore, I don't think we should (or even can) pursue this bug any further. Please reopen if it reproduces.
Yes, my bad, I completely misread comment 11.
(In reply to Dan Kenigsberg from comment #13) > If I understand comment 11 properly, the issue does NOT reproduce on a valid > build of rhev-h-3.5.6 on top of rhel-7.2. > > Therefore, I don't think we should (or even can) pursue this bug any > further. Please reopen if it reproduces. It should not be closed as not a bug, but as WONTFIX IMHO, it was reproduced on 3.5.5 RHEVH7.1. 3.5.6 has no connection to this bug, please don't mix up it with the 3.5.6.
The bug does not reproduce in the upcoming release of RHEV-H. Tweaking the reason.