Bug 1275371 - Network management bridge was not created during the hosted-engine deployment.
Summary: Network management bridge was not created during the hosted-engine deployment.
Keywords:
Status: CLOSED NEXTRELEASE
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 3.5.5
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: ovirt-3.6.1
: 3.6.0
Assignee: Ido Barkan
QA Contact: Meni Yakove
URL:
Whiteboard: network
Depends On:
Blocks: 1252796
TreeView+ depends on / blocked
 
Reported: 2015-10-26 17:08 UTC by Nikolai Sednev
Modified: 2016-02-10 19:15 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1277937 (view as bug list)
Environment:
Last Closed: 2015-11-10 10:05:45 UTC
oVirt Team: Network
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
logs from the host (black) (131.15 KB, application/x-gzip)
2015-10-26 17:09 UTC, Nikolai Sednev
no flags Details

Description Nikolai Sednev 2015-10-26 17:08:29 UTC
Description of problem:
Network management bridge was not created during the hosted-engine deployment.
Failed to deploy the HE on Red Hat Enterprise Virtualization Hypervisor release 7.1 (20151015.0.el7ev) over iSCSI because of failure in configuring the management bridge on host during deployment of the HE, bridge was not created and connectivity was lost to the host.
After restarting the host manually, bridge was created, but FQDN of the host became "localhost" instead of originally received from the DHCP.

Version-Release number of selected component (if applicable):
ovirt-node-plugin-rhn-3.2.3-23.el7.noarch
ovirt-host-deploy-offline-1.3.0-3.el7ev.x86_64
ovirt-node-branding-rhev-3.2.3-23.el7.noarch
ovirt-node-selinux-3.2.3-23.el7.noarch
ovirt-hosted-engine-ha-1.2.7.2-1.el7ev.noarch
ovirt-node-plugin-hosted-engine-0.2.0-18.0.el7ev.noarch
ovirt-node-plugin-cim-3.2.3-23.el7.noarch
ovirt-node-plugin-snmp-3.2.3-23.el7.noarch
ovirt-node-3.2.3-23.el7.noarch
ovirt-host-deploy-1.3.2-1.el7ev.noarch
ovirt-hosted-engine-setup-1.2.6.1-1.el7ev.noarch
ovirt-node-plugin-vdsm-0.2.0-26.el7ev.noarch
libvirt-1.2.8-16.el7_1.4.x86_64
mom-0.4.1-5.el7ev.noarch
qemu-kvm-rhev-2.1.2-23.el7_1.10.x86_64
vdsm-4.16.27-1.el7ev.x86_64
sanlock-3.2.2-2.el7.x86_64


How reproducible:
100%

Steps to Reproduce:
1.Install clean Red Hat Enterprise Virtualization Hypervisor release 7.1 (20151015.0.el7ev) on host.
2.Deploy HE over iSCSI via TUI.
3.

Actual results:
Management bridge not created and deployment failed, with customer being disconnected from host.

Expected results:
Deployment should pass.

Additional info:
See logs attached.

Comment 1 Nikolai Sednev 2015-10-26 17:09:10 UTC
Created attachment 1086560 [details]
logs from the host (black)

Comment 2 Sandro Bonazzola 2015-10-27 12:07:58 UTC
The real reason for the failure is:
Traceback (most recent call last):
  File "/usr/share/vdsm/rpc/BindingXMLRPC.py", line 1136, in wrapper
  File "/usr/share/vdsm/rpc/BindingXMLRPC.py", line 554, in setupNetworks
  File "/usr/share/vdsm/API.py", line 1398, in setupNetworks
  File "/usr/share/vdsm/supervdsm.py", line 50, in __call__
  File "/usr/share/vdsm/supervdsm.py", line 48, in <lambda>
  File "<string>", line 2, in setupNetworks
  File "/usr/lib64/python2.7/multiprocessing/managers.py", line 773, in _callmethod
OSError: [Errno 16] Device or resource busy: '/etc/sysconfig/network-scripts/ifcfg-rhevm'

looks like a node issue

Comment 3 Ryan Barry 2015-10-27 14:58:34 UTC
This is a node issue because vdsm is attempting to unlink a bind mounted file, but bind mounting is part and parcel of persistence, and the persistence of ifcfg files is directly handled by vdsm -- there's no clean way to handle this from the node side, since ifcfg-rhevm does not exist before hosted-engine-setup runs, and we don't have any kind of daemon monitoring for persistence or wrapping calls to unlink.

There are two problems:

MainProcess|Thread-17::INFO::2015-10-26 16:29:36,759::__init__::507::root.ovirt.node.utils.fs::(_persist_file) File "/etc/sysconfig/network-scripts/ifcfg-enp4s0" successfully persisted
MainProcess|Thread-17::DEBUG::2015-10-26 16:29:36,761::utils::739::root::(execCmd) /usr/sbin/ifdown enp4s0 (cwd None)
sourceRoute::DEBUG::2015-10-26 16:29:36,879::sourceroutethread::39::root::(process_IN_CLOSE_WRITE_filePath) Responding to DHCP response in /var/run/vdsm/sourceRoutes/1445876976
sourceRoute::INFO::2015-10-26 16:29:36,880::sourceroutethread::60::root::(process_IN_CLOSE_WRITE_filePath) interface enp4s0 is not a libvirt interface
sourceRoute::WARNING::2015-10-26 16:29:36,880::utils::129::root::(rmFile) File: /var/run/vdsm/trackedInterfaces/enp4s0 already removed
MainProcess|Thread-17::DEBUG::2015-10-26 16:29:36,967::utils::759::root::(execCmd) FAILED: <err> = 'bridge rhevm does not exist!\n'; <rc> = 1
MainProcess|Thread-17::DEBUG::2015-10-26 16:29:36,967::utils::739::root::(execCmd) /usr/bin/systemd-run --scope --slice=vdsm-dhclient /usr/sbin/ifup enp4s0 (cwd None)
MainProcess|Thread-17::DEBUG::2015-10-26 16:29:37,125::utils::759::root::(execCmd) SUCCESS: <err> = 'Running as unit run-18679.scope.\n'; <rc> = 0
MainProcess|Thread-17::DEBUG::2015-10-26 16:29:37,125::utils::739::root::(execCmd) /usr/bin/systemd-run --scope --slice=vdsm-dhclient /usr/sbin/ifup rhevm (cwd None)
MainProcess|Thread-17::DEBUG::2015-10-26 16:29:42,693::utils::759::root::(execCmd) FAILED: <err> = 'Running as unit run-18709.scope.\n'; <rc> = 1
MainProcess|Thread-17::DEBUG::2015-10-26 16:29:42,694::utils::739::root::(execCmd) /usr/sbin/ifdown rhevm (cwd None)
MainProcess|Thread-17::DEBUG::2015-10-26 16:29:43,078::utils::759::root::(execCmd) SUCCESS: <err> = ''; <rc> = 0
MainProcess|Thread-17::DEBUG::2015-10-26 16:29:43,078::utils::739::root::(execCmd) /usr/sbin/ifdown enp4s0 (cwd None)
MainProcess|Thread-17::DEBUG::2015-10-26 16:29:43,525::utils::759::root::(execCmd) SUCCESS: <err> = ''; <rc> = 0
MainProcess|Thread-17::INFO::2015-10-26 16:29:43,525::ifcfg::332::root::(restoreAtomicNetworkBackup) Rolling back logical networks configuration (restoring atomic logical networks backup)
MainProcess|Thread-17::INFO::2015-10-26 16:29:43,525::ifcfg::372::root::(restoreAtomicBackup) Rolling back configuration (restoring atomic backup)
MainProcess|Thread-17::ERROR::2015-10-26 16:29:43,526::utils::132::root::(rmFile) Removing file: /etc/sysconfig/network-scripts/ifcfg-rhevm failed
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 126, in rmFile
OSError: [Errno 16] Device or resource busy: '/etc/sysconfig/network-scripts/ifcfg-rhevm'
MainProcess|Thread-17::ERROR::2015-10-26 16:29:43,526::supervdsmServer::106::SuperVdsm.ServerCallback::(wrapper) Error in setupNetworks
Traceback (most recent call last):
  File "/usr/share/vdsm/supervdsmServer", line 104, in wrapper
    res = func(*args, **kwargs)
  File "/usr/share/vdsm/supervdsmServer", line 224, in setupNetworks
    return setupNetworks(networks, bondings, **options)
  File "/usr/share/vdsm/network/api.py", line 696, in setupNetworks
  File "/usr/share/vdsm/network/configurators/__init__.py", line 54, in __exit__
  File "/usr/share/vdsm/network/configurators/ifcfg.py", line 75, in rollback
  File "/usr/share/vdsm/network/configurators/ifcfg.py", line 454, in restoreBackups
  File "/usr/share/vdsm/network/configurators/ifcfg.py", line 375, in restoreAtomicBackup
  File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 126, in rmFile
OSError: [Errno 16] Device or resource busy: '/etc/sysconfig/network-scripts/ifcfg-rhevm'

The first problem is that it looks like configuring the networks failed and VDSM attempted to roll back. hosted-engine-setup handles creating that bridge, and I haven't seen this issue with NFS hosted engines in recent testing.

Nikolai: does this work with other storage backends?

The second problem is that, even though vdsm.api.network checks whether it's running on a node and imports the node persistence functions, vdsm.network.configurators.ifcfg.ConfigWriter.restoreAtomicBackup directly uses utils.rmFile without checking whether it's running on node, and whether that file is persisted, which should be a different (low priority bug).

The problem is with configuring networks. Reassigning back.

Comment 4 Nikolai Sednev 2015-10-27 16:52:28 UTC
I only tried this on iSCSI, not yet with the NFS.

Comment 5 Ying Cui 2015-10-28 03:23:58 UTC
I encountered the bug 1270587 on RHEL 7.2 Host before.
Bug 1270587 - [hosted-engine-setup] Deployment fails in setup networks, 'bridged' is configured as 'True' by default by vdsm
Bug 1263311 - setupNetworks fails with a KeyError exception on 'bridged'
Version:
# rpm -qa kernel ovirt-hosted-engine-setup vdsm
kernel-3.10.0-322.el7.x86_64
ovirt-hosted-engine-setup-1.3.0-1.el7ev.noarch
vdsm-4.17.9-1.el7ev.noarch
kernel-3.10.0-320.el7.x86_64
# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 7.2 Beta (Maipo)

See this bug description, I also can reproduce "failure in configuring the management bridge on RHEL 7.2 host during deployment of the HE, bridge was not created and connectivity was lost to the RHEL 7.2 host" (ovirt-hosted-engine-setup-1.3.0-1.el7ev.noarch, vdsm-4.17.9-1.el7ev.noarch), and I tried this with NFSv4. 

So it is not node specific issue.

Comment 6 Fabian Deutsch 2015-11-02 08:50:45 UTC
(In reply to Ryan Barry from comment #3)
…
> The second problem is that, even though vdsm.api.network checks whether it's
> running on a node and imports the node persistence functions,
> vdsm.network.configurators.ifcfg.ConfigWriter.restoreAtomicBackup directly
> uses utils.rmFile without checking whether it's running on node, and whether
> that file is persisted, which should be a different (low priority bug).

This should be fixed on the vdsm side, moving it over.

Comment 10 Dan Kenigsberg 2015-11-05 15:15:37 UTC
Nikolai, rhev-h-3.5.6 is going to be released on top of rhel-7.2.
Please reproduce the problem on a true rhev-h-3.5.6 build in order to block that version.

Comment 11 Nikolai Sednev 2015-11-09 17:00:31 UTC
(In reply to Dan Kenigsberg from comment #10)
> Nikolai, rhev-h-3.5.6 is going to be released on top of rhel-7.2.
> Please reproduce the problem on a true rhev-h-3.5.6 build in order to block
> that version.

Just for the reference:
Following comment #6, Ying found this issue in RHEL7.2 BETA, while RHEVH 7.2 (20151105.132.el7ev) was built upon RHEL7.2 NOT BETA. Forth to RHEL 7.2 BETA was actually built upon RHEL7.1, I guess that the same root cause was reproduced in RHEVH 7.1 (20151015.0.el7ev). 

Test results:
Failed to reproduce the same issue that was originally found in Red Hat Enterprise Virtualization Hypervisor release 7.1 (20151015.0.el7ev), on Red Hat Enterprise Virtualization Hypervisor release 7.2 RHEVH7.2, installed the RHEVH7.2(20151105.132.el7ev) on host from DiskOnKey (boot from flash):

Host:
Linux version 3.10.0-327.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.3 20140911 (Red Hat 4.8.3-9) (GCC) ) #1 SMP Thu Oct 29 17:29:29 EDT 2015
ovirt-node-selinux-3.2.3-26.el7.noarch
ovirt-host-deploy-offline-1.3.0-3.el7ev.x86_64
ovirt-node-branding-rhev-3.2.3-26.el7.noarch
ovirt-host-deploy-1.3.2-1.el7ev.noarch
ovirt-node-plugin-rhn-3.2.3-26.el7.noarch
ovirt-node-3.2.3-26.el7.noarch
ovirt-hosted-engine-ha-1.2.8-1.el7ev.noarch
ovirt-node-plugin-hosted-engine-0.2.0-18.0.el7ev.noarch
ovirt-node-plugin-cim-3.2.3-26.el7.noarch
ovirt-hosted-engine-setup-1.2.6.1-1.el7ev.noarch
ovirt-node-plugin-vdsm-0.2.0-26.el7ev.noarch
ovirt-node-plugin-snmp-3.2.3-26.el7.noarch
mom-0.4.1-5.el7ev.noarch
libvirt-1.2.17-13.el7.x86_64
qemu-kvm-rhev-2.3.0-31.el7.x86_64
vdsm-4.16.29-1.el7ev.x86_64
sanlock-3.2.4-1.el7.x86_64

RHEVM Version: 3.5.6.2-0.1.el6ev was installed on iSCSI, over PXE:
rhevm-guest-agent-gdm-plugin-1.0.10-2.el6ev.x86_64
rhevm-guest-agent-common-1.0.10-2.el6ev.noarch
rhevm-guest-agent-kdm-plugin-1.0.10-2.el6ev.x86_64
rhevm-guest-agent-pam-module-1.0.10-2.el6ev.x86_64
rhevm-guest-agent-debuginfo-1.0.10-2.el6ev.x86_64
rhevm-3.5.6.2-0.1.el6ev.noarch
Linux version 2.6.32-573.7.1.el6.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-16) (GCC) ) #1 SMP Thu Sep 10 13:42:16 EDT 2015

Comment 12 Fabian Deutsch 2015-11-10 09:13:58 UTC
To clarify ovirt-node-3.2.3-26.el7.noarch (as noted by Nikolai in comment 11) is for RHEV 3.5.z, thus Nikolai is seeing this bug on RHEV-H 3.5 for RHEV 3.5.6.

In addition IIUIC then this is a TestBlocker to them.

To me the bug looks like a missing unpersist call.

Comment 13 Dan Kenigsberg 2015-11-10 10:05:45 UTC
If I understand comment 11 properly, the issue does NOT reproduce on a valid build of rhev-h-3.5.6 on top of rhel-7.2.

Therefore, I don't think we should (or even can) pursue this bug any further. Please reopen if it reproduces.

Comment 14 Fabian Deutsch 2015-11-10 10:06:20 UTC
Yes, my bad, I completely misread comment 11.

Comment 15 Nikolai Sednev 2015-11-10 11:55:24 UTC
(In reply to Dan Kenigsberg from comment #13)
> If I understand comment 11 properly, the issue does NOT reproduce on a valid
> build of rhev-h-3.5.6 on top of rhel-7.2.
> 
> Therefore, I don't think we should (or even can) pursue this bug any
> further. Please reopen if it reproduces.

It should not be closed as not a bug, but as WONTFIX IMHO, it was reproduced on 3.5.5 RHEVH7.1.
3.5.6 has no connection to this bug, please don't mix up it with the 3.5.6.

Comment 16 Dan Kenigsberg 2015-11-10 13:35:30 UTC
The bug does not reproduce in the upcoming release of RHEV-H. Tweaking the reason.


Note You need to log in before you can comment on or make changes to this bug.