Bug 860279

Summary: 3.1 - rhevm interface is removed together with other networks
Product: Red Hat Enterprise Linux 6 Reporter: Martin Pavlik <mpavlik>
Component: ovirt-nodeAssignee: Mike Burns <mburns>
Status: CLOSED ERRATA QA Contact: Virtualization Bugs <virt-bugs>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 6.3CC: abaron, acathrow, achan, bazulay, bsarathy, cpelland, cshao, danken, eblake, fdeutsch, gklein, gouyang, hambrose, iheim, jboggs, leiwang, lpeer, mavital, mburns, mjenner, ovirt-maint, ycui, ykaul
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard: network
Fixed In Version: ovirt-node-2.5.0-5.el6 Doc Type: Bug Fix
Doc Text:
When networks were removed from host network interfaces via the Setup Network Dialog on a Red Hat Enterprise Virtualization Hypervisor 3.1 host, Red Hat Enterprise Virtualization Manager interfaces were also removed although the ifcfg file remained in the system. This caused the host to become inaccessible. This is caused by a subtle timing issue with ifcfg- files being available from the previous setup. This was fixed to ensure a single point for configuration files throughout the life of the system.
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-02-28 16:40:04 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 863165    
Attachments:
Description Flags
sosreport-dell-r210ii-06-20120925093244-c9e9.tar
none
sosreport-mp-rhevm31-20120925105832-d754.tar
none
notes.txt none

Description Martin Pavlik 2012-09-25 13:06:35 UTC
Description of problem:
when removing networks from host NICs via Setup Network Dialog on rhevh 3.1 host, rhevm interface gets removed as well (not desired), ifcfg file remains in system


Version-Release number of selected component (if applicable):
Red Hat Enterprise Virtualization Manager Version: '3.1.0-16.el6ev' 
vdsm-bootstrap-4.9.6-34.0.el6_3.noarch

Red Hat Enterprise Virtualization Hypervisor release 6.3 (20120920.0.rhev31.el6_3)
vdsm-4.9.6-34.0.el6_3.x86_64
libvirt-0.9.10-21.el6_3.4.x86_64
qemu-kvm-rhev-0.12.1.2-2.295.el6_3.2.x86_64

How reproducible:
Rarely (I have seen it once or twice before, however we did not manage to find reliable reproducer)

Steps to Reproduce:
The problem appears when running Network Sanity test ( https://tcms.engineering.redhat.com/plan/6487/rhevmnetworking31-features-sanity ) in the last case (https://tcms.engineering.redhat.com/case/175205/?from_plan=6487) when networks are removed from host


Actual results:
rhevm interface gets removed -> host is inaccessible

Expected results:
rhevm interface should not be removed

Additional info:
sosreport from setup (sosreport-mp-rhevm31-20120925105832-d754.tar) and host (sosreport-dell-r210ii-06-20120925093244-c9e9.tar) is attached, file notes.txt with various command output is attached

This is probably the related error

MainThread::ERROR::2012-09-25 08:41:33,371::vdsm::73::vds::(run) Exception raised
Traceback (most recent call last):
  File "/usr/share/vdsm/vdsm", line 71, in run
    serve_clients(log)
  File "/usr/share/vdsm/vdsm", line 39, in serve_clients
    cif = clientIF.clientIF(log)
  File "/usr/share/vdsm/clientIF.py", line 66, in __init__
  File "/usr/share/vdsm/clientIF.py", line 130, in _syncLibvirtNetworks
  File "/usr/share/vdsm/configNetwork.py", line 205, in removeLibvirtNetwork
  File "/usr/share/vdsm/configNetwork.py", line 200, in _removeNetwork
  File "/usr/lib64/python2.6/site-packages/libvirt.py", line 1957, in undefine
libvirtError: cannot remove config file '/etc/libvirt/qemu/networks/vdsm-sw2.xml': Device or resource busy



rhevm interface is gone

[root@dell-r210ii-06 ~]# ip a l
1: lo: <LOOPBACK> mtu 16436 qdisc noqueue state DOWN 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: p1p1: <BROADCAST,MULTICAST> mtu 9000 qdisc mq state DOWN qlen 1000
    link/ether 90:e2:ba:04:28:c0 brd ff:ff:ff:ff:ff:ff
3: p1p2: <BROADCAST,MULTICAST> mtu 9000 qdisc mq state DOWN qlen 1000
    link/ether 90:e2:ba:04:28:c1 brd ff:ff:ff:ff:ff:ff
4: em1: <BROADCAST,MULTICAST,PROMISC> mtu 1500 qdisc mq state DOWN qlen 1000
    link/ether d0:67:e5:f0:83:5e brd ff:ff:ff:ff:ff:ff
5: em2: <BROADCAST,MULTICAST> mtu 1500 qdisc mq state DOWN qlen 1000
    link/ether d0:67:e5:f0:83:5f brd ff:ff:ff:ff:ff:ff
6: bond0: <BROADCAST,MULTICAST,MASTER> mtu 1500 qdisc noop state DOWN 
    link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff
7: bond4: <BROADCAST,MULTICAST,PROMISC,MASTER> mtu 9000 qdisc noqueue state DOWN 
    link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff
17: bond1: <BROADCAST,MULTICAST,MASTER> mtu 1500 qdisc noop state DOWN 
    link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff
18: bond2: <BROADCAST,MULTICAST,MASTER> mtu 1500 qdisc noop state DOWN 
    link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff
19: bond3: <BROADCAST,MULTICAST,MASTER> mtu 1500 qdisc noop state DOWN 
    link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff

routes are gone
[root@dell-r210ii-06 ~]# ip route

Comment 1 Martin Pavlik 2012-09-25 13:10:15 UTC
Created attachment 617026 [details]
sosreport-dell-r210ii-06-20120925093244-c9e9.tar

Comment 2 Martin Pavlik 2012-09-25 13:10:45 UTC
Created attachment 617027 [details]
sosreport-mp-rhevm31-20120925105832-d754.tar

Comment 3 Martin Pavlik 2012-09-25 13:11:25 UTC
Created attachment 617028 [details]
notes.txt

Comment 4 Dan Kenigsberg 2012-09-29 22:11:59 UTC
On net config rollback, Vdsm stops the network service (let's not go into "why"). Right after that, libvirt.networkDefineXML fails in a bizarre way:

MainProcess|Thread-1104::DEBUG::2012-09-25 08:39:48,566::__init__::1164::Storage.Misc.excCmd::(_log) '/etc/init.d/network stop' (cwd None)
MainProcess|Thread-1104::DEBUG::2012-09-25 08:39:50,025::__init__::1164::Storage.Misc.excCmd::(_log) SUCCESS: <err> = ''; <rc> = 0
MainProcess|Thread-1104::INFO::2012-09-25 08:39:50,030::configNetwork::262::root::(restoreAtomicNetworkBackup) Rolling back logical networks configuration (restoring atomic logical networks backup)
MainProcess|Thread-1104::ERROR::2012-09-25 08:39:50,082::configNetwork::1367::setupNetworks::(setupNetworks) cannot rename file '/etc/libvirt/qemu/networks/vdsm-sw1.xml.new' as '/etc/libvirt/qemu/networks/vdsm-sw1.xml': Device or resource busy
Traceback (most recent call last):
  File "/usr/share/vdsm/configNetwork.py", line 1362, in setupNetworks
  File "/usr/share/vdsm/configNetwork.py", line 367, in restoreBackups
  File "/usr/share/vdsm/configNetwork.py", line 276, in restoreAtomicNetworkBackup
  File "/usr/share/vdsm/configNetwork.py", line 173, in _createNetwork
  File "/usr/lib64/python2.6/site-packages/libvirt.py", line 2836, in networkDefineXML
libvirtError: cannot rename file '/etc/libvirt/qemu/networks/vdsm-sw1.xml.new' as '/etc/libvirt/qemu/networks/vdsm-sw1.xml': Device or resource busy



in libvirtd.log I see

2012-09-25 08:39:48.565+0000: 8179: error : virNetworkDeleteConfig:1735 : cannot remove config file '/etc/libvirt/qemu/networks/vdsm-sw1.xml': Device or resource busy
2012-09-25 08:39:50.035+0000: 8178: error : virNetworkDeleteConfig:1735 : cannot remove config file '/etc/libvirt/qemu/networks/vdsm-sw1.xml': Device or resource busy
2012-09-25 08:39:50.069+0000: 8182: error : virFileRewrite:405 : cannot rename file '/etc/libvirt/qemu/networks/vdsm-sw1.xml.new' as '/etc/libvirt/qemu/networks/vdsm-sw1.xml': Device or resource busy
2012-09-25 08:41:32.879+0000: 8161: error : virNetSocketReadWire:999 : End of file while reading data: Input/output error
2012-09-25 08:41:33.207+0000: 8178: error : virNetworkDeleteConfig:1735 : cannot remove config file '/etc/libvirt/qemu/networks/vdsm-rhevm.xml': Device or resource busy
2012-09-25 08:41:33.370+0000: 8184: error : virNetworkDeleteConfig:1735 : cannot remove config file '/etc/libvirt/qemu/networks/vdsm-sw2.xml': Device or resource busy
2012-09-25 08:41:34.049+0000: 8161: error : virNetSocketReadWire:999 : End of file while reading data: Input/output error


where virNetworkDeleteConfig:1735 refers to this piece of code


    if (unlink(configFile) < 0) {
        virReportSystemError(errno,
                             _("cannot remove config file '%s'"),
                             configFile);
        goto error;
    }


Now why would unlink(3) return EBUSY?

Comment 5 Eric Blake 2012-10-01 21:57:19 UTC
(In reply to comment #4)
> On net config rollback, Vdsm stops the network service (let's not go into
> "why"). Right after that, libvirt.networkDefineXML fails in a bizarre way:
> 

> 2012-09-25 08:41:33.370+0000: 8184: error : virNetworkDeleteConfig:1735 :
> cannot remove config file '/etc/libvirt/qemu/networks/vdsm-sw2.xml': Device
> or resource busy
> 2012-09-25 08:41:34.049+0000: 8161: error : virNetSocketReadWire:999 : End
> of file while reading data: Input/output error
> 
> 
> where virNetworkDeleteConfig:1735 refers to this piece of code
> 
> 
>     if (unlink(configFile) < 0) {
>         virReportSystemError(errno,
>                              _("cannot remove config file '%s'"),
>                              configFile);
>         goto error;
>     }
> 
> 
> Now why would unlink(3) return EBUSY?

Does this file live on NFS or on a cifs share?  I seem to recall that NFS can fail with EBUSY when a file still has some process holding it open, and that is certainly the semantics that Windows file systems tend to implement.  That said, although POSIX permits unlink() to fail with EBUSY in this circumstance, most Unix file systems allow unlink()ing an in-use file, so it's not a common error seen during development.

Comment 6 Dan Kenigsberg 2012-10-02 19:02:28 UTC
(In reply to comment #5)

> > Now why would unlink(3) return EBUSY?
> 
> Does this file live on NFS or on a cifs share?

No. But I'm told that the /etc/libvirt/qemu/networks/ directory is bind-mounted. Could this be related?

Comment 7 Mike Burns 2012-10-02 20:16:32 UTC
(In reply to comment #5)
> 
> Does this file live on NFS or on a cifs share?  I seem to recall that NFS
> can fail with EBUSY when a file still has some process holding it open, and
> that is certainly the semantics that Windows file systems tend to implement.
> That said, although POSIX permits unlink() to fail with EBUSY in this
> circumstance, most Unix file systems allow unlink()ing an in-use file, so
> it's not a common error seen during development.

It's actually bindmounted.  The reason why this works sometimes and not others is related to whether or not you rebooted since you created the network files.  

I ran a quick test:

mkdir /etc/mburns
echo test > /etc/mburns/testfile
persist /etc/mburns
unlink /etc/mburns/testfile <-- works correctly

echo test > /etc/mburns/testfile
reboot
unlink /etc/mburns/testfile <-- fails

it turns out that they're mounted differently before/after the reboot.  Before, the /etc/mburns directory is mounted and after the /etc/mburns/testfile is mounted.

This is a problem in ovirt-node, so I'm moving this bz to ovirt-node in 6.4 and flagging for 6.3.z.


Upstream patch posted:  http://gerrit.ovirt.org/#/c/8317/

Comment 15 errata-xmlrpc 2013-02-28 16:40:04 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-0556.html