1114720 – Network interface fails when rebooting a VM with an NWFilter

Bug 1114720 - Network interface fails when rebooting a VM with an NWFilter

Summary: Network interface fails when rebooting a VM with an NWFilter

Keywords:
Status:	CLOSED UPSTREAM
Alias:	None
Product:	Virtualization Tools
Classification:	Community
Component:	libvirt
Sub Component:
Version:	unspecified
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Assignee:	Libvirt Maintainers
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-06-30 18:31 UTC by Frederick N. Brier
Modified:	2016-04-13 22:01 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2016-04-13 22:01:24 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Frederick N. Brier 2014-06-30 18:31:25 UTC

Description of problem:

The network interface fails to initialize properly when rebooting a VM with an NWFilter applied. If a full shutdown of the guest VM (not the host) is performed and then restart, the network interface comes up.

Version-Release number of selected component (if applicable):
libvirt: 0.10.2

How reproducible:

I've tested this with two Ubuntu 12.04 LTS VMs. Both fail when rebooting, and the network interface is operational when completely shutdown and restarted.

Steps to Reproduce:
1. Create a VM (this was experienced using a bare bones Ubuntu 12.04 LTS).
2. Define a network filter.
3. Shutdown the guest VM.
4. Use virsh edit to add the network filter to the bridged interface.
5. Start the VM - it works.
6. ssh into the VM, and "sudo reboot".

Actual results:

"ifconfig -a" after the reboot does not show a DHCP assigned IPV4 address. It does show the eth0 interface and MAC Address, it is just not operational. Running "ifup eth0" will hang, although you can interrupt the command.

Expected results:

"ifconfig -a" should show a properly assigned DHCP address.

Additional info:

There appears to be a difference in how a VM and the hypervisor treats a reboot vs. a shutdown/start. It appears that the virtio device ( "Host device vnet0(Bridge 'br0')" ) is not being properly initialized when an NWFilter is applied. Could this be caused because the device state is not being properly cleaned up at the shutdown portion of a reboot and a dirty state is preventing the nwfilter-applied-interface initialization code from executing? Or perhaps on a reboot, the nwfilter-applied-interface initialization code is never executed?

The workaround is just to completely shutdown the VM ("shutdown -h now") and then restart it. Thank you for any assistance.

These log messages from /var/log/libvirt/libvirtd.log may help.

2014-06-30 17:36:23.800+0000: 19890: error : qemuMonitorIO:614 : internal error End of file from monitor
2014-06-30 17:36:23.801+0000: 19890: error : virNWFilterDHCPSnoopEnd:2131 : internal error ifname "vnet0" not in key map
2014-06-30 17:36:23.820+0000: 19890: error : virNetDevGetIndex:653 : Unable to get index for interface vnet0: No such device

Comment 1 Frederick N. Brier 2014-06-30 18:38:38 UTC

After the reboot, there is also a long delay during the OS startup while it waits for the network interface.  After a full shutdown, the reboot completes very fast as it is a barebones install.

Comment 2 Frederick N. Brier 2014-07-01 21:26:01 UTC

The problem seems to be occuring even after the VM has successfully booted and the interface comes up.  I had a terminal window open where I had being a machine on the Internet.  Some time passed and when I tried to ping another machine on the Internet (google.com), it gave me an unknown host.  So something caused the interface to fail.

At this point, if I do an "ifup eth0", it responds with, "ifup: interface eth0 is already configured".  It will allow, "ifdown eth0", however, at that point, an "ifup eth0" will just hang.  So this has become more critical as there is no workaround.

Comment 3 Jiri Denemark 2014-07-02 06:41:52 UTC

Libvirt community is mainly interested in the most recent release (1.2.6 is the latest one as of today). Please, try the latest libvirt release to check if it's fixed there and if so, file a bug for your distribution to backport the fix.

Comment 4 Frederick N. Brier 2014-07-03 04:11:57 UTC

Understandable.  How would I go about installing libvirt-1.2.6 on an up to date CentOS 6.5?  Do I download the source and build it?  And if so, what dependencies will it have on other newer virtualization or kernel packages?  Is this a small effort, or a rabbit hole?  I am not married to CentOS.  I installed it as a host because it was the default for Eucalyptus.  I use Fedora for my workstation and Ubuntu 12.04 LTS for nodes because it is well supported by Chef and the Chef community.  Should I go back to Fedora as my KVM host to get the latest libvirt?

Comment 5 Jiri Denemark 2014-07-03 07:12:58 UTC

There should be no problem compiling libvirt-1.2.6 on centos 6.5. You can just run "./configure && make rpm" and install the created rpms. However, you won't get libvirt-python since that is now a separate package and upstream libvirt may not work very well with the qemu-kvm binary from centos 6.5. It should mostly work but some things will definitely not work (e.g., you cannot use <cpu> element in domain XML, and some hotplug and snapshot related things won't work either). That said it should be just fine for checking if this bug is fixed upstream or not.

Comment 6 Frederick N. Brier 2014-07-03 17:32:36 UTC

After adding the appropriate build packages, the libvirt and its sub-rpms got built.  I set it up as a local repo, but when I did a yum update, as you had indicated, the libvirt-python wants the 0.10.2.29 libvirt-client rpm, which was just removed by the new libvirt-1.2.6.  --skip-broken was going to skip the new 1.2.6 rpms, which seems counterproductive :).  I downloaded the libvirt-python 1.2.6 source, but never having built a .noarch python rpm, am probably doing something wrong.  The command "python setup.py bdist_rpm" is saying that it can't open generator.py, when it is in the same directory.  Running the individual command works, so obviously I don't know the secret sauce command to make it build.  Do you?  Please let me know if you do and also whether installing the 1.2.6 version of the libvirt-python is going to cause even more problems.

Comment 7 Frederick N. Brier 2014-07-03 19:45:13 UTC

I have now confirmed that this bug exists in libvirt-1.2.6.  I initially tried to build libvirt-python, but it complained that it could not find libvirt-lxc.  So I uninstalled libvirt-python, which also uninstalled python-virtinst and virt-manager.  libvirt-1.2.6 then installed cleanly.  I started up the the VM using virsh, then rebooted.  When I rebooted, the network interface no longer worked.  When I do a complete shutdown and start, it works.  So the bug is still there.

Comment 8 Frederick N. Brier 2014-07-03 22:00:25 UTC

Below is one of the nwfilter(s) I have created.  It is the smallest and it fails as well.  I can replace the OS on one of my kvm hosts, but if you could test this on one of your systems, perhaps running Fedora, we can determine whether this is just something that happens with CentOS 6.5 and its set of versioned packages, or whether it effects all installations.  Or perhaps my nwfilter has an invalid rule that is confusing libvirt, although it is accepted when I define it.

<filter name='web-eth0'>
  <!-- reference the clean traffic filter to prevent
       MAC, IP and ARP spoofing. By not providing
       and IP address parameter, libvirt will detect the
       IP address the VM is using. -->
  <filterref filter='clean-traffic'/>
  <filterref filter='allow-dhcp-server'>
    <parameter name='DHCPSERVER' value='192.168.1.1'/>
  </filterref>


  <!-- enable TCP ports 22 (ssh) and 80 (http) to be reachable -->
  <rule action='accept' direction='in'>
    <tcp dstportstart='22'/>
  </rule>

  <rule action='accept' direction='in'>
    <tcp dstportstart='80'/>
  </rule>

  <!-- enable general ICMP traffic to be initiated by the VM;
       this includes ping traffic -->
  <rule action='accept' direction='out'>
    <icmp/>
  </rule>

  <!-- enable Connection back to the Chef-Server -->
  <rule action='accept' direction='out'>
    <tcp dstipaddr='192.168.1.8' dstportstart='443'/>
  </rule>

  <!-- enable outgoing DNS lookups using UDP -->
  <rule action='accept' direction='out'>
    <udp dstportstart='53'/>
  </rule>

  <!-- enable Connection back to the Chef-Server -->
  <!-- Block access to the local network, but enable outgoing HTTP, so package updates work -->
  <rule action='drop' direction='out'>
    <tcp dstipaddr='192.168.1.0' dstipmask='255.255.255.0' dstportstart='80'/>
  </rule>
  <rule action='accept' direction='out'>
    <tcp dstportstart='80'/>
  </rule>

  <!-- drop all other traffic -->
  <rule action='drop' direction='inout'>
    <all/>
  </rule>

</filter>

Comment 9 Frederick N. Brier 2014-07-22 00:45:35 UTC

Here is some odd additional information, but you need some background first. I needed a workaround for this problem, so I created two daemon processes. The first runs in the VM. If it detects that the eth0 interface is no longer operational, it shuts the VM down from the inside. The other daemon runs on the KVM host. It monitors a list of VMs and if one of them is inactive/shutdown, it restarts it. So within a minute or so (30 seconds to detect and shutdown and 30 seconds to detect and startup), the VMs are back up and running.

The vmrestart daemon uses syslog and these are the messages over a period of time. Here is the interesting part. 1 minute and 30 seconds after Haboob fails and is restarted, Gale fails and restarts. 16 minutes and 30 seconds after Gale fails/restarts, Squall fails and restarts. I am not sure what this means, but it is consistent over days of logs with a variation of +/- 1 second. Is there some timed process involved in libvirt's nwfilter/networking that could be causing a failure? There is one other VM running on this host, but it does not have an nwfilter applied. There were 13 VMs defined, but only those 4 are active. Hope this helps.

Jul 18 00:56:47 derecho vmrestartd[17486]: Successfully restarted domain Haboob
Jul 18 00:58:17 derecho vmrestartd[17486]: Successfully restarted domain Gale
Jul 18 01:14:48 derecho vmrestartd[17486]: Successfully restarted domain Squall
Jul 18 09:28:49 derecho vmrestartd[17486]: Successfully restarted domain Haboob
Jul 18 09:30:19 derecho vmrestartd[17486]: Successfully restarted domain Gale
Jul 18 09:46:50 derecho vmrestartd[17486]: Successfully restarted domain Squall
Jul 18 18:00:51 derecho vmrestartd[17486]: Successfully restarted domain Haboob
Jul 18 18:02:22 derecho vmrestartd[17486]: Successfully restarted domain Gale
Jul 18 18:18:52 derecho vmrestartd[17486]: Successfully restarted domain Squall
Jul 19 02:32:53 derecho vmrestartd[17486]: Successfully restarted domain Haboob
Jul 19 02:34:24 derecho vmrestartd[17486]: Successfully restarted domain Gale
Jul 19 02:50:54 derecho vmrestartd[17486]: Successfully restarted domain Squall
Jul 19 11:04:55 derecho vmrestartd[17486]: Successfully restarted domain Haboob
Jul 19 11:06:26 derecho vmrestartd[17486]: Successfully restarted domain Gale
Jul 19 11:22:56 derecho vmrestartd[17486]: Successfully restarted domain Squall
Jul 19 19:36:57 derecho vmrestartd[17486]: Successfully restarted domain Haboob
Jul 19 19:38:28 derecho vmrestartd[17486]: Successfully restarted domain Gale
Jul 19 19:54:59 derecho vmrestartd[17486]: Successfully restarted domain Squall

Comment 10 Cole Robinson 2016-04-13 22:01:24 UTC

Sorry this didn't receive much follow up. I just tried to reproduce with libvirt git... my VM rebooted fine and with complete network connectivity, and I didn't see any of the errors you mentioned at the end of Comment #0. So I assume this is fixed upstream, closing

If you are still hitting issues on latest RHEL/centos, please file a bug there

Note You need to log in before you can comment on or make changes to this bug.