Red Hat Bugzilla – Full Text Bug Listing
|Summary:||libvirt network fails rarely - maybe dnsmasq problem|
|Product:||[Fedora] Fedora||Reporter:||Steven Dake <sdake>|
|Component:||libvirt||Assignee:||Libvirt Maintainers <libvirt-maint>|
|Status:||CLOSED WONTFIX||QA Contact:||Fedora Extras Quality Assurance <extras-qa>|
|Version:||16||CC:||berrange, calfonso, clalancette, crobinso, dougsland, itamar, jforbes, jyang, laine, libvirt-maint, veillard, virt-maint|
|Fixed In Version:||Doc Type:||Bug Fix|
|Doc Text:||Story Points:||---|
|Last Closed:||2012-06-07 17:06:18 EDT||Type:||Bug|
|oVirt Team:||---||RHEL 7.3 requirements from Atomic Host:|
Description Steven Dake 2012-04-18 11:19:22 EDT
Description of problem: Rarely the libvirt network seems to fail resulting in inability of the VM to get its DHCP address. Since it can't get a DHCP address, it can't boot. Running: Killing dnsmasq manually then: virsh net-destroy default virsh net-start default fixes the problem. Version-Release number of selected component (if applicable): Name : libvirt Version : 0.9.6 Release : 5.fc16 How reproducible: 2% Steps to Reproduce: 1. We run oz a bunch of times to generate images and eventually the network gets wedged. 2. It may take several days of oz running. 3. Actual results: Expected results: Additional info: I know the bug is light on data - I don't see any helpful diagnostic information. If you have some recommendations for data to capture next time it happens please let us know.
Comment 1 Daniel Veillard 2012-04-18 11:28:41 EDT
s/l/tracing dnsmasq when it comes into that situation may help understand where the problem comes from. when you hit the issue, don't kill the process immediately but run strace -o /tmp/dnsmasq.log -p `pidof dnsmasq` and try to boot a VM, then after it failed, kill the process and append the log, thanks ! Daniel
Comment 2 Steven Dake 2012-04-18 11:47:15 EDT
DV Thanks We will give that a go. We will also try booting a different image rather then oz-based when it locks just to verify it isn't some wierd oz output wedging libvirt (if it is, we could provide the output which may be helpful). After we will kill -HUP to see if that restarts the network. This problem doesn't happen all that often unfortunately. Regards -steve
Comment 3 chris alfonso 2012-05-21 10:34:45 EDT
Created attachment 585828 [details] This is the strace while booting the guest vm.
Comment 4 chris alfonso 2012-05-21 10:35:58 EDT
Created attachment 585829 [details] This is a screenshot with virt-viewer showing the guest config and the host network interfaces
Comment 5 chris alfonso 2012-05-21 10:53:50 EDT
I had originally created an openstack nova network using virbr0 as the bridge. After removing that network and creating a new nova network using a different arbitrary name of demonetbr0, the network on the guest comes up without any problems.
Comment 6 Cole Robinson 2012-06-07 16:15:54 EDT
chris, yeah that virbr0 name was likely clashing with libvirt's default network. Steven, is killing dnsmasq manually a requirement? Or does virsh net-destroy on its own work? Any change something could be mucking with firewall rules on the host? This can wipe out the rules that libvirt needs for NAT.
Comment 7 Steven Dake 2012-06-07 16:42:25 EDT
net-destroy gets the job done if I recall using openstack in the system, it makes all kinds of iptable changes.
Comment 8 Cole Robinson 2012-06-07 17:06:18 EDT
Long known issue which won't be fixed until we have firewalld by default which libvirt and all other iptables users talk too. Which is like F18 time frame. So this is WONTFIX for F16
Comment 9 Steven Dake 2012-06-07 18:23:59 EDT
Cole, Unclear how a conclusion can be made that changing the firewall will break dnsmasq without clear evidence.
Comment 10 Laine Stump 2012-06-07 22:03:27 EDT
libvirt adds iptables rules to (among other things) allow incoming DHCP from the virt guests to the host. If somebody else messes with the iptables rules and happens to add another rule above this particular rule, dhcp requests from the guest will no longer make it to the dnsmasq running on the host. This is just one example of many problems that can occur due to the fact that there is no central controlling authority for iptables rules, and no concept of priority so that the ordering of the rules can remain consistent regardless of the ordering of their insertion. To verify if this is the source of the problem, during a time when the system is "wedged", just run "iptables -S" and see if there is a REJECT or DROP rule that would match the dhcp packets that occurs above the rule to allow them. Also, when the networking is in ts wedged state, try restarting libvirtd to see if that un-wedges it - restarting libvirtd will reload libvirt's iptables rules and re-enable ip_forward without making any other changes to the network plumbing.
Comment 11 Cole Robinson 2012-06-17 10:57:47 EDT
Steven, sorry, wasn't trying to be rash, it's just that 95% of all networking issues filed against libvirt over the years have been some incarnation of this root issue. If you find evidence to the contrary, like as Laine requested in Comment #10, please reopen this bug and we can go from there. But until then keeping this open isn't helpful IMO
Comment 12 Laine Stump 2012-06-17 12:53:14 EDT
BTW, just a couple days ago I made a change to the system firewall with the firewall applet, and hit "Apply", and found that guests could no longer acquire a DHCP lease. When I looked at the iptables output, I found that, as we've discussed above, the rule to allow dhcp packets on the INPUT chain had been removed along with most/everything else added by libvirt). Restarting libvirtd was enough to reload libvirt's iptables rules and get dnsmasq working properly again. So, this isn't conclusive, but I did experience the exact same symptoms and the cause was just as Cole surmised.