Red Hat Bugzilla – Bug 813853
libvirt network fails rarely - maybe dnsmasq problem
Last modified: 2016-04-26 15:34:40 EDT
Description of problem:
Rarely the libvirt network seems to fail resulting in inability of the VM to get its DHCP address. Since it can't get a DHCP address, it can't boot.
Killing dnsmasq manually then:
virsh net-destroy default
virsh net-start default
fixes the problem.
Version-Release number of selected component (if applicable):
Name : libvirt
Version : 0.9.6
Release : 5.fc16
Steps to Reproduce:
1. We run oz a bunch of times to generate images and eventually the network gets wedged.
2. It may take several days of oz running.
I know the bug is light on data - I don't see any helpful diagnostic information. If you have some recommendations for data to capture next time it happens please let us know.
s/l/tracing dnsmasq when it comes into that situation may help understand
where the problem comes from.
when you hit the issue, don't kill the process immediately but run
strace -o /tmp/dnsmasq.log -p `pidof dnsmasq`
and try to boot a VM, then after it failed, kill the process and
append the log,
We will give that a go. We will also try booting a different image rather then oz-based when it locks just to verify it isn't some wierd oz output wedging libvirt (if it is, we could provide the output which may be helpful). After we will kill -HUP to see if that restarts the network. This problem doesn't happen all that often unfortunately.
Created attachment 585828 [details]
This is the strace while booting the guest vm.
Created attachment 585829 [details]
This is a screenshot with virt-viewer showing the guest config and the host network interfaces
I had originally created an openstack nova network using virbr0 as the bridge. After removing that network and creating a new nova network using a different arbitrary name of demonetbr0, the network on the guest comes up without any problems.
chris, yeah that virbr0 name was likely clashing with libvirt's default network.
Steven, is killing dnsmasq manually a requirement? Or does virsh net-destroy on its own work? Any change something could be mucking with firewall rules on the host? This can wipe out the rules that libvirt needs for NAT.
net-destroy gets the job done if I recall
using openstack in the system, it makes all kinds of iptable changes.
Long known issue which won't be fixed until we have firewalld by default which libvirt and all other iptables users talk too. Which is like F18 time frame. So this is WONTFIX for F16
Unclear how a conclusion can be made that changing the firewall will break dnsmasq without clear evidence.
libvirt adds iptables rules to (among other things) allow incoming DHCP from the virt guests to the host. If somebody else messes with the iptables rules and happens to add another rule above this particular rule, dhcp requests from the guest will no longer make it to the dnsmasq running on the host. This is just one example of many problems that can occur due to the fact that there is no central controlling authority for iptables rules, and no concept of priority so that the ordering of the rules can remain consistent regardless of the ordering of their insertion.
To verify if this is the source of the problem, during a time when the system is "wedged", just run "iptables -S" and see if there is a REJECT or DROP rule that would match the dhcp packets that occurs above the rule to allow them.
Also, when the networking is in ts wedged state, try restarting libvirtd to see if that un-wedges it - restarting libvirtd will reload libvirt's iptables rules and re-enable ip_forward without making any other changes to the network plumbing.
Steven, sorry, wasn't trying to be rash, it's just that 95% of all networking issues filed against libvirt over the years have been some incarnation of this root issue.
If you find evidence to the contrary, like as Laine requested in Comment #10, please reopen this bug and we can go from there. But until then keeping this open isn't helpful IMO
BTW, just a couple days ago I made a change to the system firewall with the firewall applet, and hit "Apply", and found that guests could no longer acquire a DHCP lease. When I looked at the iptables output, I found that, as we've discussed above, the rule to allow dhcp packets on the INPUT chain had been removed along with most/everything else added by libvirt). Restarting libvirtd was enough to reload libvirt's iptables rules and get dnsmasq working properly again. So, this isn't conclusive, but I did experience the exact same symptoms and the cause was just as Cole surmised.