Bug 813853

Summary: libvirt network fails rarely - maybe dnsmasq problem
Product: [Fedora] Fedora Reporter: Steven Dake <sdake>
Component: libvirtAssignee: Libvirt Maintainers <libvirt-maint>
Status: CLOSED WONTFIX QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 16CC: berrange, calfonso, clalancette, crobinso, dougsland, itamar, jforbes, jyang, laine, libvirt-maint, veillard, virt-maint
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-06-07 21:06:18 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
This is the strace while booting the guest vm.
none
This is a screenshot with virt-viewer showing the guest config and the host network interfaces none

Description Steven Dake 2012-04-18 15:19:22 UTC
Description of problem:
Rarely the libvirt network seems to fail resulting in inability of the VM to get its DHCP address.  Since it can't get a DHCP address, it can't boot.

Running:
Killing dnsmasq manually then:
virsh net-destroy default
virsh net-start default

fixes the problem.

Version-Release number of selected component (if applicable):
Name        : libvirt
Version     : 0.9.6
Release     : 5.fc16


How reproducible:
2%

Steps to Reproduce:
1. We run oz a bunch of times to generate images and eventually the network gets wedged.
2. It may take several days of oz running.
3.
  
Actual results:


Expected results:


Additional info:
I know the bug is light on data - I don't see any helpful diagnostic information.  If you have some recommendations for data to capture next time it happens please let us know.

Comment 1 Daniel Veillard 2012-04-18 15:28:41 UTC
s/l/tracing dnsmasq when it comes into that situation may help understand
where the problem comes from.

when you hit the issue, don't kill the process immediately but run

strace -o /tmp/dnsmasq.log -p `pidof dnsmasq`

and try to boot a VM, then after it failed, kill the process and
append the log,

 thanks !

Daniel

Comment 2 Steven Dake 2012-04-18 15:47:15 UTC
DV

Thanks
We will give that a go.  We will also try booting a different image rather then oz-based when it locks just to verify it isn't some wierd oz output wedging libvirt (if it is, we could provide the output which may be helpful).  After we will kill -HUP to see if that restarts the network.  This problem doesn't happen all that often unfortunately.  

Regards
-steve

Comment 3 chris alfonso 2012-05-21 14:34:45 UTC
Created attachment 585828 [details]
This is the strace while booting the guest vm.

Comment 4 chris alfonso 2012-05-21 14:35:58 UTC
Created attachment 585829 [details]
This is a screenshot with virt-viewer showing the guest config and the host network interfaces

Comment 5 chris alfonso 2012-05-21 14:53:50 UTC
I had originally created an openstack nova network using virbr0 as the bridge.  After removing that network and creating a new nova network using a different arbitrary name of demonetbr0, the network on the guest comes up without any problems.

Comment 6 Cole Robinson 2012-06-07 20:15:54 UTC
chris, yeah that virbr0 name was likely clashing with libvirt's default network.

Steven, is killing dnsmasq manually a requirement? Or does virsh net-destroy on its own work? Any change something could be mucking with firewall rules on the host? This can wipe out the rules that libvirt needs for NAT.

Comment 7 Steven Dake 2012-06-07 20:42:25 UTC
net-destroy gets the job done if I recall

using openstack in the system, it makes all kinds of iptable changes.

Comment 8 Cole Robinson 2012-06-07 21:06:18 UTC
Long known issue which won't be fixed until we have firewalld by default which libvirt and all other iptables users talk too. Which is like F18 time frame. So this is WONTFIX for F16

Comment 9 Steven Dake 2012-06-07 22:23:59 UTC
Cole,

Unclear how a conclusion can be made that changing the firewall will break dnsmasq without clear evidence.

Comment 10 Laine Stump 2012-06-08 02:03:27 UTC
libvirt adds iptables rules to (among other things) allow incoming DHCP from the virt guests to the host. If somebody else messes with the iptables rules and happens to add another rule above this particular rule, dhcp requests from the guest will no longer make it to the dnsmasq running on the host. This is just one example of many problems that can occur due to the fact that there is no central controlling authority for iptables rules, and no concept of priority so that the ordering of the rules can remain consistent regardless of the ordering of their insertion.

To verify if this is the source of the problem, during a time when the system is "wedged", just run "iptables -S" and see if there is a REJECT or DROP rule that would match the dhcp packets that occurs above the rule to allow them.

Also, when the networking is in ts wedged state, try restarting libvirtd to see if that un-wedges it - restarting libvirtd will reload libvirt's iptables rules and re-enable ip_forward without making any other changes to the network plumbing.

Comment 11 Cole Robinson 2012-06-17 14:57:47 UTC
Steven, sorry, wasn't trying to be rash, it's just that 95% of all networking issues filed against libvirt over the years have been some incarnation of this root issue.

If you find evidence to the contrary, like as Laine requested in Comment #10, please reopen this bug and we can go from there. But until then keeping this open isn't helpful IMO

Comment 12 Laine Stump 2012-06-17 16:53:14 UTC
BTW, just a couple days ago I made a change to the system firewall with the firewall applet, and hit "Apply", and found that guests could no longer acquire a DHCP lease. When I looked at the iptables output, I found that, as we've discussed above, the rule to allow dhcp packets on the INPUT chain had been removed along with most/everything else added by libvirt). Restarting libvirtd was enough to reload libvirt's iptables rules and get dnsmasq working properly again. So, this isn't conclusive, but I did experience the exact same symptoms and the cause was just as Cole surmised.