Bug 850606

Summary: libvirtd should check for properly running dnsmasq on networks presumed "active" at startup (and start one if necessary)
Product: [Fedora] Fedora Reporter: Scott Baker <scott>
Component: libvirtAssignee: Laine Stump <laine>
Status: CLOSED CURRENTRELEASE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 18CC: berrange, clalancette, crobinso, itamar, jforbes, jyang, laine, libvirt-maint, veillard, virt-maint
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-10-01 21:37:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Scott Baker 2012-08-21 23:31:03 UTC
Description of problem:
When I start libvirtd with a "default" network, it does not spawn dnsmasq to serve DHCP requests

Version-Release number of selected component (if applicable):
libvirt-0.9.6.1-1.fc16.x86_64

How reproducible:
Easily

Steps to Reproduce:
1. Install libvirtd
2. Use the default network configuration 
  
Actual results:
Libvirt starts, and the default network "starts" but dnsmasq isn't started to serve DHCP.

----------------------------------------------------------

:cat /etc/libvirt/qemu/networks/default.xml 
<network>
  <name>default</name>
  <uuid>06974f33-83e7-4780-9532-bf3d7acefa7c</uuid>
  <bridge name="virbr0" />
  <mac address='52:54:00:CA:D3:5D'/>
  <forward/>
  <ip address="192.168.122.1" netmask="255.255.255.0">
    <dhcp>
      <range start="192.168.122.2" end="192.168.122.254" />
    </dhcp>
  </ip>
</network>

:virsh net-info default
Name            default
UUID            06974f33-83e7-4780-9532-bf3d7acefa7c
Active:         yes
Persistent:     yes
Autostart:      yes
Bridge:         virbr0

:ps aux | grep dnsmasq
root     22874  0.0  0.0 109248   884 pts/4    S+   16:28   0:00 grep --color=auto dnsmasq

If I manually start dnsmasq it works, but it should start automatically with libvirtd turning up the "default" network

/usr/sbin/dnsmasq --strict-order --bind-interfaces --pid-file=/var/run/libvirt/network/default.pid --conf-file= --except-interface lo --listen-address 192.168.122.1 --dhcp-range 192.168.122.2,192.168.122.254 --dhcp-leasefile=/var/lib/libvirt/dnsmasq/default.leases --dhcp-lease-max=253 --dhcp-no-override

Not sure if it matters, but it appears that Centos 6.3 does the same thing.

Comment 1 Scott Baker 2012-08-22 15:08:00 UTC
I'm using "network" instead of "NetworkManager" if that's somehow related.

Comment 2 Scott Baker 2012-08-22 17:55:04 UTC
virsh net-destroy default; virsh net-start default

causes dnsmasq to restart. Working with laine on IRC, if dnsmasq crashes or is killed, and virbr0 is still present, restarting libvirtd will *NOT* restart dnsmasq. 

dnsmasq is only started by libvirtd if it thinks the network is not already up, which it determines by seeing if the virbr0 device is present. Part of libvirtd turning up the networks should probably confirm that dnsmasq is running, and if not, start it.

Comment 3 Laine Stump 2012-08-24 06:11:16 UTC
I agree we should be checking for dnsmasq and restarting it if needed (and probably giving it a SIGHUP even if it's there, just for good measure).


An aside:

In our discussion on IRC, you figured out that you had run /etc/init.d/dnsmasq restart" and that had killed dnsmasq. But that script doesn't exist on F16, because it has switched to using systemd. (and when I run "service dnsmasq restart" or "systemctl restart dnsmasq.service", it fails and doesn't kill all of libvirtd's dnsmasq instances).

Was that original behavior only seen on CentOS, and you just verified the result on F16 by manually killing the dnsmasq processes? Or is there some other weird circumstance that causes dnsmasq processes to be killed?

Comment 4 Scott Baker 2012-08-24 15:09:12 UTC
It was originally seen on CentOS. I tried it on F16, but forcibly killing dnsmasq first, wanting to see if it would restart.

Comment 5 Laine Stump 2012-08-24 18:16:37 UTC
Okay, so there isn't a separate "dnsmasq is myseteriously dying" bug on F16. That's good to know :-)

I've changed the summary of this BZ to more accurately reflect what's needed from libvirt.

Thanks for the report and extra investigation!

Comment 6 Laine Stump 2012-09-23 18:21:10 UTC
Upstream libvirt has been enhanced to restart radvd/dnsmasq when needed when libvirtd is restarted. It will also send a SIGHUP to all dnsmasq and radvd processes when libvirtd is restarted. The following two commits are required for this new behavior. I'm not sure how easily they will backport to the libvirt that's in F16 (which this BZ is filed against) or F17, but they will be in 0.10.2, which means they will automatically be in F18.

If the backport isn't trivial, we may want to consider marking this as CLOSED/NEXTRELEASE or CLOSED/UPSTREAM instead.

commit 4cf974b67427e33e3ce38df4787cddd6e2822d67
Author: Laine Stump <laine>
Date:   Sun Sep 16 21:22:27 2012 -0400

    network: restart radvd/dnsmasq if needed when libvirtd is restarted
    
    A user on IRC had accidentally killed all of his libvirt-started
    dnsmasq instances (due to a buggy dnsmasq service script in Fedora
    16), and had hoped that libvirtd would notice this on restart and
    reload all the dnsmasq daemons (as it does with iptables
    rules). Unfortunately this was not the case - as long as the network
    object had a pid registered for dnsmasq and/or radvd, it assumed that
    the processes were running.
    
    This patch takes advantage of the new utility functions in
    bridge_driver.c to do a "refresh" of all radvd and dnsmasq processes
    started by libvirt each time libvirtd is restarted - this function
    attempts to do a SIGHUP of each existing process, and if that fails,
    it restarts the process, rebuilding all the associated config files
    and commandline parameters in the process. This normally has no
    effect, but will be useful in solving the occasional "odd situation"
    without needing to take the drastic step of destroying/re-starting the
    network.

commit 1ce4922e720e125421b3f8061d0eb6fdd152c41a
Author: Laine Stump <laine>
Date:   Mon Aug 20 00:59:46 2012 -0400

    network: reorganize dnsmasq and radvd config file / startup
    
    This patch splits the starting of dnsmasq and radvd into multiple
    files, and adds new networkRefreshXX() and networkRestartXX()
    functions for each. These new functions are currently commented out
    because they won't be used until the next commit, and the compile options
    require all static functions to be used.
    
    networkRefreshXX() - rewrites any file-based config for dnsmasq/radvd,
    and sends SIGHUP to the process to make it reread its config. If the
    program isn't already running, it's just started.
    
    networkRestartXX() - kills the given program, waits for it to exit
    (see the comments in the function networkKillDaemon()), then calls
    networkStartXX().
    
    This commit is here mostly as a checkpoint to verify no change in
    functional behavior after refactoring networkStartXX() functions to
    fit in with these new functions.

Comment 7 Cole Robinson 2012-10-01 21:25:03 UTC
Amazingly these patches apply cleanly to F16 maint. However given the size of the changes, the (hopefully) rarity of the issue, and the fact that there's a workaround (destroy, start), I don't plan on backporting these to the maintenance branches.

Moving to F18.

Comment 8 Cole Robinson 2012-10-01 21:37:36 UTC
Aaaaand libvirt 0.10.2 is already in F18, so just closing as CURRENTRELEASE