Bug 1154953

Summary:

Virtual machine fails to start due to a problem with dnsmasq

Product:

Red Hat Enterprise Linux 6

Reporter:

Matthias Scheutz <matthias.scheutz>

Component:

dnsmasq

Assignee:

Pavel Šimerda (pavlix) <psimerda>

Status:

CLOSED ERRATA

QA Contact:

Jan Ščotka <jscotka>

Severity:

urgent

Docs Contact:

Priority:

high

Version:

6.7

CC:

dmitry, dyuan, ftaylor, jdenemar, jscotka, matthias.scheutz, mzhan, psklenar, rbalakri, rmy, salmy, thomas.j.thompson, thozza, tlavigne, vvasilev, wonczak

Target Milestone:

Keywords:

Patch, Regression

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

dnsmasq-2.48-17.el6

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2016-05-11 01:04:02 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Patch adapted to RHEL 6.	none

Description Matthias Scheutz 2014-10-21 06:31:32 UTC

Description of problem:

After the most recent system update, *without any changes to the configuration of the VM or the bridged networking*, the virtual machine now fails to start due to what seems to be a DHCP problem in dnsmasq:

/var/log/messages
Oct 21 02:12:47 hrilab avahi-daemon[1758]: Joining mDNS multicast group on interface virbr0.IPv4 with address 192.168.122.1.
Oct 21 02:12:47 hrilab avahi-daemon[1758]: New relevant interface virbr0.IPv4 for mDNS.
Oct 21 02:12:47 hrilab avahi-daemon[1758]: Registering new address record for 192.168.122.1 on virbr0.IPv4.
Oct 21 02:12:47 hrilab dnsmasq[20359]: failed to bind DHCP server socket: Address already in use
Oct 21 02:12:47 hrilab dnsmasq[20359]: FAILED to start up
Oct 21 02:12:47 hrilab avahi-daemon[1758]: Interface virbr0.IPv4 no longer relevant for mDNS.
Oct 21 02:12:47 hrilab avahi-daemon[1758]: Leaving mDNS multicast group on interface virbr0.IPv4 with address 192.168.122.1.
Oct 21 02:12:47 hrilab avahi-daemon[1758]: Withdrawing address record for 192.168.122.1 on virbr0.
Oct 21 02:12:47 hrilab kernel: device virbr0-nic left promiscuous mode
Oct 21 02:12:47 hrilab kernel: virbr0: port 1(virbr0-nic) entering disabled state

Version-Release number of selected component (if applicable):

libvirtd (libvirt) 0.10.2

How reproducible:

Reboot whole system and VM will fail to start, or "service libvirtd reload"

Steps to Reproduce:
1.
2.
3.

Actual results:

interface virbr0 never gets created

Expected results:

virbr0 gets created and the VM starts

Additional info:

Comment 2 Jiri Denemark 2014-10-21 13:19:03 UTC

Are you sure you don't have a conflicting dnsmasq running on the host? What is the output of "ps -fwC dnsmasq" command?

Comment 3 Matthias Scheutz 2014-10-21 13:27:53 UTC

Not that I know of, the output of "ps -fwC dnsmasq" is this:
UID        PID  PPID  C STIME TTY          TIME CMD

and the dnsmasq service is stopped

Comment 4 Matthias Scheutz 2014-10-24 14:05:21 UTC

I should add that when I stop the dhcp server using

  service dhcpd stop

and then reload libvirtd

  service reload libvirtd

then I can start the VM using virt-manager and it works fine.  However, it is then not possible to restart dhcpd, i.e., 

  service dhcpd start

always fails.  So it seems that the latest update to libvirt or related packages somehow changed the port dhcpd is listening in the virtual network when dnsmasq gets called as part of staring libvirtd, is that possible?  Please advise.

Comment 5 Matthias Scheutz 2014-10-30 12:46:14 UTC

Any suggestions on how to resolve this?  Right now, we cannot run the VM and the DHCP server at the same time as we used to before.  And if I turn the DHCP server off and start them VM, this is what I get doing "netstat -aunp":

udp        0      0 192.168.122.1:53           0.0.0.0:*                               3523/dnsmasq        
udp        0      0 192.168.0.254:53            0.0.0.0:*                               1788/named          
udp        0      0 127.0.0.1:53                0.0.0.0:*                               1788/named          
udp        0      0 0.0.0.0:67                  0.0.0.0:*                               3523/dnsmasq

Comment 6 Dr. Stephan Wonczak 2014-10-31 10:03:56 UTC

I had the same problem, too (albeit on Centos 6). Here downgrading
dnsmasq-2.48-14.el6.x86_64
to
dnsmasq-2.48-13.el6.x86_64
solved the problem for me.
While googling, I stubled on an old FC18-Bugreport, which described the exact same problem
https://bugzilla.redhat.com/show_bug.cgi?id=977555
whch was fixed at that time. Maybe for some reason this bug got resurrected in the latest dnsmasq package.

Comment 7 Matthias Scheutz 2014-10-31 12:48:09 UTC

Yup, downgrading to dnsmasq-2.48-13.el6.x86_64 worked, thanks for the tip Stephan!

For the RH developers:

When I do "netstat -aunp" now I get:

udp        0      0 192.168.122.1:53            0.0.0.0:*                               2786/dnsmasq        
udp        0      0 192.168.0.254:53            0.0.0.0:*                               1789/named          
udp        0      0 127.0.0.1:53                0.0.0.0:*                               1789/named          
udp        0      0 0.0.0.0:67                  0.0.0.0:*                               2786/dnsmasq        
udp        0      0 0.0.0.0:67                  0.0.0.0:*                               2410/dhcpd  

So, with the older version of dnsmasq dhcpd can be listening at 0 0.0.0.0:67 as well it seems

It would be great, if this regression could be fixed upstream

Comment 8 Matthias Scheutz 2015-03-06 22:37:31 UTC

Is there an update on when this will be fixed?

Comment 10 TJ 2015-08-19 17:01:08 UTC

An ETA would be appreciated from me as well.  I hit this problem w/o running dhcpd, libvirtd can't start the virtual networks at all in my particular case.

Here's the traceback in virt manager when attempting to start a network (it's nearly identical to the F18 bug IIRC):


Traceback (most recent call last):
  File "/usr/share/virt-manager/virtManager/asyncjob.py", line 44, in cb_wrapper
    callback(asyncjob, *args, **kwargs)
  File "/usr/share/virt-manager/virtManager/asyncjob.py", line 65, in tmpcb
    callback(*args, **kwargs)
  File "/usr/share/virt-manager/virtManager/network.py", line 82, in start
    self.net.create()
  File "/usr/lib64/python2.6/site-packages/libvirt.py", line 2128, in create
    if ret == -1: raise libvirtError ('virNetworkCreate() failed', net=self)
libvirtError: internal error Child process (/usr/sbin/dnsmasq --strict-order --pid-file=/var/run/libvirt/network/isolated_local.pid --conf-file= --except-interface lo --bind-interfaces --listen-address 192.168.100.1 --dhcp-option=3 --no-resolv --dhcp-range 192.168.100.128,192.168.100.254 --dhcp-leasefile=/var/lib/libvirt/dnsmasq/isolated_local.leases --dhcp-lease-max=127 --dhcp-no-override --dhcp-hostsfile=/var/lib/libvirt/dnsmasq/isolated_local.hostsfile --addn-hosts=/var/lib/libvirt/dnsmasq/isolated_local.addnhosts) unexpected exit status 2: 
dnsmasq: failed to set SO_REUSE{ADDR|PORT} on DHCP socket: Protocol not available

Note that the downgrade in dnsmasq is a viable work around at this point for me, but I'd rather the repo have a fixed version...

Comment 11 Pavel Šimerda (pavlix) 2015-09-02 14:26:17 UTC

(In reply to Dr. Stephan Wonczak from comment #6)
> I had the same problem, too (albeit on Centos 6). Here downgrading
> dnsmasq-2.48-14.el6.x86_64
> to
> dnsmasq-2.48-13.el6.x86_64
> solved the problem for me.

I have just compared the two versions and they appear to only differ in the initscript. As far as I know, libvirt is not supposed to use the initscript at all and therefore the change is unlikely to affect it, unless I missed something.

(In reply to TJ from comment #10)
> An ETA would be appreciated from me as well.  I hit this problem w/o running
> dhcpd, libvirtd can't start the virtual networks at all in my particular
> case.

Have you checked whether any daemons are using the DHCP server port other than instances of dnsmasq started by libvirt?

> Here's the traceback in virt manager when attempting to start a network
> (it's nearly identical to the F18 bug IIRC):
> 
> /usr/sbin/dnsmasq --strict-order
> --pid-file=/var/run/libvirt/network/isolated_local.pid --conf-file=
> --except-interface lo --bind-interfaces --listen-address 192.168.100.1
> --dhcp-option=3 --no-resolv --dhcp-range 192.168.100.128,192.168.100.254
> --dhcp-leasefile=/var/lib/libvirt/dnsmasq/isolated_local.leases
> --dhcp-lease-max=127 --dhcp-no-override
> --dhcp-hostsfile=/var/lib/libvirt/dnsmasq/isolated_local.hostsfile
> --addn-hosts=/var/lib/libvirt/dnsmasq/isolated_local.addnhosts

Noting the `--bind-interfaces` option also referred to in the dnsmasq source code, see below.

> dnsmasq: failed to set SO_REUSE{ADDR|PORT} on DHCP socket: Protocol not
> available

This error line uniquely identifies actual code where that happens.

  /* When bind-interfaces is set, there might be more than one dnmsasq
     instance binding port 67. That's OK if they serve different networks.
     Need to set REUSEADDR|REUSEPORT to make this posible.
     Handle the case that REUSEPORT is defined, but the kernel doesn't 
     support it. This handles the introduction of REUSEPORT on Linux. */
  if (option_bool(OPT_NOWILD) || option_bool(OPT_CLEVERBIND))
    {
      int rc = 0;

#ifdef SO_REUSEPORT
      if ((rc = setsockopt(fd, SOL_SOCKET, SO_REUSEPORT, &oneopt, sizeof(oneopt))) == -1 && 
          errno == ENOPROTOOPT)
        rc = 0;
#endif
      
      if (rc != -1)
        rc = setsockopt(fd, SOL_SOCKET, SO_REUSEADDR, &oneopt, sizeof(oneopt));
      
      if (rc == -1)
        die(_("failed to set SO_REUSE{ADDR|PORT} on DHCP socket: %s"), NULL, EC_BADNET);


    }

From the source code it looks like the error message is only printed when `setsockopt(..., SO_REUSEADDR, ...)` exits with ENOPROTOOPT, if I didn't miss something. The SO_REUSEPORT part looks safe from ENOPROTOOPT to me.

> Note that the downgrade in dnsmasq is a viable work around at this point for
> me, but I'd rather the repo have a fixed version...

Can you confirm that the downgrade actually helps? Do we have a simple reproducer not involving libvirt? I will later attempt to run the command above launched by libvirt in case it is enough to reproduce the issue.

Comment 12 TJ 2015-09-02 20:09:16 UTC

Here's how I reproduce the issue:

- Stopped all virt networks using virt manager
- Verified nothing on the DHCP ports using:  sudo lsof -iUDP
- Upgrade the dnsmasq package
- Try to start a virtual network (via virt manager) and it fails as noted above.
- Downgrade dnsmasq 
- Retry network start.  Networks start as expected. 

libvirt version is:  libvirt-0.10.2-46.el6.x86_64

I'll see if I can reproduce the problem manually...

Comment 13 Pavel Šimerda (pavlix) 2015-09-07 13:46:25 UTC

(In reply to TJ from comment #12)
> - Downgrade dnsmasq

Have you also tried to just restart dnsmasq instead of downgrading at this point?

Comment 14 Matthias Scheutz 2016-01-04 16:25:19 UTC

We tried that, it does not work.  The only thing that works is downgrading to dnsmasq.x86_64 0:2.48-13.el6.  Would be great if this could be fixed, Matthias

Comment 15 Pavel Šimerda (pavlix) 2016-01-05 14:13:04 UTC

With a bit of stracing with the -13 and -14 versions of dnsmasq we confirmed that the difference wasn't in code but rather in build environment and more specifically version of the kernel headers. The former version didn't detect support for SO_REUSEPORT at compile time and therefore uses SO_REUSEADDR while the latter uses SO_REUSEPORT. The upstream solution is to use both SO_REUSEADDR and SO_REUSEPORT.

The respective upstream commit follows...

commit ffbad34b310ab2db6a686c85f5c0a0e52c0680c8
Author: Simon Kelley <simon.uk>
Date:   Wed Aug 14 15:53:57 2013 +0100

    Set SOREUSEADDR as well as SOREUSEPORT on DHCP sockets when both available.

diff --git a/src/dhcp.c b/src/dhcp.c
index 333a327..b95a4ba 100644
--- a/src/dhcp.c
+++ b/src/dhcp.c
@@ -70,15 +70,15 @@ static int make_fd(int port)
      support it. This handles the introduction of REUSEPORT on Linux. */
   if (option_bool(OPT_NOWILD) || option_bool(OPT_CLEVERBIND))
     {
-      int rc = -1, porterr = 0;
+      int rc = 0;
 
 #ifdef SO_REUSEPORT
       if ((rc = setsockopt(fd, SOL_SOCKET, SO_REUSEPORT, &oneopt, sizeof(oneopt))) == -1 && 
-         errno != ENOPROTOOPT)
-       porterr = 1;
+         errno == ENOPROTOOPT)
+       rc = 0;
 #endif
       
-      if (rc == -1 && !porterr)
+      if (rc != -1)
        rc = setsockopt(fd, SOL_SOCKET, SO_REUSEADDR, &oneopt, sizeof(oneopt));
       
       if (rc == -1)
diff --git a/src/dhcp6.c b/src/dhcp6.c
index 17e03e5..89af7dd 100644
--- a/src/dhcp6.c
+++ b/src/dhcp6.c
@@ -55,15 +55,15 @@ void dhcp6_init(void)
      support it. This handles the introduction of REUSEPORT on Linux. */
   if (option_bool(OPT_NOWILD) || option_bool(OPT_CLEVERBIND))
     {
-      int rc = -1, porterr = 0;
+      int rc = 0;
 
 #ifdef SO_REUSEPORT
       if ((rc = setsockopt(fd, SOL_SOCKET, SO_REUSEPORT, &oneopt, sizeof(oneopt))) == -1 &&
-         errno != ENOPROTOOPT)
-       porterr = 1;
+         errno == ENOPROTOOPT)
+       rc = 0;
 #endif
       
-      if (rc == -1 && !porterr)
+      if (rc != -1)
        rc = setsockopt(fd, SOL_SOCKET, SO_REUSEADDR, &oneopt, sizeof(oneopt));
       
       if (rc == -1)

Comment 17 Pavel Šimerda (pavlix) 2016-01-06 09:23:26 UTC

*** Bug 1176224 has been marked as a duplicate of this bug. ***

Comment 18 Pavel Šimerda (pavlix) 2016-01-06 15:07:27 UTC

Created attachment 1112206 [details]
Patch adapted to RHEL 6.

Comment 29 Matthias Scheutz 2016-02-11 16:54:30 UTC

We just update to dnsmasq.x86_64 0:2.48-16.el6_7 and kernel-2.6.32-573.18.1.el6.x86_64 but the problem is still there...

Comment 32 errata-xmlrpc 2016-05-11 01:04:02 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0949.html