Bug 1400909 - DNS response SERVFAIL fails resolving, instead of trying the next DNS server
Summary: DNS response SERVFAIL fails resolving, instead of trying the next DNS server
Keywords:
Status: NEW
Alias: None
Product: Virtualization Tools
Classification: Community
Component: libvirt
Version: unspecified
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Libvirt Maintainers
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-12-02 09:35 UTC by Kamil Páral
Modified: 2018-07-18 15:07 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:


Attachments (Terms of Use)

Description Kamil Páral 2016-12-02 09:35:44 UTC
Description of problem:
In our office we had a DNS server misconfigured which returned SERVFAIL when resolving certain hosts (in my case e.g. fedoraproject.org). On my host computer, I didn't notice anything, because the system automatically tried the next DNS server. This was visible e.g. from nslookup output:

$ nslookup fedoraproject.org
;; Got SERVFAIL reply from <DNS1_IP>, trying next server 
... (standard successful nslookup output from DNS2_IP follows)

However, doing the same from my libvirt kvm VM, all applications fail to resolve such hosts:

$ nslookup fedoraproject.org
Server: 192.168.11.1
Address: 192.168.11.1#53

** server can't find fedoraproject.org: SERVFAIL

(where 192.168.11.1 is the address of the default route for VMs in my virtual network (NAT type)).

So, all my VMs have broken network. I believe the libvirt DNS forwarding should behave the same way the system resolving works, if there's an error from the first DNS server, it tries the next one.

My network looks like this:

$ virsh net-dumpxml default
<network>
  <name>default</name>
  <uuid>568a8830-529c-481c-b5f6-1a0be531a074</uuid>
  <forward mode='nat'>
    <nat>
      <port start='1024' end='65535'/>
    </nat>
  </forward>
  <bridge name='virbr0' stp='on' delay='0'/>
  <mac address='52:54:00:6d:09:26'/>
  <domain name='default'/>
  <ip address='192.168.11.1' netmask='255.255.255.0'>
    <dhcp>
      <range start='192.168.11.10' end='192.168.11.50'/>
    </dhcp>
  </ip>
</network>



Version-Release number of selected component (if applicable):
libvirt-2.2.0-2.fc25.x86_64


How reproducible:
always

Steps to Reproduce:
1. you need to have a failing DNS server, I don't know how to simulate that
2. see that resolving works from your host (skipping to a second DNS server), but fails completely from your VMs

Comment 1 Cole Robinson 2017-05-03 20:54:06 UTC
laine, any thoughts on this?

Comment 2 Laine Stump 2017-05-04 01:55:03 UTC
So the guests are of course only pointing at 192.168.122.1 for DNS, meaning that the guest can't possibly look to a secondary DNS if the primary fails. Are you saying, then, that the dnsmasq that is responding to the query from the guest should look to a secondary? Is there a dnsmasq option to control that? If so, possibly libvirt should set that option, but beyond that it's really out of libvirt's control.

Or did I misunderstand your issue?

Comment 3 Kamil Páral 2017-05-05 12:35:57 UTC
The service that is running on the host (is that dnsmasq?) should try the next dns server if the first one fails, and only then return the result to the guest. I don't know how libvirt communicates with the host regarding dns. But it seems very weird that if the primary dns fails, I can still ping/wget/use firefox just fine on the host, but I have completely non-functional network in the guest.

Comment 4 Laine Stump 2017-05-06 02:54:20 UTC
all DNS duties for libvirt networks are handled by a dnsmasq process started by libvirt, so any idiosyncracies in responses to DNS requests would be a result of the dnsmasq conf file created by libvirt and dnsmasq's own code.

Looking through dnsmasq.conf, I see the "strict-order" option which seems to control how dnsmasq deals with multiple servers. libvirt always adds this option, and has done so since libvirt-0.2.3 (released sometime in 2007). Here's what is said about strict-order in dnsmasq.conf:

  # By  default,  dnsmasq  will  send queries to any of the upstream
  # servers it knows about and tries to favour servers to are  known
  # to  be  up.  Uncommenting this forces dnsmasq to try each query
  # with  each  server  strictly  in  the  order  they   appear   in
  # /etc/resolv.conf
  #strict-order

and here's what was said in the comments of libvirt commit 6a12fee1, which added --strict-order to libvirt's invocations of dnsmasq:

+    /*
+     * Needed to ensure dnsmasq uses same algorithm for processing
+     * multiple nameserver entries in /etc/resolv.conf as GLibC.
+     */
+    APPEND_ARG(*argv, i++, "--strict-order");

To see if changing this option causes the behavior you desire, can you try doing the following:

1) edit the file /var/lib/libvirt/dnsmasq/${netname}.conf  (where ${netname} is the name of the libvirt network you're connecting to), remove the line that says "strict-order", and save the file.

2) ps -AlF | grep dnsmasq | grep ${netname} to learn the pid and full commandline of the dnsmasq process.

3) kill ${dnsmasq-pid}

4) re-run exactly the same commandline you saw in the ps output

your network will be in a strange mode where libvirt will no longer know the pid of dnsmasq, so if you attempt to destroy the network it won't be able to kill dnsmasq, *but* in the meantime you'll have a dnsmasq running without "strict-order", and you can try the same query that was failing before.

Note that it's also possible you're just seeing a bug in dnsmasq behavior. I'm Cc'ing Simon (dnsmasq author) to see if he has anything to add.

Also, maybe Daniel Berrange has something to say (he's the person who added "--strict-order" to libvirt's invocations of dnsmasq all the way back in 2007). It could be that we can't even consider removing strict-order for some reason I'm not aware of.

Comment 5 Kamil Páral 2017-05-09 08:48:55 UTC
(In reply to Laine Stump from comment #4)
> +    /*
> +     * Needed to ensure dnsmasq uses same algorithm for processing
> +     * multiple nameserver entries in /etc/resolv.conf as GLibC.
> +     */
> +    APPEND_ARG(*argv, i++, "--strict-order");

Assuming all the system tools like ping or even web browsers use default glibc default behavior, the comment doesn't seem to be correct, or there's indeed a bug in --strict-order implementation (or in description what it means). It seems that the current implementation returns the response of the first dns server, even if it's a failure, and doesn't try the next one.

> To see if changing this option causes the behavior you desire, can you try
> doing the following:

I'm sorry, our DNS server is no longer misconfigured (it was a temporary issue, due to which I discovered this problem), and I have no idea how to emulate that (I have very little networking knowledge).

Comment 6 Sergi Jimenez Romero 2017-09-05 11:08:01 UTC
I believe the option you're looking for is:

--all-servers
    By default, when dnsmasq has more than one upstream server available, it will send queries to just one server. Setting this flag forces dnsmasq to send all queries to all available servers. The reply from the server which answers first will be returned to the original requester.

This problem also happens on OCP, when one of the servers doesn't have the answer returns it and doesn't queries the next DNS server. I haven't tried yet the above option but sounds like the solution.


Note You need to log in before you can comment on or make changes to this bug.