Description of problem: In our office we had a DNS server misconfigured which returned SERVFAIL when resolving certain hosts (in my case e.g. fedoraproject.org). On my host computer, I didn't notice anything, because the system automatically tried the next DNS server. This was visible e.g. from nslookup output: $ nslookup fedoraproject.org ;; Got SERVFAIL reply from <DNS1_IP>, trying next server ... (standard successful nslookup output from DNS2_IP follows) However, doing the same from my libvirt kvm VM, all applications fail to resolve such hosts: $ nslookup fedoraproject.org Server: 192.168.11.1 Address: 192.168.11.1#53 ** server can't find fedoraproject.org: SERVFAIL (where 192.168.11.1 is the address of the default route for VMs in my virtual network (NAT type)). So, all my VMs have broken network. I believe the libvirt DNS forwarding should behave the same way the system resolving works, if there's an error from the first DNS server, it tries the next one. My network looks like this: $ virsh net-dumpxml default <network> <name>default</name> <uuid>568a8830-529c-481c-b5f6-1a0be531a074</uuid> <forward mode='nat'> <nat> <port start='1024' end='65535'/> </nat> </forward> <bridge name='virbr0' stp='on' delay='0'/> <mac address='52:54:00:6d:09:26'/> <domain name='default'/> <ip address='192.168.11.1' netmask='255.255.255.0'> <dhcp> <range start='192.168.11.10' end='192.168.11.50'/> </dhcp> </ip> </network> Version-Release number of selected component (if applicable): libvirt-2.2.0-2.fc25.x86_64 How reproducible: always Steps to Reproduce: 1. you need to have a failing DNS server, I don't know how to simulate that 2. see that resolving works from your host (skipping to a second DNS server), but fails completely from your VMs
laine, any thoughts on this?
So the guests are of course only pointing at 192.168.122.1 for DNS, meaning that the guest can't possibly look to a secondary DNS if the primary fails. Are you saying, then, that the dnsmasq that is responding to the query from the guest should look to a secondary? Is there a dnsmasq option to control that? If so, possibly libvirt should set that option, but beyond that it's really out of libvirt's control. Or did I misunderstand your issue?
The service that is running on the host (is that dnsmasq?) should try the next dns server if the first one fails, and only then return the result to the guest. I don't know how libvirt communicates with the host regarding dns. But it seems very weird that if the primary dns fails, I can still ping/wget/use firefox just fine on the host, but I have completely non-functional network in the guest.
all DNS duties for libvirt networks are handled by a dnsmasq process started by libvirt, so any idiosyncracies in responses to DNS requests would be a result of the dnsmasq conf file created by libvirt and dnsmasq's own code. Looking through dnsmasq.conf, I see the "strict-order" option which seems to control how dnsmasq deals with multiple servers. libvirt always adds this option, and has done so since libvirt-0.2.3 (released sometime in 2007). Here's what is said about strict-order in dnsmasq.conf: # By default, dnsmasq will send queries to any of the upstream # servers it knows about and tries to favour servers to are known # to be up. Uncommenting this forces dnsmasq to try each query # with each server strictly in the order they appear in # /etc/resolv.conf #strict-order and here's what was said in the comments of libvirt commit 6a12fee1, which added --strict-order to libvirt's invocations of dnsmasq: + /* + * Needed to ensure dnsmasq uses same algorithm for processing + * multiple nameserver entries in /etc/resolv.conf as GLibC. + */ + APPEND_ARG(*argv, i++, "--strict-order"); To see if changing this option causes the behavior you desire, can you try doing the following: 1) edit the file /var/lib/libvirt/dnsmasq/${netname}.conf (where ${netname} is the name of the libvirt network you're connecting to), remove the line that says "strict-order", and save the file. 2) ps -AlF | grep dnsmasq | grep ${netname} to learn the pid and full commandline of the dnsmasq process. 3) kill ${dnsmasq-pid} 4) re-run exactly the same commandline you saw in the ps output your network will be in a strange mode where libvirt will no longer know the pid of dnsmasq, so if you attempt to destroy the network it won't be able to kill dnsmasq, *but* in the meantime you'll have a dnsmasq running without "strict-order", and you can try the same query that was failing before. Note that it's also possible you're just seeing a bug in dnsmasq behavior. I'm Cc'ing Simon (dnsmasq author) to see if he has anything to add. Also, maybe Daniel Berrange has something to say (he's the person who added "--strict-order" to libvirt's invocations of dnsmasq all the way back in 2007). It could be that we can't even consider removing strict-order for some reason I'm not aware of.
(In reply to Laine Stump from comment #4) > + /* > + * Needed to ensure dnsmasq uses same algorithm for processing > + * multiple nameserver entries in /etc/resolv.conf as GLibC. > + */ > + APPEND_ARG(*argv, i++, "--strict-order"); Assuming all the system tools like ping or even web browsers use default glibc default behavior, the comment doesn't seem to be correct, or there's indeed a bug in --strict-order implementation (or in description what it means). It seems that the current implementation returns the response of the first dns server, even if it's a failure, and doesn't try the next one. > To see if changing this option causes the behavior you desire, can you try > doing the following: I'm sorry, our DNS server is no longer misconfigured (it was a temporary issue, due to which I discovered this problem), and I have no idea how to emulate that (I have very little networking knowledge).
I believe the option you're looking for is: --all-servers By default, when dnsmasq has more than one upstream server available, it will send queries to just one server. Setting this flag forces dnsmasq to send all queries to all available servers. The reply from the server which answers first will be returned to the original requester. This problem also happens on OCP, when one of the servers doesn't have the answer returns it and doesn't queries the next DNS server. I haven't tried yet the above option but sounds like the solution.
Thank you for reporting this issue to the libvirt project. Unfortunately we have been unable to resolve this issue due to insufficient maintainer capacity and it will now be closed. This is not a reflection on the possible validity of the issue, merely the lack of resources to investigate and address it, for which we apologise. If you none the less feel the issue is still important, you may choose to report it again at the new project issue tracker https://gitlab.com/libvirt/libvirt/-/issues The project also welcomes contribution from anyone who believes they can provide a solution.