Hide Forgot
Description of problem: Nscd randomly fails to resolve hosts which will cause browsers to intermittently show resolve failures or website components on other/sub domains go missing. After the cached negative entry expires resolving the domain may succeed again. Version-Release number of selected component (if applicable): - nscd-2.23.1-8.fc24.x86_64 - NetworkManager-1.2.2-2.fc24.x86_64 - systemd-229-12.fc24.x86_64 How reproducible: Install nscd, enable it, reboot the system and start browsing your favourite sites. Actual results: With Firefox some sites fail to load completely in a random fashion while some components might be missing on others (e.g., favicons of the search results on duckduckgo.com). Expected results: resolving hosts should be improved by nscd Additional info: I reproduced the problem in a VirtualBox VM on the same host and the problem seems to arise more often when the caches are dropped prior to starting the VM (I also disabled nscd on the host just to be sure). Unfortunately, the problem does not always show up. Also, sometimes almost all sites will fail, while at other times only a couple will. Working around the problem can be done simply by restarting the nscd daemon (or manually clearing the hosts cache). Another work around is to start the daemon after the multi-user.target via the "After=" directive of the systemd service. My network runs an openwrt router with caching dnsmasq daemon, but the problem does not occur when the above systemd unit file work around is applied, so this is most likely not a problem in an external component. The network also is IPv6 enabled (via 6to4) but even without IPv6 enabled the problem occurs in the VirtualBox VM.
You need to reproduce this without Virtualbox. Their DNS implementation is known to be broken: https://sourceware.org/ml/libc-help/2015-10/msg00012.html
It's not just VirtualBox, it's also my main machine, but that way changing parameters and rebooting gets very time consuming. I only used VirtualBox to reproduce it easily. I can of course use KVM but it is unlikely to make a big difference when the problem also arises on a real box.
(In reply to Christian Klomp from comment #2) > It's not just VirtualBox, it's also my main machine, but that way changing > parameters and rebooting gets very time consuming. I only used VirtualBox to > reproduce it easily. I can of course use KVM but it is unlikely to make a > big difference when the problem also arises on a real box. Okay, we'll ignore the VirtualBox aspect for now. One potential explanation for the After= difference is that nscd keeps sending DNS queries to servers which were listed in resolv.conf, but subsequently replaced. Could you please check if this is happening when the bug hits, perhaps using tcpdump? (We will likely need full-length, binary tcpdump packet captures for port 53 traffic (“tcpdump -i … -s 0 -w dns.cap port 53”) eventually if it's not the known resolv.conf issue.)
I first migrated the VM to KVM and it is definitely not a VirtualBox problem. For the packet capture tcpdump listened on the any interface with the advised parameters. The hosts cache had been cleared prior to rebooting. The packet dump shows that the random failing queries are sent to the loopback address (first queries failed and the last ones succeed): $ tshark -r dns.cap -Y "dns.qry.name == duckduckgo.com" 143 32.976997 127.0.0.1 → 127.0.0.1 DNS 76 Standard query 0x74d3 A duckduckgo.com 144 32.977029 127.0.0.1 → 127.0.0.1 DNS 76 Standard query 0x5cb7 AAAA duckduckgo.com 145 32.977065 127.0.0.1 → 127.0.0.1 DNS 76 Standard query 0x74d3 A duckduckgo.com 279 62.896094 192.168.122.29 → 192.168.122.1 DNS 76 Standard query 0x9883 A duckduckgo.com 280 62.896117 192.168.122.29 → 192.168.122.1 DNS 76 Standard query 0x6eea AAAA duckduckgo.com 281 62.906613 192.168.122.1 → 192.168.122.29 DNS 172 Standard query response 0x9883 A duckduckgo.com A 176.34.135.167 A 54.229.105.92 A 176.34.131.233 A 176.34.155.20 A 54.229.105.203 A 46.51.197.89 282 62.911047 192.168.122.1 → 192.168.122.29 DNS 132 Standard query response 0x6eea AAAA duckduckgo.com SOA ns0.dnsmadeeasy.com In the full dump there are more hosts that fail randomly (the browser only failed on duckduckgo.com). Resolv.conf is generated by NetworkManager and only shows 192.168.122.1 as nameserver. Nscd.log shows that this file doesn't exist at boot but it does pick up the change eventually.
Created attachment 1191735 [details] tcpdump that shows failure to resolve duckduckgo.com tcpdump -i any -n -s 0 port 53 -w /tmp/dns.cap
Created attachment 1191736 [details] nscd.log with debug_level 10
Still a problem with: - nscd-2.23.1-10.fc24.x86_64 - NetworkManager-1.2.4-2.fc24.x86_64 - systemd-229-13.fc24.x86_64
Is myhostname listed in /etc/nsswitch.conf, as in: hosts: files dns myhostname ?
Yes, nsswitch.conf is in the default state: hosts: files mdns4_minimal [NOTFOUND=return] dns myhostname
Is this bug 1374228? A transient failure in DNS causes myhostname to attempt a resolution which then caches a negative hit to be stored by nscd? Note that this is a bug of myhostname, but the only way to fix this is for glibc to provide a new API for myhostname to use because otherwise myhostname is unimplementable. Therefore the workaround is to remove myhostname. Can you try that and see if it helps? [1] https://bugzilla.redhat.com/show_bug.cgi?id=1374228
@Carlos O'Donell I'm not sure if it is related to 1374228, removing myhostname from nsswitch.conf doesn't seem to help. Instead I think it is related to https://bugzilla.redhat.com/show_bug.cgi?id=1367551 because when I substitute the symbolic link for a file and populate it with the same information then the system won't send queries to the localhost any more. What might be happening is that if nscd starts before NetworkManager starts, it cannot follow the /etc/resolv.conf symlink because NetworkManager hasn't created the file yet. Nscd then uses the localhost as nameserver and then for some reason after NetworkManager has started, nscd keeps using the localhost alongside any new nameservers that NetworkManager picks up. I like having resolv.conf managed by NetworkManager so changing the nscd.service unit file so that it starts sufficiently late enough still seems to be the best workaround.
It appears to me nscd doesn't *really* read the new /etc/resolv.conf after NetworkManager brings up the network. I also have been having issues with nscd on fedora 24 and 25 breaking hostname lookups on boot. E.g., [root@grad ~]# getent hosts www.google.com [root@grad ~]# telnet www.google.com telnet: www.google.com: Name or service not known [root@grad ~]# host www.google.com www.google.com has address 172.217.1.4 www.google.com has IPv6 address 2607:f8b0:400c:c06::69 I typically reduce *-time-to-live to 5 seconds in /etc/nscd.conf, so it's not a negative cache issue. removing "myhostname" from nsswitch.conf did not help. "nscd -i hosts" restores functionality until the next boot. Running the tcpdump in comment #5 shows requests going to localhost: 18:49:17.769996 IP localhost.40821 > localhost.domain: 21912+ A? www.yahoo.com. (31) 18:49:17.770038 IP localhost.40821 > localhost.domain: 26613+ AAAA? www.yahoo.com. (31) In the below, the network was configured by Networkmanager at 18:44:02 [root@grad ~]# systemctl status nscd -l | cat ● nscd.service - Name Service Cache Daemon Loaded: loaded (/usr/lib/systemd/system/nscd.service; enabled; vendor preset: disabled) Active: active (running) since Thu 2016-12-29 18:43:49 EST; 7min ago Process: 1323 ExecStart=/usr/sbin/nscd $NSCD_OPTIONS (code=exited, status=0/SUCCESS) Main PID: 1337 (nscd) Tasks: 11 (limit: 4915) CGroup: /system.slice/nscd.service └─1337 /usr/sbin/nscd Dec 29 18:43:48 localhost.localdomain nscd[1337]: 1337 disabled inotify-based monitoring for file `/etc/resolv.conf': No such file or directory Dec 29 18:43:48 localhost.localdomain nscd[1337]: 1337 stat failed for file `/etc/resolv.conf'; will try again later: No such file or directory Dec 29 18:43:48 localhost.localdomain nscd[1337]: 1337 monitoring file `/etc/services` (5) Dec 29 18:43:48 localhost.localdomain nscd[1337]: 1337 monitoring directory `/etc` (2) Dec 29 18:43:48 localhost.localdomain nscd[1337]: 1337 monitoring file `/etc/netgroup` (6) Dec 29 18:43:48 localhost.localdomain nscd[1337]: 1337 monitoring directory `/etc` (2) Dec 29 18:43:49 localhost.localdomain systemd[1]: Started Name Service Cache Daemon. Dec 29 18:44:06 grad.cam.nist.gov nscd[1337]: 1337 monitored file `/etc/resolv.conf` changed (mtime) Dec 29 18:44:06 grad.cam.nist.gov nscd[1337]: 1337 monitoring file `/etc/resolv.conf` (7) Dec 29 18:44:06 grad.cam.nist.gov nscd[1337]: 1337 monitoring directory `/etc` (2) Short of disabling nscd, a workaround is to create /etc/systemd/system/nscd.service.d/systemd-wait-online.conf with the contents: [Unit] After=NetworkManager-wait-online.service Wants=NetworkManager-wait-online.service nscd will correctly follow just changing data in /etc/resolv.conf - as a plain file, i.e., watch tcpdump traffic and 'watch systemctl status nscd' (.16.1 and .16.2 are our internal DNS servers): echo -e 'search cam.nist.gov\nnameserver 129.6.16.2' > /etc/resolv.conf echo -e 'search cam.nist.gov\nnameserver 129.6.16.1' > /etc/resolv.conf shows it changing nameservers for lookups. I tried (but failed) to break it by removing resolv.conf, restarting nscd, creating a broken symlink to "foo", waiting a bit, then populating foo with various entries. I can't seem to replicate the scenario that happens at boot.
Not that I fully understand this, but experimentally speaking, with /etc/systemd/system/nscd.service.d/network-online.conf containing: [Unit] After=network-online.target Wants=network-online.target and rebooting, host resolution was not working again. By looking at the journal: journalctl -b -u nscd -u NetworkManager -u network-online.target -u NetworkManager-wait-online.target it was clear systemd started nscd before NetworkManager, well-before the network was even configured. Here are some snippets: Jan 05 12:17:18 localhost.localdomain systemd[1]: Starting Name Service Cache Daemon... Jan 05 12:17:20 localhost.localdomain systemd[1]: Starting Network Manager... Jan 05 12:17:20 localhost.localdomain systemd[1]: Started Name Service Cache Daemon. Jan 05 12:17:22 localhost.localdomain NetworkManager[1360]: <info> [1483636642.0293] NetworkManager (version 1.4.2-2.fc25) is starting... Jan 05 12:17:33 localhost.localdomain dhclient[1540]: bound to 129.6.88.146 -- renewal in 158521 seconds. Jan 05 12:17:33 grad.cam.nist.gov NetworkManager[1360]: <info> [1483636653.4606] device (eno1): Activation: successful, device activated. Jan 05 12:17:33 grad.cam.nist.gov NetworkManager[1360]: <info> [1483636653.4627] connectivity: check for uri 'http://fedoraproject.org/static/hotspot.txt' failed with 'Error resolving 'fedoraproject.org': Name or service not known' ~20 mins passes...I ran "nscd -i hosts" to get resolution working again: Jan 05 12:37:34 grad.cam.nist.gov NetworkManager[1360]: <info> [1483637854.9200] manager: NetworkManager state is now CONNECTED_GLOBAL So it seems After+Wants=network-online doesn't do what is stated in the docs, at least in Fedora 25. Using After+Wants=NetworkManager-wait-online.service seems to do what is desired/expected. But still, it does not explain the issues with nscd. Perhaps worth noting, in one of these reboots, it appeared IPv6 hostname resolution worked, but not IPv4. While our network supports IPv6, the system does not have an IPv6 address. E.g., [root@grad ~]# getent hosts fedoraproject.org 2607:f188::dead:beef:cafe:fed1 fedoraproject.org 2605:bc80:3010:600:dead:beef:cafe:feda fedoraproject.org 2604:1580:fe00:0:dead:beef:cafe:fed1 fedoraproject.org 2605:bc80:3010:600:dead:beef:cafe:fed9 fedoraproject.org 2610:28:3090:3001:dead:beef:cafe:fed3 fedoraproject.org [root@grad ~]# telnet www.google.com telnet: www.google.com: Name or service not known [root@grad ~]# getent hosts www.google.com 2607:f8b0:400c:c0b::69 www.google.com [root@grad ~]# getent hosts www.microsoft.com [root@grad ~]# getent hosts www.amazon.com
Still happening on Fedora 25.
This message is a reminder that Fedora 25 is nearing its end of life. Approximately 4 (four) weeks from now Fedora will stop maintaining and issuing updates for Fedora 25. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a Fedora 'version' of '25'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora 25 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior this bug is closed as described in the policy above. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete.
Fedora 25 changed to end-of-life (EOL) status on 2017-12-12. Fedora 25 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. If you are unable to reopen this bug, please file a new report against the current release. If you experience problems, please add a comment to this bug. Thank you for reporting this bug and we are sorry it could not be fixed.