Bug 1367565

Summary: nscd sometimes fails to resolve hosts
Product: [Fedora] Fedora Reporter: Christian Klomp <christianklomp>
Component: glibcAssignee: glibc team <glibc-bugzilla>
Status: CLOSED EOL QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 25CC: af, arjun.is, bugzilla, christianklomp, codonell, dj, fweimer, jakub, law, mfabian, pfrankli, siddhesh
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-12-12 10:25:45 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
tcpdump that shows failure to resolve duckduckgo.com
none
nscd.log with debug_level 10 none

Description Christian Klomp 2016-08-16 19:25:09 UTC
Description of problem:
Nscd randomly fails to resolve hosts which will cause browsers to intermittently show resolve failures or website components on other/sub domains go missing. After the cached negative entry expires resolving the domain may succeed again.

Version-Release number of selected component (if applicable):
- nscd-2.23.1-8.fc24.x86_64
- NetworkManager-1.2.2-2.fc24.x86_64
- systemd-229-12.fc24.x86_64

How reproducible:
Install nscd, enable it, reboot the system and start browsing your favourite sites.

Actual results:
With Firefox some sites fail to load completely in a random fashion while some components might be missing on others (e.g., favicons of the search results on duckduckgo.com).

Expected results:
resolving hosts should be improved by nscd

Additional info:
I reproduced the problem in a VirtualBox VM on the same host and the problem seems to arise more often when the caches are dropped prior to starting the VM (I also disabled nscd on the host just to be sure). Unfortunately, the problem does not always show up. Also, sometimes almost all sites will fail, while at other times only a couple will.

Working around the problem can be done simply by restarting the nscd daemon (or manually clearing the hosts cache). Another work around is to start the daemon after the multi-user.target via the "After=" directive of the systemd service.

My network runs an openwrt router with caching dnsmasq daemon, but the problem does not occur when the above systemd unit file work around is applied, so this is most likely not a problem in an external component.

The network also is IPv6 enabled (via 6to4) but even without IPv6 enabled the problem occurs in the VirtualBox VM.

Comment 1 Florian Weimer 2016-08-16 19:32:23 UTC
You need to reproduce this without Virtualbox.  Their DNS implementation is known to be broken:

  https://sourceware.org/ml/libc-help/2015-10/msg00012.html

Comment 2 Christian Klomp 2016-08-16 19:54:13 UTC
It's not just VirtualBox, it's also my main machine, but that way changing parameters and rebooting gets very time consuming. I only used VirtualBox to reproduce it easily. I can of course use KVM but it is unlikely to make a big difference when the problem also arises on a real box.

Comment 3 Florian Weimer 2016-08-16 20:32:06 UTC
(In reply to Christian Klomp from comment #2)
> It's not just VirtualBox, it's also my main machine, but that way changing
> parameters and rebooting gets very time consuming. I only used VirtualBox to
> reproduce it easily. I can of course use KVM but it is unlikely to make a
> big difference when the problem also arises on a real box.

Okay, we'll ignore the VirtualBox aspect for now.

One potential explanation for the After= difference is that nscd keeps sending DNS queries to servers which were listed in resolv.conf, but subsequently replaced.  Could you please check if this is happening when the bug hits, perhaps using tcpdump?

(We will likely need full-length, binary tcpdump packet captures for port 53 traffic (“tcpdump -i … -s 0 -w dns.cap port 53”) eventually if it's not the known resolv.conf issue.)

Comment 4 Christian Klomp 2016-08-17 19:32:12 UTC
I first migrated the VM to KVM and it is definitely not a VirtualBox problem.

For the packet capture tcpdump listened on the any interface with the advised parameters. The hosts cache had been cleared prior to rebooting.

The packet dump shows that the random failing queries are sent to the loopback address (first queries failed and the last ones succeed):

$ tshark -r dns.cap -Y "dns.qry.name == duckduckgo.com"
143  32.976997    127.0.0.1 → 127.0.0.1    DNS 76 Standard query 0x74d3 A duckduckgo.com
144  32.977029    127.0.0.1 → 127.0.0.1    DNS 76 Standard query 0x5cb7 AAAA duckduckgo.com
145  32.977065    127.0.0.1 → 127.0.0.1    DNS 76 Standard query 0x74d3 A duckduckgo.com
279  62.896094 192.168.122.29 → 192.168.122.1 DNS 76 Standard query 0x9883 A duckduckgo.com
280  62.896117 192.168.122.29 → 192.168.122.1 DNS 76 Standard query 0x6eea AAAA duckduckgo.com
281  62.906613 192.168.122.1 → 192.168.122.29 DNS 172 Standard query response 0x9883 A duckduckgo.com A 176.34.135.167 A 54.229.105.92 A 176.34.131.233 A 176.34.155.20 A 54.229.105.203 A 46.51.197.89
282  62.911047 192.168.122.1 → 192.168.122.29 DNS 132 Standard query response 0x6eea AAAA duckduckgo.com SOA ns0.dnsmadeeasy.com

In the full dump there are more hosts that fail randomly (the browser only failed on duckduckgo.com). Resolv.conf is generated by NetworkManager and only shows 192.168.122.1 as nameserver. Nscd.log shows that this file doesn't exist at boot but it does pick up the change eventually.

Comment 5 Christian Klomp 2016-08-17 19:34:09 UTC
Created attachment 1191735 [details]
tcpdump that shows failure to resolve duckduckgo.com

tcpdump -i any -n -s 0 port 53 -w /tmp/dns.cap

Comment 6 Christian Klomp 2016-08-17 19:34:59 UTC
Created attachment 1191736 [details]
nscd.log with debug_level 10

Comment 7 Christian Klomp 2016-08-20 12:31:29 UTC
Still a problem with:
- nscd-2.23.1-10.fc24.x86_64
- NetworkManager-1.2.4-2.fc24.x86_64
- systemd-229-13.fc24.x86_64

Comment 8 Florian Weimer 2016-08-20 17:03:45 UTC
Is myhostname listed in /etc/nsswitch.conf, as in:

hosts:      files dns myhostname

?

Comment 9 Christian Klomp 2016-08-20 17:39:32 UTC
Yes, nsswitch.conf is in the default state:
hosts:      files mdns4_minimal [NOTFOUND=return] dns myhostname

Comment 10 Carlos O'Donell 2016-09-08 09:59:45 UTC
Is this bug 1374228?

A transient failure in DNS causes myhostname to attempt a resolution which then caches a negative hit to be stored by nscd?

Note that this is a bug of myhostname, but the only way to fix this is for glibc to provide a new API for myhostname to use because otherwise myhostname is unimplementable.

Therefore the workaround is to remove myhostname. Can you try that and see if it helps?

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1374228

Comment 11 Christian Klomp 2016-09-11 19:28:22 UTC
@Carlos O'Donell
I'm not sure if it is related to 1374228, removing myhostname from nsswitch.conf doesn't seem to help.

Instead I think it is related to https://bugzilla.redhat.com/show_bug.cgi?id=1367551 because when I substitute the symbolic link for a file and populate it with the same information then the system won't send queries to the localhost any more.

What might be happening is that if nscd starts before NetworkManager starts, it cannot follow the /etc/resolv.conf symlink because NetworkManager hasn't created the file yet. Nscd then uses the localhost as nameserver and then for some reason after NetworkManager has started, nscd keeps using the localhost alongside any new nameservers that NetworkManager picks up.

I like having resolv.conf managed by NetworkManager so changing the nscd.service unit file so that it starts sufficiently late enough still seems to be the best workaround.

Comment 12 Chris Schanzle 2016-12-30 00:38:49 UTC
It appears to me nscd doesn't *really* read the new /etc/resolv.conf after NetworkManager brings up the network.

I also have been having issues with nscd on fedora 24 and 25 breaking hostname lookups on boot.  E.g.,

[root@grad ~]# getent hosts www.google.com

[root@grad ~]# telnet www.google.com
telnet: www.google.com: Name or service not known

[root@grad ~]# host www.google.com
www.google.com has address 172.217.1.4
www.google.com has IPv6 address 2607:f8b0:400c:c06::69

I typically reduce *-time-to-live to 5 seconds in /etc/nscd.conf, so it's not a negative cache issue.

removing "myhostname" from nsswitch.conf did not help.

"nscd -i hosts" restores functionality until the next boot.

Running the tcpdump in comment #5 shows requests going to localhost:

18:49:17.769996 IP localhost.40821 > localhost.domain: 21912+ A? www.yahoo.com. (31)
18:49:17.770038 IP localhost.40821 > localhost.domain: 26613+ AAAA? www.yahoo.com. (31)


In the below, the network was configured by Networkmanager at 18:44:02

[root@grad ~]# systemctl status nscd -l | cat
● nscd.service - Name Service Cache Daemon
   Loaded: loaded (/usr/lib/systemd/system/nscd.service; enabled; vendor preset: disabled)
   Active: active (running) since Thu 2016-12-29 18:43:49 EST; 7min ago
  Process: 1323 ExecStart=/usr/sbin/nscd $NSCD_OPTIONS (code=exited, status=0/SUCCESS)
 Main PID: 1337 (nscd)
    Tasks: 11 (limit: 4915)
   CGroup: /system.slice/nscd.service
           └─1337 /usr/sbin/nscd

Dec 29 18:43:48 localhost.localdomain nscd[1337]: 1337 disabled inotify-based monitoring for file `/etc/resolv.conf': No such file or directory
Dec 29 18:43:48 localhost.localdomain nscd[1337]: 1337 stat failed for file `/etc/resolv.conf'; will try again later: No such file or directory
Dec 29 18:43:48 localhost.localdomain nscd[1337]: 1337 monitoring file `/etc/services` (5)
Dec 29 18:43:48 localhost.localdomain nscd[1337]: 1337 monitoring directory `/etc` (2)
Dec 29 18:43:48 localhost.localdomain nscd[1337]: 1337 monitoring file `/etc/netgroup` (6)
Dec 29 18:43:48 localhost.localdomain nscd[1337]: 1337 monitoring directory `/etc` (2)
Dec 29 18:43:49 localhost.localdomain systemd[1]: Started Name Service Cache Daemon.
Dec 29 18:44:06 grad.cam.nist.gov nscd[1337]: 1337 monitored file `/etc/resolv.conf` changed (mtime)
Dec 29 18:44:06 grad.cam.nist.gov nscd[1337]: 1337 monitoring file `/etc/resolv.conf` (7)
Dec 29 18:44:06 grad.cam.nist.gov nscd[1337]: 1337 monitoring directory `/etc` (2)


Short of disabling nscd, a workaround is to create /etc/systemd/system/nscd.service.d/systemd-wait-online.conf with the contents:
[Unit]
After=NetworkManager-wait-online.service
Wants=NetworkManager-wait-online.service

nscd will correctly follow just changing data in /etc/resolv.conf - as a plain file, i.e., watch tcpdump traffic and 'watch systemctl status nscd' (.16.1 and .16.2 are our internal DNS servers):

echo -e 'search cam.nist.gov\nnameserver 129.6.16.2' > /etc/resolv.conf
echo -e 'search cam.nist.gov\nnameserver 129.6.16.1' > /etc/resolv.conf

shows it changing nameservers for lookups.

I tried (but failed) to break it by removing resolv.conf, restarting nscd, creating a broken symlink to "foo", waiting a bit, then populating foo with various entries.  I can't seem to replicate the scenario that happens at boot.

Comment 13 Chris Schanzle 2017-01-05 18:41:27 UTC
Not that I fully understand this, but experimentally speaking, with /etc/systemd/system/nscd.service.d/network-online.conf containing:

[Unit]
After=network-online.target
Wants=network-online.target

and rebooting, host resolution was not working again.  By looking at the journal:

journalctl -b -u nscd -u NetworkManager -u network-online.target -u NetworkManager-wait-online.target

it was clear systemd started nscd before NetworkManager, well-before the network was even configured.  Here are some snippets:

Jan 05 12:17:18 localhost.localdomain systemd[1]: Starting Name Service Cache Daemon...
Jan 05 12:17:20 localhost.localdomain systemd[1]: Starting Network Manager...
Jan 05 12:17:20 localhost.localdomain systemd[1]: Started Name Service Cache Daemon.
Jan 05 12:17:22 localhost.localdomain NetworkManager[1360]: <info>  [1483636642.0293] NetworkManager (version 1.4.2-2.fc25) is starting...
Jan 05 12:17:33 localhost.localdomain dhclient[1540]: bound to 129.6.88.146 -- renewal in 158521 seconds.
Jan 05 12:17:33 grad.cam.nist.gov NetworkManager[1360]: <info>  [1483636653.4606] device (eno1): Activation: successful, device activated.
Jan 05 12:17:33 grad.cam.nist.gov NetworkManager[1360]: <info>  [1483636653.4627] connectivity: check for uri 'http://fedoraproject.org/static/hotspot.txt' failed with 'Error resolving 'fedoraproject.org': Name or service not known'

~20 mins passes...I ran "nscd -i hosts" to get resolution working again:

Jan 05 12:37:34 grad.cam.nist.gov NetworkManager[1360]: <info>  [1483637854.9200] manager: NetworkManager state is now CONNECTED_GLOBAL

So it seems After+Wants=network-online doesn't do what is stated in the docs, at least in Fedora 25.  Using After+Wants=NetworkManager-wait-online.service seems to do what is desired/expected.

But still, it does not explain the issues with nscd.

Perhaps worth noting, in one of these reboots, it appeared IPv6 hostname resolution worked, but not IPv4.  While our network supports IPv6, the system does not have an IPv6 address.  E.g.,

[root@grad ~]# getent hosts  fedoraproject.org
2607:f188::dead:beef:cafe:fed1 fedoraproject.org
2605:bc80:3010:600:dead:beef:cafe:feda fedoraproject.org
2604:1580:fe00:0:dead:beef:cafe:fed1 fedoraproject.org
2605:bc80:3010:600:dead:beef:cafe:fed9 fedoraproject.org
2610:28:3090:3001:dead:beef:cafe:fed3 fedoraproject.org

[root@grad ~]# telnet www.google.com
telnet: www.google.com: Name or service not known

[root@grad ~]# getent hosts www.google.com
2607:f8b0:400c:c0b::69 www.google.com
[root@grad ~]# getent hosts www.microsoft.com
[root@grad ~]# getent hosts www.amazon.com

Comment 14 Christian Klomp 2017-02-08 10:58:43 UTC
Still happening on Fedora 25.

Comment 15 Fedora End Of Life 2017-11-16 18:44:10 UTC
This message is a reminder that Fedora 25 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 25. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as EOL if it remains open with a Fedora  'version'
of '25'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version'
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not
able to fix it before Fedora 25 is end of life. If you would still like
to see this bug fixed and are able to reproduce it against a later version
of Fedora, you are encouraged  change the 'version' to a later Fedora
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's
lifetime, sometimes those efforts are overtaken by events. Often a
more recent Fedora release includes newer upstream software that fixes
bugs or makes them obsolete.

Comment 16 Fedora End Of Life 2017-12-12 10:25:45 UTC
Fedora 25 changed to end-of-life (EOL) status on 2017-12-12. Fedora 25 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.