Bug 505105

Summary:

getaddrinfo() with AI_ADDRCONFIG doesn't suppress AAAA DNS queries on IPv4-only networks

Product:

[Fedora] Fedora

Reporter:

mrmx1 <mrmx1>

Component:

glibc

Assignee:

Carlos O'Donell <codonell>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Fedora Extras Quality Assurance <extras-qa>

Severity:

high

Docs Contact:

Priority:

low

Version:

CC:

adler, amcnabb, andre.ocosta, antoine, ayourtch, behdad, bernd.bartmann, bugsrep, chris.stone, clodoaldo.pinto.neto, cpanceac, davem, david.halliwell, drepper, drfudgeboy, dvlasenk, fweimer, harshad.rj, horsley1953, hugh, jakub, jclere, jonathansteffan, jonathan.underwood, jrowens.fedora, kai, kernel, kevin, k.georgiou, leon, mads, markzzzsmith, marsmagic3000, matt, matzilla, mhuhtala, mike, mishu, mkanat, mschmidt, neo, nobody, oliver.henshaw, otis5842, paolini, pasteur, perobins, posguy99, psimerda, pw, raina, rayvd, redhat-bugzilla, ricardo.arguello, ric, rmj, samelstob, savitasinghsv265, scottt.tw, sonarguy, sumstultussedesquoque, tim, timo, tobias, tore, twaugh, tyler.kohler, vanhoof, vanmeeuwen+fedora, vossman77, wacker

Target Milestone:

---

Keywords:

Reopened

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Workaround: http://www.fedorafaq.org/f10/#dns-slow

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

459756

Environment:

Last Closed:

2016-11-24 11:55:05 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

883152, 477145

Attachments:

Description	Flags
python/twisted script to send two DNS queries "fast" i.e. back-to-back like glibc 2.10	none
wireshark capture of dns failure	none
wireshark capture of failed DNS session	none
wireshark capture	none
wireshark of DNS traffic that works	none
Capture of dns trafic while wget	none
Solution 1/2: Make getaddrinfo()+AI_ADDRCONFIG ignore link-locals	none
Solution 2/2: Make Mozilla Firefox use AI_ADDRCONFIG when calling getaddrinfo()	none

Description mrmx1@live.com 2009-06-10 16:47:56 UTC

+++ This bug was initially created as a clone of Bug #459756 +++

Description of problem:

Just installed Fedora 11 from x86_64 ISO image, and I encountered the same problem as with Fedora 10 when I first installed it: "DNS resolver not reliable", which was earlier reported as Bug #459756.

What I see:
yum does not connect to external repositories.
ping does resolve external names, and works OK.
Firefox does resolve Internet names only if "network.dns.disableIPv6" is set to TRUE in about:config
Evolution does not connect to my mailboxes, I believe due to DNS failure.

My Network Configuration Ethernet Device is configured (in the GUI) with "Enable IPv6 configuration for this interface" unchecked.

My uname -a reports: 
Linux localhost.localdomain 2.6.29.4-167.fc11.x86_64 #1 SMP Wed May 27 17:27:08 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux

My glibc is:
glibc-2.10.1-2

In my case this is reproducible always (firefox, yum), as it was with Fedora 10.
glibc-2.9-3 resolved the problem in Fedora 10 fine for me, with my DNS.
Maybe this is reproducible for me because my old DNS has no IPv6 at all, I guess.

Comment 1 Jonathan Steffan 2009-06-10 22:11:51 UTC

I will also confirm this issue. We are using load balanced instances of bind-9.3.4-6.0.2.P1.el5_2 for our DNS services. Default is to hand out that DNS service as a resolver target. If resolv.conf is manually changed to remove the DNS VIP and put a resolver that does not go across our firewalls, everything works as expected.

Comment 2 Ulrich Drepper 2009-06-11 08:33:41 UTC

First: why haven't any of you testing the test releases?

Second: where is the data?  Logs of the DNS traffic etc.

There is a work-around in place to catch broken installations which should catch this.  It will definitely delay the operation but this is your own fault and it's documented in the release notes.

Comment 3 mrmx1@live.com 2009-06-11 21:09:24 UTC

I have to apologize because, after such a long time, I realized that at least in my case it isn't a problem with Fedora - I managed to get an updated firmware for my DSL box (which includes my DNS) from somewhere in Australia and as a kind of magic everything seems to be working fine, now...

Most misleading was that all previous releases of Fedora and Red Hat Linux worked just fine since ever.

Comment 4 Vladimir 2009-06-14 17:14:39 UTC

  I have Fedora 10, Fedora 11 and Windows XP on my computer.  I have just installed Fedora 11.  In F10 and WinXP domain names are resolved quickly while in F11 it can take up to 10-15 seconds in both Firexox 3.5 beta 4 and Konqueror.

  After finding this bug report I changed network.dns.disableIPv6 from false to true.  As soon as I did it domain names begin to resolve immediately.  If set back to false, name resolution is very slow again.

  I do not think that this is a real cure, probably just hiding the problem.

Comment 5 Ulrich Drepper 2009-06-14 18:15:31 UTC

(In reply to comment #4)
>   I have Fedora 10, Fedora 11 and Windows XP on my computer.  I have just
> installed Fedora 11.  In F10 and WinXP domain names are resolved quickly while
> in F11 it can take up to 10-15 seconds in both Firexox 3.5 beta 4 and
> Konqueror.

There was one little additional problem which creates this behavior.  Every lookup is slow in the F11 GA release's glibc.  I've fixed that since.  Now, as it should be, only the fist lookup in a process is slow.

And yes, that will be the final solution.  It is not acceptable that people with broken setups prevent progress for all the rest.  You can set the appropriate option in resolv.conf and always get fast lookups in your broken environment but the default will be as it is now.

Comment 6 Jonathan Steffan 2009-06-19 21:32:45 UTC

We have identified the network device that is causing issues with the new glibc dns resolution behaviour. We have a Juniper SSG-320M that does stateful inspection of the DNS UDP traffic. The issue is that the new behaviour sends two packets with the same signature. "Signature" as in it has the same source and destination for host/port and this causes only one packet to make it back. In many/most cases, that is the AAAA packet and the A has to be requested multiple times before a response properly comes back. We are going to contact the vendor, but I expect others would run into similar issues.

Is it feasible to have both requests go out at the same time but have different source ports, or is that going to be just as slow as requesting A, then AAAA?

Comment 7 Phil Oester 2009-06-19 22:03:53 UTC

Regarding comment #6, that's similar to what I reported on the original bug.  copied from https://bugzilla.redhat.com/show_bug.cgi?id=459756#c108:

Just a further datapoint on this, since I too spent a few days scratching my
head on it.  It looks like what changed in F10 is that both the AAAA and A
requests are sent using the SAME SOURCE PORT, while pre-F10 used different
source ports for the two requests.  

For me, that change spelled trouble in the form of a race for my loadbalancer. 
I saw this:

1) receive A request, creating session table entry with NAT'd reply IP
2) receive AAAA request on port x, reusing session table entry from #1
3) respond to AAAA request on port x and remove session table entry
4) loadbalancer receives response from DNS server for A request, but since
session table entry (with VIP response IP) is gone, it simply forwards the
traffic, so client receives a reply from a different IP (the IP of the server
itself, NOT the vip) and ignores it

So for me, the simple solution to this is to go back to the old behaviour of
having the A and AAAA requests use unique source ports.  Wouldn't that be more
secure anyway?  Seems like a step backward to reuse the port.

Comment 8 D. Hugh Redelmeier 2009-06-19 22:36:46 UTC

re comments #6 and #7: are these not bugs in your Juniper SSG-320M and load balancer?  Should you not get them fixed?

I discussed this in https://bugzilla.redhat.com/show_bug.cgi?id=459756#c116

Ulrich: here's what I understand your fix to do:  if the first AAAA query doesn't get answered, never ask for AAAA records again.  But since UDP is unreliable, this seems actually wrong.  Wrong for everyone, not just those with broken networks.

To avoid damage from these broken devices, glibc would have to use a distinct port for each query in flight.  That would take some book-keeping in glibc, I would think (unless this case is the only way more than one query could be in flight).

Comment 9 Phil Oester 2009-06-19 23:22:57 UTC

But the question remains, WHY did the behavior change?  Originally, glibc DID use unique ports for the AAAA and A queries.  From a "predictability" perspective, that is a more secure approach, no?  Similar to how ISNs are now randomized in TCP.  

It seems many people's problems would be solved by going back to the (arguably more secure) method of using distinct ports for the A and AAAA queries.

Comment 10 D. Hugh Redelmeier 2009-06-20 00:27:16 UTC

I don't know why the behaviour changed.

To randomize the port, glibc would have to ask the OS to allocate a port for each DNS query (and, actually, two in the case we are talking about because it is actually two queries).  And free the port afterwards.  Port allocation is done by the kernel, on a per-interface basis.  So this would multiply the number of system calls (modestly).  I don't know enough about how expensive these system calls are (they should be cheap).

The way to secure DNS is through DNSSec.  I've been interested in that for a decade.  Looks like it might happen soon.  I cannot understand why, for all the crap that 911 justified, it didn't spur the deployment DNSSec.

Comment 11 Jonathan Steffan 2009-06-22 19:36:12 UTC

(In reply to comment #7, #8)

re: #8, Sorta. The addition of UDP "sessions" is not always/ever a great idea. Individual vendor implementations might have different methods that work in some/most cases and others that don't work at all. However, stateful packet inspection is something that everyone has to deal with and I would be surprised if our particular firewall vendor is not the only firewall in the world that will run into this issue. As a note, we also have this issue with our Juniper ISGs.

re: #7, We also have an issue using our Foundry ServerIron 450s as load balancers for our DNS traffic. We have "worked around" our issues with a few of our firewalls by adding rules to allow the sessions but have yet to solve how to properly implement a working LB situation for DNS clients that behave the way that glibc is right now.

If it's not that expensive to open up two ports to send the packets out at the same time not waiting for one response or the other, it might be best.

Comment 12 Phil Mayers 2009-06-23 15:05:19 UTC

For reference, we also see this with our Cisco ACE20s, though Cisco are investigating whether the behaviour is intentional or not. I hold no opinion on the wisdom or not of using the same source port.

I am interested (purely from a curiosity PoV) why glibc doesn't send a single query packet with A and AAAA queries. Presumably there are some broken DNS servers that eat the whole packet?

Ulrich - re #5 which specific Fedora update resolved the slowness to be once per-process? Then I can ask our Unix team to pre-apply it to the build.

FYI, we are having timeouts at login - the number of DNS lookups involved in resolving the LDAP server, kerberos TCP & UDP SRV & A records exceeds 60 seconds!

Comment 13 Ulrich Drepper 2009-06-23 16:46:50 UTC

A couple of points:

- almost no server handles two or more requests in one package.  Very unfortunate
  but true.

- there always was the option to keep the socket open but hardly any program
  took advantage of this (setting the RES_STAYOPEN flag)

- any solution will punish those broken setups.  I'll implement a second fallback.
  If the environment really prevents reuse of the same source port then we'll
  automatically use it only if separate requests on the same port fails.

I really hope that people will file this misbehavior of the routers as bugs.  It has always been valid of the DNS library.

Comment 14 Phil Mayers 2009-06-23 16:56:12 UTC

Ah, I suspected servers would choke on A & AAAA in the same packet.

FYI, in this case it's actually a load-balancer. Cisco currently believe the problem is actually timing related - effectively, glibc sends both DNS requests so quickly that the load-balancer is still setting up the session when the 2nd packet arrives, and it's a race-condition of sorts.

They seem receptive to addressing it. I suspect many other NATs and load-balancers might not be completely broken per-se, but racey when two requests come in so "quickly". Obviously this a slightly different issue than broken DNS servers which don't answer at all.

For those who are interested, I'll attach a python/twisted script that reproduces glibc behaviour and can be run on-demand. This is useful when opening bug with SLB/NAT vendors.

Comment 15 Phil Mayers 2009-06-23 17:00:07 UTC

Created attachment 349114 [details]
python/twisted script to send two DNS queries "fast" i.e. back-to-back like glibc 2.10

This script can be used to test the issue. It behaves (I think) similarly to glibc, in that it very quickly sends an A and AAAA query to the given DNS server from the same source port.

Comment 16 Jonathan Steffan 2009-06-23 22:27:11 UTC

We have worked around the issue by adding custom policies on our firewalls. Additionally, we discovered the failure of the subsequent DNS request to be resolved when adding the commands listed below to the Foundry ServerIron450 on the Virtual Server configuration:

Port dns udp-normal-age
Udp-age 2

Basically.. rather then closing the session after the first packet makes it back through, the LB will now consider that session valid for a longer period of time and DNS is working as expected.

Comment 17 Phil Mayers 2009-06-24 13:39:45 UTC

Cisco have confirmed that this is a timing related bug, fixed in ACE software 1.5. The Cisco bug number is CSCsw52831.

I have also found another interesting behaviour, which I'll document here for reference; we have a 2nd DNS IP that passes through our ACE (but is not handled by the ACE). This was also suffering problems, but my test script was not.

The difference appears to be in the use of connected versus unconnected UDP sockets. Specifically, unconnected UDP sockets seem (under Linux) to always have an IP ID==0, and pass through the ACE fine. Connected UDP sockets seem to have incrementing IP IDs, and seem to get treated in a session-aware manner, and subject to the same timing bug.

glibc seems to use connected sockets, and thus hits the bug.

I hope this info is of interest. If someone knows which version of the F11 glibc RPM contains the "only 1st lookup is slow" fix, that would be useful.

Comment 18 Andrew Yourtchenko 2009-06-24 14:15:28 UTC

Phil, I verified and my C test program with which I reproduced the problem yesterday does send the IP ID of 0 (with the DF bit set), so I'd still maintain it  is just timing-related for the ACE. :-)

Comment 19 Andrew McNabb 2009-06-24 18:26:58 UTC

Somebody pointed out that there may be security reasons to use two separate ports.  Wouldn't it be best to just always use two separate ports?  I don't see any drawback to having separate ports.

Comment 20 Andrew Yourtchenko 2009-06-24 18:55:52 UTC

With my protocol purist hat on, I agree with Ulrich that it is a pure bug in the middleboxes, whatever they are - but the practicalities make this tougher to get in if the same queries are sent in the bangbang manner over the same four-tuple. OTOH, the separate ports suck by doubling the state on the "dumb NAPT" boxes in the middle. But other than that (and the added PITA of handling two sockets) - Ulrich/Jakub, are there any obstacles why this is bad, besides the extra code to deal with 2 sockets on the clientside ?

With the practical hat on, having the queries originate from different port should be very practical. Of course, there are "dumb NAPTs", but there the issue is mostly the CPU spent, while for the stateful boxes those would be some significant changes in the code, if they were not coded with the assumption that there can be parallel outstanding requests over the same 4-tuple.

Of course I'd be biased to have this fixed in the middleboxes, but the trouble is that some customers will be in the different administrative domain than the middleboxes - so it's gonna be very challenging for them.

Comment 21 Ulrich Drepper 2009-06-25 07:43:58 UTC

Opening two sockets is slow and eats up precious resources (yes, I consider 64k sockets a precious resource).  I already said I'll implement this as a fallback solution in case everything else fails and the fallback mode will be automatically reached (after various timeouts).  That'll likely happen next week.  But it won't be the default.  If workarounds to broken behavior is the default the underlying bugs will never be fixed.  If people are inconvenienced because of the bugs but it still works they hopefully continue to harass the responsible parties.

The good news is that at least some of the vendors are responsive.  Cisco apparently already fixed some of their code.  Let's hope this continues to be the case for others as well.

Comment 22 Ulrich Drepper 2009-06-26 10:54:57 UTC

I've pushed to the upstream repository a change which implements the second fallback mode.  It is automatically used (after appropriate timeouts) or it can be requested with the new resolver option single-request-reopen.

Andreas will hopefully build a scratch glibc sometime soon (perhaps for F12, but that version could be used on F11, too).  When this happens please test it and don't wait until after F12 is released, as it happened this time once again.

Comment 23 Andreas Schwab 2009-06-26 13:40:21 UTC

An F12 build is available here: <http://koji.fedoraproject.org/koji/taskinfo?taskID=1436854>.

Comment 24 Jean-frederic Clere 2009-07-09 20:15:06 UTC

It seems to be the same F10 had after the release:
ping ok.
wget, FF, wget etc not ok.
On F10 it was https://bugzilla.redhat.com/show_bug.cgi?id=459756
The first fix was:
+++
I have fixed the problem by installing glibc-2.9-3 from koji
http://koji.fedoraproject.org/koji/buildinfo?buildID=73861
+++

Comment 25 Niels Haase 2009-07-14 22:04:37 UTC

*** Bug 504951 has been marked as a duplicate of this bug. ***

Comment 26 Frederick Dean 2009-07-15 16:37:21 UTC

Created attachment 353850 [details]
wireshark capture of dns failure


I had this problem with F10, and now again with F11.
It prevents me from connecting to yum, firefox, or
ssh, which makes fixing it difficult, although
dig and ping work just fine.   

I have captured the network traffic of a failed ssh
session which is attached.  The Fedora 11 client sends
out two queries (both A and AAAA) for the same name
at the same time from the same port.  The queries have
different transaction ID numbers so they both get a 
response, but unfortunately the socket appears to be
closed after the first response, because the second
response elicits an ICMP destination unreachable from
the client.  Unfortunately my DNS server is really
fast for the AAAA failures, so all my IPV4 connections
fail.  It's a Motorola DSL modem from AT&T.

If the two transactions need to use the same socket,
libc needs to leave the socket open to get both
responses.  The impact is huge.

Comment 27 Robert Scheck 2009-07-15 16:42:39 UTC

Frederick, blame the vendor/administrator of your DNS server for this,
please. Try the RPM packages from comment #23 and return some feedback
here. These packages can be used at Fedora 11 (or at least should be).

Comment 28 Andreas Schwab 2009-07-15 16:46:44 UTC

There is also an F11 build here: <http://koji.fedoraproject.org/koji/taskinfo?taskID=1475122>.

Comment 29 Niels Haase 2009-07-15 18:55:21 UTC

*** Bug 509166 has been marked as a duplicate of this bug. ***

Comment 30 Frederick Dean 2009-07-17 15:19:00 UTC

Created attachment 354162 [details]
wireshark capture of failed DNS session

(In reply to comment #27)

Sorry for the slow response but I tested the new glibc-2.10.1-3 
i686/i586.  Unfortunately it has not fixed my problem.  The ICMP 
messages have gone away, but the DNS server is responding in the 
same order as the requests which may explain it.  A capture
is attached...

  [fdd@jibsheet ~]$ ssh -v fdd.com
  OpenSSH_5.2p1, OpenSSL 0.9.8k-fips 25 Mar 2009
  debug1: Reading configuration data /etc/ssh/ssh_config
  debug1: Applying options for *
  ssh: Could not resolve hostname fdd.com: Temporary failure in name resolution

Strangely, the new glibc also breaks gnome-terminal.  The window
comes up with menus but no text in terminal area.  There is a
pop-up error of, "There was an error creating the child process for 
this terminal."  xterm works fine, so it I tried running 
gnome-terminal on the command line, but it prints 
nothing more informative to stdout.  I tried strac'ing 
gnome-terminal and it never forks.  Going back and forth 
between glibc versions shows the breakage is clearly
dependent on the new glibc being installed.  I don't need
to reboot, but I do need to kill all gnome-terminals.
Apparently it is bug 509632, and the remount command of
bug 509632 comment 9 fixes it.

Comment 31 Ulrich Drepper 2009-07-17 17:20:13 UTC

(In reply to comment #30)
> Unfortunately it has not fixed my problem.  The ICMP 
> messages have gone away, but the DNS server is responding in the 
> same order as the requests which may explain it.  A capture
> is attached... 

That data makes no sense.  Or more correctly: it makes no sense that you're having problems.

The log shows that the server correctly answers both requests.  I see the same pattern here (as expected) and everything works fine.

Where is the data captured?  On the client machine (172.22.2.7 in your case)?  Is there a firewall running?

Also, please use getent for the tests:

   getent ahosts fdd.com

That's more predictable, all the code involved comes from glibc.

Furthermore, I suee fdd.com is  your domain.  It doesn't happen for other domains?

Comment 32 Frederick Dean 2009-07-17 19:05:54 UTC

Created attachment 354195 [details]
wireshark capture


I agree it's hard to understand.  It is not fixed by 
"sudo /etc/init.d/iptables stop".  The capture was made 
on the client machine 172.22.2.7.  It happens for all domains, 
unless I add them to the /etc/hosts file.  I can ping and 
dig them just fine.  I'm attaching an unfiltered capture
of this session...

  [root@jibsheet ~]# rpm -q glibc
  glibc-2.10.1-3.i686
  [root@jibsheet ~]# iptables-save
  # Generated by iptables-save v1.4.3.1 on Fri Jul 17 13:48:59 2009
  *filter
  :INPUT ACCEPT [534:96543]
  :FORWARD ACCEPT [0:0]
  :OUTPUT ACCEPT [136:21713]
  COMMIT
  # Completed on Fri Jul 17 13:48:59 2009
  [root@jibsheet ~]# ping -c 1 google.com
  PING google.com (74.125.45.100) 56(84) bytes of data.
  64 bytes from yx-in-f100.google.com (74.125.45.100): icmp_seq=1 ttl=54 time=39.4 ms
  
  --- google.com ping statistics ---
  1 packets transmitted, 1 received, 0% packet loss, time 51ms
  rtt min/avg/max/mdev = 39.430/39.430/39.430/0.000 ms
  [root@jibsheet ~]# ssh -v google.com
  OpenSSH_5.2p1, OpenSSL 0.9.8k-fips 25 Mar 2009
  debug1: Reading configuration data /etc/ssh/ssh_config
  debug1: Applying options for *
  ssh: Could not resolve hostname google.com: Temporary failure in name resolution
  [root@jibsheet ~]# getent ahosts fdd.com
  [root@jibsheet ~]# getent ahosts google.com


The machine is newly installed F11, with updates applied by yum 
after adding hostnames to /etc/hosts.  I've added vlc
from rpmfusion, and now your glibc, but basically nothing 
else.

Trying again with glibc-2.10.1-2 produces the same 
results as -3, i.e. no ICMP.  I'm sorry to be such trouble.

Comment 33 Frederick Dean 2009-07-17 22:03:22 UTC

Created attachment 354226 [details]
wireshark of DNS traffic that works

(In reply to comment #32)

The computer works fine elsewhere, so the difference must
be in the DNS AAAA response.  The only difference I can
see is that the working response includes a SOA record
that the failing one doesn't.  The successful capture
is attached.

Comment 34 Casey Dahlin 2009-07-25 17:55:19 UTC

I'm getting this same issue, and I also have an AT&T Motorola modem. Considering buying a router in hopes that it will act as a DNS relay and protect me from whatever quirk is causing this (I'm having serious internet withdrawal :). Going into the modem interface and asking it to resolve a name directly seems to work.

Comment 35 Steve Chapel 2009-08-26 17:59:38 UTC

I had this issue with x86-64 Fedora 11, and I'm having the same issue with x86-64 Fedora 12 alpha. Let me know what I can do to help.

Comment 36 Ulrich Drepper 2009-08-27 14:34:39 UTC

(In reply to comment #35)
> I had this issue with x86-64 Fedora 11, and I'm having the same issue with
> x86-64 Fedora 12 alpha. Let me know what I can do to help.  

Look through the comments.  What broken hardware do you use?  What are the requests that are sent etc etc.

Comment 37 Steve Chapel 2009-08-27 15:19:20 UTC

I have a HomePortal 1000HW wireless DSL modem from 2wire. The problem occurs whether I connect to the router wirelessly or with an Ethernet cable. My ISP is AT&T and I use their domain name servers.

I don't know how to see the requests that are sent. I'm not an expert at TCP/IP networking. Are the comments saying that this is essentially a problem with the DNS server at my ISP? I'm able to use Firefox if I go to about:config and disable IPv6. Will disabling IPv6 in Fedora work around the problem?

Comment 38 Steve Chapel 2009-08-28 03:16:35 UTC

I added the line:
install ipv6 /bin/true
to the top of the file /etc/modprobe.d/dist.conf to disable IPv6, and the problems have gone away.

Comment 39 Clodoaldo Pinto Neto 2009-08-28 15:27:06 UTC

I have two machines, one F10 and the other F11, behind the same ADSL router, dlink DSL 500B. F11 has no problems.

The DNS servers are set in resolv.conf to those of opendns.com

In F10 Firefox only works if network.dns.disableIPv6 is set to true. The weather applet can't connect. Yum has to try many mirrors before it finds one that works. folding@home can't connect at all.

Added these lines to modprobe.conf and rebooted:
alias net-pf-10 off
alias ipv6 off

After that Firefox works with network.dns.disableIPv6 set to false. The weather applet connects. Still Yum has try many mirrors and folding@home can't connect.

Is this the same issue reported in this bug?

# cat /etc/sysconfig/network
NETWORKING=yes
HOSTNAME=d2.localdomain

# cat /etc/networks
default 0.0.0.0
loopback 127.0.0.0
link-local 169.254.0.0

# cat /etc/sysconfig/network-scripts/ifcfg-eth0
# Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit
Ethernet controller
DEVICE=eth0
BOOTPROTO=none
DNS1=208.67.220.220
DNS2=208.67.222.222
DNS3=10.1.1.1
GATEWAY=10.1.1.1
HWADDR=00:21:97:00:79:21
IPADDR=10.1.1.110
NETMASK=255.0.0.0
ONBOOT=yes
TYPE=Ethernet
USERCTL=no
IPV6INIT=yes
NM_CONTROLLED=yes
PEERDNS=yes

Also tried whith IPV6INIT=no.

Comment 40 Clodoaldo Pinto Neto 2009-08-29 09:58:12 UTC

I reverted to the default configuration, IPv6 enabled.

While capturing the traffic in wireshark there is no IPv6 traffic reaching the eth0 interface. When I disable IPv6 in Firefox I can see the IPv4 traffic in wireshark. So I guess this has nothing to do with the issue reported in this bug. I will post in the Fedora list and later, if necessary, open another bug.

Steve #38, does Yum work correctly with IPv6 disabled, I mean, it finds a suitable server in the first tries?

Comment 41 David Miller 2009-08-30 01:27:05 UTC

The bug has nothing to do with whether IPV6 traffic is generated or not.

If you have ipv6 enabled, GLIBC will try AAAA DNS requests, and the
behavior of how this is done in conjunction with normal A record
DNS requests is what causes problems with some equipment.

But these DNS queries happen over ipv4 in your configuration.
So if you want to "see it happen" you need to trace ipv4 traffic,
looking for DNS queries.

Comment 42 Steve Chapel 2009-08-31 02:15:19 UTC

(In reply to comment #40)
> 
> Steve #38, does Yum work correctly with IPv6 disabled, I mean, it finds a
> suitable server in the first tries?  

With Fedora 12 alpha, I cannot get any software update to work at all until I apply my workaround. After my workaround, I still have some problems (bug #516957) but at least software update works to some extent.

Comment 43 Steve Chapel 2009-09-01 16:51:35 UTC

I was also able to work around the problem by going into IPv4 Settings in the Network Connections Preferences and selecting Automatic (DHCP) addresses only and typing in 208.67.222.222, 208.67.220.220 for the DNS servers to use OpenDNS, then rebooting. I still have the same problems with software update as I did with IPv6 disabled.

Comment 44 Matt Benjamin 2009-09-17 10:31:37 UTC

I have repeated this behavior on Fedora 11.  While yum updates are succeeding with some timeouts, the larger problem is that other applications do fail.  I've further had no success with the workarounds (other than the explicit fix for Firefox).

I have tried:

a. installation of glibc-2.10.90-22, and using the option single-request-reopen in resolv.conf

b. disabling ipv6

I would appreciate any other workaround ideas, ideally not, it's the fault of other equipment.

Comment 45 Clodoaldo Pinto Neto 2009-09-17 10:59:14 UTC

@44(In reply to comment #44)

I opened another bug as this one is not the one I'm experiencing:

https://bugzilla.redhat.com/show_bug.cgi?id=520304

Comment 46 Matthieu Araman 2009-09-27 11:43:46 UTC

Created attachment 362819 [details]
Capture of dns trafic while wget

Capture of dns trafic while wget lwn.net with fedora11+normal updates
provider is orange/france telecom. dns server looks like it's the ip of the livebox (there is several millions of users behind a livebox I think)

wget a little more than 10s
real	0m10.922s
user	0m0.005s
sys	0m0.013s

dns capture gives :
Fedora send two request , one A and one AAAA
answer for A is given at once
after 5s, retry of A (why ? we already have received the result ...)
answer with A immediately
then retry of AAAA, this time after the A answer come

after 10s, there is no dns trafic but wget decides to get the page (it was resolving host before)

after15s and 20s, the dns server timeout the aaaa answer

this looks timing related and dependant on the behaviour of the dns server as the same pc has a different behaviour when connecting to another provider.
as the dns answer immediately with a good ip,why is the linux waiting for so long ?
 
so my questions :
is the linux correctly getting the first A answer or is it rejecting it (believing it to the be an anwser not matching for the aaaa) ?
what are the timing used ?

Comment 47 Matthieu Araman 2009-09-27 11:48:45 UTC

I'm using glibc 2.10.1-5
disabling ipv6 in about config firefox also workaround the pb.(didn't try globally as this should be working by default)

Comment 48 Jean-frederic Clere 2009-12-23 08:21:08 UTC

I have change the dns proxy option to off in my router (Netopia-3000) that fixes the problems on my boxes (f11 and f12).

Comment 49 Casey Dahlin 2010-04-06 04:55:50 UTC

This is reliable. Some programs have perfect DNS, some don't work at all.

[sadmac@foucault coding]$ ping edge.launchpad.net
PING edge.launchpad.net (91.189.89.225) 56(84) bytes of data.
64 bytes from wildcard-launchpad-net.banana.canonical.com (91.189.89.225): icmp_seq=1 ttl=42 time=101 ms
64 bytes from wildcard-launchpad-net.banana.canonical.com (91.189.89.225): icmp_seq=2 ttl=42 time=102 ms
64 bytes from wildcard-launchpad-net.banana.canonical.com (91.189.89.225): icmp_seq=3 ttl=42 time=102 ms
64 bytes from wildcard-launchpad-net.banana.canonical.com (91.189.89.225): icmp_seq=4 ttl=42 time=102 ms
^C
--- edge.launchpad.net ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3816ms
rtt min/avg/max/mdev = 101.601/101.983/102.238/0.235 ms
[sadmac@foucault coding]$ bzr co lp:libnih libnih-error
bzr: ERROR: Connection error: Could not resolve 'edge.launchpad.net' [Errno -2] Name or service not known
[sadmac@foucault coding]$

Comment 50 Casey Dahlin 2010-04-06 04:56:18 UTC

Above is on latest F12

Comment 51 Bug Zapper 2010-04-27 14:44:26 UTC

This message is a reminder that Fedora 11 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 11.  It is Fedora's policy to close all
bug reports from releases that are no longer maintained.  At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '11'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 11's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 11 is end of life.  If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora please change the 'version' of this 
bug to the applicable version.  If you are unable to change the version, 
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 52 Piotr Romanus 2010-04-30 15:56:09 UTC

Just like Casey I am experiencing the same problem on F12.

Clipper:~ $ ping peach.mycompany.com
PING peach.mycompany.com (10.26.1.61) 56(84) bytes of data.
64 bytes from peach.mycompany.com (10.26.1.61): icmp_seq=1 ttl=63 time=0.267 ms
64 bytes from peach.mycompany.com (10.26.1.61): icmp_seq=2 ttl=63 time=0.311 ms
^C
--- peach.mycompany.com ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1371ms
rtt min/avg/max/mdev = 0.267/0.289/0.311/0.022 ms
Clipper:~ $ ssh peach.mycompany.com
ssh: Could not resolve hostname peach.mycompany.com: No address associated with hostname


Here is the output from Wireshark when ssh command is issued:

24921	4305.188943	10.4.1.236	10.1.1.151	DNS	Standard query A peach.mycompany.com
24922	4305.188983	10.4.1.236	10.1.1.151	DNS	Standard query AAAA peach.mycompany.com
24923	4305.189460	10.1.1.151	10.4.1.236	DNS	Standard query response A 10.26.1.61
24924	4305.189475	10.1.1.151	10.4.1.236	DNS	Standard query response

The only way for me to fix this problem is to  put the following line in /etc/hosts 
10.26.1.61 peach

I should mention that this problem is _not_ unique to this network - the F12 machine is a laptop and I can see this problem at work as well as at home. It does not happen all the time but often enough to be annoying. Not sure what's the trigger.


Clipper:~ $ rpm -qa |grep glibc
glibc-2.11.1-4.i686
glibc-2.11.1-4.x86_64
glibc-headers-2.11.1-4.x86_64
glibc-common-2.11.1-4.x86_64
glibc-devel-2.11.1-4.x86_64
glibc-debuginfo-2.11.1-1.x86_64


Let me know if I can help in anyway to debug it.

Comment 53 Bug Zapper 2010-06-28 12:52:15 UTC

Fedora 11 changed to end-of-life (EOL) status on 2010-06-25. Fedora 11 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.

Comment 54 Tore Anderson 2011-01-31 18:06:21 UTC

There is no question that the underlying problem here is defective DNS resolvers that choke on perfectly legitimate AAA queries. That said, there are a couple of issues present in software shipped by Fedora that cause the problem to manifest itself as slowdowns noticeable by end users:

1) When called with the AI_ADDRCONFIG flag, libc's getaddrinfo() function does not disregard link-local IPv6 addresses when determining whether or not the local host has usable IPv6 connectivity. Since every IPv6-capable OS will have link-local IPv6 addresses assigned to all interfaces - regardless of any external connectivity being available or not - this essentially makes AI_ADDRCONFIG on Linux useless for the purpose of suppressing AAAA queries when they're not useful.

I've submitted a bug to the GNU libc upstream about this issue at <http://sourceware.org/bugzilla/show_bug.cgi?id=12377>.

getaddrinfo() on other operating systems (such as Apple Mac OS X and Microsoft Windows) does disregard link-local IPv6 addresses when called with AI_ADDRCONFIG, which is why the problem appears to affect GNU/Linux distributions more than other operating systems.

2) Many applications do not set the AI_ADDRCONFIG flag when calling getaddrinfo(). This includes, notably, Mozilla Firefox. However, a patch to correct this has recently been committed to the mozilla-central developement repo and will likely be part of Firefox 4.0 beta 11 (hopefully also 3.6.15), see <https://bugzilla.mozilla.org/show_bug.cgi?id=614526>. Microsoft Windows enables the use of AI_ADDRCONFIG as the system-wide default, as far as I know, which explains why it is able to cope better with those broken middleware boxes. Mac OS X does not set AI_ADDRCONFIG by default, however it has an extremely short timeout waiting for AAAA responses after the A response has been answered (around 125ms), which in turn hides the problem from most end users. Additionally, most major browsers (except Firefox) do set AI_ADDRCONFIG explicitly, which suppress the problematic AAAA queries in the first place.

So what Fedora could to avoid this problem is 1) to develop and include a patch to glibc that makes getaddrinfo() ignore link-local addresses for AI_ADDRCONFIG purposes, and 2) to back-port the NSPR patch already committed to mozilla-central to the version of Firefox shipped (or wait until Mozilla releases a new version with the patch already included).

Tore

Comment 55 Tore Anderson 2011-02-11 16:08:49 UTC

Created attachment 478268 [details]
Solution 1/2: Make getaddrinfo()+AI_ADDRCONFIG ignore link-locals

Comment 56 Tore Anderson 2011-02-11 16:09:36 UTC

Created attachment 478270 [details]
Solution 2/2: Make Mozilla Firefox use AI_ADDRCONFIG when calling getaddrinfo()

Comment 57 Tore Anderson 2011-02-11 16:13:39 UTC

The two patches I've just attached solves this problem for most users:

The first makes getaddrinfo() ignore link-local addresses when called with the AI_ADDRCONFIG flag set. This makes getaddrinfo() avoid querying for AAAAs when the host has no IPv6 connectivity, provided that the AI_ADDRCONFIG flag is set. This brings glibc's getaddrinfo() behaviour in line with Mac OS X and Windows.

The second makes Mozilla Firefox use AI_ADDRCONFIG when calling getaddrinfo(). Note that the Mozilla release drivers have already approved this patch for inclusion on the 3.6.x branch, and it has already been commited to Firefox 4.0 (it's included in beta11).

Please apply.

(Of course, there might be applications other than Mozilla Firefox that does not set AI_ADDRCONFIG as well, which would require similar patches. However, Mozilla Firefox is the obvious one and likely the source of most user complaints.)

Tore

Comment 58 Tore Anderson 2012-12-16 13:15:25 UTC

Okay, so this is still a problem. What happens is:

1) the user enters some host name into his web browser or other application of choice running on a machine connected to an IPv4-only Ethernet network
2) the application kicks of an getaddrinfo() call for the host name, using AF_UNSPEC and AI_ADDRCONFIG
3) getaddrinfo() transmits both IN A (IPv4) and IN AAAA (IPv6) DNS queries to the upstream resolver
4) The upstream resolver, which is typically some cheapo home gateway or something, don't understand the IN AAAA queries and either doesn't respond to them at all, or screw them up somehow
5) getaddrinfo() doesn't get a valid answer for the IN AAAA queries (valid answer could include NXDOMAIN or NODATA status codes), retransmits them, sits around waiting
6) user wonders why the web page or whatever takes "forever" to load, goes to submit/comment on bugs such as this one
7) getaddrinfo() finally times out the IN AAAA queries, returns IPv4 results to the application
8) lather rinse repeat

AI_ADDRCONFIG *should* have solved this issue, by suppressing IN AAAA queries from IPv4-only machines. However, the auto-configured IPv6 link local addresses on all Ethernet interfaces, causes getaddrinfo() to consider that the machine has IPv6, and therefore it won't suppress IN AAAA queries anymore. More info here:

https://fedoraproject.org/wiki/Networking/NameResolution/ADDRCONFIG#Problem_2:_IN_AAAA_DNS_query_suppression_from_Ethernet-connected_IPv4-only_hosts

Comment 59 Pavel Šimerda (pavlix) 2012-12-16 13:32:03 UTC

I see no reason why this shouldn't be fixed. We are working on solutions, all information here:

https://fedoraproject.org/wiki/Networking/NameResolution/ADDRCONFIG

Related fedora feature page:

https://fedoraproject.org/wiki/Features/DualstackNetworking

Adding to the 'dualstack' tracker bug and modified the summary.

Comment 60 Pavel Šimerda (pavlix) 2012-12-16 14:50:29 UTC

*** Bug 697149 has been marked as a duplicate of this bug. ***

Comment 61 Pavel Šimerda (pavlix) 2012-12-16 17:30:15 UTC

*** Bug 459756 has been marked as a duplicate of this bug. ***

Comment 62 Fedora Admin XMLRPC Client 2013-01-28 20:08:21 UTC

This package has changed ownership in the Fedora Package Database.  Reassigning to the new owner of this component.

Comment 63 Fedora End Of Life 2013-04-03 19:59:59 UTC

This bug appears to have been reported against 'rawhide' during the Fedora 19 development cycle.
Changing version to '19'.

(As we did not run this process for some time, it could affect also pre-Fedora 19 development
cycle bugs. We are very sorry. It will help us with cleanup during Fedora 19 End Of Life. Thank you.)

More information and reason for this action is here:
https://fedoraproject.org/wiki/BugZappers/HouseKeeping/Fedora19

Comment 64 markzzzsmith 2013-04-14 01:12:07 UTC

I disagree with making getaddrinfo() consider a host with only IPv6 link-local addresses to not have IPv6 connectivity. It does, it has IPv6 connectivity to all other IPv6 hosts on the directly attached links which also all have link-local addresses. This is actually the point of hosts automatically configuring link-local addresses on all interfaces all the time, and the IPv6 Addressing Architecture specifying that all interfaces must have link-local addresses  - so that they can at a minimum always reach their on-link neighbors via link-local addressing. This is also why protocols such as IPv6 neighbor discovery, Multicast Listener Discovery and routing protocols such as OSPF use link-local addresses as source and/or destination addresses to reach their neighbors.

Combine autoconfigured IPv6 link-local addresses with a service discovery protocol such as Multicast DNS or SSDP, and you have Zero Configuration networking without any user intervention. Compare that with IPv4, where support for 169.254.0.0/16 is patchy, because it is done in userspace via DHCPv4. IPv6 will universally and reliably provide zero configuration networking.

I think it is reasonable to hosts more robust against broken devices in the network, but ignoring IPv6 link-local connectivity and then suppressing AAAA queries is not the solution. Happy Eyeballs (RFC6555) and IPv6 source and destination selection (RFC6724) are.

Comment 65 Tore Anderson 2013-04-14 08:56:13 UTC

(In reply to comment #64)
> I disagree with making getaddrinfo() consider a host with only IPv6
> link-local addresses to not have IPv6 connectivity.

That's not what this bug report is about. It's about suppressing DNS "IN AAAA" queries if the host only has link-local addresses and AI_ADDRCONFIG is supplied.

> Combine autoconfigured IPv6 link-local addresses with a service discovery
> protocol such as Multicast DNS or SSDP,

This bug report is specifically about DNS. It's not about MDNS, SSDP, /etc/hosts, or any other NSS backends.

> I think it is reasonable to hosts more robust against broken devices in the
> network, but ignoring IPv6 link-local connectivity and then suppressing AAAA
> queries is not the solution.

How is it useful for a host with only link-local addresses to perforn "IN AAAA" DNS queries? Keep in mind that in order to use a link-local address, you need to supply an interface scope (e.g., "fe80::1%eth0"), and that DNS cannot supply this information.

Tore

Comment 66 markzzzsmith 2013-04-14 21:01:58 UTC

(In reply to comment #65)
> (In reply to comment #64)
> > I disagree with making getaddrinfo() consider a host with only IPv6
> > link-local addresses to not have IPv6 connectivity.
> 
> That's not what this bug report is about. It's about suppressing DNS "IN
> AAAA" queries if the host only has link-local addresses and AI_ADDRCONFIG is
> supplied.
> 

I understand that.

The AI_ADDRCONFIG flag does not preclude link-local addresses. From RFC3493:


"  If the AI_ADDRCONFIG flag is specified, IPv4 addresses shall be
   returned only if an IPv4 address is configured on the local system,
   and IPv6 addresses shall be returned only if an IPv6 address is
   configured on the local system.  The loopback address is not
   considered for this case as valid as a configured address."

Note that loopback addresses are, so the designers specifically thought about exclusion of addresses types.


> > Combine autoconfigured IPv6 link-local addresses with a service discovery
> > protocol such as Multicast DNS or SSDP,
> 
> This bug report is specifically about DNS. It's not about MDNS, SSDP,
> /etc/hosts, or any other NSS backends.
> 

getaddrinfo() is not a DNS protocol specific API call, and is used in front of all those NSS backends, so that applications don't have to be exposed to how the address information was determined. For example, I run MDNS at home, and when I enable it, all of my IPv6 applications automatically work with it.

Here is what RFC3493 describes it as:

6.1 Protocol-Independent Nodename and Service Name Translation

   Nodename-to-address translation is done in a protocol-independent
   fashion using the getaddrinfo() function.


getaddrinfo() can return all the information necessary to use a link-local address i.e. both the address, and in the interface index, via the sin6_scope_id field of the sockaddr_in6 structure that is returned via the ai_addr field.

By making AI_ADDRCONFIG ignore link-local addresses, the getaddrinfo() call becomes broken for NSS backends that can provide both the link-local address and the corresponding interface index, such as MDNS or any other future ones.

Perhaps the DNS NSS backend could provide it, by returning the interface index of the interface it received the response on if a link-local address is returned. On the common single-homed host, this is likely to be the correct interface index for the link-local address.

> > I think it is reasonable to hosts more robust against broken devices in the
> > network, but ignoring IPv6 link-local connectivity and then suppressing AAAA
> > queries is not the solution.
> 
> How is it useful for a host with only link-local addresses to perforn "IN
> AAAA" DNS queries? Keep in mind that in order to use a link-local address,
> you need to supply an interface scope (e.g., "fe80::1%eth0"), and that DNS
> cannot supply this information.
> 

Something else outside of DNS could provide the interface information, and the application combines them. Specifying a hostname (perhaps in /etc/hosts) and an interface will be much simpler than specifying literal link-local addresses because getaddrinfo() won't lookup IPv6 addresses when the host only has link-local addresses.

The "Happy Eyeballs" technique (RFC6555) wasn't just intended to be applied to web browsers, according to this draft from Fred Baker:

"Happier Eyeballs"
https://www.ietf.org/id/draft-baker-happier-eyeballs-00.txt

and could probably be applied to the DNS "application". For example, off the top of my head:

1) issue a standard DNS query including both A and AAAA queries.
2) if no response is received after 400ms (roughly half way around the world), issue two individual queries, one for an A and one for an AAAA.

That way you're not stopping getaddrinfo() from being used on IPv6 hosts with just link-local addresses, and it won't penalise people who have resolvers in their CPE that does the right thing. Those with broken CPE see a slight delay, but not a significant one, and one that most people won't notice.

Comment 67 Tom Horsley 2013-04-14 21:51:34 UTC

I thought the obvious problem with this was addressed way up near the top in comment 20 when someone pointed out that using the same port for the IPv4 and IPv6 queries gave firewalls fits. Ulrich Drepper had one of his standard "purity over practicality" tantrums and refused to change it to use two different ports to accommodate dumb firewalls, but since Ulrich is gone now, perhaps saner heads could revisit that? (And perhaps revisit all other bug fixes rejected over the years by Ulrich? :-).

Comment 68 Pavel Šimerda (pavlix) 2013-04-15 09:56:56 UTC

(In reply to comment #66)
> The AI_ADDRCONFIG flag does not preclude link-local addresses. From RFC3493:
> 
> 
> "  If the AI_ADDRCONFIG flag is specified, IPv4 addresses shall be
>    returned only if an IPv4 address is configured on the local system,
>    and IPv6 addresses shall be returned only if an IPv6 address is
>    configured on the local system.  The loopback address is not
>    considered for this case as valid as a configured address."
> 
> Note that loopback addresses are, so the designers specifically thought
> about exclusion of addresses types.

We are not discussing the designers' virtues but the technical issues. The RFC is (1) INFORMATIONAL and (2) wrong. For more detailed information, see:

https://fedoraproject.org/wiki/Networking/NameResolution/ADDRCONFIG

(any comments of technical value welcome)

> For example, I run MDNS at home,
> and when I enable it, all of my IPv6 applications automatically work with it.

This is not true. Just try mDNS with link-local addresses (which you mentioned) and you will realize that this feature is absent with the current glibc and nss-mdns.

> getaddrinfo() can return all the information necessary to use a link-local
> address i.e. both the address, and in the interface index, via the
> sin6_scope_id field of the sockaddr_in6 structure that is returned via the
> ai_addr field.

getaddrinfo() can, while the NSS backends cannot. Therefore currently getaddrinfo() would only return scope_id for IPv6 literals, not mDNS nor any similar protocol.

> By making AI_ADDRCONFIG ignore link-local addresses, the getaddrinfo() call
> becomes broken for NSS backends

Currently false. You can't break a feature that is absent.

> Perhaps the DNS NSS backend could provide it,

You don't need scope_id for global addresses and you don't need this information with DNS responses at all.

> Something else outside of DNS could provide the interface information, and
> the application combines them.

I don't see the need for that. DNS returns global addresses. Global addresses don't need scope_id.

> Specifying a hostname (perhaps in /etc/hosts)

Not sure whether /etc/hosts can be used to provide scope_id.

> and an interface will be much simpler

There is currently no standard way to do that. And I don't think it is valuable enough to seek standardization for that.

> than specifying literal link-local
> addresses because getaddrinfo() won't lookup IPv6 addresses when the host
> only has link-local addresses.

There's a much easier solution. Just don't apply the same rules to mDNS you apply to DNS. I believe all of this is already described in:

https://fedoraproject.org/wiki/Networking/NameResolution/ADDRCONFIG

> The "Happy Eyeballs" technique (RFC6555) wasn't just intended to be applied
> to web browsers, according to this draft from Fred Baker:
> 
> "Happier Eyeballs"
> https://www.ietf.org/id/draft-baker-happier-eyeballs-00.txt
> 
> and could probably be applied to the DNS "application". For example, off the
> top of my head:
> 
> 1) issue a standard DNS query including both A and AAAA queries.

In the case described by this bug report, there's no need to query AAAA as global routing is not available anyway. That's all.

> 2) if no response is received after 400ms (roughly half way around the
> world), issue two individual queries, one for an A and one for an AAAA.

I think that glibc resolver implementation is too dumb that it's not worth adding a bunch of hacks. And the glibc is, at the same time, too important to be played with on daily basis when another broken name server is discovered.

Even important features like split DNS are missing with the glibc resolver. Therefore my current recommendation is that *all distributions* start deploying a local recursive DNS server like unbound or dnsmasq and perform all DNS tweaking in the specialized software. This is much more useful and much more maintainable. Any libc and any DNS-enabled software can make use of the features then.

> That way you're not stopping getaddrinfo() from being used on IPv6 hosts
> with just link-local addresses, and it won't penalise people who have
> resolvers in their CPE that does the right thing.

If you consider a CPE with DNS resolver a good solution, then I think a local full-fledge DNS resolver is an even better one (it would still use the CPE one as its upstream source).

(In reply to comment #67)
> I thought the obvious problem with this was addressed way up near the top in
> comment 20 when someone pointed out that using the same port for the IPv4
> and IPv6 queries gave firewalls fits.

> Ulrich Drepper had one of his standard
> "purity over practicality" tantrums and refused to change it to use two
> different ports to accommodate dumb firewalls, but since Ulrich is gone now,
> perhaps saner heads could revisit that?

Revisit? We certainly can. But I think this exact bug report is about saving AAAA query when it is apparently not needed. Unfortunately the patch had a side effect on non-DNS cases and therewore was removed.

The glibc getaddrinfo() is rather broken for many cases (corner cases for some, day-to-day cases for others), see:

https://fedoraproject.org/wiki/Features/FixNetworkNameResolution

> (And perhaps revisit all other bug fixes rejected over the years by Ulrich? :-).

I already did some bugkeeping upstream:

http://sourceware.org/bugzilla/buglist.cgi?quicksearch=getaddrinfo

But none of the bug reports are actually specific to DNS protocol processing. I personally believe that the DNS processing in the GNU C Library should be as simple as possible and hack-free. Distributions should use local resolving DNS servers that work correctly from the glibc side and perform any necessary hacks on the external side.

Such software is much more easily testable, replacable (e.g. with an instance with more debugging enabled) and maintainable.

Comment 69 Tore Anderson 2013-04-15 10:54:42 UTC

Pavel already responded to most of your points, so I'll try to avoid just repeating his points.)

(In reply to comment #66)

> getaddrinfo() is not a DNS protocol specific API call, and is used in front
> of all those NSS backends, so that applications don't have to be exposed to
> how the address information was determined. For example, I run MDNS at home,
> and when I enable it, all of my IPv6 applications automatically work with it.

Well, again, this isn't about MDNS. Nobody is suggesting to make the MDNS backend ignore link-locals if called from getaddrinfo() w/AI_ADDRCONFIG. This bug is specifically about the DNS backend's behaviour; MDNS is out of scope.

> By making AI_ADDRCONFIG ignore link-local addresses, the getaddrinfo() call
> becomes broken for NSS backends that can provide both the link-local address
> and the corresponding interface index, such as MDNS or any other future ones.

See above, this is about the DNS backend *only*.

> Perhaps the DNS NSS backend could provide it, by returning the interface
> index of the interface it received the response on if a link-local address
> is returned. On the common single-homed host, this is likely to be the
> correct interface index for the link-local address.

This is a flawed assumption, even in the single-homed host case. One obvious example: If you run a local caching resolver (which I believe NetworkManager has native support for doing these days), you'll end up with all the returned link-local addresses being scoped to the "lo" interface, which is probably not what you want.

> Something else outside of DNS could provide the interface information, and
> the application combines them. Specifying a hostname (perhaps in /etc/hosts)
> and an interface will be much simpler than specifying literal link-local
> addresses because getaddrinfo() won't lookup IPv6 addresses when the host
> only has link-local addresses.

It would appear to me that the proper thing for such an application to do is to simply not use AI_ADDRCONFIG. However, does such an application actually exist, or are you inventing it just to support your position?

> 1) issue a standard DNS query including both A and AAAA queries.

Not possible. There is no single DNS query that requests both A and AAAA responses. (If, by any chance, you want to say "ANY" right now - don't, it doesn't do what you think it does.)

> 2) if no response is received after 400ms (roughly half way around the
> world), issue two individual queries, one for an A and one for an AAAA.

Two individual queries is what's being done today, and that's the only thing you can do. Also, 400ms, even multiple seconds, is too short a timeout - the major part a DNS lookup isn't the single RTT to the resolver listed in /etc/resolv.conf - it's waiting for that resolver to actually find the record in question. This is the sum of all RTTs to all the authoritative name servers in the delegation chain, potentially including timeouts and retransmits at some of the steps.

The only way to get Happy Eyeballs-ish behaviour using getaddrinfo() is if you run an IPv4-only thread with getaddrinfo(AF_INET)->connect(AF_INET), and a similar one for AF_INET6. You can't do it within getaddrinfo(), as it must wait for all responses (or timeouts) before it can return anything. In any case, in the dual thread case it doesn't make sense to use AI_ADDRCONFIG since you're requesting an explicit address family.

> That way you're not stopping getaddrinfo() from being used on IPv6 hosts
> with just link-local addresses,

So again, this is not about stopping getaddrinfo() from being used on such hosts, it's just to stop it from emitting useless and potentially harmful IN AAAA DNS queries. Nothing else.

Tore

Comment 70 markzzzsmith 2013-04-15 21:12:25 UTC

(In reply to comment #69)
> Pavel already responded to most of your points, so I'll try to avoid just
> repeating his points.)
> 
> (In reply to comment #66)
> 
> > getaddrinfo() is not a DNS protocol specific API call, and is used in front
> > of all those NSS backends, so that applications don't have to be exposed to
> > how the address information was determined. For example, I run MDNS at home,
> > and when I enable it, all of my IPv6 applications automatically work with it.
> 
> Well, again, this isn't about MDNS. Nobody is suggesting to make the MDNS
> backend ignore link-locals if called from getaddrinfo() w/AI_ADDRCONFIG.
> This bug is specifically about the DNS backend's behaviour; MDNS is out of
> scope.
> 

There were no qualifiers on your described behaviour. You may have been talking about DNS, but the description of the change of behaviour to AI_ADDRCONFIG did not specify that it was limited to the DNS backend.

> > By making AI_ADDRCONFIG ignore link-local addresses, the getaddrinfo() call
> > becomes broken for NSS backends that can provide both the link-local address
> > and the corresponding interface index, such as MDNS or any other future ones.
> 
> See above, this is about the DNS backend *only*.
> 

Again, you had no qualifiers.

> > Perhaps the DNS NSS backend could provide it, by returning the interface
> > index of the interface it received the response on if a link-local address
> > is returned. On the common single-homed host, this is likely to be the
> > correct interface index for the link-local address.
> 
> This is a flawed assumption, even in the single-homed host case. One obvious
> example: If you run a local caching resolver (which I believe NetworkManager
> has native support for doing these days), you'll end up with all the
> returned link-local addresses being scoped to the "lo" interface, which is
> probably not what you want.
> 

Then the cache is broken. It should be caching all the information that would be returned in the sockaddr structure returned to getaddrinfo(), not just the returned IP addresses.

> > Something else outside of DNS could provide the interface information, and
> > the application combines them. Specifying a hostname (perhaps in /etc/hosts)
> > and an interface will be much simpler than specifying literal link-local
> > addresses because getaddrinfo() won't lookup IPv6 addresses when the host
> > only has link-local addresses.
> 
> It would appear to me that the proper thing for such an application to do is
> to simply not use AI_ADDRCONFIG. However, does such an application actually
> exist, or are you inventing it just to support your position?
> 

I don't know if an application like this exists, but I doubt you know absolutely that it doesn't exist. Where are the restrictions that say such an application can't exist? The definition of AI_ADDRCONFIG didn't prohibit them, or even make recommendations against them.

You're asserting that link-locals aren't being stored in DNS. How do you know that? Have you queried all the DNS space in the world?

There has been no prohibitions on link-local addresses being put in DNS, and now you are creating one, and are asserting that as you've never seen reason to do so, nobody else has either.

Here is a realistic scenario where link-local addresses would usefully be stored in DNS.

An organisation may want to create organisation wide unique static link-local addresses, assigning them to their routers' interfaces. This would make the link-local addresses independent of the MAC addresses of the routers interfaces, and would also make the use of link-local addresses as e.g., static route next hops, simpler and less error prone because there are no intentional duplicates. e.g., their first router's first configured interface would have fe80::1, their e.g. 10th router's first configured interface might be fe80::15, depending on how many interfaces the other routers have.

To document the static link local addresses the following sorts of DNS records are created (using router10, interface eth0 as an example)

eth0.rtr10.example.com. IN AAAA fe80::15
eth0.rtr10.example.com. IN TXT "Ethernet 0 on Router 10, MAC addr 02:00:00:00:00:01"

5.1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.8.e.f.ip6.arpa.  IN  PTR eth0.rtr10.example.com.


> > 1) issue a standard DNS query including both A and AAAA queries.
> 
> Not possible. There is no single DNS query that requests both A and AAAA
> responses. (If, by any chance, you want to say "ANY" right now - don't, it
> doesn't do what you think it does.)
> 

This was a basic example to demonstrate the concept of the happy eyeballs approach ("off the top of my head"), not a detailed proposal. 

> > 2) if no response is received after 400ms (roughly half way around the
> > world), issue two individual queries, one for an A and one for an AAAA.
> 
> Two individual queries is what's being done today, and that's the only thing
> you can do. Also, 400ms, even multiple seconds, is too short a timeout - the
> major part a DNS lookup isn't the single RTT to the resolver listed in
> /etc/resolv.conf - it's waiting for that resolver to actually find the
> record in question. This is the sum of all RTTs to all the authoritative
> name servers in the delegation chain, potentially including timeouts and
> retransmits at some of the steps.
> 
> The only way to get Happy Eyeballs-ish behaviour using getaddrinfo() is if
> you run an IPv4-only thread with getaddrinfo(AF_INET)->connect(AF_INET), and
> a similar one for AF_INET6. You can't do it within getaddrinfo(), as it must
> wait for all responses (or timeouts) before it can return anything. In any
> case, in the dual thread case it doesn't make sense to use AI_ADDRCONFIG
> since you're requesting an explicit address family.
> 

The happy eyeball behaviour would be within the DNS backend, hidden from the application behind the getaddrinfo(,AI_ADDRCONFIG) call.

> > That way you're not stopping getaddrinfo() from being used on IPv6 hosts
> > with just link-local addresses,
> 
> So again, this is not about stopping getaddrinfo() from being used on such
> hosts, it's just to stop it from emitting useless and potentially harmful IN
> AAAA DNS queries. Nothing else.
> 

How do you know they're useless? How are they harmful?

The things that are useless and harmful here are the broken CPE, not link-locals in DNS, or IPv4/IPv6 hosts that only have link-local IPv6 addresses.

Mark.

Comment 71 Tore Anderson 2013-04-15 21:38:53 UTC

(In reply to comment #70)

> There were no qualifiers on your described behaviour. You may have been
> talking about DNS, but the description of the change of behaviour to
> AI_ADDRCONFIG did not specify that it was limited to the DNS backend.

The title of this bug is:

«getaddrinfo() with AI_ADDRCONFIG doesn't suppress ****AAAA DNS queries**** on IPv4-only networks»

(emphasis mine)

If that's not a crystal clear qualifier, I don't know what is.

> Then the cache is broken. It should be caching all the information that
> would be returned in the sockaddr structure returned to getaddrinfo(), not
> just the returned IP addresses.

glibc speaks to a caching resolver (e.g. dnsmasq) on 127.0.0.1 or ::1, using the regular DNS protocol. The DNS protocol has no means of communicating an interface scope ID. So how exactly would this work?

> You're asserting that link-locals aren't being stored in DNS. How do you
> know that? Have you queried all the DNS space in the world?

No, but I am asserting that storing link-locals in DNS is completely pointless, as it cannot possibly work, because there is no way the DNS protocol can communicate a scope ID to the querier.

> There has been no prohibitions on link-local addresses being put in DNS, and
> now you are creating one, and are asserting that as you've never seen reason
> to do so, nobody else has either.
> 
> Here is a realistic scenario where link-local addresses would usefully be
> stored in DNS.
> 
> An organisation may want to create organisation wide unique static
> link-local addresses, assigning them to their routers' interfaces. This
> would make the link-local addresses independent of the MAC addresses of the
> routers interfaces, and would also make the use of link-local addresses as
> e.g., static route next hops, simpler and less error prone because there are
> no intentional duplicates. e.g., their first router's first configured
> interface would have fe80::1, their e.g. 10th router's first configured
> interface might be fe80::15, depending on how many interfaces the other
> routers have.
> 
> To document the static link local addresses the following sorts of DNS
> records are created (using router10, interface eth0 as an example)
> 
> eth0.rtr10.example.com. IN AAAA fe80::15
> eth0.rtr10.example.com. IN TXT "Ethernet 0 on Router 10, MAC addr
> 02:00:00:00:00:01"
> 
> 5.1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.8.e.f.ip6.arpa. 
> IN  PTR eth0.rtr10.example.com.

Red herring. Making getaddrinfo() with AI_ADDRCONFIG suppress IN AAAA queries have nothing to do with this use of DNS as essentially a documentation tool, and would not "prohibit" this in any way.

What you can't do, though, is e.g. "ssh eth0.rtr10.example.com" - regardless of presence of IPv4 addresses on the host, IPv6 addresses on the host, and whether or not the ssh application uses AI_ADDRCONFIG.

> The happy eyeball behaviour would be within the DNS backend, hidden from the
> application behind the getaddrinfo(,AI_ADDRCONFIG) call.

Well, you can say that this behaviour is already present within getaddrinfo(). If called without AI_ADDRCONFIG (or with on a host that has the required global addresses configured), it will fire off to queries in parallel, wait for replies to come back for both (or either to time out), sort the result set according to /etc/gai.conf, and then return the results to the caller.

The only thing not "Happy" about this is that the timeout is multi-second. But it has to be, due to the way that DNS work. I think you can say "options timeout:1" in /etc/resolv.conf, but that will make queries that trigger deep recursion to randomly fail.

> > So again, this is not about stopping getaddrinfo() from being used on such
> > hosts, it's just to stop it from emitting useless and potentially harmful IN
> > AAAA DNS queries. Nothing else.
> 
> How do you know they're useless? How are they harmful?

They're useless because DNS cannot communicate the interface scope, and without the interface scope you cannot communicate with the link-local address.

AI_ADDRCONFIG is a heuristic an application can use to request only addresses that it is likely useful for communication. Link-local addresses found in DNS will simply never be useful in this manner.

Tore

Comment 72 markzzzsmith 2013-04-16 09:01:45 UTC

(In reply to comment #71)
> (In reply to comment #70)
> 
> > There were no qualifiers on your described behaviour. You may have been
> > talking about DNS, but the description of the change of behaviour to
> > AI_ADDRCONFIG did not specify that it was limited to the DNS backend.
> 
> The title of this bug is:
> 
> «getaddrinfo() with AI_ADDRCONFIG doesn't suppress ****AAAA DNS queries****
> on IPv4-only networks»
> 
> (emphasis mine)
> 
> If that's not a crystal clear qualifier, I don't know what is.
> 

You continue to miss the point. *Your* proposal on how to *fix* the problem was to change the behaviour of AI_ADDRCONFIG, regardless of the backend, because you didn't *specify* what backend your solution applied to.

Even then, the title is not actually saying what the problem is. The actual problem is CPE or DNS servers that did not handle IPv6 AAAA queries correctly. 

> > Then the cache is broken. It should be caching all the information that
> > would be returned in the sockaddr structure returned to getaddrinfo(), not
> > just the returned IP addresses.
> 
> glibc speaks to a caching resolver (e.g. dnsmasq) on 127.0.0.1 or ::1, using
> the regular DNS protocol.

It can also speak to a caching resolver directly, as it does with nscd. That is the cache that I though you were talking about, because it is part of glibc.

> The DNS protocol has no means of communicating an
> interface scope ID. So how exactly would this work?
> 
> > You're asserting that link-locals aren't being stored in DNS. How do you
> > know that? Have you queried all the DNS space in the world?
> 
> No, but I am asserting that storing link-locals in DNS is completely
> pointless,

Well, as you saw, I pointed out a valid and reasonable use for storing link-locals in DNS, so it isn't completely pointless. 

> as it cannot possibly work, because there is no way the DNS
> protocol can communicate a scope ID to the querier.
> 

It doesn't need to, that information can be gleaned from some other source, such as a command line option, or a configuration file. In this instance, DNS is still useful, because it is providing a much simpler and easier to type name for an IPv6 link-local address, even though in itself it isn't enough information to use the returned link-local address by itself.

> > There has been no prohibitions on link-local addresses being put in DNS, and
> > now you are creating one, and are asserting that as you've never seen reason
> > to do so, nobody else has either.
> > 
> > Here is a realistic scenario where link-local addresses would usefully be
> > stored in DNS.
> > 
> > An organisation may want to create organisation wide unique static
> > link-local addresses, assigning them to their routers' interfaces. This
> > would make the link-local addresses independent of the MAC addresses of the
> > routers interfaces, and would also make the use of link-local addresses as
> > e.g., static route next hops, simpler and less error prone because there are
> > no intentional duplicates. e.g., their first router's first configured
> > interface would have fe80::1, their e.g. 10th router's first configured
> > interface might be fe80::15, depending on how many interfaces the other
> > routers have.
> > 
> > To document the static link local addresses the following sorts of DNS
> > records are created (using router10, interface eth0 as an example)
> > 
> > eth0.rtr10.example.com. IN AAAA fe80::15
> > eth0.rtr10.example.com. IN TXT "Ethernet 0 on Router 10, MAC addr
> > 02:00:00:00:00:01"
> > 
> > 5.1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.8.e.f.ip6.arpa. 
> > IN  PTR eth0.rtr10.example.com.
> 
> Red herring. Making getaddrinfo() with AI_ADDRCONFIG suppress IN AAAA
> queries have nothing to do with this use of DNS as essentially a
> documentation tool, and would not "prohibit" this in any way.
> 

Yes it would. You'd stop me being able to use any of the standard command line tools that use getaddrinfo() to resolve a name to an address. 

> What you can't do, though, is e.g. "ssh eth0.rtr10.example.com" - regardless
> of presence of IPv4 addresses on the host, IPv6 addresses on the host, and
> whether or not the ssh application uses AI_ADDRCONFIG.
> 

You assume I actually want to ssh to the host. All I might want to do is see the link-local address behind the name. Once ssh starts connecting, I'll abort it via Control C, because I've got the information I wanted. I'd more likely use ping or telnet for this purpose, but ssh and anything else that calls getaddrinfo() would do - I have plenty of choices.

> > The happy eyeball behaviour would be within the DNS backend, hidden from the
> > application behind the getaddrinfo(,AI_ADDRCONFIG) call.
> 
> Well, you can say that this behaviour is already present within
> getaddrinfo(). If called without AI_ADDRCONFIG (or with on a host that has
> the required global addresses configured), it will fire off to queries in
> parallel, wait for replies to come back for both (or either to time out),
> sort the result set according to /etc/gai.conf, and then return the results
> to the caller.
> 

Why can't this behaviour occur all the time, regardless of whether AI_ADDRCONFIG is set or not? 

> The only thing not "Happy" about this is that the timeout is multi-second.
> But it has to be, due to the way that DNS work. I think you can say "options
> timeout:1" in /etc/resolv.conf, but that will make queries that trigger deep
> recursion to randomly fail.
> 
> > > So again, this is not about stopping getaddrinfo() from being used on such
> > > hosts, it's just to stop it from emitting useless and potentially harmful IN
> > > AAAA DNS queries. Nothing else.
> > 
> > How do you know they're useless? How are they harmful?
> 
> They're useless because DNS cannot communicate the interface scope, and
> without the interface scope you cannot communicate with the link-local
> address.
>

As demonstrated, sometimes you don't need the interface scope, for the DNS information still to be useful.
 
> AI_ADDRCONFIG is a heuristic an application can use to request only
> addresses that it is likely useful for communication. Link-local addresses
> found in DNS will simply never be useful in this manner.
> 


You seem to believe your solution is both the only one and the best one. Yet it changes the external behaviour of the AI_ADDRCONFIG flag. As you don't consider the drawbacks I've pointed out to be valid based on your point of view and experience, you think it is fine to burden everybody else with them because they   apparently won't effect you, even though you most likely can't accurately predict the future.

If there is a solution which doesn't change the external behaviour of AI_ADDRCONFIG *at all*, significantly improves the situation for those who are effected by the actual problem - broken resolvers in CPE -, and doesn't have any effect to anybody who has working resolvers, that is a better solution than yours. In my opinion, the Happy Eyeballs approach, applied to the DNS backend, is the better solution. The Happy Eyeballs model has worked extremely effectively in Firefox, Chrome and Safari. There is no reason it can't be applied to DNS.

Mark.

Comment 73 Tore Anderson 2013-04-16 10:11:21 UTC

(In reply to comment #72)

> You continue to miss the point. *Your* proposal on how to *fix* the problem
> was to change the behaviour of AI_ADDRCONFIG, regardless of the backend,
> because you didn't *specify* what backend your solution applied to.

My very first response to you in this thread, in comment #65, began like this:

«That's not what this bug report is about. It's about suppressing DNS "IN AAAA" queries [...]»

So how you can claim that I am not clear about talking specifically about DNS is beyond me, to be honest. It should in any case be clear by now, I hope.

> Even then, the title is not actually saying what the problem is. The actual
> problem is CPE or DNS servers that did not handle IPv6 AAAA queries
> correctly.

Such CPEs and DNS servers are buggy, true. However, it is in our users' best interest to help them avoid tickling these bugs, because it leads to crappy user experiences and bug reports with a huge number of subscribers:

https://bugzilla.redhat.com/show_bug.cgi?id=459756
https://bugs.launchpad.net/ubuntu/+source/eglibc/+bug/417757

It's sucks extra, because this is perceived to be a Linux-specific problem. MS Windows and Apple Max OS X does interpret the AI_ADDRCONFIG flag in the proposed way (i.e., it will suppress IN AAAA queries if the host only has link-local addresses configured. (I haven't verified that this behaviour still is in place in the latest versions of those operating systems though.)

> > glibc speaks to a caching resolver (e.g. dnsmasq) on 127.0.0.1 or ::1, using
> > the regular DNS protocol.
> 
> It can also speak to a caching resolver directly, as it does with nscd. That
> is the cache that I though you were talking about, because it is part of
> glibc.

NSCD is *not* a resolver. NSCD knows nothing of AAAA queries or the DNS protocol at all. The only thing NSCD can do is to cache results that came from NSS backends, such as - you guessed it - DNS.

> Well, as you saw, I pointed out a valid and reasonable use for storing
> link-locals in DNS, so it isn't completely pointless. 

This is still a red herring. If you use DNS as a documentation tool like you've outlined, there's no reason why you'd use AI_ADDRCONFIG when extracting the records. Otherwise the "documentation" would look different when read on a computer with no IPv6 addresses (not even link-locals), or on a Mac/Windows computer with IPv6 link-locals, than it would when read on a machine with global IPv6 (or a Linux machine with IPv6 link-locals). I find it far-fetched that anyone would use getaddrinfo() for "reading" such "DNS documentation" to begin with, as you cannot retrieve your TXT records with it, for example.

> > as it cannot possibly work, because there is no way the DNS
> > protocol can communicate a scope ID to the querier.
> 
> It doesn't need to, that information can be gleaned from some other source,
> such as a command line option, or a configuration file.

FWIW, it simply doesn't work in a fully updated Fedora 18:

tore@wrath:~$ host ll.fud.no
ll.fud.no has IPv6 address fe80::230:1bff:febc:7f23
tore@wrath:~$ ssh ll.fud.no%eth0
ssh: Could not resolve hostname ll.fud.no%eth0: Name or service not known
tore@wrath:~$ ssh fe80::230:1bff:febc:7f23%eth0
The authenticity of host 'fe80::230:1bff:febc:7f23%eth0 (fe80::230:1bff:febc:7f23%eth0)' can't be established.
[...]

> > Red herring. Making getaddrinfo() with AI_ADDRCONFIG suppress IN AAAA
> > queries have nothing to do with this use of DNS as essentially a
> > documentation tool, and would not "prohibit" this in any way.
> 
> Yes it would. You'd stop me being able to use any of the standard command
> line tools that use getaddrinfo() to resolve a name to an address.

The standard CLI frontend for getaddrinfo(), "getent ahosts", *doesn't* use AI_ADDRCONFIG, for a very good reason.

AI_ADDRCONFIG is counter-productive if your ultimate goal is to learn what address records are in DNS, because it would arbitrarily hide records from you depending on the machine you're running it on - IN AAAA records would be hidden on IPv4-only machines, and IN A records would be hidden on IPv6-only machines.

AI_ADDRCONFIG is only useful if the main goal isn't to dump the addresses, but to actually get an address that in turn will be used as a destination for communication with. As in, when getaddrinfo() is just a mandatory step towards the ultimate goal making a connect() somewhere. This is when you'd use AI_ADDRCONFIG, and this is when the link-locals in DNS are have no use.

> You assume I actually want to ssh to the host. All I might want to do is see
> the link-local address behind the name. Once ssh starts connecting, I'll
> abort it via Control C, because I've got the information I wanted.

So I take it this is the information you wanted?

host tore@wrath:~$ host ll.fud.no
ll.fud.no has IPv6 address fe80::230:1bff:febc:7f23
tore@wrath:~$ ssh ll.fud.no
ssh: connect to host ll.fud.no port 22: Invalid argument
tore@wrath:~$ ssh ll.fud.no%wlan0
ssh: Could not resolve hostname ll.fud.no%wlan0: Name or service not known

> > Well, you can say that this behaviour is already present within
> > getaddrinfo(). If called without AI_ADDRCONFIG (or with on a host that has
> > the required global addresses configured), it will fire off to queries in
> > parallel, wait for replies to come back for both (or either to time out),
> > sort the result set according to /etc/gai.conf, and then return the results
> > to the caller.
> 
> Why can't this behaviour occur all the time, regardless of whether
> AI_ADDRCONFIG is set or not? 

It *does* occur all the time! This is how getaddrinfo() fundamentally works.

> You seem to believe your solution is both the only one and the best one. Yet
> it changes the external behaviour of the AI_ADDRCONFIG flag. As you don't
> consider the drawbacks I've pointed out to be valid based on your point of
> view and experience, you think it is fine to burden everybody else with them
> because they   apparently won't effect you, even though you most likely
> can't accurately predict the future.

To be completely honest with you, I don't think that you (like me) have seen an actual real-world use case where you 1) put link-locals in DNS IN AAAA records, and 2) need to resolve those using getaddrinfo()+AI_ADDRCONFIG.

Your example use case of putting link-locals unique within an organisation and stuffing in DNS for documentation purposes seems to me to be made up on the spot, solely to support your position (not to mention ludicrous - I'd *really* like to see you post it to a mailing list like NANOG, ipv6-ops, or IETF v6ops and see what kind of feedback you'd get). This goes especially when you're describing how you'd use SSH to look up the addresses, which simply doesn't seem to work.

I would personally prefer to discuss real issues. Do you have any of those, by any chance? If so, I'd like to echo Pavel's suggestion that you contribute in the feature page at https://fedoraproject.org/wiki/Networking/NameResolution/ADDRCONFIG .

> If there is a solution which doesn't change the external behaviour of
> AI_ADDRCONFIG *at all*, significantly improves the situation for those who
> are effected by the actual problem - broken resolvers in CPE -, and doesn't
> have any effect to anybody who has working resolvers, that is a better
> solution than yours. In my opinion, the Happy Eyeballs approach, applied to
> the DNS backend, is the better solution. The Happy Eyeballs model has worked
> extremely effectively in Firefox, Chrome and Safari. There is no reason it
> can't be applied to DNS.

Happy Eyeballs are orthogonal to this particular issue, you can implement it fine without AI_ADDRCONFIG, and as I've pointed out earlier, getaddrinfo() already implements "HE" by issuing queries in parallel and timing them out independently.

BTW: You could also say that MS Windows and Mac OS X's interpretation of AI_ADDRCONFIG has "worked extremely effectively", as they avoid ruining the experience of the users' that happen to have a broken CPE. Linux users aren't so lucky, cf. the two bugs I linked to above.

Tore

Comment 74 Pavel Šimerda (pavlix) 2013-04-16 13:42:35 UTC

(In reply to comment #73)
> My very first response to you in this thread, in comment #65, began like
> this:
> 
> «That's not what this bug report is about. It's about suppressing DNS "IN
> AAAA" queries [...]»
> 
> So how you can claim that I am not clear about talking specifically about
> DNS is beyond me, to be honest. It should in any case be clear by now, I
> hope.

I think it is crystal clear and if anyone wants to have a good summary, there's still:

https://fedoraproject.org/wiki/Networking/NameResolution/ADDRCONFIG

> > Even then, the title is not actually saying what the problem is. The actual
> > problem is CPE or DNS servers that did not handle IPv6 AAAA queries
> > correctly.
> 
> Such CPEs and DNS servers are buggy, true. However, it is in our users' best
> interest to help them avoid tickling these bugs, because it leads to crappy
> user experiences and bug reports with a huge number of subscribers:

+1

> It's sucks extra, because this is perceived to be a Linux-specific problem.
> MS Windows and Apple Max OS X does interpret the AI_ADDRCONFIG flag in the
> proposed way (i.e., it will suppress IN AAAA queries if the host only has
> link-local addresses configured.

Please keep being specific whether you're talking about DNS or generally. I don't think the Apple folks would break their link-local name resolution deliberately.

> NSCD is *not* a resolver.

+1

Using NSCD with network name resolution and AI_ADDRCONFIG sounds dangerous to me.

> > Well, as you saw, I pointed out a valid and reasonable use for storing
> > link-locals in DNS, so it isn't completely pointless. 

This is irrelevant to the problem in this bug report, as NSS backends currently don't convey scope_id at all.

With that in mind, I think we should stop polluting this bug report with link-local in DNS as it's irrelevant in the current situation. Start a new bug report and link it from here, if you're still interested, and describe your use case there.

> The standard CLI frontend for getaddrinfo(), "getent ahosts", *doesn't* use
> AI_ADDRCONFIG, for a very good reason.

+1

According to my own micro-research, AI_ADDRCONFIG only good for one specific purpose which is a loop over getaddrinfo results with connect() in each step.

https://fedoraproject.org/wiki/Networking/NameResolution#Connecting_to_services_using_getaddrinfo.28.29
 
> AI_ADDRCONFIG is counter-productive if your ultimate goal is to learn what
> address records are in DNS, because it would arbitrarily hide records from
> you depending on the machine you're running it on - IN AAAA records would be
> hidden on IPv4-only machines, and IN A records would be hidden on IPv6-only
> machines.

+1

> AI_ADDRCONFIG is only useful if the main goal isn't to dump the addresses,
> but to actually get an address that in turn will be used as a destination
> for communication with. As in, when getaddrinfo() is just a mandatory step
> towards the ultimate goal making a connect() somewhere.

Exactly. Any other discussions than those related to getaddrinfo()+connect() should be kept off this bug report.

> host tore@wrath:~$ host ll.fud.no
> ll.fud.no has IPv6 address fe80::230:1bff:febc:7f23
> tore@wrath:~$ ssh ll.fud.no
> ssh: connect to host ll.fud.no port 22: Invalid argument
> tore@wrath:~$ ssh ll.fud.no%wlan0
> ssh: Could not resolve hostname ll.fud.no%wlan0: Name or service not known

AFAIK this is not even the standard notation. The percent sign is IMO only used with literal addresses.

> I would personally prefer to discuss real issues.

+1

Real issues and in their actual context.

> Happy Eyeballs are orthogonal to this particular issue, you can implement it
> fine without AI_ADDRCONFIG, and as I've pointed out earlier, getaddrinfo()
> already implements "HE" by issuing queries in parallel and timing them out
> independently.

+1

I kindly ask to move any further discussions about link-local in DNS to a separate resource (bug report, wiki page, mailing list, whatever). It doesn't belong here and it isn't affected by the solution of this problem, as the current getaddrinfo NSS backends don't work with that anyway.

Of course if you have anything for the discussion of *global* addresses in *DNS*, feel free to contribute here. The *link-local* case in *Multicast DNS* was only discussed because of a flawed patch. We know about the non-DNS issues, we are thinking about them, talking about them and we have already documented them publicly:

https://fedoraproject.org/wiki/Networking/NameResolution

https://fedoraproject.org/wiki/Networking/NameResolution/ADDRCONFIG

If you have anything to add regarding AI_ADDRCONFIG, that is not yet described at the above wiki page, please let me know off bugzilla. I will be happy to summarize your information and/or use cases there. Or you can add a new section yourself, if you wish so.

Comment 75 Fedora End Of Life 2015-01-09 21:39:41 UTC

This message is a notice that Fedora 19 is now at end of life. Fedora 
has stopped maintaining and issuing updates for Fedora 19. It is 
Fedora's policy to close all bug reports from releases that are no 
longer maintained. Approximately 4 (four) weeks from now this bug will
be closed as EOL if it remains open with a Fedora 'version' of '19'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 19 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 76 Oliver Henshaw 2015-01-14 14:49:35 UTC

"options single-request-reopen" in /etc/resolv.conf seems to be an effective workaround for a broken DNS resolver, even for applications that don't use AI_ADDRCONFIG.

Maybe this behaviour could be chosen unconditionally on nodes where the only IPv6 addresses are link-local? Or is this best discussed in another bug?

Comment 77 Tom Horsley 2015-01-14 15:15:17 UTC

(In reply to Phil Oester from comment #9)
> But the question remains, WHY did the behavior change?  Originally, glibc
> DID use unique ports for the AAAA and A queries.  From a "predictability"
> perspective, that is a more secure approach, no?  Similar to how ISNs are
> now randomized in TCP.  
> 
> It seems many people's problems would be solved by going back to the
> (arguably more secure) method of using distinct ports for the A and AAAA
> queries.

Since Ulrich is no longer around to defend to the death indefensible decisions, maybe it is time to just go ahead and put back the separate ports, the elimination of which caused all the problems in the first place.

Comment 78 Carlos O'Donell 2015-01-14 15:31:30 UTC

(In reply to Tom Horsley from comment #77)
> (In reply to Phil Oester from comment #9)
> > But the question remains, WHY did the behavior change?  Originally, glibc
> > DID use unique ports for the AAAA and A queries.  From a "predictability"
> > perspective, that is a more secure approach, no?  Similar to how ISNs are
> > now randomized in TCP.  
> > 
> > It seems many people's problems would be solved by going back to the
> > (arguably more secure) method of using distinct ports for the A and AAAA
> > queries.
> 
> Since Ulrich is no longer around to defend to the death indefensible
> decisions, maybe it is time to just go ahead and put back the separate
> ports, the elimination of which caused all the problems in the first place.

The glibc community is consensus driven. Someone needs to write up a plan and drive it forward. The glibc team can do this, but this particular issue is lower on the overall priority list for stub resolver fixes. Principally we have no way to test this easily, so we're trying to build out our testing infrastructure to get coverage. In the past this was all tested by hand, and we can see how badly that turned out.

Comment 79 Oliver Henshaw 2015-01-15 18:57:45 UTC

Testing on the F21 live image, I don't have a problem.

Probsbly this is https://sourceware.org/git/?p=glibc.git;a=commit;h=16b293a7a6f65d8ff348a603d19e8fd4372fa3a9 in glibc 2.20. I wonder if this resolves all broken DNS resolver issues - is there anyone on F21 who still has problems with bad routers and AAAA DNS queries in getaddrinfo()?

Comment 80 Jan Kurik 2015-07-15 15:21:40 UTC

This bug appears to have been reported against 'rawhide' during the Fedora 23 development cycle.
Changing version to '23'.

(As we did not run this process for some time, it could affect also pre-Fedora 23 development
cycle bugs. We are very sorry. It will help us with cleanup during Fedora 23 End Of Life. Thank you.)

More information and reason for this action is here:
https://fedoraproject.org/wiki/BugZappers/HouseKeeping/Fedora23

Comment 81 Fedora End Of Life 2016-11-24 10:23:51 UTC

This message is a reminder that Fedora 23 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 23. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as EOL if it remains open with a Fedora  'version'
of '23'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 23 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 82 Oliver Henshaw 2016-11-24 11:50:24 UTC

Seems like no-one has reported a problem since the F21 release.

Comment 83 Florian Weimer 2016-11-24 11:55:05 UTC

(In reply to Oliver Henshaw from comment #79)
> Testing on the F21 live image, I don't have a problem.
> 
> Probsbly this is
> https://sourceware.org/git/?p=glibc.git;a=commit;
> h=16b293a7a6f65d8ff348a603d19e8fd4372fa3a9 in glibc 2.20. I wonder if this
> resolves all broken DNS resolver issues - is there anyone on F21 who still
> has problems with bad routers and AAAA DNS queries in getaddrinfo()?

Agreed.  We have not seen further reports of the issue, so closing this bug.

Comment 84 cornel panceac 2016-11-24 11:59:08 UTC

However, i've seen sometimes when a web page takes many seconds to load, but since i've not investigated the problem. the root cause may be completely different.