Bug 1252609 - Order of CNAME and A record in reply matter for successful
Order of CNAME and A record in reply matter for successful
Status: CLOSED WONTFIX
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: glibc (Show other bugs)
7.3
Unspecified Unspecified
unspecified Severity unspecified
: rc
: ---
Assigned To: Carlos O'Donell
qe-baseos-tools
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-08-11 17:36 EDT by Paul Wouters
Modified: 2016-11-24 07:21 EST (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2015-08-17 14:48:54 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Paul Wouters 2015-08-11 17:36:08 EDT
Description of problem:

glibc assumes that CNAME's to A records always appear before the A record in a DNS reply. This is not guaranteed according to the spec (although bind software does do this, making this error not quite as common)

See:

 https://www.ietf.org/mail-archive/web/ietf/current/msg94277.html

 https://github.com/skynetservices/skydns/issues/217

(prob should be cloned to rhel6 too)
Comment 2 Carlos O'Donell 2015-08-17 14:48:54 EDT
(In reply to Paul Wouters from comment #0)
> Description of problem:
> 
> glibc assumes that CNAME's to A records always appear before the A record in
> a DNS reply. This is not guaranteed according to the spec (although bind
> software does do this, making this error not quite as common)
> 
> See:
> 
>  https://www.ietf.org/mail-archive/web/ietf/current/msg94277.html
> 
>  https://github.com/skynetservices/skydns/issues/217
> 
> (prob should be cloned to rhel6 too)

It isn't entirely clear what should happen. The stub resolver in glibc has an expectation that *could* be changed, but would negatively impact the performance of all clients. That negative impact could be removed by leaving the (CNAME,A) ordering, and having the recursive resolver sort the results.  In truth the IETF discussion appears to border on changing the standard to give guidance to recursive resolvers and require (CNAME,A) ordering.

In the case of SkyDNS it seems clear from the upstream ticket that SkyDNS will be changed to comply with (CNAME,A) ordering to allow simple stub resolvers to exist without requiring more complicated implementations.

Therefore I'm marking this as CLOSED/WONTFIX until there is more consensus upstream and in the IETF standard to make an implementation decision in glibc.
Comment 3 Paul Wouters 2015-08-18 09:42:51 EDT
DNS is pretty latency driven on the stub. I find arguments about performance pretty unconvincing.

Some discussion in IETF is still happening, but I don't expect an outcome in the form of an updated RFC or errata. If it does, I will update this bug.

I suspect that glibc will continue to see rare but hard to diagnose DNS failures without a fix.
Comment 4 Carlos O'Donell 2015-08-24 23:22:57 EDT
(In reply to Paul Wouters from comment #3)
> DNS is pretty latency driven on the stub. I find arguments about performance
> pretty unconvincing.

At some point, on a heavily loaded DNS server, supporting the alternate ordering will make a difference in performance.
 
> Some discussion in IETF is still happening, but I don't expect an outcome in
> the form of an updated RFC or errata. If it does, I will update this bug.

If there is no updated RFC or errata, then the process has failed to achieve consensus, which would be disappointing. In this case the standard is underspecified and interoperability up to the implementor. You could go directly upstream in this case with a patch and performance data showing that you don't unduly increase CPU usage (which costs money now) for supporting both orderings.
 
> I suspect that glibc will continue to see rare but hard to diagnose DNS
> failures without a fix.

How is this different from where we are today? We could still have hard to diagnose DNS failures for other underspecified parts of the standard?
Comment 5 Paul Wouters 2015-08-24 23:48:55 EDT
(In reply to Carlos O'Donell from comment #4)

> At some point, on a heavily loaded DNS server, supporting the alternate
> ordering will make a difference in performance.

A DNS server's load does not come from getaddrinfo() or gethostbyname() as it is using its own DNS server code for all DNS processing - it has nothing to do with glibc.

If you mean a server that is loaded in general, then if it falls over based on this one (or a few or a few hundred) calls to getaddrinfo() , it is bound to fall over soon over something else, like a cronjob with "echo MARK" starting :P

> > Some discussion in IETF is still happening, but I don't expect an outcome in
> > the form of an updated RFC or errata. If it does, I will update this bug.
> 
> If there is no updated RFC or errata, then the process has failed to achieve
> consensus, which would be disappointing.

The consensus might still be "do nothing because the majority of implementations do it right already". That doesn't make glibc right :P

> In this case the standard is
> underspecified and interoperability up to the implementer.

Except that it has been pointed out that this is a real life implementation interop failure. I'm fine with you or upstream saying "this is a bug we should fix, but it is a very low priority and we might not get to it". But saying "this is feature because it might use more CPU" is rather silly.

> You could go
> directly upstream in this case with a patch and performance data showing
> that you don't unduly increase CPU usage (which costs money now) for
> supporting both orderings.

well, glibc is out of the area of my expertise. I have my own packages and upstream projects and bugs that I can work on much more efficiently so that other people who are not familiar with my code base can use their time better on other things they are more skilled at :P

> > I suspect that glibc will continue to see rare but hard to diagnose DNS
> > failures without a fix.
> 
> How is this different from where we are today? We could still have hard to
> diagnose DNS failures for other underspecified parts of the standard?

It's not different from today other than that we now know there is a bug in glibc with interoperating with other implementations that some but not all consider a bug. It's not worse, but it is also not improving.
Comment 6 Carlos O'Donell 2015-08-25 00:19:28 EDT
(In reply to Paul Wouters from comment #5)
> (In reply to Carlos O'Donell from comment #4)
> 
> > At some point, on a heavily loaded DNS server, supporting the alternate
> > ordering will make a difference in performance.
> 
> A DNS server's load does not come from getaddrinfo() or gethostbyname() as
> it is using its own DNS server code for all DNS processing - it has nothing
> to do with glibc.
> 
> If you mean a server that is loaded in general, then if it falls over based
> on this one (or a few or a few hundred) calls to getaddrinfo() , it is bound
> to fall over soon over something else, like a cronjob with "echo MARK"
> starting :P

My apologies I did mean to say "DNS *using* server."

The point is that getaddrinfo calls add up across all the applications that use them.

> > > Some discussion in IETF is still happening, but I don't expect an outcome in
> > > the form of an updated RFC or errata. If it does, I will update this bug.
> > 
> > If there is no updated RFC or errata, then the process has failed to achieve
> > consensus, which would be disappointing.
> 
> The consensus might still be "do nothing because the majority of
> implementations do it right already". That doesn't make glibc right :P

I was careful not to use the word "right" in my reseponse. There is no moral judgement going on here, simply that the implementors have chosen a particular implementation defined detail (though the standard doesn't call it that). That choice may not be the most robust, but it is a choice based on project design goals e.g. performance. Just like say ld.so expects all ELF files to be as it expects and performs little to no checking of the file before execution, regardless of all of the undefined parts of the ELF standard that would need checking for robust execution.

> > In this case the standard is
> > underspecified and interoperability up to the implementer.
> 
> Except that it has been pointed out that this is a real life implementation
> interop failure. I'm fine with you or upstream saying "this is a bug we
> should fix, but it is a very low priority and we might not get to it". But
> saying "this is feature because it might use more CPU" is rather silly.

That is indeed what I am saying. It costs less CPU to assume (CNAME,A), and because every other DNS server returns that response there is certainly an interop issue with servers that don't. Until there is a more concrete response from the standards body I will not support a change in glibc. I do not prioritize robustness over design goals.

Vis-à-vis: https://sourceware.org/ml/libc-alpha/2015-07/msg00555.html

> > You could go
> > directly upstream in this case with a patch and performance data showing
> > that you don't unduly increase CPU usage (which costs money now) for
> > supporting both orderings.
> 
> well, glibc is out of the area of my expertise. I have my own packages and
> upstream projects and bugs that I can work on much more efficiently so that
> other people who are not familiar with my code base can use their time
> better on other things they are more skilled at :P

I appreciate all of the work you do, and I hope the above comment was in no way intended as a slight against you or the work you do. All I am saying is that I have given you my opinion, but you need not stop there if you feel strongly about the issue.

> > > I suspect that glibc will continue to see rare but hard to diagnose DNS
> > > failures without a fix.
> > 
> > How is this different from where we are today? We could still have hard to
> > diagnose DNS failures for other underspecified parts of the standard?
> 
> It's not different from today other than that we now know there is a bug in
> glibc with interoperating with other implementations that some but not all
> consider a bug. It's not worse, but it is also not improving.

Correct, it is not improving this particular use case. To make a generalization of the fact that it is not improving elsewhere would be untrue.

We should probably consider writing tooling to help admins detect this invalid case? Wireshark plugin?
Comment 7 Paul Wouters 2015-08-25 09:23:52 EDT
Looks like this might be an instance of that problem: https://bugzilla.redhat.com/show_bug.cgi?id=1217710

Note You need to log in before you can comment on or make changes to this bug.