Bug 436254

Summary: Updated bind falls in love with ip6
Product: [Fedora] Fedora Reporter: Ray Todd Stevens <raytodd>
Component: bindAssignee: Adam Tkac <atkac>
Status: CLOSED NOTABUG QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: low Docs Contact:
Priority: low    
Version: 9CC: ovasik
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-05-28 13:33:06 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
my /etc/sysconfig/named file
none
my named.conf none

Description Ray Todd Stevens 2008-03-06 05:06:36 UTC
Description of problem:

This may be related to bug 432615 which has been closed.

Several of my name servers have started insisting on doing ip6 even though the
configuration files have ip6 disables and prefered glue set to a.

Still the name server will only request AAAA records for the name servers.  
Since we don't do ip6 and our upstream provider doesn't do ip6 this is a big
problem.   We get no name servers found and a time out error.


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Adam Tkac 2008-03-10 12:37:11 UTC
I'm not sure if I understand correctly. You mean that named doesn't asks for A
glues and asks only for AAAA glues? Would it be possible attach named.conf and
named system log, please?

Comment 2 Ray Todd Stevens 2008-03-10 16:59:29 UTC
It sounds like you are understanding the issue exactly correctly.

This is an intermittent problem.  I find it interesting that it is even
attempting AAAAs as I am using the /4 switch on the etc sysconfig named file.  
It seems to be triggered by a name server for a search domain (in resolv.conf)
disappearing for a few minutes.   But that is only a guess.   Once the server is
back it still only tries to find it using the AAAA requests.   Since none of
these servers have ip6 addresses this means that the only way to fix the problem
is to reset the name server.   HUP and init.d named restart both work to fix the
problem is the offending name server is back up.



Don't have it doing it right now.   I will try and find an old log, but these
are pretty big logs.



Comment 3 Ray Todd Stevens 2008-03-10 17:09:36 UTC
Created attachment 297463 [details]
my /etc/sysconfig/named file

With this file it would seem to me that never ever should named be trying to do
aaaa instead of switching to always only trying them.

Comment 4 Ray Todd Stevens 2008-03-10 17:10:37 UTC
Created attachment 297464 [details]
my named.conf

As I understand it this should also be such that it should be really trying
harder on a records.

Comment 5 Ray Todd Stevens 2008-03-10 17:11:19 UTC
Now you can see kind of my confusion and why I filed this report.

Comment 6 Adam Tkac 2008-03-10 19:16:40 UTC
-4 option doesn't tell named not to ask AAAA records. You will ask for AAAA
records through IPv4 interface and for A records through IPv6 interface. -4
option tell named to listen only on IPv4 interfaces. Also I'm not sure how you
found that only AAAA glues are requested. You can try "rndc dumpdb" and look
into %{chroot}/var/named/data/cache_dump.db that both glues are requested and
cached. It doesn't look like problem for me, does it?

Comment 7 Ray Todd Stevens 2008-03-10 20:15:38 UTC
Good point on what is being asked for.   Had not thought of that.

On the other hand it should actually be asking for A records, and is not.   We
have a linux (fc8) router and ran wire shark on it.   According to it the only
outgoing name server packets were asking only for AAAA records.

Now part of the problem is that none of them were answered.   Which is a whole
separate issue.

That is where the problem is.

Comment 8 Adam Tkac 2008-03-12 14:03:05 UTC
I tried reproduce problem with bind-9.5.0-29.b2.fc9 (i386 platform) with your
configuration and I wasn't successful. Would it be possible try this, please?

$ named -4gunamed -d99 > log 2>&1

then in different terminal run

$ dig @127.0.0.1 <some_name>

then simply term named and attach produced log file to bugzilla, please

Comment 9 Ray Todd Stevens 2008-03-13 01:33:26 UTC
Part of the problem here is found in comment 2.   This is an annoyingly
intermittent  problem.  The servers are all running fine then one or more will
switch into this really annoying mode.   I have been trying to find a common
denominator.   If I can get one to fail I will do this, and I will also try and
send you a wireshark trace if I can grab this too.   I should mention that this
is doing this on not only the fc9 unit, but two fc8 ones.

In the mean time I will work on finding a way to make it fail.

Comment 10 Ray Todd Stevens 2008-03-13 02:57:27 UTC
OK I can't make this fail this way right now.   I will keep working on it.

However as I mentioned I am experiencing a very similar problem on FC8.   One of
the FC8 machines went into failure just now.   With the sales weenies out of the
office I had time to leave it in failure mode while I did some testing.  
Testing is good, wireshark is even better. ;-)

I found the problem with these machines.   Now it may have been me, but I am not
sure.  It is also possible that something wrote a bad named.conf.   If yhou tell
me I did it I will believe it.   However, I was able to find that the named.conf
has not changed between when it worked and when it didn't work and these
intermittent errors started showing up.   I was also able to bring up an old
unused fc7 nonupdated machine out of storage and put the problem config on it. 
 It ran no problem.   I upgraded to fc8 no problem.   I upgraded to current rpm
status and I can get the failure.   Now I will admit that I am doing something
weird or at least simi weird.

Most of our servers either are routers or can be configured as routers.   We
also use virtual ip addresses.   That is an eth0 eth0:1 eth0:2.

Now in the named.conf I use a listen-on statement and don't include only one of
the virtual addresses.   The purpose of this is to make sure I am not
accidentally using the wrong address, and the name server can move to a
different machine, but moving the ip address.   Lets use an example of
eth0:63.250.67.65 eth0:1:63.250.67.66 eth0:1:63.250.67.67 with a listen-on of
only 63.250.67.66.

Now I have been using a query-source of * port 53 in order to support our
firewalling.    

Now to success and failure.   In the successes (fc7, fc8 preupdate) the packets
sent to query the outside name servers are sent from port 53 ip 63.250.67.66. 
Now for the failure, (ie:fc8 post update).  In the failures the source port of
the request packets is still 53, BUT the ip address is apparently a random
selection of the ips available on the server.   We use 10.x.x.x addresses and
some of the packets go our that port with a 10.x.x.x address ramdomly selected
from the ones assigned to this net.   The packets go out ports that we don't
even have any ips being listened to.  They also go out the 63.25.67.x port.  
Unfortunately they seem to use a simi-random ip with 63.250.67.65 (a port not
being listened on) being by far the favorite port.    So return packets come in
to that same ip address which is not in the listen on list and guess what, they
are apparently ignored.   

changing the query-source to 63.25.067.66 port 53 makes the world good.

Never would have found this without wireshark.  I just found another reason to
love the program.   ;-)

Now just my two cents worth here.   I fully admit and claim that what I am doing
is a bit esoteric.   Frankly if I were you or the bind guys I would not waste
much programming time fixing this.   However, it does occur to me that some kind
of a faq/knowledgebase/etc entry saying something along the lines of:

If you use listen-on statements which do not include all of your interfaces and
ips assigned to the machine you must use a query-source statement which lists
the ips which you to listen on.

Just my two cents worth on that.

Comment 11 Ray Todd Stevens 2008-03-13 21:20:29 UTC
I am good! --- I am good!   --- I am good!    and wireshark is even better!!!!

I found the problem on the fc9 machine.  It is a different and really annoying
manifestation of the same darned problem that was on the fc8 machines.   It is
fixed the same way.

Here is the deal it is indeed sending out "A" requests and and "AAAA" requests
and both are being answered appropriately.   However if you are using a
restrictive listen on statement then the two are handled differently.    The A
requests are sent out as specified above.   That is the system seems to fall in
love with sending them out on the first ip of the interface even if it is not
being listened to.   Once in a while they will go out on the other ones.   If
they go out on a ip that is being listened to then all is well and good.   If
you get a response you are hunky doory.  If you don't get a response you get a
log entry.   But if the packet goes out on an ip not being listened to then no
log entry is made at it appears the attempt is not even being made.   Since
normally this is the case it looks to the logs like the A entries are not being
tried.

NOw for the AAAA entries.   Why they would be handled different is well behond
me, but they appear to be.   The go out on a truly random basis.   Like the A
entries if they go out on a non listened to ip then no log entry is made.   Now
none of my servers I am querying handles AAAA at this point so I don't have any
"yeap this is it responses".   But if it goes out on a listened to ip then it is
logged.   Now for the fun part since these are random the are in fact showing up
in the log.   So it looks in the log file like these are the only attempts.

Of course using an ip in the query source line solves the problem.

I guess this is ready for a close with hopefully some kind of a note to people
like me on how to set this up.



Comment 12 Adam Tkac 2008-03-26 12:08:10 UTC
Hm. It looks you're right. ARM says that named uses INADDR_ANY if you don't
specify query-source option. I think query-source option should inherit
addresses from listen-on if query-source is not specified. Let me discuss this
in upstream

Comment 13 Ray Todd Stevens 2008-03-26 12:15:54 UTC
I would say that a bear minimum it would be nice if the documentation was
changed to say that if you use a listen-on statement with ip's specified you
must also use a query-source with ip's listed.

By the way what is ARM?

Comment 14 Adam Tkac 2008-03-26 12:41:31 UTC
(In reply to comment #13)
> By the way what is ARM?

It is Administration Reference Manual
(http://www.isc.org/index.pl?/sw/bind/arm95/ for bind 9.5 series). I still think
about correct behavior and I have to take back my previous decision. It doesn't
matter that named doesn't listen on interface which is used as source IP so when
you send something outgoing port/interface is opened till response comes. I'm
not sure but this might be problem. Named tries send query which uses one
virtual interface (I expect this is network bridge) so sending works as
expected. But when response comes back it is not delivered to named, not sure
why. It might point to kernel issue because response should be delivered or
iptables misconfiguration (are you using state firewall? You should use "-m
state --state ESTABLISHED,RELATED -j ACCEPT rule"). I have experience that this
works (for example many people uses named which listen only on 127.0.0.1 but
outgoing queries come through network interface and responses are accepted well).

Comment 15 Ray Todd Stevens 2008-03-26 12:57:58 UTC
Certainly if it was working this way everything would be hunky dorry.   But it
is not that kind of is the point.

Using this rule for iptables.

I find it interesting and a clue that the problem appears to be in bind because
of the log messages.  If you query on an ip which is in the "listen on" list and
receive no response there is a time out log message.   If the queries go out on
an interface not in the listen on list and you don't get a response nothing
happens in the logs.   That would indicate to me that if the query goes out and
returns on an ip not being listened to no time out is occurring.   I have tried
turing off iptables, and that doesn't fix the problem.  This would lead me to
suspect that something in bind is not accepting returning packets on non
listened to ips.

But yes I am only guessing.

Comment 16 Ray Todd Stevens 2008-04-01 14:30:18 UTC
I have banged away at this REAL hard, and as long as you specify a listened to
ip in the query-source that everything works OK.   I can't make these problems
occur.   Interestingly enough specifying a query-source which is not in the
listen-on list will totally break things.

Comment 17 Adam Tkac 2008-04-02 12:59:43 UTC
(In reply to comment #16)
> I have banged away at this REAL hard, and as long as you specify a listened to
> ip in the query-source that everything works OK.   I can't make these problems
> occur.   Interestingly enough specifying a query-source which is not in the
> listen-on list will totally break things.

This might indicate kernel bug. It should work as expected. Would it be possible
try if this happen in case when named listens on bridge (virtual interface) and
query-source points to real interface and also what happen if named listens on
real interface and query-source uses virtual one, please? Thanks

Comment 18 Ray Todd Stevens 2008-04-02 13:56:41 UTC
If I understand you question correctly I already did this.   As I said I did a
WHOLE BUNCH of banging away at the system.

As I understand your question, you are referring to the first interface
(IE:eth0) as the "real interface" and the "bridge" as the eth0:0 eth0:1
interfaces.   In this case I did move the ips around to see if this would fix
things.    Now one thing my queries are udp queries.   I have not played with
forcing tcp, as I figure this would break to many other things.

Anyway my primary configuration is eth0 is not in the "listen-on" list.   
However I did for testing add this interface to the list and remove the others.
  It did better, primarily because of the random selection process discussed
above.   My packet analyzer shows some of the query packets being sent on the
nonlistened to interface, and these apparently ignored.   But the underlying
problem still existed.   If you then put the query-source to an nonlistened to
address (listening on eth0 querying on eth0:0) things broke.

I would think some form of kernel issues might be involved.   But I would think
that this would also in some manner have to involve how bind does things.   For
one thing I can't see the kernel being able to act upon to fail the
configuration options in the named.conf file.   To a layman in bind/kernel
interactions who has been around a while I would think that it sure looks like
bind is using the processes doing the listen-on ing to also receive and process
the incoming response to query packets.   Or it appears that the process to do
the query dies a nonlogged death right after sending out the query packet if the
address is not being listened to.

I also get these same results if no virtual interfaces are involved.   That is
where the query-source is on eth0 and the listen-on is on eth1.   I tend to
address that this is a definite problem because this configuration might be
needed for a firewalled system.

Comment 19 Ray Todd Stevens 2008-04-23 20:35:01 UTC
OK I banged at this one some more.   I found something very interesting.  
Doesn't help me, but sure is interesting.

It appears to be an interaction between listen-on and query source.   It you
eliminate either of these statements totally then every thing is back to
functionality.   The packet analyzer shows the packets going out and being
properly handled just perfectly.

It is the connection of the two that breaks things.

Hope this helps.

Comment 20 Bug Zapper 2008-05-14 05:48:47 UTC
Changing version to '9' as part of upcoming Fedora 9 GA.
More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 21 Adam Tkac 2008-05-28 13:33:06 UTC
I did some next investigations and I reproduced this problem.
It looks that problem is somewhere in bridge configuration and/or routing
configuration. I didn't figure where exactly but in this case I recommend use
query-source option with specific address. Btw when I setup machine with two
physical interfaces and bind listens on one and sends queries through other it
works as expected. Closing as notabug (if you don't agree, please reopen)