Bug 436254
Summary: | Updated bind falls in love with ip6 | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Ray Todd Stevens <raytodd> | ||||||
Component: | bind | Assignee: | Adam Tkac <atkac> | ||||||
Status: | CLOSED NOTABUG | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||
Severity: | low | Docs Contact: | |||||||
Priority: | low | ||||||||
Version: | 9 | CC: | ovasik | ||||||
Target Milestone: | --- | ||||||||
Target Release: | --- | ||||||||
Hardware: | All | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2008-05-28 13:33:06 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
Ray Todd Stevens
2008-03-06 05:06:36 UTC
I'm not sure if I understand correctly. You mean that named doesn't asks for A glues and asks only for AAAA glues? Would it be possible attach named.conf and named system log, please? It sounds like you are understanding the issue exactly correctly. This is an intermittent problem. I find it interesting that it is even attempting AAAAs as I am using the /4 switch on the etc sysconfig named file. It seems to be triggered by a name server for a search domain (in resolv.conf) disappearing for a few minutes. But that is only a guess. Once the server is back it still only tries to find it using the AAAA requests. Since none of these servers have ip6 addresses this means that the only way to fix the problem is to reset the name server. HUP and init.d named restart both work to fix the problem is the offending name server is back up. Don't have it doing it right now. I will try and find an old log, but these are pretty big logs. Created attachment 297463 [details]
my /etc/sysconfig/named file
With this file it would seem to me that never ever should named be trying to do
aaaa instead of switching to always only trying them.
Created attachment 297464 [details]
my named.conf
As I understand it this should also be such that it should be really trying
harder on a records.
Now you can see kind of my confusion and why I filed this report. -4 option doesn't tell named not to ask AAAA records. You will ask for AAAA records through IPv4 interface and for A records through IPv6 interface. -4 option tell named to listen only on IPv4 interfaces. Also I'm not sure how you found that only AAAA glues are requested. You can try "rndc dumpdb" and look into %{chroot}/var/named/data/cache_dump.db that both glues are requested and cached. It doesn't look like problem for me, does it? Good point on what is being asked for. Had not thought of that. On the other hand it should actually be asking for A records, and is not. We have a linux (fc8) router and ran wire shark on it. According to it the only outgoing name server packets were asking only for AAAA records. Now part of the problem is that none of them were answered. Which is a whole separate issue. That is where the problem is. I tried reproduce problem with bind-9.5.0-29.b2.fc9 (i386 platform) with your configuration and I wasn't successful. Would it be possible try this, please? $ named -4gunamed -d99 > log 2>&1 then in different terminal run $ dig @127.0.0.1 <some_name> then simply term named and attach produced log file to bugzilla, please Part of the problem here is found in comment 2. This is an annoyingly intermittent problem. The servers are all running fine then one or more will switch into this really annoying mode. I have been trying to find a common denominator. If I can get one to fail I will do this, and I will also try and send you a wireshark trace if I can grab this too. I should mention that this is doing this on not only the fc9 unit, but two fc8 ones. In the mean time I will work on finding a way to make it fail. OK I can't make this fail this way right now. I will keep working on it. However as I mentioned I am experiencing a very similar problem on FC8. One of the FC8 machines went into failure just now. With the sales weenies out of the office I had time to leave it in failure mode while I did some testing. Testing is good, wireshark is even better. ;-) I found the problem with these machines. Now it may have been me, but I am not sure. It is also possible that something wrote a bad named.conf. If yhou tell me I did it I will believe it. However, I was able to find that the named.conf has not changed between when it worked and when it didn't work and these intermittent errors started showing up. I was also able to bring up an old unused fc7 nonupdated machine out of storage and put the problem config on it. It ran no problem. I upgraded to fc8 no problem. I upgraded to current rpm status and I can get the failure. Now I will admit that I am doing something weird or at least simi weird. Most of our servers either are routers or can be configured as routers. We also use virtual ip addresses. That is an eth0 eth0:1 eth0:2. Now in the named.conf I use a listen-on statement and don't include only one of the virtual addresses. The purpose of this is to make sure I am not accidentally using the wrong address, and the name server can move to a different machine, but moving the ip address. Lets use an example of eth0:63.250.67.65 eth0:1:63.250.67.66 eth0:1:63.250.67.67 with a listen-on of only 63.250.67.66. Now I have been using a query-source of * port 53 in order to support our firewalling. Now to success and failure. In the successes (fc7, fc8 preupdate) the packets sent to query the outside name servers are sent from port 53 ip 63.250.67.66. Now for the failure, (ie:fc8 post update). In the failures the source port of the request packets is still 53, BUT the ip address is apparently a random selection of the ips available on the server. We use 10.x.x.x addresses and some of the packets go our that port with a 10.x.x.x address ramdomly selected from the ones assigned to this net. The packets go out ports that we don't even have any ips being listened to. They also go out the 63.25.67.x port. Unfortunately they seem to use a simi-random ip with 63.250.67.65 (a port not being listened on) being by far the favorite port. So return packets come in to that same ip address which is not in the listen on list and guess what, they are apparently ignored. changing the query-source to 63.25.067.66 port 53 makes the world good. Never would have found this without wireshark. I just found another reason to love the program. ;-) Now just my two cents worth here. I fully admit and claim that what I am doing is a bit esoteric. Frankly if I were you or the bind guys I would not waste much programming time fixing this. However, it does occur to me that some kind of a faq/knowledgebase/etc entry saying something along the lines of: If you use listen-on statements which do not include all of your interfaces and ips assigned to the machine you must use a query-source statement which lists the ips which you to listen on. Just my two cents worth on that. I am good! --- I am good! --- I am good! and wireshark is even better!!!! I found the problem on the fc9 machine. It is a different and really annoying manifestation of the same darned problem that was on the fc8 machines. It is fixed the same way. Here is the deal it is indeed sending out "A" requests and and "AAAA" requests and both are being answered appropriately. However if you are using a restrictive listen on statement then the two are handled differently. The A requests are sent out as specified above. That is the system seems to fall in love with sending them out on the first ip of the interface even if it is not being listened to. Once in a while they will go out on the other ones. If they go out on a ip that is being listened to then all is well and good. If you get a response you are hunky doory. If you don't get a response you get a log entry. But if the packet goes out on an ip not being listened to then no log entry is made at it appears the attempt is not even being made. Since normally this is the case it looks to the logs like the A entries are not being tried. NOw for the AAAA entries. Why they would be handled different is well behond me, but they appear to be. The go out on a truly random basis. Like the A entries if they go out on a non listened to ip then no log entry is made. Now none of my servers I am querying handles AAAA at this point so I don't have any "yeap this is it responses". But if it goes out on a listened to ip then it is logged. Now for the fun part since these are random the are in fact showing up in the log. So it looks in the log file like these are the only attempts. Of course using an ip in the query source line solves the problem. I guess this is ready for a close with hopefully some kind of a note to people like me on how to set this up. Hm. It looks you're right. ARM says that named uses INADDR_ANY if you don't specify query-source option. I think query-source option should inherit addresses from listen-on if query-source is not specified. Let me discuss this in upstream I would say that a bear minimum it would be nice if the documentation was changed to say that if you use a listen-on statement with ip's specified you must also use a query-source with ip's listed. By the way what is ARM? (In reply to comment #13) > By the way what is ARM? It is Administration Reference Manual (http://www.isc.org/index.pl?/sw/bind/arm95/ for bind 9.5 series). I still think about correct behavior and I have to take back my previous decision. It doesn't matter that named doesn't listen on interface which is used as source IP so when you send something outgoing port/interface is opened till response comes. I'm not sure but this might be problem. Named tries send query which uses one virtual interface (I expect this is network bridge) so sending works as expected. But when response comes back it is not delivered to named, not sure why. It might point to kernel issue because response should be delivered or iptables misconfiguration (are you using state firewall? You should use "-m state --state ESTABLISHED,RELATED -j ACCEPT rule"). I have experience that this works (for example many people uses named which listen only on 127.0.0.1 but outgoing queries come through network interface and responses are accepted well). Certainly if it was working this way everything would be hunky dorry. But it is not that kind of is the point. Using this rule for iptables. I find it interesting and a clue that the problem appears to be in bind because of the log messages. If you query on an ip which is in the "listen on" list and receive no response there is a time out log message. If the queries go out on an interface not in the listen on list and you don't get a response nothing happens in the logs. That would indicate to me that if the query goes out and returns on an ip not being listened to no time out is occurring. I have tried turing off iptables, and that doesn't fix the problem. This would lead me to suspect that something in bind is not accepting returning packets on non listened to ips. But yes I am only guessing. I have banged away at this REAL hard, and as long as you specify a listened to ip in the query-source that everything works OK. I can't make these problems occur. Interestingly enough specifying a query-source which is not in the listen-on list will totally break things. (In reply to comment #16) > I have banged away at this REAL hard, and as long as you specify a listened to > ip in the query-source that everything works OK. I can't make these problems > occur. Interestingly enough specifying a query-source which is not in the > listen-on list will totally break things. This might indicate kernel bug. It should work as expected. Would it be possible try if this happen in case when named listens on bridge (virtual interface) and query-source points to real interface and also what happen if named listens on real interface and query-source uses virtual one, please? Thanks If I understand you question correctly I already did this. As I said I did a WHOLE BUNCH of banging away at the system. As I understand your question, you are referring to the first interface (IE:eth0) as the "real interface" and the "bridge" as the eth0:0 eth0:1 interfaces. In this case I did move the ips around to see if this would fix things. Now one thing my queries are udp queries. I have not played with forcing tcp, as I figure this would break to many other things. Anyway my primary configuration is eth0 is not in the "listen-on" list. However I did for testing add this interface to the list and remove the others. It did better, primarily because of the random selection process discussed above. My packet analyzer shows some of the query packets being sent on the nonlistened to interface, and these apparently ignored. But the underlying problem still existed. If you then put the query-source to an nonlistened to address (listening on eth0 querying on eth0:0) things broke. I would think some form of kernel issues might be involved. But I would think that this would also in some manner have to involve how bind does things. For one thing I can't see the kernel being able to act upon to fail the configuration options in the named.conf file. To a layman in bind/kernel interactions who has been around a while I would think that it sure looks like bind is using the processes doing the listen-on ing to also receive and process the incoming response to query packets. Or it appears that the process to do the query dies a nonlogged death right after sending out the query packet if the address is not being listened to. I also get these same results if no virtual interfaces are involved. That is where the query-source is on eth0 and the listen-on is on eth1. I tend to address that this is a definite problem because this configuration might be needed for a firewalled system. OK I banged at this one some more. I found something very interesting. Doesn't help me, but sure is interesting. It appears to be an interaction between listen-on and query source. It you eliminate either of these statements totally then every thing is back to functionality. The packet analyzer shows the packets going out and being properly handled just perfectly. It is the connection of the two that breaks things. Hope this helps. Changing version to '9' as part of upcoming Fedora 9 GA. More information and reason for this action is here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping I did some next investigations and I reproduced this problem. It looks that problem is somewhere in bridge configuration and/or routing configuration. I didn't figure where exactly but in this case I recommend use query-source option with specific address. Btw when I setup machine with two physical interfaces and bind listens on one and sends queries through other it works as expected. Closing as notabug (if you don't agree, please reopen) |