Bug 975940 - 9.9.3-3.P1 upgrade fails remote lookups
Summary: 9.9.3-3.P1 upgrade fails remote lookups
Keywords:
Status: CLOSED EOL
Alias: None
Product: Fedora
Classification: Fedora
Component: bind
Version: 19
Hardware: x86_64
OS: Linux
low
unspecified
Target Milestone: ---
Assignee: Tomáš Hozza
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-06-19 16:05 UTC by William H. Haller
Modified: 2015-02-18 13:57 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-02-18 13:57:11 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
named.conf as working with previous release (4.88 KB, text/x-csrc)
2013-06-21 14:04 UTC, William H. Haller
no flags Details
First bind wireshark capture (1.68 KB, application/octet-stream)
2013-08-27 02:10 UTC, William H. Haller
no flags Details
Second bind wireshark capture (38.12 KB, application/pdf)
2013-08-27 02:11 UTC, William H. Haller
no flags Details
second attachment (9.99 KB, application/octet-stream)
2013-08-29 13:31 UTC, William H. Haller
no flags Details

Description William H. Haller 2013-06-19 16:05:18 UTC
Description of problem: After upgrade to 9.9.3-3 remote lookups fail. Sometimes the first one or two work, but after that all fail.


Version-Release number of selected component (if applicable): 9.9.3-3.P1


How reproducible: Always


Steps to Reproduce:
1. upgrade
2. nslookup external name, dig external name
3. servfail

Actual results: servfail on dns lookup. All locally defined zones work fine and consistently. But external lookups fail. named continues to run. There is nothing in the log or blocked packets in the DENY entries.


Expected results: Smooth and uneventful upgrade from previous release of bind.


Additional info: Downgrading bind and its associated packages to the previous release makes things work again.

Doesn't seem to matter if running as a full server or forwarding to an external resolver like 8.8.8.8. Doesn't seem to matter if dnssec is on or off, although I've found with the previous release I can't reliably run with both dnssec and chroot. dnssec works fine without the chroot. Even when first started and 9.9.3-3.p1 does resolve a couple it isn't fast.

lookups are running from a fixed query port and address.

Comment 1 Tomáš Hozza 2013-06-21 08:04:51 UTC
Can you please provide your named.conf, with which you are experiencing this
issue?

Thank you.

Comment 2 William H. Haller 2013-06-21 14:04:08 UTC
Created attachment 763868 [details]
named.conf as working with previous release

I've tried turning dnssec options to no and commenting the key out and using the forwarders, but to no avail with the new version. The internally managed zones work. External references fail.

This is the slave DNS server's configuration file which manages DNS for the rest of our network. It's also the one that I had to stop running bind-chroot for in order to get dnssec to work at all.

Comment 3 Fedora End Of Life 2013-07-04 04:02:17 UTC
This message is a reminder that Fedora 17 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 17. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '17'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 17's end of life.

Bug Reporter:  Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 17 is end of life. If you 
would still like  to see this bug fixed and are able to reproduce it 
against a later version  of Fedora, you are encouraged  change the 
'version' to a later Fedora version prior to Fedora 17's end of life.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 4 Fedora End Of Life 2013-08-01 12:29:32 UTC
Fedora 17 changed to end-of-life (EOL) status on 2013-07-30. Fedora 17 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.

Comment 5 William H. Haller 2013-08-11 00:09:01 UTC
Delayed upgrade to F18 for as long as I could, but finally did the upgrade. The bind release that is distributed with F18 acts the same as F17. When you first fire up the process it will answer queries for a while and then it stops. Setting resolv.conf to match the forwarders listed in the named.conf file allows other programs to resolve names immediately, but if set to just 127.0.0.1 it stops forwarding after awhile. It still answers any DNS name it already knows - local zones or DNS names it has looked up before it started forwarding.

Please keep looking. There are several servers I can't upgrade until this is resolved.

The EOL policy really needs to be re-examined.

Comment 6 William H. Haller 2013-08-12 19:07:41 UTC
FWIW, a wireshark capture shows the 9.9.3-4 requesting the DNS lookup properly and the previous F17 downgraded server replying with the address. This happens multiple times for each DNS request made.

The only difference is that what is being sent back in response to the query isn't the same format (or at least doesn't appear to be the same format) as when the request is made from a downgraded F18 package (bind-9.9.2-5.p1).

If being sent from a downgraded F18 package (similar to a downgraded F17 package), then wireshark reports the response and decodes it in the display as having a A xxx.xxx.xxx.xxx record which is right and only one request is made and responded to. The F18 package (latest version) gets a reply back that has the right information if you open up the packet, but which doesn't display right in wireshark in the live packet capture making me wonder if the new DNS package is asking for something differently and not parsing the reply correctly.

Anyway, please reopen.

Comment 7 Tomáš Hozza 2013-08-26 08:36:26 UTC
(In reply to William H. Haller from comment #6)
> FWIW, a wireshark capture shows the 9.9.3-4 requesting the DNS lookup
> properly and the previous F17 downgraded server replying with the address.
> This happens multiple times for each DNS request made.
> 
> The only difference is that what is being sent back in response to the query
> isn't the same format (or at least doesn't appear to be the same format) as
> when the request is made from a downgraded F18 package (bind-9.9.2-5.p1).
> 
> If being sent from a downgraded F18 package (similar to a downgraded F17
> package), then wireshark reports the response and decodes it in the display
> as having a A xxx.xxx.xxx.xxx record which is right and only one request is
> made and responded to. The F18 package (latest version) gets a reply back
> that has the right information if you open up the packet, but which doesn't
> display right in wireshark in the live packet capture making me wonder if
> the new DNS package is asking for something differently and not parsing the
> reply correctly.
> 
> Anyway, please reopen.

I'm trying to reproduce your issue but with no luck so far. I tried
bind-9.9.2-12.P2.fc18.x86_64 and bind-9.9.3-4.P2.fc18.x86_64 and
both work fine and resolve external named also after some time.

Can you please attach communication dumps from wireshark for bind
9.9.2 (that works) and bind 9.9.3 (not working)?

Thanks!

Comment 8 William H. Haller 2013-08-26 14:12:17 UTC
If you'd like to see a particular address queried, let me know. Otherwise, no reply is needed and I'll get you the information tonight, if possible.

Comment 9 William H. Haller 2013-08-27 02:10:40 UTC
Created attachment 790731 [details]
First bind wireshark capture

Comment 10 William H. Haller 2013-08-27 02:11:54 UTC
Created attachment 790733 [details]
Second bind wireshark capture

Comment 11 William H. Haller 2013-08-27 02:14:01 UTC
In both cases, the same Fedora 17 server release of bind is running on the machines being forwarded to. Both are 64 bit.

The first dump shows a working sequence. This combination of bind-9.9.2-5.P1.fc18.x86_64 on the client and bind-9.9.2-7.P2.fc17.x86_64 on the server works without fail. The second dump shows the effects of switching to bind-9.9.3-4.P2.fc18.x86_64 on the client. The first few queries work as expected, but after just about a minute, the client stops correctly analyzing the server responses and repeats them until it gives up and will not answer any more queries that it has to forward. Note that it also doesn't appear to properly cache the responses or they have a weird life time as they should be answered directly from cache but are instead queried.

Comment 12 Tomáš Hozza 2013-08-29 08:57:33 UTC
(In reply to William H. Haller from comment #10)
> Created attachment 790733 [details]
> Second bind wireshark capture

I'm not able to open the file. Can you please re-upload it?

Thanks!

Comment 13 William H. Haller 2013-08-29 13:31:39 UTC
Created attachment 791811 [details]
second attachment

Previous version wasn't actually gzip'd. Sorry. Double checked to make sure it opened properly here before compressing.

Comment 14 Fedora End Of Life 2013-12-21 15:33:47 UTC
This message is a reminder that Fedora 18 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 18. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '18'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 18's end of life.

Thank you for reporting this issue and we are sorry that we may not be 
able to fix it before Fedora 18 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior to Fedora 18's end of life.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 15 William H. Haller 2014-01-24 14:44:44 UTC
Did an upgrade to F19 - still no joy with FC19's latest.

Additionally, the link to dnsperf in koji is broken for the version of dnsperf that worked in FC18. Although the version of bind and its associated files are there for FC19, there is no bind-license file for that version of FC19, so even though you can download the rest, the downgrade fails when trying to back up to the FC19 equivalent of that version.

I did try turning forwarding off completely with the FC19 latest and that still just gets a single query back processed right. It then can't process replies any further - I still see multiple requests and multiple replies for each domain I query, but no nslookup returns - just fails.

Fortunately the FC18 packages still seem to work.

Didn't see an FC20 version listed. If anyone wonders why more people don't upgrade when you'd like them to it's because too much stuff still isn't working in the previous version that has been broken for quite a while. I guess unless I want to join the intrepid people ripping up their DNS architecture to try Bind 10, I may be stuck on FC19 for a long time unless FC18 packages will still work on FC20.

Comment 16 William H. Haller 2014-09-30 21:19:21 UTC
F20 version 9.9.4 also seems to be affected at a different company in some perhaps similar way.

There I have two zones (one internal checking over one VLAN and one external checking over a second VLAN). The external zone always seems to update. The internal zone has at least updated occasionally since there is a zone transfer that eventually is populated. But looking on the master server, the packets start getting blocked by iptables (both tcp and udp, but only the one internal zone transfer - not both).

Other Centos 7 and 6.5 based servers running the latest bind instance for each (9.9.4-14.el7.x86_64 and 9.8.2-0.23.rc1.el6_5.1x86_64) have no problem connecting to the same Centos 7 master for zone transfers either internal or external over the same VLANs. It is just the F20 box's version of bind (9.9.4-15.P2.fc20.x86_64) that seems to have the issue. I'm running the fc20 (and the el7 clients) in a virtualized test environment, but I'm not seeing any other issues with the networking.

The only difference I see is the packets getting blocked don't have a Response In: value field in wireshark's display of the Domain Name System (response).

The actual named.conf files are of course different between the earlier and this report, but the principle of how they are constructed and how they query their respective zones are the same. Any plans to move a 9.9.2 version forward for something that works in fc20 (along with all the associated dhcp and other files). I can always just use resolv.conf to forward the requests without running bind, but it is nice in the virtualized environment to be able to do some caching in case the virtuals have to go down for some reason.

Comment 17 Tomáš Hozza 2014-10-01 11:09:50 UTC
(In reply to William H. Haller from comment #16)
> F20 version 9.9.4 also seems to be affected at a different company in some
> perhaps similar way.
> 
> There I have two zones (one internal checking over one VLAN and one external
> checking over a second VLAN). The external zone always seems to update. The
> internal zone has at least updated occasionally since there is a zone
> transfer that eventually is populated. But looking on the master server, the
> packets start getting blocked by iptables (both tcp and udp, but only the
> one internal zone transfer - not both).
> 
> Other Centos 7 and 6.5 based servers running the latest bind instance for
> each (9.9.4-14.el7.x86_64 and 9.8.2-0.23.rc1.el6_5.1x86_64) have no problem
> connecting to the same Centos 7 master for zone transfers either internal or
> external over the same VLANs. It is just the F20 box's version of bind
> (9.9.4-15.P2.fc20.x86_64) that seems to have the issue. I'm running the fc20
> (and the el7 clients) in a virtualized test environment, but I'm not seeing
> any other issues with the networking.
> 
> The only difference I see is the packets getting blocked don't have a
> Response In: value field in wireshark's display of the Domain Name System
> (response).
> 
> The actual named.conf files are of course different between the earlier and
> this report, but the principle of how they are constructed and how they
> query their respective zones are the same. Any plans to move a 9.9.2 version
> forward for something that works in fc20 (along with all the associated dhcp
> and other files). I can always just use resolv.conf to forward the requests
> without running bind, but it is nice in the virtualized environment to be
> able to do some caching in case the virtuals have to go down for some reason.

Hi.

The F20 version of BIND is pretty much the same as in CentOS 7 (and RHEL 7). I'm not planning to move to 9.9.2. However I plan to update BIND to 9.9.6 in F21 and to 9.10.1 in rawhide.

Comment 18 William H. Haller 2014-10-01 16:37:12 UTC
It appears that the problem with the F20 client was the network adapter for the virtual machine being set to virtio. Changing it to e1000 seems to have made it happy. It is odd that the virito driver appears to ship packets (as the queries were seen on the master side) but there is something subtly wrong with them that causes them to be blocked by iptables (and even with iptables shut off, ignored by the master DNS server). Anyway, apologies for the false start.

Comment 19 Tomáš Hozza 2014-10-01 17:02:25 UTC
(In reply to William H. Haller from comment #18)
> It appears that the problem with the F20 client was the network adapter for
> the virtual machine being set to virtio. Changing it to e1000 seems to have
> made it happy. It is odd that the virito driver appears to ship packets (as
> the queries were seen on the master side) but there is something subtly
> wrong with them that causes them to be blocked by iptables (and even with
> iptables shut off, ignored by the master DNS server). Anyway, apologies for
> the false start.

Well, that's good news since you've found the cause. Does this mean you are not experiencing issues any more? Can I close this bug?

Anyway I think it might be worth filing a libvirt bug due to the strange network nic behavior.

Comment 20 William H. Haller 2014-10-01 18:29:32 UTC
Completely different install environment. I had originally just thought it might be related.

The only real way for me to know if it could be closed is to actually upgrade from F19 (which is actually running the last F18 bind/dhcp stack) to F20 (which has no F18 era bind/dhcp rpms available). I haven't done that yet, hoping you'd find something. If it has the same problems in F20 that it has with 9.9.3 in F19, I'm in a tough spot since I'm not sure how well the older version would compile up under F20.

I'll eventually have to do something when F19 goes EOL, but this is an infrastructure bug I'd really like to see resolved before I have to roll the dice on the F20 release. I'm not even remotely considering F21 at this time. This F20 virtual here at work is the first try at F20, and if the 389 directory server on Centos had a graphical client that I could get to work, I wouldn't even be trying it there. If I have a minute, I might fire up a virtual F20 in the network with the original bug and see what happens.

Comment 21 William H. Haller 2015-01-02 22:37:36 UTC
I did the upgrade from FC19 to FC20 over the last couple weeks. It was an upgrade so no configuration files changed. In addition, I had the various bind packages excluded via yum.conf, so the FC18 packages were still running and all was well. Last week I decided to try the FC20 packages. They fail in the same way - a few nslookups go through and then nothing. Restart the named-chroot service and you get a few more.

Local zone files for local domains resolve correctly. Going back to FC18 packages 9.9.2-5.P1 returns correct rock solid operation.

I only updated the clients that were forwarding their requests to the main DNS servers which were still running FC18's packages, but taking out the forward lines in the configuration changed nothing other than time exceeded ICMP errors.

I reached my limit with the bind packages and am currently trying unbound/nsd. If they work out, I'll move on and this will have to remain unresolved in terms of testing on my end. I'm not happy with some of their limitations (losing split horizon, having to run both packages with interface choosing for each, and the incredibly stupid inability to configure an outbound interface to use for forwarding/stub), but having something that has applied bug fixes is worth something as well.

Comment 22 Fedora End Of Life 2015-01-09 22:09:49 UTC
This message is a notice that Fedora 19 is now at end of life. Fedora 
has stopped maintaining and issuing updates for Fedora 19. It is 
Fedora's policy to close all bug reports from releases that are no 
longer maintained. Approximately 4 (four) weeks from now this bug will
be closed as EOL if it remains open with a Fedora 'version' of '19'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 19 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 23 Fedora End Of Life 2015-02-18 13:57:11 UTC
Fedora 19 changed to end-of-life (EOL) status on 2015-01-06. Fedora 19 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.


Note You need to log in before you can comment on or make changes to this bug.