Bug 62291 - Latest bind 8.2.3 erratum for 6.2 and 7.0 is totally unuseable
Summary: Latest bind 8.2.3 erratum for 6.2 and 7.0 is totally unuseable
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat Linux
Classification: Retired
Component: bind
Version: 7.0
Hardware: i386
OS: Linux
high
medium
Target Milestone: ---
Assignee: Daniel Walsh
QA Contact: Ben Levenson
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2002-03-29 08:24 UTC by Mike A. Harris
Modified: 2007-03-27 03:52 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2002-12-18 20:21:58 UTC
Embargoed:


Attachments (Terms of Use)

Description Mike A. Harris 2002-03-29 08:24:25 UTC
Description of Problem:
The latest bind erratum of 8.2.3 for both RHL 6.2 and 7.0 is totally broken.
Once installed/updated, just using the default configuration supplied with
bind plus the caching-nameserver package it is totally unreliable and will
start failing within 5 minutes.

Version-Release number of selected component (if applicable):
RHL 6.2 bind-8.2.3-0.6.x
RHL 7.0 bind-8.2.3-1

How Reproducible:
100%

Steps to Reproduce:
1. Install a clean RHL 6.2 or 7.0 system
2. up2date -u to all current erratum released
3. be sure bind and caching-nameserver are installed and are current
4. Edit /etc/resolv.conf and add "nameserver 127.0.0.1"
5. Wether or not ISP nameservers are also listed the server will fail easily
   within 5-10 minutes tops.  Usually reproduceable in less than a minute.

Actual Results:
Random failures, examples to follow.

Expected Results:
Proper DNS resolution.

Additional Information:
This problem has been occuring for a while on and off now, maybe a few months,
hard to pinpoint exactly.  However, I recently up2date'd my machines, and after
that I've had nothing but nonstop DNS total failure.  I've got to restart
named 2,3, 6, 10 times in a row, click on a URL, failed.. again, failed,
again, finally DNS resolves the page.  Within a minute, that URL wont
load again.  Other sites work fine.  Then they disappear and one that
wouldn't work before now works again.

I first noticed this problem strongly when trying to use google after
the update.  When I go to www.google.com, it redirects me to www.google.ca,
which then says no such host can be found.  DOing a manual lookup of
the IP of www.google.com, and putting www.google.ca pointing to the IP
of .com acted as a workaround.

The sites I consistently use to reproduce this are the ones I most often use.

www.google.com, www.google.ca (wont work from outside .ca), irc.lame.org,
irc.openprojects.net, www.redhat.com, bugzilla.redhat.com

Comment 1 Mike A. Harris 2002-03-29 08:31:45 UTC
pts/0 root@gw:/# nslookup irc.redhat.com
Server:  gw.capslock.lan
Address:  192.168.1.1
 
*** gw.capslock.lan can't find irc.redhat.com: Server failed
pts/0 root@gw:/# service named restart
Shutting down named:                                       [  OK  ]
Starting named:                                            [  OK  ]
pts/0 root@gw:/# nslookup irc.redhat.com
Server:  gw.capslock.lan
Address:  192.168.1.1
 
Non-authoritative answer:
Name:    irc.openprojects.net
Addresses:  64.28.67.98, 207.106.22.229, 216.53.71.65, 198.186.203.27
Aliases:  irc.redhat.com
pts/0 root@gw:/# nslookup irc.redhat.com
Server:  gw.capslock.lan
Address:  192.168.1.1
 
*** gw.capslock.lan can't find irc.redhat.com: Server failed
pts/0 root@gw:/# nslookup www.redhat.com
Server:  gw.capslock.lan
Address:  192.168.1.1
 
Non-authoritative answer:
Name:    www.redhat.com
Addresses:  216.148.218.197, 216.148.218.195
 
pts/0 root@gw:/# nslookup irc.lame.org
Server:  gw.capslock.lan
Address:  192.168.1.1
 
*** gw.capslock.lan can't find irc.lame.org: Server failed

Comment 2 Mike A. Harris 2002-03-29 08:40:39 UTC
Sometimes I get lookups working for 2 to 3 minutes, maybe as many as 5 to
10 minutes.  Leaving the machine completely idle, and waiting 2-3 minutes
then hitting up-arrow-enter of the last lookup is all that need be done.
If it works one time, it will fail at some point.  If I cant get one to
fail, I try a different host, and generally one fails right away.  Restarting
bind does not guarantee it will work right away either.  It might work, or
might fail immediately.  Sometimes 2-3 restarts are needed.



Comment 3 Mike A. Harris 2002-03-29 09:39:31 UTC
I just rebuild bind 9.1.0 from RHL 7.1 on RHL 7.0 after removing the
dependancy on tar, and replacing tar -j with bzcat piped to tar...

Results.... same thing.  So, it appears this might be more than a bind issue,
but perhaps a library issue or somesuch.  I dont know enough about bind
to debug the issue further, but I've discussed it with a few other people
now too, and they're having similar problems.  :o/


Comment 4 Mike A. Harris 2002-03-29 09:43:45 UTC
That didn't make much sense.. considering I am debugging the issue further...
Try to find out more tomorrow.

Comment 5 Karsten Hopp 2002-07-22 14:12:46 UTC
Any news on this one ?

Comment 6 Daniel Walsh 2002-12-18 18:29:50 UTC
Is this still a open bug or did you figure out the problem?

 It was just assigned to me.

Dan

Comment 7 Mike A. Harris 2002-12-18 20:21:58 UTC
This problem drove me completely nuts to the point where 3 DNS experts (one of
which was Bryce) couldn't fix it, and couldn't determine what the problem could
be - complete bafflement.

As such, I just stopped using bind entirely, disabled local DNS, and started
using /etc/hosts on all machines mirrored via cron, the good old fashioned
1970's way.  I pointed DNS to my ISP's servers, and all problems went away
for quite some time.

Many many months later, I began having new DNS problems, in particular in
mozilla, and oddly - only from certain machines on my network.  Frustration
once again, and with many of the same symptoms as the problem described
here.  I was essentially unable to use the Internet properly while my whole
LAN seemed to work fine.  I began suspecting something wrong on my firewall
perhaps.

I investigated the configuration of pretty much everything on my firewall
and tested many things, all to no avail.  Couldn't find any problems.  Then
I checked /var/log/messages, and scanned it for anything even remotely
possible to be the culprit of the trouble I was having.

Lo and behold.....

messages.2.gz:Oct 16 10:28:34 gw kernel: IP_MASQ:ip_masq_new(proto=UDP): could
not get free masq entry (free=34844).
messages.2.gz:Oct 16 10:29:39 gw kernel: IP_MASQ:ip_masq_new(proto=UDP): could
not get free masq entry (free=34844).
messages.2.gz:Oct 16 10:33:37 gw kernel: IP_MASQ:ip_masq_new(proto=UDP): could
not get free masq entry (free=34844).
messages.2.gz:Oct 16 10:34:12 gw kernel: IP_MASQ:ip_masq_new(proto=UDP): could
not get free masq entry (free=34846).
messages.2.gz:Oct 16 10:37:39 gw kernel: IP_MASQ:ip_masq_new(proto=UDP): could
not get free masq entry (free=34846).
messages.2.gz:Oct 16 10:37:48 gw kernel: IP_MASQ:ip_masq_new(proto=UDP): could
not get free masq entry (free=34846).
messages.2.gz:Oct 16 10:39:53 gw kernel: IP_MASQ:ip_masq_new(proto=UDP): could
not get free masq entry (free=34848).
messages.2.gz:Oct 16 10:39:55 gw kernel: IP_MASQ:ip_masq_new(proto=UDP): could
not get free masq entry (free=34848).
messages.2.gz:Oct 16 10:39:58 gw kernel: IP_MASQ:ip_masq_new(proto=UDP): could
not get free masq entry (free=34849).
messages.2.gz:Oct 16 10:40:00 gw kernel: IP_MASQ:ip_masq_new(proto=UDP): could
not get free masq entry (free=34848).
messages.2.gz:Oct 16 10:45:36 gw kernel: IP_MASQ:ip_masq_new(proto=UDP): could
not get free masq entry (free=34849).
messages.2.gz:Oct 16 10:45:43 gw kernel: IP_MASQ:ip_masq_new(proto=UDP): could
not get free masq entry (free=34857).
messages.2.gz:Oct 16 10:46:17 gw kernel: IP_MASQ:ip_masq_new(proto=UDP): could
not get free masq entry (free=34860).

IP Masquerading was failing for UDP due to a filled masquerade table.  But for
some odd reason, *only* on *certain* machines.  ARRRRGHHHHH!!  In other words,
a 2.2.x kernel bug (IMHO).  The solution was to reboot the machine.  The problem
went away for a couple months and returned, and another reboot solved it again.

I do not know explicitly if this kernel bug/issue is/was responsible for the
bind issue I am reporting in this report, however it is entirely likely that
it is/was the problem at that point in time as well.

Since nobody else seems to have experienced this problem, I am considering
it a local issue now, due to the specifics of my own kernel (which is
*cough* homebrew *cough* from stock kernel.org sources).  I have deprecated
my trusty 486-DX2/66 now, and plan on putting a newer RHL 8.0 capable machine
in its place with iptables, and a stock Red Hat kernel rather than the
minimalized kernel I had no choice but to use on the 12Mb 486.  ;o)

In short, I consider this issue closed due to kernel funkification.
Closing as WORKSFORME now.


Note You need to log in before you can comment on or make changes to this bug.