Bug 146600 - dhclient allows reverting to invalid subnet ip address
dhclient allows reverting to invalid subnet ip address
Status: CLOSED CURRENTRELEASE
Product: Fedora
Classification: Fedora
Component: dhcp (Show other bugs)
3
i686 Linux
medium Severity medium
: ---
: ---
Assigned To: David Cantrell
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2005-01-30 07:17 EST by Trevor Cordes
Modified: 2007-11-30 17:10 EST (History)
1 user (show)

See Also:
Fixed In Version: FC5
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2006-11-27 15:02:38 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)
/var/log/messages output from dhclient (1.19 KB, text/plain)
2005-01-30 07:18 EST, Trevor Cordes
no flags Details
tcpdump packet trace of bootps/pc traffic (2.11 KB, text/plain)
2005-02-05 00:57 EST, Trevor Cordes
no flags Details
log showing dhclient traffic after which there was no connectivity (856 bytes, text/plain)
2005-02-10 14:34 EST, Trevor Cordes
no flags Details
corrupted leases file (6.46 KB, text/plain)
2005-02-23 15:25 EST, Trevor Cordes
no flags Details

  None (edit)
Description Trevor Cordes 2005-01-30 07:17:02 EST
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.5)
Gecko/20041107 Firefox/1.0

Description of problem:
If dhclient cannot reach a DHCP server it will go through previous
leases and pick one to revert to.  In certain circumstances it can
pick an address that is currently being used on another subnet.  This
causes loss of connectivity and other nasty problems.

Here's what happens:

1. I set up FC3 server boxes (call them "kids") in advance and yum
update them on my own private network (192.168.*) during which time
the kids will get a dhcp address of 192.168.* from my dhcp server for
eth0.

2. I take the kids on-site to put them in production.  There they plug
into cable modems and get dhcp addresses from the ISP dhcp server on
eth0 (24.76.*)

3. The kids in "live mode" also now have an live intranet interface
eth1 that is subnet 192.168.*.

4. If the ISP's DHCP server doesn't respond to a future request,
dhclient reverts to the original 192.168.* address from step 1.  Now
the box has 2 interfaces on the same subnet!  Chaos ensues and the box
loses all internet connectivity yet sits there thinking everything is
100% ok.  A reboot usually fixes it because by then the ISP's DHCP
server is responding again.

This sequence of events has occurred at 2 different locations I
administer in the past 2 weeks.  This has never occurred before on
over a dozen systems I administer so I *think* that something has
changed between FC1 and FC3's dhclient.  Alternatively, my ISP's DHCP
server simply has become more unreliable.

If this problem occurs, a reboot or an ifdown/ifup eth0 will fix it
(usually).

The problem can be mitigated completely by
rm /var/lib/dhcp/dhclient-*.leases; ifdown eth0; ifup eth0
which will clear the 192.168.* from dhclient's "memory".

The real solution is to have dhclient check any possible revert
addresses against the current subnets used on other interfaces and
ignore those.


Version-Release number of selected component (if applicable):
dhclient-3.0.1-30_FC3

How reproducible:
Sometimes

Steps to Reproduce:
1. see above
2.
3.
    

Actual Results:  dhclient chose an invalid ip address to revert to

Expected Results:  dhclient should have chosen a valid ip address to
revert to or failed with an error leaving the interface down

Additional info:
Comment 1 Trevor Cordes 2005-01-30 07:18:02 EST
Created attachment 110410 [details]
/var/log/messages output from dhclient
Comment 2 Jason Vas Dias 2005-01-30 14:12:06 EST
dhclient operation is based on the premise that the subnet it is on
and the DHCP server it uses does not change between writing 
/var/lib/dhclient-$IF.leases and potentially using its contents 
to revert to a previously valid UNEXPIRED lease if the dhcp server
cannot be contacted. 
When moving the box to a different subnet / DHCP server, remove
/var/lib/dhclient-$IF.leases.
When the dhclient attempts to gain a new lease for an interface,
the first thing it does is check if there are any unexpired leases
for that interface in the dhclient-$IF.leases file - if so, it
issues a RENEW request to the server that issued the lease.
Is there any way that your 192.168* DHCP server can be reached 
from the interface on which you expect the 24.76.* DHCP server
to respond ? If so, this is the source of the problem, because the
old DHCP server is still reachable, and the unexpired leases on the
192.168 subnet will be renewed .
Nothing has changed with dhclient behaviour in this regard - yes,
it must be that the new DHCP servers are not responding reliably.
I will work on a fix for this that will not consider leases valid
for reversion if the dhcp-server-identifier has changed, or if 
another interface is configured on the same subnet.

 
Comment 3 Trevor Cordes 2005-01-31 12:42:41 EST
There is no way the box once deployed can contact the DHCP server that
it got the original 192.168.* lease from.  Besides, the box (my
iptables and routing rules) (and the ISP's routers, I believe) don't
allow 192.168.* packets to be sent out that external/internet interface.

However, there is a DHCP server running on the box once deployed, and
it listens only on 192.168.* (eth1).  Perhaps dhclient is somehow
contacting the DHCP server on the same box on the wrong interface? 
But the log shows no dhcp _server_ output that would be associated
with issuing a lease.

The addresses of the DHCP servers used to get 192.168 leases are
always 192.168.100.1 because my DHCP server when setting up the box is
192.168.100.1 and then once deployed the new box is assigned the eth1
address of 192.168.100.1.

I understand that this is a rare scenario but the fix logic is so
simple.  I was going to try to fix it myself but I see dhclient is a
binary and not a bsh script and I'm pretty hopeless at C.
Comment 4 Jason Vas Dias 2005-01-31 16:28:03 EST
I really think the simplest solution here is to remove the 
/var/lib/dhcp/dhclient*.leases file before shipping the "kid"
machines.

The problem is, that the configuration scripts will always bring
up eth0 before eth1, so it is not actually dhclient that brings 
up the interface on a duplicate subnet, it is the ifup* scripts
for eth1, which is a static IP of 192.168.100.1 ?  

I've just been trying to reproduce this problem, but cannot:
I get a dhclient lease on eth0 on 172.16.80/22, then I bring 
down the interfaces, bring up a dhcp server for 172.16.80/22
on eth1 using the same lease address that eth0 had, and then
bring up dhclient on eth0, which is not connected to a dhcp
server - eth0 is never configured with any IP address.

dhclient will NEVER configure a lease IP address on an interface
unless it gets a positive DHCPACK response from a DHCP server.
dhclient ONLY sends packets on the interface it was invoked to
configure, not on any other interface .

The problem is I think that the cable modem DHCP server will always
agree to renew the lease for the 192.168.100 subnet - at least mine
at home does - this is one of the ways in which they attempt to
discourage more than one host getting a useable real IP address 
lease .

Also there is nothing in the ip subsystem, configuration scripts
or tools (eg. ifconfig) that prevents two interfaces being on the
same subnet, or even having the same address .

I think if you do a tcpdump on a kid machine when it is first 
attached to the router, eg. with the command
  'tcpdump -nl -i eth0 -vvv -s 2048 port bootpc or port bootps'
you will see that it is performing a DHCPREQUEST for 192.168.100.1
and receives a DHCPACK for this address from a DHCP server. 




 
Comment 5 Trevor Cordes 2005-02-01 10:02:19 EST
You are correct: I forgot eth1 will come up after eth0 and so it would
not be easy for eth0 dhclient to discern that 192.168.100 is used on
another interface.  I guess it could look at the config file, but that
won't work if eth1 also uses dhcp.

Yes, eth1 uses a static IP of 192.168.100.1

As for reproducibility, when it happened the first time on one kid I
thought it was a weird fluke.  I run over a dozen of these systems and
it never happened before in 3 years.  Then a week later another system
did the same thing!  When I lose internet connectivity to these
off-site headless boxes it is really a big disaster so I started
taking the issue seriously.

You could be correct in that the kid got a DHCPACK from the ISP. 
Perhaps that is the change (a policy at the ISP) that is triggering
the issue.  However, the ISP in question allows (in the contract) 2
DHCP IP's to be assigned (and we only ever use 1).  They used to
technically allow unlimited IP's though it was against the contract.

A thought: how could dhclient accept a DHCPACK from the ISP
(24.something) for the stale IP addr when the dhcp-server-identifier
was a mismatch (the old one being 192.168.100.1, the ISP one being
24.something)?

Since the workaround, deleting the file, is easy and since detecting
duplicate subnet interfaces is more difficult than I had first
thought, perhaps there is nothing to be fixed in dhclient.  However,
until someone hits this bug they will not know to delete the file and
could get bitten by it with loss of connectivity.

What about the idea that it should be dhclient's job to delete that
file?  You said:

"dhclient operation is based on the premise that the subnet it is on
and the DHCP server it uses does not change between writing 
/var/lib/dhclient-$IF.leases and potentially using its contents 
to revert to a previously valid UNEXPIRED lease if the dhcp server
cannot be contacted."

Perhaps that premise should be enforced by removing old subnet leases
when it finds itself on a new subnet?  But wouldn't that break a
2-site laptop scenario where it is perfectly legit for the subnet to
alternate between, say, home & office?

Lastly, I was incorrect when I said I am blocking outgoing 192.168.*
traffic with IP tables -- it turns out I'm only blocking _incoming_
192.168.* traffic.  I will probably add in rules to block outgoing and
perhaps that would have avoided this issue.
Comment 6 Jason Vas Dias 2005-02-01 15:04:45 EST
The simplest fix for this is still to remove the
/var/lib/dhcp/dhclient-*.leases file before moving 
the boxes to connect to new DHCP server.

RE: dhcp-server-identifier: these are not used by the
client to verify that a DHCPACK is OK and configure the
lease - if they were, it would break failover .

RE: "dhclient operation is based on the premise..."
I should have said that it is based on the premise that
if it has an unexpired lease in its leases file, then
any servers it may contact regarding extending / renewing
that lease will be part of the same system that granted
the lease. 

RE: 
"
Perhaps that premise should be enforced by removing old subnet leases
when it finds itself on a new subnet?  But wouldn't that break a
2-site laptop scenario where it is perfectly legit for the subnet to
alternate between, say, home & office?
" : Yes - dhcp servers are allowed to change a client's subnet.

If you want more of a fix than just removing the dhclient*.leases
file, it would be great if you could get a tcpdump of the dhcp
traffic that causes the problem - ie. before bringing up the interface
that connects to the cable router, do:
  # ifconfig eth0 0.0.0.0 up
  # tcpdump -i eth0 -nl -vvv -s 2048 port bootps or port bootpc 2>&1 |
    tee /tmp/tcpdump.log &
  # ifup eth0
  ( wait for bad IP address to be configured on eth0 )
  # pkill tcpdump
and append the /tmp/tcpdump.log to this bug .

Comment 7 Trevor Cordes 2005-02-05 00:56:04 EST
Hi again.  I successfully recreated the issue on my own site's
production router.  I planted a bogus entry in
/var/lib/dhcp/dhclient-eth0.leases with an IP of 192.168.100.202
(which I know is not taken but valid for the DHCP range).

I did exactly the commands you specified above.

I'll attach the tcpdump file.  I changed my real IP in the dump to
24.76.2.2 to give me some privacy.  Strange, there is no indication of
response from ANY dhcp server yet the ifup assigned me the bogus .202
address.

I got some different and interesting syslog output this time including
this strange line:

Feb  4 23:40:42 pog kernel: MASQUERADE: Route sent us somewhere else.

I do do netfilter MASQ on this box.  I think the above line is a
function of the tcpdump being run during the process.
Comment 8 Trevor Cordes 2005-02-05 00:57:05 EST
Created attachment 110689 [details]
tcpdump packet trace of bootps/pc traffic
Comment 9 Trevor Cordes 2005-02-10 14:32:53 EST
I should mention that during the test I physically unplugged the eth0 LAN cable
(from the cable modem) to simulate my ISP's DHCP server not responding.

Also, I just had another hit on this problem (or one suspiciously like it) on
the same box that had the first hint of this problem.  This time the system
didn't get a response from the DHCP server and then reverted to its *proper* IP
address (not some 192.* one).  However from that point on it had lost all
internet connectivity and my normal recovery scripts couldn't revive it (took a
reboot by someone onsite).  See the attached log.

This has got to be indicitive of a bigger problem here as why would I lose
connectivity when it reverted to the proper previous IP?

Note: I have an extremely complex iptables script that I use but I log all drops
and none are showing up when this DHCP stuff is occurring so I don't think
that's the issue.
Comment 10 Trevor Cordes 2005-02-10 14:34:43 EST
Created attachment 110931 [details]
log showing dhclient traffic after which there was no connectivity
Comment 11 Trevor Cordes 2005-02-10 14:38:09 EST
I should also mention I figured out why this started occurring recently: my ISP
appears to have changed the lease renewal time from 1-2 weeks to 4 hours!  I bet
their DHCP server is way overloaded and that's why it often doesn't respond.
Comment 12 Jason Vas Dias 2005-02-11 18:24:03 EST
Sorry for the delay in getting back to you.

I think I may have a fix for you:
Yes, after dhclient has determined that no servers can be contacted,
it will revert to the last unexpired lease, which in your case is
from the wrong DHCP server, by invoking the dhclient-script in 
timeout mode.

But dhclient does check that the router is still contactable, and that
the address to be configured is not in use. The problem was,
that dhclient-script used a 'ping' to verify that the router is still
contactable, but did not supply the '-I' option to make ping only 
send on one interface ; the ping used the current routes which now
direct the packet to the wrong interface .  If the ping had been
sent only on the interface being configured, the router would not be
contactable and the bad address would not be configured.

dhcp-3.0.1-33_FC3 will now ping routers with the -I option so that
the pings are only sent on the interface being configured. Please
test it out - you can download it from :
  http://people.redhat.com/~jvdias/DHCP/FC3/

You might also investigate putting the 'PERSISTENT_DHCLIENT' option
in /etc/sysconfig/network-scripts/ifcfg-$IF .  By default, dhclient
is run in 'one-shot' mode with the -1 option, so after it determines
no DHCP server is contactable, it tries to revert to an unexpired
lease, and if this fails, it gives up and exits. The
PERSISTENT_DHCLIENT option makes it start again from the beginning
until a  DHCP server is contacted .

Also see  'man dhclient.conf' and consider adjusting the timeout 
parameters in /etc/dhclient.conf . By default, the whole sequence
of deciding a DHCP server is not contactable takes 5 minutes - by 
setting the timeout parameters, this can be much quicker - I use:
   timeout 10;
in /etc/dhclient.conf.

Please let me know if dhcp-3.0.1-33_FC3 fixes your problem - thanks!

Comment 13 Trevor Cordes 2005-02-23 03:23:02 EST
Sorry, it doesn't seem to have helped.  I installed the dhcp and
dhclient bin rpms from your link.  I then repeated my bogus-entry test
and it did the same thing as before.

This time I got some debug output I never saw before.  Perhaps you
didn't redirect the output from the ping?

#ifup eth0

Determining IP information for eth0...PING 24.76.180.1 (24.76.180.1)
56(84) bytes of data.

--- 24.76.180.1 ping statistics ---
4 packets transmitted, 0 received, +3 errors, 100% packet loss, time
2999ms
, pipe 4
PING 192.168.100.1 (192.168.100.1) 56(84) bytes of data.

--- 192.168.100.1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.068/0.068/0.068/0.000 ms, pipe 2
 done.

The syslog output:

Feb 23 02:02:17 pog dhclient: DHCPREQUEST on eth0 to 255.255.255.255
port 67
Feb 23 02:02:25 pog dhclient: DHCPREQUEST on eth0 to 255.255.255.255
port 67
Feb 23 02:02:42 pog dhclient: DHCPDISCOVER on eth0 to 255.255.255.255
port 67 interval 7
Feb 23 02:02:49 pog dhclient: DHCPDISCOVER on eth0 to 255.255.255.255
port 67 interval 15
Feb 23 02:03:04 pog dhclient: DHCPDISCOVER on eth0 to 255.255.255.255
port 67 interval 24
Feb 23 02:03:28 pog dhclient: No DHCPOFFERS received.
Feb 23 02:03:28 pog dhclient: Trying recorded lease 24.76.123.456
Feb 23 02:03:31 pog dhclient: receive_packet failed on eth0: Network
is down
Feb 23 02:03:31 pog dhclient: Trying recorded lease 192.168.100.54
Feb 23 02:03:31 pog dhclient: bound: renewal in 238191 seconds.
Feb 23 02:03:31 pog dhclient: receive_packet failed on eth0: Network
is down
Feb 23 02:10:17 pog dhclient: receive_packet failed on eth0: Network
is down

I noticed that there is still an old dhclient ps running on my system.
 Should I have killed that first, or perhaps rebooted?

#ps -ef | grep dhcli
root      5462     1  0 Feb15 ?        00:00:00 /sbin/dhclient -1 -q
-lf /var/lib/dhcp/dhclient-eth0.leases -pf /var/run/dhclient-eth0.pid eth0
root     18387     1  0 02:03 ?        00:00:00 /sbin/dhclient -q -cf
/etc/dhclient-eth0.conf -lf /var/lib/dhcp/dhclient-eth0.leases -pf
/var/run/dhclient-eth0.pid eth0

Also interesting, is I now have in my ifcfg-eth0:

PERSISTENT_DHCLIENT=yes

yet dhclient just tried the once then quit.  I left it for 5 mins with
no activity then plugged back in the LAN cable.  5 more mins and no
auto-retry (and the ifup had exited already).  I had to service
network restart to get it back up (these tests seem to hose lo).

I also tweaked the timeouts as you suggested.  Thanks for the RTFM
pointer, I didn't know about that manpage.  From the section on how it
deals with old leases, it sounds like exactly how we think it should run.

I just thought of something else: if we set dhclient up to be
persistent and not return until it succeeds... then this is great for
when a system is already up.  But what about when the system is
booting?  Might it get stuck on bringing up eth0 and never go on if
the DHCP server cannot be reached?  In that case I would *definitely*
*need* to have the system come up even without an eth0 so that the
system could still be accessible on the internal network.  I can't
have a system non-bootable because of a downed ISP DHCP server or a
flaky NIC/cable/modem/iptables/whatever on eth0!  If this is the case
then I cannot use the persistent option... but that's ok because I
have watchdog scripts that would keep retrying the ifup until it
succeeds (albeit with a much longer retry interval).

Thanks for the help!  I feel confident we are now on the right track.
I'm glad to help to make sure no one else hits this obscure bug.
Comment 14 Trevor Cordes 2005-02-23 15:24:47 EST
Something very strange just happened after my testing last night. 
Today I noticed that /var/lib/dhcp/dhclient-eth0.leases seems
corrupted.  There are werid control chars (nulls?) right before one of
the "leases" stanza.  I have never seen this before.  See attached.
Comment 15 Trevor Cordes 2005-02-23 15:25:47 EST
Created attachment 111347 [details]
corrupted leases file
Comment 16 Matthew Miller 2006-07-10 18:06:38 EDT
Fedora Core 3 is now maintained by the Fedora Legacy project for security
updates only. If this problem is a security issue, please reopen and
reassign to the Fedora Legacy product. If it is not a security issue and
hasn't been resolved in the current FC5 updates or in the FC6 test
release, reopen and change the version to match.

Thank you!
Comment 17 Trevor Cordes 2006-11-22 12:44:52 EST
I have not seen this behaviour since my posts in 2005.  I have since upgraded
all boxes to FC5.  If I see it again I will reopen here.
Comment 18 David Cantrell 2006-11-27 15:02:38 EST
(In reply to comment #17)
> I have not seen this behaviour since my posts in 2005.  I have since upgraded
> all boxes to FC5.  If I see it again I will reopen here.
> 

Thanks, closing this bug.

Note You need to log in before you can comment on or make changes to this bug.