Bug 1132396 - [i686] dhcpd doesn't send infinite lease times
Summary: [i686] dhcpd doesn't send infinite lease times
Keywords:
Status: CLOSED EOL
Alias: None
Product: Fedora
Classification: Fedora
Component: dhcp
Version: 20
Hardware: i686
OS: Unspecified
unspecified
medium
Target Milestone: ---
Assignee: Jiri Popelka
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-08-21 09:38 UTC by Jiri Popelka
Modified: 2015-06-30 01:07 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of: 1097523
Environment:
Last Closed: 2015-06-30 01:07:50 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)

Description Jiri Popelka 2014-08-21 09:38:44 UTC
+++ This bug was initially created as a clone of Bug #1097523 +++

Description of problem: Diskless clients with NFS root lose IP address after DHCP expiry. Clients then become unresponsive as they have lost their root filesystem. Reboot fixes until next time


Version-Release number of selected component (if applicable):
Version dracut-network-034-64.git20131205.fc20.1.i686 sets lifetime to forever
dracut-network-037-11.git20140402.fc20.i686 sets scope dynamic and therefore expiry as per dhcp server

How reproducible:
Default DHCP lease was expired in 4 hrs, clients would lose their IP address randomly after that.

Steps to Reproduce:
1. Boot NFS root client
2. ip addr
3. check if scope is dynamic

Actual results:
ip addr show dev enp4s0
2: enp4s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 00:25:22:9c:04:80 brd ff:ff:ff:ff:ff:ff
    inet 10.4.11.6/16 brd 10.4.255.255 scope global dynamic enp4s0
       valid_lft 2147430sec preferred_lft 2147430sec


Expected results:
ip addr show dev enp4s0
2: enp4s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 00:25:22:9c:04:80 brd ff:ff:ff:ff:ff:ff
    inet 10.4.11.6/16 brd 10.4.255.255 scope global enp4s0
       valid_lft forever preferred_lft forever


Additional info:
Symptoms relieved by setting the dhcpd.conf 
    max-lease-time -1;
    min-lease-time -1;
To give 2,000,000+ seconds before loss of ip address

--- Additional comment from Harald Hoyer on 2014-05-14 16:36:24 CEST ---

Wouldn't running a dhcp client on the machine refresh that IP? Isn't that, what the DHCP protocol mandates?

--- Additional comment from Darryl Bond on 2014-05-14 23:19:59 CEST ---

Tried that, it doesn't prevent the IP disappearing on all occasions.
I assume they both time out together and there is a race between the dhclient getting a new address and the interface dropping the IP address. I had 4hr timeout on the dhcp address originally and, of 250 clients, only 10% would lose the race but this still made plenty of complaints.

Note that my clients have static addresses, I don't ever change the addresses. I don't actually need DHCP except for initial boot.

Running NetworkManager for DHCP is out of the question anyway as it drops the IP address when the client suspends, also not good. I want the thin clients to suspend when not in use and works very well when NetworkManager is not running.

--- Additional comment from Harald Hoyer on 2014-05-16 12:02:09 CEST ---

so, you are suggesting, that dracut should be fixed to set "forever" for -1

well... 

new_dhcp_lease_time=4294967295
new_expiry=1400233859

are the values I get with
    max-lease-time -1;
    min-lease-time -1;

and with:

ip addr add 192.168.50.101/255.255.255.0 broadcast 192.168.50.255 valid_lft 4294967295 preferred_lft 4294967295 dev ens3

the result is:

2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 52:54:00:12:34:00 brd ff:ff:ff:ff:ff:ff
    inet 192.168.50.101/24 brd 192.168.50.255 scope global ens3
       valid_lft forever preferred_lft forever

which fits "forever"

at least on a x86_64 machine with:
dhcp-4.3.0-10.fc21.x86_64
iproute-3.14.0-2.fc21.x86_64

--- Additional comment from Darryl Bond on 2014-05-17 09:46:18 CEST ---

The NFS root clients are i686. Maybe that's why they only set to 2147430 or 25 days.

I have been running Fedora ltsp thin clients since F10. The behaviour I have described where 10% of the clients would drop their IP address while people are using them running remote sessions is new. I only got a hint as to what the problem was when looking at ip addr rather than ifconfig.

Only started with kernel 3.13.8 built with the new dracut.

I have forced them to forever so I am right from here. It's going to catch other people though.

--- Additional comment from Harald Hoyer on 2014-05-20 10:24:36 CEST ---

So, the bug is the i686 dhclient reporting 2147430 instead of 4294967295.

--- Additional comment from Pavel Šimerda (pavlix) on 2014-05-20 10:49:29 CEST ---

This resembles bug #662254 that has been fixed some time ago.

--- Additional comment from Darryl Bond on 2014-05-20 23:45:09 CEST ---

No, I think the bug is that the behavior has changed. 

In the past the boot (dracut) ran dhclient once and set the interface valid_lft to forever. I assume that if you chose to run a dhclient after boot, then it would renew leases as required. If not, at least the machine would not lock up when the interface enforced the expiry.

I know that LTSP is not really a supported package in the standard repositories (it is there though) but there are people working on the package and using it.

I can assure you that there is nothing quite as useless as a NFS root client with no network connection. So 4hrs, 24 days or 10 years between lockups, it is still a lockup that didn't used to happen.

--- Additional comment from Jiri Popelka on 2014-05-21 09:37:17 CEST ---

(In reply to Darryl Bond from comment #7)
> In the past the boot (dracut) ran dhclient once and set the interface
> valid_lft to forever.

That was probably dhclient itself. Previously dhclient (dhclient-script specifically) hadn't specified valid_lft/preferred_lft when adding address, which means they were set to 'forever'.
This has changed with
http://pkgs.fedoraproject.org/cgit/dhcp.git/commit/?id=e3ee5b17e9ffb46b1476ef97f532d6b81f2615dc
and
http://pkgs.fedoraproject.org/cgit/dhcp.git/commit/?h=f20&id=566cad88e7ef11d5fae31b7934a3e6cbe58b5800

--- Additional comment from Pavel Šimerda (pavlix) on 2014-05-21 11:55:07 CEST ---

(In reply to Darryl Bond from comment #7)
> I can assure you that there is nothing quite as useless as a NFS root client
> with no network connection. So 4hrs, 24 days or 10 years between lockups, it
> is still a lockup that didn't used to happen.

There have been many discussions on this topic and there will certainly be more. Just added the topic to the network management microconference[1].

But to the specific issue...

1) It is possible to specify the infinite lease time on the server when I expect to boot NFS root filesystems.

default-lease-time infinite;
max-lease-time infinite;

2) You are specifying the lease time to -1 which is a value I don't see in the documentation at all. Could you please point me to the relevant information?

3) If you specify a finite lease time on the server, the client is generally expected to stop using the lease when it's expired. Root on NFS is a bit special here and Harald always advocated a workaround of using the DHCP lease as if it was static configuration (i.e. using infinite lifetime).

As dracut knows which connection/address is configured to connect to NFS, it could instruct the dhclient-script (a modified version of it) to use infinite lifetime or it could run the script just to get the lease and then override the lifetime in kernel. Any thoughts?

Also, Tom Gundersen is AFAIK going to address this in systemd-networkd which won't be using the dhclient-script nor dhclient itself.

[1] http://wiki.linuxplumbersconf.org/2014:network-management#topics

--- Additional comment from Darryl Bond on 2014-05-21 23:25:44 CEST ---

1) My dhcpd is a ancient V3.0.7 on Solaris. The infinite option is not available in that version. I'll get it updated.

2) The reference to -1
http://serverfault.com/questions/505300/isc-dhcp-infinite-lease-time
The Internet is the holder of all truth :)
When I tested it, it set the lease time to the same as if I put in a very large number (24 days), so I used it.

3) I understand the dilemma, but I believe that NFS Root should always set it to infinite. The DHCP server can refuse to hand out addresses that are still being used by using the ping test. 
I suppose the real problem is that there is a race between the interface expiry and the dhclient renewal. I hope that is addressed in systemd-networkd. I imagine that simply setting the interface to max-lease-time +10 would be the simple fix.

As I mentioned earlier, I couldn't use Network Manager because of it's behavior when sleeping. I never found how to disable NetworkManager from dropping the interface when the machine started going to sleep.

I wrote a script and a systemd service file that ran ifup <interface> on boot. Note that this did not fix my problem due to the race between dhclient and the timeout on the interface.

--- Additional comment from Darryl Bond on 2014-05-22 07:53:22 CEST ---

I arranged for the upgrade to dhcpd 4.3.0
I set the lease times to infinite
    infinite-is-reserved on;
    max-lease-time infinite;
    default-lease-time infinite;

I commented the line in the script that forced forever and rebooted.
The client still looks like 
valid_lft 2147430sec preferred_lft 2147430sec

Might it be an i686 thing?

--- Additional comment from Pavel Šimerda (pavlix) on 2014-05-22 09:34:19 CEST ---

(In reply to Darryl Bond from comment #10)
> I suppose the real problem is that there is a race between the interface
> expiry and the dhclient renewal.

I'm not aware of such a race condition. DHCP clients are expected to renew the configuration data *long* before it's expired. No padding should be needed to ensure DHCP renewal works correctly.

> As I mentioned earlier, I couldn't use Network Manager because of it's
> behavior when sleeping.

Just curious. Did you test that NFS survives the sleep when not using NetworkManager? I haven't tested such and to be honest I wouldn't expect it to work.

> I wrote a script and a systemd service file that ran ifup <interface> on
> boot. Note that this did not fix my problem due to the race between dhclient
> and the timeout on the interface.

If you suspect a race condition there, could you please describe it carefully? In my opinion, if you get a lease that survives the boot time and dhclient requests the lease again (not sure whether it can renew properly), where is the race condition then?

--- Additional comment from Tom Gundersen on 2014-05-22 12:37:35 CEST ---

I agree that we should treat DHCP leases as infinite if they are required for the root fs(and never, ever drop the addresses, not when sleeping, not for any reason). In networkd we introduced "CriticalConnection=yes" for this purpose.

We should probably put up big fat scary warnings when we get a non-infinite lease for a critical connection, and even more scary warning in the extremely unlikely case that this lease expires without being renewed (the latter we do, the former we don't).

Btw, the reason '-1' means 'infinity', is that the DHCP spec says so (or rather its unsigned equivalent '0xffffffff').

--- Additional comment from Harald Hoyer on 2014-05-22 12:48:08 CEST ---

dracut now sets the lifetimes, because it was requested in bug 1058519

--- Additional comment from Darryl Bond on 2014-05-22 13:13:19 CEST ---

Nfs root clients reliably suspend and wake if NetworkManager isn't used. I tested using dhclient via ifup and, of course, setting the interface permanently. They will even wake-on-lan.

I am certain that setting the interface to forever solved the lockups while running dhclient did not. The fact that only 10% locked up each day indicates some kind of race.

I shall try to set up a test to get some data.

--- Additional comment from Pavel Šimerda (pavlix) on 2014-05-22 14:10:23 CEST ---

(In reply to Harald Hoyer from comment #14)
> dracut now sets the lifetimes, because it was requested in bug 1058519

Thank you for the information. With this in mind, if Darryl doesn't run dhclient at all, nothing will break, is that correct? Darryl, please confirm that the short (or finate) lifetime comes from your manually created dhclient service. In fact you should probably avoid running dhclient if you rely on dracut to set the infinite lifetime.

(In reply to Tom Gundersen from comment #13)
> Btw, the reason '-1' means 'infinity', is that the DHCP spec says so (or
> rather its unsigned equivalent '0xffffffff').

Using -1 for 0xffffffff can only cause confusion, especially if it's not a documented value in the documentation for the configuration format.

--- Additional comment from Jiri Popelka on 2014-05-22 22:46:02 CEST ---

(In reply to Darryl Bond from comment #11)
> The client still looks like 
> valid_lft 2147430sec preferred_lft 2147430sec
> 
> Might it be an i686 thing?

I'm seeing this with iproute-3.14.0-2.fc20.i686:

# ip addr add 1.2.3.4/24 dev dummy0 valid_lft 12345678 preferred_lft 12345678 && ip addr show dummy0

inet 1.2.3.4/24 scope global dynamic dummy0
   valid_lft 2147483sec preferred_lft 2147483sec

--- Additional comment from Darryl Bond on 2014-05-22 23:31:42 CEST ---

> Thank you for the information. With this in mind, if Darryl doesn't 
> run dhclient at all, nothing will break, is that correct? Darryl, 
> please confirm that the short (or finate) lifetime comes from your 
> manually created dhclient service. In fact you should probably 
> avoid running dhclient if you rely on dracut to set the infinite lifetime.
1. The default LTSP build runs dhclient once in dracut
2. As of dracut-network-037-11.git20140402.fc20.i686 the valid_lft to that supplied by dhclient, prior to that it was set to forever.
3. The clients now randomly lock up as the interface is lost. 
4. Only 10% per day with 4hr dhcp lease times on the server. So it is not that every client locks up at the expiry of the lease time.
5. At this point, I had no idea what was causing the lockups, only that they had started after a particular kernel update. Upgraded kernel several more times to see if there was some kind of regression.
6. Spent some days trying to bisect the lockups and found that ip addr gave me a hint.
7. Ran dhclient, did not stop the lockups
8. Set dhcp expiry to -1, clients stopped locking up. 
9. Saw that the expiry would be 24 days, so the problem was delayed, not solved.
10. Added early boot script to force valid_lft to forever. Problem for my case is solved, others not so much.

--- Additional comment from Fedora Admin XMLRPC Client on 2014-08-20 10:54:10 CEST ---

This package has changed ownership in the Fedora Package Database.  Reassigning to the new owner of this component.

--- Additional comment from Pavel Šimerda (pavlix) on 2014-08-20 21:39:43 CEST ---

Jiri, Daryll, can you confirm that the issue is in iproute? What is the exact way to reproduce that an infinite lifetime doesn't get set?

--- Additional comment from Darryl Bond on 2014-08-21 01:00:40 CEST ---

See description at top for how to reproduce.
I have no idea if the issue is in iproute. As far as I can tell it is in dracut dhclient config.

The problem is resolved for me by manually forcing valid_lft via a systemd service. I have not had a lockup since doing this.

--- Additional comment from Jiri Popelka on 2014-08-21 09:34:25 CEST ---

Back then when I was investigating it, it seemed that there are actually two bugs, one in dhcpd and second in ip. Both on i686 only.

(1) i686 dhcpd:
 - set lease_time to infinite (0xffffffff, 4 294 967 295)
 - run dhcpd & dhclient and check (tcpdump) what lease time is sent
 - from tcpdump:  Lease-Time Option 51, length 4: 738 878 298

(2) i686 ip: (see comment #17)
 - try to add an address with value of valid_lft/preferred_lft between 2147483 and 4294967295
 - valid_lft/preferred_lft is always set to 2147483

I can clone this bug to further investigate (1) and leave this one to you to for (2).

Comment 1 Fridolín Pokorný 2014-09-04 13:39:10 UTC
This issue is caused in server/dhcp.c:2110:

TIME lease_time;
...
/* Enforce the maximum lease length. */
if (lease_time < 0 /* XXX */
    || lease_time > max_lease_time)
       lease_time = max_lease_time;

TIME is time_t, on x86_64 sizeof(time_t) == 8 (lease_time == 4294967295), but on i686 Fedora sizeof(time_t) == 4 (lease_time == -1). This causes that lease_time is reduced (lease_time < 0). 

Possible bugfix is to use more suitable type for lease_time (e.g. uint32_t or a cast).

Comment 2 Pavel Šimerda (pavlix) 2014-09-05 06:33:35 UTC
(In reply to Fridolín Pokorný from comment #1)
> Possible bugfix is to use more suitable type for lease_time (e.g. uint32_t
> or a cast).

+1, time_t is not the right type for 32bit unsigned lease times. On the other hand, if you allow 0xffffffff, there's no point in checking whether the value of type uint32_t is in the interval from zero to 0xffffffff.

Comment 3 Pavel Šimerda (pavlix) 2014-09-08 07:55:35 UTC
Fridolín contacted upstream.

Comment 4 Fedora End Of Life 2015-05-29 12:41:39 UTC
This message is a reminder that Fedora 20 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 20. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as EOL if it remains open with a Fedora  'version'
of '20'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 20 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 5 Fedora End Of Life 2015-06-30 01:07:50 UTC
Fedora 20 changed to end-of-life (EOL) status on 2015-06-23. Fedora 20 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.


Note You need to log in before you can comment on or make changes to this bug.