Bug 1097523 - iproute i686 can't set large finite ip address lifetimes
Summary: iproute i686 can't set large finite ip address lifetimes
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Fedora
Classification: Fedora
Component: iproute
Version: rawhide
Hardware: i686
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Pavel Šimerda (pavlix)
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-05-14 04:24 UTC by Darryl Bond
Modified: 2015-03-17 15:01 UTC (History)
11 users (show)

Fixed In Version:
Clone Of:
: 1132396 (view as bug list)
Environment:
Last Closed: 2015-03-17 15:01:52 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 662254 0 low CLOSED dhclient fails to renew lease; lease-time of 0xffffffff (infinity) causes exit 1 2021-02-22 00:41:40 UTC

Internal Links: 662254

Description Darryl Bond 2014-05-14 04:24:02 UTC
Description of problem: Diskless clients with NFS root lose IP address after DHCP expiry. Clients then become unresponsive as they have lost their root filesystem. Reboot fixes until next time


Version-Release number of selected component (if applicable):
Version dracut-network-034-64.git20131205.fc20.1.i686 sets lifetime to forever
dracut-network-037-11.git20140402.fc20.i686 sets scope dynamic and therefore expiry as per dhcp server

How reproducible:
Default DHCP lease was expired in 4 hrs, clients would lose their IP address randomly after that.

Steps to Reproduce:
1. Boot NFS root client
2. ip addr
3. check if scope is dynamic

Actual results:
ip addr show dev enp4s0
2: enp4s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 00:25:22:9c:04:80 brd ff:ff:ff:ff:ff:ff
    inet 10.4.11.6/16 brd 10.4.255.255 scope global dynamic enp4s0
       valid_lft 2147430sec preferred_lft 2147430sec


Expected results:
ip addr show dev enp4s0
2: enp4s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 00:25:22:9c:04:80 brd ff:ff:ff:ff:ff:ff
    inet 10.4.11.6/16 brd 10.4.255.255 scope global enp4s0
       valid_lft forever preferred_lft forever


Additional info:
Symptoms relieved by setting the dhcpd.conf 
    max-lease-time -1;
    min-lease-time -1;
To give 2,000,000+ seconds before loss of ip address

Comment 1 Harald Hoyer 2014-05-14 14:36:24 UTC
Wouldn't running a dhcp client on the machine refresh that IP? Isn't that, what the DHCP protocol mandates?

Comment 2 Darryl Bond 2014-05-14 21:19:59 UTC
Tried that, it doesn't prevent the IP disappearing on all occasions.
I assume they both time out together and there is a race between the dhclient getting a new address and the interface dropping the IP address. I had 4hr timeout on the dhcp address originally and, of 250 clients, only 10% would lose the race but this still made plenty of complaints.

Note that my clients have static addresses, I don't ever change the addresses. I don't actually need DHCP except for initial boot.

Running NetworkManager for DHCP is out of the question anyway as it drops the IP address when the client suspends, also not good. I want the thin clients to suspend when not in use and works very well when NetworkManager is not running.

Comment 3 Harald Hoyer 2014-05-16 10:02:09 UTC
so, you are suggesting, that dracut should be fixed to set "forever" for -1

well... 

new_dhcp_lease_time=4294967295
new_expiry=1400233859

are the values I get with
    max-lease-time -1;
    min-lease-time -1;

and with:

ip addr add 192.168.50.101/255.255.255.0 broadcast 192.168.50.255 valid_lft 4294967295 preferred_lft 4294967295 dev ens3

the result is:

2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 52:54:00:12:34:00 brd ff:ff:ff:ff:ff:ff
    inet 192.168.50.101/24 brd 192.168.50.255 scope global ens3
       valid_lft forever preferred_lft forever

which fits "forever"

at least on a x86_64 machine with:
dhcp-4.3.0-10.fc21.x86_64
iproute-3.14.0-2.fc21.x86_64

Comment 4 Darryl Bond 2014-05-17 07:46:18 UTC
The NFS root clients are i686. Maybe that's why they only set to 2147430 or 25 days.

I have been running Fedora ltsp thin clients since F10. The behaviour I have described where 10% of the clients would drop their IP address while people are using them running remote sessions is new. I only got a hint as to what the problem was when looking at ip addr rather than ifconfig.

Only started with kernel 3.13.8 built with the new dracut.

I have forced them to forever so I am right from here. It's going to catch other people though.

Comment 5 Harald Hoyer 2014-05-20 08:24:36 UTC
So, the bug is the i686 dhclient reporting 2147430 instead of 4294967295.

Comment 6 Pavel Šimerda (pavlix) 2014-05-20 08:49:29 UTC
This resembles bug #662254 that has been fixed some time ago.

Comment 7 Darryl Bond 2014-05-20 21:45:09 UTC
No, I think the bug is that the behavior has changed. 

In the past the boot (dracut) ran dhclient once and set the interface valid_lft to forever. I assume that if you chose to run a dhclient after boot, then it would renew leases as required. If not, at least the machine would not lock up when the interface enforced the expiry.

I know that LTSP is not really a supported package in the standard repositories (it is there though) but there are people working on the package and using it.

I can assure you that there is nothing quite as useless as a NFS root client with no network connection. So 4hrs, 24 days or 10 years between lockups, it is still a lockup that didn't used to happen.

Comment 8 Jiri Popelka 2014-05-21 07:37:17 UTC
(In reply to Darryl Bond from comment #7)
> In the past the boot (dracut) ran dhclient once and set the interface
> valid_lft to forever.

That was probably dhclient itself. Previously dhclient (dhclient-script specifically) hadn't specified valid_lft/preferred_lft when adding address, which means they were set to 'forever'.
This has changed with
http://pkgs.fedoraproject.org/cgit/dhcp.git/commit/?id=e3ee5b17e9ffb46b1476ef97f532d6b81f2615dc
and
http://pkgs.fedoraproject.org/cgit/dhcp.git/commit/?h=f20&id=566cad88e7ef11d5fae31b7934a3e6cbe58b5800

Comment 9 Pavel Šimerda (pavlix) 2014-05-21 09:55:07 UTC
(In reply to Darryl Bond from comment #7)
> I can assure you that there is nothing quite as useless as a NFS root client
> with no network connection. So 4hrs, 24 days or 10 years between lockups, it
> is still a lockup that didn't used to happen.

There have been many discussions on this topic and there will certainly be more. Just added the topic to the network management microconference[1].

But to the specific issue...

1) It is possible to specify the infinite lease time on the server when I expect to boot NFS root filesystems.

default-lease-time infinite;
max-lease-time infinite;

2) You are specifying the lease time to -1 which is a value I don't see in the documentation at all. Could you please point me to the relevant information?

3) If you specify a finite lease time on the server, the client is generally expected to stop using the lease when it's expired. Root on NFS is a bit special here and Harald always advocated a workaround of using the DHCP lease as if it was static configuration (i.e. using infinite lifetime).

As dracut knows which connection/address is configured to connect to NFS, it could instruct the dhclient-script (a modified version of it) to use infinite lifetime or it could run the script just to get the lease and then override the lifetime in kernel. Any thoughts?

Also, Tom Gundersen is AFAIK going to address this in systemd-networkd which won't be using the dhclient-script nor dhclient itself.

[1] http://wiki.linuxplumbersconf.org/2014:network-management#topics

Comment 10 Darryl Bond 2014-05-21 21:25:44 UTC
1) My dhcpd is a ancient V3.0.7 on Solaris. The infinite option is not available in that version. I'll get it updated.

2) The reference to -1
http://serverfault.com/questions/505300/isc-dhcp-infinite-lease-time
The Internet is the holder of all truth :)
When I tested it, it set the lease time to the same as if I put in a very large number (24 days), so I used it.

3) I understand the dilemma, but I believe that NFS Root should always set it to infinite. The DHCP server can refuse to hand out addresses that are still being used by using the ping test. 
I suppose the real problem is that there is a race between the interface expiry and the dhclient renewal. I hope that is addressed in systemd-networkd. I imagine that simply setting the interface to max-lease-time +10 would be the simple fix.

As I mentioned earlier, I couldn't use Network Manager because of it's behavior when sleeping. I never found how to disable NetworkManager from dropping the interface when the machine started going to sleep.

I wrote a script and a systemd service file that ran ifup <interface> on boot. Note that this did not fix my problem due to the race between dhclient and the timeout on the interface.

Comment 11 Darryl Bond 2014-05-22 05:53:22 UTC
I arranged for the upgrade to dhcpd 4.3.0
I set the lease times to infinite
    infinite-is-reserved on;
    max-lease-time infinite;
    default-lease-time infinite;

I commented the line in the script that forced forever and rebooted.
The client still looks like 
valid_lft 2147430sec preferred_lft 2147430sec

Might it be an i686 thing?

Comment 12 Pavel Šimerda (pavlix) 2014-05-22 07:34:19 UTC
(In reply to Darryl Bond from comment #10)
> I suppose the real problem is that there is a race between the interface
> expiry and the dhclient renewal.

I'm not aware of such a race condition. DHCP clients are expected to renew the configuration data *long* before it's expired. No padding should be needed to ensure DHCP renewal works correctly.

> As I mentioned earlier, I couldn't use Network Manager because of it's
> behavior when sleeping.

Just curious. Did you test that NFS survives the sleep when not using NetworkManager? I haven't tested such and to be honest I wouldn't expect it to work.

> I wrote a script and a systemd service file that ran ifup <interface> on
> boot. Note that this did not fix my problem due to the race between dhclient
> and the timeout on the interface.

If you suspect a race condition there, could you please describe it carefully? In my opinion, if you get a lease that survives the boot time and dhclient requests the lease again (not sure whether it can renew properly), where is the race condition then?

Comment 13 Tom Gundersen 2014-05-22 10:37:35 UTC
I agree that we should treat DHCP leases as infinite if they are required for the root fs(and never, ever drop the addresses, not when sleeping, not for any reason). In networkd we introduced "CriticalConnection=yes" for this purpose.

We should probably put up big fat scary warnings when we get a non-infinite lease for a critical connection, and even more scary warning in the extremely unlikely case that this lease expires without being renewed (the latter we do, the former we don't).

Btw, the reason '-1' means 'infinity', is that the DHCP spec says so (or rather its unsigned equivalent '0xffffffff').

Comment 14 Harald Hoyer 2014-05-22 10:48:08 UTC
dracut now sets the lifetimes, because it was requested in bug 1058519

Comment 15 Darryl Bond 2014-05-22 11:13:19 UTC
Nfs root clients reliably suspend and wake if NetworkManager isn't used. I tested using dhclient via ifup and, of course, setting the interface permanently. They will even wake-on-lan.

I am certain that setting the interface to forever solved the lockups while running dhclient did not. The fact that only 10% locked up each day indicates some kind of race.

I shall try to set up a test to get some data.

Comment 16 Pavel Šimerda (pavlix) 2014-05-22 12:10:23 UTC
(In reply to Harald Hoyer from comment #14)
> dracut now sets the lifetimes, because it was requested in bug 1058519

Thank you for the information. With this in mind, if Darryl doesn't run dhclient at all, nothing will break, is that correct? Darryl, please confirm that the short (or finate) lifetime comes from your manually created dhclient service. In fact you should probably avoid running dhclient if you rely on dracut to set the infinite lifetime.

(In reply to Tom Gundersen from comment #13)
> Btw, the reason '-1' means 'infinity', is that the DHCP spec says so (or
> rather its unsigned equivalent '0xffffffff').

Using -1 for 0xffffffff can only cause confusion, especially if it's not a documented value in the documentation for the configuration format.

Comment 17 Jiri Popelka 2014-05-22 20:46:02 UTC
(In reply to Darryl Bond from comment #11)
> The client still looks like 
> valid_lft 2147430sec preferred_lft 2147430sec
> 
> Might it be an i686 thing?

I'm seeing this with iproute-3.14.0-2.fc20.i686:

# ip addr add 1.2.3.4/24 dev dummy0 valid_lft 12345678 preferred_lft 12345678 && ip addr show dummy0

inet 1.2.3.4/24 scope global dynamic dummy0
   valid_lft 2147483sec preferred_lft 2147483sec

Comment 18 Darryl Bond 2014-05-22 21:31:42 UTC
> Thank you for the information. With this in mind, if Darryl doesn't 
> run dhclient at all, nothing will break, is that correct? Darryl, 
> please confirm that the short (or finate) lifetime comes from your 
> manually created dhclient service. In fact you should probably 
> avoid running dhclient if you rely on dracut to set the infinite lifetime.
1. The default LTSP build runs dhclient once in dracut
2. As of dracut-network-037-11.git20140402.fc20.i686 the valid_lft to that supplied by dhclient, prior to that it was set to forever.
3. The clients now randomly lock up as the interface is lost. 
4. Only 10% per day with 4hr dhcp lease times on the server. So it is not that every client locks up at the expiry of the lease time.
5. At this point, I had no idea what was causing the lockups, only that they had started after a particular kernel update. Upgraded kernel several more times to see if there was some kind of regression.
6. Spent some days trying to bisect the lockups and found that ip addr gave me a hint.
7. Ran dhclient, did not stop the lockups
8. Set dhcp expiry to -1, clients stopped locking up. 
9. Saw that the expiry would be 24 days, so the problem was delayed, not solved.
10. Added early boot script to force valid_lft to forever. Problem for my case is solved, others not so much.

Comment 19 Fedora Admin XMLRPC Client 2014-08-20 08:54:10 UTC
This package has changed ownership in the Fedora Package Database.  Reassigning to the new owner of this component.

Comment 20 Pavel Šimerda (pavlix) 2014-08-20 19:39:43 UTC
Jiri, Daryll, can you confirm that the issue is in iproute? What is the exact way to reproduce that an infinite lifetime doesn't get set?

Comment 21 Darryl Bond 2014-08-20 23:00:40 UTC
See description at top for how to reproduce.
I have no idea if the issue is in iproute. As far as I can tell it is in dracut dhclient config.

The problem is resolved for me by manually forcing valid_lft via a systemd service. I have not had a lockup since doing this.

Comment 22 Jiri Popelka 2014-08-21 07:34:25 UTC
Back then when I was investigating it, it seemed that there are actually two bugs, one in dhcpd and second in ip. Both on i686 only.

(1) i686 dhcpd:
 - set lease_time to infinite (0xffffffff, 4 294 967 295)
 - run dhcpd & dhclient and check (tcpdump) what lease time is sent
 - from tcpdump:  Lease-Time Option 51, length 4: 738 878 298

(2) i686 ip: (see comment #17)
 - try to add an address with value of valid_lft/preferred_lft between 2147483 and 4294967295
 - valid_lft/preferred_lft is always set to 2147483

I can clone this bug to further investigate (1) and leave this one to you to for (2).

Comment 23 Pavel Šimerda (pavlix) 2014-08-21 09:54:38 UTC
(In reply to Jiri Popelka from comment #22)
> I can clone this bug to further investigate (1) and leave this one to you to
> for (2).

Please do so.

Cheers,

Pavel

Comment 24 Jiri Popelka 2014-08-21 10:48:58 UTC
(In reply to Pavel Šimerda (pavlix) from comment #23)
> (In reply to Jiri Popelka from comment #22)
> > I can clone this bug to further investigate (1) and leave this one to you to
> > for (2).
> 
> Please do so.

Bug #1132396

The summary 'iproute i686 can't set large ip address lifetimes including the infinite lifetime' is not completely correct.
You can set infinite with value of 4294967295, but if you try a lower value, it's set to 2147483.

Comment 25 Pavel Šimerda (pavlix) 2015-03-14 00:01:54 UTC
Confirmed also with iproute-3.16.0-3.fc21 including the fact that *infinite* lifetime works as suggested by jpopelka.

But that means the bug only affects *finite* large lifetimes. Therefore it should be perfectly possible to configure infinite lifetimes with static IP addresses and the importance of the bug was overestimated.

Just tested a couple of related things:

1) "-1" and "forever" works just as well as 4294967295

2) "-2" exhibits the same wrong behavior as 4294967294 (i.e. forever - 1)

3) Hexadecimal is accepted but the result is bogus. This is apparently fixed in rawhide's iproute-3.19.0-1.fc23.

The latest version (3.19.0) still has the problem with finite large lifetimes and therefore there's a large chance that the problem is present upstream. As the issue doesn't occur with small lifetimes and doesn't occure with infinite lifetime, I think we can safely postpone it to rawhide.

If you encounter issues with infinite lifetimes (aka forever or 0xffffffff), please file a new bug report. If you have a good reason to fix the large finite lifetime issue also in a released version of Fedora, please let me know in this bug report.

Comment 26 Pavel Šimerda (pavlix) 2015-03-17 15:01:52 UTC
Closing because of...

1) The iproute package sends the very 32-bit unsigned integer you specify on the command line and prints the very 32-bit unsigned integer it gets from the kernel. There's nothing to do in the iproute package.

2) The infinite/forever/0xffffffff value works on i686 just as well as on x86_64. Therefore there are no issues with the permanent addresses.

3) Kernel, when built for i686, uses a magical constant[1] to clamp down all lifetime values before accepting them. Comments suggest that the reason is to prevent arithmetic overflow.

4) The kernel magical constant is high enough to allow for a lifetime that lasts more than three weeks.

So this behavior is apparently both expected and harmless.

[1] LONG_MAX / HZ = 2^31 / 1000 = 2147483


Note You need to log in before you can comment on or make changes to this bug.