Bug 1109949

Summary: dhclient-script does not handle option 121 properly
Product: Red Hat Enterprise Linux 7 Reporter: Jiri Popelka <jpopelka>
Component: dhcpAssignee: Jiri Popelka <jpopelka>
Status: CLOSED ERRATA QA Contact: Release Test Team <release-test-team-automation>
Severity: high Docs Contact:
Priority: medium    
Version: 7.0CC: bconnor, ljozsa, ovasik, pneedle, praiskup, shobbs
Target Milestone: rcKeywords: Patch
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: dhcp-4.2.5-29.el7 Doc Type: Bug Fix
Doc Text:
Cause Installing RHEL-7 on Google Compute Engine (GCE). Consequence dhclient failed to set up proper static routes provided via classless-static-routes option (option 121) on the installer image. Fix A function in dhclient-script was fixed. Result dhclient properly installs static routes provided via classless-static-routes option.
Story Points: ---
Clone Of: 1102830 Environment:
Last Closed: 2015-03-05 10:35:49 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1113520    
Attachments:
Description Flags
dhclient-script patch none

Description Jiri Popelka 2014-06-16 17:37:18 UTC
+++ This bug was initially created as a clone of Bug #1102830 +++

Description of problem:
dhclient-script is unable to configure static routes given by DHCP option 121. It fails to setup routes properly if DHCP server gives out router which is not on local subnet as well as sends static routes via option 121. See sample lease file which you can use to be able to reproduce that:

#------------------

lease {
  interface "eth0";
  fixed-address 10.240.173.41;
  option subnet-mask 255.255.255.255;
  option routers 10.240.0.1;
  option dhcp-lease-time 4294967295;
  option dhcp-message-type 5;
  option domain-name-servers 169.254.169.254,10.240.0.1;
  option dhcp-server-identifier 169.254.169.254;
  option interface-mtu 1460;
  option domain-search "c.bar.internal.", "656424467045.google.internal.", "google.internal.";
  option ntp-servers 169.254.169.254,10.240.0.1;
  option classless-static-routes 32.10.240.0.1 0.0.0.0,0 10.240.0.1;
  option host-name "foo.c.bar.internal";
  option domain-name "c.bar.internal.";
  renew 2 2082/06/16 18:16:21;
  rebind 3 2133/07/01 02:41:50;
  expire 0 2150/07/05 21:30:28;
}
--------------------

This bug was partially introduced by me via #1055181.

This happens on google compute engine (GCE). They recently changed how they provide static routes by sending them via dhcp option 121 (classless-static-routes), which is the right way to do.

Line 280 returns 1 if adding a route fails. It always fails because static route has been provided via classless-static-routes already.

I am not sure how it is best to handle this situation. To me it seems that dhclient-script logic/flow is partially wrong, so I didn't want to submit any patches, but instead start conversations. 

Version-Release number of selected component (if applicable):
dhclient-4.2.6-1.fc20, dhclient-4.3.0-1.fc21

How reproducible:
Always


Steps to Reproduce:

1. Configure dhcp server to send router which is not on the same subnet and classless-static-routes option.

2. Run fedora cloud image in the same network so it can get its network configuration from DHCP.
3.

Actual results:
Network config is left in broken state.


Expected results:
Network properly configured. Routes properly set.


Additional info:

--- Additional comment from Pavel Šimerda (pavlix) on 2014-05-30 10:32:23 CEST ---

DHCP server configuration, especially the classless-static-routes part, would be great.

(In reply to Vaidas Jablonskis from comment #0)
>   option classless-static-routes 32.10.240.0.1 0.0.0.0,0 10.240.0.1;

This is internal format for dhclient lease files.According to the RFC, the option carries destination network (including prefix length) and the router address. I don't know what the middle value means, just guessing that this record roughly means:

ip route add 10.240.0.1/32 via 10.240.0.1

> 1. Configure dhcp server to send router which is not on the same subnet and
> classless-static-routes option.

In the submitted lease, there seems to be a route in the format (gateway/32 via gateway) where gateway is also the default gateway per the respective DHCP option. What's the purpose of that?

> Actual results:
> Network config is left in broken state.

That's not very descriptive.

> Expected results:
> Network properly configured. Routes properly set.

Which exact routes are expected? An expected routing table would be best.

--- Additional comment from Jiri Popelka on 2014-05-30 15:12:48 CEST ---

(In reply to Pavel Šimerda (pavlix) from comment #1)
> DHCP server configuration, especially the classless-static-routes part,
> would be great.

Or at least a packet dump with a DHCPOFFER and/or DHCPACK with the classless-static-routes option filled by server.
 
> (In reply to Vaidas Jablonskis from comment #0)
> >   option classless-static-routes 32.10.240.0.1 0.0.0.0,0 10.240.0.1;
 
The option is described in dhcp-options(5).
This looks like 2 routes
10.240.0.1/32 via 0
0/0 via 10.240.0.1

FYI bug #516325, bug #769463

--- Additional comment from Pavel Šimerda (pavlix) on 2014-05-30 16:44:12 CEST ---

(In reply to Jiri Popelka from comment #2)
> > (In reply to Vaidas Jablonskis from comment #0)
> > >   option classless-static-routes 32.10.240.0.1 0.0.0.0,0 10.240.0.1;
>  
> The option is described in dhcp-options(5).
> This looks like 2 routes
> 10.240.0.1/32 via 0
> 0/0 via 10.240.0.1

Ah, that's true. The first route seems to be redundant as it's a route to a gateway, the second route seems to be redundant as well as it's a default route through a gateway.

--- Additional comment from Vaidas Jablonskis on 2014-06-02 13:49:52 CEST ---

Hi guys. Sorry for not coming back earlier.

I've created a couple of VMs and was able reproduce it locally.

Server:
--------------- /etc/dhcp/dhcpd.conf -----------------

option domain-name "local";
option domain-name-servers 8.8.8.8;

default-lease-time 600;
max-lease-time 7200;

log-facility local7;
subnet 10.240.0.0 netmask 255.255.0.0 {
}

host dhcp-client {
  hardware ethernet 08:00:27:fd:1e:6a;
  fixed-address 10.240.173.41;
  option subnet-mask 255.255.255.255;
  option routers 10.240.0.1;
  option dhcp-message-type 5;
  option domain-name-servers 169.254.169.254;
  option interface-mtu 1460;
  option classless-static-routes 32.10.240.0.1 0.0.0.0,0 10.240.0.1;
}
-----------------------------------


Client (broken state)
--------------- ip addr show / ip route show ---------------

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 08:00:27:fd:1e:6a brd ff:ff:ff:ff:ff:ff
    inet 10.240.173.41/32 brd 10.240.173.41 scope global dynamic enp0s3
       valid_lft 556sec preferred_lft 556sec
    inet6 fe80::a00:27ff:fefd:1e6a/64 scope link
       valid_lft forever preferred_lft forever


$ ip route show 
10.240.0.1 dev enp0s3  proto static  scope link

-----------------------------------------------------------



Client (desired state)
--------------- ip addr show / ip route show ---------------
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 08:00:27:fd:1e:6a brd ff:ff:ff:ff:ff:ff
    inet 10.240.173.41/32 brd 10.240.173.41 scope global dynamic enp0s3
       valid_lft 555sec preferred_lft 555sec
    inet6 fe80::a00:27ff:fefd:1e6a/64 scope link
       valid_lft forever preferred_lft forever


$ ip route show
default via 10.240.0.1 dev enp0s3  proto static
10.240.0.1 dev enp0s3  proto static  scope link

--------------------------------------------------------

I get dhclient-script to work if I comment out line 280, but this it not a solution. It seems to fail because line 277 tries to add a route which is already added by option classless-static-routes.


Let me know if I can provide you with more details.

--- Additional comment from Jiri Popelka on 2014-06-02 14:00:49 CEST ---

I'm afraid I don't have time to look at it this week.

It's possible that the logic there is bad,
I wasn't much experienced when I was writing the code back then.
I'll be glad for any suggested patch/fix,
you perhaps understand the problem more then me.
Thanks

--- Additional comment from Vaidas Jablonskis on 2014-06-02 14:56:03 CEST ---

Thanks Jiri.

I will look into this for a possible solution. Will update with my findings.

--- Additional comment from Pavel Šimerda (pavlix) on 2014-06-02 16:44:12 CEST ---

(In reply to Vaidas Jablonskis from comment #4)
>   fixed-address 10.240.173.41;
>   option subnet-mask 255.255.255.255;

The netmask looks wrong. This may well be the root cause of the problem. Did you try with 255.255.255.0. That would get you:

10.240.0.0/24 dev enp0s3

>   option routers 10.240.0.1;

And this should be enough to get:

default via 10.240.0.1 [dev enp0s3]

>   option dhcp-message-type 5;

What's the purpose of this?

>   option classless-static-routes 32.10.240.0.1 0.0.0.0,0 10.240.0.1;

This looks like an attempt to work around the potentially wrong configuration above.

It would be good to know the motivation for such a nonstandard configuration. And if you could please quickly check that fixing the netmask removes the issue.

--- Additional comment from Vaidas Jablonskis on 2014-06-02 17:02:24 CEST ---

(In reply to Pavel Šimerda (pavlix) from comment #7)
> (In reply to Vaidas Jablonskis from comment #4)
> >   fixed-address 10.240.173.41;
> >   option subnet-mask 255.255.255.255;
> 
> The netmask looks wrong. This may well be the root cause of the problem. Did
> you try with 255.255.255.0. That would get you:
> 
> 10.240.0.0/24 dev enp0s3
> 
Netmask is not wrong. /32 is a valid netmask. I don't disagree that this is slightly unusual network setup, but this is how networking is done in Google Cloud. I have tried CentOS, RHEL, SuSe on GCE and it works just fine.

> >   option routers 10.240.0.1;
> 
> And this should be enough to get:
> 
> default via 10.240.0.1 [dev enp0s3]
> 
> >   option dhcp-message-type 5;
> 
> What's the purpose of this?
Can be removed. It does not make any difference, so just ignore that.
> 
> >   option classless-static-routes 32.10.240.0.1 0.0.0.0,0 10.240.0.1;
> 
> This looks like an attempt to work around the potentially wrong
> configuration above.
> 
> It would be good to know the motivation for such a nonstandard
> configuration. And if you could please quickly check that fixing the netmask
> removes the issue.

I cannot tell you exactly what motivation behind this kind of setup is, but I guess this way GCE can easily control traffic flows and do it in most efficient way, so that all traffic between instances in the same subnet goes via default gateway.

The logic of dhclient-script is clearly wrong, but I need some help to figure out how it should be done.

CentOS uses an older version of dhclient-script, which has no issues configuring network on google cloud.

As stated above, this problem was partially introduced by me. GCE didn't use option 121 previously, but previous fedora dhclient-script used to try to ping default router, while ICMP ping is disabled in GCE network, so network config used to fail, because default gateway could not be set. So we decided that checking whether router is alive by pinging it is not a great idea.

--- Additional comment from Pavel Šimerda (pavlix) on 2014-06-02 17:45:46 CEST ---

(In reply to Vaidas Jablonskis from comment #8)
> I cannot tell you exactly what motivation behind this kind of setup is, but
> I guess this way GCE can easily control traffic flows and do it in most
> efficient way, so that all traffic between instances in the same subnet goes
> via default gateway.

Now I see. It looks like they're trying to mimic the IPv6 autoconf AdvOnLink=no feature. That's a valid use case. But it's true that we've seen a lot of issues with that in NetworkManager, I don't remember the details, though.

> The logic of dhclient-script is clearly wrong, but I need some help to
> figure out how it should be done.

The order of the routes is good, so it should be possible to handle it. And as you say, it works in some versions of some distributions, at the least. The script should definitely be altered to cope with this situation.

> As stated above, this problem was partially introduced by me.

OK, that's fair.

> GCE didn't use
> option 121 previously, but previous fedora dhclient-script used to try to
> ping default router, while ICMP ping is disabled in GCE network, so network
> config used to fail, because default gateway could not be set. So we decided
> that checking whether router is alive by pinging it is not a great idea.

I see. I'll have more comments later.

--- Additional comment from Filipe Brandenburger on 2014-06-12 08:14:50 CEST ---

One possible solution that worked for me is to use "ip route replace" instead of "ip route add" in is_router_reachable. I think that would make some sense considering that's the only place where ip route add (instead of replace) is used throughout the file...

The other thought I had was to use ip route get to see if that router is reachable, but I'm not 100% sure that just because it's reachable it makes sense since in practice I think it needs to be directly connected, so I'm not sure what would be the best way to check for that...

Pavel, Jiri, do you have a suggestion of what would be the best way to solve this issue?

--- Additional comment from Pavel Šimerda (pavlix) on 2014-06-12 10:59:23 CEST ---

(In reply to Filipe Brandenburger from comment #10)
> One possible solution that worked for me is to use "ip route replace"
> instead of "ip route add" in is_router_reachable.

Sounds reasonable, if that fixes the issue for you. Another way would be to gracefully recover when "ip route add" exits with an error. Not that I'm speaking out of my mind, not looking at the actual code.

--- Additional comment from Vaidas Jablonskis on 2014-06-12 13:01:09 CEST ---

What is the purpose of the function `is_router_reachable()`?

It seems redundant to me apart from that it tries to handle router route which is not on the same subnet. Normally this would be passed in via option 121 (like google does now), but as we know there are some broken DHCP configurations where this function can be useful.

First of all I think it should be renamed to something else, because it does not really tell whether router is alive or not anymore. Secondly, as Pavel suggested, it should not return 1 or gracefully recover.

Would there be a way to check whether DHCP server sent option 121?

--- Additional comment from Jiri Popelka on 2014-06-12 13:21:09 CEST ---

(In reply to Vaidas Jablonskis from comment #12)
> Would there be a way to check whether DHCP server sent option 121?

if [ -n "${new_classless_static_routes}" ]

--- Additional comment from Vaidas Jablonskis on 2014-06-12 15:48:08 CEST ---

Great. Thanks Jiri.

I will try to incorporate some changes to the script and do some testing (I have a test dhcp environment which I can use to simulate various scenarios).

Is anyone else on this list working on a fix? Just want to make sure we're not duplicating the effort.

--- Additional comment from Filipe Brandenburger on 2014-06-12 16:29:06 CEST ---

I noticed that dhclient 4.1.1 works fine with the GCE setup.

Digging deeper a little bit, it seems to me (unconfirmed though) that this regression was introduced in commit 0acc1b8c2c6de469e25850a703b54f30827560d1, particularly this part of it:

dhclient-script: allow static route with a 0.0.0.0 next-hop addressi (#769463

> -                if is_router_reachable ${gateway}; then
> +                # special case 0.0.0.0 to allow static routing for link-local addresses
> +                # (including IPv4 multicast) which will not have a next-hop (#769463)
> +                if [ "${gateway}" = "0.0.0.0" ] ||
> +                   is_router_reachable ${gateway}; then

What was happening on 4.1.1 was that adding the route for 32.10.240.0.1 0.0.0.0 would fail because is_router_reachable 0.0.0.0 returns false so the static route to 10.240.0.1 is never added... Then, when trying to process the second route of 0 10.240.0.1 it would call is_router_reachable 10.240.0.1 which would add the route to that router *FOR THE FIRST TIME* and so the ip route add would not fail at that point...

I think using ip route replace instead is a possible fix, but I echo Vaidas in comment 12 asking what the purpose of is_router_reachable really is... From its name, it sounds like it should be only "checking" something so I'm not sure that having side effects like ip route add (or even ip route replace!) would be wise there...

Having said that, I think s/ip route add/ip route replace/ is a start, if you think doing a larger change might have unintended side effects elsewhere...

Cheers,
Filipe

--- Additional comment from Filipe Brandenburger on 2014-06-12 17:41:10 CEST ---

Here's a suggested patch.

I left a comment saying that I think is_router_reachable should not have side effects such as adding or replacing routes, but if changing that would be too risky then at least using ip route replace would fix this bug and I don't see it introducing any unexpected side effects.

If possible, can I also ask for a backport of it to both f20 and f21 branches? It should apply cleanly on both.

Cheers,
Filipe

--- Additional comment from Vaidas Jablonskis on 2014-06-12 18:05:22 CEST ---

What happens if there is no such route and it runs "ip route replace <non existing route>" ?

This is how GCE network worked before, it never used to use option 121, which is why we ran into issues in the first place.

--- Additional comment from Filipe Brandenburger on 2014-06-12 18:18:45 CEST ---

Hi Vaidas,

"ip route replace" will add a route if none exists. The main difference is that it won't fail if a route already exists, in that case it will simply "update" it.

See my comment 15 for an analysis of why this worked in 4.1.1, it seems to me that in that case we had two bugs producing the correct end result :-) As the first bug was fixed (namely handling static routes with gateway 0.0.0.0 as being scope link routes) the second bug stayed (the one where is_router_reachable has a side effect and calling it twice on the same router will fail on the second time) and that's what we're seeing here.

Fix for the first bug is here:
http://pkgs.fedoraproject.org/cgit/dhcp.git/commit/dhclient-script?id=0acc1b8c2c6de469e25850a703b54f30827560d1

And it was trying to fix this bug here:
https://bugzilla.redhat.com/show_bug.cgi?id=769463

I really agree with your remarks that is_router_available trying to add routes is a little weird, that's why I added the #TODO comments on my suggested patch.

Cheers,
Filipe

--- Additional comment from Vaidas Jablonskis on 2014-06-12 18:37:51 CEST ---

Filipe,

I agree with your solution. It is not the best, but this should have the least amount of impact on the overall logic of that script.

Thanks!

--- Additional comment from Filipe Brandenburger on 2014-06-16 19:18:26 CEST ---

Hi Jiri,

I just built a Fedora 20 image for GCE with your dhclient RPM and it worked just fine.

[root@fedora20gce ~]# ip route
default via 10.240.0.1 dev eth0  proto static 
10.240.0.1 dev eth0  scope link 

I hope that's useful information to you!

BTW, I haven't tested that personally and I can't confirm or deny it, but I imagine the same issue might affect RHEL 7 on GCE, considering the version of dhclient shipped in it seems to include commit 0acc1b8c2c6de469e25850a703b54f30827560d1 which without this second fix seems to cause this bug.

Comment 4 Jiri Popelka 2014-06-25 14:23:25 UTC
Created attachment 912132 [details]
dhclient-script patch

Comment 10 Ladislav Jozsa 2014-11-28 09:08:10 UTC
I'm not able to trigger bug in RHEL-7.0, dhclient-4.2.5-27.el7, it works for me there. However I can confirm that patch is present both in specfile and in src package in dhcp-4.2.5-29.el7. Verified as SanityOnly.

Comment 12 errata-xmlrpc 2015-03-05 10:35:49 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-0450.html