Bug 2162315

Summary: nexthop weight for IPv4 routes is not exposed on netlinks
Product: Red Hat Enterprise Linux 9 Reporter: Thomas Haller <thaller>
Component: kernelAssignee: Hangbin Liu <haliu>
kernel sub component: Networking QA Contact: Jianlin Shi <jishi>
Status: NEW --- Docs Contact:
Severity: unspecified    
Priority: unspecified CC: haliu, jiji, kzhang, sukulkar
Version: 9.2Keywords: FutureFeature
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Thomas Haller 2023-01-19 10:51:12 UTC
IPv4 route can have no next-hop, one nexthop or multiple nexthops (ECMP).

For ECMP routes, each next hop can have a weight. The list of nexthops (including their weight!) is a identifying property of those routes. I mean, that you can add two routes that are otherwise identical but only differ by their nexthops (or even only differ by weight):


  # ip route append 1.2.3.1 nexthop via 192.168.5.1 dev net0 weight 1 nexthop via 192.168.5.2 dev net0 weight 1
  # ip route append 1.2.3.1 nexthop via 192.168.5.1 dev net0 weight 2 nexthop via 192.168.5.2 dev net0 weight 1
  # ip -4 -d route show
  ...
  unicast 1.2.3.1 proto boot scope global 
        nexthop via 192.168.5.1 dev net0 weight 1 
        nexthop via 192.168.5.2 dev net0 weight 1 
  unicast 1.2.3.1 proto boot scope global 
        nexthop via 192.168.5.1 dev net0 weight 2 
        nexthop via 192.168.5.2 dev net0 weight 1 



On netlink, the nexthop is either expressed with the RTA_MULTIPATH or the RTA_OIF,RTA_GATEWAY attributes.

For singlehop routes, kernel will use RTA_OIF,RTA_GATEWAY to dump the route. You'll see that with `ip -d route show`.



Kernel allows to configure single hop routes with a weight (using RTA_MULTIPATH). This means, you can add two otherwise identical routes that only differ by white. However, in the dump the weight is not present, so they look identical:

  # ip route append 1.2.3.4 nexthop via 192.168.5.1 dev net0 weight 1
  # ip route append 1.2.3.4 nexthop via 192.168.5.1 dev net0 weight 2
  # ip -4 -d route show
  ...
  unicast 1.2.3.4 via 192.168.5.1 dev net0 proto boot scope global 
  unicast 1.2.3.4 via 192.168.5.1 dev net0 proto boot scope global 


That is a problem, because NetworkManager puts routes in a cache (dictionary), and those route would look identical to user-space, thus being combined in one. It also means, if you then delete one of those routes, NetworkManager will get a  RTM_DELROUTE message and delete the single instance of them, without knowing that there is yet another route.



Tested on f37 (6.0.18-300.fc37.x86_64 and rhel-9.2 (5.14.0-234.el9.x86_64).


Reproducer:
>>>>>>


#!/bin/bash

set -xe

ip netns del testnetns &>/dev/null || :
ip netns add testnetns

ip -netns testnetns link add net0 type dummy
ip -netns testnetns link set net0 up

ip -netns testnetns addr add 192.168.5.5/24 dev net0

ip -netns testnetns route append 1.2.3.1 nexthop via 192.168.5.1 dev net0 weight 1 nexthop via 192.168.5.2 dev net0 weight 1
ip -netns testnetns route append 1.2.3.1 nexthop via 192.168.5.1 dev net0 weight 2 nexthop via 192.168.5.2 dev net0 weight 1

ip -netns testnetns route append 1.2.3.4 nexthop via 192.168.5.1 dev net0 weight 1
ip -netns testnetns route append 1.2.3.4 nexthop via 192.168.5.1 dev net0 weight 2

ip -netns testnetns -4 -d route show

# ip netns exec testnetns strace -s 10000 ip -4 route show 2>&1

Comment 1 Hangbin Liu 2023-06-29 09:38:34 UTC
Hi Thomas,

Before I found the root issue for this bug(I have discussed with David Ahern. The fix looks not straight forward). Would you like to try the `ip nexthop` cmd? e.g.

# ip nexthop add id 1 via 192.168.1.1 dev eth1
# ip nexthop add id 2 via 192.168.1.1 dev eth1
# ip nexthop add id 101 group 1,1/2,2
# ip route add 192.168.2.0/24 nhid 101
# ip route show 192.168.2.0/24
192.168.2.0/24 nhid 101
        nexthop via 192.168.1.1 dev eth1 weight 1
        nexthop via 192.168.1.1 dev eth1 weight 2

# ip nexthop del id 1
# ip nexthop show
id 2 via 192.168.1.1 dev eth1 scope host
id 101 group 2,2
# ip route show 192.168.2.0/24
192.168.2.0/24 nhid 101 via 192.168.1.1 dev eth1

# ip nexthop add id 1 via 192.168.1.1 dev eth1
# ip nexthop replace id 101 group 1,1/2,2
# ip route show 192.168.2.0/24
192.168.2.0/24 nhid 101
        nexthop via 192.168.1.1 dev eth1 weight 1
        nexthop via 192.168.1.1 dev eth1 weight 2

`ip nexthop` has nh id, which could be used to select each nexthop.

Thanks
Hangbin

Comment 2 Thomas Haller 2023-07-03 07:51:54 UTC
hi Hangbin.

I don't understand what is there to try. I am aware of nexthop objects, thank you.

Do you mean, NetworkManager should use nexthop objects? That may well be, however it would be a larger change and require some more analysis.

But this bug report isn't really about NetworkManager. Yes, NetworkManager is mentioned as an example for how it affects a user. But the problem is not NetworkManager specific.

If the issues (this and others) of the netlink API for route objects can be fixed, then there seems less reason to change NetworkManager to nexthop objects.
If it cannot (won't) be fixed, then would be another argument for using nexthop objects...

Comment 3 Hangbin Liu 2023-07-04 09:19:21 UTC
(In reply to Thomas Haller from comment #2)
> Do you mean, NetworkManager should use nexthop objects? 

Yes. This is the new nexthop api which should avoid most conflicts.

> That may well be, however it would be a larger change and require some more analysis.

Got it.

> 
> But this bug report isn't really about NetworkManager. Yes, NetworkManager
> is mentioned as an example for how it affects a user. But the problem is not
> NetworkManager specific.
> 
> If the issues (this and others) of the netlink API for route objects can be
> fixed, then there seems less reason to change NetworkManager to nexthop
> objects.
> If it cannot (won't) be fixed, then would be another argument for using
> nexthop objects...

OK. I will try to fix the nexthop bugs you opened. The IPv4 and IPv6 route codes
have a lot of differences. I'm not sure if upstream will accept all.

For this bug. The reason that IPv4 could add 2 same routes with only different weights is
IPv4 checks all the fields of 2 routes.

```
int fib_table_insert(struct net *net, struct fib_table *tb,
                     struct fib_config *cfg, struct netlink_ext_ack *extack)
{
                ...
                fa_match = NULL;
                fa_first = fa;
                hlist_for_each_entry_from(fa, fa_list) {
                        if ((fa->fa_slen != slen) ||
                            (fa->tb_id != tb->tb_id) ||
                            (fa->fa_dscp != dscp))
                                break;
                        if (fa->fa_info->fib_priority != fi->fib_priority)
                                break;
                        if (fa->fa_type == cfg->fc_type &&
                            fa->fa_info == fi) {                   <-- here it checks the total struct fib_info
                                fa_match = fa;
                                break;
                        }
               ...
```

While IPv6 only checks the dev and gw address.

```
static inline bool rt6_duplicate_nexthop(struct fib6_info *a, struct fib6_info *b)
{
        struct fib6_nh *nha, *nhb;

        if (a->nh || b->nh)
                return nexthop_cmp(a->nh, b->nh);

        nha = a->fib6_nh;
        nhb = b->fib6_nh;
        return nha->fib_nh_dev == nhb->fib_nh_dev &&
               ipv6_addr_equal(&nha->fib_nh_gw6, &nhb->fib_nh_gw6) &&
               !lwtunnel_cmp_encap(nha->fib_nh_lws, nhb->fib_nh_lws);
}
```

Comment 4 Hangbin Liu 2023-07-06 08:42:39 UTC
(In reply to Hangbin Liu from comment #3)
> For this bug. The reason that IPv4 could add 2 same routes with only
> different weights is IPv4 checks all the fields of 2 routes.

Update:

  # ip route append 1.2.3.4 nexthop via 192.168.5.1 dev net0 weight 1
  # ip route append 1.2.3.4 nexthop via 192.168.5.1 dev net0 weight 2
  # ip -4 -d route show
  ...
  unicast 1.2.3.4 via 192.168.5.1 dev net0 proto boot scope global 
  unicast 1.2.3.4 via 192.168.5.1 dev net0 proto boot scope global

These 2 routes are totally 2 different routes, not same route (to 1.2.3.4) with different nexthops.

Here is why they tread them as different routes.

- fib_create_info
  - fib_find_info()
    - nh_comp()
      - nh->fib_nh_weight != onh->fib_nh_weight

As these 2 routes are just regular routes instead of multipath routes. So kernel
only dump the normal route info instead of weight info. e.g.

  - fib_dump_info
    - if (nhs == 1): fib_nexthop_info
      - just add RTA_GATEWAY
    - else: fib_add_multipath
      - Start with RTA_MULTIPATH flag
      - fib_add_nexthop(with fib_nh_weight)
        - add struct rtnexthop rtnh to store the weight info
        - fib_nexthop_info