RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 2162315 - nexthop weight for IPv4 routes is not exposed on netlinks
Summary: nexthop weight for IPv4 routes is not exposed on netlinks
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 9
Classification: Red Hat
Component: kernel
Version: 9.2
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: ---
Assignee: Hangbin Liu
QA Contact: Jianlin Shi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-01-19 10:51 UTC by Thomas Haller
Modified: 2023-09-11 08:22 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-09-11 08:17:48 UTC
Type: Bug
Target Upstream Version:
Embargoed:
pm-rhel: mirror+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHELPLAN-145760 0 None None None 2023-01-19 10:53:30 UTC

Description Thomas Haller 2023-01-19 10:51:12 UTC
IPv4 route can have no next-hop, one nexthop or multiple nexthops (ECMP).

For ECMP routes, each next hop can have a weight. The list of nexthops (including their weight!) is a identifying property of those routes. I mean, that you can add two routes that are otherwise identical but only differ by their nexthops (or even only differ by weight):


  # ip route append 1.2.3.1 nexthop via 192.168.5.1 dev net0 weight 1 nexthop via 192.168.5.2 dev net0 weight 1
  # ip route append 1.2.3.1 nexthop via 192.168.5.1 dev net0 weight 2 nexthop via 192.168.5.2 dev net0 weight 1
  # ip -4 -d route show
  ...
  unicast 1.2.3.1 proto boot scope global 
        nexthop via 192.168.5.1 dev net0 weight 1 
        nexthop via 192.168.5.2 dev net0 weight 1 
  unicast 1.2.3.1 proto boot scope global 
        nexthop via 192.168.5.1 dev net0 weight 2 
        nexthop via 192.168.5.2 dev net0 weight 1 



On netlink, the nexthop is either expressed with the RTA_MULTIPATH or the RTA_OIF,RTA_GATEWAY attributes.

For singlehop routes, kernel will use RTA_OIF,RTA_GATEWAY to dump the route. You'll see that with `ip -d route show`.



Kernel allows to configure single hop routes with a weight (using RTA_MULTIPATH). This means, you can add two otherwise identical routes that only differ by white. However, in the dump the weight is not present, so they look identical:

  # ip route append 1.2.3.4 nexthop via 192.168.5.1 dev net0 weight 1
  # ip route append 1.2.3.4 nexthop via 192.168.5.1 dev net0 weight 2
  # ip -4 -d route show
  ...
  unicast 1.2.3.4 via 192.168.5.1 dev net0 proto boot scope global 
  unicast 1.2.3.4 via 192.168.5.1 dev net0 proto boot scope global 


That is a problem, because NetworkManager puts routes in a cache (dictionary), and those route would look identical to user-space, thus being combined in one. It also means, if you then delete one of those routes, NetworkManager will get a  RTM_DELROUTE message and delete the single instance of them, without knowing that there is yet another route.



Tested on f37 (6.0.18-300.fc37.x86_64 and rhel-9.2 (5.14.0-234.el9.x86_64).


Reproducer:
>>>>>>


#!/bin/bash

set -xe

ip netns del testnetns &>/dev/null || :
ip netns add testnetns

ip -netns testnetns link add net0 type dummy
ip -netns testnetns link set net0 up

ip -netns testnetns addr add 192.168.5.5/24 dev net0

ip -netns testnetns route append 1.2.3.1 nexthop via 192.168.5.1 dev net0 weight 1 nexthop via 192.168.5.2 dev net0 weight 1
ip -netns testnetns route append 1.2.3.1 nexthop via 192.168.5.1 dev net0 weight 2 nexthop via 192.168.5.2 dev net0 weight 1

ip -netns testnetns route append 1.2.3.4 nexthop via 192.168.5.1 dev net0 weight 1
ip -netns testnetns route append 1.2.3.4 nexthop via 192.168.5.1 dev net0 weight 2

ip -netns testnetns -4 -d route show

# ip netns exec testnetns strace -s 10000 ip -4 route show 2>&1

Comment 1 Hangbin Liu 2023-06-29 09:38:34 UTC
Hi Thomas,

Before I found the root issue for this bug(I have discussed with David Ahern. The fix looks not straight forward). Would you like to try the `ip nexthop` cmd? e.g.

# ip nexthop add id 1 via 192.168.1.1 dev eth1
# ip nexthop add id 2 via 192.168.1.1 dev eth1
# ip nexthop add id 101 group 1,1/2,2
# ip route add 192.168.2.0/24 nhid 101
# ip route show 192.168.2.0/24
192.168.2.0/24 nhid 101
        nexthop via 192.168.1.1 dev eth1 weight 1
        nexthop via 192.168.1.1 dev eth1 weight 2

# ip nexthop del id 1
# ip nexthop show
id 2 via 192.168.1.1 dev eth1 scope host
id 101 group 2,2
# ip route show 192.168.2.0/24
192.168.2.0/24 nhid 101 via 192.168.1.1 dev eth1

# ip nexthop add id 1 via 192.168.1.1 dev eth1
# ip nexthop replace id 101 group 1,1/2,2
# ip route show 192.168.2.0/24
192.168.2.0/24 nhid 101
        nexthop via 192.168.1.1 dev eth1 weight 1
        nexthop via 192.168.1.1 dev eth1 weight 2

`ip nexthop` has nh id, which could be used to select each nexthop.

Thanks
Hangbin

Comment 2 Thomas Haller 2023-07-03 07:51:54 UTC
hi Hangbin.

I don't understand what is there to try. I am aware of nexthop objects, thank you.

Do you mean, NetworkManager should use nexthop objects? That may well be, however it would be a larger change and require some more analysis.

But this bug report isn't really about NetworkManager. Yes, NetworkManager is mentioned as an example for how it affects a user. But the problem is not NetworkManager specific.

If the issues (this and others) of the netlink API for route objects can be fixed, then there seems less reason to change NetworkManager to nexthop objects.
If it cannot (won't) be fixed, then would be another argument for using nexthop objects...

Comment 3 Hangbin Liu 2023-07-04 09:19:21 UTC
(In reply to Thomas Haller from comment #2)
> Do you mean, NetworkManager should use nexthop objects? 

Yes. This is the new nexthop api which should avoid most conflicts.

> That may well be, however it would be a larger change and require some more analysis.

Got it.

> 
> But this bug report isn't really about NetworkManager. Yes, NetworkManager
> is mentioned as an example for how it affects a user. But the problem is not
> NetworkManager specific.
> 
> If the issues (this and others) of the netlink API for route objects can be
> fixed, then there seems less reason to change NetworkManager to nexthop
> objects.
> If it cannot (won't) be fixed, then would be another argument for using
> nexthop objects...

OK. I will try to fix the nexthop bugs you opened. The IPv4 and IPv6 route codes
have a lot of differences. I'm not sure if upstream will accept all.

For this bug. The reason that IPv4 could add 2 same routes with only different weights is
IPv4 checks all the fields of 2 routes.

```
int fib_table_insert(struct net *net, struct fib_table *tb,
                     struct fib_config *cfg, struct netlink_ext_ack *extack)
{
                ...
                fa_match = NULL;
                fa_first = fa;
                hlist_for_each_entry_from(fa, fa_list) {
                        if ((fa->fa_slen != slen) ||
                            (fa->tb_id != tb->tb_id) ||
                            (fa->fa_dscp != dscp))
                                break;
                        if (fa->fa_info->fib_priority != fi->fib_priority)
                                break;
                        if (fa->fa_type == cfg->fc_type &&
                            fa->fa_info == fi) {                   <-- here it checks the total struct fib_info
                                fa_match = fa;
                                break;
                        }
               ...
```

While IPv6 only checks the dev and gw address.

```
static inline bool rt6_duplicate_nexthop(struct fib6_info *a, struct fib6_info *b)
{
        struct fib6_nh *nha, *nhb;

        if (a->nh || b->nh)
                return nexthop_cmp(a->nh, b->nh);

        nha = a->fib6_nh;
        nhb = b->fib6_nh;
        return nha->fib_nh_dev == nhb->fib_nh_dev &&
               ipv6_addr_equal(&nha->fib_nh_gw6, &nhb->fib_nh_gw6) &&
               !lwtunnel_cmp_encap(nha->fib_nh_lws, nhb->fib_nh_lws);
}
```

Comment 4 Hangbin Liu 2023-07-06 08:42:39 UTC
(In reply to Hangbin Liu from comment #3)
> For this bug. The reason that IPv4 could add 2 same routes with only
> different weights is IPv4 checks all the fields of 2 routes.

Update:

  # ip route append 1.2.3.4 nexthop via 192.168.5.1 dev net0 weight 1
  # ip route append 1.2.3.4 nexthop via 192.168.5.1 dev net0 weight 2
  # ip -4 -d route show
  ...
  unicast 1.2.3.4 via 192.168.5.1 dev net0 proto boot scope global 
  unicast 1.2.3.4 via 192.168.5.1 dev net0 proto boot scope global

These 2 routes are totally 2 different routes, not same route (to 1.2.3.4) with different nexthops.

Here is why they tread them as different routes.

- fib_create_info
  - fib_find_info()
    - nh_comp()
      - nh->fib_nh_weight != onh->fib_nh_weight

As these 2 routes are just regular routes instead of multipath routes. So kernel
only dump the normal route info instead of weight info. e.g.

  - fib_dump_info
    - if (nhs == 1): fib_nexthop_info
      - just add RTA_GATEWAY
    - else: fib_add_multipath
      - Start with RTA_MULTIPATH flag
      - fib_add_nexthop(with fib_nh_weight)
        - add struct rtnexthop rtnh to store the weight info
        - fib_nexthop_info

Comment 5 Hangbin Liu 2023-09-11 08:17:48 UTC
Confirmed with David, This will not be fixed as it will change the uAPI.


Note You need to log in before you can comment on or make changes to this bug.