Bug 2162315
| Summary: | nexthop weight for IPv4 routes is not exposed on netlinks | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 9 | Reporter: | Thomas Haller <thaller> |
| Component: | kernel | Assignee: | Hangbin Liu <haliu> |
| kernel sub component: | Networking | QA Contact: | Jianlin Shi <jishi> |
| Status: | CLOSED WONTFIX | Docs Contact: | |
| Severity: | unspecified | ||
| Priority: | unspecified | CC: | haliu, jiji, kzhang, sukulkar |
| Version: | 9.2 | Keywords: | FutureFeature |
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-09-11 08:17:48 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Hi Thomas,
Before I found the root issue for this bug(I have discussed with David Ahern. The fix looks not straight forward). Would you like to try the `ip nexthop` cmd? e.g.
# ip nexthop add id 1 via 192.168.1.1 dev eth1
# ip nexthop add id 2 via 192.168.1.1 dev eth1
# ip nexthop add id 101 group 1,1/2,2
# ip route add 192.168.2.0/24 nhid 101
# ip route show 192.168.2.0/24
192.168.2.0/24 nhid 101
nexthop via 192.168.1.1 dev eth1 weight 1
nexthop via 192.168.1.1 dev eth1 weight 2
# ip nexthop del id 1
# ip nexthop show
id 2 via 192.168.1.1 dev eth1 scope host
id 101 group 2,2
# ip route show 192.168.2.0/24
192.168.2.0/24 nhid 101 via 192.168.1.1 dev eth1
# ip nexthop add id 1 via 192.168.1.1 dev eth1
# ip nexthop replace id 101 group 1,1/2,2
# ip route show 192.168.2.0/24
192.168.2.0/24 nhid 101
nexthop via 192.168.1.1 dev eth1 weight 1
nexthop via 192.168.1.1 dev eth1 weight 2
`ip nexthop` has nh id, which could be used to select each nexthop.
Thanks
Hangbin
hi Hangbin. I don't understand what is there to try. I am aware of nexthop objects, thank you. Do you mean, NetworkManager should use nexthop objects? That may well be, however it would be a larger change and require some more analysis. But this bug report isn't really about NetworkManager. Yes, NetworkManager is mentioned as an example for how it affects a user. But the problem is not NetworkManager specific. If the issues (this and others) of the netlink API for route objects can be fixed, then there seems less reason to change NetworkManager to nexthop objects. If it cannot (won't) be fixed, then would be another argument for using nexthop objects... (In reply to Thomas Haller from comment #2) > Do you mean, NetworkManager should use nexthop objects? Yes. This is the new nexthop api which should avoid most conflicts. > That may well be, however it would be a larger change and require some more analysis. Got it. > > But this bug report isn't really about NetworkManager. Yes, NetworkManager > is mentioned as an example for how it affects a user. But the problem is not > NetworkManager specific. > > If the issues (this and others) of the netlink API for route objects can be > fixed, then there seems less reason to change NetworkManager to nexthop > objects. > If it cannot (won't) be fixed, then would be another argument for using > nexthop objects... OK. I will try to fix the nexthop bugs you opened. The IPv4 and IPv6 route codes have a lot of differences. I'm not sure if upstream will accept all. For this bug. The reason that IPv4 could add 2 same routes with only different weights is IPv4 checks all the fields of 2 routes. ``` int fib_table_insert(struct net *net, struct fib_table *tb, struct fib_config *cfg, struct netlink_ext_ack *extack) { ... fa_match = NULL; fa_first = fa; hlist_for_each_entry_from(fa, fa_list) { if ((fa->fa_slen != slen) || (fa->tb_id != tb->tb_id) || (fa->fa_dscp != dscp)) break; if (fa->fa_info->fib_priority != fi->fib_priority) break; if (fa->fa_type == cfg->fc_type && fa->fa_info == fi) { <-- here it checks the total struct fib_info fa_match = fa; break; } ... ``` While IPv6 only checks the dev and gw address. ``` static inline bool rt6_duplicate_nexthop(struct fib6_info *a, struct fib6_info *b) { struct fib6_nh *nha, *nhb; if (a->nh || b->nh) return nexthop_cmp(a->nh, b->nh); nha = a->fib6_nh; nhb = b->fib6_nh; return nha->fib_nh_dev == nhb->fib_nh_dev && ipv6_addr_equal(&nha->fib_nh_gw6, &nhb->fib_nh_gw6) && !lwtunnel_cmp_encap(nha->fib_nh_lws, nhb->fib_nh_lws); } ``` (In reply to Hangbin Liu from comment #3) > For this bug. The reason that IPv4 could add 2 same routes with only > different weights is IPv4 checks all the fields of 2 routes. Update: # ip route append 1.2.3.4 nexthop via 192.168.5.1 dev net0 weight 1 # ip route append 1.2.3.4 nexthop via 192.168.5.1 dev net0 weight 2 # ip -4 -d route show ... unicast 1.2.3.4 via 192.168.5.1 dev net0 proto boot scope global unicast 1.2.3.4 via 192.168.5.1 dev net0 proto boot scope global These 2 routes are totally 2 different routes, not same route (to 1.2.3.4) with different nexthops. Here is why they tread them as different routes. - fib_create_info - fib_find_info() - nh_comp() - nh->fib_nh_weight != onh->fib_nh_weight As these 2 routes are just regular routes instead of multipath routes. So kernel only dump the normal route info instead of weight info. e.g. - fib_dump_info - if (nhs == 1): fib_nexthop_info - just add RTA_GATEWAY - else: fib_add_multipath - Start with RTA_MULTIPATH flag - fib_add_nexthop(with fib_nh_weight) - add struct rtnexthop rtnh to store the weight info - fib_nexthop_info Confirmed with David, This will not be fixed as it will change the uAPI. |
IPv4 route can have no next-hop, one nexthop or multiple nexthops (ECMP). For ECMP routes, each next hop can have a weight. The list of nexthops (including their weight!) is a identifying property of those routes. I mean, that you can add two routes that are otherwise identical but only differ by their nexthops (or even only differ by weight): # ip route append 1.2.3.1 nexthop via 192.168.5.1 dev net0 weight 1 nexthop via 192.168.5.2 dev net0 weight 1 # ip route append 1.2.3.1 nexthop via 192.168.5.1 dev net0 weight 2 nexthop via 192.168.5.2 dev net0 weight 1 # ip -4 -d route show ... unicast 1.2.3.1 proto boot scope global nexthop via 192.168.5.1 dev net0 weight 1 nexthop via 192.168.5.2 dev net0 weight 1 unicast 1.2.3.1 proto boot scope global nexthop via 192.168.5.1 dev net0 weight 2 nexthop via 192.168.5.2 dev net0 weight 1 On netlink, the nexthop is either expressed with the RTA_MULTIPATH or the RTA_OIF,RTA_GATEWAY attributes. For singlehop routes, kernel will use RTA_OIF,RTA_GATEWAY to dump the route. You'll see that with `ip -d route show`. Kernel allows to configure single hop routes with a weight (using RTA_MULTIPATH). This means, you can add two otherwise identical routes that only differ by white. However, in the dump the weight is not present, so they look identical: # ip route append 1.2.3.4 nexthop via 192.168.5.1 dev net0 weight 1 # ip route append 1.2.3.4 nexthop via 192.168.5.1 dev net0 weight 2 # ip -4 -d route show ... unicast 1.2.3.4 via 192.168.5.1 dev net0 proto boot scope global unicast 1.2.3.4 via 192.168.5.1 dev net0 proto boot scope global That is a problem, because NetworkManager puts routes in a cache (dictionary), and those route would look identical to user-space, thus being combined in one. It also means, if you then delete one of those routes, NetworkManager will get a RTM_DELROUTE message and delete the single instance of them, without knowing that there is yet another route. Tested on f37 (6.0.18-300.fc37.x86_64 and rhel-9.2 (5.14.0-234.el9.x86_64). Reproducer: >>>>>> #!/bin/bash set -xe ip netns del testnetns &>/dev/null || : ip netns add testnetns ip -netns testnetns link add net0 type dummy ip -netns testnetns link set net0 up ip -netns testnetns addr add 192.168.5.5/24 dev net0 ip -netns testnetns route append 1.2.3.1 nexthop via 192.168.5.1 dev net0 weight 1 nexthop via 192.168.5.2 dev net0 weight 1 ip -netns testnetns route append 1.2.3.1 nexthop via 192.168.5.1 dev net0 weight 2 nexthop via 192.168.5.2 dev net0 weight 1 ip -netns testnetns route append 1.2.3.4 nexthop via 192.168.5.1 dev net0 weight 1 ip -netns testnetns route append 1.2.3.4 nexthop via 192.168.5.1 dev net0 weight 2 ip -netns testnetns -4 -d route show # ip netns exec testnetns strace -s 10000 ip -4 route show 2>&1