Bug 2218866
Summary: | 9.3 regression: NetworkManager restores src route after ip route replace/change | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 9 | Reporter: | Oyvind Albrigtsen <oalbrigt> | ||||
Component: | NetworkManager | Assignee: | Beniamino Galvani <bgalvani> | ||||
Status: | CLOSED NOTABUG | QA Contact: | Desktop QE <desktop-qa-list> | ||||
Severity: | unspecified | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 9.3 | CC: | bgalvani, cfeist, fge, lrintel, mjuricek, rkhan, sfaye, sukulkar, thaller, till | ||||
Target Milestone: | rc | Keywords: | Regression, Triaged | ||||
Target Release: | 9.3 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | No Doc Update | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2023-08-03 16:14:21 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Oyvind Albrigtsen
2023-06-30 10:23:28 UTC
in the past, NetworkManager made an effort to not restore a route, when it was removed externally. However, that causes problems and was "fixed" (well, it's not a fix from POV of the reporter). - when a route disappears, it's not clear why it's gone and whether that is what the user desired. For example, if you have a route with "prefsrc $IPADDR" and you delete the IP address "$IPADDR", kernel will automatically remove the route too. So far so good, at first NetworkManager cannot re-add the route unless somebody/NetworkManager re-adds $IPADDR address. However, later when the address gets re-added (either by the user or by NM), then NM probably(?) should restore that route. As we don't know whether the route was removed intentionally be the user or as a side effect from something else, we don't know whether we should restore the route. NM now consistently restores it. - routes depend on each other. For example, a route with a next hop, can only be configured in kernel, if there is also a route that makes the next-hop directly reachable. If the user externally removes such a direct route, NM is unable to configure other routes. It's difficult to keep track of what NM should configure vs. what it can currently configure. - during a restart of NetworkManager, an interface might only be half configured (if it was in the process of activating). Upon restart, NetworkManager doesn't know why some IP addresses/routes are missing. And it needs to re-configure them. NetworkManager does not configure the interface all the time, but whenever something happens that triggers a reconfiguration (e.g. a DHCP lease update, a IPv6 Router Adv, a `nmcli device reapply`), NetworkManager will aim to configure all IP addresses and routes, that it planned to configure. That is, it no longer tries to keep track of addresses/routes which it previously configured and which somehow got externally lost (and blacklist them from being configured). What NetworkManager still does, ignore all externally present/configured addresses/routes, until `nmcli device reapply|modify` gets called. So you can add them without NetworkManager's knowledge or interference. But you cannot remove addresses that NetworkManager wants to configure. Your application should work in the face where somebody calls `nmcli device reapply` or `nmcli connection up`. Both would restore the IP configuration that NetworkManager intends to be present. If that breaks your application, then you have a problem. If your application can handle that, then that is not different from NetworkManager at undefined times re configuring what it wants. I don't think this worked reliably in the past either, because the behavior was ill defined. You probably had to be ready to handle the case that NetworkManager would restore an address/route, that you just deleted. For example, imagine the route comes from DHCP and NM configures a route. You delete the route externally and NM on rhel-9.2 would not restore it. Then the DHCP server crashes, the lease times out and NM forgets about the route. Later DHCP works again, and NM would restore the route again. Note that NetworkManager will not fight hard to restore the route. If you delete it, NM will only restore it when it thinks it should configure the route again. So you could just delete it then again. That's like in the past, where this also didn't reliably work, and you had to be ready to take that action over and over. I think a much better solution is to not delete routes that NetworkManager adds, but instead only `ip route append` routes with a lower metric. related commit https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/commit/7ca95cee15b32af2452aaf4a165eb5c634fba132 Hi Oyvind, From comment 1, it looks like restoring routes when they were removed externally is expected in RHEL-9.3. Does the proposal of not deleting routes that NetworkManager adds but instead use "ip route append" work for you? Thanks The resource agent logic depends too heavily on the replace/change logic, and that's not something we want to change, as it might lead to all kinds of other issues with or without NM. Also we shouldnt change how these things work in the middle of a release. Also, my concern is that we are making a significant change in default behavior of Network manager in a minor release. This may have the potential to break customer scripts. Hi Oyvind, Chris, I understand your concerns about the changes to NetworkManager's default behavior in the minor release and how it might affect the resource agent logic. To address this, a member of our team is going to investigate and propose a solution. We are also considering your suggestions and the potential to revert the changes. I will let you know in this bz about the progress. Do you have a specific timeline you are working with where you would expect a resolution to this issue? Thanks Hi Oyvind, in the bug description I see that the agent is replacing a route received via DHCP: ip route change to default via 10.37.167.254 proto dhcp metric 100 src 10.37.165.110 My understanding is that even on RHEL 9.2, NetworkManager re-adds the old route again on the next DHCP lease renewal, and the result is the same route duplication reported in comment 0. How does the agent handle that scenario? Also, can you attach a full log of NetworkManager from startup? The IP was added by IPaddr2 agent (which uses ip addr add), and the issue is new in 9.3. It has worked fine in RHEL7 (and probably earlier) up to 9.2. (In reply to Oyvind Albrigtsen from comment #8) > The IP was added by IPaddr2 agent (which uses ip addr add), and the issue is new in 9.3. It has worked fine in RHEL7 (and probably earlier) up to 9.2. The change done in 9.3 is to restore routes configured in NM if they are removed externally. From what I understand the agent is adding addresses and routes via 'ip' and not through NM and so I am struggling to understand how the change is affecting your scenario. Please attach a full log since NM startup (and at trace level) so that the issue can be reproduced and investigated. Created attachment 1976653 [details]
first try of reproducer
According to previous NM logs, the reporter is trying to add another static IP using the same CIDR retrieved from DHCP and replacing both direct route and default gateway with this new static address as source address.
Using attached reproducer `bug_2218866.sh`:
NetworkManager-1.43.11-32486.copr.13d4d4c35c.el9.x86_64 will add the dhcp route back when it got IPv6 router advertisement.
NetworkManager-1.42.2-6.el9_2 only add DHCP route back on next DHCP request. In the reproducer, we set the DHCP lease to 120seconds, so we see DHCP route back after 60 seconds.
The reason reporter thinks RHEL 9.2 does not have this problem is because they have big DHCP lease time 86400 seconds(24 hours).
> The reason reporter thinks RHEL 9.2 does not have this problem is because they have big DHCP lease time 86400 seconds(24 hours). Right, that also matches my analysis and what said in comment 7. On RHEL 9.2 the DHCP route is being restored after 12 hours (on lease renewal), while on 9.3 it's restored when a new IPv6 router advertisement is received after few seconds. In any case, I think we should restore the old behavior because it's risky to introduce such changes in a minor RHEL release. At the same time, it seems to me that the issue that the IPsrcaddr resource agent is having on 9.3 is also present on 9.2, albeit with a different frequency. Oyvind, do you how the agent handles the fact that NM restores the route on DHCP renewal (also in 9.2)? (In reply to Beniamino Galvani from comment #12) > > The reason reporter thinks RHEL 9.2 does not have this problem is because they have big DHCP lease time 86400 seconds(24 hours). You are right, we reproduced the problem also in RHEL-9.2 after 12+ hours. Also the ocf:heartbeat:IPsrcaddr resource agent man page recommends not to use the agent on dhcp enabled interface. So we ran the tests with dhcp disabled for 20+ hours without any problem. (In reply to Martin Juricek from comment #13) > You are right, we reproduced the problem also in RHEL-9.2 after 12+ hours. > Also the ocf:heartbeat:IPsrcaddr resource agent man page recommends not to > use the agent on dhcp enabled interface. So we ran the tests with dhcp > disabled for 20+ hours without any problem. In the light of this, do you consider the bz still valid? Is the scenario from comment 0 taken from a test case, or was the problem found in a real use case? |