Bug 2048988
| Summary: | NNCP deployment fails on applying ipv6 routes | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | Adi Zavalkovsky <azavalko> | ||||||||||||
| Component: | nmstate | Assignee: | Gris Ge <fge> | ||||||||||||
| Status: | CLOSED ERRATA | QA Contact: | Mingyu Shi <mshi> | ||||||||||||
| Severity: | unspecified | Docs Contact: | |||||||||||||
| Priority: | urgent | ||||||||||||||
| Version: | 8.4 | CC: | amalykhi, ellorent, ferferna, fge, jiji, jishi, mshi, network-qe, phoracek, rnetser, thaller, till | ||||||||||||
| Target Milestone: | rc | Keywords: | ZStream | ||||||||||||
| Target Release: | 8.6 | Flags: | pm-rhel:
mirror+
|
||||||||||||
| Hardware: | Unspecified | ||||||||||||||
| OS: | Unspecified | ||||||||||||||
| Whiteboard: | |||||||||||||||
| Fixed In Version: | nmstate-1.2.1-1.el8 | Doc Type: | No Doc Update | ||||||||||||
| Doc Text: | Story Points: | --- | |||||||||||||
| Clone Of: | |||||||||||||||
| : | 2053027 2054053 2054054 (view as bug list) | Environment: | |||||||||||||
| Last Closed: | 2022-05-10 13:34:48 UTC | Type: | Bug | ||||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||||
| Documentation: | --- | CRM: | |||||||||||||
| Verified Versions: | Category: | --- | |||||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||
| Embargoed: | |||||||||||||||
| Bug Depends On: | |||||||||||||||
| Bug Blocks: | 2054053, 2054054 | ||||||||||||||
| Attachments: |
|
||||||||||||||
Update - On a new run, the node which previously had the changes applied successfully is also failing. It seems that deploying bridges behind a node's primary iface is failing on BM nodes when applying IPv6 routes. Scrubbing: We can drop IPv6 from our current test plans. It is not important the use-case of the epic. Could you try it again without the ipv6 portion? Then we should follow up with development to see whether the issue described here should be solved in nmpolicy or nmstate. Hi Adi, I failed to reproduce this in my VM and the log is not showing the root cause. Could you ping me when you have time so that I can do a live debug? Thank you! Sure, @fge, taking this privately After applied the desire state via nmstatectl with `--no-commit --no-verify` argument.
I got
```
default proto static metric 102 pref medium
nexthop via fe80::c242:d000:645f:92a0 dev eno1 weight 1
nexthop via fe80::c242:d000:645f:92a0 dev capture-br1 weight 1
[root@cnv-qe-10 /]# ip addr show eno1
2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master capture-br1 state UP group default qlen 1000
link/ether e4:43:4b:57:47:50 brd ff:ff:ff:ff:ff:ff
```
So the root causes are:
1. NetworkManager did not remove route entry of eno1 when attaching eno1 to bridge.
2. Nmstate is ignoring multipath route.
Let me try to workaround this in nmstate before asking NetworkManager to fix it.
Problem reproduced on a ppc64le server with i40e NIC with below yamls. On the same server, if I use veth NIC, no problem found. In VM, e100e NIC has no such problem. Will leave the remaining root cause debug to NetworkManager team. Nmstate workaround fixes: * RHEL 8.6: https://github.com/nmstate/nmstate/pull/1800 * RHEL 8.5: https://github.com/nmstate/nmstate/pull/1802 * RHEL 8.4: https://github.com/nmstate/nmstate/pull/1801 first.yml ``` --- routes: config: - destination: ::/0 metric: 102 next-hop-address: fe80::c242:d000:645f:92a0 next-hop-interface: enP2p1s0f1 table-id: 254 interfaces: - name: enP2p1s0f1 type: ethernet state: up ipv4: enabled: true address: - ip: 10.9.96.49 prefix-length: 24 dhcp: false ipv6: enabled: true address: - ip: 2620:52:0:960:e643:4bff:fe57:4750 prefix-length: 64 - ip: fe80::e643:4bff:fe57:4750 prefix-length: 64 autoconf: false dhcp: false ``` second.yml ``` --- routes: config: - destination: ::/0 metric: 102 next-hop-address: fe80::c242:d000:645f:92a0 next-hop-interface: capture-br1 table-id: 254 interfaces: - name: capture-br1 type: linux-bridge state: up bridge: options: stp: enabled: false port: - name: enP2p1s0f1 vlan: {} ipv4: enabled: true address: - ip: 10.9.96.49 prefix-length: 24 dhcp: false ipv6: enabled: true address: - ip: 2620:52:0:960:e643:4bff:fe57:4750 prefix-length: 64 - ip: fe80::e643:4bff:fe57:4750 prefix-length: 64 autoconf: false dhcp: false ``` The reproducer in https://bugzilla.redhat.com/show_bug.cgi?id=2048988#c9 is incorrect. Still digging. Created attachment 1859781 [details]
step1.yml
Created attachment 1859782 [details]
step2.yml
To reproduce this problem in VM. 1. use e1000e as driver of VM NIC interface. Assume it is named as enp9s0. 2. Download above two yaml files. 3. dnf remove NetworkManager-config-server -y 4. systemctl restart NetworkManager 5. nmstatectl apply step1.yml 6. nmstatectl apply step2.yml Hi Petr, The root cause of this issue is CNV does not install `NetworkManager-config-server` in host. Placing this package in container does not helps. Please contact responsible team to include that rpm in host environment where NM daemon running. This rpm only contains a config file for NetworkManager. If including extra rpm is too much for host environment, placing below file in /etc/NetworkManager/conf.d also helps: [main] no-auto-default=* ignore-carrier=* All nmstate CI is assuming NetworkManager-config-server installed, without it, NetworkManager will automatically create profile for newly discovered interface which has approved to be causing many problem in RHV. For current bug, nmstate can workaround it by supporting multipath route. But including above rpm can solve other potential problems. Gris, thanks a lot for digging into this. This is very helpful. @ellorent would you please follow-up with a bug on RHCOS, asking for the config-server RPM, explaining the motivation that Gris described above? The original problem could be solved by multiple approaches:
A. nmstate support mutlipath route.
B. NetworkManager by default installation should remove multipath route on
interface which is attached to a bridge.
C. RHCOS include `NetworkManager-config-server` rpm in host environment.
We will use this bug to track effort of A): nmstate support multipath route for
RHEL 8.7. (RHEL 8.6 is in late pharse, there is no valid use case requiring
multipath route support)
CNV team will work with RHCOS off-thread for option C) for RHEL 8.4, if nmstate
workaround still require, please leave a comment requesting zstream review.
Acceptance criteria of this bug:
* Given a fresh install RHEL 8 system with multipath route configured by:
sudo ip route add 198.51.100.0/24 proto static scope global \
nexthop via 192.0.2.254 dev eth1 weight 1 onlink \
nexthop via 192.0.2.253 dev eth1 weight 256 onlink
sudo ip -6 route add 2001:db8:e::/64 proto static scope global \
nexthop via 2001:db8:f::254 dev eth1 weight 1 onlink \
nexthop via 2001:db8:f::253 dev eth1 weight 256 onlink
* When user installed nmstate and invoke `nmstatectl show`.
* Then nmstate should convert these multipath route into several normal
route entry.
(In reply to Petr Horáček from comment #16) > Gris, thanks a lot for digging into this. This is very helpful. > > @ellorent would you please follow-up with a bug on RHCOS, asking > for the config-server RPM, explaining the motivation that Gris described > above? let's see how it rolls https://github.com/openshift/os/pull/705 Looks like we have to ask for it first at fcos, the process is different now https://github.com/coreos/fedora-coreos-tracker/issues/1094. Looks like we are going to do this with Ignition/MachineConfig https://github.com/openshift/os/pull/705#issuecomment-1032714634 Let's try to add MachineConfig to the kubernetes-nmstate operator https://github.com/coreos/fedora-coreos-tracker/issues/1094#issuecomment-1032758020 Looks like multipath is going to land at nmstate soon, https://github.com/nmstate/nmstate/pull/1800 maybe we have to wait for it instead of configuring NetworkManager What do you think @ Looks like multipath is going to land at nmstate soon, https://github.com/nmstate/nmstate/pull/1800 maybe we have to wait for it instead of configuring NetworkManager What do you think @gris I agree with Gris on the issue. The NetworkManager part is a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1837254 Yes, you can (and maybe should) configure `[main].no-auto-default=*` (NetworkManager-config-server package). That might avoid the conflict in this case, as NetworkManager then possibly does not activate the ethernet device (which ends up getting a conflicting IPv6 route). But it's not a fix. @fge we need a backport of fixes at https://errata.devel.redhat.com/advisory/86674 for 8.4.0.z Since we already have https://bugzilla.redhat.com/show_bug.cgi?id=1837254 tracking this issue, would it make sense to re-purpose this BZ for CNV? We are not done with the investigation yet, but currently my understanding is following: * This is not a regression. The issue is present in production today. We have hit it only now because of our existing CI was not testing bridge created on the default NIC * This is not related to nmpolicy I would suggest we move this to CNV, target it to 4.11 and document as a known issue. What do you think @rnetser Hi Petr, Nmstate need this bug for 8.4.0 zstream approval review and 8.6 efforts. Could you clone this bug to CNV? Hi Mingyu, For zstream review, please do pre-test for these scratch build: * RHEL 8.4: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=42949638 * RHEL 8.5: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=42949845 Sorry for cancelling the needinfo request for rnester by mistake @rnester Please see #comment28 Thanks Mingyu. We met with Ruth and discussed this BZ offline. We are now trying to come up with a workaround that would allow us to continue with our feature at least on a tech preview level. (In reply to Gris Ge from comment #30) > Hi Mingyu, > > For zstream review, please do pre-test for these scratch build: > > * RHEL 8.4: > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=42949638 > * RHEL 8.5: > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=42949845 These two build works well. But currently I failed to reproduce the problem with (RHEL 8.4) nmstate-1.0.2-16.el8_4.noarch nispor-1.0.1-5.el8_4.x86_64 NetworkManager-1.30.0-13.el8_4.x86_64 or (RHEL 8.5) nmstate-1.1.0-5.el8_5.noarch nispor-1.1.1-2.el8_5.x86_64 NetworkManager-1.32.10-4.el8.x86_64 Though I was taking the same step and using the same type of NIC in https://bugzilla.redhat.com/show_bug.cgi?id=2048988#c31 Created attachment 1860913 [details]
Reproducer script
New reproducer script. Does not matter whether you has `NetworkManager-config-server` installed or not.
Created attachment 1860916 [details]
NetworkManager trace log
Trace log of NetworkManager in case NM dev would like to investigate more.
Hi Petr, Could your team test on 8.4.0.z official on nmstate-1.0.2-17.el8_4 ? Thank you! (In reply to Gris Ge from comment #38) > Created attachment 1860916 [details] > NetworkManager trace log > > Trace log of NetworkManager in case NM dev would like to investigate more. yes, the problem is bug 1837254. Thank for the log. Hey Gris, we will be soon checking the nmstate-1.0.2-18.el8.noarch.rpm, I suppose that would be enough? Thanks a lot for helping with this. Verified with versions: nmstate-1.2.1-1.el8.x86_64 nispor-1.2.3-1.el8.x86_64 NetworkManager-1.36.0-0.8.el8.x86_64 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (nmstate bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2022:1772 |
Created attachment 1858288 [details] nns before bridge deployment Description of problem: When trying to deploy a bridge behind a primary interface on a BM node, it fails during route setting, which is odd since it doesn't fail on another node in the same cluster. Version-Release number of selected component (if applicable): nmcli tool, version 1.30.0-13.el8_4 How reproducible: NNCE - desiredState: interfaces: - bridge: options: stp: enabled: false port: - name: eno1 vlan: {} ipv4: address: - ip: 10.9.96.50 prefix-length: 24 dhcp: false enabled: true ipv6: address: - ip: 2620:52:0:960:e643:4bff:fe57:4a90 prefix-length: 64 - ip: fe80::e643:4bff:fe57:4a90 prefix-length: 64 autoconf: false dhcp: false enabled: true name: capture-br1 state: up type: linux-bridge routes: config: - destination: 2620:52:0:960::/64 metric: 104 next-hop-address: '::' next-hop-interface: capture-br1 table-id: 254 - destination: fe80::/64 metric: 104 next-hop-address: '::' next-hop-interface: capture-br1 table-id: 254 - destination: ::/0 metric: 104 next-hop-address: fe80::c242:d000:645f:92a0 next-hop-interface: capture-br1 table-id: 254 - destination: 0.0.0.0/0 metric: 104 next-hop-address: 10.9.96.254 next-hop-interface: capture-br1 table-id: 254 - destination: 10.9.96.0/24 metric: 104 next-hop-address: 0.0.0.0 next-hop-interface: capture-br1 table-id: 254 - destination: 10.9.96.0/24 metric: 104 next-hop-address: 0.0.0.0 next-hop-interface: capture-br1 table-id: 254 Actual results: error reconciling NodeNetworkConfigurationPolicy at desired state apply: , failed to execute nmstatectl set --no-commit --timeout 480: 'exit status 1' Unhandled AF_SPEC_BRIDGE_INFO 0 [2, 0] Unhandled AF_SPEC_BRIDGE_INFO 1 [1, 0] Unhandled AF_SPEC_BRIDGE_INFO 0 [2, 0] Unhandled AF_SPEC_BRIDGE_INFO 1 [1, 0] Unhandled AF_SPEC_BRIDGE_INFO 0 [2, 0] ......... ......... Unhandled AF_SPEC_BRIDGE_INFO 1 [1, 0] Unhandled AF_SPEC_BRIDGE_INFO 0 [2, 0] Unhandled AF_SPEC_BRIDGE_INFO 1 [1, 0] libnmstate.error.NmstateVerificationError: desired ======= --- routes: config: - destination: 0.0.0.0/0 metric: 104 next-hop-address: 10.9.96.254 next-hop-interface: capture-br1 table-id: 254 - destination: 10.9.96.0/24 metric: 104 next-hop-address: 0.0.0.0 next-hop-interface: capture-br1 table-id: 254 - destination: 2620:52:0:960::/64 metric: 104 next-hop-address: '::' next-hop-interface: capture-br1 table-id: 254 - destination: ::/0 metric: 104 next-hop-address: fe80::c242:d000:645f:92a0 next-hop-interface: capture-br1 table-id: 254 - destination: fe80::/64 metric: 104 next-hop-address: '::' next-hop-interface: capture-br1 table-id: 254 current ======= --- routes: config: - destination: 0.0.0.0/0 metric: 104 next-hop-address: 10.9.96.254 next-hop-interface: capture-br1 table-id: 254 - destination: 10.9.96.0/24 metric: 104 next-hop-address: 0.0.0.0 next-hop-interface: capture-br1 table-id: 254 - destination: 2620:52:0:960::/64 metric: 104 next-hop-address: '::' next-hop-interface: capture-br1 table-id: 254 - destination: fe80::/64 metric: 104 next-hop-address: '::' next-hop-interface: capture-br1 table-id: 254 difference ========== --- desired +++ current @@ -16,11 +16,6 @@ next-hop-address: '::' next-hop-interface: capture-br1 table-id: 254 - - destination: ::/0 - metric: 104 - next-hop-address: fe80::c242:d000:645f:92a0 - next-hop-interface: capture-br1 - table-id: 254 - destination: fe80::/64 metric: 104 next-hop-address: '::' Expected results: NNCP Applied succesfully, like in Additional info: Attaching NetworkManager logs for successful and unsuccessful nodes, nns before deployment. cnv-qe-10 - successful node cnv-qe-11 - failing node