Bug 2037411

Summary: nmstatectl removes routes of unmanaged interfaces (NM process hanging/blocked?)
Product: Red Hat Enterprise Linux 8 Reporter: Radim Hrazdil <rhrazdil>
Component: NetworkManagerAssignee: Thomas Haller <thaller>
Status: CLOSED ERRATA QA Contact: Matej Berezny <mberezny>
Severity: high Docs Contact:
Priority: high    
Version: 8.4CC: bgalvani, ellorent, ferferna, fpokryvk, jiji, jishi, lrintel, mberezny, network-qe, phoracek, rkhan, sukulkar, thaller, till
Target Milestone: rcKeywords: Triaged
Target Release: 8.6   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: NetworkManager-1.36.0-0.6.el8 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-05-10 14:55:01 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
logs and config
none
eth0 info + logs
none
logfile showing hanging of NM
none
strace of NetworkManager, hanging to wait for RTM_GETROUTE netlink response none

Description Radim Hrazdil 2022-01-05 15:26:35 UTC
Created attachment 1849046 [details]
logs and config

Created attachment 1849046 [details]
logs and config

Created attachment 1849046 [details]
logs and config

Description of problem:
Applying a desired state:
interfaces:
- name: eth0
  state: up
  type: ethernet

causes routes with next hop to unmanaged interfaces to be discarded.
This causes node to be unreachable.

Rollback fails on timeout:
2022-01-05 15:16:52,343 root         ERROR    Rollback failed with error Checkpoint rollback failed: error=g-io-error-quark: Timeout was reached (24)

Version-Release number of selected component (if applicable):
nmstate 1.2.0
CentOS Stream release 8

How reproducible:
100%

See the attachment for:
routes before and after applying the desired state
nmstatectl show output before applying the desired state
nmstatectl output of desired state configuration
NetworkManaget config file for unmanaged interfaces

How to reproduce it with kubevirtci:

# First start it up and loging with the serial console
git clone https://github.com/kubevirt/kubevirtci
cd kubevirtci
export KUBEVIRTCI_TAG=2201130628-f543542
export KUBEVIRT_PROVIDER=k8s-1.22
export KUBEVIRTCI_RUNTIME=podman
podman system service -t 0 unix:///${HOME}/podman.sock
make cluster-up
docker exec -it k8s-1.22-node01 screen /dev/pts/2 # pts number can be different
# User and password is vagrant:vagrant
dnf install -y nmstate
cat <<EOF | sudo nmstatectl apply
interfaces:
  - name: eth0
    type: ethernet
    state: up 
    ipv4: 
      enabled: true
      dhcp: true
EOF

Comment 1 Radim Hrazdil 2022-01-05 15:38:52 UTC
Created attachment 1849048 [details]
nmstatectl_show_after

Also, adding output of nmstatectl show after appplying the desired state,
highlight that the eth0 interface has ipv4 and ipv6 disabled after the config was applied.
IP of eth0 was lost, probably as a consequence of that

Comment 2 Gris Ge 2022-01-07 10:03:05 UTC
Acceptance criteria: Nmstate should not change existing route besides desire state.

Comment 3 Gris Ge 2022-01-07 13:05:45 UTC
Hi Radim,

The first log shows you enabled DHCP in NetworkManager on eth0, but still have static IP and route on it.
When you apply empty state, nmstate just use DHCP and remove all static route and IP.

The second log shows the IP is disabled.

Could you provide a server access or more background on this eth0 interface?

Comment 4 Radim Hrazdil 2022-01-07 14:06:38 UTC
Created attachment 1849460 [details]
eth0 info + logs

Hello Gris,

We have tried multiple other nmstate versions and we're observing similar behaviour with older versions as well, so it's possible that this is not caused by a change in nmstate.
However, we'be been using this desired state in our test teardown for a while, to ensure eth0 is up and other ifates are down, and only now we're seeing issues.

Also, it doesn't only happen with empty state, the same behaviour is observed when applying:

interfaces:
- name: eth0
  type: ethernet
  state: up
  ipv4:
    dhcp: true
    enabled: true
which should be a no-op,

or 
interfaces:
- name: eth0
  type: ethernet
  state: up
  ipv4:
    enabled: true
    addresses:
    - ip: 192.168.66.102
      prefix-length: 25



It seems that nmstate drops the routes before applying the desiredState:
2022-01-07 13:41:19,937 root         DEBUG    The current route {'table-id': 254, 'destination': 'fd10:244::7200/128', 'next-hop-interface': 'cali201d44a2450', 'next-hop-address': '::', 'metric': 1024} has been discarded due to Route {'table-id': 254, 'metric': 1024, 'destination': 'fd10:244::7200/128', 'next-hop-ad
dress': '::', 'next-hop-interface': 'cali201d44a2450'} next hop to down/absent interface
2022-01-07 13:41:19,937 root         DEBUG    The current route {'table-id': 254, 'destination': 'fd10:244::bac0/122', 'next-hop-interface': 'eth0', 'next-hop-address': 'fd00::103', 'metric': 1024} has been discarded due to Route {'table-id': 254, 'metric': 1024, 'destination': 'fd10:244::bac0/122', 'next-hop-addres
s': 'fd00::103', 'next-hop-interface': 'eth0'} next hop to interface with dynamic IP
2022-01-07 13:41:19,937 root         DEBUG    The current route {'table-id': 254, 'destination': 'fd10:244::c480/122', 'next-hop-interface': 'eth0', 'next-hop-address': 'fd00::101', 'metric': 1024} has been discarded due to Route {'table-id': 254, 'metric': 1024, 'destination': 'fd10:244::c480/122', 'next-hop-addres
s': 'fd00::101', 'next-hop-interface': 'eth0'} next hop to interface with dynamic IP
2022-01-07 13:41:19,937 root         DEBUG    The current route {'table-id': 254, 'destination': 'fd10:244::d840/122', 'next-hop-interface': 'eth0', 'next-hop-address': 'fd00::106', 'metric': 1024} has been discarded due to Route {'table-id': 254, 'metric': 1024, 'destination': 'fd10:244::d840/122', 'next-hop-addres
s': 'fd00::106', 'next-hop-interface': 'eth0'} next hop to interface with dynamic IP
2022-01-07 13:41:19,938 root         DEBUG    The current route {'table-id': 254, 'destination': '10.244.114.0/26', 'next-hop-interface': '', 'next-hop-address': '0.0.0.0', 'metric': 0} has been discarded due to Route entry does not have next hop interface
2022-01-07 13:41:19,938 root         DEBUG    The current route {'table-id': 254, 'destination': '10.244.114.1/32', 'next-hop-interface': 'cali201d44a2450', 'next-hop-address': '0.0.0.0', 'metric': 0} has been discarded due to Route {'table-id': 254, 'metric': 0, 'destination': '10.244.114.1/32', 'next-hop-address':
 '0.0.0.0', 'next-hop-interface': 'cali201d44a2450'} next hop to down/absent interface
2022-01-07 13:41:19,938 root         DEBUG    The current route {'table-id': 254, 'destination': '10.244.196.128/26', 'next-hop-interface': 'tunl0', 'next-hop-address': '192.168.66.101', 'metric': 0} has been discarded due to Route {'table-id': 254, 'metric': 0, 'destination': '10.244.196.128/26', 'next-hop-address'
: '192.168.66.101', 'next-hop-interface': 'tunl0'} next hop to unknown interface
2022-01-07 13:41:19,939 root         DEBUG    The current route {'table-id': 254, 'destination': '10.244.216.64/26', 'next-hop-interface': 'tunl0', 'next-hop-address': '192.168.66.106', 'metric': 0} has been discarded due to Route {'table-id': 254, 'metric': 0, 'destination': '10.244.216.64/26', 'next-hop-address': 
'192.168.66.106', 'next-hop-interface': 'tunl0'} next hop to unknown interface



Notice that some of the routes are dropped due to 'next hop to unknown interface', which are the cali and tunl0 interfaces, those interfaces are configured to be unmanaged.



Additionally, we've noticed that after applying the desired state, `nmcli con show` is empty, as if the eth0 profile was removed. But after manually assigning the eth0 device an IP address (with ip addr add), the profile is restored.
I'm providing artifacts collected applying the first desiredState, along with complete info of eth0 before applying the state and NetworkManager trace logs.


We don't have an accessible environment, but the environment can be easily spinned up using 
$ make cluster-up
in nmstate/kubernetes-nmstate repo.
Then nmstate can be installed on the nodes by running
$ ./cluster/ssh.sh node01
$ ./sudo dnf install nmstate

Comment 7 Quique Llorente 2022-01-10 09:49:20 UTC
Looks like downgrading from NetworkManager 1.36 to NetworkManager 1.34 fix the issue.

Comment 8 Gris Ge 2022-01-10 12:56:10 UTC
Hi Thomas,

Could you take a look on debug log of NetworkManager at comment #5 ?
Sounds like a regression of NM to me.

Comment 9 Thomas Haller 2022-01-10 14:11:08 UTC
The NetworkManager log from comment 4 is without `level=TRACE`. Please always collect logs with debug logging enabled.

From the log of comment 5 it's not clear to me what might be wrong there. The log shows no changes to *other* devices ("unmanaged interfaces") once nmstate does its actions. Could you give some help reading this?

Which exactly are the "unmanaged interfaces"?
Which routes are we talking about? Was it:

  10.244.196.128/26 via 192.168.66.101 dev tunl0 proto bird onlink

Comment 10 Gris Ge 2022-01-10 14:16:52 UTC
In the peer debuging, we noticed `eth0` has DHCP with dns and gateway. The state for nmstatectl to apply is just said the same: eth0 with dhcp. Somehow NetworkManager goes stuck and nmstate failed to find out whether DHCP is enabled on eth0, hence rollback, but rollback also hang with timeout.

Comment 11 Quique Llorente 2022-01-10 14:47:23 UTC
(In reply to Thomas Haller from comment #9)
> The NetworkManager log from comment 4 is without `level=TRACE`. Please
> always collect logs with debug logging enabled.

The debug logs are here https://bugzilla.redhat.com/attachment.cgi?id=1849805

> 
> From the log of comment 5 it's not clear to me what might be wrong there.
> The log shows no changes to *other* devices ("unmanaged interfaces") once
> nmstate does its actions. Could you give some help reading this?

There are some routes configured by calico over unmanaged interfaces "cali*" and "tunl*" so nmstate ignore those routes.

> 
> Which exactly are the "unmanaged interfaces"?

That's on the config https://bugzilla.redhat.com/attachment.cgi?id=1849046

> Which routes are we talking about? Was it:
> 
>   10.244.196.128/26 via 192.168.66.101 dev tunl0 proto bird onlink

Comment 12 Thomas Haller 2022-01-10 16:14:23 UTC
Created attachment 1849890 [details]
logfile showing hanging of NM

the problem seems entirely different than described here. At least, the issue I encountered was different, and before that issue is solved, there is no point to investigate the issue described here (probably they are the same anyway).

It seems, `nmcli` hangs (and any D-Bus requests hang).

Attached is a logfile:


Jan 10 15:52:10 node01 NetworkManager[8331]: <trace> [1641829930.4111] platform-linux: event-notification: RTM_NEWROUTE, flags multi, seq 25: ignore
Jan 10 15:52:10 node01 NetworkManager[8331]: <trace> [1641829930.4111] platform-linux: event-notification: RTM_NEWROUTE, flags multi, seq 25: ignore
Jan 10 15:52:10 node01 NetworkManager[8331]: <trace> [1641829930.4111] platform-linux: netlink: read: wait for ACK for sequence number 26...
...
Jan 10 15:54:10 node01 NetworkManager[8331]: <trace> [1641830050.9556] platform-linux: netlink: read: wait for ACK for sequence number 26...


Note that teh service hangs here for 120 seconds.

Comment 13 Thomas Haller 2022-01-12 20:36:05 UTC
(In reply to Thomas Haller from comment #12)
> Created attachment 1849890 [details]
> logfile showing hanging of NM

> Note that teh service hangs here for 120 seconds.

the process hangs for 120 seconds 3 times.
Incidentally, that is also the time how long the process was running, when it was starting to wait for the first time.

So it would seem that

Looking at https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/blob/1.35.3-dev/src/libnm-platform/nm-linux-platform.c#L9295 ,
it would seem that "next.now_ns" is zero (or close to zero) at that point (note that this is a release build, so above `nm_assert(next.now_ns > 0)` is disabled).

If you look at https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/blob/1.35.3-dev/src/libnm-platform/nm-linux-platform.c#L5960, that happens because `now_ns` is int (32bit), and assigning a nanoseconds timestamp overflows every two seconds. That problem is fixed by https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/commit/65cdbd355fdce5867182e89a389082e3f788bf2f


But that only was responsible for calculating much too long timeouts (and not correctly detecting when the timeout hit).

The other (still unfixed) problem is that the netlink message that we expect (with a certain sequence number) is not coming...

Comment 14 Quique Llorente 2022-01-13 08:49:11 UTC
@thaller added some steps to reproduce it with kubevirtci, let me know if it "works" for you.

Comment 15 Thomas Haller 2022-01-13 13:46:44 UTC
investigating...

Comment 16 Thomas Haller 2022-01-13 16:02:12 UTC
with version 1.35.4 (on RHEL8 version 1.36.0-0.4), there miscalculation of the timeout got fixed ([1]).
that makes the problem seemingly go away, by reducing the hang to an (unnoticeable) 200msec.


There is still a problem that the netlink message got lost.

I am pretty sure that is related to the BPF filter, because I could reproduce it (sometimes) with the BPF filter enabled but could never reproduce with the BPF filter disabled.
(technically, after disabling the BPF filter, I re-enabled it again and could no longer reproduce it before reprovisioning the machine. So there is small chance that it's unrelated).

[1] https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/commit/65cdbd355fdce5867182e89a389082e3f788bf2f
[2] https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/commit/e9ca5583e5f37f922e0b000a8cff2c2813ce2e69

Comment 17 Thomas Haller 2022-01-13 20:53:56 UTC
Created attachment 1850706 [details]
strace of NetworkManager, hanging to wait for RTM_GETROUTE netlink response

See we send a RTM_GETROUTE request with seq=191, but don't get a response (and time out during poll()).

It's most likely caused by the BPF socket filter.

Comment 18 Thomas Haller 2022-01-14 08:27:04 UTC
I don't think that the BPF filter is gonna work.

To my understanding, on the netlink socket, the BPF filter is called for each message (as we recvmsg() it).
One RTM_NEWROUTE message usually contains many routes ...


for example:

`make src/core/platform/tests/monitor && strace -s 10000 -tt src/core/platform/tests/monitor --no-persist | grep RTM_NEWROUTE`

gives:

09:21:04.538924 recvmsg(8, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[[{nlmsg_len=60, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=4, nlmsg_pid=101343}, {rtm_family=AF_INET, rtm_dst_len=0, rtm_src_len=0, rtm_tos=0, rtm_table=RT_TABLE_MAIN, rtm_protocol=RTPROT_DHCP, rtm_scope=RT_SCOPE_UNIVERSE, rtm_type=RTN_UNICAST, rtm_flags=0}, [[{nla_len=8, nla_type=RTA_TABLE}, RT_TABLE_MAIN], [{nla_len=8, nla_type=RTA_PRIORITY}, 100], [{nla_len=8, nla_type=RTA_GATEWAY}, inet_addr("192.168.23.1")], [{nla_len=8, nla_type=RTA_OIF}, if_nametoindex("eth0")]]], [{nlmsg_len=60, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=4, nlmsg_pid=101343}, {rtm_family=AF_INET, rtm_dst_len=0, rtm_src_len=0, rtm_tos=0, rtm_table=RT_TABLE_MAIN, rtm_protocol=RTPROT_DHCP, rtm_scope=RT_SCOPE_UNIVERSE, rtm_type=RTN_UNICAST, rtm_flags=0}, [[{nla_len=8, nla_type=RTA_TABLE}, RT_TABLE_MAIN], [{nla_len=8, nla_type=RTA_PRIORITY}, 600], [{nla_len=8, nla_type=RTA_GATEWAY}, inet_addr("192.168.23.1")], [{nla_len=8, nla_type=RTA_OIF}, if_nametoindex("wlan0")]]], [{nlmsg_len=60, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=4, nlmsg_pid=101343}, {rtm_family=AF_INET, rtm_dst_len=0, rtm_src_len=0, rtm_tos=0, rtm_table=RT_TABLE_MAIN, rtm_protocol=RTPROT_DHCP, rtm_scope=RT_SCOPE_UNIVERSE, rtm_type=RTN_UNICAST, rtm_flags=0}, [[{nla_len=8, nla_type=RTA_TABLE}, RT_TABLE_MAIN], [{nla_len=8, nla_type=RTA_PRIORITY}, 20100], [{nla_len=8, nla_type=RTA_GATEWAY}, inet_addr("192.168.23.1")], [{nla_len=8, nla_type=RTA_OIF}, if_nametoindex("eth0")]]], [{nlmsg_len=52, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=4, nlmsg_pid=101343}, {rtm_family=AF_INET, rtm_dst_len=32, rtm_src_len=0, rtm_tos=0, rtm_table=RT_TABLE_MAIN, rtm_protocol=RTPROT_BIRD, rtm_scope=RT_SCOPE_LINK, rtm_type=RTN_UNICAST, rtm_flags=0}, [[{nla_len=8, nla_type=RTA_TABLE}, RT_TABLE_MAIN], [{nla_len=8, nla_type=RTA_DST}, inet_addr("1.2.3.4")], [{nla_len=8, nla_type=RTA_OIF}, if_nametoindex("eth0")]]], [{nlmsg_len=60, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=4, nlmsg_pid=101343}, {rtm_family=AF_INET, rtm_dst_len=8, rtm_src_len=0, rtm_tos=0, rtm_table=RT_TABLE_MAIN, rtm_protocol=RTPROT_BOOT, rtm_scope=RT_SCOPE_UNIVERSE, rtm_type=RTN_UNICAST, rtm_flags=0}, [[{nla_len=8, nla_type=RTA_TABLE}, RT
........
t_addr("192.168.23.138")], [{nla_len=8, nla_type=RTA_OIF}, if_nametoindex("eth0")]]], [{nlmsg_len=60, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=4, nlmsg_pid=101343}, {rtm_family=AF_INET, rtm_dst_len=32, rtm_src_len=0, rtm_tos=0, rtm_table=RT_TABLE_LOCAL, rtm_protocol=RTPROT_KERNEL, rtm_scope=RT_SCOPE_LINK, rtm_type=RTN_BROADCAST, rtm_flags=0}, [[{nla_len=8, nla_type=RTA_TABLE}, RT_TABLE_LOCAL], [{nla_len=8, nla_type=RTA_DST}, inet_addr("192.168.23.255")], [{nla_len=8, nla_type=RTA_PREFSRC}, inet_addr("192.168.23.107")], [{nla_len=8, nla_type=RTA_OIF}, if_nametoindex("wlan0")]]], [{nlmsg_len=60, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=4, nlmsg_pid=101343}, {rtm_family=AF_INET, rtm_dst_len=32, rtm_src_len=0, rtm_tos=0, rtm_table=RT_TABLE_LOCAL, rtm_protocol=RTPROT_KERNEL, rtm_scope=RT_SCOPE_HOST, rtm_type=RTN_LOCAL, rtm_flags=0}, [[{nla_len=8, nla_type=RTA_TABLE}, RT_TABLE_LOCAL], [{nla_len=8, nla_type=RTA_DST}, inet_addr("192.168.122.1")], [{nla_len=8, nla_type=RTA_PREFSRC}, inet_addr("192.168.122.1")], [{nla_len=8, nla_type=RTA_OIF}, if_nametoindex("virbr0")]]], [{nlmsg_len=60, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=4, nlmsg_pid=101343}, {rtm_family=AF_INET, rtm_dst_len=32, rtm_src_len=0, rtm_tos=0, rtm_table=RT_TABLE_LOCAL, rtm_protocol=RTPROT_KERNEL, rtm_scope=RT_SCOPE_LINK, rtm_type=RTN_BROADCAST, rtm_flags=0x10 /* RTM_F_??? */}, [[{nla_len=8, nla_type=RTA_TABLE}, RT_TABLE_LOCAL], [{nla_len=8, nla_type=RTA_DST}, inet_addr("192.168.122.255")], [{nla_len=8, nla_type=RTA_PREFSRC}, inet_addr("192.168.122.1")], [{nla_len=8, nla_type=RTA_OIF}, if_nametoindex("virbr0")]]]], iov_len=32768}], msg_iovlen=1, msg_control=[{cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS, cmsg_data={pid=0, uid=0, gid=0}}], msg_controllen=32, msg_flags=0}, 0) = 1688




I also confirmed that with a patch:


diff --git i/src/libnm-platform/nm-netlink.c w/src/libnm-platform/nm-netlink.c
index 09369ef8d00a..f6bdaaaec255 100644
--- i/src/libnm-platform/nm-netlink.c
+++ w/src/libnm-platform/nm-netlink.c
@@ -1059,6 +1059,14 @@ nl_socket_set_rtprot_filter(struct nl_sock *sk)
         BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, htons(RTM_DELROUTE), 1, 0),
         BPF_STMT(BPF_RET | BPF_K, 0xFFFF),
 
+        /* A <- len */
+        BPF_STMT(BPF_LD | BPF_W | BPF_LEN, 0),
+
+        BPF_JUMP(BPF_JMP | BPF_JGT | BPF_K, 1000, 0, 1),
+        BPF_STMT(BPF_RET | BPF_K, 1000),
+
+        BPF_STMT(BPF_RET | BPF_K, 0xFFFF),
+
         /*
          * if (rtm->rtm_protocol <= RTPROT_STATIC)
          *     return KEEP;


to truncate messages to 1000 bytes... and the messages get truncated, showing that the message as seen by the BPF filter is 1688 bytes (with above example).



This means, the BPF filter only looks at the very first `nlmsghdr` (`nlmsg_type=RTM_NEWROUTE`) and makes the decision to drop the entire message based on the protocol of the first route.


I don't think that we can use BPF for this kind of filtering. We would have to check that *all* routes in the skb have the unwanted protocol, and as we cannot have loops in BPF, that seems almost impossible (well, we could have a bounded, unrolled loop, and iterate over a certain nmber of nlmsghdr and if we encounter more nlmsghdrs or if any of the routes contains a route we want to keep, let it through). But that seems rather complicated.


Opinions?

Comment 19 Thomas Haller 2022-01-14 08:34:58 UTC
I also tested:


diff --git i/src/libnm-platform/nm-netlink.c w/src/libnm-platform/nm-netlink.c
index 09369ef8d00a..faba4186d059 100644
--- i/src/libnm-platform/nm-netlink.c
+++ w/src/libnm-platform/nm-netlink.c
@@ -1059,25 +1059,17 @@ nl_socket_set_rtprot_filter(struct nl_sock *sk)
         BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, htons(RTM_DELROUTE), 1, 0),
         BPF_STMT(BPF_RET | BPF_K, 0xFFFF),
 
-        /*
-         * if (rtm->rtm_protocol <= RTPROT_STATIC)
-         *     return KEEP;
-         */
         BPF_STMT(BPF_LD | BPF_ABS | BPF_B,
                  sizeof(struct nlmsghdr) + offsetof(struct rtmsg, rtm_protocol)),
-        BPF_JUMP(BPF_JMP | BPF_JGT | BPF_K, RTPROT_STATIC, 1, 0),
-        BPF_STMT(BPF_RET | BPF_K, 0xFFFF),
 
         /*
-         * if (rtm->rtm_protocol == RTPROT_DHCP || rtm->rtm_protocol == RTPROT_RA)
-         *     return KEEP;
+         * if (rtm->rtm_protocol == RTPROT_DHCP)
+         *     return DROP;
          */
-        BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, RTPROT_DHCP, 1, 0),
-        BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, RTPROT_RA, 0, 1),
-        BPF_STMT(BPF_RET | BPF_K, 0xFFFF),
-
-        /* return DROP; */
+        BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, RTPROT_KERNEL, 0, 1),
         BPF_STMT(BPF_RET | BPF_K, 0),
+
+        BPF_STMT(BPF_RET | BPF_K, 0xFFFF),
     };
     struct sock_fprog prog = {
         .len    = sizeof(filter) / sizeof(filter[0]),




which drops the message if protocol is KERNEL. In that case, 

`make src/core/platform/tests/monitor && strace -s 10000 -tt src/core/platform/tests/monitor --no-persist`

receives the message with rtm_protocol=RTPROT_KERNEL, because the first route is RTPROT_DHCP:


09:31:41.830589 recvmsg(8, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[[{nlmsg_len=60, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=4, nlmsg_pid=105031}, {rtm_family=AF_INET, rtm_dst_len=0, rtm_src_len=0, rtm_tos=0, rtm_table=RT_TABLE_MAIN, rtm_protocol=RTPROT_DHCP, rtm_scope=RT_SCOPE_UNIVERSE, rtm_type=RTN_UNICAST, rtm_flags=0}, [[{nla_len=8, nla_type=RTA_TABLE}, RT_TABLE_MAIN], [{nla_len=8, nla_type=RTA_PRIORITY}, 100], [{nla_len=8, nla_type=RTA_GATEWAY}, inet_addr("192.168.23.1")], [{nla_len=8, nla_type=RTA_OIF}, if_nametoindex("eth0")]]], [{nlmsg_len=60, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=4, nlmsg_pid=105031}, {rtm_family=AF_INET, rtm_dst_len=0, rtm_src_len=0, rtm_tos=0, rtm_table=RT_TABLE_MAIN, rtm_protocol=RTPROT_DHCP, rtm_scope=RT_SCOPE_UNIVERSE, rtm_type=RTN_UNICAST, rtm_flags=0}, [[{nla_len=8, nla_type=RTA_TABLE}, RT_TABLE_MAIN], [{nla_len=8, nla_type=RTA_PRIORITY}, 600], [{nla_len=8, nla_type=RTA_GATEWAY}, inet_addr("192.168.23.1")], [{nla_len=8, nla_type=RTA_OIF}, if_nametoindex("wlan0")]]], [{nlmsg_len=60, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=4, nlmsg_pid=105031}, {rtm_family=AF_INET, rtm_dst_len=0, rtm_src_len=0, rtm_tos=0, rtm_table=RT_TABLE_MAIN, rtm_protocol=RTPROT_DHCP, rtm_scope=RT_SCOPE_UNIVERSE, rtm_type=RTN_UNICAST, rtm_flags=0}, [[{nla_len=8, nla_type=RTA_TABLE}, RT_TABLE_MAIN], [{nla_len=8, nla_type=RTA_PRIORITY}, 20100], [{nla_len=8, nla_type=RTA_GATEWAY}, inet_addr("192.168.23.1")], [{nla_len=8, nla_type=RTA_OIF}, if_nametoindex("eth0")]]], [{nlmsg_len=52, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=4, nlmsg_pid=105031}, {rtm_family=AF_INET, rtm_dst_len=32, rtm_src_len=0, rtm_tos=0, rtm_table=RT_TABLE_MAIN, rtm_protocol=RTPROT_BIRD, rtm_scope=RT_SCOPE_LINK, rtm_type=RTN_UNICAST, rtm_flags=0}, [[{nla_len=8, nla_type=RTA_TABLE}, RT_TABLE_MAIN], [{nla_len=8, nla_type=RTA_DST}, inet_addr("1.2.3.4")], [{nla_len=8, nla_type=RTA_OIF}, if_nametoindex("eth0")]]], [{nlmsg_len=60, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=4, nlmsg_pid=105031}, {rtm_family=AF_INET, rtm_dst_len=8, rtm_src_len=0, rtm_tos=0, rtm_table=RT_TABLE_MAIN, rtm_protocol=RTPROT_BOOT, rtm_scope=RT_SCOPE_UNIVERSE, rtm_type=RTN_UNICAST, rtm_flags=0}, [[{nla_len=8, nla_type=RTA_TABLE}, RT_TABLE_MAIN], [{nla_len=8, nla_type=RTA_DST}, inet_addr("10.0.0.0")], [{nla_len=8, nla_type=RTA_GATEWAY}, inet_addr("10.40.192.1")], [{nla_len=8, nla_type=RTA_OIF}, if_nametoindex("tun9")]]], [{nlmsg_len=60, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=4, nlmsg_pid=105031}, {rtm_family=AF_INET, rtm_dst_len=20, rtm_src_len=0, rtm_tos=0, rtm_table=RT_TABLE_MAIN, rtm_protocol=RTPROT_KERNEL, rtm_scope=RT_SC...



If I change the patch to reject RTPROT_DHCP, then the entire message is rejected (including the routes of type RTPROT_KERNEL).

Comment 20 Beniamino Galvani 2022-01-14 08:44:06 UTC
In the BPF filter, can we check if the message is multipart and accept it in such case, while keeping the filtering for single messages?

I expect that when routes are added externally one by one, the messages will be non-multipart and so there is still an advantage in having the BPF filter? Is this correct?

For netlink dumps, we get a multipart message as reply and it will have to be processed by NM.

Comment 21 Beniamino Galvani 2022-01-24 15:19:16 UTC
I verified that NM no longer hangs waiting for the route dump. I used these commands:

    # activate one
    nmcli connection add type dummy con-name dummy1+ ifname dummy1 ip4 172.26.1.1/24 autoconnect no
    nmcli connection up dummy1+

    # add route
    ip route add blackhole 172.25.1.0/24 proto bird

    # deactivate all
    nmcli -g uuid connection show --active | xargs nmcli -w 10 connection down

    # inspect logs
    journalctl -u NetworkManager -e -n 10 | cat
    echo --------------
    journalctl -u NetworkManager -e -n 1000 | grep "wait for ACK

NM 1.35.3 hangs forever waiting for the dump:

-- Journal begins at Tue 2022-01-18 21:51:38 CET, ends at Mon 2022-01-24 16:10:44 CET. --
Jan 24 16:10:26 localhost NetworkManager[3702586]: <trace> [1643037026.3654] platform-linux: netlink: recvmsg: new message RTM_DELROUTE, flags 0, seq 0
Jan 24 16:10:26 localhost NetworkManager[3702586]: <trace> [1643037026.3654] platform-linux: event-notification: RTM_DELROUTE, flags 0, seq 0: ignore
Jan 24 16:10:26 localhost NetworkManager[3702586]: <trace> [1643037026.3654] platform-linux: netlink: recvmsg: new message NLMSG_ERROR, flags 0x0100, seq 54
Jan 24 16:10:26 localhost NetworkManager[3702586]: <trace> [1643037026.3654] platform-linux: delayed-action: complete wait-for-nl-response (seq 54, timeout in 0.199819060, response-type 0, success)
Jan 24 16:10:26 localhost NetworkManager[3702586]: <trace> [1643037026.3654] platform-linux: delayed-action: handle refresh-all-ip4-routes
Jan 24 16:10:26 localhost NetworkManager[3702586]: <trace> [1643037026.3655] platform-linux: delayed-action: handle (do-request-all) refresh-all-ip4-routes
Jan 24 16:10:26 localhost NetworkManager[3702586]: <trace> [1643037026.3655] platform-linux: delayed-action: schedule wait-for-nl-response (seq 55, timeout in 0.199998418, response-type 1)
Jan 24 16:10:26 localhost NetworkManager[3702586]: <trace> [1643037026.3655] platform-linux: delayed-action: handle wait-for-nl-response (any)
Jan 24 16:10:26 localhost NetworkManager[3702586]: <trace> [1643037026.3655] platform-linux: netlink: read: wait for ACK for sequence number 55...
Jan 24 16:10:39 localhost NetworkManager[3702586]: <trace> [1643037039.4628] platform-linux: netlink: read: wait for ACK for sequence number 55...
--------------
Jan 24 16:10:26 localhost NetworkManager[3702586]: <trace> [1643037026.3655] platform-linux: netlink: read: wait for ACK for sequence number 55...
Jan 24 16:10:39 localhost NetworkManager[3702586]: <trace> [1643037039.4628] platform-linux: netlink: read: wait for ACK for sequence number 55...


NM 1.35.5 hangs for 200ms:

-- Journal begins at Tue 2022-01-18 21:51:38 CET, ends at Mon 2022-01-24 16:12:30 CET. --
Jan 24 16:12:30 localhost NetworkManager[3703763]: <trace> [1643037150.9610] platform-linux: netlink: recvmsg: new message NLMSG_ERROR, flags 0, seq 61
Jan 24 16:12:30 localhost NetworkManager[3703763]: <debug> [1643037150.9610] platform-linux: netlink: recvmsg: error message from kernel: Device or resource busy (16) for request 61
Jan 24 16:12:30 localhost NetworkManager[3703763]: <trace> [1643037150.9610] platform-linux: netlink: recvmsg: new message NLMSG_DONE, flags multi, seq 60
Jan 24 16:12:30 localhost NetworkManager[3703763]: <trace> [1643037150.9610] platform-linux: delayed-action: complete wait-for-nl-response (seq 60, timeout in 0.199722987, response-type 1, success)
Jan 24 16:12:30 localhost NetworkManager[3703763]: <trace> [1643037150.9611] platform-linux: delayed-action: complete wait-for-nl-response (seq 61, timeout in 0.199844073, response-type 1, failure 16 (Device or resource busy))
Jan 24 16:12:30 localhost NetworkManager[3703763]: <trace> [1643037150.9611] platform-linux: delayed-action: schedule wait-for-nl-response (seq 62, timeout in 0.199995862, response-type 1)
Jan 24 16:12:30 localhost NetworkManager[3703763]: <trace> [1643037150.9611] platform-linux: delayed-action: handle (do-request-all) refresh-all-routing-rules-ip6
Jan 24 16:12:30 localhost NetworkManager[3703763]: <trace> [1643037150.9611] platform-linux: netlink: recvmsg: new message RTM_NEWRULE, flags multi, seq 62
Jan 24 16:12:30 localhost NetworkManager[3703763]: <trace> [1643037150.9612] platform-linux: event-notification: RTM_NEWRULE, flags multi, seq 62, in-dump: [4] 0: from all lookup 255 protocol 2
Jan 24 16:12:30 localhost NetworkManager[3703763]: <trace> [1643037150.9612] platform-linux: netlink: recvmsg: new message RTM_NEWRULE, flags multi, seq 62
--------------
Jan 24 16:12:30 localhost NetworkManager[3703763]: <trace> [1643037150.2763] platform-linux: netlink: read: wait for ACK for sequence number 52... (199 msec)
Jan 24 16:12:30 localhost NetworkManager[3703763]: <trace> [1643037150.4757] platform-linux: netlink: read: wait for ACK for sequence number 52... (1 msec)
Jan 24 16:12:30 localhost NetworkManager[3703763]: <trace> [1643037150.4817] platform-linux: netlink: read: wait for ACK for sequence number 54... (199 msec)
Jan 24 16:12:30 localhost NetworkManager[3703763]: <trace> [1643037150.6810] platform-linux: netlink: read: wait for ACK for sequence number 54... (1 msec)
Jan 24 16:12:30 localhost NetworkManager[3703763]: <trace> [1643037150.7391] platform-linux: netlink: read: wait for ACK for sequence number 56... (199 msec)
Jan 24 16:12:30 localhost NetworkManager[3703763]: <trace> [1643037150.9387] platform-linux: netlink: read: wait for ACK for sequence number 56... (1 msec)


On git main there is no such delay:

-- Journal begins at Wed 2022-01-19 09:23:29 CET, ends at Mon 2022-01-24 16:13:48 CET. --
Jan 24 16:13:48 localhost NetworkManager[3704928]: <trace> [1643037228.3666] l3cfg[88c4234bf456a872,ifindex=47165]: emit signal (platform-change-on-idle, obj-type-flags=0x2)
Jan 24 16:13:48 localhost NetworkManager[3704928]: <trace> [1643037228.3666] dbus-object[c87b674d487d4606]: unexport: "/org/freedesktop/NetworkManager/IP6Config/9"
Jan 24 16:13:48 localhost NetworkManager[3704928]: <trace> [1643037228.3666] dbus-object[595a4d9338f5d323]: unexport: "/org/freedesktop/NetworkManager/IP4Config/9"
Jan 24 16:13:48 localhost NetworkManager[3704928]: <trace> [1643037228.3667] l3cfg[88c4234bf456a872,ifindex=47165]: finalized
Jan 24 16:13:48 localhost NetworkManager[3704928]: <debug> [1643037228.3667] device[117914a30c7be164] (dummy1): remove_pending_action (0): 'autoactivate'
Jan 24 16:13:48 localhost NetworkManager[3704928]: <debug> [1643037228.3751] dispatcher: (24) /etc/NetworkManager/dispatcher.d/10-save-env.sh succeeded
Jan 24 16:13:48 localhost NetworkManager[3704928]: <debug> [1643037228.3752] dispatcher: (24) /etc/NetworkManager/dispatcher.d/10-update-rc.sh succeeded
Jan 24 16:13:48 localhost NetworkManager[3704928]: <debug> [1643037228.3752] dispatcher: (24) /usr/lib/NetworkManager/dispatcher.d/11-dhclient succeeded
Jan 24 16:13:48 localhost NetworkManager[3704928]: <debug> [1643037228.3752] dispatcher: (24) /usr/lib/NetworkManager/dispatcher.d/20-chrony-dhcp succeeded
Jan 24 16:13:48 localhost NetworkManager[3704928]: <debug> [1643037228.3752] dispatcher: (24) /usr/lib/NetworkManager/dispatcher.d/20-chrony-onoffline succeeded
--------------

Comment 26 errata-xmlrpc 2022-05-10 14:55:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (NetworkManager bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2022:1985