Bug 2173485

Summary: Bridge CNI ADD with spoof check takes a very long time when there many Services in the cluster
Product: Container Native Virtualization (CNV) Reporter: Jonathan Maxwell <jmaxwell>
Component: NetworkingAssignee: Miguel Duarte Barroso <mduarted>
Status: CLOSED ERRATA QA Contact: Yossi Segev <ysegev>
Severity: urgent Docs Contact:
Priority: high    
Version: 4.10.0CC: edwardh, egarver, germano, gveitmic, mduarted, phoracek, psutter, rgertzbe, todoleza, ysegev
Target Milestone: ---   
Target Release: 4.13.1   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: cnv-containernetworki@phil sung-plugins-rhel9 v4.13.1-2 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-06-20 13:41:05 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jonathan Maxwell 2023-02-27 04:50:56 UTC
Description of problem:

A customer in a Openshift virtualization environment reported that a VM is very slow to start when there are 25K nftables chains. 

Version-Release number of selected component (if applicable):

RHEL 8.6

How reproducible:

We can reproduce a case where a nft command takes 48 seconds to complete.

Steps to Reproduce:
1.

Run this script a few times.

cat nft_add.sh
for a in {1..30000}
do
nft add table inet filter$a 
nft add chain inet filter$a input { type filter hook input priority 0 \; }
done

2. There should be ~36K chains:

# nft list ruleset|wc -l
36420

3. Then try and add a rule:

# time nft add rule ip filter output ip daddr 192.168.1.0/24 counter
Error: Could not process rule: No such file or directory
add rule ip filter output ip daddr 192.168.1.0/24 counter
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

real	0m49.454s
user	0m8.915s
sys	0m40.407s

Actual results:

nft commands take 48 seconds to complete.

Expected results:

nft commands should return within a second.

Additional info:

It's spending a lot time in the netlink routines:


nft 45862 [003]  2940.260199:     844862   cycles:
        ffffffffc1429105 nf_tables_dump_chains+0x65 (/lib/modules/4.18.0-372.26.1.el8_6.x86_64/kernel/net/netfilter/nf_tables.ko.xz)
        ffffffffa4c5221a netlink_dump+0x18a (/usr/lib/debug/lib/modules/4.18.0-372.26.1.el8_6.x86_64/vmlinux)
        ffffffffa4c52637 netlink_recvmsg+0x227 (/usr/lib/debug/lib/modules/4.18.0-372.26.1.el8_6.x86_64/vmlinux)
        ffffffffa4bb70b1 ____sys_recvmsg+0x91 (/usr/lib/debug/lib/modules/4.18.0-372.26.1.el8_6.x86_64/vmlinux)
        ffffffffa4bbaf7b ___sys_recvmsg+0x7b (/usr/lib/debug/lib/modules/4.18.0-372.26.1.el8_6.x86_64/vmlinux)
        ffffffffa4bbb044 __sys_recvmsg+0x54 (/usr/lib/debug/lib/modules/4.18.0-372.26.1.el8_6.x86_64/vmlinux)
        ffffffffa440430b do_syscall_64+0x5b (/usr/lib/debug/lib/modules/4.18.0-372.26.1.el8_6.x86_64/vmlinux)
        ffffffffa4e000ad entry_SYSCALL_64_after_hwframe+0x65 (/usr/lib/debug/lib/modules/4.18.0-372.26.1.el8_6.x86_64/vmlinux)
            7f0dfc28d198 __libc_recvmsg+0x18 (/usr/lib64/libc-2.28.so)

Comment 3 Eric Garver 2023-02-27 14:04:08 UTC
This commit looks relevant:

  17297d1acbbf ("cache: Filter chain list on kernel side")

First upstream in nftables v1.0.2.

Comment 4 Phil Sutter 2023-02-28 14:40:03 UTC
(In reply to Eric Garver from comment #3)
> This commit looks relevant:
> 
>   17297d1acbbf ("cache: Filter chain list on kernel side")
> 
> First upstream in nftables v1.0.2.

This commit is part of a series resolving "overcaching" in different forms. The others are:

a37212f2fd907 ("cache: Filter tables on kernel side")
95781fcbddcd6 ("cache: Filter rule list on kernel side")

All these commits depend on a larger caching code refactoring though, not good with RHEL8.6.z in perspective. I'll see if reimplementing the feature in the old code-base is feasible.

Comment 5 Jonathan Maxwell 2023-03-03 04:26:30 UTC
(In reply to Phil Sutter from comment #4)
> (In reply to Eric Garver from comment #3)
> > This commit looks relevant:
> > 
> >   17297d1acbbf ("cache: Filter chain list on kernel side")
> > 
> > First upstream in nftables v1.0.2.
> 
> This commit is part of a series resolving "overcaching" in different forms.
> The others are:
> 
> a37212f2fd907 ("cache: Filter tables on kernel side")
> 95781fcbddcd6 ("cache: Filter rule list on kernel side")
> 
> All these commits depend on a larger caching code refactoring though, not
> good with RHEL8.6.z in perspective. I'll see if reimplementing the feature
> in the old code-base is feasible.

Hi Phil,

Thanks.

I take these commits are in RHEL 9.2 nft? 

Which has:

nftables-1.0.4-10.el9_1.x86_64

I tried testing on RHEL 9.2 but ran into:

https://bugzilla.redhat.com/show_bug.cgi?id=2173801
https://bugzilla.redhat.com/show_bug.cgi?id=2173764

Regards

Jon

Comment 7 Phil Sutter 2023-03-03 16:41:04 UTC
Hi Jon,

(In reply to Jonathan Maxwell from comment #5)
> (In reply to Phil Sutter from comment #4)
> > (In reply to Eric Garver from comment #3)
> > > This commit looks relevant:
> > > 
> > >   17297d1acbbf ("cache: Filter chain list on kernel side")
> > > 
> > > First upstream in nftables v1.0.2.
> > 
> > This commit is part of a series resolving "overcaching" in different forms.
> > The others are:
> > 
> > a37212f2fd907 ("cache: Filter tables on kernel side")
> > 95781fcbddcd6 ("cache: Filter rule list on kernel side")
> > 
> > All these commits depend on a larger caching code refactoring though, not
> > good with RHEL8.6.z in perspective. I'll see if reimplementing the feature
> > in the old code-base is feasible.
> 
> Hi Phil,
> 
> Thanks.
> 
> I take these commits are in RHEL 9.2 nft? 

Yes, they are.

> Which has:
> 
> nftables-1.0.4-10.el9_1.x86_64
> 
> I tried testing on RHEL 9.2 but ran into:
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=2173801
> https://bugzilla.redhat.com/show_bug.cgi?id=2173764

I tried on a beefy machine (128GB RAM), the failing 'add rule' command
completes in about half a second.

With the ruleset still in place, 'free' shows about 2GB of memory usage. Though
this doesn't change after an 'nft flush ruleset', so not sure how much of it
can be attributed to nftables in kernel.

Your OOM issue in bug 2173801 is with nft tool though. Adding 35k tables is
possible on the above machine, 'nft list ruleset' command eats at most ~15GB
memory according to top. This would mean about 450KB per table in cache, indeed
a bit much.

The test case in comment 1 above is bogus, though: There is a limit on the
maximum number of chains attaching to the same hook (1024 IIRC), so the 'add
chain' commands start failing at some point. This is also reflected by the
number of lines returned by 'nft list ruleset': Each table with chain use five
lines, so it should return 150k lines instead of the ~36k. But this is not
relevant here.

I had a look at the customer case, it seems you missed a crucial aspect: The
large ruleset slowing down nft command was probably created by
OpenShift/Kubernetes. At least the chain names indicate that. If so, they were
likely added using iptables-nft. This in turn is indeed optimized for use with
excessively large rulesets as the one at hand.

So please ask the customer if they can use iptables-nft instead of nft. The
latter may produce wrong results anyway for iptables-nft-created rules.

Comment 8 Edward Haas 2023-03-09 14:22:27 UTC
(In reply to Phil Sutter from comment #7)
> I had a look at the customer case, it seems you missed a crucial aspect: The
> large ruleset slowing down nft command was probably created by
> OpenShift/Kubernetes. At least the chain names indicate that. If so, they
> were
> likely added using iptables-nft. This in turn is indeed optimized for use
> with
> excessively large rulesets as the one at hand.

Could you please elaborate what optimization is done exactly?
I was under the impression that iptables-nft just translated iptables config to
nftable ones with existing tables & chains pre-defined.
Are there other things going on behind the scenes?
 
> So please ask the customer if they can use iptables-nft instead of nft. The
> latter may produce wrong results anyway for iptables-nft-created rules.

There are a large number of components in a Kubernetes deployment, some will use
iptables-nft and some nft.
In this specific case, the one using nft is a component called CNI, which is
responsible to setup the pod/container network configuration (e.g. create a veth,
place one peer into the pod and another at the root netns, possibly connecting
it to a bridge).

This CNI is using one base chain [1] and multiple rules to match per iifname
and later to filter per mac [2].
Can a single base chain cause this trouble of slowness?
If it is, is there something we can improve to solve it?

To clarify, the customer has no control about which client is used.
The only think we can try is to recreate it with iptables-nft instead.

[1] https://github.com/containernetworking/plugins/blob/86e39cfe3c324c4e95cf12167c722173be4495c4/pkg/link/spoofcheck_test.go#L200-L221
(the other custom chains are defined per interface)
[2] https://github.com/containernetworking/plugins/blob/86e39cfe3c324c4e95cf12167c722173be4495c4/pkg/link/spoofcheck_test.go#L231-L259

Comment 14 Miguel Duarte Barroso 2023-04-10 13:58:01 UTC
By removing the index of the MAC address matching rule, we get a very noticeable performance increase executing CNI ADDs.

This works on both RHEL 8.6 and 9.2 beta.

Thanks @psutter for your help so far.

Should I "take" the bug from you ?

Comment 15 Miguel Duarte Barroso 2023-04-17 15:18:27 UTC
The bridge-cni fix was merged upstream.

Should I clone this bug, update the component to CNI, and treat it separately ? (allowing me to port the fix downstream) 

@psutter

Comment 16 Phil Sutter 2023-04-20 12:11:03 UTC
Sorry for the delay, I was off for the last two weeks.

Miguel, feel free to move this BZ into your realm. I guess your fix solves the attached customer case, right?

If teardown slowness persists to be problematic, I guess we should attempt the suggested rewrite involving a set as described via email. Just dropping the needless 'index' parameter is a much safer change, though.

Comment 17 Petr Horáček 2023-04-24 07:53:30 UTC
Thanks Phil o/

Moving this BZ to CNV and setting the target release to 4.13.1. We have the needed patch posted on midstream, but it is too late to release it in 4.13.0, so we will ship it in the first z-stream.

Comment 18 Miguel Duarte Barroso 2023-04-27 07:11:14 UTC
(In reply to Phil Sutter from comment #16)
> Sorry for the delay, I was off for the last two weeks.
> 
> Miguel, feel free to move this BZ into your realm. I guess your fix solves
> the attached customer case, right?

It should. 

> 
> If teardown slowness persists to be problematic, I guess we should attempt
> the suggested rewrite involving a set as described via email. Just dropping
> the needless 'index' parameter is a much safer change, though.

Comment 27 Yossi Segev 2023-05-29 19:34:35 UTC
@psutter @jmaxwell @mduarted 
I tried reproducing the scenario in an attempt to verify the bug, using the scenario that is provided in the bug description.
The problem is that like Phil says on comment #7 ("The test case in comment 1 above is bogus, though: There is a limit on the maximum number of chains attaching to the same hook (1024 IIRC)"), on the 1025th iteration  of the nft_add.sh I started getting these errors:

Error: Could not process rule: Argument list too long
add chain inet filter1025 input { type filter hook input priority 0 ; }
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Error: Could not process rule: Argument list too long
add chain inet filter1026 input { type filter hook input priority 0 ; }
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Error: Could not process rule: Argument list too long
add chain inet filter1027 input { type filter hook input priority 0 ; }
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...

So I either cannot verify this BZ, or I am doing something wrong (although I try following the description), or there is another bug here, which blocks this BZ.

Few questions:
1. Is the scenario in the BZ description valid? My guess is that it is not (because of the 1024 chains limitation), and in that case - I need a valid scenario for verification, please.
2. The cluster nodes run RHCOS 4.13, which means RHEL 9.2 under the hood.
What I want to verify is that it is OK that the *guest VM* I am running is a Fedora VM (I guess it is, as long as the hosting nodes are RHEL 9.2, but I want to be 100% sure).
Thank you.

Comment 28 Germano Veit Michel 2023-05-29 23:07:38 UTC
(In reply to Yossi Segev from comment #27)
> Few questions:
> 1. Is the scenario in the BZ description valid? My guess is that it is not
> (because of the 1024 chains limitation), and in that case - I need a valid
> scenario for verification, please.

I tested this a few time, usually doing something like 10-20 tables, 500 chains and 5-10 rules per chain.
Try to get to about 20-40K entries total (tables*chains*rules)

> 2. The cluster nodes run RHCOS 4.13, which means RHEL 9.2 under the hood.
> What I want to verify is that it is OK that the *guest VM* I am running is a
> Fedora VM (I guess it is, as long as the hosting nodes are RHEL 9.2, but I
> want to be 100% sure).
> Thank you.

You don't even need an OS, you can do a blank disk and leave it stuck on SeaBIOS/OVMF failed boot, the whole bug happens before qemu starts (add) and after qemu shuts down (del).

Comment 30 Phil Sutter 2023-05-30 11:03:29 UTC
(In reply to Yossi Segev from comment #27)
> @psutter @jmaxwell @mduarted 
> I tried reproducing the scenario in an attempt to verify the bug, using the
> scenario that is provided in the bug description.

It is outdated: The bug was meanwhile transferred to CNV, a fix (or rather,
workaround?) deployed there. It is not possible to test it using nft only which
wasn't changed.

Comment 31 Yossi Segev 2023-05-30 11:28:42 UTC
(In reply to Phil Sutter from comment #30)
> (In reply to Yossi Segev from comment #27)
> > @psutter @jmaxwell @mduarted 
> > I tried reproducing the scenario in an attempt to verify the bug, using the
> > scenario that is provided in the bug description.
> 
> It is outdated: The bug was meanwhile transferred to CNV, a fix (or rather,
> workaround?) deployed there. It is not possible to test it using nft only
> which
> wasn't changed.

Thank you Phil.
So either @mduarted, @germano or @jmaxwell - can you please provide me an *exact* reproducible scenario to run, in order to verify this bug?
Thank you.

Comment 33 Germano Veit Michel 2023-05-30 23:08:26 UTC
(In reply to Yossi Segev from comment #31)
> (In reply to Phil Sutter from comment #30)
> > (In reply to Yossi Segev from comment #27)
> > > @psutter @jmaxwell @mduarted 
> > > I tried reproducing the scenario in an attempt to verify the bug, using the
> > > scenario that is provided in the bug description.
> > 
> > It is outdated: The bug was meanwhile transferred to CNV, a fix (or rather,
> > workaround?) deployed there. It is not possible to test it using nft only
> > which
> > wasn't changed.
> 
> Thank you Phil.
> So either @mduarted, @germano or @jmaxwell
> - can you please provide me an *exact* reproducible scenario to run, in
> order to verify this bug?
> Thank you.

From the OCP/CNV side, the proper real-life scenario for testing is to create hundreds (or thousands) of services in the cluster, using multi-port ranges (so 1 service has many TCP/UDP ports), so that each node has ~30-50K nft rules/chains.
Then try to start a VM on it, and stop it. Note down the times it takes on CNI add and delete, it should be almost instant on latest, and take many minutes on 4.12.

Comment 34 Germano Veit Michel 2023-05-30 23:38:48 UTC
And if you want to "fake" the rules like I've been testing on a small scale env, you can check step 3 in BZ2175041. But I think it would be good to do a real life test on this.

Comment 35 Yossi Segev 2023-05-31 14:30:41 UTC
Thank you Germano.
Unfortunately, I don't have the luxury to run a real-life scenario like the one you suggested, because my resources are limited, and the PSI cluster I am using is pretty lean.
Therefore, I went for the scenario you used for BZ2175041:

CNV 4.13.1
cnv-containernetworking-plugins-rhel9:v4.13.1-2
OS: Red Hat Enterprise Linux CoreOS 413.92.202305231734-0 (RHEL 9.2 based)

1. Create a linux bridge interface on a single node using this policy:
apiVersion: nmstate.io/v1
kind: NodeNetworkConfigurationPolicy
metadata:
  name: linux-bridge-ens11
spec:
  desiredState:
    interfaces:
    - name: test-br
      type: linux-bridge
      state: up
      ipv4:
        dhcp: true
        enabled: true
      bridge:
        options:
          stp:
            enabled: false
        port:
        - name: ens11
  nodeSelector:
    kubernetes.io/hostname: c01-n-ys-4131o-k59g2-worker-0-lc5wc

2. Create NetworkAttachmentDefinition that utilizes the bridge interface:
apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
  annotations:
    k8s.v1.cni.cncf.io/resourceName: bridge.network.kubevirt.io/test-br
  name: test-br-nad
  namespace: yoss-ns
spec:
  config: '{"cniVersion": "0.3.1", "name": "test-br", "type":
    "cnv-bridge", "bridge": "test-br", "macspoofchk":true,"ipam":{}}'

3. On the selected node - run the script from BZ2175041 description (step 3) in the background (. nft.sh &):
for a in {1..500}
do
nft add table ip table$a
for b in {1..500}
do
nft add chain ip table$a chain$b
done
done

4. I let the script run for a while, until there were about 300K rule entries:
sh-5.1# nft list ruleset|wc -l
302251

5. I ran the command from step 3 in the description above:
sh-5.1# time nft add rule ip filter output ip daddr 192.168.1.0/24 counter
Error: Could not process rule: No such file or directory
add rule ip filter output ip daddr 192.168.1.0/24 counter
                   ^^^^^^
real	0m0.841s
user	0m0.418s
sys	0m0.411s

As can be seen, it took a few ms to complete (rather than ~40 seconds before the bug was fixed)

6. I started a VM (with a secondary interface, which is backed by the NetworkAttachmentDefinition created above), and it took ~8 second to get to running state:
$ virtctl start vm1
ocVM vm1 was scheduled to start
$
$ oc get vmi -w
NAME   AGE   PHASE        IP    NODENAME   READY
vm1    3s    Scheduling                    False
vm1    6s    Scheduled          c01-n-ys-4131o-k59g2-worker-0-lc5wc   False
vm1    8s    Scheduled          c01-n-ys-4131o-k59g2-worker-0-lc5wc   False
vm1    8s    Running      10.129.2.99   c01-n-ys-4131o-k59g2-worker-0-lc5wc   False
vm1    8s    Running      10.129.2.99   c01-n-ys-4131o-k59g2-worker-0-lc5wc   True
$
$ virtctl console vm1
Successfully connected to vm1 console. The escape sequence is ^]
                                                                      
vm1 login: fedora
Password: 
Last login: Tue Feb 21 09:44:15 on ttyS0
[systemd]
Failed Units: 1
  NetworkManager-wait-online.service
[fedora@vm1 ~]$

Comment 37 Phil Sutter 2023-06-01 17:03:56 UTC
(In reply to Yossi Segev from comment #35)
[...]
> 3. On the selected node - run the script from BZ2175041 description (step 3)
> in the background (. nft.sh &):
> for a in {1..500}
> do
> nft add table ip table$a
> for b in {1..500}
> do
> nft add chain ip table$a chain$b
> done
> done
> 
> 4. I let the script run for a while, until there were about 300K rule
> entries:
> sh-5.1# nft list ruleset|wc -l
> 302251
> 
> 5. I ran the command from step 3 in the description above:
> sh-5.1# time nft add rule ip filter output ip daddr 192.168.1.0/24 counter
> Error: Could not process rule: No such file or directory
> add rule ip filter output ip daddr 192.168.1.0/24 counter
>                    ^^^^^^
> real	0m0.841s
> user	0m0.418s
> sys	0m0.411s

I might be wrong but to me this looks like you're adding a bunch of tables and
chains, then measure how long it takes for nft to notice you're trying to add a
rule to a non-existent table.

Also, the add rule command your pasting above doesn't involve caching at all.
It just sends the data to the kernel which returns ENOENT.

I wonder how this is supposed to test a change in CNI.

Comment 39 Germano Veit Michel 2023-06-01 21:32:44 UTC
(In reply to Phil Sutter from comment #37)
> (In reply to Yossi Segev from comment #35)
> [...]
> > 3. On the selected node - run the script from BZ2175041 description (step 3)
> > in the background (. nft.sh &):
> > for a in {1..500}
> > do
> > nft add table ip table$a
> > for b in {1..500}
> > do
> > nft add chain ip table$a chain$b
> > done
> > done
> > 
> > 4. I let the script run for a while, until there were about 300K rule
> > entries:
> > sh-5.1# nft list ruleset|wc -l
> > 302251
> > 
> > 5. I ran the command from step 3 in the description above:
> > sh-5.1# time nft add rule ip filter output ip daddr 192.168.1.0/24 counter
> > Error: Could not process rule: No such file or directory
> > add rule ip filter output ip daddr 192.168.1.0/24 counter
> >                    ^^^^^^
> > real	0m0.841s
> > user	0m0.418s
> > sys	0m0.411s
> 
> I might be wrong but to me this looks like you're adding a bunch of tables
> and
> chains, then measure how long it takes for nft to notice you're trying to
> add a
> rule to a non-existent table.
> 
> Also, the add rule command your pasting above doesn't involve caching at all.
> It just sends the data to the kernel which returns ENOENT.
> 
> I wonder how this is supposed to test a change in CNI.

Yes, this is not a valid test, mixing my script to create tables and chains with the command from this BZ to add a rule to a randon table/chain name does not work, he would need to add the rule to one of the chains the script created.

But step 6 - VM start he did adds a rule to an existing chain is fine. 8s from start to running is OK (nft add is inside the 8s somewhere, probably a tiny fraction).

Comment 41 Phil Sutter 2023-06-02 16:41:00 UTC
My changes to go-nft have been merged, so I submitted a MR for cni-plugins:

https://github.com/containernetworking/plugins/pull/902

Any review/testing highly appreciated, of course!

Comment 42 Yossi Segev 2023-06-04 07:18:04 UTC
@

Comment 43 Yossi Segev 2023-06-04 08:55:30 UTC
@psutter @germano 
Thank you both for reviewing my reproduction scenario and providing the feedback.
After realizing that the rule I added (`add rule ip filter output ip daddr 192.168.1.0/24 counter`) is actually a dummy rule, which is not applied on any table or chain, I changed the scenario accordingly.
For the convenience and clarity of all of us, I'm specifying the modified reproduction scenario here in full, although most of its steps are similar to those I specified in comment #35 (actually, the only step that was changed is step 5, where the rule was added).
Gladly, the results were the same, so I'm keeping this BZ as verified.


CNV 4.13.1
cnv-containernetworking-plugins-rhel9:v4.13.1-2
OS: Red Hat Enterprise Linux CoreOS 413.92.202305231734-0 (RHEL 9.2 based)

1. Create a linux bridge interface on a single node using this policy:
apiVersion: nmstate.io/v1
kind: NodeNetworkConfigurationPolicy
metadata:
  name: linux-bridge-ens11
spec:
  desiredState:
    interfaces:
    - name: test-br
      type: linux-bridge
      state: up
      ipv4:
        dhcp: true
        enabled: true
      bridge:
        options:
          stp:
            enabled: false
        port:
        - name: ens11
  nodeSelector:
    kubernetes.io/hostname: c01-n-ys-4131o-k59g2-worker-0-lc5wc

2. Create NetworkAttachmentDefinition that utilizes the bridge interface:
apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
  annotations:
    k8s.v1.cni.cncf.io/resourceName: bridge.network.kubevirt.io/test-br
  name: test-br-nad
  namespace: yoss-ns
spec:
  config: '{"cniVersion": "0.3.1", "name": "test-br", "type":
    "cnv-bridge", "bridge": "test-br", "macspoofchk":true,"ipam":{}}'

3. On the selected node - run the script from BZ2175041 description (step 3) in the background (. nft.sh &):
for a in {1..500}
do
nft add table ip table$a
for b in {1..500}
do
nft add chain ip table$a chain$b
done
done

4. I let the script run for a while, until there were about 300K rule entries:
sh-5.1# nft list ruleset|wc -l
302251

5. I ran the command from step 3 in the description above, on an actual chain in one of the tables add by the script:
sh-5.1# time nft add rule ip table200 chain400 ip daddr 192.168.1.0/24 counter

real	0m0.766s
user	0m0.410s
sys	0m0.355s
sh-5.1#
sh-5.1# nft list chain table200 chain400
table ip table200 {
	chain chain400 {
		ip daddr 192.168.1.0/24 counter packets 0 bytes 0
	}
}

As can be seen, it took a few ms to complete (rather than ~40 seconds before the bug was fixed)

6. I started a VM (with a secondary interface, which is backed by the NetworkAttachmentDefinition created above; VM spec yaml is attached), and it took ~8 second to get to running state:
$ virtctl start vm1
VM vm1 was scheduled to start
$
$ virtctl start vm1
VM vm1 was scheduled to start
ysegev@ysegev-fedora (bz-2173485) $ oc get vmi -w
NAME   AGE   PHASE        IP    NODENAME   READY
vm1    1s    Scheduling                    False
vm1    5s    Scheduled          c01-n-ys-4131o-k59g2-worker-0-lc5wc   False
vm1    8s    Scheduled          c01-n-ys-4131o-k59g2-worker-0-lc5wc   False
vm1    8s    Running      10.129.2.109   c01-n-ys-4131o-k59g2-worker-0-lc5wc   False
vm1    8s    Running      10.129.2.109   c01-n-ys-4131o-k59g2-worker-0-lc5wc   True
vm1    8s    Running      10.129.2.109   c01-n-ys-4131o-k59g2-worker-0-lc5wc   True
$
$ virtctl console vm1
Successfully connected to vm1 console. The escape sequence is ^]

vm1 login: fedora
Password: 
Last login: Tue Feb 21 09:44:15 on ttyS0
[systemd]
Failed Units: 1
  NetworkManager-wait-online.service
[fedora@vm1 ~]$

Comment 45 Germano Veit Michel 2023-06-04 21:20:44 UTC
Hi Yossi,

Thanks, it looks perfect to me.

Just one thing here:

(In reply to Yossi Segev from comment #43)
> 6. I started a VM (with a secondary interface, which is backed by the
> NetworkAttachmentDefinition created above; VM spec yaml is attached), and it
> took ~8 second to get to running state:

The 8s includes a lot more things than just the specific step we want. I think it will be fine as 8s for everything is quite OK, and your manual rule was quick too.

However, if you want to really see how long it took to go over the slow step we are discussing here, you can enable debug logs and then look for this:

To enable debug, edit /etc/kubernetes/cni/net.d/00-multus.conf, set logLevel to debug and add LogFile to log where u want it

1. start log
2023-03-24T00:21:56Z [debug] confAdd: &{f1de1b8888acaff8dffbd9473170816871e58aae6cba2bcfa15ce70863061943 /var/run/netns/37fea639-da07-4dcf-b367-d3a695f1dc43 net1 [[IgnoreUnknown true] [K8S_POD_NAMESPACE openshift-cnv] [K8S_POD_NAME virt-launcher-rhel9-08yug1jeokq0tqty-ndzgl] [K8S_POD_INFRA_CONTAINER_ID f1de1b8888acaff8dffbd9473170816871e58aae6cba2bcfa15ce70863061943] [K8S_POD_UID 12db1f36-9cfa-4dae-bf5b-bc1aa755ea68] [IgnoreUnknown 1] [K8S_POD_NAMESPACE openshift-cnv] [K8S_POD_NAME virt-launcher-rhel9-08yug1jeokq0tqty-ndzgl] [K8S_POD_INFRA_CONTAINER_ID f1de1b8888acaff8dffbd9473170816871e58aae6cba2bcfa15ce70863061943] [K8S_POD_UID 12db1f36-9cfa-4dae-bf5b-bc1aa755ea68] [MAC 02:d9:17:00:00:03]] map[mac:02:d9:17:00:00:03] }, {"name":"virt.toca","type":"cnv-bridge","cniVersion":"0.4.0","bridge":"virt.toca","macspoofchk":true,"ipam":{}}

2. finish log
2023-03-24T00:22:10Z [verbose] Add: openshift-cnv:virt-launcher-rhel9-08yug1jeokq0tqty-ndzgl:12db1f36-9cfa-4dae-bf5b-bc1aa755ea68:openshift-cnv/virt-toca(virt.toca):net1 {"cniVersion":"0.4.0","interfaces":[{"name":"virt.toca","mac":"02:52:59:00:11:04"},{"name":"veth05f348bf","mac":"82:0b:f3:3f:df:f9"},{"name":"net1","mac":"02:d9:17:00:00:03","sandbox":"/var/run/netns/37fea639-da07-4dcf-b367-d3a695f1dc43"}],"dns":{}}

So the above took 14s to do a few netns things and insert the rule. Yours should be a few ms, but if you want to measure it when fired from the code thats how to get a better idea on how long the more specific step took.

Comment 47 Miguel Duarte Barroso 2023-06-08 09:54:11 UTC
Seems the reproducer has been provided, and the bug verified. Clearing the needinfo request from me.

Comment 52 errata-xmlrpc 2023-06-20 13:41:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Virtualization 4.13.1 Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2023:3686

Comment 53 Red Hat Bugzilla 2023-10-19 04:25:08 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days