RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1887568 - Avoiding network soft IRQs on isolated CPUs
Summary: Avoiding network soft IRQs on isolated CPUs
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: kernel-rt
Version: 8.4
Hardware: x86_64
OS: Linux
high
high
Target Milestone: rc
: 8.0
Assignee: Paolo Abeni
QA Contact: mhou
URL:
Whiteboard:
Depends On:
Blocks: 1825271
TreeView+ depends on / blocked
 
Reported: 2020-10-12 19:49 UTC by Daniel Bristot de Oliveira
Modified: 2020-11-16 21:05 UTC (History)
19 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-11-16 21:05:17 UTC
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
console log (2.87 MB, image/jpeg)
2020-10-21 12:54 UTC, Andrew Theurer
no flags Details

Description Daniel Bristot de Oliveira 2020-10-12 19:49:32 UTC
Description of problem:

A latency is spike of ~150 us was noticed on the OCP setup, caused by the handling of network softirq on isolated CPUs.

In discussions with Francesco Romani, Martin Sivak, and Paolo Abeni, the current
point is that this is the side effect of existing optimizations for maximizing resource usage on both sides.

From the Kubernetes side, the isolated CPUs are used by non-exclusive threads if there is no workload assigned to it. That explains why we were seeing regular threads running in the isolated CPUs, and makes sense in the idea of maximizing CPU time.

From the network side, the network softirq of a CPU with threads that used the network stack end up entering the game to improve system performance. As threads was running on that CPU before the "exclusive" workload gets in, the softirq is in the game when oslat enters in. The softirq main job is to process packages via OVS and push them to user-space. Without HRTICK working, the ksoftirqd was waking up many threads, with HRTICK, ksoftirq wakes up one thread, then is throttled, and in the next period, another one is awakened, and so on. Still, some of the processing is done with preemption disabled, pushing the throttling far from the expected time.

How can we avoid this latency spike?

Comment 1 Daniel Bristot de Oliveira 2020-10-13 08:40:51 UTC
Tracing the irq:sofirq_raise I see this:


--------- %< ---------------
  mdns-publisher-9106  [004] d...113 338655.473930: softirq_raise: vec=3 [action=NET_RX]
  mdns-publisher-9106  [004] d...213 338655.473934: <stack trace>
 => trace_event_raw_event_softirq
 => __raise_softirq_irqoff
 => enqueue_to_backlog
 => netif_rx_internal
 => netif_rx_ni
 => dev_loopback_xmit
 => ip_mc_output
 => ip_send_skb
 => udp_send_skb.isra.46
 => udp_sendmsg
 => sock_sendmsg
 => ___sys_sendmsg
 => __sys_sendmsg
 => do_syscall_64
 => entry_SYSCALL_64_after_hwframe
--------- >% ---------------


and this:

--------- %< ---------------
           <...>-42518 [004] d...114 338653.188346: softirq_raise: vec=3 [action=NET_RX]
           <...>-42518 [004] d...214 338653.188351: <stack trace>
 => trace_event_raw_event_softirq
 => __raise_softirq_irqoff
 => enqueue_to_backlog
 => netif_rx_internal
 => loopback_xmit
 => dev_hard_start_xmit
 => __dev_queue_xmit
 => ip_finish_output2
 => ip_output
 => __ip_queue_xmit
 => __tcp_transmit_skb
 => tcp_write_xmit
 => __tcp_push_pending_frames
 => tcp_close
 => inet_release
 => __sock_release
 => sock_close
 => __fput
 => task_work_run
 => exit_to_usermode_loop
 => do_syscall_64
 => entry_SYSCALL_64_after_hwframe
--------- >% ---------------

and these point for this code:

--------- >% ---------------
static int netif_rx_internal(struct sk_buff *skb)
{
        int ret;

        net_timestamp_check(netdev_tstamp_prequeue, skb);

        trace_netif_rx(skb);

#ifdef CONFIG_RPS
        if (static_branch_unlikely(&rps_needed)) {
                struct rps_dev_flow voidflow, *rflow = &voidflow;
                int cpu;

                preempt_disable();
                rcu_read_lock();

                cpu = get_rps_cpu(skb->dev, skb, &rflow);
                if (cpu < 0)
                        cpu = smp_processor_id();

                ret = enqueue_to_backlog(skb, cpu, &rflow->last_qtail);

                rcu_read_unlock();
                preempt_enable();
        } else
#endif
        {
                unsigned int qtail;

                ret = enqueue_to_backlog(skb, get_cpu(), &qtail);
                put_cpu();
        }
--------- >% ---------------

By default, rps cpumask is not set, so it makes sense to have it handled at the local CPU.

I tried to se the RPS for the CPUs 1 and 2 (mask 3) for all nics, 

find /sys -name rps_cpus | while read mask; do echo 3 > $mask; done

but it did not work. I wonder if that is because the nics I am seing the sofirq raise are the virtual ones?


Here is one example of veth_xmit:

--------- %< ---------------
           <...>-1191724 [004] d...114 338990.764907: softirq_raise: vec=3 [action=NET_RX]
           <...>-1191724 [004] d...214 338990.764913: <stack trace>
 => trace_event_raw_event_softirq
 => __raise_softirq_irqoff
 => enqueue_to_backlog
 => netif_rx_internal
 => veth_xmit
 => dev_hard_start_xmit
 => __dev_queue_xmit
 => ip_finish_output2
 => ip_output
 => __ip_queue_xmit
 => __tcp_transmit_skb
 => tcp_write_xmit
 => __tcp_push_pending_frames
 => tcp_sendmsg_locked
 => tcp_sendmsg
 => sock_sendmsg
 => sock_write_iter
 => new_sync_write
 => vfs_write
 => ksys_write
 => do_syscall_64
 => entry_SYSCALL_64_after_hwframe
--------- >% ---------------

Comment 2 Marcelo Ricardo Leitner 2020-10-13 17:01:54 UTC
(In reply to Daniel Bristot de Oliveira from comment #0)
> Description of problem:
> 
> A latency is spike of ~150 us was noticed on the OCP setup, caused by the
> handling of network softirq on isolated CPUs.

There was https://bugzilla.redhat.com/show_bug.cgi?id=1844520
Did you test with a kernel with these patches in?

Comment 3 Daniel Bristot de Oliveira 2020-10-13 18:22:08 UTC
Hi Marcelo,

I am aware of those patches, but no network interrupts are taking place on that CPUs.

-- Daniel

Comment 4 Paolo Abeni 2020-10-14 14:13:52 UTC
(In reply to Daniel Bristot de Oliveira from comment #1)
> I tried to se the RPS for the CPUs 1 and 2 (mask 3) for all nics, 
> 
> find /sys -name rps_cpus | while read mask; do echo 3 > $mask; done
> 
> but it did not work. I wonder if that is because the nics I am seing the
> sofirq raise are the virtual ones?

The above command has a couple of caveats.

If there is openshift into play, you may have additional/relevant devices in some network namespace, that will not be reached. Additional openshift may delete or create new devices after that the above command completes. Newly create device will carry empty RPS mask.

How do you conclude it did not work? did you still saw the network receive path triggered on CPUs different from 1 and 2?

Comment 5 Daniel Bristot de Oliveira 2020-10-14 14:31:35 UTC
I did not conclude anything, but I observed that there was still network softirq running on CPUs different from 1 and 2.

This seems to be a good point:

   "If there is openshift into play, you may have additional/relevant devices in some network namespace, that will not be reached. Additional openshift may delete or create new devices after that the above command completes. Newly create device will carry empty RPS mask."

So, is there a way to set the "initial" or a "global" RPS mask so that the newly created devices will be set with the restricted mask?

Comment 6 Daniel Bristot de Oliveira 2020-10-15 07:22:06 UTC
Paolo and I worked on this problem yesterday and: 1) We confirmed that the RPS was the cause of (part of) latency spikes; 2) Paolo created a tuning script that correctly set the RPS mask for all the (existing) namespaces;

With that script, the results dramatically improved, from around 180 us on all measurements to results below the 100 us as the common case, with some sporadic spikes. The overnight results are [each test is a ten minutes run]:

     Maximum:	 16 23 12 12 13 12 15 16 (us)
     Maximum:	 25 22 15 16 14 14 11 13 (us)
     Maximum:	 17 18 13 12 11 11 19 13 (us)
     Maximum:	 22 46 23 12 11 16 12 12 (us)
     Maximum:	 22 17 11 12 11 10 11 13 (us)
     Maximum:	 17 11 12 17 15 13 11 11 (us)
     Maximum:	 16 10 10 9 10 10 9 14 (us)
     Maximum:	 16 11 13 11 11 10 11 14 (us)
     Maximum:	 17 13 10 11 16 13 10 10 (us)
     Maximum:	 54 12 11 11 12 12 16 17 (us)
     Maximum:	 17 11 13 11 13 11 17 11 (us)
     Maximum:	 16 12 10 11 12 12 11 10 (us)
     Maximum:	 68 11 172 14 10 11 11 11 (us)
     Maximum:	 34 16 11 13 12 17 13 12 (us)
     Maximum:	 16 10 15 10 11 11 11 12 (us)
     Maximum:	 19 12 203 10 13 10 9 12 (us)
     Maximum:	 30 13 25 13 14 12 10 9 (us)
     Maximum:	 16 10 10 10 11 10 12 10 (us)
     Maximum:	 92 10 13 10 12 14 11 10 (us)
     Maximum:	 18 10 10 9 9 15 11 13 (us)
     Maximum:	 33 15 16 16 121 12 17 12 (us)
     Maximum:	 16 11 11 12 11 12 13 11 (us)
     Maximum:	 22 12 12 17 11 12 18 12 (us)
     Maximum:	 17 10 11 15 11 11 14 11 (us)
     Maximum:	 17 15 18 10 11 10 11 17 (us)
     Maximum:	 17 13 11 13 10 10 10 10 (us)
     Maximum:	 11 11 12 11 10 10 11 17 (us)
     Maximum:	 10 10 9 14 15 11 10 102 (us)
     Maximum:	 11 15 10 10 12 13 10 11 (us)
     Maximum:	 19 10 10 12 10 10 12 45 (us)
     Maximum:	 12 14 11 12 20 15 11 11 (us)
     Maximum:	 12 11 14 11 11 12 11 11 (us)
     Maximum:	 11 12 12 11 16 18 16 11 (us)
     Maximum:	 18 11 10 10 17 11 15 11 (us)
     Maximum:	 17 11 11 12 11 11 17 12 (us)
     Maximum:	 17 12 12 13 16 21 11 13 (us)
     Maximum:	 14 13 12 13 12 16 12 13 (us)
     Maximum:	 20 16 19 21 13 22 13 11 (us)
     Maximum:	 15 18 10 11 12 15 17 20 (us)
     Maximum:	 15 10 14 11 11 10 12 11 (us)
     Maximum:	 11 10 16 17 13 12 17 11 (us)
     Maximum:	 17 12 10 12 11 11 10 12 (us)
     Maximum:	 45 15 13 10 10 14 10 12 (us)
     Maximum:	 15 11 15 15 11 11 11 14 (us)
     Maximum:	 14 11 13 18 17 30 20 10 (us)
     Maximum:	 13 12 16 11 11 16 10 10 (us)
     Maximum:	 11 11 11 10 13 15 11 12 (us)
     Maximum:	 10 11 11 12 16 17 10 12 (us)
     Maximum:	 14 12 11 10 10 12 15 141 (us)
     Maximum:	 11 11 15 11 11 18 10 11 (us)
     Maximum:	 11 13 12 18 11 12 14 17 (us)
     Maximum:	 16 10 10 10 17 12 12 12 (us)
     Maximum:	 10 10 11 17 12 17 10 10 (us)
     Maximum:	 11 11 16 11 10 10 10 13 (us)
     Maximum:	 10 11 12 11 15 11 11 11 (us)
     Maximum:	 9 10 15 13 10 12 11 14 (us)

Note that the spikes we see now can be the side effect of temporary namespaces created by openshift that the current tune does not cover. We will now discuss ways to transform the script into something more "bulletproof."

The tune script content is:
--------- %< run_paolo.sh %< ----------------
#!/bin/sh
mount -o remount,rw /sys
find /sys -name rps_cpus | while read mask; do echo $mask; echo 3 > $mask; done
mount -o remount,ro /sys
--------- >% run_paolo.sh >% ----------------

and we run it with the followin commands:

---------- %< network_tune.sh %< ----------
for N in `ip  netns list | awk '{print $1}'`; do ip netns exec $N ./run_paolo.sh; done
./run_paolo.sh
---------- >% network_tune.sh >% ----------

Good progress!

Comment 8 Andrew Theurer 2020-10-21 01:49:45 UTC
We are trying this on openshift, but the worker loses its network connection after running close to an hour for oslat.  The console has a lot of messages about ksoftirqd being stalled on many CPUs, which leads me to believe that setting the rps_cpus mask is not working, or there is a constant influx of new containers with a default rps_cpus mask of all cpus (or if it's empty, does that mean all cpus?).

However, even if rps_mask is not being set, wouldn't stalld give the ksoftirqds prio to run, and the host network should not be unresponsive?

Comment 9 Paolo Abeni 2020-10-21 08:05:53 UTC
(In reply to Andrew Theurer from comment #8)
> The console has a lot of messages
> about ksoftirqd being stalled on many CPUs, 

Can you please attach the console log?

> which leads me to believe that
> setting the rps_cpus mask is not working, or there is a constant influx of
> new containers with a default rps_cpus mask of all cpus 

This is likely the case.

> (or if it's empty, does that mean all cpus?).

Some clarification needed. In the original problem reported by Daniel the system observed unexpected high latency due to softirq being scheduled on isolated CPUs. The scheduling happened due to packets looped back by OVS. By default looped back packet are processed - via softirq NET_RX_ACTION - on the same CPU that originally xmitted the packet. RPS come into play to force the looped back packets away from such CPUs, specifying a CPU mask = !(isolated CPUS).

An empty RPS mask means the default behavior, that is, looped back packets processed on the originating CPU - whatever that is, comprising isolated CPUs. 

Please let me know if the above clarifies a bit.

> However, even if rps_mask is not being set, wouldn't stalld give the
> ksoftirqds prio to run, and the host network should not be unresponsive?

@Daniel, do you know better?

Comment 10 Andrew Theurer 2020-10-21 12:54:45 UTC
Created attachment 1723197 [details]
console log

Comment 11 Andrew Theurer 2020-10-21 18:31:04 UTC
We are trying this on openshift, but the worker loses its network connection after running close to an hour for oslat.  The console has a lot of messages about ksoftirqd being stalled on many CPUs, which leads me to believe that setting the rps_cpus mask is not working, or there is a constant influx of new containers with a default rps_cpus mask of all cpus (or if it's empty, does that mean all cpus?).

However, even if rps_mask is not being set, wouldn't stalld give the ksoftirqds prio to run, and the host network should not be unresponsive?

Comment 12 Daniel Bristot de Oliveira 2020-10-21 19:01:52 UTC
Andrew, could you check if stalld is not dying on your setup?

We found a bug that was causing it to die on systems with sched_debug getting too large.

if so, could you update the stalld binary with the latest upstream version?

https://gitlab.com/rt-linux-tools/stalld

Regarding the rps_mask not working, did you trace the system? (I did and saw it working).

Thanks!

Comment 13 Troy Wilson 2020-10-22 13:53:42 UTC
When running Paolo's script on my worker node in a loop every five seconds...

====================
#!/bin/bash
cpuset=`cat /proc/self/status | grep Cpus_allowed_list: | cut -f 2`
echo "running on processors " ${cpuset}
while true; do
  for N in `ip  netns list | awk '{print $1}'`; do ip netns exec $N  ./run_paolo.sh; done
  sleep 5;
done
====================

... I noticed that the script stops producing output when I have oslat running with rtprio 1, but will continue to run if I do not use --rtprio with oslat.  In the case of running oslat with rtprio 1, the script will resume as soon as the oslat run completes.  The cpus allowed output indicates the script is running on housekeeping cpus.

I am running OCP 4.6.0-rc.2 with a test kernel from brew 4.18.0-193.27.1.rt13.76.el8_2.rtperf20201015.x86_64.  I am running without stalld.

[root@worker00 ~]# cat /proc/cmdline
BOOT_IMAGE=(hd0,gpt1)/ostree/rhcos-5fa94aa804ce1821c09dec865c57662f23be27c2b4ac5e0d54da4992861ecf87/vmlinuz-4.18.0-193.27.1.rt13.76.el8_2.rtperf20201015.x86_64 rhcos.root=crypt_rootfs random.trust_cpu=on console=tty0 console=ttyS0,115200n8 ignition.platform.id=metal rd.luks.options=discard ostree=/ostree/boot.1/rhcos/5fa94aa804ce1821c09dec865c57662f23be27c2b4ac5e0d54da4992861ecf87/0 skew_tick=1 nohz=on rcu_nocbs=4-23 tuned.non_isolcpus=0000000f intel_pstate=disable nosoftlockup tsc=nowatchdog intel_iommu=on iommu=pt isolcpus=managed_irq,4-23 systemd.cpu_affinity=0,1,2,3 default_hugepagesz=1G hugepagesz=1G hugepages=10 nosmt tsc=nowatchdog
[root@worker00 ~]#


If there is interest in the oslat output, these are from very short runs used to diagnose the issue above:
(these runs are so short this data should not be treated as valid latency data)


oslat with rtprio 1, paolo script started but stalled during oslat run:

   Minimum:	 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 (us)
   Average:	 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 (us)
   Maximum:	 17459 26 21 17 17 15 14 15 15 15 14 15 14 14 14 14 14 14 (us)
   Max-Min:	 17458 25 20 16 16 14 13 14 14 14 13 14 13 13 13 13 13 13 (us)


oslat run without rtprio option, paolo script running:

   Minimum:	 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 (us)
   Average:	 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 (us)
   Maximum:	 62 11 65 12 34 47 39 45 50 13 48 45 44 31 44 61 61 60 (us)
   Max-Min:	 61 10 64 11 33 46 38 44 49 12 47 44 43 30 43 60 60 59 (us)

oslat run without rtprio option, paolo script not started at all:

   Minimum:	 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 (us)
   Average:	 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 (us)
   Maximum:	 50 11 52 11 41 39 33 48 59 13 39 49 43 40 43 55 58 52 (us)
   Max-Min:	 49 10 51 10 40 38 32 47 58 12 38 48 42 39 42 54 57 51 (us)

Comment 14 Daniel Bristot de Oliveira 2020-10-22 14:25:36 UTC
for how long are you running oslat? is the system complaining about rcu stall?

Comment 15 Troy Wilson 2020-10-22 16:07:43 UTC
(In reply to Daniel Bristot de Oliveira from comment #14)
> for how long are you running oslat? is the system complaining about rcu
> stall?

These runs are just 5 minute runs as I verified the issue above.  I am unable to make runs longer than maybe 30 or 40 minutes.  One-hour runs consistently fail.  Either the pod disappears entirely and events logs show nothing, or the entire worker node becomes unresponsive, even from the console, and requires a manual reboot to recover.  When running without stalld I see lots of "rcu: INFO: rcu_preempt self-detected stall on CPU <whatever>", and when running with stalld I see lots of rcuc and ksoftirqd starvation.

Comment 16 Daniel Bristot de Oliveira 2020-10-22 17:05:18 UTC
Question: which stalld are you using, is it the latest version from here?

https://gitlab.com/rt-linux-tools/stalld

If not, please use it, it fixes a known issue.

Running without stalld will cause rcu stall, and that causes the 1ms > latency because the stall dispatches NMI to all CPUs. This was already investigated in the origin of this BUG (note, this BUG is the consequence of the analysis).

The rcuc and ksoftirqd starvation messages are the side effect of oslat on their scenario. Stalld messages are not the problem or the error but are the evidence that we need to understand why those threads are running there. That said, I do not think it makes sense to test without stalld.

It is easy to understand what softirqd is running on the CPUs (and causing the wakeup) using trace.

This BZ in specific cares about the dispatch of NET_RX softirq. But it is known that some timer softirqd are being raised there, and that is not the problem in this BZ.

When the system crashes, can you try to get the dmesg output from /sys/fs/pstore ? here: https://access.redhat.com/solutions/4581331

check if it is something like:

"kernel BUG at kernel/sched/deadline.c:1462!"

Comment 17 Troy Wilson 2020-10-22 17:52:11 UTC
(In reply to Daniel Bristot de Oliveira from comment #16)
> Question: which stalld are you using, is it the latest version from here?
> 
> https://gitlab.com/rt-linux-tools/stalld
> 
> If not, please use it, it fixes a known issue.

I'm currently working on rpm-ostree installing the build prerequisites to my worker node.  If there is a better way to get the new stalld onto my worker, please let me know because this is unpleasant so far.


> When the system crashes, can you try to get the dmesg output from
> /sys/fs/pstore ? here: https://access.redhat.com/solutions/4581331
> 
> check if it is something like:
> 
> "kernel BUG at kernel/sched/deadline.c:1462!"

I have nothing in /sys/fs/pstore now, but I can start checking it after a crash.  Thanks for the tip.

Comment 22 Daniel Bristot de Oliveira 2020-10-26 11:15:58 UTC
Hi Paolo,

I am testing a kernel-rt with your patches (and all the others that openshift needs).

So far so good.

More updates by the end of the day.

Thanks!

Comment 23 Troy Wilson 2020-10-26 13:55:18 UTC
(In reply to Daniel Bristot de Oliveira from comment #22)
> Hi Paolo,
> 
> I am testing a kernel-rt with your patches (and all the others that
> openshift needs).

I'd be happy to test the rt version as well if you have a link to the build.  Thanks!

Comment 33 Daniel Bristot de Oliveira 2020-10-28 10:42:24 UTC
Note:

While running the experiments, stalld is pointing to the executions of kworkers on the isolated CPUs.

I file the following BZ to track it: https://bugzilla.redhat.com/show_bug.cgi?id=1892264


Note You need to log in before you can comment on or make changes to this bug.