1855447 – FIFO priority causes hung tasks

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1855447 - FIFO priority causes hung tasks

Summary: FIFO priority causes hung tasks

Keywords:
Status:	CLOSED DUPLICATE of bug 1878602
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	kernel-rt
Sub Component:
Version:	8.1
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	high
Target Milestone:	rc
Target Release:	8.0
Assignee:	Daniel Bristot de Oliveira
QA Contact:	Qiao Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:	1866909 1869991
Blocks:	1861094
TreeView+	depends on / blocked

Reported:	2020-07-09 21:23 UTC by Karl Rister
Modified:	2020-10-10 06:42 UTC (History)
CC List:	22 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1861094 (view as bug list)
Environment:
Last Closed:	2020-10-08 20:40:03 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
hung task trace (280.27 KB, image/png) 2020-07-09 21:23 UTC, Karl Rister	no flags	Details
top output (239.23 KB, image/png) 2020-07-10 15:55 UTC, Karl Rister	no flags	Details
dmesg output (220.54 KB, text/plain) 2020-07-10 19:12 UTC, Karl Rister	no flags	Details
dmesg output with debug kernel (268.49 KB, text/plain) 2020-07-13 18:12 UTC, Karl Rister	no flags	Details
Instructions about how to reproduce the problem (and how to workaround it) (2.23 KB, text/plain) 2020-07-21 16:04 UTC, Daniel Bristot de Oliveira	no flags	Details
raggiro v2 script (2.24 KB, application/x-shellscript) 2020-07-28 13:24 UTC, Karl Rister	no flags	Details
View All

Description Karl Rister 2020-07-09 21:23:04 UTC

Created attachment 1700491 [details]
hung task trace

Description of problem:

Using the 8.1 RT kernel on an OCP worker node and creating a Pod with a DPDK application (testpmd) which has some threads configured with SCHED_FIFO priority 1 (chrt -f -p 1 <pid>) is resulting in kernel stack traces for hung tasks and eventually causing OCP to terminate the Pod which it cannot do because the system has become unresponsive.


Version-Release number of selected component (if applicable):

4.18.0-147.8.1.rt24.101.el8_1.x86_64


How reproducible:

In two attempts to leave the Pod running overnight it this has happened both times.


Steps to Reproduce:
1.  Launch Pod with testpmd running in a container
2.  Change testpmd thread priority using chrt
3.  Wait until things start to break


Actual results:

Kernel dumps stack traces and OCP attempts to cleanup but fails.


Expected results:

Testpmd runs just fun until the user interrupts it.


Additional info:

Comment 2 Clark Williams 2020-07-10 13:56:15 UTC

Do we see a behavior change between running this on a RHEL kernel vs an RT kernel?

Also, did an operator turn off FIFO throttling?

If you can snag them, would be nice to see the backtraces here.

Comment 3 Karl Rister 2020-07-10 15:54:29 UTC

We have not seen this exact problem on RHEL, but it does have the classic characteristics of an potential problem from running any SCHED_FIFO task in that some random task is initiating a kernel action that is requiring some information from a CPU where the SCHED_FIFO task is running or requiring an activity to be performed on that CPU.  I suspect that we are just getting unlucky in the OCP environment and that something that OCP is doing is causing such an activity to occur.

One thing that I have observed is that after awhile with testpmd running a kubelet task consumes 100% of a housekeeping core and then many node_exporter (I'm told this is related to prometheus data logging) tasks are consuming small amounts of CPU (I'll attach a top screen grab).  These appear to be blocked from making progress by the SCHED_FIFO task(s).  Previously, this always occurred when the node was essentially lost and it was hard to debug.

Yesterday (after our discussions) I proceeded to come up with some shell magic that will re-prioritize all SCHED_FIFO:1 tasks other than the testpmd tasks to SCHED_FIFO:2 and this is what I came up with:

for tid in $(ps -L -eo tid | sort -n); do if [ -e "${tid}/sched" ]; then policy=$(grep policy ${tid}/sched | awk '{ print $3 }'); if [ "${policy}" == "1" ]; then comm=$(cat ${tid}/comm); priority=$(grep prio ${tid}/sched | awk '{ print $3 }'); if [ "${priority}" == "98" ]; then echo "${tid} ${priority} ${comm}"; if echo "${comm}" | grep -qv lcore; then chrt -v -f -p 2 ${tid}; fi; fi; fi; fi; done

This moved quite a few tasks to SCHED_FIFO:2 and I then let the system run overnight.  This morning the system was still alive and OCP was not perturbed like it had been, but attempts to delete the Pod were unsuccessful until I manually intervened -- using a pre-created bash console (ssh core@node; sudo bash) I was able to use kill to terminate testpmd and then the Pod deletion succeeded.  This also unplugged whatever was holding back the kubelet/node_exporter tasks and they were then able to complete and exit/subside.  This would seem to imply that those tasks not progressing is not what is causing the catastrophic problem though since the system was still alive with that occurring.  It did take me a little while to get that shell command working to fixup the priorities so it is possible that some bad state was achieved prior to getting that working.  I'm going to try and reproduce this with the priorities fixed before ever launching testpmd.

I have explicitly tested deleting a Pod with SCHED_FIFO:1 testpmd tasks after creating it and bringing it up and that worked.  So that is not a combination that is 100% broken, something is happening via a timer and/or aging that is getting us into the problematic state.

It should be noted that when running testpmd with SCHED_FIFO:1 and the other tasks elevated to SCHED_FIFO:2 that the measured packet latency did increase substantially from a max of 16.5usec to 91.6usec.

--

If this the throttling that you are referring to I guess the answer is no:

[root@perf150 core]# cat /proc/sys/kernel/sched_rt_runtime_us 
-1

--

Do you mean the backtrace that is in the attached image file?  We don't have a serial console hooked up so I just had to do a screen grab of the iDRAC's remote console.

Comment 4 Karl Rister 2020-07-10 15:55:35 UTC

Created attachment 1700620 [details]
top output

Comment 5 Karl Rister 2020-07-10 16:51:53 UTC

After a fresh boot of the worker node I moved all the existing SCHED_FIFO:1 tasks to SCHED_FIFO:2, created a new Pod running testpmd, and then applied SCHED_FIFO:1 to the PMD threads spawned by testpmd.  At this point everything looked good.  I was watching the system via top for awhile and things were continuing to look good and I was targeting getting to 30 minutes of testpmd runtime and then recording some information.  Well, I got a bit distracted around 25 minutes and and didn't get to look at the system again until about 31 minutes and at that point I could see some funny stuff going on with kubelet/node_exporter (this grab is from ~46.5 minutes as I have been watching it to see if anything changed):

top - 16:49:51 up  1:57,  3 users,  load average: 100.24, 88.97, 58.62
Threads: 2071 total,  13 running, 2051 sleeping,   0 stopped,   7 zombie
%Cpu(s): 11.1 us,  8.8 sy,  0.0 ni, 80.1 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem : 192056.2 total, 172286.8 free,  17504.8 used,   2264.6 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used. 173483.0 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+  P COMMAND                                                                                                                                  
   2357 root      20   0 4742792 144136  62940 R  99.7   0.1  10:50.52  0 kubelet                                                                                                                                  
 291703 root      -2   0  256.2g  65208  11232 R  97.8   0.0  46:25.88  7 lcore-slave-7                                                                                                                            
 291704 root      -2   0  256.2g  65208  11232 R  97.8   0.0  46:26.01  9 lcore-slave-9                                                                                                                            
   7857 nobody    20   0  121272  49536   9684 D   8.8   0.0   1:16.26 14 node_exporter                                                                                                                            
  13703 nobody    20   0  121272  49536   9684 D   8.4   0.0   0:26.62 14 node_exporter                                                                                                                            
 165079 nobody    20   0  121272  49536   9684 D   8.1   0.0   0:23.32 14 node_exporter                                                                                                                            
 439006 nobody    20   0  121272  49536   9684 D   8.1   0.0   0:46.23 14 node_exporter                                                                                                                            
   7859 nobody    20   0  121272  49536   9684 D   7.8   0.0   0:32.73 14 node_exporter                                                                                                                            
   4789 nobody    20   0  121272  49536   9684 D   6.2   0.0   0:53.47 22 node_exporter                                                                                                                            
   7916 nobody    20   0  121272  49536   9684 D   6.2   0.0   0:40.02 22 node_exporter                                                                                                                            
 165076 nobody    20   0  121272  49536   9684 D   6.2   0.0   0:48.25 22 node_exporter                                                                                                                            
 439219 nobody    20   0  121272  49536   9684 D   5.6   0.0   0:54.51 22 node_exporter                                                                                                                            
   7872 nobody    20   0  121272  49536   9684 D   4.4   0.0   1:02.26  2 node_exporter                                                                                                                            
   7920 nobody    20   0  121272  49536   9684 D   4.4   0.0   0:56.53  2 node_exporter                                                                                                                            
 439100 nobody    20   0  121272  49536   9684 D   4.4   0.0   0:52.05  2 node_exporter                                                                                                                            
   7869 nobody    20   0  121272  49536   9684 D   4.1   0.0   0:38.95 10 node_exporter                                                                                                                            
   4790 nobody    20   0  121272  49536   9684 D   3.4   0.0   0:37.07 10 node_exporter                                                                                                                            
   7852 nobody    20   0  121272  49536   9684 D   3.4   0.0   0:23.85 21 node_exporter                                                                                                                            
   7873 nobody    20   0  121272  49536   9684 D   3.4   0.0   0:20.14 21 node_exporter

When I go look at the iDRAC console I see task blocked backtrace messages for kworkers for more than 600 seconds.  So I'm currently inclined to think that whatever is happening is occurring/starting around 20 minutes after starting testpmd and then due to the 600 second timeout we start seeing more evidence at around 30 minutes.

Comment 6 Karl Rister 2020-07-10 16:56:25 UTC

This may be obvious, but I should note that testpmd is continuing to function just fine while this is happening -- it is executing nearly uninterrupted after all.  So if observing the system remotely (say from our traffic generator) everything appears fine because packets continue to flow and performance is decent if not exceptional (depending on configuration).  It is only once taking a closer look at OCP and/or the worker node itself that the problems that are manifesting become apparent.

Comment 7 Karl Rister 2020-07-10 19:12:09 UTC

Created attachment 1700644 [details]
dmesg output

I recreated the scenario mentioned in comment #5 from a fresh boot and captured dmesg from the initial boot through the problem occurring.  The guest was booted right around the "[Fri Jul 10 17:30:08 2020]" timestamp.

Comment 8 Juri Lelli 2020-07-13 09:28:02 UTC

Hi Karl,

I'm thinking that it might be interesting to try running this once
with -debug kernel.

I noticed from the dmesg you shared that blocked tasks seem to be
either blocked on netlink_sock cb_mutex (e.g., NetworkManager, agetty)
or css_set_lock, a mutex as well in RT (e.g., crio, kubelet).

When running with LOCKDEP enabled the hung task warning should
hopefully also print a list of all locks held in the system, so
we could try to use that additional info to figure out who is
holding the lock(s) the blocked tasks are waiting for.

Comment 9 Karl Rister 2020-07-13 18:12:00 UTC

Created attachment 1700879 [details]
dmesg output with debug kernel

Eventually got the RHCOS node to boot with the kernel-rt debug kernel and then fired off a Pod.  It took quite awhile for testpmd to come online, presumably do to debug overhead (it is pretty slow on RT as is due to memory mapping).

The problem seems to have taken much longer to manifest than previously seen but I imagine that has something to do with the debug overhead as well.

Attached is a new dmesg with the lockdep output.

Comment 17 Andrew Theurer 2020-07-20 14:25:55 UTC

Changing priority to urgent as this blocks OCP 4.6 release

Comment 18 Daniel Bristot de Oliveira 2020-07-21 16:03:43 UTC

Hi

Juri and I have been working the entire day on this problem, and here are our findings.

We found a way to reproduce the problem, and it consists of sending a signal -1 to the ovs-vswitchd daemon after setting the real-time priority to the testpmd threads. The signal makes ovs-vswitchd to restart, and as a consequence, it runs the functions of i40e that “causes” the problem (ovs-vswitchd waiting for a workqueue).

We also found that this action dispatched a workqueue on the CPUs that the testpmd threads were running as a busy-loop with real-time priority. If no action is taken, the system goes to a faulty state.

However, if we increase the priority of these kworkers (the WQ_HIGHPRI kworkers) to a real-time priority higher than the testpmd threads, the kworker runs, and the ovs-vswitchd makes progress, avoiding the faulty state.

With these in mind, we (mostly) discard our previous theory of a locking problem on i40e. At that point, we thought that the ovs-vswitchd was (indirectly) waiting for the kworker that was running i40e_service_task() work, but it is not the case. So, it seems to be a “classical” starvation bug.

Juri and I will continue debugging this problem.

I am attaching the instructions about how to reproduce and workaround the problem.

Thanks
Daniel Oliveira and Juri Lelli

Comment 19 Daniel Bristot de Oliveira 2020-07-21 16:04:45 UTC

Created attachment 1701936 [details]
Instructions about how to reproduce the problem (and how to workaround it)

Comment 20 Luiz Capitulino 2020-07-21 18:33:50 UTC

Daniel,

What's the function that's causing the work items to be dispatched
to isolated CPUs? Is it in the workqueue subsystem or is some
specific function the i40e driver is calling?

We set the workqueue cpumask, but I remember this doesn't affect
all workqueues.

Comment 21 Daniel Bristot de Oliveira 2020-07-22 07:34:36 UTC

Hi Luiz

It is dispatched by net/core/dev.c:flush_all_backlogs(), inside a for_each_online_cpu(cpu), using the "system_highpri_wq" (iow: the high priority workqueues), which is per-cpu.

IIRC, we are setting the mask for the unbounded wq, that, as the name suggests, is not bounded to a specific CPU.

Comment 22 Phil Auld 2020-07-22 13:26:38 UTC

Hi Daniel,

That makes me think of bz1745111. In that we found that UNBOUND wq basically meant use the CPU of the submitting task, which in this case was causing the submitter to get migrated away by the load balancer since there would be 2 running tasks.   There was an upstream list discussion about it but I can't find it right now. Ming Lei posted a patch which started it I think.  Juri may remember, it was part of the SAP thing.  One of the attempted fixes which did work was to explicitly bind the work to a different CPU. In the above case it sounds the same if the task on the isolated CPU is doing the submitting. 

Fwiw,

Phil

Comment 23 Daniel Bristot de Oliveira 2020-07-22 16:32:51 UTC

Hi Phil, thanks for the heads up, but unfortunately this is not the same problem we are seeing :-(

Comment 26 Juri Lelli 2020-07-23 17:07:25 UTC

Hi,

While userspace workaround (comment25) is tested and numbers gathered (as agreed during today's
sync call), we'd need to discuss possible (maybe longer term) fixes for the kernel.

In my understanding, the highprio workqueues that hang are scheduled onto the isolated core
running testpmd FIFO threads by net/core/dev.c flush_all_backlogs() functions that uses
for_each_online_cpu() to queue flushing of (potential) remaining packets on all online cpus.

We wonder if it might make sense to keep isolated cores out of the handling packets business
(unless strictly required), so that there will be no need to trigger flushing on them when an
interface is removed/reconfigured and consequently no need to queue highprio workqueues.

I discussed the issue and provided a bit of background to Paolo (now cc-ed) already and he had
interesting considerations that I kindly ask him to repeat in this thread as well so that everybody
can follow the discussion. Paolo, feel free, of course, to come back with questions if something is
not clear to you.

Comment 27 Paolo Abeni 2020-07-24 12:04:18 UTC

> I discussed the issue and provided a bit of background to Paolo (now cc-ed)
> already and he had interesting considerations that I kindly ask him to repeat in this thread as
> well so that everybody can follow the discussion.

First thing, the commit introducing the issue:

commit 145dd5f9c88f6ee645662df0be003e8f04bdae93
Author: Paolo Abeni <pabeni>
Date:   Thu Aug 25 15:58:44 2016 +0200

    net: flush the softnet backlog in process context

can't be easily reverted because is leveraged by later commit:

commit 41852497a9205964b958a245a9526040b980926f
Author: Eric Dumazet <edumazet>
Date:   Fri Aug 26 12:50:39 2016 -0700

    net: batch calls to flush_all_backlogs()

to speed-up netns removal time in case of many network device present, which is considered a quite critical metric.

flush_all_backlogs() needs to act on each online CPUs due to RPS - receive packet steering: each NIC can be configured via sysfs entry to redirect incoming packet to specific CPUs using the ingress packet RX hash for the demux operation. The target CPU can be selected even if isolated.

So a possible solution could be update RPS to be aware of isolated CPUs: store_rps_map() in net/core/net-sysfs.c must prevent isolated CPUs from being inserted into the map and flush_all_backlogs() could then skip isolated CPUs.

The above assuming that the isolated CPU lists did not change after the boot. Otherwise, at isolated CPU mask update time, the sofrnet_data.input_pkt_queue and softnet_data.process_queue of the newly isolated CPUs should be moved to others cores.

Note that the above will introduce a change of established behavior and may see some resistance upstream. Possibly doing the above under some specific boot params and/or Kconfig knob (CONFIG_CPU_ISOLATION???) could help. Does CPU isolation already prevent e.g. NICS from scheduling IRQs on the isolated cores? I'm asking to try to understand how much resistance could find upstream the mentioned change.

Comment 28 Nitesh Narayan Lal 2020-07-24 12:31:43 UTC

(In reply to Paolo Abeni from comment #27)
> > I discussed the issue and provided a bit of background to Paolo (now cc-ed)
> > already and he had interesting considerations that I kindly ask him to repeat in this thread as
> > well so that everybody can follow the discussion.
> 
> First thing, the commit introducing the issue:
> 
> commit 145dd5f9c88f6ee645662df0be003e8f04bdae93
> Author: Paolo Abeni <pabeni>
> Date:   Thu Aug 25 15:58:44 2016 +0200
> 
>     net: flush the softnet backlog in process context
> 
> can't be easily reverted because is leveraged by later commit:
> 
> commit 41852497a9205964b958a245a9526040b980926f
> Author: Eric Dumazet <edumazet>
> Date:   Fri Aug 26 12:50:39 2016 -0700
> 
>     net: batch calls to flush_all_backlogs()
> 
> to speed-up netns removal time in case of many network device present, which
> is considered a quite critical metric.
> 
> flush_all_backlogs() needs to act on each online CPUs due to RPS - receive
> packet steering: each NIC can be configured via sysfs entry to redirect
> incoming packet to specific CPUs using the ingress packet RX hash for the
> demux operation. The target CPU can be selected even if isolated.
> 
> So a possible solution could be update RPS to be aware of isolated CPUs:
> store_rps_map() in net/core/net-sysfs.c must prevent isolated CPUs from
> being inserted into the map and flush_all_backlogs() could then skip
> isolated CPUs.

I have recently worked on this change and this is already sitting in linux-next [1].
I did miss to include the netdev ml, which I should have. 
However, this particular change was reviewed Acked by David S. Miller [2].

[1] https://lore.kernel.org/lkml/20200625223443.2684-4-nitesh@redhat.com/
[2] https://lore.kernel.org/lkml/20200626.102038.1298513542272393362.davem@davemloft.net/

> 
> The above assuming that the isolated CPU lists did not change after the
> boot. Otherwise, at isolated CPU mask update time, the
> sofrnet_data.input_pkt_queue and softnet_data.process_queue of the newly
> isolated CPUs should be moved to others cores.
> 
> Note that the above will introduce a change of established behavior and may
> see some resistance upstream. Possibly doing the above under some specific
> boot params and/or Kconfig knob (CONFIG_CPU_ISOLATION???) could help. Does
> CPU isolation already prevent e.g. NICS from scheduling IRQs on the isolated
> cores? I'm asking to try to understand how much resistance could find
> upstream the mentioned change.

Yes, atleast the NIC drivers that use cpumask_local_spread to pick CPUs at the time
of vector allocation. This change is also made in the same patch-set [1].

[1] https://lore.kernel.org/lkml/20200625223443.2684-2-nitesh@redhat.com/

Comment 29 Paolo Abeni 2020-07-24 14:59:41 UTC

(In reply to Nitesh Narayan Lal from comment #28)
> (In reply to Paolo Abeni from comment #27)
> > > I discussed the issue and provided a bit of background to Paolo (now cc-ed)
> > > already and he had interesting considerations that I kindly ask him to repeat in this thread as
> > > well so that everybody can follow the discussion.
> > 
> > First thing, the commit introducing the issue:
> > 
> > commit 145dd5f9c88f6ee645662df0be003e8f04bdae93
> > Author: Paolo Abeni <pabeni>
> > Date:   Thu Aug 25 15:58:44 2016 +0200
> > 
> >     net: flush the softnet backlog in process context
> > 
> > can't be easily reverted because is leveraged by later commit:
> > 
> > commit 41852497a9205964b958a245a9526040b980926f
> > Author: Eric Dumazet <edumazet>
> > Date:   Fri Aug 26 12:50:39 2016 -0700
> > 
> >     net: batch calls to flush_all_backlogs()
> > 
> > to speed-up netns removal time in case of many network device present, which
> > is considered a quite critical metric.
> > 
> > flush_all_backlogs() needs to act on each online CPUs due to RPS - receive
> > packet steering: each NIC can be configured via sysfs entry to redirect
> > incoming packet to specific CPUs using the ingress packet RX hash for the
> > demux operation. The target CPU can be selected even if isolated.
> > 
> > So a possible solution could be update RPS to be aware of isolated CPUs:
> > store_rps_map() in net/core/net-sysfs.c must prevent isolated CPUs from
> > being inserted into the map and flush_all_backlogs() could then skip
> > isolated CPUs.
> 
> I have recently worked on this change and this is already sitting in
> linux-next [1].
> I did miss to include the netdev ml, which I should have. 
> However, this particular change was reviewed Acked by David S. Miller [2].
> 
> [1] https://lore.kernel.org/lkml/20200625223443.2684-4-nitesh@redhat.com/
> [2]
> https://lore.kernel.org/lkml/20200626.102038.1298513542272393362.
> davem/

Excellent! looks like we are on the same page! The mentioned commit was also discussed on an internal ML, but I did not follow closely.

If I read correctly, to completely address this issue we additionally need to mask out isolated CPUs in 
flush_backlog() loops. Am I missing something? Or are you planning such change, too?

Thanks,

Paolo

Comment 30 Nitesh Narayan Lal 2020-07-24 15:41:32 UTC

(In reply to Paolo Abeni from comment #29)
> (In reply to Nitesh Narayan Lal from comment #28)
> > (In reply to Paolo Abeni from comment #27)
> > > > I discussed the issue and provided a bit of background to Paolo (now cc-ed)
> > > > already and he had interesting considerations that I kindly ask him to repeat in this thread as
> > > > well so that everybody can follow the discussion.
> > > 
> > > First thing, the commit introducing the issue:
> > > 
> > > commit 145dd5f9c88f6ee645662df0be003e8f04bdae93
> > > Author: Paolo Abeni <pabeni>
> > > Date:   Thu Aug 25 15:58:44 2016 +0200
> > > 
> > >     net: flush the softnet backlog in process context
> > > 
> > > can't be easily reverted because is leveraged by later commit:
> > > 
> > > commit 41852497a9205964b958a245a9526040b980926f
> > > Author: Eric Dumazet <edumazet>
> > > Date:   Fri Aug 26 12:50:39 2016 -0700
> > > 
> > >     net: batch calls to flush_all_backlogs()
> > > 
> > > to speed-up netns removal time in case of many network device present, which
> > > is considered a quite critical metric.
> > > 
> > > flush_all_backlogs() needs to act on each online CPUs due to RPS - receive
> > > packet steering: each NIC can be configured via sysfs entry to redirect
> > > incoming packet to specific CPUs using the ingress packet RX hash for the
> > > demux operation. The target CPU can be selected even if isolated.
> > > 
> > > So a possible solution could be update RPS to be aware of isolated CPUs:
> > > store_rps_map() in net/core/net-sysfs.c must prevent isolated CPUs from
> > > being inserted into the map and flush_all_backlogs() could then skip
> > > isolated CPUs.
> > 
> > I have recently worked on this change and this is already sitting in
> > linux-next [1].
> > I did miss to include the netdev ml, which I should have. 
> > However, this particular change was reviewed Acked by David S. Miller [2].
> > 
> > [1] https://lore.kernel.org/lkml/20200625223443.2684-4-nitesh@redhat.com/
> > [2]
> > https://lore.kernel.org/lkml/20200626.102038.1298513542272393362.
> > davem/
> 
> Excellent! looks like we are on the same page! The mentioned commit was also
> discussed on an internal ML, but I did not follow closely.
> 
> If I read correctly, to completely address this issue we additionally need
> to mask out isolated CPUs in 
> flush_backlog() loops. Am I missing something? Or are you planning such
> change, too?

Yeah, that's also my impression.
Juri has told me that you will be taking care of the flush_backlog() change.
So once that is done we should be fine.

Thanks
Nitesh

> 
> Thanks,
> 
> Paolo

Comment 31 Andrew Theurer 2020-07-28 01:50:07 UTC

> If I read correctly, to completely address this issue we additionally need
> to mask out isolated CPUs in 
> flush_backlog() loops. Am I missing something? Or are you planning such
> change, too?


But we are not using isolated CPUs (isolcpus=domain=).  And over time these CPUs being used could change.  If this is a problem with RPS, why can't we check what CPUs are targets for RPS, and a CPU is not a target, can we just skip them?

Comment 32 Paolo Abeni 2020-07-28 08:37:21 UTC

(In reply to Andrew Theurer from comment #31)
> > If I read correctly, to completely address this issue we additionally need
> > to mask out isolated CPUs in 
> > flush_backlog() loops. Am I missing something? Or are you planning such
> > change, too?
> 
> 
> But we are not using isolated CPUs (isolcpus=domain=).  And over time these
> CPUs being used could change.  If this is a problem with RPS, why can't we
> check what CPUs are targets for RPS, and a CPU is not a target, can we just
> skip them?

RPS settings are per-device and can be modified dynamically at run-time. flush_backlog processes several device at once. It's fairly non trivial compute the minumal CPU set flush_device should run on. Additionally, a valid, useful and common configuration is spreading RPS on all the available CPUs. Without isolcpus looks impossible to me avoid this issue.

Comment 33 Juri Lelli 2020-07-28 09:13:00 UTC

(In reply to Andrew Theurer from comment #31)
> > If I read correctly, to completely address this issue we additionally need
> > to mask out isolated CPUs in 
> > flush_backlog() loops. Am I missing something? Or are you planning such
> > change, too?
> 
> 
> But we are not using isolated CPUs (isolcpus=domain=).  And over time these
> CPUs being used could change.  If this is a problem with RPS, why can't we
> check what CPUs are targets for RPS, and a CPU is not a target, can we just
> skip them?

But I believe rcu_nocbs is used however, or am I wrong?

If that's the case, maybe we can think of an additional kernel/sysfs parameter
(i.e. something similar to standard workqueues) to configure RPS queues and
flush?

Comment 34 Karl Rister 2020-07-28 13:22:46 UTC

I fired up raggiro and just let the system run for a long time (overnight) and I still saw problems.  It looked like it was waiting indefinitely for a task that had been assigned deadline scheduling to complete.  When I looked at the list of runnable tasks on the isolated CPUs I saw quite a few.  This made me think that the way the script was written was finding a task to give deadline scheduling (and then it blocks waiting for that task to no longer be runnable) and never actually gives deadline scheduling classification to the task that is actually blocking forward progress for the other tasks.  So I have made some changes to raggiro and will post that as raggiro v2.  It it no longer filtering just on kworker threads and I had to tweak the logic some to allow it to tweak multiple tasks per CPU simultaneously.

Unfortunately, raggiro v2 is still not working.  It does appear to let the tasks make forward progress, albeit extremely slowly.  I saw a lot of messages like this from running it overnight:

[Tue Jul 28 01:42:57 2020] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[Tue Jul 28 01:42:57 2020] rcu:         Tasks blocked on level-1 rcu_node (CPUs 0-11): P85
[Tue Jul 28 01:42:57 2020] rcu:         (detected by 0, t=60002 jiffies, g=9520401, q=432287)
[Tue Jul 28 01:42:57 2020] ksoftirqd/7     R  running task        0    85      2 0x80000000
[Tue Jul 28 01:42:57 2020] Call Trace:
[Tue Jul 28 01:42:57 2020]  ? __schedule+0x25a/0x6a0
[Tue Jul 28 01:42:57 2020]  preempt_schedule_irq+0x43/0xa0
[Tue Jul 28 01:42:57 2020]  do_preempt_schedule_irq+0x5/0x7
[Tue Jul 28 01:42:57 2020] RIP: 0010:find_busiest_group+0x20b/0xac0
[Tue Jul 28 01:42:57 2020] Code: 08 48 8b 74 ce 08 48 39 f0 48 0f 42 c6 eb 84 8b 40 38 89 44 24 04 e9 7f fe ff ff 8b 4c 24 04 48 8d 34 10 48 8b 86 28 01 00 00 <85> c9 0f 84 61 ff ff ff 0f 1f 44 00 00 e9 57 ff ff ff 48 8b 4c 24
[Tue Jul 28 01:42:57 2020] RSP: 0000:ffffb006d49bfbb0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
[Tue Jul 28 01:42:57 2020] RAX: 0000000000000000 RBX: 0000000000025d00 RCX: 0000000000000002
[Tue Jul 28 01:42:57 2020] RDX: ffff9f339fa80000 RSI: ffff9f339faa5d00 RDI: ffff9f1c0d137030
[Tue Jul 28 01:42:57 2020] RBP: ffff9f339faa5d00 R08: 0000000000000015 R09: 0000000000000000
[Tue Jul 28 01:42:57 2020] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000015
[Tue Jul 28 01:42:57 2020] R13: 0000000000000000 R14: ffffb006d49bfbe0 R15: ffffb006d49bfd70
[Tue Jul 28 01:42:57 2020]  load_balance+0x16e/0xa70
[Tue Jul 28 01:42:57 2020]  rebalance_domains+0x25d/0x2e0
[Tue Jul 28 01:42:57 2020]  do_current_softirqs+0x223/0x3e0
[Tue Jul 28 01:42:57 2020]  ? smpboot_register_percpu_thread_cpumask+0x130/0x130
[Tue Jul 28 01:42:57 2020]  run_ksoftirqd+0x22/0x40
[Tue Jul 28 01:42:57 2020]  smpboot_thread_fn+0x1d6/0x2c0
[Tue Jul 28 01:42:57 2020]  kthread+0x112/0x130
[Tue Jul 28 01:42:57 2020]  ? kthread_flush_work_fn+0x10/0x10
[Tue Jul 28 01:42:57 2020]  ret_from_fork+0x35/0x40

I also still eventually saw a hung task timeout message:

[Tue Jul 28 02:00:27 2020] INFO: task rcub/1:15 blocked for more than 600 seconds.
[Tue Jul 28 02:00:27 2020]       Not tainted 4.18.0-147.8.1.rt24.101.el8_1.x86_64 #1
[Tue Jul 28 02:00:27 2020] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Tue Jul 28 02:00:27 2020] rcub/1          D    0    15      2 0x80000000
[Tue Jul 28 02:00:27 2020] Call Trace:
[Tue Jul 28 02:00:27 2020]  ? __schedule+0x25a/0x6a0
[Tue Jul 28 02:00:27 2020]  schedule+0x39/0xd0
[Tue Jul 28 02:00:27 2020]  __rt_mutex_slowlock+0xbe/0x130
[Tue Jul 28 02:00:27 2020]  ? __perf_event_task_sched_in+0x21/0x250
[Tue Jul 28 02:00:27 2020]  rt_mutex_slowlock_locked+0xbc/0x270
[Tue Jul 28 02:00:27 2020]  rt_mutex_slowlock+0x6d/0xc0
[Tue Jul 28 02:00:27 2020]  ? trace_rcu_utilization+0x80/0x80
[Tue Jul 28 02:00:27 2020]  rcu_boost_kthread+0xf7/0x370
[Tue Jul 28 02:00:27 2020]  ? trace_rcu_utilization+0x80/0x80
[Tue Jul 28 02:00:27 2020]  kthread+0x112/0x130
[Tue Jul 28 02:00:27 2020]  ? kthread_flush_work_fn+0x10/0x10
[Tue Jul 28 02:00:27 2020]  ret_from_fork+0x35/0x40

Comment 35 Karl Rister 2020-07-28 13:24:52 UTC

Created attachment 1702661 [details]
raggiro v2 script

Comment 36 Daniel Bristot de Oliveira 2020-07-28 13:45:02 UTC

Karl,

Can we have access to the system?

-- Daniel

Comment 37 Karl Rister 2020-07-28 13:50:57 UTC

Daniel

Sure.  The worker node is freshly booted, it was completely inaccessible this morning.

Karl

Comment 38 Daniel Bristot de Oliveira 2020-07-28 14:55:18 UTC

warning while using the C version of raggiro: -> https://bugzilla.redhat.com/show_bug.cgi?id=1861414

Comment 39 Daniel Bristot de Oliveira 2020-07-28 15:51:45 UTC

Fast report:

I was able to reproduce the RCU stall, and yeah, it happened because there was not enough time for RCU to make progress.

Now I am giving: 20us/second, during a three-second window, for each starving task that is waiting for more than 10 seconds as runnable [all tasks will receive at least three periods].

These parameters seem to be enough to allow RCU to stop complaining.

The SCHED_DEADLINE messages that are showing up there can be ignored, see comment #38.

[we are using a new version of raggiro, made in C...]

Let's see if the system stays up until tomorrow.

Thoughts?

-- Daniel

Comment 40 Karl Rister 2020-07-28 16:00:06 UTC

Sounds good.  I'm afraid we are going to have to let it run for long periods to see if it is actually avoiding the problems.

So the 20us/second is interesting.  We moved to 10us/second because our latency requirement is 20us.  So allowing 20us/second of interruption virtually guarantees that our latency requirement will be exceeded :/

Karl

Comment 41 Daniel Bristot de Oliveira 2020-07-28 17:29:12 UTC

guaranteein 20us latency is something more complex than that... mainly because we are doing very... let's say... optmistic tests (see: https://bristot.me/demystifying-the-real-time-linux-latency/ for a contrast).

But now I reduced the runtime to 10us, let's see!

Comment 42 Daniel Bristot de Oliveira 2020-07-29 13:10:04 UTC

Karl

The system survived without RCU messages using the 10us/sec setup.

Would you mind to install this kernel here: http://brew-task-repos.usersys.redhat.com/repos/scratch/jlelli/kernel-rt/4.18.0/147.23.1.rt24.117.el8_1.dlboost.test.cki.kt0/x86_64/

[it fixes a warning from SCHED_DEADLINE... Kudos for Juri]

and dispatches the raggiro I left in the /var/home/core/raggiro

with the following arguments:

/var/home/core/raggiro -c 7,9 -k -t 10 -r 10000

to see if the system survives?

Thanks!

Comment 43 Karl Rister 2020-07-29 13:25:26 UTC

Daniel

Will do.  Thanks!

Karl

Comment 44 Paolo Abeni 2020-07-29 14:43:37 UTC

(In reply to Andrew Theurer from comment #31)
> > If I read correctly, to completely address this issue we additionally need
> > to mask out isolated CPUs in 
> > flush_backlog() loops. Am I missing something? Or are you planning such
> > change, too?
> 
> 
> But we are not using isolated CPUs (isolcpus=domain=).  

Due to the above comment and fact I will not submit the suggested upstream patch: it will not apply here.

It could still be useful in other scenarios. I'll try to follow the discussion here and will act accordingly.

/P

Comment 45 Karl Rister 2020-07-29 22:15:01 UTC

So far this is looking a lot better, but I probably need to let it run a lot longer (ie. overnight) to be sure.

Daniel/Juri

Can you guys explain what the C version is doing differently than the scripted version was doing?  I did not actually see the code lurking anywhere but I also did not look very hard yet.

Karl

Comment 46 Daniel Bristot de Oliveira 2020-07-30 07:39:11 UTC

Hi Karl,

The C version is here:

https://github.com/bristot/starvation_monitor

The idea is basically the same. However, it differs from the bash version because instead of doing commands (that could (more likely) block on locks caused by the starvation), it gets the information straight from the kernel files, and does the system calls that control the prioritization more easily.

It is also easy to use, maintain and optimize as C code, and we will work to make it a production ready deamon.

Comment 54 Yaniv Joseph 2020-10-01 12:04:49 UTC

Should this BZ state be ON_QA as stalld was delivered ?

Comment 55 Daniel Bristot de Oliveira 2020-10-05 11:56:37 UTC

I would say yes. I wonder what management thinks.

I will raise this topic on today's meeting.

Comment 58 Luiz Capitulino 2020-10-08 20:40:03 UTC

Since bug 1878602 resolves this issue, I'm closing this as
a dupe. If you disagree, feel free to reopen at any time.

*** This bug has been marked as a duplicate of bug 1878602 ***

Note You need to log in before you can comment on or make changes to this bug.

atheurer
bhu
chuhu
daolivei
fromani
jhopper
jlelli
kcarcia
lcapitulino
marjones
michele
msimmons
mtosatti
nilal
pabeni
pauld
peterx
pezhang
rt-qe
sassmann
trwilson
yjoseph