Bug 1887568
Summary: | Avoiding network soft IRQs on isolated CPUs | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 8 | Reporter: | Daniel Bristot de Oliveira <daolivei> | ||||
Component: | kernel-rt | Assignee: | Paolo Abeni <pabeni> | ||||
kernel-rt sub component: | Networking | QA Contact: | mhou <mhou> | ||||
Status: | CLOSED NOTABUG | Docs Contact: | |||||
Severity: | high | ||||||
Priority: | high | CC: | atheurer, atragler, bhu, dblack, fbaudin, fromani, jmencak, kzhang, lcapitulino, mapfelba, mark.d.gray, mleitner, mtosatti, nilal, pabeni, rkhan, trwilson, william.caban, williams | ||||
Version: | 8.4 | ||||||
Target Milestone: | rc | ||||||
Target Release: | 8.0 | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2020-11-16 21:05:17 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1825271 | ||||||
Attachments: |
|
Description
Daniel Bristot de Oliveira
2020-10-12 19:49:32 UTC
Tracing the irq:sofirq_raise I see this: --------- %< --------------- mdns-publisher-9106 [004] d...113 338655.473930: softirq_raise: vec=3 [action=NET_RX] mdns-publisher-9106 [004] d...213 338655.473934: <stack trace> => trace_event_raw_event_softirq => __raise_softirq_irqoff => enqueue_to_backlog => netif_rx_internal => netif_rx_ni => dev_loopback_xmit => ip_mc_output => ip_send_skb => udp_send_skb.isra.46 => udp_sendmsg => sock_sendmsg => ___sys_sendmsg => __sys_sendmsg => do_syscall_64 => entry_SYSCALL_64_after_hwframe --------- >% --------------- and this: --------- %< --------------- <...>-42518 [004] d...114 338653.188346: softirq_raise: vec=3 [action=NET_RX] <...>-42518 [004] d...214 338653.188351: <stack trace> => trace_event_raw_event_softirq => __raise_softirq_irqoff => enqueue_to_backlog => netif_rx_internal => loopback_xmit => dev_hard_start_xmit => __dev_queue_xmit => ip_finish_output2 => ip_output => __ip_queue_xmit => __tcp_transmit_skb => tcp_write_xmit => __tcp_push_pending_frames => tcp_close => inet_release => __sock_release => sock_close => __fput => task_work_run => exit_to_usermode_loop => do_syscall_64 => entry_SYSCALL_64_after_hwframe --------- >% --------------- and these point for this code: --------- >% --------------- static int netif_rx_internal(struct sk_buff *skb) { int ret; net_timestamp_check(netdev_tstamp_prequeue, skb); trace_netif_rx(skb); #ifdef CONFIG_RPS if (static_branch_unlikely(&rps_needed)) { struct rps_dev_flow voidflow, *rflow = &voidflow; int cpu; preempt_disable(); rcu_read_lock(); cpu = get_rps_cpu(skb->dev, skb, &rflow); if (cpu < 0) cpu = smp_processor_id(); ret = enqueue_to_backlog(skb, cpu, &rflow->last_qtail); rcu_read_unlock(); preempt_enable(); } else #endif { unsigned int qtail; ret = enqueue_to_backlog(skb, get_cpu(), &qtail); put_cpu(); } --------- >% --------------- By default, rps cpumask is not set, so it makes sense to have it handled at the local CPU. I tried to se the RPS for the CPUs 1 and 2 (mask 3) for all nics, find /sys -name rps_cpus | while read mask; do echo 3 > $mask; done but it did not work. I wonder if that is because the nics I am seing the sofirq raise are the virtual ones? Here is one example of veth_xmit: --------- %< --------------- <...>-1191724 [004] d...114 338990.764907: softirq_raise: vec=3 [action=NET_RX] <...>-1191724 [004] d...214 338990.764913: <stack trace> => trace_event_raw_event_softirq => __raise_softirq_irqoff => enqueue_to_backlog => netif_rx_internal => veth_xmit => dev_hard_start_xmit => __dev_queue_xmit => ip_finish_output2 => ip_output => __ip_queue_xmit => __tcp_transmit_skb => tcp_write_xmit => __tcp_push_pending_frames => tcp_sendmsg_locked => tcp_sendmsg => sock_sendmsg => sock_write_iter => new_sync_write => vfs_write => ksys_write => do_syscall_64 => entry_SYSCALL_64_after_hwframe --------- >% --------------- (In reply to Daniel Bristot de Oliveira from comment #0) > Description of problem: > > A latency is spike of ~150 us was noticed on the OCP setup, caused by the > handling of network softirq on isolated CPUs. There was https://bugzilla.redhat.com/show_bug.cgi?id=1844520 Did you test with a kernel with these patches in? Hi Marcelo, I am aware of those patches, but no network interrupts are taking place on that CPUs. -- Daniel (In reply to Daniel Bristot de Oliveira from comment #1) > I tried to se the RPS for the CPUs 1 and 2 (mask 3) for all nics, > > find /sys -name rps_cpus | while read mask; do echo 3 > $mask; done > > but it did not work. I wonder if that is because the nics I am seing the > sofirq raise are the virtual ones? The above command has a couple of caveats. If there is openshift into play, you may have additional/relevant devices in some network namespace, that will not be reached. Additional openshift may delete or create new devices after that the above command completes. Newly create device will carry empty RPS mask. How do you conclude it did not work? did you still saw the network receive path triggered on CPUs different from 1 and 2? I did not conclude anything, but I observed that there was still network softirq running on CPUs different from 1 and 2. This seems to be a good point: "If there is openshift into play, you may have additional/relevant devices in some network namespace, that will not be reached. Additional openshift may delete or create new devices after that the above command completes. Newly create device will carry empty RPS mask." So, is there a way to set the "initial" or a "global" RPS mask so that the newly created devices will be set with the restricted mask? Paolo and I worked on this problem yesterday and: 1) We confirmed that the RPS was the cause of (part of) latency spikes; 2) Paolo created a tuning script that correctly set the RPS mask for all the (existing) namespaces; With that script, the results dramatically improved, from around 180 us on all measurements to results below the 100 us as the common case, with some sporadic spikes. The overnight results are [each test is a ten minutes run]: Maximum: 16 23 12 12 13 12 15 16 (us) Maximum: 25 22 15 16 14 14 11 13 (us) Maximum: 17 18 13 12 11 11 19 13 (us) Maximum: 22 46 23 12 11 16 12 12 (us) Maximum: 22 17 11 12 11 10 11 13 (us) Maximum: 17 11 12 17 15 13 11 11 (us) Maximum: 16 10 10 9 10 10 9 14 (us) Maximum: 16 11 13 11 11 10 11 14 (us) Maximum: 17 13 10 11 16 13 10 10 (us) Maximum: 54 12 11 11 12 12 16 17 (us) Maximum: 17 11 13 11 13 11 17 11 (us) Maximum: 16 12 10 11 12 12 11 10 (us) Maximum: 68 11 172 14 10 11 11 11 (us) Maximum: 34 16 11 13 12 17 13 12 (us) Maximum: 16 10 15 10 11 11 11 12 (us) Maximum: 19 12 203 10 13 10 9 12 (us) Maximum: 30 13 25 13 14 12 10 9 (us) Maximum: 16 10 10 10 11 10 12 10 (us) Maximum: 92 10 13 10 12 14 11 10 (us) Maximum: 18 10 10 9 9 15 11 13 (us) Maximum: 33 15 16 16 121 12 17 12 (us) Maximum: 16 11 11 12 11 12 13 11 (us) Maximum: 22 12 12 17 11 12 18 12 (us) Maximum: 17 10 11 15 11 11 14 11 (us) Maximum: 17 15 18 10 11 10 11 17 (us) Maximum: 17 13 11 13 10 10 10 10 (us) Maximum: 11 11 12 11 10 10 11 17 (us) Maximum: 10 10 9 14 15 11 10 102 (us) Maximum: 11 15 10 10 12 13 10 11 (us) Maximum: 19 10 10 12 10 10 12 45 (us) Maximum: 12 14 11 12 20 15 11 11 (us) Maximum: 12 11 14 11 11 12 11 11 (us) Maximum: 11 12 12 11 16 18 16 11 (us) Maximum: 18 11 10 10 17 11 15 11 (us) Maximum: 17 11 11 12 11 11 17 12 (us) Maximum: 17 12 12 13 16 21 11 13 (us) Maximum: 14 13 12 13 12 16 12 13 (us) Maximum: 20 16 19 21 13 22 13 11 (us) Maximum: 15 18 10 11 12 15 17 20 (us) Maximum: 15 10 14 11 11 10 12 11 (us) Maximum: 11 10 16 17 13 12 17 11 (us) Maximum: 17 12 10 12 11 11 10 12 (us) Maximum: 45 15 13 10 10 14 10 12 (us) Maximum: 15 11 15 15 11 11 11 14 (us) Maximum: 14 11 13 18 17 30 20 10 (us) Maximum: 13 12 16 11 11 16 10 10 (us) Maximum: 11 11 11 10 13 15 11 12 (us) Maximum: 10 11 11 12 16 17 10 12 (us) Maximum: 14 12 11 10 10 12 15 141 (us) Maximum: 11 11 15 11 11 18 10 11 (us) Maximum: 11 13 12 18 11 12 14 17 (us) Maximum: 16 10 10 10 17 12 12 12 (us) Maximum: 10 10 11 17 12 17 10 10 (us) Maximum: 11 11 16 11 10 10 10 13 (us) Maximum: 10 11 12 11 15 11 11 11 (us) Maximum: 9 10 15 13 10 12 11 14 (us) Note that the spikes we see now can be the side effect of temporary namespaces created by openshift that the current tune does not cover. We will now discuss ways to transform the script into something more "bulletproof." The tune script content is: --------- %< run_paolo.sh %< ---------------- #!/bin/sh mount -o remount,rw /sys find /sys -name rps_cpus | while read mask; do echo $mask; echo 3 > $mask; done mount -o remount,ro /sys --------- >% run_paolo.sh >% ---------------- and we run it with the followin commands: ---------- %< network_tune.sh %< ---------- for N in `ip netns list | awk '{print $1}'`; do ip netns exec $N ./run_paolo.sh; done ./run_paolo.sh ---------- >% network_tune.sh >% ---------- Good progress! We are trying this on openshift, but the worker loses its network connection after running close to an hour for oslat. The console has a lot of messages about ksoftirqd being stalled on many CPUs, which leads me to believe that setting the rps_cpus mask is not working, or there is a constant influx of new containers with a default rps_cpus mask of all cpus (or if it's empty, does that mean all cpus?). However, even if rps_mask is not being set, wouldn't stalld give the ksoftirqds prio to run, and the host network should not be unresponsive? (In reply to Andrew Theurer from comment #8) > The console has a lot of messages > about ksoftirqd being stalled on many CPUs, Can you please attach the console log? > which leads me to believe that > setting the rps_cpus mask is not working, or there is a constant influx of > new containers with a default rps_cpus mask of all cpus This is likely the case. > (or if it's empty, does that mean all cpus?). Some clarification needed. In the original problem reported by Daniel the system observed unexpected high latency due to softirq being scheduled on isolated CPUs. The scheduling happened due to packets looped back by OVS. By default looped back packet are processed - via softirq NET_RX_ACTION - on the same CPU that originally xmitted the packet. RPS come into play to force the looped back packets away from such CPUs, specifying a CPU mask = !(isolated CPUS). An empty RPS mask means the default behavior, that is, looped back packets processed on the originating CPU - whatever that is, comprising isolated CPUs. Please let me know if the above clarifies a bit. > However, even if rps_mask is not being set, wouldn't stalld give the > ksoftirqds prio to run, and the host network should not be unresponsive? @Daniel, do you know better? Created attachment 1723197 [details]
console log
We are trying this on openshift, but the worker loses its network connection after running close to an hour for oslat. The console has a lot of messages about ksoftirqd being stalled on many CPUs, which leads me to believe that setting the rps_cpus mask is not working, or there is a constant influx of new containers with a default rps_cpus mask of all cpus (or if it's empty, does that mean all cpus?). However, even if rps_mask is not being set, wouldn't stalld give the ksoftirqds prio to run, and the host network should not be unresponsive? Andrew, could you check if stalld is not dying on your setup? We found a bug that was causing it to die on systems with sched_debug getting too large. if so, could you update the stalld binary with the latest upstream version? https://gitlab.com/rt-linux-tools/stalld Regarding the rps_mask not working, did you trace the system? (I did and saw it working). Thanks! When running Paolo's script on my worker node in a loop every five seconds... ==================== #!/bin/bash cpuset=`cat /proc/self/status | grep Cpus_allowed_list: | cut -f 2` echo "running on processors " ${cpuset} while true; do for N in `ip netns list | awk '{print $1}'`; do ip netns exec $N ./run_paolo.sh; done sleep 5; done ==================== ... I noticed that the script stops producing output when I have oslat running with rtprio 1, but will continue to run if I do not use --rtprio with oslat. In the case of running oslat with rtprio 1, the script will resume as soon as the oslat run completes. The cpus allowed output indicates the script is running on housekeeping cpus. I am running OCP 4.6.0-rc.2 with a test kernel from brew 4.18.0-193.27.1.rt13.76.el8_2.rtperf20201015.x86_64. I am running without stalld. [root@worker00 ~]# cat /proc/cmdline BOOT_IMAGE=(hd0,gpt1)/ostree/rhcos-5fa94aa804ce1821c09dec865c57662f23be27c2b4ac5e0d54da4992861ecf87/vmlinuz-4.18.0-193.27.1.rt13.76.el8_2.rtperf20201015.x86_64 rhcos.root=crypt_rootfs random.trust_cpu=on console=tty0 console=ttyS0,115200n8 ignition.platform.id=metal rd.luks.options=discard ostree=/ostree/boot.1/rhcos/5fa94aa804ce1821c09dec865c57662f23be27c2b4ac5e0d54da4992861ecf87/0 skew_tick=1 nohz=on rcu_nocbs=4-23 tuned.non_isolcpus=0000000f intel_pstate=disable nosoftlockup tsc=nowatchdog intel_iommu=on iommu=pt isolcpus=managed_irq,4-23 systemd.cpu_affinity=0,1,2,3 default_hugepagesz=1G hugepagesz=1G hugepages=10 nosmt tsc=nowatchdog [root@worker00 ~]# If there is interest in the oslat output, these are from very short runs used to diagnose the issue above: (these runs are so short this data should not be treated as valid latency data) oslat with rtprio 1, paolo script started but stalled during oslat run: Minimum: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 (us) Average: 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 (us) Maximum: 17459 26 21 17 17 15 14 15 15 15 14 15 14 14 14 14 14 14 (us) Max-Min: 17458 25 20 16 16 14 13 14 14 14 13 14 13 13 13 13 13 13 (us) oslat run without rtprio option, paolo script running: Minimum: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 (us) Average: 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 (us) Maximum: 62 11 65 12 34 47 39 45 50 13 48 45 44 31 44 61 61 60 (us) Max-Min: 61 10 64 11 33 46 38 44 49 12 47 44 43 30 43 60 60 59 (us) oslat run without rtprio option, paolo script not started at all: Minimum: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 (us) Average: 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 (us) Maximum: 50 11 52 11 41 39 33 48 59 13 39 49 43 40 43 55 58 52 (us) Max-Min: 49 10 51 10 40 38 32 47 58 12 38 48 42 39 42 54 57 51 (us) for how long are you running oslat? is the system complaining about rcu stall? (In reply to Daniel Bristot de Oliveira from comment #14) > for how long are you running oslat? is the system complaining about rcu > stall? These runs are just 5 minute runs as I verified the issue above. I am unable to make runs longer than maybe 30 or 40 minutes. One-hour runs consistently fail. Either the pod disappears entirely and events logs show nothing, or the entire worker node becomes unresponsive, even from the console, and requires a manual reboot to recover. When running without stalld I see lots of "rcu: INFO: rcu_preempt self-detected stall on CPU <whatever>", and when running with stalld I see lots of rcuc and ksoftirqd starvation. Question: which stalld are you using, is it the latest version from here? https://gitlab.com/rt-linux-tools/stalld If not, please use it, it fixes a known issue. Running without stalld will cause rcu stall, and that causes the 1ms > latency because the stall dispatches NMI to all CPUs. This was already investigated in the origin of this BUG (note, this BUG is the consequence of the analysis). The rcuc and ksoftirqd starvation messages are the side effect of oslat on their scenario. Stalld messages are not the problem or the error but are the evidence that we need to understand why those threads are running there. That said, I do not think it makes sense to test without stalld. It is easy to understand what softirqd is running on the CPUs (and causing the wakeup) using trace. This BZ in specific cares about the dispatch of NET_RX softirq. But it is known that some timer softirqd are being raised there, and that is not the problem in this BZ. When the system crashes, can you try to get the dmesg output from /sys/fs/pstore ? here: https://access.redhat.com/solutions/4581331 check if it is something like: "kernel BUG at kernel/sched/deadline.c:1462!" (In reply to Daniel Bristot de Oliveira from comment #16) > Question: which stalld are you using, is it the latest version from here? > > https://gitlab.com/rt-linux-tools/stalld > > If not, please use it, it fixes a known issue. I'm currently working on rpm-ostree installing the build prerequisites to my worker node. If there is a better way to get the new stalld onto my worker, please let me know because this is unpleasant so far. > When the system crashes, can you try to get the dmesg output from > /sys/fs/pstore ? here: https://access.redhat.com/solutions/4581331 > > check if it is something like: > > "kernel BUG at kernel/sched/deadline.c:1462!" I have nothing in /sys/fs/pstore now, but I can start checking it after a crash. Thanks for the tip. Hi Paolo, I am testing a kernel-rt with your patches (and all the others that openshift needs). So far so good. More updates by the end of the day. Thanks! (In reply to Daniel Bristot de Oliveira from comment #22) > Hi Paolo, > > I am testing a kernel-rt with your patches (and all the others that > openshift needs). I'd be happy to test the rt version as well if you have a link to the build. Thanks! Note: While running the experiments, stalld is pointing to the executions of kworkers on the isolated CPUs. I file the following BZ to track it: https://bugzilla.redhat.com/show_bug.cgi?id=1892264 |