From Flavio Leitner:
On many core SMP machine (such as 16 core or more), soft lockup can occur when heavy network load are produced concurrently.
The lockup happens at __qdisc_run()(@net/sched/sch_generic.c:line 84). Because driver continue to send packet and return NETDEV_TX_OK, __qdisc_run() can't exit from qdisc_restart() loop.
This behavior may improve throughput, but some application can stuck over 10s.
This issue has been fixed on vanilla kernel.
Version-Release number of selected component:
kernel version: 2.6.18-92.el5 (RHEL5.2GA)
It can be reproducible in dozens of seconds, on 16 core SMP box. This issue is easy to happen, when UDP workload is very high.
Steps to Reproduce:
On 16 core SMP machine, execute netperf in higher than 16 parallel with the following options, then it occurs at a client side.
# netperf -H <netserver_address> -l 60 -t UDP_STREAM -- -s 262144 -r 262144 -m
A lot of soft lockup messages are recorded into syslog, and performance problem appears in some applications.
In kernel, any CPU doesn't dedicate to some work without schedule() for a long time.
It makes customer's applications unresponsive too long and it makes impossible to apply RHEL5.2 to performance/latency sensitive systems.
git patch: [NET]: Add preemption point in qdisc_run
Created attachment 327745 [details]
(In reply to comment #1)
> Created an attachment (id=327745) [details]
Note netperf must be installed. Reproducer triggers the loockup for me safely with 40 tasks (the second cmd line parameter) on 16cpu machine. The first parameter is hostname of computer where "netserver" (part of netperf package) is running.
This was addressed via:
Red Hat Enterprise Linux version 5 (RHSA-2009:0264)