From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)
Description of problem:
(Here's the RHEL 3.0 2.4.21-4.EL kernel version of our deepstack
softirq deferral patch. this RHEL 3.0 one further excludes s390 and
s390x from the scheme, since they gained an interrupt stack between
2.4.9 and 2.4.21. The same arguments apply, for completeness
Some VERITAS users on RHAS2.1 have suffered from do_irq's (2048 bytes
away from) stack overflow messages. The threshold for that message is
at a more reasonable level (1024) in RHEL3.0, but we're still worried
by the prospect of overflow, and are working to reduce our stack
There's a small change to Linux which would reduce the likelihood of
stack overflow for all. It's far from being a complete solution,
but a small enough change to be worth making.
In many (but not all) drivers, the complex and stack-hungry part of
interrupt processing is done in the softirq rather than the hardirq.
do_softirq() already defers softirq work to its daemon when swamped
by more softirqs while it's working. This patch adds a stack check,
deferring all softirq work to the daemon when the stack is too deep.
How deep is too deep? Given the hardirq warning at 1k, we estimate
the threshold for softirq deferral should be between 2k and 3k, and
have set 2560 here. Much lower than that would make it ineffective,
much higher than that would impact performance.
This patch differs slightly from the patch we offered earlier for
RHEL3.0: extending it from i386 to other architectures (excepting
parisc, s390, s390x and
Sheryl, I agree to this patch. All we need to do now is get it past
the other developers at Red Hat, by proving that it doesn't have a
noticable performance penalty in the kind of setup that shows stack
overflows with normal users.
In short, we would need a setup that:
1) sometimes results in stack overflows when the benchmark is run normally
2) works fine with the patch
If this setup produces pretty much identical performance with and
without the stack overflow patch, it'll be hard for other engineers
inside Red Hat to object to the patch.
Could Veritas come back with a proposed testing/characterization
scenario as Rik described above? Would be useful for Veritas to first
describe how they can reproduce the scenario (hardware used, tests
run, etc), then bounce that off us. The whole point of this exercise
is to allay concerns of potential negative performance implications.
The more convincing the scenario, the better chance of acceptance.
Once you have proposed the testing scenarios, we can discuss whether
it covers the concerns. If Veritas provides us this testing proposal
before investing the time to do it, we can provide feedback. That way
you won't end up wasting time and we later determine its insuficient
(Separately, we are trying to drum up testing recommendations, but
since Veritas is most familiar with how to reproduce the problem, it
might be beneficial for you to pose the initial test descriptions.
Then we can cooperatively kick it around from there.)
This got applied to one of the RHEL3 updates. Probably U2 ;)
The fix was committed to the RHEL3 U2 patch pool in kernel version
2.4.21-9.15.EL, and it was released in errata RHSA-2004:188.