From Bugzilla Helper: User-Agent: Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0) Description of problem: (Here's the RHEL 3.0 2.4.21-4.EL kernel version of our deepstack softirq deferral patch. this RHEL 3.0 one further excludes s390 and s390x from the scheme, since they gained an interrupt stack between 2.4.9 and 2.4.21. The same arguments apply, for completeness prepended again....) Some VERITAS users on RHAS2.1 have suffered from do_irq's (2048 bytes away from) stack overflow messages. The threshold for that message is at a more reasonable level (1024) in RHEL3.0, but we're still worried by the prospect of overflow, and are working to reduce our stack usage. There's a small change to Linux which would reduce the likelihood of stack overflow for all. It's far from being a complete solution, but a small enough change to be worth making. In many (but not all) drivers, the complex and stack-hungry part of interrupt processing is done in the softirq rather than the hardirq. do_softirq() already defers softirq work to its daemon when swamped by more softirqs while it's working. This patch adds a stack check, deferring all softirq work to the daemon when the stack is too deep. How deep is too deep? Given the hardirq warning at 1k, we estimate the threshold for softirq deferral should be between 2k and 3k, and have set 2560 here. Much lower than that would make it ineffective, much higher than that would impact performance. This patch differs slightly from the patch we offered earlier for RHEL3.0: extending it from i386 to other architectures (excepting parisc, s390, s390x and
Sheryl, I agree to this patch. All we need to do now is get it past the other developers at Red Hat, by proving that it doesn't have a noticable performance penalty in the kind of setup that shows stack overflows with normal users. In short, we would need a setup that: 1) sometimes results in stack overflows when the benchmark is run normally 2) works fine with the patch If this setup produces pretty much identical performance with and without the stack overflow patch, it'll be hard for other engineers inside Red Hat to object to the patch.
Could Veritas come back with a proposed testing/characterization scenario as Rik described above? Would be useful for Veritas to first describe how they can reproduce the scenario (hardware used, tests run, etc), then bounce that off us. The whole point of this exercise is to allay concerns of potential negative performance implications. The more convincing the scenario, the better chance of acceptance. Once you have proposed the testing scenarios, we can discuss whether it covers the concerns. If Veritas provides us this testing proposal before investing the time to do it, we can provide feedback. That way you won't end up wasting time and we later determine its insuficient testing. (Separately, we are trying to drum up testing recommendations, but since Veritas is most familiar with how to reproduce the problem, it might be beneficial for you to pose the initial test descriptions. Then we can cooperatively kick it around from there.)
This got applied to one of the RHEL3 updates. Probably U2 ;)
The fix was committed to the RHEL3 U2 patch pool in kernel version 2.4.21-9.15.EL, and it was released in errata RHSA-2004:188.