The following has be reported by IBM LTC: Up to 6 Second Latencies in some cpu bound transactions Hardware Environment: Egenera Blades Software Environment: RH 2.4.9-e25 Steps to Reproduce: Can only be done at CSFB's production environment. 1. Run CSFB's AMM application (stock quoting system) 2. Observe 6 second maximum logged transaction time (happens every 15 minutes) Additional Information: CSFB's stock quoting application has some cpu bound transactions that take up to 6 seconds, while average time is 24ms. Kernel 2.4.7-10 did not experience this, but kernel 2.4.9-e25 does. We have instrumented try_to_free_pages to see if we were stuck in a long quest for memory allocation. None of the threads called try_to_free_pages during the test (6 second transactions were still there). Kswapd also did not call try_to_free_pages We have instrumented task timeslice assignments (min, max) per task to look for scheduler starvation. All timeslices were 15ms. All tasks had the same priority. We are preparing for a A/B test with a "whitebox" system to compare to Egenera's blade systems. We would like RedHat's help with this problem: 1) Have there been any situations like this, a multi-second latency for cpu bound operations? 2) What else can we instrument to better identify the problem Glen/Greg - this is a performance bug against a Red Hat errata kernel. Per Andrew's request, please submit this to Red Hat. Thanks.
Is this our kernel or the egenera recompiled-kernel-with-changes-and-hooks kernel ?
egenera kernel. Whenever I need to patch for instrumentation, test, etc, I send to them, they rebuild, then send to customer. FYI, they had also been experiencing significant average performance drop compared to a 2.4.7-10 kernel (at least 40%). Application of Ingo's aggressive idle steal (add "idle ||" to CAN_MIGRATE) to 2.4.9-e25 brought average performance slightly better than 2.4.7-10, and worst case latency from 6 seconds to about 3 seconds. Worst case latency on 2.4.7-10 is 500 ms.
Please report back if this also shows on an actual supported Red Hat kernel
------ Additional Comments From khoa.com 2003-25-09 23:23 ------- Andrew - Red Hat has refused to look at this problem if it does not happen on their kernel. I had thought about this when I first screened this bug, but I thought Red Hat would answer your two questions above. As it turned out, they would not. So I'd like to assign this bug back to you for more analysis. Thanks.
----- Additional Comments From atheurer.com(prefers email via habanero.com) 2004-04-21 15:18 ------- Chnages to scheduler resolved this bug. Changes already in RHEL3
RHEL2.1 is currently accepting only critical security fixes. This issue is outside the current scope of support.