A 32-bit application (the High Availability daemon (had) of Veritas Cluster Server) runs fine on an earlier version of ia32el package [ia32el-1.2.4]. However the version of ia32el shipped with rhel4u4/u5 [ia32el-1.6-14.EL4] causes a system hang when the application is run. The problem is only seen when the number of cpus = 1. Also, when the application is run on strace or gdb, the application starts without any problems.
Root cause: Some lock in IA-32 EL is implemented with atomic cmpxchg and sched_yield(). HAD is set as a real-time thread and spins on a internal lock used by IA-32 EL (because IA-32 EL executes code on behalf of HAD), while the lock is hold by another thread with low priority (so-called translation thread created by IA- 32 EL). As long as Translation Thread does not release the lock, the real-time thread will running endlessly and system seems freezing. For this specific application (HAD), Translation Thread is the feature that exposes the issue; Since it converted a single thread problem to multi-thread, the spin-lock internally used by IA32EL comes to be a problem. But for real multi-thread applications, these kind of lock can be a problem even if there is no Translation Thread within IA32EL, so we plan to provide an ultimate fix for this problem in the on-going version of IA32EL. We have disabled Translation Thread in IA32EL shipped with RHEL5.1. So for a temporary workaround, we recommend customer to use IA-32 EL on RHEL5.1.
Eric, The way I read this is that one can work around this problem by using the ia32el package that is shipped with RHEL5. Assuming so, then perhaps the easiest way to resolve this bug is to document this in a knowledge base article. I am also bearin gin mind that Intel is shipping dual cores across the entire product line, so the case of cpus=1 is rather small. Please confirm.
Ronald, One correction for you, workaround should be using ia32el package with RHEL 5 U1. And yes, we'd like to document it in knowledge base article, any process for that?
Product Management has reviewed and declined this request. You may appeal this decision by reopening this request.
The issue is waiting to be verified by customers
Hi, the RHEL4.7 release notes deadline is on June 17, 2008 (Tuesday). they will undergo a final proofread before being dropped to translation, at which point no further additions or revisions will be entertained. a mockup of the RHEL4.7 release notes can be viewed here: http://intranet.corp.redhat.com/ic/intranet/RHEL4u7relnotesmockup.html please use the aforementioned link to verify if your bugzilla is already in the release notes (if it needs to be). each item in the release notes contains a link to its original bug; as such, you can search through the release notes by bug number. Cheers, Don
Eric/Keve, Can you provide us with an update?
we have already got the workaround: using ia32-el shipped in RHEL5 instead in RHEL 4.7. Gary Case (gcase) is currently verify the workaround with the customers. we could close this bug now.
Don (and others), do you think we need to document the workaround in Comment #14 in the release notes before closing this out?
Yes, please. Please note user need this workaround only if threads of their application use real time priority
Eric, Can you please make a specific suggestion of how we should word the release note?
How about the following? In an X86 application with one or more SCHED_PR threads, it may hang due to a bug in IA-32 EL V6 shipped with this OS release. The workaround is to use IA- 32 EL V6 Update 1 shipped with RHEL 5.
Partners, I would like to thank you all for your participation in assuring the quality of this RHEL 4.7 Update Release. My hat's off to you all. Thanks.
Intel will fix this regression in lastest IA-32EL release, targeting RHEL5.4