From Bugzilla Helper: User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; Q312461) Description of problem: I have installed Red Hat AS 2.1 on a Fujitsu Siemens Computers Primergy T80 with 8 cpu's and 32 GB main memory. The Primergy T850 is the same hardware like the IBM x440. After installing RH AS2.1 out of the box I have installed the kernel 2.4.9-e.12: rpm -Uvh kernel-doc-2.4.9-e.12.i386.rpm rpm -Uvh kernel-headers-2.4.9-e.12.i386.rpm rpm -Uvh kernel-source-2.4.9-e.12.i386.rpm rpm -ivh kernel-summit-2.4.9-e.12.i686.rpm Additionally I use the boot option "notsc". The systems hangs approximate 30 minutes after starting our testsuite "Platform Load Test". I see a frozen graphic console, keyboard and mouse doesn't work and a ping doesn't work. I can not find any error messages. I had reported this error also against the beta-kernel versions under the bugzilla #83715. The testsuite and some additional information you can find as attachment of bugzilla #83715. I think you can reproduce the error on a IBM x440 system because it is the same hardware. Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1.see bugzilla #83715 2. 3. Additional info:
Please read http://www.redhat.com/support/techsupport/production/GSS_caveat.html
I did the same test with 16 GB main memory and XFree86 4.1.0-29 and the system hangs again.
I did some tests with XFree86 4.1.0-44 and the same error happens. Sometimes I find this message in /var/log/messages: Unable to handle kernel NULL pointer dereference at virtual address 00000007 Please see also bugzilla #85341
Created attachment 98982 [details] bug fix for race in IPI code I am not sure this bug is fix in the latest summit kernel or not. We run into this problem also. It turn out to be a bug in the summit kernel. The following code is quote from the summit kernel smp_callfunction. #else /* Weirdly, the cpu_relax below breaks NUMA-Q boxes. */ if (!clustered_apic_logical) cpu_relax(); #endif No, cpu_relax() is just the pause instruction it should not break NUMA boxes. It is pretty much a good indication there are some timing race elsewhere. It turn out that smp_call_function_interrupt() will set the finished bit no matter it is sync/async call. Let's assume there is a async call follow by a very slow sync call. The second sync call started before the first async finished. Then the first async call finish before the second one finished. The async set the finished bit will cause the caller of the second sync call mistakenly think the sync call is finished. So the caller continue and release the context data while the sync call is running on it! do_ccupdate_local is what I am talking about. BTW, this bug do not exist on the offical kernel. The offical kernel does smp_call_function correctly. It is something introduce by the summit kernel. After apply the attach patch, we can abuse the IBM x440 on big memory vmware stress test for weeks and kernel does not OOPS yet. It has some other problem recovering from swap storm but that is kind of expected given over commit that many memory.
looks like this a dup of #111219. Will double check.
We cant really make changes to the AS2.1 kernel at this point. Please let me know if this is OK on RHEL3. Larry Woodman
There are no problems with RHEL3.