Description of problem:
System locks up in panic() in kernel 2.4.9-e.3. Checking 2.4.9-e.16 also shows
Version-Release number of selected component (if applicable):
Problem reproduced in 2.4.9-e.3. Code is same in 2.4.9-e.16.
Steps to Reproduce:
1. Heavily stress SMP system until panic received. (Still diagnosing)
Endless loop in panic.c when doing CHECK_EMERGENCY_SYNC
Using an ITP we obtained the following code of a locked up processor:
0x0148:0010:00000000c011b989 e872cffeff call $-x00013089 ;a=c01
0x0148:0010:00000000c011b98e a1f4d53dc0 mov eax, dword ptr
0x0148:0010:00000000c011b993 8db600000000 lea esi, dword ptr
0x0148:0010:00000000c011b999 8dbc2700000000 lea edi, dword ptr
0x0148:0010:00000000c011b9a0 85c0 test eax, eax
0x0148:0010:00000000c011b9a2 74fc jz $-0x02 ;a=c011b9a0
0x0148:0010:00000000c011b9a4 e8878d0700 call
Please note that the "test eax, eax" followed by the jz is not going anywhere.
eax was previously loaded with the value of emergency_sync_scheduled. This
appears at the end of panic() in linux/kernel/panic.c in the macro
CHECK_EMERGENCY_SYNC which is defined in linux/include/linux/sysrq.h. Looks
like emergency_sync_scheduled at first glance should be marked volatile.
Hmmm... I just checked 2.4.20 from kernel.org and sysrq.h has been patched to
define emergency_sync_scheduled as "volatile int".
panic() isn't actually supposed to return....... it's like a panic
That is understandable. But it would be nice if it was coded that way instead
of depending on a compiler optimization.
This code looks like it is waiting for the variable emergency_sync_scheduled to
be non-zero and then call do_emergency_sync. Coded as is it will never call