Description of problem: System locks up in panic() in kernel 2.4.9-e.3. Checking 2.4.9-e.16 also shows same code. Version-Release number of selected component (if applicable): Problem reproduced in 2.4.9-e.3. Code is same in 2.4.9-e.16. How reproducible: Difficult. Steps to Reproduce: 1. Heavily stress SMP system until panic received. (Still diagnosing) 2. 3. Actual results: Endless loop in panic.c when doing CHECK_EMERGENCY_SYNC Expected results: no hang. Additional info: Using an ITP we obtained the following code of a locked up processor: 0x0148:0010:00000000c011b989 e872cffeff call $-x00013089 ;a=c01 08900 0x0148:0010:00000000c011b98e a1f4d53dc0 mov eax, dword ptr 0xc03dd5f4] 0x0148:0010:00000000c011b993 8db600000000 lea esi, dword ptr [esi+ 0x00000000] 0x0148:0010:00000000c011b999 8dbc2700000000 lea edi, dword ptr [edi+ 0x00000000] 0x0148:0010:00000000c011b9a0 85c0 test eax, eax 0x0148:0010:00000000c011b9a2 74fc jz $-0x02 ;a=c011b9a0 0x0148:0010:00000000c011b9a4 e8878d0700 call +0x00078d8c ;a=c0194730 Please note that the "test eax, eax" followed by the jz is not going anywhere. eax was previously loaded with the value of emergency_sync_scheduled. This appears at the end of panic() in linux/kernel/panic.c in the macro CHECK_EMERGENCY_SYNC which is defined in linux/include/linux/sysrq.h. Looks like emergency_sync_scheduled at first glance should be marked volatile.
Hmmm... I just checked 2.4.20 from kernel.org and sysrq.h has been patched to define emergency_sync_scheduled as "volatile int".
ehm panic() isn't actually supposed to return....... it's like a panic
That is understandable. But it would be nice if it was coded that way instead of depending on a compiler optimization. This code looks like it is waiting for the variable emergency_sync_scheduled to be non-zero and then call do_emergency_sync. Coded as is it will never call do_emergency_sync. yes?