97138 – (SCHEDULER)NULL pointer dereference in try_to_wake_up

Bug 97138 - (SCHEDULER)NULL pointer dereference in try_to_wake_up

Summary: (SCHEDULER)NULL pointer dereference in try_to_wake_up

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Linux
Classification:	Retired
Component:	kernel
Sub Component:
Version:	9
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Ingo Molnar
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2003-06-10 19:52 UTC by Petr Vandrovec
Modified:	2007-04-18 16:54 UTC (History)
CC List:	0 users
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2003-07-05 08:02:47 UTC
Embargoed:

Attachments	(Terms of Use)

Description Petr Vandrovec 2003-06-10 19:52:23 UTC

Description of problem:

*pde = 00000000                      
Oops: 0000                      

CPU:    1                      
EIP:    0060:[<c011d601>]    Not tainted                      
EFLAGS: 00000046                      

EIP is at try_to_wake_up [kernel] 0x21 (2.4.20-8smp)                      
eax: 00000000   ebx: c183dfa8   ecx: c183df9c   edx: c183df98                   
esi: 00000000   edi: 00000001   ebp: c183df7c   esp: c183df60                   
ds: 0068   es: 0068   ss: 0068                                             
Process swapper (pid: 5, stackpage=c183d000)                      
Stack: c183df8c 00000000 00000000 00000246 c183dfa8 c03ba280 00000001 c183df90  
c011d81e 00000000 00000007 00000000 c183dfc8 c011f870 00000282 c03bac1c   
c03bac1c c183c000 00000000 00000001 c183dfb0 c183dfb0 00000246 c183c000   
Call Trace:   [<c011d81e>] wake_up_process [kernel] 0x1e (0xc183df80))          
[<c011f870>] set_cpus_allowed [kernel] 0xd0 (0xc183df94))                      
[<c0129181>] ksoftirqd [kernel] 0x51 (0xc183dfcc))                           
[<c0129130>] ksoftirqd [kernel] 0x0 (0xc183dfe0))                      
[<c010759d>] kernel_thread_helper [kernel] 0x5 (0xc183dff0))

Looking at kernel code, migration_init does:

migration_call(smp_processor_id());  <<<<< 1
for (cpu = 0; cpu < smp_num_cpus; cpu++)
   if (cpu != smp_processor_id())    <<<<< 2
      migration_call(cpu);

Is there any code which ensures that we run on same CPUs in steps 1 and 2? I 
did not found anything ensuring that, and it could explain why migration_thread 
of some runqueue was NULL when we were starting ksoftirqd threads - 
migration_init started two migration threads on one of CPUs, while no thread on 
another...

Unfortunately problem is not easily reproducible, but as I see nothing changed 
in this area in 2.4.20-20.1.2007, I'm reporting it anyway although there are 
newer kernels available. I believe that 2.5.x is safe due to its use of cpu 
notifiers.


Version-Release number of selected component (if applicable):
kernel-smp-2.4.20-8, but I believe that 2.4.20-20.1.2007.nptl is affected too.

How reproducible:
Occassionaly.

Steps to Reproduce:
1.Get SMP machine
2.Install kernel-smp-2.4.20-8
3.Reboot it again and again
    
Actual results:
Sometime it crashes during bootup with oopses above.

Expected results:
No crash.

Comment 1 Ingo Molnar 2003-07-05 08:02:47 UTC

agreed, this is a bug. The solution is to create an explicit sleep/wake/sleep
cycle using completions, between the init thread and the migration threads -
this fixed the problem here. The patch will probably be in the next erratum.

Note You need to log in before you can comment on or make changes to this bug.