Description of problem: *pde = 00000000 Oops: 0000 CPU: 1 EIP: 0060:[<c011d601>] Not tainted EFLAGS: 00000046 EIP is at try_to_wake_up [kernel] 0x21 (2.4.20-8smp) eax: 00000000 ebx: c183dfa8 ecx: c183df9c edx: c183df98 esi: 00000000 edi: 00000001 ebp: c183df7c esp: c183df60 ds: 0068 es: 0068 ss: 0068 Process swapper (pid: 5, stackpage=c183d000) Stack: c183df8c 00000000 00000000 00000246 c183dfa8 c03ba280 00000001 c183df90 c011d81e 00000000 00000007 00000000 c183dfc8 c011f870 00000282 c03bac1c c03bac1c c183c000 00000000 00000001 c183dfb0 c183dfb0 00000246 c183c000 Call Trace: [<c011d81e>] wake_up_process [kernel] 0x1e (0xc183df80)) [<c011f870>] set_cpus_allowed [kernel] 0xd0 (0xc183df94)) [<c0129181>] ksoftirqd [kernel] 0x51 (0xc183dfcc)) [<c0129130>] ksoftirqd [kernel] 0x0 (0xc183dfe0)) [<c010759d>] kernel_thread_helper [kernel] 0x5 (0xc183dff0)) Looking at kernel code, migration_init does: migration_call(smp_processor_id()); <<<<< 1 for (cpu = 0; cpu < smp_num_cpus; cpu++) if (cpu != smp_processor_id()) <<<<< 2 migration_call(cpu); Is there any code which ensures that we run on same CPUs in steps 1 and 2? I did not found anything ensuring that, and it could explain why migration_thread of some runqueue was NULL when we were starting ksoftirqd threads - migration_init started two migration threads on one of CPUs, while no thread on another... Unfortunately problem is not easily reproducible, but as I see nothing changed in this area in 2.4.20-20.1.2007, I'm reporting it anyway although there are newer kernels available. I believe that 2.5.x is safe due to its use of cpu notifiers. Version-Release number of selected component (if applicable): kernel-smp-2.4.20-8, but I believe that 2.4.20-20.1.2007.nptl is affected too. How reproducible: Occassionaly. Steps to Reproduce: 1.Get SMP machine 2.Install kernel-smp-2.4.20-8 3.Reboot it again and again Actual results: Sometime it crashes during bootup with oopses above. Expected results: No crash.
agreed, this is a bug. The solution is to create an explicit sleep/wake/sleep cycle using completions, between the init thread and the migration threads - this fixed the problem here. The patch will probably be in the next erratum.