Description of problem: When booting the kernel-debug package, NMI lockups or system hangs are seen after enabling cpuspeed. Version-Release number of selected component (if applicable): 147.el5 How reproducible: 100% Steps to Reproduce: 1. boot kernel-debug for 148.el5 or 147.el5 Actual results: System hangs or sometimes I see <jburke> Starting mcstransd: BUG: NMI Watchdog detected LOCKUP on CPU2, eip c042bf71, registers: <jburke> CPU: 2 <jburke> EIP is at getnstimeofday+0x13/0xb6 <jburke> eax: 00000001 ebx: ff6d93e0 ecx: 00000006 edx: 01bfdb64 <jburke> esi: 00000006 edi: 00000000 ebp: 000144ed esp: f7cdaee4 <jburke> ds: 007b es: 007b ss: 0068 <jburke> Process swapper (pid: 0, ti=f7cda000 task=f7c8ece0 task.ti=f7cda000) <jburke> Stack: f7cdaf08 000144ed f7cdaf08 f7cdaf80 f7cdaf10 c043800f c30d41a4 00000002 <jburke> c0420c8c 4a0d944b 17f3f9fd f7cdaf24 c0420d15 02943608 00000002 f7cdaf80 <jburke> f7cdaf58 c0420d80 00000003 fffc0e92 c0621d73 00000004 f7c8ece0 f7c8ece0 <jburke> Call Trace: <jburke> [<c043800f>] ktime_get_ts+0x16/0x44 <jburke> [<c0420c8c>] get_monotonic_time+0xe/0x1b <jburke> [<c0420d15>] sched_clock_tick+0x7c/0xb6 <jburke> [<c0420d80>] scheduler_tick+0x31/0x340 <jburke> [<c0621d73>] schedule+0xa1f/0xaab <jburke> [<c042d118>] update_process_times+0x57/0x62 <jburke> [<c041788d>] smp_apic_timer_interrupt+0x66/0x75 <jburke> [<c0405aee>] apic_timer_interrupt+0x2a/0x30 <jburke> [<c0403bf3>] default_idle+0x3e/0x6d <jburke> [<c0403bf5>] default_idle+0x40/0x6d <jburke> [<c0403cc6>] cpu_idle+0xa4/0xbe <jburke> ======================= <jburke> Code: c3 a1 40 54 85 c0 99 0f a4 c2 14 c1 e0 14 01 c1 11 d3 89 c8 89 da 5b c3 55 57 56 53 83 ec 04 89 04 24 8b 2d 80 9c 74 c0 0f ae e8 <66> 90 a1 10 43 85 c0 8b 0c 24 8b 15 14 43 85 c0 89 01 89 51 04 Expected results: No hang. Additional info: Git bisect points the at pzilstr's commits between 6bbc8c76e0d31d43687e6af6157a2ecafe098cb3 and 0d1af16a1628460ad295607ce90c9f24a99a5391. The time.c code is finicky -- it's loaded with int/long overflows, etc.. Seen on sun-v40z-01.rhts.bos.redhat.com, which is currently checked out by jburke, and can be extended by executing 'extendtesttime.sh'.
Can be replicated using kernel-PAE...-148.el5 & manually starting the cpuspeed service. P.
pzijlstr -- if you decide to use the system to debug, I have *disabled* cpuspeed in chkconfig. You have to start it manually. FYI, P.
Can be reproduced using kernel-2.6.18-148.el5, and manually restarting cpuspeed service or loading the powernow-k8 module. P.
There are reports of slow-downs during boot (see comment #10 in BZ 499999). I don't know if this is related to this BZ, but it definitely could be. P.
Backing out cca0fdba55f43de5eed0d60e257c4c17d3e33396 (and making minor changes to set_cyc2ns_scale() ) fix the issue -- ie) the above change is responsible for the strange behavior. ... this feels like a type-casting issue ... P.
From the previous patch: @@ -5104,7 +5338,8 @@ void __cpuinit init_idle(struct task_struct *idle, int cpu) idle->cpus_allowed = cpumask_of_cpu(cpu); set_task_cpu(idle, cpu); - spin_lock_irqsave(&rq->lock, flags); + spin_lock(&rq->lock); + rq->curr = rq->idle = idle; #if defined(CONFIG_SMP) && defined(__ARCH_WANT_UNLOCKED_CTXSW) idle->oncpu = 1; Not 100% sure if this is it but this looks wrong. This changes the lock acquisition to spin_lock ... but the "free'ing" of the lock is still spin_lock_irqrestore, which would use a bogus flags value. (Unless I'm reading the patch wrong) P.
Prarit, Your observation in comment #6 might be correct, but the only real difference between spin_lock and spin_lock_irqsave is the call to local_irq_save that is added. Since a call to that is done at the beginning of init_idle it might not be needed. It still seems a bit odd that there is a gap between when interrupts are saved and when the lock is taken (and I don't really see that anywhere upstream). The closest thing out there that is trying to emulate this might be here: commit 5cbd54ef470d880fc37fbe4b21eb514806d51e0d Author: Ingo Molnar <mingo> Date: Wed Nov 12 20:05:50 2008 +0100 sched: fix init_idle()'s use of sched_clock() I'm still not sure this is critical, but it could be interesting to replace this hunk in cca0fdba55f43de5eed0d60e257c4c17d3e33396 @@ -5096,7 +5328,9 @@ void __cpuinit init_idle(struct task_struct *idle, int cpu) struct rq *rq = cpu_rq(cpu); unsigned long flags; - idle->timestamp = sched_clock(); + local_irq_save(flags); + + idle->timestamp = sched_clock_local(); idle->sleep_avg = 0; idle->array = NULL; idle->prio = idle->normal_prio = MAX_PRIO; @@ -5104,7 +5338,8 @@ void __cpuinit init_idle(struct task_struct *idle, int cpu) idle->cpus_allowed = cpumask_of_cpu(cpu); set_task_cpu(idle, cpu); - spin_lock_irqsave(&rq->lock, flags); + spin_lock(&rq->lock); + rq->curr = rq->idle = idle; #if defined(CONFIG_SMP) && defined(__ARCH_WANT_UNLOCKED_CTXSW) idle->oncpu = 1; with something like this: @@ -5096,7 +5328,9 @@ void __cpuinit init_idle(struct task_struct *idle, int cpu) struct rq *rq = cpu_rq(cpu); unsigned long flags; - idle->timestamp = sched_clock(); + spin_lock_irqsave(&rq->lock, flags); + + idle->timestamp = sched_clock_local(); idle->sleep_avg = 0; idle->array = NULL; idle->prio = idle->normal_prio = MAX_PRIO; @@ -5104,7 +5338,8 @@ void __cpuinit init_idle(struct task_struct *idle, int cpu) idle->cpus_allowed = cpumask_of_cpu(cpu); set_task_cpu(idle, cpu); - spin_lock_irqsave(&rq->lock, flags); - rq->curr = rq->idle = idle; #if defined(CONFIG_SMP) && defined(__ARCH_WANT_UNLOCKED_CTXSW) idle->oncpu = 1; Just a thought from a casual observer....
Its as Andy writes, I simply moved the irq_save() up a few lines, since sched_clock_local() wants IRQs disabled, but assigning stuff to idle doesn't require rq->lock. That said, reverting this patch removes the whole call path observed in the NMI watchdog: [<c043800f>] ktime_get_ts+0x16/0x44 [<c0420c8c>] get_monotonic_time+0xe/0x1b [<c0420d15>] sched_clock_tick+0x7c/0xb6 [<c0420d80>] scheduler_tick+0x31/0x340 [<c0621d73>] schedule+0xa1f/0xaab [<c042d118>] update_process_times+0x57/0x62 [<c041788d>] smp_apic_timer_interrupt+0x66/0x75 [<c0405aee>] apic_timer_interrupt+0x2a/0x30 [<c0403bf3>] default_idle+0x3e/0x6d [<c0403bf5>] default_idle+0x40/0x6d [<c0403cc6>] cpu_idle+0xa4/0xbe As it includes adding sched_clock_tick() to scheduler_tick(). Now it would be interesting to see where: c043800f is in the code, would anybody be having the debuginfo of that particular kernel build handy? I'm betting that is right smack in the middle of that xtime_lock loop. Which would mean I missed a path where we're holding xtime_lock over update_process_times()... i386-SMP, right? /me goes look.
Peter, yup -- i386 smp (well ... i686.config ;) ). P.
Peter, From self-compiled 148.el5: crash> dis c043800f 0xc043800f <ktime_get_ts+0x16>: mov 0xc0854320,%edx crash> dis -l ktime_get_ts /home/kernel/kernel/hrtimer.c: 109 0xc0437ff9 <ktime_get_ts>: push %esi 0xc0437ffa <ktime_get_ts+0x1>: mov %eax,%esi 0xc0437ffc <ktime_get_ts+0x3>: push %ebx include/linux/seqlock.h: 88 0xc0437ffd <ktime_get_ts+0x4>: mov 0xc0749c80,%ebx include/linux/seqlock.h: 89 0xc0438003 <ktime_get_ts+0xa>: lfence 0xc0438006 <ktime_get_ts+0xd>: data16 0xc0438007 <ktime_get_ts+0xe>: nop /home/kernel/kernel/hrtimer.c: 115 0xc0438008 <ktime_get_ts+0xf>: mov %esi,%eax 0xc043800a <ktime_get_ts+0x11>: call 0xc042bf5e <getnstimeofday> /home/kernel/kernel/hrtimer.c: 116 0xc043800f <ktime_get_ts+0x16>: mov 0xc0854320,%edx 0xc0438015 <ktime_get_ts+0x1c>: mov 0xc0854324,%ecx include/linux/seqlock.h: 103 0xc043801b <ktime_get_ts+0x22>: lfence 0xc043801e <ktime_get_ts+0x25>: data16 0xc043801f <ktime_get_ts+0x26>: nop /home/kernel/kernel/hrtimer.c: 118 0xc0438020 <ktime_get_ts+0x27>: mov %ebx,%eax 0xc0438022 <ktime_get_ts+0x29>: and $0x1,%ebx 0xc0438025 <ktime_get_ts+0x2c>: xor 0xc0749c80,%eax 0xc043802b <ktime_get_ts+0x32>: or %ebx,%eax 0xc043802d <ktime_get_ts+0x34>: jne 0xc0437ffd <ktime_get_ts+0x4> /home/kernel/kernel/hrtimer.c: 120 0xc043802f <ktime_get_ts+0x36>: add 0x4(%esi),%ecx 0xc0438032 <ktime_get_ts+0x39>: mov %esi,%eax 0xc0438034 <ktime_get_ts+0x3b>: add (%esi),%edx /home/kernel/kernel/hrtimer.c: 122 0xc0438036 <ktime_get_ts+0x3d>: pop %ebx 0xc0438037 <ktime_get_ts+0x3e>: pop %esi /home/kernel/kernel/hrtimer.c: 120 0xc0438038 <ktime_get_ts+0x3f>: jmp 0xc0428b04 <set_normalized_timespec>
From IRC chat with Peter: <prarit> Okay -- I think I may have found it. Hold on a few minutes and let me re-review. <prarit> time_cpufreq_notifier <prarit> -> acquires xtime_lock via write_seqlock_irq(&xtime_lock); <prarit> -> set_cyc2ns_scale <prarit> -> sched_clock_idle_wakeup_event <prarit> -> sched_clock_tick <prarit> -> get_monotonic_time <prarit> -> ktime_get_ts <prarit> which attempts a <prarit> seq = read_seqbegin(&xtime_lock) DEADLOCK. The write_seqlock_irq() actually *acquires* the lock via a spin_lock() call. read_seqbegin() only checks to see if the lock is free and barrels on through using a sequence # to determine if any changes to xtime have been made. P.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
in kernel-2.6.18-150.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please do NOT transition this bugzilla state to VERIFIED until our QE team has sent specific instructions indicating when to do so. However feel free to provide a comment indicating that this fix has been verified.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1243.html