Bug 501178

Summary:	RHEL5: NMI lockups seen after enabling cpuspeed on -147.el5 & -148.el5
Product:	Red Hat Enterprise Linux 5	Reporter:	Prarit Bhargava <prarit>
Component:	kernel	Assignee:	Prarit Bhargava <prarit>
Status:	CLOSED ERRATA	QA Contact:	Red Hat Kernel QE team <kernel-qe>
Severity:	high	Docs Contact:
Priority:	high
Version:	5.4	CC:	agospoda, dzickus, emcnabb, jburke, mgahagan, prarit, pzijlstr
Target Milestone:	beta	Keywords:	Regression
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2009-09-02 08:34:04 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Prarit Bhargava 2009-05-17 15:07:53 UTC

Description of problem:

When booting the kernel-debug package, NMI lockups or system hangs are seen after enabling cpuspeed.

Version-Release number of selected component (if applicable): 147.el5


How reproducible: 100%


Steps to Reproduce:
1. boot kernel-debug for 148.el5 or 147.el5
  
Actual results:

System hangs or sometimes I see

<jburke> Starting mcstransd: BUG: NMI Watchdog detected LOCKUP on CPU2, eip c042bf71, registers:
<jburke> CPU:    2
<jburke> EIP is at getnstimeofday+0x13/0xb6
<jburke> eax: 00000001   ebx: ff6d93e0   ecx: 00000006   edx: 01bfdb64
<jburke> esi: 00000006   edi: 00000000   ebp: 000144ed   esp: f7cdaee4
<jburke> ds: 007b   es: 007b   ss: 0068
<jburke> Process swapper (pid: 0, ti=f7cda000 task=f7c8ece0 task.ti=f7cda000)
<jburke> Stack: f7cdaf08 000144ed f7cdaf08 f7cdaf80 f7cdaf10 c043800f c30d41a4 00000002 
<jburke>        c0420c8c 4a0d944b 17f3f9fd f7cdaf24 c0420d15 02943608 00000002 f7cdaf80 
<jburke>        f7cdaf58 c0420d80 00000003 fffc0e92 c0621d73 00000004 f7c8ece0 f7c8ece0 
<jburke> Call Trace:
<jburke>  [<c043800f>] ktime_get_ts+0x16/0x44
<jburke>  [<c0420c8c>] get_monotonic_time+0xe/0x1b
<jburke>  [<c0420d15>] sched_clock_tick+0x7c/0xb6
<jburke>  [<c0420d80>] scheduler_tick+0x31/0x340
<jburke>  [<c0621d73>] schedule+0xa1f/0xaab
<jburke>  [<c042d118>] update_process_times+0x57/0x62
<jburke>  [<c041788d>] smp_apic_timer_interrupt+0x66/0x75
<jburke>  [<c0405aee>] apic_timer_interrupt+0x2a/0x30
<jburke>  [<c0403bf3>] default_idle+0x3e/0x6d
<jburke>  [<c0403bf5>] default_idle+0x40/0x6d
<jburke>  [<c0403cc6>] cpu_idle+0xa4/0xbe
<jburke>  =======================
<jburke> Code: c3 a1 40 54 85 c0 99 0f a4 c2 14 c1 e0 14 01 c1 11 d3 89 c8 89 da 5b c3 55 57 56 53 83 ec 04 89 04 24 8b 2d 80 9c 74 c0 0f ae e8 <66> 90 a1 10 43 85 c0 8b 0c 24 8b 15 14 43 85 c0 89 01 89 51 04 

Expected results: No hang.


Additional info: Git bisect points the at pzilstr's commits between 6bbc8c76e0d31d43687e6af6157a2ecafe098cb3 and 0d1af16a1628460ad295607ce90c9f24a99a5391.

The time.c code is finicky -- it's loaded with int/long overflows, etc.. 

Seen on sun-v40z-01.rhts.bos.redhat.com, which is currently checked out by jburke, and can be extended by executing 'extendtesttime.sh'.

Comment 1 Prarit Bhargava 2009-05-17 15:53:01 UTC

Can be replicated using kernel-PAE...-148.el5 & manually starting the cpuspeed service.

P.

Comment 2 Prarit Bhargava 2009-05-17 16:03:35 UTC

pzijlstr -- if you decide to use the system to debug, I have *disabled* cpuspeed in chkconfig.  You have to start it manually.

FYI,

P.

Comment 3 Prarit Bhargava 2009-05-17 16:05:36 UTC

Can be reproduced using kernel-2.6.18-148.el5, and manually restarting cpuspeed service or loading the powernow-k8 module.

P.

Comment 4 Prarit Bhargava 2009-05-18 19:17:48 UTC

There are reports of slow-downs during boot (see comment #10 in BZ 499999).  I don't know if this is related to this BZ, but it definitely could be.

P.

Comment 5 Prarit Bhargava 2009-05-18 22:23:22 UTC

Backing out cca0fdba55f43de5eed0d60e257c4c17d3e33396 (and making minor changes
to  set_cyc2ns_scale() ) fix the issue -- ie) the above change is responsible
for the strange behavior.

... this feels like a type-casting issue ...

P.

Comment 6 Prarit Bhargava 2009-05-18 22:29:41 UTC

From the previous patch:

@@ -5104,7 +5338,8 @@ void __cpuinit init_idle(struct task_struct *idle, int cpu)
        idle->cpus_allowed = cpumask_of_cpu(cpu);
        set_task_cpu(idle, cpu);
 
-       spin_lock_irqsave(&rq->lock, flags);
+       spin_lock(&rq->lock);
+
        rq->curr = rq->idle = idle;
 #if defined(CONFIG_SMP) && defined(__ARCH_WANT_UNLOCKED_CTXSW)
        idle->oncpu = 1;

Not 100% sure if this is it but this looks wrong.  This changes the lock acquisition to spin_lock ... but the "free'ing" of the lock is still spin_lock_irqrestore, which would use a bogus flags value.

(Unless I'm reading the patch wrong)

P.

Comment 7 Andy Gospodarek 2009-05-19 01:56:41 UTC

Prarit,

Your observation in comment #6 might be correct, but the only real difference between spin_lock and spin_lock_irqsave is the call to local_irq_save that is added.  Since a call to that is done at the beginning of init_idle it might not be needed.  It still seems a bit odd that there is a gap between when interrupts are saved and when the lock is taken (and I don't really see that anywhere upstream).  The closest thing out there that is trying to emulate this might be here:

commit 5cbd54ef470d880fc37fbe4b21eb514806d51e0d
Author: Ingo Molnar <mingo>
Date:   Wed Nov 12 20:05:50 2008 +0100

    sched: fix init_idle()'s use of sched_clock()

I'm still not sure this is critical, but it could be interesting to replace this hunk in cca0fdba55f43de5eed0d60e257c4c17d3e33396

@@ -5096,7 +5328,9 @@ void __cpuinit init_idle(struct task_struct *idle, int cpu)
        struct rq *rq = cpu_rq(cpu);
        unsigned long flags;
 
-       idle->timestamp = sched_clock();
+       local_irq_save(flags);
+
+       idle->timestamp = sched_clock_local();
        idle->sleep_avg = 0;
        idle->array = NULL;
        idle->prio = idle->normal_prio = MAX_PRIO;
@@ -5104,7 +5338,8 @@ void __cpuinit init_idle(struct task_struct *idle, int cpu)
        idle->cpus_allowed = cpumask_of_cpu(cpu);
        set_task_cpu(idle, cpu);
 
-       spin_lock_irqsave(&rq->lock, flags);
+       spin_lock(&rq->lock);
+
        rq->curr = rq->idle = idle;
 #if defined(CONFIG_SMP) && defined(__ARCH_WANT_UNLOCKED_CTXSW)
        idle->oncpu = 1;


with something like this:

@@ -5096,7 +5328,9 @@ void __cpuinit init_idle(struct task_struct *idle, int cpu)
        struct rq *rq = cpu_rq(cpu);
        unsigned long flags;
 
-       idle->timestamp = sched_clock();
+       spin_lock_irqsave(&rq->lock, flags);
+
+       idle->timestamp = sched_clock_local();
        idle->sleep_avg = 0;
        idle->array = NULL;
        idle->prio = idle->normal_prio = MAX_PRIO;
@@ -5104,7 +5338,8 @@ void __cpuinit init_idle(struct task_struct *idle, int cpu)
        idle->cpus_allowed = cpumask_of_cpu(cpu);
        set_task_cpu(idle, cpu);
 
-       spin_lock_irqsave(&rq->lock, flags);
- 
        rq->curr = rq->idle = idle;
 #if defined(CONFIG_SMP) && defined(__ARCH_WANT_UNLOCKED_CTXSW)
        idle->oncpu = 1;


Just a thought from a casual observer....

Comment 8 Peter Zijlstra 2009-05-19 10:22:05 UTC

Its as Andy writes, I simply moved the irq_save() up a few lines, since sched_clock_local() wants IRQs disabled, but assigning stuff to idle doesn't require rq->lock.

That said, reverting this patch removes the whole call path observed in the NMI watchdog:

  [<c043800f>] ktime_get_ts+0x16/0x44
  [<c0420c8c>] get_monotonic_time+0xe/0x1b
  [<c0420d15>] sched_clock_tick+0x7c/0xb6
  [<c0420d80>] scheduler_tick+0x31/0x340
  [<c0621d73>] schedule+0xa1f/0xaab
  [<c042d118>] update_process_times+0x57/0x62
  [<c041788d>] smp_apic_timer_interrupt+0x66/0x75
  [<c0405aee>] apic_timer_interrupt+0x2a/0x30
  [<c0403bf3>] default_idle+0x3e/0x6d
  [<c0403bf5>] default_idle+0x40/0x6d
  [<c0403cc6>] cpu_idle+0xa4/0xbe

As it includes adding sched_clock_tick() to scheduler_tick().

Now it would be interesting to see where: c043800f is in the code, would anybody be having the debuginfo of that particular kernel build handy?

I'm betting that is right smack in the middle of that xtime_lock loop.

Which would mean I missed a path where we're holding xtime_lock over update_process_times()... i386-SMP, right?  /me goes look.

Comment 9 Prarit Bhargava 2009-05-19 12:35:05 UTC

Peter, yup -- i386 smp (well ... i686.config ;) ).

P.

Comment 10 Prarit Bhargava 2009-05-19 13:01:12 UTC

Peter,

From self-compiled 148.el5:

crash> dis c043800f
0xc043800f <ktime_get_ts+0x16>: mov    0xc0854320,%edx
crash> dis -l ktime_get_ts
/home/kernel/kernel/hrtimer.c: 109
0xc0437ff9 <ktime_get_ts>:      push   %esi
0xc0437ffa <ktime_get_ts+0x1>:  mov    %eax,%esi
0xc0437ffc <ktime_get_ts+0x3>:  push   %ebx
include/linux/seqlock.h: 88
0xc0437ffd <ktime_get_ts+0x4>:  mov    0xc0749c80,%ebx
include/linux/seqlock.h: 89
0xc0438003 <ktime_get_ts+0xa>:  lfence 
0xc0438006 <ktime_get_ts+0xd>:  data16
0xc0438007 <ktime_get_ts+0xe>:  nop    
/home/kernel/kernel/hrtimer.c: 115
0xc0438008 <ktime_get_ts+0xf>:  mov    %esi,%eax
0xc043800a <ktime_get_ts+0x11>: call   0xc042bf5e <getnstimeofday>
/home/kernel/kernel/hrtimer.c: 116
0xc043800f <ktime_get_ts+0x16>: mov    0xc0854320,%edx
0xc0438015 <ktime_get_ts+0x1c>: mov    0xc0854324,%ecx
include/linux/seqlock.h: 103
0xc043801b <ktime_get_ts+0x22>: lfence 
0xc043801e <ktime_get_ts+0x25>: data16
0xc043801f <ktime_get_ts+0x26>: nop    
/home/kernel/kernel/hrtimer.c: 118
0xc0438020 <ktime_get_ts+0x27>: mov    %ebx,%eax
0xc0438022 <ktime_get_ts+0x29>: and    $0x1,%ebx
0xc0438025 <ktime_get_ts+0x2c>: xor    0xc0749c80,%eax
0xc043802b <ktime_get_ts+0x32>: or     %ebx,%eax
0xc043802d <ktime_get_ts+0x34>: jne    0xc0437ffd <ktime_get_ts+0x4>
/home/kernel/kernel/hrtimer.c: 120
0xc043802f <ktime_get_ts+0x36>: add    0x4(%esi),%ecx
0xc0438032 <ktime_get_ts+0x39>: mov    %esi,%eax
0xc0438034 <ktime_get_ts+0x3b>: add    (%esi),%edx
/home/kernel/kernel/hrtimer.c: 122
0xc0438036 <ktime_get_ts+0x3d>: pop    %ebx
0xc0438037 <ktime_get_ts+0x3e>: pop    %esi
/home/kernel/kernel/hrtimer.c: 120
0xc0438038 <ktime_get_ts+0x3f>: jmp    0xc0428b04 <set_normalized_timespec>

Comment 11 Prarit Bhargava 2009-05-19 13:40:20 UTC

From IRC chat with Peter:

<prarit> Okay -- I think I may  have found it.  Hold on a few minutes and let me re-review.
<prarit> time_cpufreq_notifier
<prarit>         -> acquires xtime_lock via  write_seqlock_irq(&xtime_lock);
<prarit>         -> set_cyc2ns_scale
<prarit>                 -> sched_clock_idle_wakeup_event
<prarit>                         -> sched_clock_tick
<prarit>                                 -> get_monotonic_time
<prarit>                                         -> ktime_get_ts
<prarit>                                                 which attempts a
<prarit>                                                 seq = read_seqbegin(&xtime_lock)

DEADLOCK.

The write_seqlock_irq() actually *acquires* the lock via a spin_lock() call.  read_seqbegin() only checks to see if the lock is free and barrels on through using a sequence # to determine if any changes to xtime have been made.

P.

Comment 13 RHEL Program Management 2009-05-19 21:10:57 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 14 Don Zickus 2009-05-21 15:37:14 UTC

in kernel-2.6.18-150.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.

Comment 18 errata-xmlrpc 2009-09-02 08:34:04 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1243.html