Bug 214832

Summary: getting unlock_cpu_hotplug warning at boot on rhel5-b2
Product: Red Hat Enterprise Linux 5 Reporter: Matthew Coffey <mcoffey>
Component: kernelAssignee: Prarit Bhargava <prarit>
Status: CLOSED DUPLICATE QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: 5.0CC: jason_mack, jfeeney, rhentosh
Target Milestone: ---   
Target Release: ---   
Hardware: athlon   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-11-29 20:12:23 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 200812    

Description Matthew Coffey 2006-11-09 17:43:21 UTC
Description of problem:
When rhel5-b2 boots, it writes the following error message and trace to the console.

BUG: warning at kernel/cpu.c:56/unlock_cpu_hotplug() (Not tainted)

Call Trace:
 [<ffffffff80069620>] show_trace+0x34/0x47
 [<ffffffff80069645>] dump_stack+0x12/0x17
 [<ffffffff800a0cab>] unlock_cpu_hotplug+0x47/0x74
 [<ffffffff882942aa>] :cpufreq_ondemand:do_dbs_timer+0x11c/0x174
 [<ffffffff8004b5c9>] run_workqueue+0x94/0xe5
 [<ffffffff80048009>] worker_thread+0xf0/0x122
 [<ffffffff800322d0>] kthread+0xf6/0x12a
 [<ffffffff8005c365>] child_rip+0xa/0x11
DWARF2 unwinder stuck at child_rip+0xa/0x11
Leftover inexact backtrace:
 [<ffffffff8009c3c5>] keventd_create_kthread+0x0/0x61
 [<ffffffff800321da>] kthread+0x0/0x12a
 [<ffffffff8005c35b>] child_rip+0x0/0x11

To date we've only installed this on two AMD Opteron systems with Rev E and Rev
F CPUs. It happens on both of them. The first is a HP DL385 and the second is a
DL365.

Version-Release number of selected component (if applicable):
RHEL5-b2

How reproducible:
Occurs every reboot.

Steps to Reproduce:
1. Reboot
2. The warning generally appears about the time that sendmail is starting
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Jason Mack 2006-11-22 17:18:17 UTC
RHEL5-Beta2-x86_64:  BUG: warning at kernel/cpu.c:56/unlock_cpu_hotplug()  

Description:
While running ad hoc overnight stress testing PE1950 failed the Newburn test on
RHEL5-Beta2-x86_64.  Later the same issue was observed on PE840.

Dell's CTCS Newburn SYSLOG:

>> Nov 16 06:39:34 pe1950-r5-b2-rc kernel: BUG: warning at
kernel/cpu.c:56/unlock_cpu_hotplug() (Not tainted)
Thu Nov 16 06:40:18 CST 2006: SYSLOG FAILED: on 4/0 after 10h28m1s
4 fail 0 succeed 4 count

While the PE1950 was very sluggish until Newburn exited (cleanly), the system at
no time locked.  It rebooted fine, and all lights are blue.  BIOS is at v1.3.0.
 kernel is 2.6.18-1.2747.el5.  Mem is at 2.5 GB using 2x 256MB and 2x 1024MB DIMMs.

Steps to Re-Create:
1:  PE1950, w/ 2 Quad-core Intel CPU's install and run RHEL5-B2 x86_64.
2:  Using Newburn from Dell's RHEL(4) CTCS, run stress overnight.
3:  Next morning, system has flashing fail messages on screen, from Newburn.

I also see this issue on PE840, another Xeon system.  Also, it seems the same on
every arch; I can see at least 5 reports here on RH's Bugzilla.  In all these
cases the function unlock_cpu_hotplug, line 56 is seen as the cause of the BUG.

The issue comes to cpu.c's line 56, implying a common name as 'line 56 bug.'

During subsequent testing, I noted that this bug is not specific to the Newburn
test.

The trace looks like this:

Nov 16 06:39:34 pe1950-r5-b2-rc kernel: BUG: warning at
kernel/cpu.c:56/unlock_cpu_hotplug() (Not tainted)
Nov 16 06:39:34 pe1950-r5-b2-rc kernel: 
Nov 16 06:39:34 pe1950-r5-b2-rc kernel: Call Trace:
Nov 16 06:39:34 pe1950-r5-b2-rc kernel:  [<ffffffff80069632>] show_trace+0x34/0x47
Nov 16 06:39:34 pe1950-r5-b2-rc kernel:  [<ffffffff80069657>] dump_stack+0x12/0x17
Nov 16 06:39:34 pe1950-r5-b2-rc kernel:  [<ffffffff800a0c60>]
unlock_cpu_hotplug+0x47/0x74
Nov 16 06:39:34 pe1950-r5-b2-rc kernel:  [<ffffffff882e52aa>]
:cpufreq_ondemand:do_dbs_timer+0x11c/0x174
Nov 16 06:39:34 pe1950-r5-b2-rc kernel:  [<ffffffff8004b5cc>]
run_workqueue+0x94/0xe5
Nov 16 06:39:34 pe1950-r5-b2-rc kernel:  [<ffffffff80048018>]
worker_thread+0xf0/0x122
Nov 16 06:39:34 pe1950-r5-b2-rc kernel:  [<ffffffff800322e7>] kthread+0xf6/0x12a
Nov 16 06:39:34 pe1950-r5-b2-rc kernel:  [<ffffffff8005c365>] child_rip+0xa/0x11
Nov 16 06:39:34 pe1950-r5-b2-rc kernel: DWARF2 unwinder stuck at child_rip+0xa/0x11
Nov 16 06:39:35 pe1950-r5-b2-rc kernel: Leftover inexact backtrace:
Nov 16 06:39:36 pe1950-r5-b2-rc kernel:  [<ffffffff8009c368>]
keventd_create_kthread+0x0/0x61
Nov 16 06:39:36 pe1950-r5-b2-rc kernel:  [<ffffffff800321f1>] kthread+0x0/0x12a
Nov 16 06:39:37 pe1950-r5-b2-rc kernel:  [<ffffffff8005c35b>] child_rip+0x0/0x11

(Note: b2-rc=b2)

In cpu.c the function is:

void unlock_cpu_hotplug(void)
{
  WARN_ON(recursive != current);
  if (recursive_depth) {
    recursive_depth--;
    return;
  }
  mutex_unlock(&cpu_bitmask_lock);
  recursive = NULL;
}
EXPORT_SYMBOL_GPL(unlock_cpu_hotplug);

It is the second function in cpu.c.

Some proposed fixes for this bug are seen in Bug 211301, for the ia64 platform.

Comment 2 Jason Mack 2006-11-22 17:42:38 UTC
Oh.  Line 56 is:

WARN_ON(recursive != current);


Comment 3 Jason Mack 2006-11-27 17:41:37 UTC
Well, ok, the "The cpu.c line 56 bug".  So what else can I do to help?

Comment 5 Peter Martuccelli 2006-11-29 20:11:06 UTC
Prarit please review this issue, bring in Konrad as required.  If possible
resolve this for R4.5.

Comment 6 Prarit Bhargava 2006-11-29 20:12:23 UTC
Dup of 213455.

*** This bug has been marked as a duplicate of 213455 ***

Comment 7 Prarit Bhargava 2006-11-29 20:15:09 UTC
Sorry, duped to wrong bug # ....

*** This bug has been marked as a duplicate of 211301 ***