Bug 696794

Summary: CPU init: "not responding - cannot use it"
Product: Red Hat Enterprise Linux 5 Reporter: Steve Snyder <swsnyder>
Component: kernelAssignee: Prarit Bhargava <prarit>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 5.6CC: jarod
Target Milestone: rc   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-08-08 17:45:08 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Steve Snyder 2011-04-14 20:32:58 UTC
Description of problem:

Kernel fails to initialize a CPU at boot time.

Version-Release number of selected component (if applicable):

kernel-2.6.18-238.9.1.el5

How reproducible:

Unknown.

Steps to Reproduce:
1. Examine system log after booting new kernel
2. Note log entries of failed initialization
3.
  
Actual results:

Only 3 of 4 CPUs were successfully initialized

Expected results:

All CPUs should be initialized.

Additional info:

This problem is not readily reproducible, so I'm not expecting a fix.  I just want to get this on the record in case other report the same problem.

I've been running Red Hat kernels on the same hardware for 9 years (same BIOS for the last 5 years) and I have never seen this failure before.  Maybe it's just a fluke that it is seen on the first boot of a new kernel.  After seeing this failure I rebooted the machine again and no failure was seen.

My hardware: a Supermicro P4DC6+ motherboard with 2 x Prestonia (Pentium4-era) Xeon CPUs.  The BIOS on this machine is configured to enable Hyper-threading, which is how I get (or *should* get) 4 CPUs seen at boot time.

First boot of kernel 2.6.18-238.9.1.el5 :

Apr 14 07:35:31 kernel: Intel machine check architecture supported.
Apr 14 07:35:31 kernel: Intel machine check reporting enabled on CPU#0.
Apr 14 07:35:31 kernel: CPU0: Intel P4/Xeon Extended MCE MSRs (12) available
Apr 14 07:35:31 kernel: CPU0: Thermal monitoring enabled
Apr 14 07:35:31 kernel: Checking 'hlt' instruction... OK.
Apr 14 07:35:31 kernel: SMP alternatives: switching to UP code
Apr 14 07:35:31 kernel: ACPI: Core revision 20060707
Apr 14 07:35:31 kernel: CPU0: Intel(R) Xeon(TM) CPU 2.40GHz stepping 07
Apr 14 07:35:31 kernel: SMP alternatives: switching to SMP code
Apr 14 07:35:31 kernel: Booting processor 1/1 eip 11000
Apr 14 07:35:31 kernel: CPU 1 irqstacks, hard=c0833000 soft=c082f000
Apr 14 07:35:31 kernel: Not responding.
Apr 14 07:35:31 kernel: Inquiring remote APIC #1...
Apr 14 07:35:31 kernel: ... APIC #1 ID: failed
Apr 14 07:35:31 kernel: ... APIC #1 VERSION: failed
Apr 14 07:35:31 kernel: ... APIC #1 SPIV: failed
Apr 14 07:35:31 kernel: CPU #1 not responding - cannot use it.
Apr 14 07:35:31 kernel: SMP alternatives: switching to SMP code
Apr 14 07:35:31 kernel: Booting processor 1/2 eip 11000
Apr 14 07:35:31 kernel: Initializing CPU#1
Apr 14 07:35:31 kernel: Calibrating delay using timer specific routine.. 4757.34 BogoMIPS (lpj=2378671)
Apr 14 07:35:31 kernel: CPU: Trace cache: 12K uops, L1 D cache: 8K
Apr 14 07:35:31 kernel: CPU: L2 cache: 512K
Apr 14 07:35:31 kernel: CPU: Physical Processor ID: 0


Second boot of same kernel:

Apr 14 07:57:09 kernel: Intel machine check architecture supported.
Apr 14 07:57:09 kernel: Intel machine check reporting enabled on CPU#0.
Apr 14 07:57:09 kernel: CPU0: Intel P4/Xeon Extended MCE MSRs (12) available
Apr 14 07:57:09 kernel: CPU0: Thermal monitoring enabled
Apr 14 07:57:09 kernel: Checking 'hlt' instruction... OK.
Apr 14 07:57:09 kernel: SMP alternatives: switching to UP code
Apr 14 07:57:09 kernel: ACPI: Core revision 20060707
Apr 14 07:57:09 kernel: CPU0: Intel(R) Xeon(TM) CPU 2.40GHz stepping 07
Apr 14 07:57:09 kernel: SMP alternatives: switching to SMP code
Apr 14 07:57:09 kernel: Booting processor 1/1 eip 11000
Apr 14 07:57:09 kernel: CPU 1 irqstacks, hard=c0833000 soft=c082f000
Apr 14 07:57:09 kernel: Initializing CPU#1
Apr 14 07:57:09 kernel: Calibrating delay using timer specific routine.. 4757.43 BogoMIPS (lpj=2378715)
Apr 14 07:57:09 kernel: CPU: Trace cache: 12K uops, L1 D cache: 8K
Apr 14 07:57:09 kernel: CPU: L2 cache: 512K
Apr 14 07:57:09 kernel: CPU: Physical Processor ID: 3

Comment 1 Prarit Bhargava 2011-04-20 13:53:49 UTC
Apr 14 07:35:31 kernel: Not responding.
Apr 14 07:35:31 kernel: Inquiring remote APIC #1...
Apr 14 07:35:31 kernel: ... APIC #1 ID: failed
Apr 14 07:35:31 kernel: ... APIC #1 VERSION: failed
Apr 14 07:35:31 kernel: ... APIC #1 SPIV: failed
Apr 14 07:35:31 kernel: CPU #1 not responding - cannot use it.

I haven't seen anything like this reported, but the "Not responding." error indicates that the 2nd socket's 1st processor is not initializing for some reason.

Do you see the same issues with upstream/RHEL6/newer kernels?

P.

Comment 2 Steve Snyder 2011-04-20 14:13:53 UTC
(In reply to comment #1)
[snip]
> Do you see the same issues with upstream/RHEL6/newer kernels?

I haven't tried it.  Can I run a RHEL6 kernel on a RHEL 5.6 system?  I mean, without also installing a lot of dependent RPM?

Comment 3 Jarod Wilson 2011-04-20 21:30:28 UTC
(In reply to comment #2)
> (In reply to comment #1)
> [snip]
> > Do you see the same issues with upstream/RHEL6/newer kernels?
> 
> I haven't tried it.  Can I run a RHEL6 kernel on a RHEL 5.6 system?  I mean,
> without also installing a lot of dependent RPM?

No, a RHEL6 kernel will require quite a fair bit of updated userspace too.

Comment 4 Prarit Bhargava 2011-05-02 12:48:39 UTC
(In reply to comment #2)
> (In reply to comment #1)
> [snip]
> > Do you see the same issues with upstream/RHEL6/newer kernels?
> 
> I haven't tried it.  Can I run a RHEL6 kernel on a RHEL 5.6 system?  I mean,
> without also installing a lot of dependent RPM?

Well, you *can* just take the vmlinuz from RHEL6 and try to boot it.  Obviously you won't be able to mount filesystems, etc..

But :(, as you said, the problem is intermittent.  Any idea of how often this happens?  1/10 boots?  1/100?

P.

Comment 5 Steve Snyder 2011-05-02 13:13:08 UTC
(In reply to comment #4)
[snip]
> But :(, as you said, the problem is intermittent.  Any idea of how often this
> happens?  1/10 boots?  1/100?

I've only seen it that one time, though I've rebooted the system several times since then.

Could have been a cosmic ray, I guess. :-)

Comment 6 Prarit Bhargava 2011-05-02 13:16:58 UTC
(In reply to comment #5)
> (In reply to comment #4)
> [snip]
> > But :(, as you said, the problem is intermittent.  Any idea of how often this
> > happens?  1/10 boots?  1/100?
> 
> I've only seen it that one time, though I've rebooted the system several times
> since then.
> 
> Could have been a cosmic ray, I guess. :-)

:)  If it happens again, please ping in this BZ.  Just for kicks I'll try finding a similar system within Red Hat and do an overnight reboot test (on each reboot I will confirm that the # of CPUs is what it should be).

I have not seen any other reports of this issue FWIW.

P.

Comment 7 Prarit Bhargava 2011-08-08 17:45:08 UTC
Steve, closing this out as there haven't been any updates in months.

P.