Bug 431882

Summary: One of 2 AMD CPUs getting shutdown
Product: [Fedora] Fedora Reporter: Bruno Wolff III <bruno>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED CANTFIX QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: low Docs Contact:
Priority: low    
Version: rawhideCC: bruno, gthaker, mingo, tglx
Target Milestone: ---Keywords: Reopened
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-01-25 15:20:13 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Contents of /proc/cpuinfo when both cpus appear to be working
none
/var/log/messages extract from when both CPUs used.
none
/var/log/messages extract when only cpu 0 is used
none
Test patch which increases the boot delays
none
Extract from /var/log/messages none

Description Bruno Wolff III 2008-02-07 16:38:44 UTC
Description of problem:
I have noticed a few instances in the last month or so of my machine running only one of its two cpus. If one case I had to power off the power supply to clear things up. A reboot didn't work.
Currently I am running the 2.6.24-17.fc9 kernel, but it happened with a 2.6.23 kernel as well. I can't rule out there being some sort of hardware problem. I don't overclock the cpus but they are getting a bit old now. I have 2 AMD Athlon(tm) MP 2400+ (stepping 1). I doubt this is enough to go on, but I am hoping you can suggest some information to capture to help diagnose the problem or at least determine if it is a software or hardware problem.
Version-Release number of selected component (if applicable):
2.6.24-17.fc9

How reproducible:
I don't know. I know it has happened at least 3 times, but it doesn't happen all of the time. It may have been happening for a long time and I just didn't notice.

Steps to Reproduce:
1. At this point I don't have a reliable way to reproduce this problem.
2.
3.
  
Actual results:
/proc/cpuinfo shows info only about cpu 0 (sometimes).

Expected results:
/proc/cpuinfo should always show info about cpu 0 and cpu 1.

Additional info:

Comment 1 Bruno Wolff III 2008-02-07 18:07:01 UTC
Created attachment 294241 [details]
Contents of /proc/cpuinfo when both cpus appear to be working

Comment 2 Bruno Wolff III 2008-02-07 18:16:08 UTC
Created attachment 294245 [details]
/var/log/messages extract from when both CPUs used.

Comment 3 Bruno Wolff III 2008-02-07 18:17:08 UTC
Created attachment 294246 [details]
/var/log/messages extract when only cpu 0 is used

Comment 4 Bruno Wolff III 2008-02-20 08:00:02 UTC
As another data point to this, I seem to be consistantly seeing a several second
pause during the boot with the line of text:
CPU 1 irqstacks, hard=c0820000 soft=c0800000
on the screen when cpu1 is not used and no pause when it is used.

Comment 5 Bruno Wolff III 2008-03-08 16:13:01 UTC
This is still happening in 2.6.25-0.95.rc4.fc9.
At this point I think the occurrence rate is roughly between 10% and 50% of
boots. I no longer believe that removing power had any effect, as I have had
success just rebooting until it didn't happen.

Comment 6 Gautam Thaker 2008-03-27 21:09:32 UTC
I have very similar problems. I see:

Mar 27 11:20:02 zadapi-L kernel: SMP alternatives: switching to UP code
Mar 27 11:20:02 zadapi-L kernel: ACPI: Core revision 20070126
Mar 27 11:20:02 zadapi-L kernel: CPU0: AMD Athlon(tm) 64 X2 Dual Core Processor
4000+
stepping 01
Mar 27 11:20:02 zadapi-L kernel: SMP alternatives: switching to SMP code
Mar 27 11:20:02 zadapi-L kernel: Booting processor 1/1 eip 3000
Mar 27 11:20:02 zadapi-L kernel: CPU 1 irqstacks, hard=c07aa000 soft=c078a000
Mar 27 11:20:02 zadapi-L kernel: Not responding.
Mar 27 11:20:02 zadapi-L kernel: Inquiring remote APIC #1...
Mar 27 11:20:02 zadapi-L kernel: ... APIC #1 ID: 1000000
Mar 27 11:20:02 zadapi-L kernel: ... APIC #1 VERSION: 80050010
Mar 27 11:20:02 zadapi-L kernel: ... APIC #1 SPIV: ff
Mar 27 11:20:02 zadapi-L kernel: CPU #1 not responding - cannot use it.
Mar 27 11:20:02 zadapi-L kernel: Total of 1 processors activated (4222.40 BogoMIPS).
Mar 27 11:20:02 zadapi-L kernel: ENABLING IO-APIC IRQs
Mar 27 11:20:02 zadapi-L kernel: ..TIMER: vector=0x31 apic1=0 pin1=2 apic2=-1
pin2=-1
Mar 27 11:20:02 zadapi-L kernel: Brought up 1 CPUs

I have rebooted about a dozen times w/ 2.6.23 and 2.6.24 kernels, and it has
NEVER come back as a dual core system for me. (So may be my CPU has somehow
permanently lost a core, though this could still be  a SW issue.) I can find no
settings in BIOS that I can change that may impact this.

Gautam


Comment 7 Thomas Gleixner 2008-03-28 08:09:38 UTC
Can you make sure that you have the latest BIOS installed ?



Comment 8 Thomas Gleixner 2008-03-28 08:10:50 UTC
Created attachment 299439 [details]
Test patch which increases the boot delays

Can you please test the attached patch ?

Comment 9 Gautam Thaker 2008-03-28 19:44:42 UTC
(In reply to comment #7)
> Can you make sure that you have the latest BIOS installed ?
> 
> 

My machine is a Dell Inspiron 530. Its bios version is 1.0.7, Dell has at its
support site version 1.0.12; i will reflash and try it this weekend. One
question though is that I know for a fact that before (or ~ 2months) i ran
happily w/ 2 cores always detected (i believe this was w/ 2.6.23 SMP kernel. Now
neither 2.6.24 nor 2.6.23 detect more than 1 core).

Gautam


Comment 10 Bruno Wolff III 2008-03-29 07:29:34 UTC
In my case the motherboard is about 7 years old. I haven't seen a new update in
several years for it and I doubt it is supported any more. I believe I have the
last rev that was published installed.

Comment 11 Gautam Thaker 2008-03-30 20:45:59 UTC
Here is my update, it may not be very valuable but this is where I am at.

I tried to update the BIOS by making a FreeDOS cdrom, booting from it, and
executing Dell's BIOS update pgm I531_109.exe. This program asked for a file to
use in upgrade and since I had no other file than this .exe I managed to exit
out of the program. I was glad that no harm was done and I was able to reboot.
The BIOS version number did not change, so I assume no upgrade was done. I
thought nothing of it and started use the machine. However,  few hours later I
noticed that I was running w/ both the cores! Since this is my production
machine and I have some long running things already going I have not rebooted
again to see if I continue to see 2 cores, but I will later tonight an update.

Gautam

Comment 12 Gautam Thaker 2008-03-31 13:03:13 UTC
When I rebooted my machine it  reverted back to just a single core. So finally I
have seen waht was prev. observed by Bruno, that at some times it can come up w/
greater than 1 core. 

I will now more seriously try to upgrade by bios and after that will try the
suggested patch. However, next 48 hours I am on travel so it will be mid week.

Gautam


Comment 13 Gautam Thaker 2008-03-31 13:04:49 UTC
I should note that I am working w/  2.6.23.15-137.fc8 #1 SMP. One of my apps is
a bit of work to get to work w/ 2.6.24 so I just boot w/ 2.6.23 for now.
(Anyway, 2.6.23 had worked for me w/ 2 cores properly)

Gautam


Comment 14 Bruno Wolff III 2008-04-06 04:58:08 UTC
I have now seen this happen with 2.6.25-0.195.rc8.git1.fc9.i686.

Comment 15 Thomas Gleixner 2008-04-06 14:22:31 UTC
Can you please try the patch with the increased boot delays ?
https://bugzilla.redhat.com/attachment.cgi?id=299439

Thanks,
       tglx


Comment 16 Bruno Wolff III 2008-04-06 15:28:15 UTC
I'll try to test it during the week. I don't have the kernel source at home
(where I have dial up and the problem machine is), so I won't get to try this
until Monday evening at the earliest.
I haven't built a modified kernel for a while so it may take a little playing
with to get it figured out. I'll start with the kernel src rpm and go from there.

Comment 17 Thomas Gleixner 2008-04-06 16:51:30 UTC
Dave, can you just put this patch into the next kernel rpm please ?

Thanks,
        tglx


Comment 18 Chuck Ebbert 2008-04-07 05:10:15 UTC
In rawhide but will not be in today's build.

Comment 19 Bruno Wolff III 2008-04-07 13:44:48 UTC
Will there be a Koji build of it during the day? I can grab that as easily as
one from rawhide. (It looks like the current koji build is the same as this
morning's rawhide version and there wasn't a comment about the above change, so
I expect I need the next build.)

Comment 20 Bruno Wolff III 2008-04-07 14:43:37 UTC
I see that -204 has started building and barring build problems I should be able
to bring home a testable update tonight.
I'll try at least a few reboots tonight, though it will probably take a lot to
confirm a fix on system since the failure rate is fairly low.

Comment 21 Bruno Wolff III 2008-04-08 13:37:29 UTC
I did 10 reboots this morning and the short answer is the delay only lengthens
the time of the pause before continuing with one cpu.
In the list below, the pause/no pause status has always been 100% correlated
with 1 cpu or 2 cpus, respectively, when I have checked. Some of the tests ended
in raid failures (a different bug) and I forgot until late in the series that I
would still be able to look at /proc/cpuinfo in those cases.
1: No pause, Raid failure
2: Pause, Raid failure
3: No pause, 2 cpus
4: Pause, Raid failure
5: No pause, 2 cpus
6: No pause, Raid failure
7: No pause, 2 cpus
8: Pause, 1 cpu
9: No pause, Raid failure, 2 cpus
10: No pause, 2 cpus

Comment 22 Bruno Wolff III 2008-04-17 16:22:32 UTC
I was testing a fix in another bug this morning and saw 3 reboots out of 10 come
up with just one of the two cpus functional. This was with the
2.6.25-0.218.rc8.git7.fc9.i686 kernel.

Comment 23 Bruno Wolff III 2008-04-23 20:23:48 UTC
The e100 driver was still having this issue with 2.6.25-1.fc9.i686. I am now
using a different card using a different driver for the connection that was
causing problems. Since I couldn't reliably get the problem to occur it may take
a bit for it to happen again or to have some confidence that the network hang
part of the issue is driver specific.

Comment 24 Bruno Wolff III 2008-04-23 20:26:57 UTC
Please ignore the last comment as I accidentally added it to the wrong bug.

Comment 25 Bug Zapper 2008-05-14 05:03:56 UTC
Changing version to '9' as part of upcoming Fedora 9 GA.
More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 26 Bruno Wolff III 2008-05-24 21:28:12 UTC
I saw this happen again with 2.6.25.4-30.fc9.i686.

Comment 27 Bruno Wolff III 2008-08-01 21:31:32 UTC
I saw it again with kernel-2.6.25.11-97.fc9.i686. I have reboot with this kernel
several times on the machine but only had the problem happen one of those times.

Comment 28 Gautam Thaker 2008-08-02 01:28:20 UTC
I am currently running 2.6.25.11-60.fc8 #1SMP kernel. Currently I am observing
both my cores. I have been doing some experimentation and I have observed a
strong correlation w/ this. If when I reboot I have my USB jumpstick in then
only one core is detected. If I take it out and reboot, (in 3-4 times i tied so
far), both cores I detected. w/ the USB stick in there almost never since my
troubles started  do I get both my cores.

Gautam


Comment 29 Bruno Wolff III 2008-08-26 06:59:59 UTC
I have now seen this happen a couple of times with 2.6.26.3-17.fc9.i686.
I also notice the wait is back down to a couple of seconds again. That should be OK though, as the longer wait wasn't helping anyway.

Comment 30 Bruno Wolff III 2008-09-07 05:24:43 UTC
I have now seen this happen a couple of times with 2.6.27-0.305.rc5.git6.fc10.i686.

Comment 31 Bruno Wolff III 2008-09-20 15:08:16 UTC
I have now seen this happen with the 2.6.27-0.337.rc6.git5.fc10.i686 kernel.

Comment 32 Bruno Wolff III 2008-10-07 06:08:44 UTC
I am still seeing this with kernel-2.6.27-0.393.rc8.git7.fc10.i686.

Comment 33 Bruno Wolff III 2008-10-13 18:04:04 UTC
I have now see this happen a couple of times with kernel 2.6.27-3.fc10.i686.

Comment 34 Bruno Wolff III 2008-10-25 14:32:15 UTC
I have now seen this happen a couple of times with kernel 2.6.27.3-44.fc10.i686.

Comment 35 Bruno Wolff III 2008-12-03 05:13:39 UTC
I have now seen this happen running 2.6.27.7-135.fc10.i686.
I switched the bug from F9 to F10 since I am tracking F10 now on the machine I am having the problem on.

Comment 36 Bruno Wolff III 2008-12-09 06:03:59 UTC
I have now seen this happen with 2.6.27.7-137.fc10.i686.

Comment 37 Bruno Wolff III 2008-12-13 17:03:57 UTC
I have now seen this with 2.6.27.9-152.rc2.fc10.i686.

Comment 38 Bruno Wolff III 2009-03-22 14:32:30 UTC
I have now seen this with 2.6.29-0.267.rc8.git4.fc11.i686.PAE.

Comment 39 Bruno Wolff III 2009-04-22 01:23:24 UTC
I have now seen this with kernel 2.6.29.1-103.fc11.i686.PAE.

Comment 40 Bruno Wolff III 2009-10-10 18:54:46 UTC
I have now seen this on 2.6.31 kernels.

Comment 41 Bug Zapper 2009-11-18 12:25:30 UTC
This message is a reminder that Fedora 10 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 10.  It is Fedora's policy to close all
bug reports from releases that are no longer maintained.  At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '10'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 10's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 10 is end of life.  If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora please change the 'version' of this 
bug to the applicable version.  If you are unable to change the version, 
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 42 Bruno Wolff III 2009-11-18 14:02:31 UTC
I have now seen this in F13/rawhide's 2.6.32 kernel so it's still an ongoing issue.
I'm guessing it is pretty specific to my system (maybe motherboard hardware) as no one else seems to be reporting this.

Comment 43 Bruno Wolff III 2010-03-13 22:23:04 UTC
Created attachment 399913 [details]
Extract from /var/log/messages

This is from a boot of 2.6.33-8.fc13.i686.PAE. This doesn't seem to happen often in 2.6.33 kernels.

Comment 44 Bruno Wolff III 2010-11-15 14:32:22 UTC
Still present with kernel-PAE-2.6.35.8-55.fc14.i686.

Comment 45 Bruno Wolff III 2012-01-23 08:17:21 UTC
This is still happening with kernel-PAE-3.3.0-0.rc1.git0.3.fc17.i686.

Comment 46 Fedora End Of Life 2012-08-16 18:38:43 UTC
This message is a notice that Fedora 14 is now at end of life. Fedora 
has stopped maintaining and issuing updates for Fedora 14. It is 
Fedora's policy to close all bug reports from releases that are no 
longer maintained.  At this time, all open bugs with a Fedora 'version'
of '14' have been closed as WONTFIX.

(Please note: Our normal process is to give advanced warning of this 
occurring, but we forgot to do that. A thousand apologies.)

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, feel free to reopen 
this bug and simply change the 'version' to a later Fedora version.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we were unable to fix it before Fedora 14 reached end of life. If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora, you are encouraged to click on 
"Clone This Bug" (top right of this page) and open it against that 
version of Fedora.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 47 Fedora End Of Life 2013-07-04 06:48:44 UTC
This message is a reminder that Fedora 17 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 17. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '17'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 17's end of life.

Bug Reporter:  Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 17 is end of life. If you 
would still like  to see this bug fixed and are able to reproduce it 
against a later version  of Fedora, you are encouraged  change the 
'version' to a later Fedora version prior to Fedora 17's end of life.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 48 Fedora End Of Life 2013-08-01 18:27:19 UTC
Fedora 17 changed to end-of-life (EOL) status on 2013-07-30. Fedora 17 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.

Comment 49 Bruno Wolff III 2013-08-05 20:38:45 UTC
This is still happening with 3.11.0-0.rc2.git3.2.fc20.i686+PAE.

Comment 50 Bruno Wolff III 2016-01-25 15:20:13 UTC
The machine that had this problem is mostly dead now and I don't think I am likely to try to get it going again given its age. So I am closing this.