Bug 431882
Summary: | One of 2 AMD CPUs getting shutdown | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Bruno Wolff III <bruno> |
Component: | kernel | Assignee: | Kernel Maintainer List <kernel-maint> |
Status: | CLOSED CANTFIX | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
Severity: | low | Docs Contact: | |
Priority: | low | ||
Version: | rawhide | CC: | bruno, gthaker, mingo, tglx |
Target Milestone: | --- | Keywords: | Reopened |
Target Release: | --- | ||
Hardware: | i686 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2016-01-25 15:20:13 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Attachments: |
Description
Bruno Wolff III
2008-02-07 16:38:44 UTC
Created attachment 294241 [details]
Contents of /proc/cpuinfo when both cpus appear to be working
Created attachment 294245 [details]
/var/log/messages extract from when both CPUs used.
Created attachment 294246 [details]
/var/log/messages extract when only cpu 0 is used
As another data point to this, I seem to be consistantly seeing a several second pause during the boot with the line of text: CPU 1 irqstacks, hard=c0820000 soft=c0800000 on the screen when cpu1 is not used and no pause when it is used. This is still happening in 2.6.25-0.95.rc4.fc9. At this point I think the occurrence rate is roughly between 10% and 50% of boots. I no longer believe that removing power had any effect, as I have had success just rebooting until it didn't happen. I have very similar problems. I see: Mar 27 11:20:02 zadapi-L kernel: SMP alternatives: switching to UP code Mar 27 11:20:02 zadapi-L kernel: ACPI: Core revision 20070126 Mar 27 11:20:02 zadapi-L kernel: CPU0: AMD Athlon(tm) 64 X2 Dual Core Processor 4000+ stepping 01 Mar 27 11:20:02 zadapi-L kernel: SMP alternatives: switching to SMP code Mar 27 11:20:02 zadapi-L kernel: Booting processor 1/1 eip 3000 Mar 27 11:20:02 zadapi-L kernel: CPU 1 irqstacks, hard=c07aa000 soft=c078a000 Mar 27 11:20:02 zadapi-L kernel: Not responding. Mar 27 11:20:02 zadapi-L kernel: Inquiring remote APIC #1... Mar 27 11:20:02 zadapi-L kernel: ... APIC #1 ID: 1000000 Mar 27 11:20:02 zadapi-L kernel: ... APIC #1 VERSION: 80050010 Mar 27 11:20:02 zadapi-L kernel: ... APIC #1 SPIV: ff Mar 27 11:20:02 zadapi-L kernel: CPU #1 not responding - cannot use it. Mar 27 11:20:02 zadapi-L kernel: Total of 1 processors activated (4222.40 BogoMIPS). Mar 27 11:20:02 zadapi-L kernel: ENABLING IO-APIC IRQs Mar 27 11:20:02 zadapi-L kernel: ..TIMER: vector=0x31 apic1=0 pin1=2 apic2=-1 pin2=-1 Mar 27 11:20:02 zadapi-L kernel: Brought up 1 CPUs I have rebooted about a dozen times w/ 2.6.23 and 2.6.24 kernels, and it has NEVER come back as a dual core system for me. (So may be my CPU has somehow permanently lost a core, though this could still be a SW issue.) I can find no settings in BIOS that I can change that may impact this. Gautam Can you make sure that you have the latest BIOS installed ? Created attachment 299439 [details]
Test patch which increases the boot delays
Can you please test the attached patch ?
(In reply to comment #7) > Can you make sure that you have the latest BIOS installed ? > > My machine is a Dell Inspiron 530. Its bios version is 1.0.7, Dell has at its support site version 1.0.12; i will reflash and try it this weekend. One question though is that I know for a fact that before (or ~ 2months) i ran happily w/ 2 cores always detected (i believe this was w/ 2.6.23 SMP kernel. Now neither 2.6.24 nor 2.6.23 detect more than 1 core). Gautam In my case the motherboard is about 7 years old. I haven't seen a new update in several years for it and I doubt it is supported any more. I believe I have the last rev that was published installed. Here is my update, it may not be very valuable but this is where I am at. I tried to update the BIOS by making a FreeDOS cdrom, booting from it, and executing Dell's BIOS update pgm I531_109.exe. This program asked for a file to use in upgrade and since I had no other file than this .exe I managed to exit out of the program. I was glad that no harm was done and I was able to reboot. The BIOS version number did not change, so I assume no upgrade was done. I thought nothing of it and started use the machine. However, few hours later I noticed that I was running w/ both the cores! Since this is my production machine and I have some long running things already going I have not rebooted again to see if I continue to see 2 cores, but I will later tonight an update. Gautam When I rebooted my machine it reverted back to just a single core. So finally I have seen waht was prev. observed by Bruno, that at some times it can come up w/ greater than 1 core. I will now more seriously try to upgrade by bios and after that will try the suggested patch. However, next 48 hours I am on travel so it will be mid week. Gautam I should note that I am working w/ 2.6.23.15-137.fc8 #1 SMP. One of my apps is a bit of work to get to work w/ 2.6.24 so I just boot w/ 2.6.23 for now. (Anyway, 2.6.23 had worked for me w/ 2 cores properly) Gautam I have now seen this happen with 2.6.25-0.195.rc8.git1.fc9.i686. Can you please try the patch with the increased boot delays ? https://bugzilla.redhat.com/attachment.cgi?id=299439 Thanks, tglx I'll try to test it during the week. I don't have the kernel source at home (where I have dial up and the problem machine is), so I won't get to try this until Monday evening at the earliest. I haven't built a modified kernel for a while so it may take a little playing with to get it figured out. I'll start with the kernel src rpm and go from there. Dave, can you just put this patch into the next kernel rpm please ? Thanks, tglx In rawhide but will not be in today's build. Will there be a Koji build of it during the day? I can grab that as easily as one from rawhide. (It looks like the current koji build is the same as this morning's rawhide version and there wasn't a comment about the above change, so I expect I need the next build.) I see that -204 has started building and barring build problems I should be able to bring home a testable update tonight. I'll try at least a few reboots tonight, though it will probably take a lot to confirm a fix on system since the failure rate is fairly low. I did 10 reboots this morning and the short answer is the delay only lengthens the time of the pause before continuing with one cpu. In the list below, the pause/no pause status has always been 100% correlated with 1 cpu or 2 cpus, respectively, when I have checked. Some of the tests ended in raid failures (a different bug) and I forgot until late in the series that I would still be able to look at /proc/cpuinfo in those cases. 1: No pause, Raid failure 2: Pause, Raid failure 3: No pause, 2 cpus 4: Pause, Raid failure 5: No pause, 2 cpus 6: No pause, Raid failure 7: No pause, 2 cpus 8: Pause, 1 cpu 9: No pause, Raid failure, 2 cpus 10: No pause, 2 cpus I was testing a fix in another bug this morning and saw 3 reboots out of 10 come up with just one of the two cpus functional. This was with the 2.6.25-0.218.rc8.git7.fc9.i686 kernel. The e100 driver was still having this issue with 2.6.25-1.fc9.i686. I am now using a different card using a different driver for the connection that was causing problems. Since I couldn't reliably get the problem to occur it may take a bit for it to happen again or to have some confidence that the network hang part of the issue is driver specific. Please ignore the last comment as I accidentally added it to the wrong bug. Changing version to '9' as part of upcoming Fedora 9 GA. More information and reason for this action is here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping I saw this happen again with 2.6.25.4-30.fc9.i686. I saw it again with kernel-2.6.25.11-97.fc9.i686. I have reboot with this kernel several times on the machine but only had the problem happen one of those times. I am currently running 2.6.25.11-60.fc8 #1SMP kernel. Currently I am observing both my cores. I have been doing some experimentation and I have observed a strong correlation w/ this. If when I reboot I have my USB jumpstick in then only one core is detected. If I take it out and reboot, (in 3-4 times i tied so far), both cores I detected. w/ the USB stick in there almost never since my troubles started do I get both my cores. Gautam I have now seen this happen a couple of times with 2.6.26.3-17.fc9.i686. I also notice the wait is back down to a couple of seconds again. That should be OK though, as the longer wait wasn't helping anyway. I have now seen this happen a couple of times with 2.6.27-0.305.rc5.git6.fc10.i686. I have now seen this happen with the 2.6.27-0.337.rc6.git5.fc10.i686 kernel. I am still seeing this with kernel-2.6.27-0.393.rc8.git7.fc10.i686. I have now see this happen a couple of times with kernel 2.6.27-3.fc10.i686. I have now seen this happen a couple of times with kernel 2.6.27.3-44.fc10.i686. I have now seen this happen running 2.6.27.7-135.fc10.i686. I switched the bug from F9 to F10 since I am tracking F10 now on the machine I am having the problem on. I have now seen this happen with 2.6.27.7-137.fc10.i686. I have now seen this with 2.6.27.9-152.rc2.fc10.i686. I have now seen this with 2.6.29-0.267.rc8.git4.fc11.i686.PAE. I have now seen this with kernel 2.6.29.1-103.fc11.i686.PAE. I have now seen this on 2.6.31 kernels. This message is a reminder that Fedora 10 is nearing its end of life. Approximately 30 (thirty) days from now Fedora will stop maintaining and issuing updates for Fedora 10. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as WONTFIX if it remains open with a Fedora 'version' of '10'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version prior to Fedora 10's end of life. Bug Reporter: Thank you for reporting this issue and we are sorry that we may not be able to fix it before Fedora 10 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora please change the 'version' of this bug to the applicable version. If you are unable to change the version, please add a comment here and someone will do it for you. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. The process we are following is described here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping I have now seen this in F13/rawhide's 2.6.32 kernel so it's still an ongoing issue. I'm guessing it is pretty specific to my system (maybe motherboard hardware) as no one else seems to be reporting this. Created attachment 399913 [details]
Extract from /var/log/messages
This is from a boot of 2.6.33-8.fc13.i686.PAE. This doesn't seem to happen often in 2.6.33 kernels.
Still present with kernel-PAE-2.6.35.8-55.fc14.i686. This is still happening with kernel-PAE-3.3.0-0.rc1.git0.3.fc17.i686. This message is a notice that Fedora 14 is now at end of life. Fedora has stopped maintaining and issuing updates for Fedora 14. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At this time, all open bugs with a Fedora 'version' of '14' have been closed as WONTFIX. (Please note: Our normal process is to give advanced warning of this occurring, but we forgot to do that. A thousand apologies.) Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, feel free to reopen this bug and simply change the 'version' to a later Fedora version. Bug Reporter: Thank you for reporting this issue and we are sorry that we were unable to fix it before Fedora 14 reached end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged to click on "Clone This Bug" (top right of this page) and open it against that version of Fedora. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. The process we are following is described here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping This message is a reminder that Fedora 17 is nearing its end of life. Approximately 4 (four) weeks from now Fedora will stop maintaining and issuing updates for Fedora 17. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as WONTFIX if it remains open with a Fedora 'version' of '17'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version prior to Fedora 17's end of life. Bug Reporter: Thank you for reporting this issue and we are sorry that we may not be able to fix it before Fedora 17 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior to Fedora 17's end of life. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. Fedora 17 changed to end-of-life (EOL) status on 2013-07-30. Fedora 17 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. Thank you for reporting this bug and we are sorry it could not be fixed. This is still happening with 3.11.0-0.rc2.git3.2.fc20.i686+PAE. The machine that had this problem is mostly dead now and I don't think I am likely to try to get it going again given its age. So I am closing this. |