Bug 417671
Summary: | [pata_amd] AMD desktop is unstable after installing F8: libata is suspected as the cause | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | James Scott Jr. <skoona> |
Component: | kernel | Assignee: | Kernel Maintainer List <kernel-maint> |
Status: | CLOSED NEXTRELEASE | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
Severity: | high | Docs Contact: | |
Priority: | low | ||
Version: | 8 | ||
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | i686 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Fedora V9, Kernel 2.6.25.14-108.fc9.i686 #1 SMP | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2008-11-26 23:50:40 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Attachments: |
Description
James Scott Jr.
2007-12-10 05:42:04 UTC
Please try the workarounds: https://fedoraproject.org/wiki/KernelCommonProblems And report back on which, if any, fix the problem. Also, can you manage to get a copy of /var/log/dmesg after successful boot and attach it to this bug report? I was able to get a dmesg and the entire /var/log/messages; its attached as jscott.dmesg.log. From the link I tried, apic=off, nohz=off, pci=nomsi,nommconf, pnpacpi=off, pci=noacpi clock_source=acpi_pm; all in the correct order. 'apic' failed to start X and locked up when I tried to login via the text console. The other options allowed boot to proceed to the gdm-login screen and locked up before I could type my userid. I can normally make it to gnome's desktop if I start from a cold boot start (i.e. not a hardware reset button but power off, then on). Created attachment 283611 [details]
dmesg log and the /var/log/messages system log files
this is the dmesg from a boot using the xen kernel. xen is not related to the
bug report. a xen kernel fails in the exact same way as the regular F8 kernel.
Boot option where 'combined_mode=ide'
James
While waiting on feedback I re-installed using the default LVM storage model for F8. The machine failed in the exact same way as with Raid5. I then re-installed with individual drives (i.e. /dev/sda=/, /dev/sdc=/home, /dev/sdd=/var/local. No boot params, just a fresh install. I then restored my video library (part of my backup from F6) to each volume to see if it would fail. It failed about 15gb into a 174gb restore (actually: '# cp -a /source /target/'). The failure had the same signature as before; disk lite on solid, keyboard/mouse non-responsive, screen frozen. I repeated each of the actions that caused the failures with LVM, RAID5, and standalone disk. Because the standalone attempt signled out a particular disk (sda) I replaced it with another one and tried again; same failure result. This got me thinking of what was different between sda and sdd or sdc. sdc/sdd are on ide channel-1 as primary/secondary; while sda is primary/master on ide channel-0 with a dvd drive. sda is using UDMA-100 and the dvd drive (lite On) is using UDMA-33. I uncabled the dvd drive, changed the master-with-0slave setting of sda to master-with-no-slave and retired the 174gb restore - it made it too 55gb before failing - but it did fail the exact same way. Now: I will go buy or make a null serial cable while I await some assistance. Note: the copy/delete cycle of 174GB passed repeatedly for sdc & sdd, during the individual/standalone disk tests. Only sda failed at 15 to 55gb or a 174gb copy. (In reply to comment #2) > I was able to get a dmesg and the entire /var/log/messages; its attached as > jscott.dmesg.log. > > From the link I tried, apic=off, nohz=off, pci=nomsi,nommconf, pnpacpi=off, > pci=noacpi clock_source=acpi_pm; all in the correct order. "apic=off" won't do anything; try "noapic" Tried 'noapic' alone with no change in result; i.e. still locks up with high disk activity. Again, trying to restore my video backup of 174gb locks up the machine after 25gb is transfered. F8 is still not in a stable state on this platform, and I am at a loss (short of null cable and kernel debuging) as to why. James, (In reply to comment #4) >> This got me thinking of what was different between sda and sdd or sdc. sdc/sdd > are on ide channel-1 as primary/secondary; while sda is primary/master on ide > channel-0 with a dvd drive. sda is using UDMA-100 and the dvd drive (lite On) > is using UDMA-33. I uncabled the dvd drive, changed the master-with-0slave > setting of sda to master-with-no-slave and retired the 174gb restore - it made > it too 55gb before failing - but it did fail the exact same way. > Silly question, but is the master drive at the end of the cable? Can you try swapping one of the drives from the other connector to where sda is connected to see if the error follows the controller port or the drive? Yes, the Master is at the very end of the cable, with the dvd in the middle of the cable. I'm hoping my partitions are labeled so I can do that without re-installing. Otherwise, before posting this bug I replaced the motherboard with a new one (as in new), then I purchased a new 250gb drive and attempted to re-install using raid5 - after each fail I'd move the new drive from port to port, replacing a existing drive with no change in the failure. To me this proofed the drives themselves. Now I am banging on the port and made this observation, that only /dev/sda fails. I will swap primary/secondary on the same IDE channel, then move the boot disk to the other channel to see if the failure follows. I using Kernel 2.6.23.9-85.fc8 with no boot params. I will place my boot disk in ide0.0, ide0.1, ide1.0, and ide1.1 to see if it continues to fail. Also, I will reverify ide0.0 once I move the current disk off it. This will take me until Saturday Morning. First I switched ide channel cables: failing disk was master on channel 0. Swapping cable made it master on channel 1. The result was all drives failed when trying to restore 174gb backup - they eached failed near 50gb of progress. I swapped every ide drive across every ide port, again no positive result. In frustration, I attempted to re-install CentOS then Fedora 6, both of which did work at one point - the actual installs failed with the disk-light spinning and the keyboard mouse locked; during grub-install or before while loading rpms. I then purchased another 250gb ide drive and replaced the prior sda (now sdc) drive. I swapped every ide drive port across every ide port, again no positive result, and the lockup failure still happens randomly no matter which drive is being driven. However, I can now install F8 without boot parms and signon before it locks. Adding the boot parm 'noapic' allows me the opportunity to run 'yum -y update' - it has not finished yet. If 'noapic' does not resolve the problem I will upgrade my MSI K7D bios from V190 to V191. From there if the failure still persist, I'm tempted to unplug it and drag it into the back room. Status: 2 of the three original IDE drives were replaced 1 Mother board was replaced 1 DVD Drive was replaced All Drive cables have been replaced All drive configurations contain the failure in some fashion No error Message has been recorded - except Ubuntu install presented 'Disk Failure: 80, AX=4280 Drive 9F' No boot param makes a significant difference. No tried Linux distribution is failure free - CentOS5, Ubuntu, Fedora6, and Fedora8 Comment: I have no clue why this is happening, nor do I have a clue as to what to try next. The absence of an error message is frustrating. I have a significant amount of skill to try anything and I've tried almost every reasonable thing. This platform has run for three years without issue and was running fine when I choose to upgrade it in Oct. The only thing I have not changed is the power supply, 2gb registered memory, and the two AMD MP2400 Cpus. CPUs don't seem reasonable, and the power supply voltages seem normal (antec 450). Memtest has run for several days recording no errors (and I did remove a module during testing). 12-26-2007 Update: Setup a null cable to have a serial console available. No error messages recorded Updating Bios failed - tried two amdflash.exe routine; keyboard locks With F8 installed and /dev/md0 in raid5 mode over three 250gb ide disks, the 2nd boot requires the array to be resync'ed. It's this resync that fails and locks the machine. From the remote serial console I watched a reboot, without signing on via gdm, and whenever the resync got pass 68% complete it would lock the keyboard/mouse and the whole machine without error message before logging resync complete. When BIOS has PNP enabled, raid5 init would show using '2 or 3 drives, 1 as spare'. Changing PNP to disabled f8 boots with '3 of 3 drives, 0 spares'; the way I configured it. Since I have no error messages and I have replaced most things -- I removed CPU1 from the mainboard. Boot parm are: noapic console=tty0 console=ttyS0. System boots, completes a 2 hour resync (250gb drives) and is now transfering my 174gb backup via a nfs mount from my backup server. With no issues. I plan to flash/upgrade the bios, and swap the CPUs when the current restore completes. If things continue to work afer swapping CPU's - I will reinstall both of them and see what's up. Created attachment 290435 [details]
dmesg.log from 1-CPU and ide raid5 - good run
140gb restore of 174gb complete.
Ok, With one CPU installed, I was finally able to update the bios to version 191 from 190. A new fresh install went forward without incident; using only the console=ttyx boot params. I restored the full backup of 230gb and misc other saved directories and cvs repos; using nfs, ssh-sftp, and cifs all at the same time. I.E. I beat up on the disk as hard as I could without incident or error. Next I swapped the CPUs to be sure that each one was working. Again a range of action which normally produce a lock up passed without issue. Next I re-installed both CPUs (AMD MP 2400+ @ 2.0Ghz) and the machine locked up during the ipl; somewhere near the sshd or cupsd initialization point. I added the 'noapic' as a boot parm and the machine booted to the gdm signon, where I was able to signon. The act of using 'vim' to add noapic to the current grub.conf file caused a lockup on save of that file. At this moement I can't get a dmesg, lsmod, or any other file - because its not stable long enough to scp one out. I have caefully reviewed my BIOS config settings and can verify they have all the same values I would have used last year. The serial console records no errors, and there are none recorded in the messages file. However, I did notice once that both libata and pata_amd are loaded via an 'lsmod' command listing. This UNI proccessor experience demonstrates for me that F8 can be installed on this platform and will operate normally. My SMP experience highlights that there is some hardware issue or driver issue that interferes with F8's normal operation. I have done everyting I can think of to proof the hardware and I'm left with a moderate degree of assurance that the hardware is fine. 'noapic' is the one boot param that I know of that provides a measure of stability. However, I don't understand which driver is receiving this benefit? I now have the debug kernel available - but I don't know yet how to turn on its debugging info features. Is there anything you can suggest that I explore to resolve this problem or at least generate an error message? Created attachment 290527 [details]
dmesg log using boot parm of 'nosmp'
The only boot param I find that avoids the lockup or hang error is 'nosmp'.
When I physically removed one of the two cpus in the platform, I experienced no
errors or lockups. Adding both cpu back in their sockets, the lockup happens
immediately during ipl, and with 'noapic' ipl completes then fails within
seconds of a signon through GDM.
I tried the debug kernel with 'debug' boot option, no errors were recorded and
the lockup occured. I tried boot param 'debug 1' to boot into single user
mode, and the platform was unstable - locking up after two or three hours, or
immediately, for no repeatable reason. In single user mode I can reliable
allow the raid5 driver to resync the array after a lockup - preventing a forced
re-install of F8.
Question: Libata/pata_amd seems to work fine in non-smp mode; should I
reclassify this problem as 'SMP' related? As long as I'm in non-smp mode,
everything work fine!
Created attachment 290567 [details]
Successful F8 SMP boot - dmesg.log
The boot options 'noresume maxcpus=2' seem tohave resolved the stability
problems I was having. (maxcpus=4 caused a lockup)
The platform has been stable for 1 hour. I will post again after a few more
days of validation.
ok, I spoke too quickly. With the tvime tv viewer playing, I removed the existing video backup and started to restore it. The keyboard/mouse and screen is frozen - but the audio is still playing. I'm going to let it run and see if it comes back after a few hours (restore time). James Created attachment 291030 [details]
dmsg after changing video card from ATI to NV.
Googling this SMP failures, I found some mentio of X or the screen driver being
responsible for the lockups. I then swapped graphic controllers to test this;
not fixed. The NV gard performed worse then the ATI, regarding lockups when
attempting 2-cpus or 1-cpu.
This now means I have swapped everything except the PSU in the box. These
lockups unders full SMP mode continue.
This message is a reminder that Fedora 8 is nearing its end of life. Approximately 30 (thirty) days from now Fedora will stop maintaining and issuing updates for Fedora 8. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as WONTFIX if it remains open with a Fedora 'version' of '8'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version prior to Fedora 8's end of life. Bug Reporter: Thank you for reporting this issue and we are sorry that we may not be able to fix it before Fedora 8 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora please change the 'version' of this bug to the applicable version. If you are unable to change the version, please add a comment here and someone will do it for you. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. The process we are following is described here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping Fedora V9, Kernel 2.6.25.14-108.fc9.i686 #1 SMP -- resolved this issue. |