Bug 417671

Summary: [pata_amd] AMD desktop is unstable after installing F8: libata is suspected as the cause
Product: [Fedora] Fedora Reporter: James Scott Jr. <skoona>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED NEXTRELEASE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: low    
Version: 8   
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Fedora V9, Kernel 2.6.25.14-108.fc9.i686 #1 SMP Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-11-26 23:50:40 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
dmesg log and the /var/log/messages system log files
none
dmesg.log from 1-CPU and ide raid5 - good run
none
dmesg log using boot parm of 'nosmp'
none
Successful F8 SMP boot - dmesg.log
none
dmsg after changing video card from ATI to NV. none

Description James Scott Jr. 2007-12-10 05:42:04 UTC
Description of problem:
I did a fresh install to upgrade from F6 to F8.  libata changed the names of my
internal software raid 5 ide configuration and the machine is now unstable.  I
can gather no real info to help debug this issue; as I can't keep it running
long enough to even print the screen.  Installation was a real bear, I must have
tried it 40 times, only to have the machine lockup at random points.  The lockup
is hard; no keyboard, mouse, or ssh response and the disk light is on solid (no
blinks).  a lockup normally corrupts 'md0' and forces a reinstall to correct.  I
would consider this bug fixed is their were a kernel boot param that would
resolve my problem.

Version-Release number of selected component (if applicable):
Fresh install for original F8-DVD.iso

How reproducible:
For me it is very reproducible; any attempt to boot and signon is met with a
lockup 95% of the time.  If I can login then any attempt to do significant io
will lock it up; like restoring a 200gb backup or 'yum -y update'.

Steps to Reproduce:
1. Boot and login
2. move 1gb of files on or off 'md0'
3. 
  
Actual results:
locks up leaving disk activity lights lit, and the keyboard/mouse frozen.

Expected results:
Normal operations of F8

Additional info:
My machine was hand made in 2003 using;
MSI K7D Dual AMD SocketA, with amd-760 Chipset
Dual AMD 2400 MP processors
Three Western Digital 250GB IDE drives (2-partitions (1gb=/boot/swap2,
249gb=software-raid-5 (md0)
** the typeical uptime has been 200 days.  I am fairly certain its not a
hardware problem; I changed the registerd memory, mainboard, and the disks.

This problem stopped me from installing F8; the kernel boot parm
'combined_mode=ide' seems to let me install and sometimes login before it locks.
 However, I am not convinced its working for me.

Comment 1 Chuck Ebbert 2007-12-10 19:28:34 UTC
Please try the workarounds:

https://fedoraproject.org/wiki/KernelCommonProblems

And report back on which, if any, fix the problem.


Also, can you manage to get a copy of /var/log/dmesg after successful boot and
attach it to this bug report?



Comment 2 James Scott Jr. 2007-12-11 03:34:52 UTC
I was able to get a dmesg and the entire /var/log/messages; its attached as
jscott.dmesg.log.

From the link I tried, apic=off, nohz=off, pci=nomsi,nommconf, pnpacpi=off,
pci=noacpi clock_source=acpi_pm; all in the correct order.  'apic' failed to
start   X and locked up when I tried to login via the text console.  The other
options allowed boot to proceed to the gdm-login screen and locked up before I
could type my userid.  I can normally make it to gnome's desktop if I start from
a cold boot start (i.e. not a hardware reset button but power off, then on).

Comment 3 James Scott Jr. 2007-12-11 03:38:03 UTC
Created attachment 283611 [details]
dmesg log and the /var/log/messages system log files

this is the dmesg from a boot using the xen kernel.  xen is not related to the
bug report.  a xen kernel fails in the exact same way as the regular F8 kernel.
 Boot option where 'combined_mode=ide'

James

Comment 4 James Scott Jr. 2007-12-17 19:31:03 UTC
While waiting on feedback I re-installed using the default LVM storage model for
F8.  The machine failed in the exact same way as with Raid5.   

I then re-installed with individual drives (i.e. /dev/sda=/, /dev/sdc=/home,
/dev/sdd=/var/local.  No boot params, just a fresh install.  I then restored my
video library (part of my backup from F6) to each volume to see if it would
fail.  It failed about 15gb into a 174gb restore (actually: '# cp -a /source
/target/').  The failure had the same signature as before; disk lite on solid,
keyboard/mouse non-responsive, screen frozen.

I repeated each of the actions that caused the failures with LVM, RAID5, and
standalone disk.  Because the standalone attempt signled out a particular disk
(sda) I replaced it with another one and tried again; same failure result. 

This got me thinking of what was different between sda and sdd or sdc.  sdc/sdd
are on ide channel-1 as primary/secondary; while sda is primary/master on ide
channel-0 with a dvd drive.  sda is using UDMA-100 and the dvd drive (lite On)
is using UDMA-33.  I uncabled the dvd drive, changed the master-with-0slave
setting of sda to master-with-no-slave and retired the 174gb restore - it made
it too 55gb before failing - but it did fail the exact same way.

Now: I will go buy or make a null serial cable while I await some assistance.

Comment 5 James Scott Jr. 2007-12-17 19:37:07 UTC
Note: the copy/delete cycle of 174GB passed repeatedly for sdc & sdd, during the
individual/standalone disk tests.  Only sda failed at 15 to 55gb or a 174gb copy.

Comment 6 Chuck Ebbert 2007-12-18 00:09:46 UTC
(In reply to comment #2)
> I was able to get a dmesg and the entire /var/log/messages; its attached as
> jscott.dmesg.log.
> 
> From the link I tried, apic=off, nohz=off, pci=nomsi,nommconf, pnpacpi=off,
> pci=noacpi clock_source=acpi_pm; all in the correct order.

"apic=off" won't do anything; try "noapic"

Comment 7 James Scott Jr. 2007-12-20 02:23:34 UTC
Tried 'noapic' alone with no change in result; i.e. still locks up with high
disk activity. Again, trying to restore my video backup of 174gb locks up the
machine after 25gb is transfered.  F8 is still not in a stable state on this
platform, and I am at a loss (short of null cable and kernel debuging) as to why.

James,

Comment 8 Chuck Ebbert 2007-12-21 20:32:47 UTC
(In reply to comment #4)
>> This got me thinking of what was different between sda and sdd or sdc.  sdc/sdd
> are on ide channel-1 as primary/secondary; while sda is primary/master on ide
> channel-0 with a dvd drive.  sda is using UDMA-100 and the dvd drive (lite On)
> is using UDMA-33.  I uncabled the dvd drive, changed the master-with-0slave
> setting of sda to master-with-no-slave and retired the 174gb restore - it made
> it too 55gb before failing - but it did fail the exact same way.
> 

Silly question, but is the master drive at the end of the cable?

Can you try swapping one of the drives from the other connector to where sda is
connected to see if the error follows the controller port or the drive?


Comment 9 James Scott Jr. 2007-12-21 21:28:57 UTC
Yes, the Master is at the very end of the cable, with the dvd in the middle of
the cable.

I'm hoping my partitions are labeled so I can do that without re-installing. 
Otherwise, before posting this bug I replaced the motherboard with a new one (as
in new), then I purchased a new 250gb drive and attempted to re-install using
raid5 - after each fail I'd move the new drive from port to port, replacing a
existing drive with no change in the failure.  To me this proofed the drives
themselves.  Now I am banging on the port and made this observation, that only
/dev/sda fails.  I  will swap primary/secondary on the same IDE channel, then
move the boot disk to the other channel to see if the failure follows.

I using Kernel 2.6.23.9-85.fc8 with no boot params.  
I will place my boot disk in ide0.0, ide0.1, ide1.0, and ide1.1 to see if it
continues to fail.  Also, I will reverify ide0.0 once I move the current disk
off it.

This will take me until Saturday Morning.

Comment 10 James Scott Jr. 2007-12-25 17:59:09 UTC
First I switched ide channel cables:  failing disk was master on channel 0.
Swapping cable made it master on channel 1.  The result was all drives failed
when trying to restore 174gb backup - they eached failed near 50gb of progress.
 I swapped every ide drive across every ide port, again no positive result.

In frustration, I attempted to re-install CentOS then Fedora 6, both of which
did work at one point - the actual installs failed with the disk-light spinning
and the keyboard mouse locked; during grub-install or before while loading rpms.

I then purchased another 250gb ide drive and replaced the prior sda (now sdc)
drive.  I swapped every ide drive port across every ide port, again no positive
result, and the lockup failure still happens randomly no matter which drive is
being driven.  However, I can now install F8 without boot parms and signon
before it locks.  Adding the boot parm 'noapic' allows me the opportunity to run
'yum -y update' - it has not finished yet.

If 'noapic' does not resolve the problem I will upgrade my MSI K7D bios from
V190 to V191.  From there if the failure still persist, I'm tempted to unplug it
and drag it into the back room.

Status:
2 of the three original IDE drives were replaced
1 Mother board was replaced
1 DVD Drive was replaced
All Drive cables have been replaced
All drive configurations contain the failure in some fashion
No error Message has been recorded
- except Ubuntu install presented  'Disk Failure: 80, AX=4280 Drive 9F'
No boot param makes a significant difference.
No tried Linux distribution is failure free
- CentOS5, Ubuntu, Fedora6, and Fedora8

Comment:
I have no clue why this is happening, nor do I have a clue as to what to try
next.  The absence of an error message is frustrating.  I have a significant
amount of skill to try anything and I've tried almost every reasonable thing. 
This platform has run for three years without issue and was running fine when I
choose to upgrade it in Oct.  The only thing I have not changed is the power
supply, 2gb registered memory, and the two AMD MP2400 Cpus.  CPUs don't seem
reasonable, and the power supply voltages seem normal (antec 450).  Memtest has
run for several days recording no errors (and I did remove a module during testing).


Comment 11 James Scott Jr. 2007-12-27 03:39:42 UTC
12-26-2007 Update:
Setup a null cable to have a serial console available.  No error messages recorded
Updating Bios failed - tried two amdflash.exe routine; keyboard locks

With F8 installed and /dev/md0 in raid5 mode over three 250gb ide disks, the 2nd
boot requires the array to be resync'ed.  It's this resync that fails and locks
the machine.  From the remote serial console I watched a reboot, without signing
on via gdm, and whenever the resync got pass 68% complete it would lock the
keyboard/mouse and the whole machine without error message before logging resync
complete.

When BIOS has PNP enabled, raid5 init would show using '2 or 3 drives, 1 as
spare'.  Changing PNP to disabled f8 boots with '3 of 3 drives, 0 spares'; the
way I configured it.

Since I have no error messages and I have replaced most things -- I removed CPU1
from the mainboard.  Boot parm are: noapic console=tty0 console=ttyS0.  System
boots, completes a 2 hour resync (250gb drives) and is now transfering my 174gb
backup via a nfs mount from my backup server.  With no issues.

I plan to flash/upgrade the bios, and swap the CPUs when the current restore
completes.  If things continue to work afer swapping CPU's - I will reinstall
both of them and see what's up.

Comment 12 James Scott Jr. 2007-12-27 04:51:37 UTC
Created attachment 290435 [details]
dmesg.log from 1-CPU and ide raid5 - good run 

140gb restore of 174gb complete.

Comment 13 James Scott Jr. 2007-12-28 19:04:53 UTC
Ok,
With one CPU installed, I was finally able to update the bios to version 191
from 190.  A new fresh install went forward without incident; using only the
console=ttyx boot params.  I restored the full backup of 230gb and misc other
saved directories and cvs repos; using nfs, ssh-sftp, and cifs all at the same
time.  I.E. I beat up on the disk as hard as I could without incident or error.  

Next I swapped the CPUs to be sure that each one was working.  Again a range of
action which normally produce a lock up passed without issue.

Next I re-installed both CPUs (AMD MP 2400+ @ 2.0Ghz) and the machine locked up
during the ipl; somewhere near the sshd or cupsd initialization point.  I added
the 'noapic' as a boot parm and the machine booted to the gdm signon, where I
was able to signon.  The act of using 'vim' to add noapic to the current
grub.conf file caused a lockup on save of that file.  At this moement I can't
get a dmesg, lsmod, or any other file - because its not stable long enough to
scp one out.

I have caefully reviewed my BIOS config settings and can verify they have all
the same values I would have used last year.

The serial console records no errors, and there are none recorded in the
messages file.  However, I did notice once that both libata and pata_amd are
loaded via an 'lsmod' command listing.

This UNI proccessor experience demonstrates for me that F8 can be installed on
this platform and will operate normally.  My SMP experience highlights that
there is some hardware issue or driver issue that interferes with F8's normal
operation.  I have done everyting I can think of to proof the hardware and I'm
left with a moderate degree of assurance that the hardware is fine.

'noapic' is the one boot param that I know of that provides a measure of
stability.  However, I don't understand which driver is receiving this benefit?

I now have the debug kernel available - but I don't know yet how to turn on its
debugging info features.

Is there anything you can suggest that I explore to resolve this problem or at
least generate an error message?

Comment 14 James Scott Jr. 2007-12-29 15:51:24 UTC
Created attachment 290527 [details]
dmesg log using boot parm of 'nosmp'

The only boot param I find that avoids the lockup or hang error is 'nosmp'. 
When I physically removed one of the two cpus in the platform, I experienced no
errors or lockups.  Adding both cpu back in their sockets, the lockup happens
immediately during ipl, and with 'noapic' ipl completes then fails within
seconds of a signon through GDM.

I tried the debug kernel with 'debug' boot option, no errors were recorded and
the lockup occured.  I tried boot param 'debug 1' to boot into single user
mode, and the platform was unstable - locking up after two or three hours, or
immediately, for no repeatable reason.	In single user mode I can reliable
allow the raid5 driver to resync the array after a lockup - preventing a forced
re-install of F8.

Question: Libata/pata_amd seems to work fine in non-smp mode; should I
reclassify this problem as 'SMP' related?  As long as I'm in non-smp mode,
everything work fine!

Comment 15 James Scott Jr. 2007-12-31 06:10:05 UTC
Created attachment 290567 [details]
Successful F8 SMP boot - dmesg.log

The boot options 'noresume maxcpus=2' seem tohave resolved the stability
problems I was having.	(maxcpus=4 caused a lockup)

The platform has been stable for 1 hour.  I will post again after a few more
days of validation.

Comment 16 James Scott Jr. 2007-12-31 06:38:48 UTC
ok, I spoke too quickly. With the tvime tv viewer playing, I removed the
existing video backup and started to restore it.  The keyboard/mouse and screen
is frozen - but the audio is still playing.  I'm going to let it run and see if
it comes back after a few hours (restore time).

James

Comment 17 James Scott Jr. 2008-01-08 01:45:55 UTC
Created attachment 291030 [details]
dmsg after changing video card from ATI to NV.

Googling this SMP failures, I found some mentio of X or the screen driver being
responsible for the lockups.  I then  swapped graphic controllers to test this;
not fixed.  The NV gard performed worse then the ATI, regarding lockups when
attempting 2-cpus or 1-cpu.

This now means I have swapped everything except the PSU in the box.  These
lockups unders full SMP mode continue.

Comment 18 Bug Zapper 2008-11-26 08:55:55 UTC
This message is a reminder that Fedora 8 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 8.  It is Fedora's policy to close all
bug reports from releases that are no longer maintained.  At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '8'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 8's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 8 is end of life.  If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora please change the 'version' of this 
bug to the applicable version.  If you are unable to change the version, 
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 19 James Scott Jr. 2008-11-26 23:50:40 UTC
Fedora V9, Kernel 2.6.25.14-108.fc9.i686 #1 SMP -- resolved this issue.