Bug 491554

Summary: nVidia MCP55 exception Emask frozen hardreset device not ready
Product: Red Hat Enterprise Linux 5 Reporter: Gerrit Slomma <gerrit.slomma>
Component: kernelAssignee: David Milburn <dmilburn>
Status: CLOSED NOTABUG QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: high Docs Contact:
Priority: medium    
Version: 5.3CC: dirk.leinenbach, dzickus, james.brown, mikej, tao
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-09-03 23:02:18 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
lspci -xxvvv output
none
/var/log/message of kernel 2.6.18-128.el5 none

Description Gerrit Slomma 2009-03-22 23:10:12 UTC
Description of problem:

Since update from kernel 2.6.18-92.1.22.el5 to 2.6.18-128.el5 or 2.6.18-128.1.1.el5 i got plenty of:

(...)
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x400000 action 0x6 frozen
ata1: SError: { Handshk }
ata1.00: cmd 35/00:00:70:07:8b/00:02:01:00:00/e0 tag 0 dma 262144 out
         res 40/00:00:00:4f:c2/04:00:01:00:00/00 Emask 0x4 (timeout)
ata1.00: status: { DRDY }
ata1: link is slow to respond, please be patient (ready=0)
ata1: device not ready (errno=-16), forcing hardreset
ata1: soft resetting link
ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata1.00: configured for UDMA/100
ata1: EH complete
SCSI device sda: 156301488 512-byte hdwr sectors (80026 MB)
sda: Write Protect is off
sda: Mode Sense: 00 3a 00 00
SCSI device sda: drive cache: write back
(...)

if this happens nothing is possible on the system.

Version-Release number of selected component (if applicable):

2.6.18-128.el5 and 2.6.18-128.1.1.el5

How reproducible:

update from kernel 2.6.18-92.1.22.el5 to 2.6.18-128.el5 or 2.6.18-128.1.1.el5

Steps to Reproduce:
1. yum install kernel-2.6.18-128.el5 or yum install kernel-2.6.18-128.el5
2. reboot into new kernel
3. watch the messages scroll by
  
Actual results:

system hangs occasionally

Expected results:

system is running fine as was on kernel 2.6.18-92.1.22.el5

Additional info:

[root@rr019 ~]# lspci
00:00.0 RAM memory: nVidia Corporation MCP55 Memory Controller (rev a2)
00:01.0 ISA bridge: nVidia Corporation MCP55 LPC Bridge (rev a3)
00:01.1 SMBus: nVidia Corporation MCP55 SMBus (rev a3)
00:02.0 USB Controller: nVidia Corporation MCP55 USB Controller (rev a1)
00:02.1 USB Controller: nVidia Corporation MCP55 USB Controller (rev a2)
00:04.0 IDE interface: nVidia Corporation MCP55 IDE (rev a1)
00:05.0 IDE interface: nVidia Corporation MCP55 SATA Controller (rev a3)
00:05.1 IDE interface: nVidia Corporation MCP55 SATA Controller (rev a3)
00:05.2 IDE interface: nVidia Corporation MCP55 SATA Controller (rev a3)
00:06.0 PCI bridge: nVidia Corporation MCP55 PCI bridge (rev a2)
00:08.0 Bridge: nVidia Corporation MCP55 Ethernet (rev a3)
00:09.0 Bridge: nVidia Corporation MCP55 Ethernet (rev a3)
00:0a.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a3)
00:0d.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a3)
00:0f.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a3)
00:18.0 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] HyperTransport Configuration
00:18.1 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] Address Map
00:18.2 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] DRAM Controller
00:18.3 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] Miscellaneous Control
00:18.4 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] Link Control
00:19.0 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] HyperTransport Configuration
00:19.1 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] Address Map
00:19.2 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] DRAM Controller
00:19.3 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] Miscellaneous Control
00:19.4 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] Link Control
01:04.0 VGA compatible controller: ATI Technologies Inc ES1000 (rev 02)

Motherboard is a Tyan Thunder n3600R S2912 with newest bios-version. SATA-Ports run as AHCI. Upgrading to newer kernel-subversion is no real option at the moment as the base release is the in-house-standard of our company - our kvm-kernel-modules are compiled against this version (this is only a test-host though). I tested the newest 2.6.18-128.1.1.el5 - but to no avail. Problem seems to be known albeit those are Tyan Thunder n3600B and Supermicro H8QMI-2:
> http://lists.debian.org/debian-kernel/2007/03/msg00258.html
> http://forums.fedoraforum.org/showthread.php?t=208698&page=2
but these are another distro and kernel-subversion.

In the changelog of 2.6.18-128.el5 i found following:

(...)
2008-12-08 22:00:00
Don Zickus <dzickus> [2.6.18-126.el5]:
(...)
- [ata] libata: sata_nv hard reset mcp55 (David Milburn ) [473152]
(...)

Shouldn't that mean this issue is/was fixed?
I will give the 2.6.18-128.1.1.el5 a try and report back.

Comment 1 Gerrit Slomma 2009-03-23 11:24:40 UTC
As stated in the initial post i tried the 2.6.18-128.1.1.el5, i wrote the comment and then decided to give the new kernel-subversion a try but it did not pay out.

Comment 2 David Milburn 2009-03-31 23:14:51 UTC
Hello,

Would you please test kernel-2.6.18-137.el5.bz491554.1?

http://people.redhat.com/dmilburn/

It includes some upstream hardreset fixes subsequent to the patch above
including:

commit 2d775708bc6613f1be47f1e720781343341ecc94
Author: Tejun Heo <tj>
Date:   Sun Jan 25 11:29:38 2009 +0900

    sata_nv: fix MCP5x reset

Comment 3 Gerrit Slomma 2009-04-01 16:50:01 UTC
This polls on the console for some time

ata1: SATA link down (SStatus 100 SControl 300)
ata1: EH complete
ata1: hard resetting link

and finaly resulting in:

Scanning logical volumes
  Reading all physical volumes. This may take a while...
Activating logical volumes
  Volume group "rootdg" not found
Trying to resume from /dev/rootdg/swapvol
Unable to access resume device (/dev/rootdg/swapvol)
Creating root device.
Mounting root filesystem.
mount: could not find filesystem '/dev/root'
Setting up other filesystems.
Setting up new root fs
setuproot: moving /dev failed: No such file or directory
no fstab.sys, mounting internal defaults
setuproot: error mounting /proc: No such file or directory
setuproot: error mounting /sys: No such file or directory
Switching to new root and running init
unmounting old /dev
unmounting old /proc
unmounting old /sys
switchroot: mount failed: No such file or directory
Kernel panic - not syncinc: Atempted to kill init!

Do i have to build an proper initrd with lvm and mdadm built in or is this a direct follow-up of the hard reset of the link?

Comment 4 David Milburn 2009-04-01 23:38:35 UTC
No, you shouldn't have to manually build an initrd, when you installed the
test rpm it should be putting a new initrd in /boot. I will need to look
over the source again, would you attach the output of "lspci -xxvvv"?

Comment 5 Gerrit Slomma 2009-04-01 23:59:09 UTC
Done

Comment 6 Gerrit Slomma 2009-04-01 23:59:47 UTC
Created attachment 337692 [details]
lspci -xxvvv output

Comment 7 Gerrit Slomma 2009-04-02 04:49:54 UTC
Hm.
Seems like the problem is gone with 2.6.18-128.1.6.el5.x86_64 kernel.

# grep "soft resetting" /var/log/messages

Mar 22 23:25:41 rr019 kernel: ata1: soft resetting link
Mar 22 23:31:42 rr019 kernel: ata1: soft resetting link
Mar 22 23:31:42 rr019 kernel: ata1: soft resetting link
Mar 22 23:31:42 rr019 kernel: ata1: soft resetting link
Mar 22 23:31:42 rr019 kernel: ata1: soft resetting link
Mar 22 23:31:42 rr019 kernel: ata1: soft resetting link
Mar 22 23:31:42 rr019 kernel: ata1: soft resetting link
Mar 22 23:31:42 rr019 kernel: ata1: soft resetting link
Mar 22 23:31:42 rr019 kernel: ata1: soft resetting link
Mar 22 23:31:42 rr019 kernel: ata1: soft resetting link
Mar 22 23:31:42 rr019 kernel: ata1: soft resetting link
Mar 22 23:31:42 rr019 kernel: ata1: soft resetting link
Mar 22 23:31:42 rr019 kernel: ata1: soft resetting link
Apr  2 02:18:00 rr019 kernel: ata1: soft resetting link

The lone message is from yesterday after rebooting pending the kernel installation, the other errors are from 2.6.18.128.1.1.el5.x86_64 kernel.
Can't get load on the machine now, will post back as soon as i manage this.

Comment 8 Gerrit Slomma 2009-04-02 15:24:18 UTC
No i was triumphating to early.
Some minutes in uptime the soft resets are back.

Comment 9 David Milburn 2009-04-07 22:10:08 UTC
Still looking this over, I have HP system with the same SATA controller,
but I am not seeing the failures.

00:05.2 IDE interface: nVidia Corporation MCP55 SATA Controller (rev a3) 
  .
  . 
00: de 10 7f 03 07 00 b0 00 a3 85 01 01 00 00 80 00
10: e1 30 00 00 11 31 00 00 e9 30 00 00 15 31 00 00
20: b1 30 00 00 00 70 24 ee 00 00 00 00 3c 10 fe 12
30: 00 00 00 00 44 00 00 00 00 00 00 00 0a 03 03 01

Would you please boot the -128.el5 kernel with "log_buf_len=1000000" and
attach the output of dmesg after booting?

Comment 10 Gerrit Slomma 2009-04-09 07:28:17 UTC
I tried to but i couldn't even get the system up to log into it with the -128.el5 kernel.
I appended the /var/log/messages of the boot-attempt of today in the mornig.

Comment 11 Gerrit Slomma 2009-04-09 07:29:26 UTC
Created attachment 338870 [details]
/var/log/message of kernel 2.6.18-128.el5

Comment 12 David Milburn 2009-04-20 19:49:39 UTC
Sorry for the delay, it looks the drive doesn't support the Write DMA queued
command (0x61) and aborts it setting the ABRT bit in the error register this
leads to the error messages.

Can we see how your system behaves if we turn SWNCQ off? You will need to
modify /etc/modprobe.conf

options sata_nv swncq_enabled=0

And then rebuild your initrd and boot the -128.el5 kernel and attach you log.

cp /boot/initrd-2.6.18-128.el5.img /boot/initrd-2.6.18-128.el5.img.bak
mkinitrd -f /boot/initrd-2.6.18-128.el5.img 2.6.18-128.el5

Comment 13 Gerrit Slomma 2009-04-21 21:50:00 UTC
Sorry i can't: We dumped the host and gave the mainboard including the CPUs back to the distributor exchanging it for a Xeon-System.
To much problems with this one...

Comment 26 Mike Jones 2009-06-09 16:04:06 UTC
I was having the same problem, and swncq_enabled=0 did fix the problem.

The system was stable through 5.2 releases and had trouble after upgrading to a 5.3 kernel.  Is this a problem with the controller or the drives?

Comment 37 David Milburn 2009-09-03 23:02:18 UTC
Since the controller is reporting handshake errors with the drive,
and since swncq can be turned off with a module param, closing
this BZ.