Bug 491554
Summary: | nVidia MCP55 exception Emask frozen hardreset device not ready | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Gerrit Slomma <gerrit.slomma> | ||||||
Component: | kernel | Assignee: | David Milburn <dmilburn> | ||||||
Status: | CLOSED NOTABUG | QA Contact: | Red Hat Kernel QE team <kernel-qe> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | medium | ||||||||
Version: | 5.3 | CC: | dirk.leinenbach, dzickus, james.brown, mikej, tao | ||||||
Target Milestone: | rc | ||||||||
Target Release: | --- | ||||||||
Hardware: | x86_64 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2009-09-03 23:02:18 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
As stated in the initial post i tried the 2.6.18-128.1.1.el5, i wrote the comment and then decided to give the new kernel-subversion a try but it did not pay out. Hello, Would you please test kernel-2.6.18-137.el5.bz491554.1? http://people.redhat.com/dmilburn/ It includes some upstream hardreset fixes subsequent to the patch above including: commit 2d775708bc6613f1be47f1e720781343341ecc94 Author: Tejun Heo <tj> Date: Sun Jan 25 11:29:38 2009 +0900 sata_nv: fix MCP5x reset This polls on the console for some time ata1: SATA link down (SStatus 100 SControl 300) ata1: EH complete ata1: hard resetting link and finaly resulting in: Scanning logical volumes Reading all physical volumes. This may take a while... Activating logical volumes Volume group "rootdg" not found Trying to resume from /dev/rootdg/swapvol Unable to access resume device (/dev/rootdg/swapvol) Creating root device. Mounting root filesystem. mount: could not find filesystem '/dev/root' Setting up other filesystems. Setting up new root fs setuproot: moving /dev failed: No such file or directory no fstab.sys, mounting internal defaults setuproot: error mounting /proc: No such file or directory setuproot: error mounting /sys: No such file or directory Switching to new root and running init unmounting old /dev unmounting old /proc unmounting old /sys switchroot: mount failed: No such file or directory Kernel panic - not syncinc: Atempted to kill init! Do i have to build an proper initrd with lvm and mdadm built in or is this a direct follow-up of the hard reset of the link? No, you shouldn't have to manually build an initrd, when you installed the test rpm it should be putting a new initrd in /boot. I will need to look over the source again, would you attach the output of "lspci -xxvvv"? Done Created attachment 337692 [details]
lspci -xxvvv output
Hm. Seems like the problem is gone with 2.6.18-128.1.6.el5.x86_64 kernel. # grep "soft resetting" /var/log/messages Mar 22 23:25:41 rr019 kernel: ata1: soft resetting link Mar 22 23:31:42 rr019 kernel: ata1: soft resetting link Mar 22 23:31:42 rr019 kernel: ata1: soft resetting link Mar 22 23:31:42 rr019 kernel: ata1: soft resetting link Mar 22 23:31:42 rr019 kernel: ata1: soft resetting link Mar 22 23:31:42 rr019 kernel: ata1: soft resetting link Mar 22 23:31:42 rr019 kernel: ata1: soft resetting link Mar 22 23:31:42 rr019 kernel: ata1: soft resetting link Mar 22 23:31:42 rr019 kernel: ata1: soft resetting link Mar 22 23:31:42 rr019 kernel: ata1: soft resetting link Mar 22 23:31:42 rr019 kernel: ata1: soft resetting link Mar 22 23:31:42 rr019 kernel: ata1: soft resetting link Mar 22 23:31:42 rr019 kernel: ata1: soft resetting link Apr 2 02:18:00 rr019 kernel: ata1: soft resetting link The lone message is from yesterday after rebooting pending the kernel installation, the other errors are from 2.6.18.128.1.1.el5.x86_64 kernel. Can't get load on the machine now, will post back as soon as i manage this. No i was triumphating to early. Some minutes in uptime the soft resets are back. Still looking this over, I have HP system with the same SATA controller, but I am not seeing the failures. 00:05.2 IDE interface: nVidia Corporation MCP55 SATA Controller (rev a3) . . 00: de 10 7f 03 07 00 b0 00 a3 85 01 01 00 00 80 00 10: e1 30 00 00 11 31 00 00 e9 30 00 00 15 31 00 00 20: b1 30 00 00 00 70 24 ee 00 00 00 00 3c 10 fe 12 30: 00 00 00 00 44 00 00 00 00 00 00 00 0a 03 03 01 Would you please boot the -128.el5 kernel with "log_buf_len=1000000" and attach the output of dmesg after booting? I tried to but i couldn't even get the system up to log into it with the -128.el5 kernel. I appended the /var/log/messages of the boot-attempt of today in the mornig. Created attachment 338870 [details]
/var/log/message of kernel 2.6.18-128.el5
Sorry for the delay, it looks the drive doesn't support the Write DMA queued command (0x61) and aborts it setting the ABRT bit in the error register this leads to the error messages. Can we see how your system behaves if we turn SWNCQ off? You will need to modify /etc/modprobe.conf options sata_nv swncq_enabled=0 And then rebuild your initrd and boot the -128.el5 kernel and attach you log. cp /boot/initrd-2.6.18-128.el5.img /boot/initrd-2.6.18-128.el5.img.bak mkinitrd -f /boot/initrd-2.6.18-128.el5.img 2.6.18-128.el5 Sorry i can't: We dumped the host and gave the mainboard including the CPUs back to the distributor exchanging it for a Xeon-System. To much problems with this one... I was having the same problem, and swncq_enabled=0 did fix the problem. The system was stable through 5.2 releases and had trouble after upgrading to a 5.3 kernel. Is this a problem with the controller or the drives? Since the controller is reporting handshake errors with the drive, and since swncq can be turned off with a module param, closing this BZ. |
Description of problem: Since update from kernel 2.6.18-92.1.22.el5 to 2.6.18-128.el5 or 2.6.18-128.1.1.el5 i got plenty of: (...) ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x400000 action 0x6 frozen ata1: SError: { Handshk } ata1.00: cmd 35/00:00:70:07:8b/00:02:01:00:00/e0 tag 0 dma 262144 out res 40/00:00:00:4f:c2/04:00:01:00:00/00 Emask 0x4 (timeout) ata1.00: status: { DRDY } ata1: link is slow to respond, please be patient (ready=0) ata1: device not ready (errno=-16), forcing hardreset ata1: soft resetting link ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300) ata1.00: configured for UDMA/100 ata1: EH complete SCSI device sda: 156301488 512-byte hdwr sectors (80026 MB) sda: Write Protect is off sda: Mode Sense: 00 3a 00 00 SCSI device sda: drive cache: write back (...) if this happens nothing is possible on the system. Version-Release number of selected component (if applicable): 2.6.18-128.el5 and 2.6.18-128.1.1.el5 How reproducible: update from kernel 2.6.18-92.1.22.el5 to 2.6.18-128.el5 or 2.6.18-128.1.1.el5 Steps to Reproduce: 1. yum install kernel-2.6.18-128.el5 or yum install kernel-2.6.18-128.el5 2. reboot into new kernel 3. watch the messages scroll by Actual results: system hangs occasionally Expected results: system is running fine as was on kernel 2.6.18-92.1.22.el5 Additional info: [root@rr019 ~]# lspci 00:00.0 RAM memory: nVidia Corporation MCP55 Memory Controller (rev a2) 00:01.0 ISA bridge: nVidia Corporation MCP55 LPC Bridge (rev a3) 00:01.1 SMBus: nVidia Corporation MCP55 SMBus (rev a3) 00:02.0 USB Controller: nVidia Corporation MCP55 USB Controller (rev a1) 00:02.1 USB Controller: nVidia Corporation MCP55 USB Controller (rev a2) 00:04.0 IDE interface: nVidia Corporation MCP55 IDE (rev a1) 00:05.0 IDE interface: nVidia Corporation MCP55 SATA Controller (rev a3) 00:05.1 IDE interface: nVidia Corporation MCP55 SATA Controller (rev a3) 00:05.2 IDE interface: nVidia Corporation MCP55 SATA Controller (rev a3) 00:06.0 PCI bridge: nVidia Corporation MCP55 PCI bridge (rev a2) 00:08.0 Bridge: nVidia Corporation MCP55 Ethernet (rev a3) 00:09.0 Bridge: nVidia Corporation MCP55 Ethernet (rev a3) 00:0a.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a3) 00:0d.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a3) 00:0f.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a3) 00:18.0 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] HyperTransport Configuration 00:18.1 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] Address Map 00:18.2 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] DRAM Controller 00:18.3 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] Miscellaneous Control 00:18.4 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] Link Control 00:19.0 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] HyperTransport Configuration 00:19.1 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] Address Map 00:19.2 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] DRAM Controller 00:19.3 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] Miscellaneous Control 00:19.4 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] Link Control 01:04.0 VGA compatible controller: ATI Technologies Inc ES1000 (rev 02) Motherboard is a Tyan Thunder n3600R S2912 with newest bios-version. SATA-Ports run as AHCI. Upgrading to newer kernel-subversion is no real option at the moment as the base release is the in-house-standard of our company - our kvm-kernel-modules are compiled against this version (this is only a test-host though). I tested the newest 2.6.18-128.1.1.el5 - but to no avail. Problem seems to be known albeit those are Tyan Thunder n3600B and Supermicro H8QMI-2: > http://lists.debian.org/debian-kernel/2007/03/msg00258.html > http://forums.fedoraforum.org/showthread.php?t=208698&page=2 but these are another distro and kernel-subversion. In the changelog of 2.6.18-128.el5 i found following: (...) 2008-12-08 22:00:00 Don Zickus <dzickus> [2.6.18-126.el5]: (...) - [ata] libata: sata_nv hard reset mcp55 (David Milburn ) [473152] (...) Shouldn't that mean this issue is/was fixed? I will give the 2.6.18-128.1.1.el5 a try and report back.