Bug 484672

Summary:	Kernel 2.6.27.12-170.2.5.fc10.x86_64 frequent panics
Product:	[Fedora] Fedora	Reporter:	josip@icase.edu <jl-icase>
Component:	kernel	Assignee:	Kernel Maintainer List <kernel-maint>
Status:	CLOSED NOTABUG	QA Contact:	Fedora Extras Quality Assurance <extras-qa>
Severity:	high	Docs Contact:
Priority:	low
Version:	10	CC:	kernel-maint, quintela
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:
Clones:	484749 (view as bug list)		Environment:
Last Closed:	2009-02-16 22:20:13 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description josip@icase.edu 2009-02-09 14:00:11 UTC

Description of problem:
Since installing kernel-2.6.27.12-170.2.5.fc10.x86_64, the system panics frequently leaving nothing in logs.  Screen trace not captured, but indicates problems within file system area of the kernel.

Version-Release number of selected component (if applicable):
kernel-2.6.27.12-170.2.5.fc10.x86_64

How reproducible:
Upgrade to kernel-2.6.27.12-170.2.5.fc10.x86_64 then use the system for a week.


Steps to Reproduce:
1.
2.
3.
  
Actual results:
Kernel panics, prints lots of detail only to console screen.  The trace suggests that trouble originates within file system area of the kernel.  Sorry, console screen contents not captured.

Expected results:
Stable system.

Additional info:
Trying to back off to kernel-2.6.27.9-159.fc10.x86_64 which performed stably.  The machine is a server with 6 SATA drives in RAID6 configuration, must be reliable.

Comment 1 Chuck Ebbert 2009-02-10 07:27:20 UTC

If we don't have a the text of the error messages there's not much that can be done about the bug...

Comment 2 josip@icase.edu 2009-02-10 13:00:17 UTC

OK, two problems:

(1) How do you capture screen text?  There is nothing in the logs.

(2) This machine must be reliable and has been downgraded to the earlier kernel which worked fine.

Comment 3 josip@icase.edu 2009-02-15 13:34:10 UTC

It happened again, but this time with kernel-2.6.27.9-159.fc10.x86_64.rpm (the previous version -- my next step is to try even older kernel-2.6.27.7-134.fc10.x86_64).

The machine locks up and the kernel prints backtrace from _spin_lock about once per minute, so this isn't quite kernel panic, more like periodic detection of deadlocks related to the file system.  There are two patterns of backtrace:

_spin_lock
d_alloc
do_lookup
__link_path_walk
path_walk
do_path_lookup
user_path_at
ext3_dirty_inode
_spin_lock
mnt_drop_writes
vfs_lstat_fd
sys_newfstatat
path_put
audit_syscall_entry
system_call_fastpath

I copied this by hand from the console, so there may be typos.  The other pattern is:

_spin_lock
_atomic_dec_and_lock
dput
path_put
__link_path_walk
generic_file
...

Both of the above look like some kind of deadlock related to the file system.  There are no indications of problems in system logs, and "smartctl -a ..." shows that none of the drives had any errors.h

The machine has the following parts:

# lspci
00:00.0 Host bridge: Intel Corporation 82P965/G965 Memory Controller Hub (rev 02)
00:01.0 PCI bridge: Intel Corporation 82P965/G965 PCI Express Root Port (rev 02)
00:1b.0 Audio device: Intel Corporation 82801H (ICH8 Family) HD Audio Controller (rev 02)
00:1c.0 PCI bridge: Intel Corporation 82801H (ICH8 Family) PCI Express Port 1 (rev 02)
00:1c.2 PCI bridge: Intel Corporation 82801H (ICH8 Family) PCI Express Port 3 (rev 02)
00:1c.4 PCI bridge: Intel Corporation 82801H (ICH8 Family) PCI Express Port 5 (rev 02)
00:1c.5 PCI bridge: Intel Corporation 82801H (ICH8 Family) PCI Express Port 6 (rev 02)
00:1d.0 USB Controller: Intel Corporation 82801H (ICH8 Family) USB UHCI Controller #1 (rev 02)
00:1d.1 USB Controller: Intel Corporation 82801H (ICH8 Family) USB UHCI Controller #2 (rev 02)
00:1d.2 USB Controller: Intel Corporation 82801H (ICH8 Family) USB UHCI Controller #3 (rev 02)
00:1d.3 USB Controller: Intel Corporation Device 2833 (rev 02)
00:1d.7 USB Controller: Intel Corporation 82801H (ICH8 Family) USB2 EHCI Controller #1 (rev 02)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev f2)
00:1f.0 ISA bridge: Intel Corporation 82801HB/HR (ICH8/R) LPC Interface Controller (rev 02)
00:1f.2 SATA controller: Intel Corporation 82801HR/HO/HH (ICH8R/DO/DH) 6 port SATA AHCI Controller (rev 02)
00:1f.3 SMBus: Intel Corporation 82801H (ICH8 Family) SMBus Controller (rev 02)
01:00.0 VGA compatible controller: nVidia Corporation GeForce 7100 GS (rev a1)
02:00.0 Ethernet controller: Marvell Technology Group Ltd. 88E8056 PCI-E Gigabit Ethernet Controller (rev 12)
03:00.0 SATA controller: JMicron Technologies, Inc. JMicron 20360/20363 AHCI Controller (rev 02)
03:00.1 IDE interface: JMicron Technologies, Inc. JMicron 20360/20363 AHCI Controller (rev 02)
04:00.0 Ethernet controller: Intel Corporation 82572EI Gigabit Ethernet Controller (Copper) (rev 06)
06:01.0 Ethernet controller: ADMtek NC100 Network Everywhere Fast Ethernet 10/100 (rev 11)
06:03.0 FireWire (IEEE 1394): Texas Instruments TSB43AB22/A IEEE-1394a-2000 Controller (PHY/Link)
06:04.0 Ethernet controller: Marvell Technology Group Ltd. 88E8001 Gigabit Ethernet Controller (rev 14)

Each of the 6 ports of the Intel SATA controller has a 500 GB drive attached.  The JMicron SATA controller has a DVD-ROM attached.  The processor is an Intel Core Duo 6700 @ 2.66 GHz.  The BIOS is AMI version 1226 rev. 8.12 released 11/23/2007.

Comment 4 josip@icase.edu 2009-02-16 13:59:03 UTC

Please close this bug: Memory on the system is going bad.

Memtest86+ originally confirmed that memory was good (about 2 years ago), but re-testing shows frequent errors of the leading bit.  Re-seating DIMMs didn't help.  New memory is on order.