Bug 163738 - Kernel PANIC - not syncing: fatal exception
Summary: Kernel PANIC - not syncing: fatal exception
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel
Version: 4.0
Hardware: i686
OS: Linux
medium
high
Target Milestone: ---
: ---
Assignee: Steve Dickson
QA Contact: Brian Brock
URL: RHEL4 PANICs at clear_inode
Whiteboard:
Depends On:
Blocks: 168429
TreeView+ depends on / blocked
 
Reported: 2005-07-20 18:43 UTC by GV Govindasamy
Modified: 2007-11-30 22:07 UTC (History)
2 users (show)

Fixed In Version: RHSA-2006-0132
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2006-03-07 19:21:43 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
stack trace (1.29 MB, image/jpeg)
2005-07-20 18:49 UTC, GV Govindasamy
no flags Details
don't manipulate data_updates around async unlinks (979 bytes, patch)
2005-07-23 15:07 UTC, Chuck Lever
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2005:808 0 normal SHIPPED_LIVE Important: kernel security update 2005-10-27 04:00:00 UTC
Red Hat Product Errata RHSA-2006:0132 0 qe-ready SHIPPED_LIVE Moderate: Updated kernel packages available for Red Hat Enterprise Linux 4 Update 3 2006-03-09 16:31:00 UTC

Description GV Govindasamy 2005-07-20 18:43:29 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.5) Gecko/20041107 Firefox/1.0

Description of problem:
While running our tests RHEL4 PANICs intermittantly with the same stack trace. I am attaching the image of this with this bug.

Version-Release number of selected component (if applicable):
kernel-2.6.9-11.EL

How reproducible:
Didn't try


Additional info:

Comment 1 GV Govindasamy 2005-07-20 18:47:06 UTC
# uname -a
Linux xxx.xxx.xxx.com 2.6.9-11.ELsmp #1 SMP Fri May 20 18:26:27 EDT 2005 i686
i686 i386 GNU/Linux

# dmesg
 2005
BIOS-provided physical RAM map:
 BIOS-e820: 0000000000000000 - 000000000009cc00 (usable)
 BIOS-e820: 000000000009cc00 - 00000000000a0000 (reserved)
 BIOS-e820: 00000000000ea070 - 0000000000100000 (reserved)
 BIOS-e820: 0000000000100000 - 00000000dffc0000 (usable)
 BIOS-e820: 00000000dffc0000 - 00000000dffcf000 (ACPI data)
 BIOS-e820: 00000000dffcf000 - 00000000dfff0000 (ACPI NVS)
 BIOS-e820: 00000000dfff0000 - 00000000e0000000 (reserved)
 BIOS-e820: 00000000fec00000 - 00000000fec86000 (reserved)
 BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
 BIOS-e820: 00000000ffb00000 - 0000000100000000 (reserved)
 BIOS-e820: 0000000100000000 - 0000000120000000 (usable)
3712MB HIGHMEM available.
896MB LOWMEM available.
found SMP MP-table at 000ff780
On node 0 totalpages: 1179648
  DMA zone: 4096 pages, LIFO batch:1
  Normal zone: 225280 pages, LIFO batch:16
  HighMem zone: 950272 pages, LIFO batch:16
DMI 2.3 present.
Using APIC driver default
ACPI: RSDP (v000 ACPIAM                                ) @ 0x000f7710
ACPI: RSDT (v001 A M I  OEMRSDT  0x09000424 MSFT 0x00000097) @ 0xdffc0000
ACPI: FADT (v002 A M I  OEMFACP  0x09000424 MSFT 0x00000097) @ 0xdffc0200
ACPI: MADT (v001 A M I  OEMAPIC  0x09000424 MSFT 0x00000097) @ 0xdffc0390
ACPI: OEMB (v001 A M I  AMI_OEM  0x09000424 MSFT 0x00000097) @ 0xdffcf040
ACPI: DSDT (v001  0ABDI 0ABDI007 0x00000007 INTL 0x02002026) @ 0x00000000
ACPI: PM-Timer IO Port: 0x408
ACPI: Local APIC address 0xfee00000
ACPI: LAPIC (acpi_id[0x01] lapic_id[0x00] enabled)
Processor #0 15:3 APIC version 20
ACPI: LAPIC (acpi_id[0x02] lapic_id[0x06] enabled)
Processor #6 15:3 APIC version 20
ACPI: LAPIC (acpi_id[0x03] lapic_id[0x01] enabled)
Processor #1 15:3 APIC version 20
ACPI: LAPIC (acpi_id[0x04] lapic_id[0x07] enabled)
Processor #7 15:3 APIC version 20
Enabling APIC mode:  Flat.  Using 0 I/O APICs
ACPI: IOAPIC (id[0x08] address[0xfec00000] gsi_base[0])
IOAPIC[0]: apic_id 8, version 32, address 0xfec00000, GSI 0-23
ACPI: IOAPIC (id[0x09] address[0xfec10000] gsi_base[24])
IOAPIC[1]: apic_id 9, version 32, address 0xfec10000, GSI 24-47
ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
ACPI: IRQ0 used by override.
ACPI: IRQ2 used by override.
ACPI: IRQ9 used by override.
Using ACPI (MADT) for SMP configuration information
Built 1 zonelists
Kernel command line: auto BOOT_IMAGE=2.6.9-11.ELsmp ro
BOOT_FILE=/boot/vmlinuz-2.6.9-11.ELsmp rhgb quiet root=LABEL=/
Initializing CPU#0
CPU 0 irqstacks, hard=c03db000 soft=c03bb000
PID hash table entries: 4096 (order: 12, 65536 bytes)
Detected 2801.374 MHz processor.
Using tsc for high-res timesource
Console: colour VGA+ 80x25
Dentry cache hash table entries: 131072 (order: 7, 524288 bytes)
Inode-cache hash table entries: 65536 (order: 6, 262144 bytes)
Memory: 4150216k/4718592k available (1824k kernel code, 42940k reserved, 744k
data, 176k init, 3276544k highmem)
Calibrating delay loop... 5521.40 BogoMIPS (lpj=2760704)
Security Scaffold v1.0.0 initialized
SELinux:  Initializing.
SELinux:  Starting in permissive mode
There is already a security framework initialized, register_security failed.
selinux_register_security:  Registering secondary module capability
Capability LSM initialized as secondary
Mount-cache hash table entries: 512 (order: 0, 4096 bytes)
CPU: After generic identify, caps: bfebfbff 20000000 00000000 00000000
CPU: After vendor identify, caps:  bfebfbff 20000000 00000000 00000000
monitor/mwait feature present.
using mwait in idle threads.
CPU: Trace cache: 12K uops, L1 D cache: 16K
CPU: L2 cache: 1024K
CPU: Physical Processor ID: 0
CPU: After all inits, caps:        bfebf3ff 20000000 00000000 00000080
Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#0.
CPU0: Intel P4/Xeon Extended MCE MSRs (24) available
CPU0: Thermal monitoring enabled
Enabling fast FPU save and restore... done.
Enabling unmasked SIMD FPU exception support... done.
Checking 'hlt' instruction... OK.
CPU0: Intel(R) Xeon(TM) CPU 3.00GHz stepping 04
per-CPU timeslice cutoff: 2925.33 usecs.
task migration cache decay timeout: 3 msecs.
Booting processor 1/1 eip 3000
CPU 1 irqstacks, hard=c03dc000 soft=c03bc000
Initializing CPU#1
Calibrating delay loop... 5586.94 BogoMIPS (lpj=2793472)
CPU: After generic identify, caps: bfebfbff 20000000 00000000 00000000
CPU: After vendor identify, caps:  bfebfbff 20000000 00000000 00000000
monitor/mwait feature present.
CPU: Trace cache: 12K uops, L1 D cache: 16K
CPU: L2 cache: 1024K
CPU: Physical Processor ID: 0
CPU: After all inits, caps:        bfebf3ff 20000000 00000000 00000080
Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#1.
CPU1: Intel P4/Xeon Extended MCE MSRs (24) available
CPU1: Thermal monitoring enabled
CPU1: Intel(R) Xeon(TM) CPU 3.00GHz stepping 04
Booting processor 2/6 eip 3000
CPU 2 irqstacks, hard=c03dd000 soft=c03bd000
Initializing CPU#2
Calibrating delay loop... 5586.94 BogoMIPS (lpj=2793472)
CPU: After generic identify, caps: bfebfbff 20000000 00000000 00000000
CPU: After vendor identify, caps:  bfebfbff 20000000 00000000 00000000
monitor/mwait feature present.
CPU: Trace cache: 12K uops, L1 D cache: 16K
CPU: L2 cache: 1024K
CPU: Physical Processor ID: 3
CPU: After all inits, caps:        bfebf3ff 20000000 00000000 00000080
Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#2.
CPU2: Intel P4/Xeon Extended MCE MSRs (24) available
CPU2: Thermal monitoring enabled
CPU2: Intel(R) Xeon(TM) CPU 3.00GHz stepping 04
Booting processor 3/7 eip 3000
CPU 3 irqstacks, hard=c03de000 soft=c03be000
Initializing CPU#3
Calibrating delay loop... 5586.94 BogoMIPS (lpj=2793472)
CPU: After generic identify, caps: bfebfbff 20000000 00000000 00000000
CPU: After vendor identify, caps:  bfebfbff 20000000 00000000 00000000
monitor/mwait feature present.
CPU: Trace cache: 12K uops, L1 D cache: 16K
CPU: L2 cache: 1024K
CPU: Physical Processor ID: 3
CPU: After all inits, caps:        bfebf3ff 20000000 00000000 00000080
Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#3.
CPU3: Intel P4/Xeon Extended MCE MSRs (24) available
CPU3: Thermal monitoring enabled
CPU3: Intel(R) Xeon(TM) CPU 3.00GHz stepping 04
Total of 4 processors activated (22282.24 BogoMIPS).
ENABLING IO-APIC IRQs
..TIMER: vector=0x31 pin1=2 pin2=-1
checking TSC synchronization across 4 CPUs: passed.
Brought up 4 CPUs
zapping low mappings.
checking if image is initramfs... it is
Freeing initrd memory: 472k freed
NET: Registered protocol family 16
PCI: PCI BIOS revision 2.10 entry at 0xf0031, last bus=3
PCI: Using configuration type 1
mtrr: v2.0 (20020519)
ACPI: Subsystem revision 20040816
ACPI: Interpreter enabled
ACPI: Using IOAPIC for interrupt routing
ACPI: PCI Root Bridge [PCI0] (00:00)
PCI: Probing PCI hardware (bus 00)
PCI: Ignoring BAR0-3 of IDE controller 0000:00:1f.2
PCI: Transparent bridge - 0000:00:1e.0
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0._PRT]
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.EPA0._PRT]
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.P0P1._PRT]
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.P0PC._PRT]
ACPI: PCI Interrupt Link [LNKA] (IRQs 3 4 5 6 7 *10 11 12 14 15)
ACPI: PCI Interrupt Link [LNKB] (IRQs 3 4 5 6 7 10 *11 12 14 15)
ACPI: PCI Interrupt Link [LNKC] (IRQs 3 4 5 6 7 *10 11 12 14 15)
ACPI: PCI Interrupt Link [LNKD] (IRQs 3 4 5 6 *7 10 11 12 14 15)
ACPI: PCI Interrupt Link [LNKE] (IRQs 3 4 5 6 7 10 11 12 14 15) *0, disabled.
ACPI: PCI Interrupt Link [LNKF] (IRQs 3 4 5 6 7 10 11 12 14 15) *0, disabled.
ACPI: PCI Interrupt Link [LNKG] (IRQs 3 4 5 6 7 10 11 12 14 15) *0, disabled.
ACPI: PCI Interrupt Link [LNKH] (IRQs 3 4 *5 6 7 10 11 12 14 15)
Linux Plug and Play Support v0.97 (c) Adam Belay
usbcore: registered new driver usbfs
usbcore: registered new driver hub
PCI: Using ACPI for IRQ routing
ACPI: PCI interrupt 0000:00:02.0[A] -> GSI 16 (level, low) -> IRQ 169
ACPI: PCI interrupt 0000:00:1d.0[A] -> GSI 16 (level, low) -> IRQ 169
ACPI: PCI interrupt 0000:00:1d.1[B] -> GSI 19 (level, low) -> IRQ 177
ACPI: PCI interrupt 0000:00:1d.7[D] -> GSI 23 (level, low) -> IRQ 185
ACPI: PCI interrupt 0000:00:1f.2[A] -> GSI 18 (level, low) -> IRQ 193
ACPI: PCI interrupt 0000:00:1f.3[B] -> GSI 17 (level, low) -> IRQ 201
ACPI: PCI interrupt 0000:03:04.0[A] -> GSI 18 (level, low) -> IRQ 193
ACPI: PCI interrupt 0000:03:05.0[A] -> GSI 17 (level, low) -> IRQ 201
apm: BIOS not found.
audit: initializing netlink socket (disabled)
audit(1121858240.453:0): initialized
highmem bounce pool size: 64 pages
Total HugeTLB memory allocated, 0
VFS: Disk quotas dquot_6.5.1
Dquot-cache hash table entries: 1024 (order 0, 4096 bytes)
SELinux:  Registering netfilter hooks
Initializing Cryptographic API
ksign: Installing public key data
Loading keyring
- Added public key D67B3E6B1ED6FEC7
- User ID: Red Hat, Inc. (Kernel Module GPG key)
pci_hotplug: PCI Hot Plug PCI Core version: 0.5
ACPI: Processor [CPU1] (supports C1, 8 throttling states)
ACPI: Processor [CPU2] (supports C1)
ACPI: Processor [CPU3] (supports C1)
ACPI: Processor [CPU4] (supports C1)
Real Time Clock Driver v1.12
Linux agpgart interface v0.100 (c) Dave Jones
serio: i8042 AUX port at 0x60,0x64 irq 12
serio: i8042 KBD port at 0x60,0x64 irq 1
Serial: 8250/16550 driver $Revision: 1.90 $ 8 ports, IRQ sharing enabled
ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
ttyS1 at I/O 0x2f8 (irq = 3) is a 16550A
RAMDISK driver initialized: 16 RAM disks of 16384K size 1024 blocksize
divert: not allocating divert_blk for non-ethernet device lo
Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2
ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
Probing IDE interface ide0...
hda: CD-224E, ATAPI CD/DVD-ROM drive
ide1: I/O resource 0x170-0x177 not free.
ide1: ports already in use, skipping probe
Probing IDE interface ide2...
Probing IDE interface ide3...
Probing IDE interface ide4...
Probing IDE interface ide5...
Using cfq io scheduler
ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
hda: ATAPI 24X CD-ROM drive, 128kB Cache
Uniform CD-ROM driver Revision: 3.20
ide-floppy driver 0.99.newide
usbcore: registered new driver hiddev
usbcore: registered new driver usbhid
drivers/usb/input/hid-core.c: v2.0:USB HID core driver
mice: PS/2 mouse device common for all mice
md: md driver 0.90.0 MAX_MD_DEVS=256, MD_SB_DISKS=27
NET: Registered protocol family 2
IP: routing cache hash table of 32768 buckets, 512Kbytes
TCP: Hash tables configured (established 262144 bind 43690)
Initializing IPsec netlink socket
NET: Registered protocol family 1
NET: Registered protocol family 17
ACPI: (supports S0 S1 S3 S4 S4bios S5)
ACPI wakeup devices:
EPA0 EPA1 EPB0 EPB1 EPC0 P0P1 MC97 USB1 USB2 EUSB P0PC SLPB
Freeing unused kernel memory: 176k freed
SCSI subsystem initialized
libata version 1.10 loaded.
ata_piix version 1.03
ata_piix: combined mode detected
ACPI: PCI interrupt 0000:00:1f.2[A] -> GSI 18 (level, low) -> IRQ 193
ata: 0x1f0 IDE port busy
PCI: Setting latency timer of device 0000:00:1f.2 to 64
ata1: SATA max UDMA/133 cmd 0x170 ctl 0x376 bmdma 0xFC08 irq 15
ata1: dev 1 cfg 49:2f00 82:346b 83:7d01 84:4003 85:3469 86:3c01 87:4003 88:207f
ata1: dev 1 ATA, max UDMA/133, 156301488 sectors: lba48
ata1: dev 1 configured for UDMA/133
scsi0 : ata_piix
  Vendor: ATA       Model: ST380013AS        Rev: 3.19
  Type:   Direct-Access                      ANSI SCSI revision: 05
SCSI device sda: 156301488 512-byte hdwr sectors (80026 MB)
SCSI device sda: drive cache: write back
 sda: sda1 sda2
Attached scsi disk sda at scsi0, channel 0, id 1, lun 0
EXT3-fs: INFO: recovery required on readonly filesystem.
EXT3-fs: write access will be enabled during recovery.
kjournald starting.  Commit interval 5 seconds
EXT3-fs: sda1: orphan cleanup on readonly fs
ext3_orphan_cleanup: deleting unreferenced inode 40747
ext3_orphan_cleanup: deleting unreferenced inode 43476
ext3_orphan_cleanup: deleting unreferenced inode 42801
ext3_orphan_cleanup: deleting unreferenced inode 854049
ext3_orphan_cleanup: deleting unreferenced inode 164605
ext3_orphan_cleanup: deleting unreferenced inode 494818
ext3_orphan_cleanup: deleting unreferenced inode 493756
ext3_orphan_cleanup: deleting unreferenced inode 494793
ext3_orphan_cleanup: deleting unreferenced inode 7504011
ext3_orphan_cleanup: deleting unreferenced inode 7504003
ext3_orphan_cleanup: deleting unreferenced inode 43402
ext3_orphan_cleanup: deleting unreferenced inode 42787
ext3_orphan_cleanup: deleting unreferenced inode 494781
ext3_orphan_cleanup: deleting unreferenced inode 42051
ext3_orphan_cleanup: deleting unreferenced inode 494784
EXT3-fs: sda1: 15 orphan inodes deleted
EXT3-fs: recovery complete.
EXT3-fs: mounted filesystem with ordered data mode.
SELinux:  Disabled at runtime.
SELinux:  Unregistering netfilter hooks
inserting floppy driver for 2.6.9-11.ELsmp
Floppy drive(s): fd0 is 1.44M
FDC 0 is a post-1991 82077
Intel(R) PRO/1000 Network Driver - version 5.6.10.1-k2-NAPI
Copyright (c) 1999-2004 Intel Corporation.
ACPI: PCI interrupt 0000:03:04.0[A] -> GSI 18 (level, low) -> IRQ 193
divert: allocating divert_blk for eth0
e1000: eth0: e1000_probe: Intel(R) PRO/1000 Network Connection
hw_random: RNG not detected
ACPI: PCI interrupt 0000:00:1d.7[D] -> GSI 23 (level, low) -> IRQ 185
ehci_hcd 0000:00:1d.7: EHCI Host Controller
PCI: Setting latency timer of device 0000:00:1d.7 to 64
ehci_hcd 0000:00:1d.7: irq 185, pci mem f8806c00
ehci_hcd 0000:00:1d.7: new USB bus registered, assigned bus number 1
PCI: cache line size of 128 is not supported by device 0000:00:1d.7
ehci_hcd 0000:00:1d.7: USB 2.0 enabled, EHCI 1.00, driver 2004-May-10
hub 1-0:1.0: USB hub found
hub 1-0:1.0: 4 ports detected
USB Universal Host Controller Interface driver v2.2
ACPI: PCI interrupt 0000:00:1d.0[A] -> GSI 16 (level, low) -> IRQ 169
uhci_hcd 0000:00:1d.0: UHCI Host Controller
PCI: Setting latency timer of device 0000:00:1d.0 to 64
uhci_hcd 0000:00:1d.0: irq 169, io base 0000e800
uhci_hcd 0000:00:1d.0: new USB bus registered, assigned bus number 2
hub 2-0:1.0: USB hub found
hub 2-0:1.0: 2 ports detected
ACPI: PCI interrupt 0000:00:1d.1[B] -> GSI 19 (level, low) -> IRQ 177
uhci_hcd 0000:00:1d.1: UHCI Host Controller
PCI: Setting latency timer of device 0000:00:1d.1 to 64
uhci_hcd 0000:00:1d.1: irq 177, io base 0000ec00
uhci_hcd 0000:00:1d.1: new USB bus registered, assigned bus number 3
hub 3-0:1.0: USB hub found
hub 3-0:1.0: 2 ports detected
ip_tables: (C) 2000-2002 Netfilter core team
md: Autodetecting RAID arrays.
md: autorun ...
md: ... autorun DONE.
ACPI: Power Button (FF) [PWRF]
ACPI: Sleep Button (CM) [SLPB]
e1000: eth0: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex
e1000: eth0: e1000_watchdog: NIC Link is Down
e1000: eth0: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex
EXT3 FS on sda1, internal journal
device-mapper: 4.4.0-ioctl (2005-01-12) initialised: dm.com
cdrom: open failed.
Adding 16707592k swap on /dev/sda2.  Priority:-1 extents:1
ip_tables: (C) 2000-2002 Netfilter core team
ip_tables: (C) 2000-2002 Netfilter core team
i2c /dev entries driver
NET: Registered protocol family 10
Disabled Privacy Extensions on device c03356c0(lo)
IPv6 over IPv4 tunneling driver
divert: not allocating divert_blk for non-ethernet device sit0
ip_tables: (C) 2000-2002 Netfilter core team
eth0: no IPv6 routers present

# /sbin/lspci -v
00:00.0 Host bridge: Intel Corporation E7320 Memory Controller Hub (rev 0a)
        Subsystem: Intel Corporation: Unknown device 0000
        Flags: bus master, fast devsel, latency 0
        Capabilities: <available only to root>

00:02.0 PCI bridge: Intel Corporation E7525/E7520/E7320 PCI Express Port A (rev
0a) (prog-if 00 [Normal decode])
        Flags: bus master, fast devsel, latency 0
        Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
        Capabilities: <available only to root>

00:1c.0 PCI bridge: Intel Corporation 6300ESB 64-bit PCI-X Bridge (rev 02)
(prog-if 00 [Normal decode])
        Flags: bus master, 66Mhz, fast devsel, latency 32
        Bus: primary=00, secondary=02, subordinate=02, sec-latency=32
        Capabilities: <available only to root>

00:1d.0 USB Controller: Intel Corporation 6300ESB USB Universal Host Controller
(rev 02) (prog-if 00 [UHCI])
        Subsystem: Intel Corporation: Unknown device 24d0
        Flags: bus master, medium devsel, latency 0, IRQ 169
        I/O ports at e800 [size=32]

00:1d.1 USB Controller: Intel Corporation 6300ESB USB Universal Host Controller
(rev 02) (prog-if 00 [UHCI])
        Subsystem: Intel Corporation: Unknown device 24d0
        Flags: bus master, medium devsel, latency 0, IRQ 177
        I/O ports at ec00 [size=32]

00:1d.4 System peripheral: Intel Corporation 6300ESB Watchdog Timer (rev 02)
        Flags: medium devsel
        Memory at febff800 (32-bit, non-prefetchable) [size=16]

00:1d.5 PIC: Intel Corporation 6300ESB I/O Advanced Programmable Interrupt
Controller (rev 02) (prog-if 20 [IO(X)-APIC])
        Flags: bus master, fast devsel, latency 0
        Capabilities: <available only to root>

00:1d.7 USB Controller: Intel Corporation 6300ESB USB2 Enhanced Host Controller
(rev 02) (prog-if 20 [EHCI])
        Subsystem: Intel Corporation: Unknown device 24d0
        Flags: bus master, medium devsel, latency 0, IRQ 185
        Memory at febffc00 (32-bit, non-prefetchable) [size=1K]
        Capabilities: <available only to root>

00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 0a) (prog-if 00
[Normal decode])
        Flags: bus master, fast devsel, latency 0
        Bus: primary=00, secondary=03, subordinate=03, sec-latency=32
        I/O behind bridge: 0000d000-0000dfff
        Memory behind bridge: fca00000-feafffff
        Prefetchable memory behind bridge: fc800000-fc8fffff

00:1f.0 ISA bridge: Intel Corporation 6300ESB LPC Interface Controller (rev 02)
        Flags: bus master, medium devsel, latency 0

00:1f.2 IDE interface: Intel Corporation 6300ESB SATA Storage Controller (rev
02) (prog-if 8a [Master SecP PriP])
        Subsystem: Intel Corporation 6300ESB SATA Storage Controller
        Flags: bus master, 66Mhz, medium devsel, latency 0, IRQ 193
        I/O ports at <unassigned>
        I/O ports at <unassigned>
        I/O ports at <unassigned>
        I/O ports at <unassigned>
        I/O ports at fc00 [size=16]

00:1f.3 SMBus: Intel Corporation 6300ESB SMBus Controller (rev 02)
        Subsystem: Intel Corporation: Unknown device 24d0
        Flags: medium devsel, IRQ 201
        I/O ports at 0540 [size=32]

03:04.0 Ethernet controller: Intel Corporation 82541GI/PI Gigabit Ethernet
Controller
        Subsystem: Super Micro Computer Inc: Unknown device 1076
        Flags: bus master, 66Mhz, medium devsel, latency 32, IRQ 193
        Memory at feaa0000 (64-bit, non-prefetchable) [size=128K]
        I/O ports at dc00 [size=64]
        Capabilities: <available only to root>

03:05.0 VGA compatible controller: ATI Technologies Inc Rage XL (rev 27)
(prog-if 00 [VGA])
        Subsystem: ATI Technologies Inc Rage XL
        Flags: bus master, stepping, medium devsel, latency 32, IRQ 201
        Memory at fd000000 (32-bit, non-prefetchable) [size=16M]
        I/O ports at d800 [size=256]
        Memory at feafe000 (32-bit, non-prefetchable) [size=4K]
        Expansion ROM at fea60000 [disabled] [size=128K]
        Capabilities: <available only to root>



Comment 2 GV Govindasamy 2005-07-20 18:49:04 UTC
Created attachment 116991 [details]
stack trace

Comment 3 Steve Dickson 2005-07-21 12:13:28 UTC
hmm.. what test were you running? Also have you been
able to reproduce this?

Comment 4 GV Govindasamy 2005-07-21 17:37:21 UTC
We are running around 600 tests atleast 6 each in parallel. So, it is not
straight forward to detect which test really caused this. I haven't tried
re-producing this.

Comment 5 GV Govindasamy 2005-07-22 00:11:22 UTC
Clear text from /var/log/messages:

Jul 17 20:13:59 xxx kernel: ------------[ cut here ]------------
Jul 17 20:13:59 xxx kernel: kernel BUG at fs/nfs/inode.c:156!
Jul 17 20:13:59 xxx kernel: invalid operand: 0000 [#1]
Jul 17 20:13:59 xxx kernel: SMP
Jul 17 20:13:59 xxx kernel: Modules linked in: autofs4 i2c_dev i2c_core nfs
lockd sunrpc dm_mod button battery ac md5 ipv6 uhci_hcd ehci_hcd e1000 floppy
ext3 jbd ata_piix libata sd_mod scsi_mod
Jul 17 20:13:59 xxx kernel: CPU:    2
Jul 17 20:13:59 xxx kernel: EIP:    0060:[<f91febcd>]    Not tainted VLI
Jul 17 20:13:59 xxx kernel: EFLAGS: 00010286   (2.6.9-11.ELsmp)
Jul 17 20:13:59 xxx kernel: EIP is at nfs_clear_inode+0x3f/0x4a [nfs]
Jul 17 20:13:59 xxx kernel: eax: ffffffd8   ebx: f2dc9dfc   ecx: 00000000   edx:
f91b9c80
Jul 17 20:13:59 xxx kernel: esi: f2dc9ccc   edi: 00000001   ebp: c031fc78   esp:
ef669eec
Jul 17 20:13:59 xxx kernel: ds: 007b   es: 007b   ss: 0068
Jul 17 20:13:59 xxx kernel: Process umount (pid: 27723, threadinfo=ef669000
task=f7d8e230)
Jul 17 20:13:59 xxx kernel: Stack: f2dc9dfc ef669f10 c016b829 f2dc9dfc c016b89c
00000000 f5e9be00 f5e9be00
Jul 17 20:13:59 xxx kernel:        c016ba05 ef669f10 ef669f10 ef669000 f5e9be00
00000000 f922ae40 c015b452
Jul 17 20:13:59 xxx kernel:        00000064 f5e9be00 f922af80 ef669000 c015bccf
f5dd2400 f9200e37 f5e9be40
Jul 17 20:13:59 xxx kernel: Call Trace:
Jul 17 20:13:59 xxx kernel:  [<c016b829>] clear_inode+0xcc/0xf8
Jul 17 20:13:59 xxx kernel:  [<c016b89c>] dispose_list+0x47/0x6d
Jul 17 20:13:59 xxx kernel:  [<c016ba05>] invalidate_inodes+0x96/0xae
Jul 17 20:13:59 xxx kernel:  [<c015b452>] generic_shutdown_super+0x8c/0x154
Jul 17 20:13:59 xxx kernel:  [<c015bccf>] kill_anon_super+0x9/0x30
Jul 17 20:13:59 xxx kernel:  [<f9200e37>] nfs_kill_super+0xc/0x63 [nfs]
Jul 17 20:13:59 xxx kernel:  [<c015b314>] deactivate_super+0x5b/0x70
Jul 17 20:13:59 xxx kernel:  [<c016e2cb>] sys_umount+0x65/0x6c
Jul 17 20:13:59 xxx kernel:  [<c0169cfc>] dput+0x34/0x19b
Jul 17 20:13:59 xxx kernel:  [<c0156d33>] __fput+0xda/0x100
Jul 17 20:13:59 xxx kernel:  [<c0155962>] sys_close+0x67/0x71
Jul 17 20:13:59 xxx kernel:  [<c016e2dd>] sys_oldumount+0xb/0xe
Jul 17 20:13:59 xxx kernel:  [<c02c7377>] syscall_call+0x7/0xb
Jul 17 20:13:59 xxx kernel: Code: 00 58 8d 43 c4 39 43 c4 74 08 0f 0b 98 00 1c
c9 21 f9 8b 86 b4 00 00 00 85 c0 74 05 e8 3b 37 fa ff 8b 86 ac 00 00 00 85 c0 74
08 <0f> 0b 9c 00 1c c9 21 f9 5b 5e c3 8b 80 80 01 00 00 8b 00 85 c0
Jul 17 20:13:59 xxx kernel:  <0>Fatal exception: panic in 5 seconds


Comment 6 GV Govindasamy 2005-07-22 00:30:37 UTC
As another note: I am seeing a lot of messages as below on the console:

kernel: VFS: Busy inodes after unmount. Self-destruct in 5 seconds.  Have a nice
day...

and a few as below:

kernel: nfs_proc_symlink: TEST/SYMLINK_EEXIST_link already exists??


Comment 7 Chuck Lever 2005-07-23 14:25:26 UTC
from /var/log/messages, it looks like there's an oops first, then the panic follows.

Jul 18 22:25:12 sdgsim-c12 kernel: kernel BUG at fs/nfs/inode.c:156!
Jul 18 22:25:12 sdgsim-c12 kernel: invalid operand: 0000 [#1]
Jul 18 22:25:12 sdgsim-c12 kernel: SMP
Jul 18 22:25:12 sdgsim-c12 kernel: Modules linked in: md5 ipv6 autofs4 i2c_dev i
2c_core nfs lockd sunrpc dm_mod button battery ac uhci_hcd ehci_hcd e1000 floppy
 ext3 jbd ata_piix libata sd_mod scsi_mod
Jul 18 22:25:12 sdgsim-c12 kernel: CPU:    0
Jul 18 22:25:12 sdgsim-c12 kernel: EIP:    0060:[<f91a1bcd>]    Not tainted VLI
Jul 18 22:25:12 sdgsim-c12 kernel: EFLAGS: 00010286   (2.6.9-11.ELsmp)
Jul 18 22:25:12 sdgsim-c12 kernel: EIP is at nfs_clear_inode+0x3f/0x4a [nfs]
Jul 18 22:25:12 sdgsim-c12 kernel: eax: ffffffd8   ebx: e7841b74   ecx: 00000000
   edx: f915cc80
Jul 18 22:25:12 sdgsim-c12 kernel: esi: e7841a44   edi: 00000001   ebp: c031fc78
   esp: e60a2eec
Jul 18 22:25:12 sdgsim-c12 kernel: ds: 007b   es: 007b   ss: 0068
Jul 18 22:25:12 sdgsim-c12 kernel: Process umount (pid: 9414, threadinfo=e60a200
0 task=f733ecb0)
Jul 18 22:25:12 sdgsim-c12 kernel: Stack: e7841b74 e60a2f10 c016b829 e7841b74 c0
16b89c 00000000 c365f400 c365f400
Jul 18 22:25:12 sdgsim-c12 kernel:        c016ba05 e60a2f10 e60a2f10 e60a2000 c3
65f400 00000000 f91cde40 c015b452
Jul 18 22:25:12 sdgsim-c12 kernel:        0000003a c365f400 f91cdf80 e60a2000 c0
15bccf f7c04400 f91a3e37 c365f440
Jul 18 22:25:12 sdgsim-c12 kernel: Call Trace:
Jul 18 22:25:12 sdgsim-c12 kernel:  [<c016b829>] clear_inode+0xcc/0xf8
Jul 18 22:25:12 sdgsim-c12 kernel:  [<c016b89c>] dispose_list+0x47/0x6d
Jul 18 22:25:12 sdgsim-c12 kernel:  [<c016ba05>] invalidate_inodes+0x96/0xae
Jul 18 22:25:12 sdgsim-c12 kernel:  [<c015b452>] generic_shutdown_super+0x8c/0x1
54
Jul 18 22:25:12 sdgsim-c12 kernel:  [<c015bccf>] kill_anon_super+0x9/0x30
Jul 18 22:25:12 sdgsim-c12 kernel:  [<f91a3e37>] nfs_kill_super+0xc/0x63 [nfs]
Jul 18 22:25:12 sdgsim-c12 kernel:  [<c015b314>] deactivate_super+0x5b/0x70
Jul 18 22:25:12 sdgsim-c12 kernel:  [<c016e2cb>] sys_umount+0x65/0x6c
Jul 18 22:25:12 sdgsim-c12 kernel:  [<c0169cfc>] dput+0x34/0x19b
Jul 18 22:25:12 sdgsim-c12 kernel:  [<c0156d33>] __fput+0xda/0x100
Jul 18 22:25:12 sdgsim-c12 kernel:  [<c0155962>] sys_close+0x67/0x71
Jul 18 22:25:12 sdgsim-c12 kernel:  [<c016e2dd>] sys_oldumount+0xb/0xe
Jul 18 22:25:12 sdgsim-c12 kernel:  [<c02c7377>] syscall_call+0x7/0xb
Jul 18 22:25:12 sdgsim-c12 kernel: Code: 00 58 8d 43 c4 39 43 c4 74 08 0f 0b 98
00 1c f9 1b f9 8b 86 b4 00 00 00 85 c0 74 05 e8 3b 37 fa ff 8b 86 ac 00 00 00 85
 c0 74 08 <0f> 0b 9c 00 1c f9 1b f9 5b 5e c3 8b 80 80 01 00 00 8b 00 85 c0
Jul 18 22:25:12 sdgsim-c12 kernel:  <0>Fatal exception: panic in 5 seconds

fs/nfs/inode.c:156 is this:

static void
nfs_clear_inode(struct inode *inode)
{
        struct nfs_inode *nfsi = NFS_I(inode);
        struct rpc_cred *cred;

        nfs_wb_all(inode);
        BUG_ON (!list_empty(&nfsi->open_files));
        cred = nfsi->cache_access.cred;
        if (cred)
                put_rpccred(cred);
        BUG_ON(atomic_read(&nfsi->data_updates) != 0);
}

the second BUG_ON -- data_updates is not zero.  trond suspects asynchronous unlink.


Comment 8 Chuck Lever 2005-07-23 15:07:20 UTC
Created attachment 117101 [details]
don't manipulate data_updates around async unlinks

since this BUG_ON is hit only at umount time, trond believes the problem is a
race between an async unlink nfs_clear_inode.  removing the
nfs_{begin,end}_data_update around the async unlink should prevent tripping the
BUG_ON in nfs_clear_inode.

Comment 9 Steve Dickson 2005-07-25 07:08:31 UTC
So this patch did stop the BUG_ON from popping? Also was there
any particular test you were running that cause this problem?

Comment 10 GV Govindasamy 2005-07-25 07:43:51 UTC
When I say patch did stop BUG_ON from popping so far (for a day), I would like
to mention two other changes I have made to the same system just before applying
the patch.
1. Number of tests running in parallel reduced from 6 to 4.
2. Set cron to reboot the system every day early morning.

I will keep an eye on the result for next a couple of days and slowly make the
system to be back to original setttings.

No, there was no specific test identified for this bug yet.

Comment 11 Steve Dickson 2005-07-26 15:20:10 UTC
First of all... thanks for the time and effort you putting into
this... its definitely appreciated!

I'll go a head a try to reproduce this here with some
stress testing... but a definitive yea or nay that this
patch (and only this patch) does fix the problem
would be good.... 

Comment 15 Steve Dickson 2005-07-26 21:37:48 UTC
In http://people.redhat.com/steved/bz163738 are a smp and up kernels
that just have the above patch. Unfortunately, I'm still not able to reproduce
this problem, so would it be possible to pop one of these kernels on
to your test machines to see if it take cares of the problem

Please note, the window on U2 patches has close as of last week
So for us to get this in we will need a definitive answer, very soon,
on whether this patch does or does not fix the problem.

Again, thanks for all your help!!!

Comment 16 Chuck Lever 2005-07-27 13:01:10 UTC
hi steve, don't worry about U2 on our account.  i understand where you guys are
with your release cycle.  at this point we just want a hotfix kernel that will
allow us to continue running.  let's consider the fix for U3, though.

we are confident that the patch is a successful workaround for the problem, but
i think we should put the fix through a full series of testing to make sure it
doesn't expose other issues.

the real issue is a race between async unlink and umount, which leaves
nfsi->data_updates positive in nfs_clear_inode, and that triggers the BUG_ON. 
trond and olaf had a conversation at OLS about some architectural changes that
would be needed upstream to address the root cause.

in the meantime, that gives us a clue about what might be a good reproducer.  a
lot of silly renames followed immediately by a umount will probably do it.

Comment 17 Steve Dickson 2005-07-27 16:01:04 UTC
OK...

Do the kernels in http://people.redhat.com/steved/bz163738 keep you going?

Comment 18 Chuck Lever 2005-07-27 16:08:04 UTC
i've got a 2.6.9-11.EL kernel build going here with all the fixes we need so far.

the only thing i'm concerned about now is what will happen if we hit other
issues with my custom-built kernel -- will you guys still support it, or should
we switch over to a hotfix kernel you built to continue getting help?

Comment 19 Steve Dickson 2005-07-27 18:08:35 UTC
Well it will be easier for me to get things in if I know exactly
what your running... Meaning if you run a kernel from my
people page and it fixes the problem, I can go back to our
review process and definitely say the proposed patch fixes
the problem. But if your running a kernel you built, I don't
have the same confidence there is not something "extra"
in your kernel...

So I would suggest you do both... once a problem is identified
and has a proposed patch, I'll can build a "people page" kernel
with just that patch. Then, assuming the problem is reproducible,
that kernel can be tested to see if ti truly does fix the problem.

But in the mean time, while I'm building the "people page" kernel,
I would strongly suggest you throw the proposed patch in
your local kernel and keep testing... The last thing I wan to do
is get in your way or slow down your process.

Please note, the only _officially_ supported kernels are released and
"hotfix" kernels. But with you being a partner and all, I really think
if you find an problem with a kernel that when through our build
process, people would be very interested... I know I will if it has
anything to do with NFS.... ;-)



Comment 20 Chuck Lever 2005-08-08 15:03:30 UTC
running a 2.6.9-11.EL kernel with the attached patch (comment #8), we have not
seen any more panics.  we are confident this patch addresses the panic.



Comment 28 Issue Tracker 2006-01-09 15:08:46 UTC
From User-Agent: XML-RPC

Bugzillas 163738 and 164298 have been committed to the latest RHEL4 U3
beta. 

Action for NetApp: Please test and provide feedback ASAP.


This event sent from IssueTracker by andriusb
 issue 76661

Comment 30 Red Hat Bugzilla 2006-03-07 19:21:44 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2006-0132.html



Note You need to log in before you can comment on or make changes to this bug.