From Bugzilla Helper: User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.5) Gecko/20041107 Firefox/1.0 Description of problem: While running our tests RHEL4 PANICs intermittantly with the same stack trace. I am attaching the image of this with this bug. Version-Release number of selected component (if applicable): kernel-2.6.9-11.EL How reproducible: Didn't try Additional info:
# uname -a Linux xxx.xxx.xxx.com 2.6.9-11.ELsmp #1 SMP Fri May 20 18:26:27 EDT 2005 i686 i686 i386 GNU/Linux # dmesg 2005 BIOS-provided physical RAM map: BIOS-e820: 0000000000000000 - 000000000009cc00 (usable) BIOS-e820: 000000000009cc00 - 00000000000a0000 (reserved) BIOS-e820: 00000000000ea070 - 0000000000100000 (reserved) BIOS-e820: 0000000000100000 - 00000000dffc0000 (usable) BIOS-e820: 00000000dffc0000 - 00000000dffcf000 (ACPI data) BIOS-e820: 00000000dffcf000 - 00000000dfff0000 (ACPI NVS) BIOS-e820: 00000000dfff0000 - 00000000e0000000 (reserved) BIOS-e820: 00000000fec00000 - 00000000fec86000 (reserved) BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved) BIOS-e820: 00000000ffb00000 - 0000000100000000 (reserved) BIOS-e820: 0000000100000000 - 0000000120000000 (usable) 3712MB HIGHMEM available. 896MB LOWMEM available. found SMP MP-table at 000ff780 On node 0 totalpages: 1179648 DMA zone: 4096 pages, LIFO batch:1 Normal zone: 225280 pages, LIFO batch:16 HighMem zone: 950272 pages, LIFO batch:16 DMI 2.3 present. Using APIC driver default ACPI: RSDP (v000 ACPIAM ) @ 0x000f7710 ACPI: RSDT (v001 A M I OEMRSDT 0x09000424 MSFT 0x00000097) @ 0xdffc0000 ACPI: FADT (v002 A M I OEMFACP 0x09000424 MSFT 0x00000097) @ 0xdffc0200 ACPI: MADT (v001 A M I OEMAPIC 0x09000424 MSFT 0x00000097) @ 0xdffc0390 ACPI: OEMB (v001 A M I AMI_OEM 0x09000424 MSFT 0x00000097) @ 0xdffcf040 ACPI: DSDT (v001 0ABDI 0ABDI007 0x00000007 INTL 0x02002026) @ 0x00000000 ACPI: PM-Timer IO Port: 0x408 ACPI: Local APIC address 0xfee00000 ACPI: LAPIC (acpi_id[0x01] lapic_id[0x00] enabled) Processor #0 15:3 APIC version 20 ACPI: LAPIC (acpi_id[0x02] lapic_id[0x06] enabled) Processor #6 15:3 APIC version 20 ACPI: LAPIC (acpi_id[0x03] lapic_id[0x01] enabled) Processor #1 15:3 APIC version 20 ACPI: LAPIC (acpi_id[0x04] lapic_id[0x07] enabled) Processor #7 15:3 APIC version 20 Enabling APIC mode: Flat. Using 0 I/O APICs ACPI: IOAPIC (id[0x08] address[0xfec00000] gsi_base[0]) IOAPIC[0]: apic_id 8, version 32, address 0xfec00000, GSI 0-23 ACPI: IOAPIC (id[0x09] address[0xfec10000] gsi_base[24]) IOAPIC[1]: apic_id 9, version 32, address 0xfec10000, GSI 24-47 ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl) ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level) ACPI: IRQ0 used by override. ACPI: IRQ2 used by override. ACPI: IRQ9 used by override. Using ACPI (MADT) for SMP configuration information Built 1 zonelists Kernel command line: auto BOOT_IMAGE=2.6.9-11.ELsmp ro BOOT_FILE=/boot/vmlinuz-2.6.9-11.ELsmp rhgb quiet root=LABEL=/ Initializing CPU#0 CPU 0 irqstacks, hard=c03db000 soft=c03bb000 PID hash table entries: 4096 (order: 12, 65536 bytes) Detected 2801.374 MHz processor. Using tsc for high-res timesource Console: colour VGA+ 80x25 Dentry cache hash table entries: 131072 (order: 7, 524288 bytes) Inode-cache hash table entries: 65536 (order: 6, 262144 bytes) Memory: 4150216k/4718592k available (1824k kernel code, 42940k reserved, 744k data, 176k init, 3276544k highmem) Calibrating delay loop... 5521.40 BogoMIPS (lpj=2760704) Security Scaffold v1.0.0 initialized SELinux: Initializing. SELinux: Starting in permissive mode There is already a security framework initialized, register_security failed. selinux_register_security: Registering secondary module capability Capability LSM initialized as secondary Mount-cache hash table entries: 512 (order: 0, 4096 bytes) CPU: After generic identify, caps: bfebfbff 20000000 00000000 00000000 CPU: After vendor identify, caps: bfebfbff 20000000 00000000 00000000 monitor/mwait feature present. using mwait in idle threads. CPU: Trace cache: 12K uops, L1 D cache: 16K CPU: L2 cache: 1024K CPU: Physical Processor ID: 0 CPU: After all inits, caps: bfebf3ff 20000000 00000000 00000080 Intel machine check architecture supported. Intel machine check reporting enabled on CPU#0. CPU0: Intel P4/Xeon Extended MCE MSRs (24) available CPU0: Thermal monitoring enabled Enabling fast FPU save and restore... done. Enabling unmasked SIMD FPU exception support... done. Checking 'hlt' instruction... OK. CPU0: Intel(R) Xeon(TM) CPU 3.00GHz stepping 04 per-CPU timeslice cutoff: 2925.33 usecs. task migration cache decay timeout: 3 msecs. Booting processor 1/1 eip 3000 CPU 1 irqstacks, hard=c03dc000 soft=c03bc000 Initializing CPU#1 Calibrating delay loop... 5586.94 BogoMIPS (lpj=2793472) CPU: After generic identify, caps: bfebfbff 20000000 00000000 00000000 CPU: After vendor identify, caps: bfebfbff 20000000 00000000 00000000 monitor/mwait feature present. CPU: Trace cache: 12K uops, L1 D cache: 16K CPU: L2 cache: 1024K CPU: Physical Processor ID: 0 CPU: After all inits, caps: bfebf3ff 20000000 00000000 00000080 Intel machine check architecture supported. Intel machine check reporting enabled on CPU#1. CPU1: Intel P4/Xeon Extended MCE MSRs (24) available CPU1: Thermal monitoring enabled CPU1: Intel(R) Xeon(TM) CPU 3.00GHz stepping 04 Booting processor 2/6 eip 3000 CPU 2 irqstacks, hard=c03dd000 soft=c03bd000 Initializing CPU#2 Calibrating delay loop... 5586.94 BogoMIPS (lpj=2793472) CPU: After generic identify, caps: bfebfbff 20000000 00000000 00000000 CPU: After vendor identify, caps: bfebfbff 20000000 00000000 00000000 monitor/mwait feature present. CPU: Trace cache: 12K uops, L1 D cache: 16K CPU: L2 cache: 1024K CPU: Physical Processor ID: 3 CPU: After all inits, caps: bfebf3ff 20000000 00000000 00000080 Intel machine check architecture supported. Intel machine check reporting enabled on CPU#2. CPU2: Intel P4/Xeon Extended MCE MSRs (24) available CPU2: Thermal monitoring enabled CPU2: Intel(R) Xeon(TM) CPU 3.00GHz stepping 04 Booting processor 3/7 eip 3000 CPU 3 irqstacks, hard=c03de000 soft=c03be000 Initializing CPU#3 Calibrating delay loop... 5586.94 BogoMIPS (lpj=2793472) CPU: After generic identify, caps: bfebfbff 20000000 00000000 00000000 CPU: After vendor identify, caps: bfebfbff 20000000 00000000 00000000 monitor/mwait feature present. CPU: Trace cache: 12K uops, L1 D cache: 16K CPU: L2 cache: 1024K CPU: Physical Processor ID: 3 CPU: After all inits, caps: bfebf3ff 20000000 00000000 00000080 Intel machine check architecture supported. Intel machine check reporting enabled on CPU#3. CPU3: Intel P4/Xeon Extended MCE MSRs (24) available CPU3: Thermal monitoring enabled CPU3: Intel(R) Xeon(TM) CPU 3.00GHz stepping 04 Total of 4 processors activated (22282.24 BogoMIPS). ENABLING IO-APIC IRQs ..TIMER: vector=0x31 pin1=2 pin2=-1 checking TSC synchronization across 4 CPUs: passed. Brought up 4 CPUs zapping low mappings. checking if image is initramfs... it is Freeing initrd memory: 472k freed NET: Registered protocol family 16 PCI: PCI BIOS revision 2.10 entry at 0xf0031, last bus=3 PCI: Using configuration type 1 mtrr: v2.0 (20020519) ACPI: Subsystem revision 20040816 ACPI: Interpreter enabled ACPI: Using IOAPIC for interrupt routing ACPI: PCI Root Bridge [PCI0] (00:00) PCI: Probing PCI hardware (bus 00) PCI: Ignoring BAR0-3 of IDE controller 0000:00:1f.2 PCI: Transparent bridge - 0000:00:1e.0 ACPI: PCI Interrupt Routing Table [\_SB_.PCI0._PRT] ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.EPA0._PRT] ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.P0P1._PRT] ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.P0PC._PRT] ACPI: PCI Interrupt Link [LNKA] (IRQs 3 4 5 6 7 *10 11 12 14 15) ACPI: PCI Interrupt Link [LNKB] (IRQs 3 4 5 6 7 10 *11 12 14 15) ACPI: PCI Interrupt Link [LNKC] (IRQs 3 4 5 6 7 *10 11 12 14 15) ACPI: PCI Interrupt Link [LNKD] (IRQs 3 4 5 6 *7 10 11 12 14 15) ACPI: PCI Interrupt Link [LNKE] (IRQs 3 4 5 6 7 10 11 12 14 15) *0, disabled. ACPI: PCI Interrupt Link [LNKF] (IRQs 3 4 5 6 7 10 11 12 14 15) *0, disabled. ACPI: PCI Interrupt Link [LNKG] (IRQs 3 4 5 6 7 10 11 12 14 15) *0, disabled. ACPI: PCI Interrupt Link [LNKH] (IRQs 3 4 *5 6 7 10 11 12 14 15) Linux Plug and Play Support v0.97 (c) Adam Belay usbcore: registered new driver usbfs usbcore: registered new driver hub PCI: Using ACPI for IRQ routing ACPI: PCI interrupt 0000:00:02.0[A] -> GSI 16 (level, low) -> IRQ 169 ACPI: PCI interrupt 0000:00:1d.0[A] -> GSI 16 (level, low) -> IRQ 169 ACPI: PCI interrupt 0000:00:1d.1[B] -> GSI 19 (level, low) -> IRQ 177 ACPI: PCI interrupt 0000:00:1d.7[D] -> GSI 23 (level, low) -> IRQ 185 ACPI: PCI interrupt 0000:00:1f.2[A] -> GSI 18 (level, low) -> IRQ 193 ACPI: PCI interrupt 0000:00:1f.3[B] -> GSI 17 (level, low) -> IRQ 201 ACPI: PCI interrupt 0000:03:04.0[A] -> GSI 18 (level, low) -> IRQ 193 ACPI: PCI interrupt 0000:03:05.0[A] -> GSI 17 (level, low) -> IRQ 201 apm: BIOS not found. audit: initializing netlink socket (disabled) audit(1121858240.453:0): initialized highmem bounce pool size: 64 pages Total HugeTLB memory allocated, 0 VFS: Disk quotas dquot_6.5.1 Dquot-cache hash table entries: 1024 (order 0, 4096 bytes) SELinux: Registering netfilter hooks Initializing Cryptographic API ksign: Installing public key data Loading keyring - Added public key D67B3E6B1ED6FEC7 - User ID: Red Hat, Inc. (Kernel Module GPG key) pci_hotplug: PCI Hot Plug PCI Core version: 0.5 ACPI: Processor [CPU1] (supports C1, 8 throttling states) ACPI: Processor [CPU2] (supports C1) ACPI: Processor [CPU3] (supports C1) ACPI: Processor [CPU4] (supports C1) Real Time Clock Driver v1.12 Linux agpgart interface v0.100 (c) Dave Jones serio: i8042 AUX port at 0x60,0x64 irq 12 serio: i8042 KBD port at 0x60,0x64 irq 1 Serial: 8250/16550 driver $Revision: 1.90 $ 8 ports, IRQ sharing enabled ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A ttyS1 at I/O 0x2f8 (irq = 3) is a 16550A RAMDISK driver initialized: 16 RAM disks of 16384K size 1024 blocksize divert: not allocating divert_blk for non-ethernet device lo Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2 ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx Probing IDE interface ide0... hda: CD-224E, ATAPI CD/DVD-ROM drive ide1: I/O resource 0x170-0x177 not free. ide1: ports already in use, skipping probe Probing IDE interface ide2... Probing IDE interface ide3... Probing IDE interface ide4... Probing IDE interface ide5... Using cfq io scheduler ide0 at 0x1f0-0x1f7,0x3f6 on irq 14 hda: ATAPI 24X CD-ROM drive, 128kB Cache Uniform CD-ROM driver Revision: 3.20 ide-floppy driver 0.99.newide usbcore: registered new driver hiddev usbcore: registered new driver usbhid drivers/usb/input/hid-core.c: v2.0:USB HID core driver mice: PS/2 mouse device common for all mice md: md driver 0.90.0 MAX_MD_DEVS=256, MD_SB_DISKS=27 NET: Registered protocol family 2 IP: routing cache hash table of 32768 buckets, 512Kbytes TCP: Hash tables configured (established 262144 bind 43690) Initializing IPsec netlink socket NET: Registered protocol family 1 NET: Registered protocol family 17 ACPI: (supports S0 S1 S3 S4 S4bios S5) ACPI wakeup devices: EPA0 EPA1 EPB0 EPB1 EPC0 P0P1 MC97 USB1 USB2 EUSB P0PC SLPB Freeing unused kernel memory: 176k freed SCSI subsystem initialized libata version 1.10 loaded. ata_piix version 1.03 ata_piix: combined mode detected ACPI: PCI interrupt 0000:00:1f.2[A] -> GSI 18 (level, low) -> IRQ 193 ata: 0x1f0 IDE port busy PCI: Setting latency timer of device 0000:00:1f.2 to 64 ata1: SATA max UDMA/133 cmd 0x170 ctl 0x376 bmdma 0xFC08 irq 15 ata1: dev 1 cfg 49:2f00 82:346b 83:7d01 84:4003 85:3469 86:3c01 87:4003 88:207f ata1: dev 1 ATA, max UDMA/133, 156301488 sectors: lba48 ata1: dev 1 configured for UDMA/133 scsi0 : ata_piix Vendor: ATA Model: ST380013AS Rev: 3.19 Type: Direct-Access ANSI SCSI revision: 05 SCSI device sda: 156301488 512-byte hdwr sectors (80026 MB) SCSI device sda: drive cache: write back sda: sda1 sda2 Attached scsi disk sda at scsi0, channel 0, id 1, lun 0 EXT3-fs: INFO: recovery required on readonly filesystem. EXT3-fs: write access will be enabled during recovery. kjournald starting. Commit interval 5 seconds EXT3-fs: sda1: orphan cleanup on readonly fs ext3_orphan_cleanup: deleting unreferenced inode 40747 ext3_orphan_cleanup: deleting unreferenced inode 43476 ext3_orphan_cleanup: deleting unreferenced inode 42801 ext3_orphan_cleanup: deleting unreferenced inode 854049 ext3_orphan_cleanup: deleting unreferenced inode 164605 ext3_orphan_cleanup: deleting unreferenced inode 494818 ext3_orphan_cleanup: deleting unreferenced inode 493756 ext3_orphan_cleanup: deleting unreferenced inode 494793 ext3_orphan_cleanup: deleting unreferenced inode 7504011 ext3_orphan_cleanup: deleting unreferenced inode 7504003 ext3_orphan_cleanup: deleting unreferenced inode 43402 ext3_orphan_cleanup: deleting unreferenced inode 42787 ext3_orphan_cleanup: deleting unreferenced inode 494781 ext3_orphan_cleanup: deleting unreferenced inode 42051 ext3_orphan_cleanup: deleting unreferenced inode 494784 EXT3-fs: sda1: 15 orphan inodes deleted EXT3-fs: recovery complete. EXT3-fs: mounted filesystem with ordered data mode. SELinux: Disabled at runtime. SELinux: Unregistering netfilter hooks inserting floppy driver for 2.6.9-11.ELsmp Floppy drive(s): fd0 is 1.44M FDC 0 is a post-1991 82077 Intel(R) PRO/1000 Network Driver - version 5.6.10.1-k2-NAPI Copyright (c) 1999-2004 Intel Corporation. ACPI: PCI interrupt 0000:03:04.0[A] -> GSI 18 (level, low) -> IRQ 193 divert: allocating divert_blk for eth0 e1000: eth0: e1000_probe: Intel(R) PRO/1000 Network Connection hw_random: RNG not detected ACPI: PCI interrupt 0000:00:1d.7[D] -> GSI 23 (level, low) -> IRQ 185 ehci_hcd 0000:00:1d.7: EHCI Host Controller PCI: Setting latency timer of device 0000:00:1d.7 to 64 ehci_hcd 0000:00:1d.7: irq 185, pci mem f8806c00 ehci_hcd 0000:00:1d.7: new USB bus registered, assigned bus number 1 PCI: cache line size of 128 is not supported by device 0000:00:1d.7 ehci_hcd 0000:00:1d.7: USB 2.0 enabled, EHCI 1.00, driver 2004-May-10 hub 1-0:1.0: USB hub found hub 1-0:1.0: 4 ports detected USB Universal Host Controller Interface driver v2.2 ACPI: PCI interrupt 0000:00:1d.0[A] -> GSI 16 (level, low) -> IRQ 169 uhci_hcd 0000:00:1d.0: UHCI Host Controller PCI: Setting latency timer of device 0000:00:1d.0 to 64 uhci_hcd 0000:00:1d.0: irq 169, io base 0000e800 uhci_hcd 0000:00:1d.0: new USB bus registered, assigned bus number 2 hub 2-0:1.0: USB hub found hub 2-0:1.0: 2 ports detected ACPI: PCI interrupt 0000:00:1d.1[B] -> GSI 19 (level, low) -> IRQ 177 uhci_hcd 0000:00:1d.1: UHCI Host Controller PCI: Setting latency timer of device 0000:00:1d.1 to 64 uhci_hcd 0000:00:1d.1: irq 177, io base 0000ec00 uhci_hcd 0000:00:1d.1: new USB bus registered, assigned bus number 3 hub 3-0:1.0: USB hub found hub 3-0:1.0: 2 ports detected ip_tables: (C) 2000-2002 Netfilter core team md: Autodetecting RAID arrays. md: autorun ... md: ... autorun DONE. ACPI: Power Button (FF) [PWRF] ACPI: Sleep Button (CM) [SLPB] e1000: eth0: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex e1000: eth0: e1000_watchdog: NIC Link is Down e1000: eth0: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex EXT3 FS on sda1, internal journal device-mapper: 4.4.0-ioctl (2005-01-12) initialised: dm.com cdrom: open failed. Adding 16707592k swap on /dev/sda2. Priority:-1 extents:1 ip_tables: (C) 2000-2002 Netfilter core team ip_tables: (C) 2000-2002 Netfilter core team i2c /dev entries driver NET: Registered protocol family 10 Disabled Privacy Extensions on device c03356c0(lo) IPv6 over IPv4 tunneling driver divert: not allocating divert_blk for non-ethernet device sit0 ip_tables: (C) 2000-2002 Netfilter core team eth0: no IPv6 routers present # /sbin/lspci -v 00:00.0 Host bridge: Intel Corporation E7320 Memory Controller Hub (rev 0a) Subsystem: Intel Corporation: Unknown device 0000 Flags: bus master, fast devsel, latency 0 Capabilities: <available only to root> 00:02.0 PCI bridge: Intel Corporation E7525/E7520/E7320 PCI Express Port A (rev 0a) (prog-if 00 [Normal decode]) Flags: bus master, fast devsel, latency 0 Bus: primary=00, secondary=01, subordinate=01, sec-latency=0 Capabilities: <available only to root> 00:1c.0 PCI bridge: Intel Corporation 6300ESB 64-bit PCI-X Bridge (rev 02) (prog-if 00 [Normal decode]) Flags: bus master, 66Mhz, fast devsel, latency 32 Bus: primary=00, secondary=02, subordinate=02, sec-latency=32 Capabilities: <available only to root> 00:1d.0 USB Controller: Intel Corporation 6300ESB USB Universal Host Controller (rev 02) (prog-if 00 [UHCI]) Subsystem: Intel Corporation: Unknown device 24d0 Flags: bus master, medium devsel, latency 0, IRQ 169 I/O ports at e800 [size=32] 00:1d.1 USB Controller: Intel Corporation 6300ESB USB Universal Host Controller (rev 02) (prog-if 00 [UHCI]) Subsystem: Intel Corporation: Unknown device 24d0 Flags: bus master, medium devsel, latency 0, IRQ 177 I/O ports at ec00 [size=32] 00:1d.4 System peripheral: Intel Corporation 6300ESB Watchdog Timer (rev 02) Flags: medium devsel Memory at febff800 (32-bit, non-prefetchable) [size=16] 00:1d.5 PIC: Intel Corporation 6300ESB I/O Advanced Programmable Interrupt Controller (rev 02) (prog-if 20 [IO(X)-APIC]) Flags: bus master, fast devsel, latency 0 Capabilities: <available only to root> 00:1d.7 USB Controller: Intel Corporation 6300ESB USB2 Enhanced Host Controller (rev 02) (prog-if 20 [EHCI]) Subsystem: Intel Corporation: Unknown device 24d0 Flags: bus master, medium devsel, latency 0, IRQ 185 Memory at febffc00 (32-bit, non-prefetchable) [size=1K] Capabilities: <available only to root> 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 0a) (prog-if 00 [Normal decode]) Flags: bus master, fast devsel, latency 0 Bus: primary=00, secondary=03, subordinate=03, sec-latency=32 I/O behind bridge: 0000d000-0000dfff Memory behind bridge: fca00000-feafffff Prefetchable memory behind bridge: fc800000-fc8fffff 00:1f.0 ISA bridge: Intel Corporation 6300ESB LPC Interface Controller (rev 02) Flags: bus master, medium devsel, latency 0 00:1f.2 IDE interface: Intel Corporation 6300ESB SATA Storage Controller (rev 02) (prog-if 8a [Master SecP PriP]) Subsystem: Intel Corporation 6300ESB SATA Storage Controller Flags: bus master, 66Mhz, medium devsel, latency 0, IRQ 193 I/O ports at <unassigned> I/O ports at <unassigned> I/O ports at <unassigned> I/O ports at <unassigned> I/O ports at fc00 [size=16] 00:1f.3 SMBus: Intel Corporation 6300ESB SMBus Controller (rev 02) Subsystem: Intel Corporation: Unknown device 24d0 Flags: medium devsel, IRQ 201 I/O ports at 0540 [size=32] 03:04.0 Ethernet controller: Intel Corporation 82541GI/PI Gigabit Ethernet Controller Subsystem: Super Micro Computer Inc: Unknown device 1076 Flags: bus master, 66Mhz, medium devsel, latency 32, IRQ 193 Memory at feaa0000 (64-bit, non-prefetchable) [size=128K] I/O ports at dc00 [size=64] Capabilities: <available only to root> 03:05.0 VGA compatible controller: ATI Technologies Inc Rage XL (rev 27) (prog-if 00 [VGA]) Subsystem: ATI Technologies Inc Rage XL Flags: bus master, stepping, medium devsel, latency 32, IRQ 201 Memory at fd000000 (32-bit, non-prefetchable) [size=16M] I/O ports at d800 [size=256] Memory at feafe000 (32-bit, non-prefetchable) [size=4K] Expansion ROM at fea60000 [disabled] [size=128K] Capabilities: <available only to root>
Created attachment 116991 [details] stack trace
hmm.. what test were you running? Also have you been able to reproduce this?
We are running around 600 tests atleast 6 each in parallel. So, it is not straight forward to detect which test really caused this. I haven't tried re-producing this.
Clear text from /var/log/messages: Jul 17 20:13:59 xxx kernel: ------------[ cut here ]------------ Jul 17 20:13:59 xxx kernel: kernel BUG at fs/nfs/inode.c:156! Jul 17 20:13:59 xxx kernel: invalid operand: 0000 [#1] Jul 17 20:13:59 xxx kernel: SMP Jul 17 20:13:59 xxx kernel: Modules linked in: autofs4 i2c_dev i2c_core nfs lockd sunrpc dm_mod button battery ac md5 ipv6 uhci_hcd ehci_hcd e1000 floppy ext3 jbd ata_piix libata sd_mod scsi_mod Jul 17 20:13:59 xxx kernel: CPU: 2 Jul 17 20:13:59 xxx kernel: EIP: 0060:[<f91febcd>] Not tainted VLI Jul 17 20:13:59 xxx kernel: EFLAGS: 00010286 (2.6.9-11.ELsmp) Jul 17 20:13:59 xxx kernel: EIP is at nfs_clear_inode+0x3f/0x4a [nfs] Jul 17 20:13:59 xxx kernel: eax: ffffffd8 ebx: f2dc9dfc ecx: 00000000 edx: f91b9c80 Jul 17 20:13:59 xxx kernel: esi: f2dc9ccc edi: 00000001 ebp: c031fc78 esp: ef669eec Jul 17 20:13:59 xxx kernel: ds: 007b es: 007b ss: 0068 Jul 17 20:13:59 xxx kernel: Process umount (pid: 27723, threadinfo=ef669000 task=f7d8e230) Jul 17 20:13:59 xxx kernel: Stack: f2dc9dfc ef669f10 c016b829 f2dc9dfc c016b89c 00000000 f5e9be00 f5e9be00 Jul 17 20:13:59 xxx kernel: c016ba05 ef669f10 ef669f10 ef669000 f5e9be00 00000000 f922ae40 c015b452 Jul 17 20:13:59 xxx kernel: 00000064 f5e9be00 f922af80 ef669000 c015bccf f5dd2400 f9200e37 f5e9be40 Jul 17 20:13:59 xxx kernel: Call Trace: Jul 17 20:13:59 xxx kernel: [<c016b829>] clear_inode+0xcc/0xf8 Jul 17 20:13:59 xxx kernel: [<c016b89c>] dispose_list+0x47/0x6d Jul 17 20:13:59 xxx kernel: [<c016ba05>] invalidate_inodes+0x96/0xae Jul 17 20:13:59 xxx kernel: [<c015b452>] generic_shutdown_super+0x8c/0x154 Jul 17 20:13:59 xxx kernel: [<c015bccf>] kill_anon_super+0x9/0x30 Jul 17 20:13:59 xxx kernel: [<f9200e37>] nfs_kill_super+0xc/0x63 [nfs] Jul 17 20:13:59 xxx kernel: [<c015b314>] deactivate_super+0x5b/0x70 Jul 17 20:13:59 xxx kernel: [<c016e2cb>] sys_umount+0x65/0x6c Jul 17 20:13:59 xxx kernel: [<c0169cfc>] dput+0x34/0x19b Jul 17 20:13:59 xxx kernel: [<c0156d33>] __fput+0xda/0x100 Jul 17 20:13:59 xxx kernel: [<c0155962>] sys_close+0x67/0x71 Jul 17 20:13:59 xxx kernel: [<c016e2dd>] sys_oldumount+0xb/0xe Jul 17 20:13:59 xxx kernel: [<c02c7377>] syscall_call+0x7/0xb Jul 17 20:13:59 xxx kernel: Code: 00 58 8d 43 c4 39 43 c4 74 08 0f 0b 98 00 1c c9 21 f9 8b 86 b4 00 00 00 85 c0 74 05 e8 3b 37 fa ff 8b 86 ac 00 00 00 85 c0 74 08 <0f> 0b 9c 00 1c c9 21 f9 5b 5e c3 8b 80 80 01 00 00 8b 00 85 c0 Jul 17 20:13:59 xxx kernel: <0>Fatal exception: panic in 5 seconds
As another note: I am seeing a lot of messages as below on the console: kernel: VFS: Busy inodes after unmount. Self-destruct in 5 seconds. Have a nice day... and a few as below: kernel: nfs_proc_symlink: TEST/SYMLINK_EEXIST_link already exists??
from /var/log/messages, it looks like there's an oops first, then the panic follows. Jul 18 22:25:12 sdgsim-c12 kernel: kernel BUG at fs/nfs/inode.c:156! Jul 18 22:25:12 sdgsim-c12 kernel: invalid operand: 0000 [#1] Jul 18 22:25:12 sdgsim-c12 kernel: SMP Jul 18 22:25:12 sdgsim-c12 kernel: Modules linked in: md5 ipv6 autofs4 i2c_dev i 2c_core nfs lockd sunrpc dm_mod button battery ac uhci_hcd ehci_hcd e1000 floppy ext3 jbd ata_piix libata sd_mod scsi_mod Jul 18 22:25:12 sdgsim-c12 kernel: CPU: 0 Jul 18 22:25:12 sdgsim-c12 kernel: EIP: 0060:[<f91a1bcd>] Not tainted VLI Jul 18 22:25:12 sdgsim-c12 kernel: EFLAGS: 00010286 (2.6.9-11.ELsmp) Jul 18 22:25:12 sdgsim-c12 kernel: EIP is at nfs_clear_inode+0x3f/0x4a [nfs] Jul 18 22:25:12 sdgsim-c12 kernel: eax: ffffffd8 ebx: e7841b74 ecx: 00000000 edx: f915cc80 Jul 18 22:25:12 sdgsim-c12 kernel: esi: e7841a44 edi: 00000001 ebp: c031fc78 esp: e60a2eec Jul 18 22:25:12 sdgsim-c12 kernel: ds: 007b es: 007b ss: 0068 Jul 18 22:25:12 sdgsim-c12 kernel: Process umount (pid: 9414, threadinfo=e60a200 0 task=f733ecb0) Jul 18 22:25:12 sdgsim-c12 kernel: Stack: e7841b74 e60a2f10 c016b829 e7841b74 c0 16b89c 00000000 c365f400 c365f400 Jul 18 22:25:12 sdgsim-c12 kernel: c016ba05 e60a2f10 e60a2f10 e60a2000 c3 65f400 00000000 f91cde40 c015b452 Jul 18 22:25:12 sdgsim-c12 kernel: 0000003a c365f400 f91cdf80 e60a2000 c0 15bccf f7c04400 f91a3e37 c365f440 Jul 18 22:25:12 sdgsim-c12 kernel: Call Trace: Jul 18 22:25:12 sdgsim-c12 kernel: [<c016b829>] clear_inode+0xcc/0xf8 Jul 18 22:25:12 sdgsim-c12 kernel: [<c016b89c>] dispose_list+0x47/0x6d Jul 18 22:25:12 sdgsim-c12 kernel: [<c016ba05>] invalidate_inodes+0x96/0xae Jul 18 22:25:12 sdgsim-c12 kernel: [<c015b452>] generic_shutdown_super+0x8c/0x1 54 Jul 18 22:25:12 sdgsim-c12 kernel: [<c015bccf>] kill_anon_super+0x9/0x30 Jul 18 22:25:12 sdgsim-c12 kernel: [<f91a3e37>] nfs_kill_super+0xc/0x63 [nfs] Jul 18 22:25:12 sdgsim-c12 kernel: [<c015b314>] deactivate_super+0x5b/0x70 Jul 18 22:25:12 sdgsim-c12 kernel: [<c016e2cb>] sys_umount+0x65/0x6c Jul 18 22:25:12 sdgsim-c12 kernel: [<c0169cfc>] dput+0x34/0x19b Jul 18 22:25:12 sdgsim-c12 kernel: [<c0156d33>] __fput+0xda/0x100 Jul 18 22:25:12 sdgsim-c12 kernel: [<c0155962>] sys_close+0x67/0x71 Jul 18 22:25:12 sdgsim-c12 kernel: [<c016e2dd>] sys_oldumount+0xb/0xe Jul 18 22:25:12 sdgsim-c12 kernel: [<c02c7377>] syscall_call+0x7/0xb Jul 18 22:25:12 sdgsim-c12 kernel: Code: 00 58 8d 43 c4 39 43 c4 74 08 0f 0b 98 00 1c f9 1b f9 8b 86 b4 00 00 00 85 c0 74 05 e8 3b 37 fa ff 8b 86 ac 00 00 00 85 c0 74 08 <0f> 0b 9c 00 1c f9 1b f9 5b 5e c3 8b 80 80 01 00 00 8b 00 85 c0 Jul 18 22:25:12 sdgsim-c12 kernel: <0>Fatal exception: panic in 5 seconds fs/nfs/inode.c:156 is this: static void nfs_clear_inode(struct inode *inode) { struct nfs_inode *nfsi = NFS_I(inode); struct rpc_cred *cred; nfs_wb_all(inode); BUG_ON (!list_empty(&nfsi->open_files)); cred = nfsi->cache_access.cred; if (cred) put_rpccred(cred); BUG_ON(atomic_read(&nfsi->data_updates) != 0); } the second BUG_ON -- data_updates is not zero. trond suspects asynchronous unlink.
Created attachment 117101 [details] don't manipulate data_updates around async unlinks since this BUG_ON is hit only at umount time, trond believes the problem is a race between an async unlink nfs_clear_inode. removing the nfs_{begin,end}_data_update around the async unlink should prevent tripping the BUG_ON in nfs_clear_inode.
So this patch did stop the BUG_ON from popping? Also was there any particular test you were running that cause this problem?
When I say patch did stop BUG_ON from popping so far (for a day), I would like to mention two other changes I have made to the same system just before applying the patch. 1. Number of tests running in parallel reduced from 6 to 4. 2. Set cron to reboot the system every day early morning. I will keep an eye on the result for next a couple of days and slowly make the system to be back to original setttings. No, there was no specific test identified for this bug yet.
First of all... thanks for the time and effort you putting into this... its definitely appreciated! I'll go a head a try to reproduce this here with some stress testing... but a definitive yea or nay that this patch (and only this patch) does fix the problem would be good....
In http://people.redhat.com/steved/bz163738 are a smp and up kernels that just have the above patch. Unfortunately, I'm still not able to reproduce this problem, so would it be possible to pop one of these kernels on to your test machines to see if it take cares of the problem Please note, the window on U2 patches has close as of last week So for us to get this in we will need a definitive answer, very soon, on whether this patch does or does not fix the problem. Again, thanks for all your help!!!
hi steve, don't worry about U2 on our account. i understand where you guys are with your release cycle. at this point we just want a hotfix kernel that will allow us to continue running. let's consider the fix for U3, though. we are confident that the patch is a successful workaround for the problem, but i think we should put the fix through a full series of testing to make sure it doesn't expose other issues. the real issue is a race between async unlink and umount, which leaves nfsi->data_updates positive in nfs_clear_inode, and that triggers the BUG_ON. trond and olaf had a conversation at OLS about some architectural changes that would be needed upstream to address the root cause. in the meantime, that gives us a clue about what might be a good reproducer. a lot of silly renames followed immediately by a umount will probably do it.
OK... Do the kernels in http://people.redhat.com/steved/bz163738 keep you going?
i've got a 2.6.9-11.EL kernel build going here with all the fixes we need so far. the only thing i'm concerned about now is what will happen if we hit other issues with my custom-built kernel -- will you guys still support it, or should we switch over to a hotfix kernel you built to continue getting help?
Well it will be easier for me to get things in if I know exactly what your running... Meaning if you run a kernel from my people page and it fixes the problem, I can go back to our review process and definitely say the proposed patch fixes the problem. But if your running a kernel you built, I don't have the same confidence there is not something "extra" in your kernel... So I would suggest you do both... once a problem is identified and has a proposed patch, I'll can build a "people page" kernel with just that patch. Then, assuming the problem is reproducible, that kernel can be tested to see if ti truly does fix the problem. But in the mean time, while I'm building the "people page" kernel, I would strongly suggest you throw the proposed patch in your local kernel and keep testing... The last thing I wan to do is get in your way or slow down your process. Please note, the only _officially_ supported kernels are released and "hotfix" kernels. But with you being a partner and all, I really think if you find an problem with a kernel that when through our build process, people would be very interested... I know I will if it has anything to do with NFS.... ;-)
running a 2.6.9-11.EL kernel with the attached patch (comment #8), we have not seen any more panics. we are confident this patch addresses the panic.
From User-Agent: XML-RPC Bugzillas 163738 and 164298 have been committed to the latest RHEL4 U3 beta. Action for NetApp: Please test and provide feedback ASAP. This event sent from IssueTracker by andriusb issue 76661
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2006-0132.html