Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

For bugs related to Red Hat Enterprise Linux 5 product line. The current stable release is 5.10. For Red Hat Enterprise Linux 6 and above, please visit Red Hat JIRA https://issues.redhat.com/secure/CreateIssue!default.jspa?pid=12332745 to report new issues.

Bug 458620

Summary:

Problem with aic79xx

Product:

Red Hat Enterprise Linux 5

Reporter:

Qian Cai <qcai>

Component:

kernel

Assignee:

Neil Horman <nhorman>

Status:

CLOSED ERRATA

QA Contact:

Martin Jenner <mjenner>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

5.2

CC:

bdonahue, coughlan, duck, dzickus, jburke, jjarvis, jlarrew, jlarrew, lwang, pbog, pbunyan, riek, syeghiay, vgoyal

Target Milestone:

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2009-01-20 20:18:05 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
aic79xx module with debug options in capture Kernel	none
patch to reset HBA	none

Description Qian Cai 2008-08-11 03:45:05 UTC

Description of problem:
Kdump with Adaptec AIC79XX PCI-X SCSI HBA DRIVER can be unreliable. I have seen error messages, and capture Kernel panic or unable to find rootfs after loading the aic79xx.ko module,

http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=3851053
http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=2681873

Loading aic79xx.ko module
scsi0 : Adaptec AIC79XX PCI-X SCSI HBA DRIVER, Rev 3.0
        <Adaptec AIC7902 Ultra320 SCSI adapter>
        aic7902: Ultra320 Wide Channel A, SCSI Id=7, PCI-X 101-133Mhz, 512 SCBs

PCI: Enabling device 0000:08:07.1 (0156 -> 0157)
ACPI: PCI Interrupt 0000:08:07.1[B] -> GSI 24 (level, low) -> IRQ 185
scsi1 : Adaptec AIC79XX PCI-X SCSI HBA DRIVER, Rev 3.0
        <Adaptec AIC7902 Ultra320 SCSI adapter>
        aic7902: Ultra320 Wide Channel B, SCSI Id=7, PCI-X 101-133Mhz, 512 SCBs

scsi 1:0:0:0: Attempting to queue an ABORT message:CDB: 0x12 0x0 0x0 0x0 0x24 0x0
scsi1: At time of recovery, card was paused
>>>>>>>>>>>>>>>>>> Dump Card State Begins <<<<<<<<<<<<<<<<<
scsi1: Dumping Card State at program address 0x24 Mode 0x22
Card was paused
INTSTAT[0x0] SELOID[0x0] SELID[0x0] HS_MAILBOX[0x0] 
INTCTL[0x0] SEQINTSTAT[0x0] SAVED_MODE[0x11] DFFSTAT[0x33] 
SCSISIGI[0xb6] SCSIPHASE[0x4] SCSIBUS[0x0] LASTPHASE[0x1] 
SCSISEQ0[0x40] SCSISEQ1[0x12] SEQCTL0[0x0] SEQINTCTL[0x0] 
SEQ_FLAGS[0x0] SEQ_FLAGS2[0x4] QFREEZE_COUNT[0x0] 
KERNEL_QFREEZE_COUNT[0x0] MK_MESSAGE_SCB[0xff00] 
MK_MESSAGE_SCSIID[0xff] SSTAT0[0x42] SSTAT1[0x11] 
SSTAT2[0x0] SSTAT3[0x0] PERRDIAG[0x0] SIMODE1[0xac] 
LQISTAT0[0x0] LQISTAT1[0x0] LQISTAT2[0x0] LQOSTAT0[0x0] 
LQOSTAT1[0x0] LQOSTAT2[0x0] 

SCB Count = 4 CMDS_PENDING = 1 LASTSCB 0xffff CURRSCB 0x3 NEXTSCB 0x0
qinstart = 1 qinfifonext = 1
QINFIFO:
WAITING_TID_QUEUES:
       0 ( 0x3 )
Pending list:
  3 FIFO_USE[0x0] SCB_CONTROL[0x40] SCB_SCSIID[0x7] 
Total 1
Kernel Free SCB list: 2 1 0 
Sequencer Complete DMA-inprog list: 
Sequencer Complete list: 
Sequencer DMA-Up and Complete list: 
Sequencer On QFreeze and Complete list: 


scsi1: FIFO0 Free, LONGJMP == 0x8071, SCB 0x0
SEQIMODE[0x3f] SEQINTSRC[0x0] DFCNTRL[0x0] DFSTATUS[0x89] 
SG_CACHE_SHADOW[0x2] SG_STATE[0x0] DFFSXFRCTL[0x0] 
SOFFCNT[0x0] MDFFSTAT[0x5] SHADDR = 0x00, SHCNT = 0x0 
HADDR = 0x00, HCNT = 0x0 CCSGCTL[0x10] 

scsi1: FIFO1 Free, LONGJMP == 0x8063, SCB 0x0
SEQIMODE[0x3f] SEQINTSRC[0x0] DFCNTRL[0x0] DFSTATUS[0x89] 
SG_CACHE_SHADOW[0x2] SG_STATE[0x0] DFFSXFRCTL[0x0] 
SOFFCNT[0x0] MDFFSTAT[0x5] SHADDR = 0x00, SHCNT = 0x0 
HADDR = 0x00, HCNT = 0x0 CCSGCTL[0x10] 
LQIN: 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 
scsi1: LQISTATE = 0x0, LQOSTATE = 0x0, OPTIONMODE = 0x52
scsi1: OS_SPACE_CNT = 0x20 MAXCMDCNT = 0x0
scsi1: SAVED_SCSIID = 0x0 SAVED_LUN = 0x0
SIMODE0[0xc] 
CCSCBCTL[0x4] 
scsi1: REG0 == 0x3, SINDEX = 0x102, DINDEX = 0x102
scsi1: SCBPTR == 0xff00, SCB_NEXT == 0xff00, SCB_NEXT2 == 0x0
CDB 0 1 0 0 0 0
STACK: 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0
<<<<<<<<<<<<<<<<< Dump Card State Ends >>>>>>>>>>>>>>>>>>
scsi1:0:0:0: Cmd aborted from QINFIFO
scsi 1:0:0:0: Attempting to queue an ABORT message:CDB: 0x0 0x0 0x0 0x0 0x0 0x0
scsi1: At time of recovery, card was not paused
>>>>>>>>>>>>>>>>>> Dump Card State Begins <<<<<<<<<<<<<<<<<
scsi1: Dumping Card State at program address 0xfd Mode 0x11
Card was paused
INTSTAT[0x0] SELOID[0x0] SELID[0x0] HS_MAILBOX[0x0] 
INTCTL[0x0] SEQINTSTAT[0x0] SAVED_MODE[0x11] DFFSTAT[0x19] 
SCSISIGI[0x86] SCSIPHASE[0x10] SCSIBUS[0x0] LASTPHASE[0x80] 
SCSISEQ0[0x0] SCSISEQ1[0x12] SEQCTL0[0x0] SEQINTCTL[0x0] 
SEQ_FLAGS[0x0] SEQ_FLAGS2[0x4] QFREEZE_COUNT[0x0] 
KERNEL_QFREEZE_COUNT[0x0] MK_MESSAGE_SCB[0xff00] 
MK_MESSAGE_SCSIID[0xff] SSTAT0[0x0] SSTAT1[0x9] 
SSTAT2[0x0] SSTAT3[0x0] PERRDIAG[0x0] SIMODE1[0xac] 
LQISTAT0[0x0] LQISTAT1[0x0] LQISTAT2[0x0] LQOSTAT0[0x0] 
LQOSTAT1[0x0] LQOSTAT2[0x0] 

SCB Count = 4 CMDS_PENDING = 0 LASTSCB 0xffff CURRSCB 0x0 NEXTSCB 0x0
qinstart = 1 qinfifonext = 2
QINFIFO: 0x3
WAITING_TID_QUEUES:
scsi1: SCB = 0x0 Not Active!
Kernel panic - not syncing: Waiting TID List traversal


Version-Release number of selected component (if applicable):
kexec-tools-1.102pre-21.el5
kernel-2.6.18-92.el5

How reproducible:
sometimes on ibm-himalaya.rhts.bos.redhat.com

Steps to Reproduce:
1. configure Kdump with 128M@16M
2. SysRq-C
  
Actual results:
Capture Kernel panic or failed to find rootfs.

Expected results:
Successfully capture the vmcore.

Comment 1 Qian Cai 2008-08-11 09:15:50 UTC

OK, looks like I could reproduce this issue.

Comment 2 Neil Horman 2008-08-11 11:43:26 UTC

hmm, looks like perhaps during the reset operation when the module was installed we call ahd_pause, and then never reach the point in the module init routine where we call ahd_reset, which appears to be responsible for unpausing the driver.Cai, can you re-run the test with these module options specified on the aic79xx modules:
aic79xx=verbose.debug:0xffff

Lets see if we can figure out why we never unpause the driver.

Vivek, you've looked at a few aic79xx issues with kdump upstream, any more insight as to what is going on here>  Thanks!

Comment 3 Qian Cai 2008-08-11 14:33:34 UTC

Is it to add the following line to /etc/modprobe.conf?

options aic7xxx aic7xxx=verbose,debug:0xffff

Comment 4 Qian Cai 2008-08-11 14:35:29 UTC

I mean, options aic79xx aic79xx=verbose,debug:0xffff

Comment 5 Neil Horman 2008-08-11 14:54:55 UTC

That should be sufficient for the normal kernel(you'll have to edit the kdump initramfs to include the options by hadn there though)

Comment 6 Qian Cai 2008-08-12 05:53:21 UTC

Created attachment 314045 [details]
aic79xx module with debug options in capture Kernel

OK, I got you debug logs.

Comment 7 Qian Cai 2008-08-12 05:54:52 UTC

The machine has been reserved for the next few days.

Comment 8 Neil Horman 2008-08-12 20:13:14 UTC

I think I've got something functional.  I need to reset the HBA at the end of ahd_pci_config to simulate a shutdown (or at least a device reset like if the host stopped responding), it seems to allow the HBA to come up normally.  I'll have a patch for you to test soon.

Comment 9 Neil Horman 2008-08-13 15:14:42 UTC

Created attachment 314213 [details]
patch to reset HBA

This patch corrects the problem.  I'm re-installing the machine now, to confirm that it corrects the problem upstream as well.

Comment 10 Neil Horman 2008-08-13 19:50:58 UTC

Looks like upstream is working properly with kexec, thanks to a series of patches that implements proper detection of device resets and hangups and properly resets the card and bus in the process.  Given that we are so late in the RHEL 5.3 devel cycle, I think its best to go with the patch I attached, and backport the aic driver code properly for RHEL5.4

Comment 12 Don Zickus 2008-09-15 14:18:37 UTC

in kernel-2.6.18-115.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 17 Neil Horman 2008-10-24 00:55:35 UTC

I _know_ this was working when I wrote the patch.  I assume you set reset_devices on the kdump kernel command line?  Can you try this back with kernel 115 when it initially went in?

Comment 18 Qian Cai 2008-10-24 02:18:47 UTC

Yes, it use reset_devices as you can see it from comment #15. I am reserving the machine right now to see if it works in -115.el5 Kernel.

Comment 19 Qian Cai 2008-10-24 03:29:54 UTC

The problem is not fixed in -115.el5 Kernel. Some hints to reproduce this problem.

- the chance is higher for i386
- it does not always happen, but I always reproduce it within 5 attempts.

ibm-himalaya.rhts.bos.redhat.com login: SysRq : Trigger a crashdump^M
Linux version 2.6.18-115.el5 (brewbuilder.redhat.com) (gcc version 4.1.2 20071124 (Red Hat 4.1.2-41)) #1 SMP Sat Sep 13 23:29:21 EDT 2008^M
BIOS-provided physical RAM map:^M
 BIOS-e820: 0000000000000100 - 000000000009dc00 (usable)^M
 BIOS-e820: 000000000009dc00 - 00000000000a0000 (reserved)^M
 BIOS-e820: 0000000000000100 - 000000000009dc00 (usable)^M
 BIOS-e820: 000000000009dc00 - 00000000000a0000 (reserved)^M
 BIOS-e820: 0000000000100000 - 000000003ffcc200 (usable)^M
 BIOS-e820: 000000003ffcc200 - 000000003ffcf800 (ACPI data)^M
 BIOS-e820: 000000003ffcf800 - 0000000040000000 (reserved)^M
 BIOS-e820: 00000000fec00000 - 0000000100000000 (reserved)^M
user-defined physical RAM map:^M
 user: 0000000000000000 - 00000000000a0000 (usable)^M
 user: 0000000001000000 - 0000000008f5b000 (usable)^M
0MB HIGHMEM available.^M
143MB LOWMEM available.^M
found SMP MP-table at 0009dd40^M
Memory for crash kernel (0x0 to 0x0) notwithin permissible range^M
disabling kdump^M
Using x86 segment limits to approximate NX protection^M
DMI 2.3 present.^M
Using APIC driver default^M
ACPI: PM-Timer IO Port: 0x588^M
ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled)^M
Processor #0 15:3 APIC version 20^M
ACPI: LAPIC (acpi_id[0x01] lapic_id[0x06] enabled)^M
Processor #6 15:3 APIC version 20^M
WARNING: maxcpus limit of 1 reached. Processor ignored.^M
ACPI: LAPIC (acpi_id[0x02] lapic_id[0x01] enabled)^M
Processor #1 15:3 APIC version 20^M
WARNING: maxcpus limit of 1 reached. Processor ignored.^M
ACPI: LAPIC (acpi_id[0x03] lapic_id[0x07] enabled)^M
Processor #7 15:3 APIC version 20^M
WARNING: maxcpus limit of 1 reached. Processor ignored.^M
ACPI: LAPIC_NMI (acpi_id[0x00] dfl dfl lint[0x1])^M
ACPI: LAPIC_NMI (acpi_id[0x01] dfl dfl lint[0x1])^M
ACPI: LAPIC_NMI (acpi_id[0x02] dfl dfl lint[0x1])^M
ACPI: LAPIC_NMI (acpi_id[0x01] dfl dfl lint[0x1])^M
ACPI: LAPIC_NMI (acpi_id[0x02] dfl dfl lint[0x1])^M
ACPI: LAPIC_NMI (acpi_id[0x03] dfl dfl lint[0x1])^M
ACPI: IOAPIC (id[0x0e] address[0xfec00000] gsi_base[0])^M
IOAPIC[0]: apic_id 14, version 32, address 0xfec00000, GSI 0-23^M
ACPI: IOAPIC (id[0x0d] address[0xfec84000] gsi_base[24])^M
IOAPIC[1]: apic_id 13, version 32, address 0xfec84000, GSI 24-47^M
ACPI: IOAPIC (id[0x0c] address[0xfec84400] gsi_base[48])^M
IOAPIC[2]: apic_id 12, version 32, address 0xfec84400, GSI 48-71^M
ACPI: IOAPIC (id[0x0b] address[0xfec80000] gsi_base[72])^M
IOAPIC[3]: apic_id 11, version 32, address 0xfec80000, GSI 72-95^M
ACPI: IOAPIC (id[0x0a] address[0xfec80400] gsi_base[96])^M
IOAPIC[4]: apic_id 10, version 32, address 0xfec80400, GSI 96-119^M
ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)^M
ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)^M
Enabling APIC mode:  Flat.  Using 5 I/O APICs^M
Using ACPI (MADT) for SMP configuration information^M
Allocating PCI resources starting at 10000000 (gap: 08f5b000:f70a5000)^M
Detected 3600.346 MHz processor.^M
Built 1 zonelists.  Total pages: 36699^M
Kernel command line: ro root=/dev/VolGroup00/LogVol00 console=ttyS0,19200n8  irqpoll maxcpus=1 reset_devices  hda=cdrom memmap=exactmap memmap=640K@0K memmap=130412K@16384K elfcorehdr=146796K^M
Misrouted IRQ fixup and polling support enabled^M
This may significantly impact system performance^M
ide_setup: hda=cdrom^M 
Enabling fast FPU save and restore... done.^M
Enabling unmasked SIMD FPU exception support... done.^M
Initializing CPU#0^M
CPU 0 irqstacks, hard=c135d000 soft=c133d000^M 
PID hash table entries: 1024 (order: 10, 4096 bytes)^M
irq 169, desc: c12f5300, depth: 1, count: 0, unhandled: 0^M
->handle_irq():  c104e272, handle_bad_irq+0x0/0x1a6^M
->chip(): c1288280, 0xc1288280^M
->action(): 00000000^M
  IRQ_DISABLED set^M
unexpected IRQ trap at vector a9^M
Console: colour VGA+ 80x25^M
Dentry cache hash table entries: 32768 (order: 5, 131072 bytes)^M
Inode-cache hash table entries: 16384 (order: 4, 65536 bytes)^M
Memory: 122488k/146796k available (2140k kernel code, 8412k reserved, 902k data, 228k init, 0k highmem)^M
Checking if this processor honours the WP bit even in supervisor mode... Ok.^M
Calibrating delay using timer specific routine.. 7224.19 BogoMIPS (lpj=3612098)^M
Security Framework v1.0.0 initialized^M
SELinux:  Initializing.^M
selinux_register_security:  Registering secondary module capability^M
Capability LSM initialized as secondary^M
Mount-cache hash table entries: 512^M
monitor/mwait feature present.^M
using mwait in idle threads.^M
CPU: Trace cache: 12K uops, L1 D cache: 16K^M
CPU: L2 cache: 1024K^M
CPU: Physical Processor ID: 0^M
Intel machine check architecture supported.^M
Intel machine check reporting enabled on CPU#0.^M
CPU0: Intel P4/Xeon Extended MCE MSRs (24) available^M
CPU0: Thermal monitoring enabled^M
Checking 'hlt' instruction... OK.^M
SMP alternatives: switching to UP code^M
Freeing SMP alternatives: 14k freed^M
ACPI: Core revision 20060707^M
CPU0: Intel(R) Xeon(TM) CPU 3.60GHz stepping 04^M
Total of 1 processors activated (7224.19 BogoMIPS).^M
ENABLING IO-APIC IRQs^M
..TIMER: vector=0x31 apic1=0 pin1=2 apic2=0 pin2=0^M
Brought up 1 CPUs^M 
checking if image is initramfs... it is^M
Freeing initrd memory: 2858k freed^M
NET: Registered protocol family 16^M
No dock devices found.^M
ACPI: bus type pci registered^M
PCI: BIOS Bug: MCFG area at e0000000 is not E820-reserved^M
PCI: Not using MMCONFIG.^M
PCI: PCI BIOS revision 2.10 entry at 0xfd6de, last bus=9^M
PCI: Using configuration type 1^M
Setting up standard PCI resources^M
ACPI: Interpreter enabled^M
ACPI: Using IOAPIC for interrupt routing^M
ACPI: PCI Root Bridge [PCI0] (0000:00)^M
PCI quirk: region 0580-05ff claimed by ICH4 ACPI/GPIO/TCO^M
PCI quirk: region 0400-043f claimed by ICH4 GPIO^M
PCI: Ignoring BAR0-3 of IDE controller 0000:00:1f.1^M
PCI: PXH quirk detected, disabling MSI for SHPC device^M
PCI: PXH quirk detected, disabling MSI for SHPC device^M
PCI: Transparent bridge - 0000:00:1e.0^M 
ACPI: PCI Interrupt Link [LP00] (IRQs *3)^M
ACPI: PCI Interrupt Link [LP01] (IRQs *11)^M
ACPI: Blank IRQ resource^M
ACPI: Resource is not an IRQ entry^M
ACPI: PCI Interrupt Link [LP02] (IRQs) *0, disabled.^M
ACPI: PCI Interrupt Link [LP03] (IRQs *5)^M
ACPI: PCI Interrupt Link [LP04] (IRQs *11)^M 
ACPI: Blank IRQ resource^M
ACPI: Resource is not an IRQ entry^M
ACPI: PCI Interrupt Link [LP05] (IRQs) *0, disabled.^M
ACPI: Blank IRQ resource^M
ACPI: Resource is not an IRQ entry^M
ACPI: PCI Interrupt Link [LP06] (IRQs) *0, disabled.^M
ACPI: PCI Interrupt Link [LP07] (IRQs *7)^M
Linux Plug and Play Support v0.97 (c) Adam Belay^M
pnp: PnP ACPI init^M
pnp: PnP ACPI: found 16 devices^M
xen_mem: Initialising balloon driver.^M
usbcore: registered new driver usbfs^M
usbcore: registered new driver hub^M
PCI: Using ACPI for IRQ routing^M
PCI: If a device doesn't work, try "pci=routeirq".  If it helps, post a report^M
NetLabel: Initializing^M
NetLabel:  domain hash size = 128^M
NetLabel:  protocols = UNLABELED CIPSOv4^M
NetLabel:  unlabeled traffic allowed by default^M
pnp: 00:01: ioport range 0x520-0x53f has been reserved^M
pnp: 00:01: ioport range 0x540-0x547 has been reserved^M
pnp: 00:0d: ioport range 0x400-0x43f has been reserved^M
PCI: Bridge: 0000:02:00.0^M
  IO window: disabled.^M
  MEM window: disabled.^M
  PREFETCH window: disabled.^M
PCI: Bridge: 0000:02:00.2^M
  IO window: disabled.^M
  MEM window: disabled.^M
  PREFETCH window: disablPCI: Bridge: 0000:00:02.0^M
  IO window: disabled.^M 
  MEM window: disabled.^M
  PREFETCH window: disabled.^M
PCI: Bridge: 0000:00:04.0^M
  IO window: disabled.^M 
  MEM window: dd000000-deffffff^M
  PREFETCH window: 10000000-100fffff^M
PCI: Bridge: 0000:00:05.0^M
  IO window: disabled.^M
  MEM window: db000000-dcffffff^M
  PREFETCH window: disabled.^M 
PCI: Bridge: 0000:07:00.0^M
  IO window: 4000-4fff^M
  MEM window: d9000000-daffffff^M
  PREFETCH window: 50000000-500fffff^M
PCI: Bridge: 0000:07:00.2^M 
  IO window: 5000-ffff^M
  MEM window: disabled.^M
  PREFETCH window: disabled.^M
PCI: Bridge: 0000:00:06.0^M
  IO window: 4000-ffff^M 
  MEM window: d9000000-daffffff^M
  PREFETCH window: 50000000-500fffff^M
PCI: Bridge: 0000:00:1e.0^M
  IO window: 3000-3fff^M
  MEM window: f8000000-f8ffffff^M
  PREFETCH window: f0000000-f7ffffff^M
ACPI: PCI Interrupt 0000:00:02.0[A] -> GSI 16 (level, low) -> IRQ 169^M
ACPI: PCI Interrupt 0000:00:04.0[A] -> GSI 16 (level, low) -> IRQ 169^M
ACPI: PCI Interrupt 0000:00:05.0[A] -> GSI 16 (level, low) -> IRQ 169^M
ACPI: PCI Interrupt 0000:00:06.0[A] -> GSI 16 (level, low) -> IRQ 169^M
NET: Registered protocol family 2^M
IP route cache hash table entries: 2048 (order: 1, 8192 bytes)^M
TCP established hash table entries: 8192 (order: 4, 65536 bytes)^M
TCP bind hash table entries: 4096 (order: 3, 32768 bytes)^M
TCP: Hash tables configured (established 8192 bind 4096)^M
TCP reno registered^M
IBM machine detected. Enabling interrupts during APM calls.^M
apm: BIOS not found.^M
audit: initializing netlink socket (disabled)^M
type=2000 audit(1224818335.800:1): initialized^M
Total HugeTLB memory allocated, 0^M
VFS: Disk quotas dquot_6.5.1^M
Dquot-cache hash table entries: 1024 (order 0, 4096 bytes)^M
Initializing Cryptographic API^M
alg: No test for crc32c (crc32c-generic)^M
ksign: Installing public key data^M
Loading keyring^M
- Added public key B57B7263B69CC092^M
- User ID: Red Hat, Inc. (Kernel Module GPG key)^M
io scheduler noop registered^M
io scheduler anticipatory registered^M
io scheduler deadline registered^M
io scheduler cfq registered (default)^M
Intel E7520/7320/7525 detected.<6>Disabling irq balancing and affinity^M
IRQ lockup detection disabled^M
pci_hotplug: PCI Hot Plug PCI Core version: 0.5^M
BIOS reported wrong ACPI idfor the processor^M
ACPI Exception (evxface-0545): AE_NOT_EXIST, Removing notify handler [20060707]^M
Real Time Clock Driver v1.12ac^M
Non-volatile memory driver v1.2^M
Linux agpgart interface v0.101 (c) Dave Jones^M
Serial: 8250/16550 driver $Revision: 1.90 $ 4 ports, IRQ sharing enabled^M
serial8250: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A^M
00:05: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A^M
RAMDISK driver initialized: 16 RAM disks of 16384K size 4096 blocksize^M
Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2^M
ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx^M
ICH5: IDE controller at PCI slot 0000:00:1f.1^M
ACPI: Unable to derive IRQ for device 0000:00:1f.1^M
ACPI: PCI Interrupt 0000:00:1f.1[A]: no GSI^M
ICH5: chipset revision 2^M
ICH5: not 100% native mode: will probe irqs later^M
    ide0: BM-DMA at 0x0480-0x0487, BIOS settings: hda:DMA, hdb:DMA^M
hda: HL-DT-STDVD-ROM GDR8082N, ATAPI CD/DVD-ROM drive^M
ide0 at 0x1f0-0x1f7,0x3f6 on irq 14^M
ide-floppy driver 0.99.newide^M
usbcore: registered new driver hiddev^M
usbcore: registered new driver usbhid^M
drivers/usb/input/hid-core.c: v2.6:USB HID core driver^M
PNP: PS/2 Controller [PNP0303:PS2K,PNP0f13:PS2M] at 0x64,0x60 irq 1,12^M
PNP: PS/2 controller has invalid data port 0x64; using default 0x60^M
PNP: PS/2 controller has invalid command port 0x60; using default 0x64^M
serio: i8042 KBD port at 0x60,0x64 irq 1^M
serio: i8042 AUX port at 0x60,0x64 irq 12^M 
mice: PS/2 mouse device common for all mice^M
md: md driver 0.90.3 MAX_MD_DEVS=256, MD_SB_DISKS=27^M
md: bitmap version 4.39^M
TCP bic registered^MInitializing IPsec netlink socket^M
NET: Registered protocol family 1^M
NET: Registered protocol family 17^M
Using IPI No-Shortcut mode^M
ACPI: (supports<6>Time: tsc clocksource has been installed.^M
 S0 S4 S5)^M
Freeing unused kernel memory: 228k freed^M
Write protecting the kernel read-only data: 4294955412k^M
Mounting proc filesystem^M
Mounting sysfs filesysSCSI subsystem initialized^M
tem^M
Creating /dev^M
Creating iniPCI: Enabling device 0000:08:07.0 (0156 -> 0157)^M
tial device nodeACPI: PCI Interrupt 0000:08:07.0[A] -> s^M
Loading scsi_GSI 27 (level, low) -> IRQ 177^M
mod.ko module^M
Loading sd_mod.ko module^M
Loading scsi_transport_spi.ko module^M
Loading aic79xx.ko module^M
scsi0 : Adaptec AIC79XX PCI-X SCSI HBA DRIVER, Rev 3.0^M
        <Adaptec AIC7902 Ultra320 SCSI adapter>^M
        aic7902: Ultra320 Wide Channel A, SCSI Id=7, PCI-X 101-133Mhz, 512 SCBs^M
^M
PCI: Enabling device 0000:08:07.1 (0156 -> 0157)^M
ACPI: PCI Interrupt 0000:08:07.1[B] -> GSI 24 (level, low) -> IRQ 185^M
scsi1 : Adaptec AIC79XX PCI-X SCSI HBA DRIVER, Rev 3.0^M
        <Adaptec AIC7902 Ultra320 SCSI adapter>^M
        aic7902: Ultra320 Wide Channel B, SCSI Id=7, PCI-X 101-133Mhz, 512 SCBs^M
^M
scsi 1:0:0:0: Attempting to queue an ABORT message:CDB: 0x12 0x0 0x0 0x0 0x24 0x0^Mscsi 1:0:0:0: Attempting to queue an ABORT message:CDB: 0x12 0x0 0x0 0x0 0x24 0x0^M
scsi1: At time of recovery, card was paused^M
>>>>>>>>>>>>>>>>>> Dump Card State Begins <<<<<<<<<<<<<<<<<^M
scsi1: Dumping Card State at program address 0x24 Mode 0x22^M
Card was paused^M
INTSTAT[0x0] SELOID[0x0] SELID[0x0] HS_MAILBOX[0x0] ^M
INTCTL[0x0] SEQINTSTAT[0x0] SAVED_MODE[0x11] DFFSTAT[0x33] ^M
SCSISIGI[0xe6] SCSIPHASE[0x8] SCSIBUS[0x80] LASTPHASE[0x1] ^M
SCSISEQ0[0x0] SCSISEQ1[0x12] SEQCTL0[0x0] SEQINTCTL[0x0] ^M
SEQ_FLAGS[0xc0] SEQ_FLAGS2[0x4] QFREEZE_COUNT[0x0] ^M
KERNEL_QFREEZE_COUNT[0x0] MK_MESSAGE_SCB[0xff00] ^M
MK_MESSAGE_SCSIID[0xff] SSTAT0[0x22] SSTAT1[0x11] ^M
SSTAT2[0x0] SSTAT3[0x0] PERRDIAG[0x0] SIMODE1[0xa4] ^M
LQISTAT0[0x0] LQISTAT1[0x0] LQISTAT2[0x0] LQOSTAT0[0x0] ^M
LQOSTAT1[0x0] LQOSTAT2[0x0] ^M
^M
SCB Count = 4 CMDS_PENDING = 1 LASTSCB 0xffff CURRSCB 0x3 NEXTSCB 0x0^M
qinstart = 1 qinfifonext = 1^M
QINFIFO:^M
WAITING_TID_QUEUES:^M
Pending list:^M 
  3 FIFO_USE[0x0] SCB_CONTROL[0x44] SCB_SCSIID[0x7] ^M
Total 1^M
Kernel Free SCB list: 2 1 0 ^M
Sequencer Complete DMA-inprog list: ^M
Sequencer Complete list: ^M
Sequencer DMA-Up and Complete list: ^M 
Sequencer On QFreeze and Complete list: ^M
^M
^M
scsi1: FIFO0 Free, LONGJMP == 0x8071, SCB 0x0^M
SEQIMODE[0x3f] SEQINTSRC[0x0] DFCNTRL[0x0] DFSTATUS[0x89] ^M
SG_CACHE_SHADOW[0x2] SG_STATE[0x0] DFFSXFRCTL[0x0] ^M
SOFFCNT[0x0] MDFFSTAT[0x5] SHADDR = 0x00, SHCNT = 0x0 ^M
HADDR = 0x00, HCNT = 0x0 CCSGCTL[0x10] ^M
^M
scsi1: FIFO1 Free, LONGJMP == 0x8063, SCB 0x3^M
SEQIMODE[0x3f] SEQINTSRC[0x0] DFCNTRL[0x4] DFSTATUS[0x89] ^M
SG_CACHE_SHADOW[0x2] SG_STATE[0x0] DFFSXFRCTL[0x0] ^M
SOFFCNT[0x0] MDFFSTAT[0x5] SHADDR = 0x00, SHCNT = 0x0 ^M
HADDR = 0x00, HCNT = 0x0 CCSGCTL[0x10] ^M
LQIN: 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 ^M
scsi1: LQISTATE = 0x0, LQOSTATE = 0x0, OPTIONMODE = 0x52^M
scsi1: OS_SPACE_CNT = 0x20 MAXCMDCNT = 0x0^M
scsi1: SAVED_SCSIID = 0x0 SAVED_LUN = 0x0^M
SIMODE0[0xc] ^M
CCSCBCTL[0x4] ^M
scsi1: REG0 == 0x3, SINDEX = 0x1b8, DINDEX = 0x1ba^M
scsi1: SCBPTR == 0xff00, SCB_NEXT == 0xff00, SCB_NEXT2 == 0x0^M
CDB 0 1 0 0 0 0^M
STACK: 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0^M
<<<<<<<<<<<<<<<<< Dump Card State Ends >>>>>>>>>>>>>>>>>>^M
(scsi1:A:0:0): Device is disconnected, re-queuing SCB^M
scsi1: Recovery code sleeping^M
scsi1: Timer Expired (active 1)^M
Recovery code awake^M
scsi1: Command abort returning 0x2003^M
scsi 1:0:0:0: Attempting to queue a TARGET RESET message:CDB: 0x12 0x0 0x0 0x0 0x24 0x0^M
scsi1: Device reset code sleeping^M
scsi1: Device reset timer expired (active 2)^M
scsi1: Device reset returning 0x2003^M
Recovery SCB completes^M
Recovery SCB completes^M
scsi 1:0:0:0: Attempting to queue an ABORT message:CDB: 0x0 0x0 0x0 0x0 0x0 0x0^M
scsi1: At time of recovery, card was paused^M
...

Comment 20 Neil Horman 2008-10-24 20:36:25 UTC

I think I have this fixed.  An update to my patch which flushes the output queue of non-completed operations prior to resetting the HBA has allowed me 6 consecutive kdumps on i686 ibm-hymilaya.  I need to do some more agressive testing (flodding the box with disk I/O during a kdump) on monday, after which I'll post the fix.  Its running on hymilaya right now, cai if you want to jump on and play around with it.  Its the 2.6.18-prep kernel

Comment 21 Neil Horman 2008-10-28 14:18:05 UTC

Ok, bad news, good news.  

The bad news: What I thought I had working in comment #20, I don't.  I was able to get numberous kdumps successfully on an idle system without failure.  Even when I started writing random data to files to create heavy load, I was able to get about 7-8 kdumps in a row.  then unfortunately, it failed.  So back to the drawing board

Good news: So I started digging into the errors that we get on a failure.  And despite clearing out any pending requests from the sequencer, and completely resetting the HBA, It seems to me that on occasion we have drives that just stop responding to our HBA's requests.  So I started looking at our drive microcode.  It turns out the system we are testing on (ibm-himilaya) is several versions of microcode backlevel, and there are several Linux "robustness" fixes.  Its vague, but given that we've only seen this error on this system, and no other aic7xxx systems that I'm aware of, I think its worth pursuing.

The link to the update is here:
http://www-304.ibm.com/systems/support/supportsite.wss/docdisplay?lndocid=MIGR-4GSU4F&brandind=5000008

Unfortunately, to preform the update, you need to burn the iso to cd and boot it, and since I'm not near the system,  I can't do that.  Cai, is it possible for you to put that iso on a cd and update the drive firmware?

Comment 22 Qian Cai 2008-10-28 14:59:52 UTC

Neil, I can't either. I have created a ticket for it.
https://engineering.redhat.com/rt3/Ticket/Display.html?id=29997

Comment 23 Neil Horman 2008-10-28 15:11:32 UTC

ahh, cool.  Thank you.  I'll test again when that ticket is closed.

Comment 24 Linda Wang 2008-10-30 03:21:09 UTC

where is the machine in question?  in Westford, or in Raleigh?

Comment 26 PaulB 2008-10-30 14:43:17 UTC

All,
Updated IBM Himalaya bios to version 1.17.
Also, updated firmware on all three hard drives to version 1.19b.

Neil is currently testing the machine.

PaulB

Comment 27 Tom Coughlan 2008-10-30 15:18:30 UTC

Neil,

My workstation in my office has aic79xx. We also have an aic79xx HBA floating around in the Westford lab (not in RHTS), I think. Let me know if you want to try a test with one of those. 

Tom

Comment 28 Neil Horman 2008-10-30 15:55:29 UTC

Thanks, tom, but I found several systems in rhts with the same card, I'm trying this on nec-em15

Comment 30 Neil Horman 2008-10-30 19:26:31 UTC

Ok so I've spent the afternoon trying to recreate this problem on a different system (specifically nec-em15.rhts.bos.redhat.com, which has an aic HBA running all its disk i/o with the same pci vendor and device id. I've managed to get 10 consecutive dumps accross that HBA under heavy i/o load without failure.  So I'm starting to wonder if this isn't a problem isolated to this specific ibm system.  Tom, Can you suggest who (either internally or at IBM) that I might talk to about this?  Thanks!

Comment 31 Tom Coughlan 2008-10-30 20:28:55 UTC

John, Jesse,

The system ibm-himalaya.rhts.bos.redhat.com in Westford fails kdump intermittently. We thought it might be related to aic79xx, but testing on another system with the same HBA model has not failed. We would like to ask for some help from IBM in diagnosing this. Can they reproduce the problem? Do they recognize this symptom? 

Neil, CAI,

Please try a re-test on ibm-himalaya, now that the bios and drive firwmare have been updated.

Thanks,

Tom

Comment 32 Tom Coughlan 2008-10-30 20:48:37 UTC

(In reply to comment #21)

> So I started digging into the errors that we get on a failure.  And
> despite clearing out any pending requests from the sequencer, and completely
> resetting the HBA, It seems to me that on occasion we have drives that just
> stop responding to our HBA's requests.  So I started looking at our drive
> microcode.  

When drives stop responding after a reset, the problem may be related to the relatively complex set of steps involved in negotiating the data transfer rate and bus width on the parallel SCSI bus. After a full bus reset, the initiator and targets start out doing asynchronous, 8 bit transfers. Then they negotiate to higher speeds, 16-bit width, and possibly packet mode. Later, if the initiator somehow resets, but the drives do not, then they will not talk to eachother. The firmware update on the drives could address the problem. Or maybe the aic79xxx is not getting the reset and re-negotiation done right. Or it could be something else entirely. A bus analyzer would be the most direct way to know for sure.

Comment 33 Jesse Larrew 2008-10-30 22:38:29 UTC

(In reply to comment #31)
> John, Jesse,
> 
> The system ibm-himalaya.rhts.bos.redhat.com in Westford fails kdump
> intermittently. We thought it might be related to aic79xx, but testing on
> another system with the same HBA model has not failed. We would like to ask for
> some help from IBM in diagnosing this. Can they reproduce the problem? Do they
> recognize this symptom? 
> 

I've added Peter Bogdanovic from the System X team at IBM to the CC list for this bug. He has agreed to help diagnose the problem. I'm also adding myself to the list so I can help out in any way. I'm here in Westford, so just let me know if you need some tinkering done to ibm-himalaya.

Sincerely,

   Jesse

Comment 34 Neil Horman 2008-10-31 00:50:21 UTC

Jesse,  Peter, john, thanks for your help!

Tom, I've retested on ibm-hymialya since the drive firmware update, and the problem persists.  Its as you said intermittent, and the patch I added a few weeks back reduced the frequency from consistent to infrequently,  but the system never seems to survive more than 6 kdump iterations before we get a stall during hba initialization resulting in the documented results above.  Conducting the same test on nec-em15.rhts.bos.redhat.com (which contains the same aic card has now successfully executed 12 kdumps.  clearly thats not conclusive evidence, but it put me at rather a loss for what might be going on here.  the extra insight is most appreciated!

Comment 35 Peter Bogdanovic 2008-10-31 16:15:35 UTC

I have installed RHEL5.3 beta on x346 in our Beaverton lab and configured kdump.  I crashed it a few times last night before heading home and didn't encounter a problem.  I will continue iterating the crashes in an attempt to trigger the problem.  

Is the system idle when you are invoking the crash or are you creating some I/O first?  I am going to try it with background I/O and see if that brings out the problem.

Comment 36 Neil Horman 2008-10-31 16:55:45 UTC

Its entirely possible at this point that this problem is isolated just to the ibm xserver that this was reported on.

I never was able to go more than 6 kdump without the HBA throwing errors and preventing kdump from continuing.  It would happen both if I was doing i/o or keeping the system idle.  I had a test script that did this:

service kdump restart
rm -f /tmp/rand_data
dd if=/dev/urandom of=/tmp/rand_data bs=4096 &
sleep 5
echo c> /proc/sysrq-trigger

That could ususally recreate the problem within 3 kdump attempts

Comment 37 Peter Bogdanovic 2008-10-31 22:27:42 UTC

I have been running the script from the comment above all day on a IBM x346 model 8840 22U.  It just keeps on ticking.  It looks like it's going to just keep going until I fill the root filesystem with core files. 

Is there only one system that is having this problem?  Maybe you have a bad disk?

Comment 38 Neil Horman 2008-11-01 11:59:47 UTC

Thats the same result I've been getting on my other system with an aic card.  Clearly it seems that this is isolated to the one system in question.  It could be a bad drive, but smartd isn't reporting any errors, and the system runs fine prior to the kdump.  I'm starting to wonder if the aic controller has backlevel firmware, or is a slightly differnt spin of the silicon (which is hard to believe since the pci ids are identical between the non-working and working systems).  The failing system in question here is an eserver xSeries 346 -[8840D2Z]-.  do you have one of those you can try this on there?  If that works for you perhaps we can chalk this up as a hardware anomaly.

Comment 39 Jesse Larrew 2008-11-03 19:45:14 UTC

I found an open ticket from RHTS to replace a failing fan on ibm-himalaya about a year ago. It doesn't look like this ever got serviced. Is it possible that all the increased I/O from the kdumps is causing the cpu to overheat and enter thermal shutdown?

Comment 40 Neil Horman 2008-11-03 19:59:30 UTC

It would certainly fit with the apparent isolation of this issue to just the ibm-himalaya system.  I'm not qute sure why it only occurs during a kdump operation rather than during normal operation, but honestly I've never left the system running for an extended amount of time without crashing it to test kdump.  Is it possible to service the fan now?

Comment 41 Linda Wang 2008-11-04 03:35:44 UTC

If we can't service the fan in time,  can Neil provide a test kernel,
and you all can test on the same systems at your site?  (hopefully identitical to ibm-hamalya) and see if it works for you?

Will this work?

Comment 42 Neil Horman 2008-11-04 12:16:44 UTC

Linda, if this truly turns out to be a hw problem, given that the currently release candidate for RHEL5.2 runs kdump fine on other system with an aic7902 card, we shouldn't need a specific test kernel.  Anything in don's tree after -114.el5 (the kernel with my origional fix in place) should work fine.  Jesse if  you don't already have that kernel or later, let me know and I'll put it on my people page..  Thanks!

Comment 43 Neil Horman 2008-11-04 16:36:55 UTC

Ok, so, based on what we've been discussing here, I think we are about ready to close this bug as a hardware issue.  Before we do that however, I would feel much better about it if we could have someone at IBM test on the exact same model of hardware to confirm that the problem does not reproduce.  The system in question is an IBM eserver xSeries 346 -[8840D2Z]-.  The test reproducer I've been using is as follows:

install kernel -114.el5 or later on the system
install kexec-tools on the system (RHEL5 beta kexec tools is what I used).
configure kdump to dump to the root fs from the initramfs:
   echo ext3 /dev/mapper/VolGroup00-LogVol00 >> /etc/kdump.conf
   service kdump restart
note that you may need to change the physicall location of your root partition. 
run my script from comment #36.

In about 5 seconds the system will crash, upon kdump reboot the failing system randomly experienced HBA bus communication failures from which it was unable to recover.  Sometimes it will capture a core, but it never went more than 6 kdumps in a row before reproducing the problem.

I would say if IBM can issue 10 consecutive kdumps on one of these systems in their lab without failure, we can close this as a hardware issue.

Comment 44 Peter Bogdanovic 2008-11-04 16:49:43 UTC

To add evidence to the "it's hardware problem" theory. I ran crash dump script above with rhel5.3 beta1, 2.6.18-120 i386, on a x346, model 8840  type 22U, same model but different type, all weekend.  The result was 97 consecutive successful crash dumps.  If it is an intermittent problem it appears to be a pretty rare event.

Comment 45 Peter Bogdanovic 2008-11-04 17:03:41 UTC

I didn't notice in the bug that use of logical disk volumes.  I installed my system on a primary partition.  Do you think this difference is significant?  Should I re-install on a logical volume and run the kdumps again?

Comment 46 Neil Horman 2008-11-04 17:34:39 UTC

I can guarantee that the use of LVM is inconsequential to the reproduction of this bug.  When the kdump kernel boots  and the problem reproduces, the HBA fails to communicate with any disks, before we even start querying them for LVM labeling.  

I think based on peters test, we can close this as  a hardware issue.  Is there general consensus on that?

Comment 52 errata-xmlrpc 2009-01-20 20:18:05 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-0225.html