Bug 179272 - 2.6.16 x86_64 Adaptec RAID-1 phantom errors
Summary: 2.6.16 x86_64 Adaptec RAID-1 phantom errors
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 4
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Dave Jones
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks: FCMETA_SCSI
TreeView+ depends on / blocked
 
Reported: 2006-01-29 14:33 UTC by Sam Varshavchik
Modified: 2015-01-04 22:24 UTC (History)
2 users (show)

Fixed In Version: 2.6.17_2139
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2006-07-01 13:39:23 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
2.6.15 boot log. (30.57 KB, text/plain)
2006-02-04 17:07 UTC, Sam Varshavchik
no flags Details
Output of lspci (17.38 KB, text/plain)
2006-02-04 17:10 UTC, Sam Varshavchik
no flags Details
Output of dmidecode (13.69 KB, text/plain)
2006-02-04 17:11 UTC, Sam Varshavchik
no flags Details
2.6.16 boot log. (20.54 KB, text/plain)
2006-04-01 17:49 UTC, Sam Varshavchik
no flags Details

Description Sam Varshavchik 2006-01-29 14:33:12 UTC
At first I thought I was reproducing bug 174973, but I think there might be
another issue here.  The last FC4 errata kernel that appears to work reliably
for me is also 2.6.13-1.1532_FC4smp.  None of the 2.6.14-based kernels are
stable.  2.6.14-1.1656smp fails to boot completely. Unolike bug 174973, my
2.6.14-1.1644smp survives RAID initialization, and begins booting, but late in
the boot there are a bunch of disk errors and one of the disks drops off the
RAID.  The system continues to run off a single disk (I have two disks in a
RAID-1 configuration).

After rebuilding my arrays, I then attempted to boot 2.6.14-1.1656smp with a
serial console.  This time 2.6.14-1.1656smp survived RAID initialization, but
then one of the disks dropped out due to phantom errors again.

My understanding of bug 174973 is that it's a timeout issue that prevents the
kernel from seeing the RAID devices.  Here, it appears that I do see the RAID
devices, but then one of the disks fail later in the boot process:


Bootdata ok (command line is ro root=/dev/md1 console=ttyS0,9600 console=tty0)
Linux version 2.6.14-1.1656_FC4smp (bhcompile.redhat.com) (gcc
version 4.0.2 20051125 (Red Hat 4.0.2-8)) #1 SMP Thu Jan 5 22:26:33 EST 2006
BIOS-provided physical RAM map:
 BIOS-e820: 0000000000000000 - 000000000009b400 (usable)
 BIOS-e820: 000000000009b400 - 00000000000a0000 (reserved)
 BIOS-e820: 00000000000d6000 - 0000000000100000 (reserved)
 BIOS-e820: 0000000000100000 - 000000007ff70000 (usable)
 BIOS-e820: 000000007ff70000 - 000000007ff76000 (ACPI data)
 BIOS-e820: 000000007ff76000 - 000000007ff80000 (ACPI NVS)
 BIOS-e820: 000000007ff80000 - 0000000080000000 (reserved)
 BIOS-e820: 00000000fec00000 - 00000000fec00400 (reserved)
 BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
 BIOS-e820: 00000000fff80000 - 0000000100000000 (reserved)
Scanning NUMA topology in Northbridge 24
Number of nodes 2
Node 0 using interleaving mode 1/0
No NUMA configuration found
Faking a node at 0000000000000000-000000007ff70000
Bootmem setup node 0 0000000000000000-000000007ff70000
ACPI: PM-Timer IO Port: 0x8008
ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled)
Processor #0 15:5 APIC version 16
ACPI: LAPIC (acpi_id[0x01] lapic_id[0x01] enabled)
Processor #1 15:5 APIC version 16
ACPI: LAPIC_NMI (acpi_id[0x00] high edge lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x01] high edge lint[0x1])
ACPI: IOAPIC (id[0x02] address[0xfec00000] gsi_base[0])
IOAPIC[0]: apic_id 2, version 17, address 0xfec00000, GSI 0-23
ACPI: IOAPIC (id[0x03] address[0xfe500000] gsi_base[24])
IOAPIC[1]: apic_id 3, version 17, address 0xfe500000, GSI 24-27
ACPI: IOAPIC (id[0x04] address[0xfe501000] gsi_base[28])
IOAPIC[2]: apic_id 4, version 17, address 0xfe501000, GSI 28-31
ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 high edge)
Setting APIC routing to flat
Using ACPI (MADT) for SMP configuration information
Allocating PCI resources starting at 88000000 (gap: 80000000:7ec00000)
Checking aperture...
CPU 0: aperture @ 0 size 32 MB
No AGP bridge found
Built 1 zonelists
Kernel command line: ro root=/dev/md1 console=ttyS0,9600 console=tty0
Initializing CPU#0
PID hash table entries: 4096 (order: 12, 131072 bytes)
time.c: Using 3.579545 MHz PM timer.
time.c: Detected 1403.212 MHz processor.
Console: colour VGA+ 80x25
Dentry cache hash table entries: 524288 (order: 10, 4194304 bytes)
Inode-cache hash table entries: 262144 (order: 9, 2097152 bytes)
Memory: 2055212k/2096576k available (2335k kernel code, 40960k reserved, 1389k
data, 236k init)
Calibrating delay using timer specific routine.. 2812.38 BogoMIPS (lpj=5624776)
Security Framework v1.0.0 initialized
SELinux:  Initializing.
SELinux:  Starting in permissive mode
selinux_register_security:  Registering secondary module capability
Capability LSM initialized as secondary
Mount-cache hash table entries: 256
CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
CPU: L2 Cache: 1024K (64 bytes/line)
CPU 0(1) -> Node 0 -> Core 0
mtrr: v2.0 (20020519)
Using local APIC timer interrupts.
Detected 12.528 MHz APIC timer.
Booting processor 1/2 APIC 0x1
Initializing CPU#1
Calibrating delay using timer specific routine.. 2806.58 BogoMIPS (lpj=5613176)
CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
CPU: L2 Cache: 1024K (64 bytes/line)
CPU 1(1) -> Node 0 -> Core 0
AMD Opteron(tm) Processor 240 stepping 01
CPU 1: Syncing TSC to CPU 0.
CPU 1: synchronized TSC with CPU 0 (last diff -3 cycles, maxerr 928 cycles)
Brought up 2 CPUs
Disabling vsyscall due to use of PM timer
time.c: Using PM based timekeeping.
testing NMI watchdog ... OK.
checking if image is initramfs... it is
NET: Registered protocol family 16
ACPI: bus type pci registered
PCI: Using configuration type 1
ACPI: Subsystem revision 20050916
ACPI: Interpreter enabled
ACPI: Using IOAPIC for interrupt routing
ACPI: PCI Root Bridge [PCI0] (0000:00)
PCI: Probing PCI hardware (bus 00)
ACPI: PCI Interrupt Link [LNKA] (IRQs 3 5 10 *11)
ACPI: PCI Interrupt Link [LNKB] (IRQs 3 *5 10 11)
ACPI: PCI Interrupt Link [LNKC] (IRQs 3 5 *10 11)
ACPI: PCI Interrupt Link [LNKD] (IRQs 3 5 10 *11)
Linux Plug and Play Support v0.97 (c) Adam Belay
pnp: PnP ACPI init
pnp: PnP ACPI: found 12 devices
usbcore: registered new driver usbfs
usbcore: registered new driver hub
PCI: Using ACPI for IRQ routing
PCI: If a device doesn't work, try "pci=routeirq".  If it helps, post a report
PCI-DMA: Disabling IOMMU.
pnp: 00:04: ioport range 0x4d0-0x4d1 has been reserved
pnp: 00:04: ioport range 0x1100-0x117f has been reserved
pnp: 00:04: ioport range 0x1180-0x11ff has been reserved
PCI: Bridge: 0000:00:06.0
  IO window: 2000-2fff
  MEM window: fd000000-fe0fffff
  PREFETCH window: 88000000-880fffff
PCI: Bridge: 0000:00:0a.0
  IO window: disabled.
  MEM window: fe100000-fe1fffff
  PREFETCH window: 88100000-881fffff
PCI: Bridge: 0000:00:0b.0
  IO window: 3000-3fff
  MEM window: fe200000-fe2fffff
  PREFETCH window: 88200000-882fffff
IA32 emulation $Id: sys_ia32.c,v 1.32 2002/03/24 13:02:28 ak Exp $
audit: initializing netlink socket (disabled)
audit(1138525891.648:1): initialized
Total HugeTLB memory allocated, 0
VFS: Disk quotas dquot_6.5.1
Dquot-cache hash table entries: 512 (order 0, 4096 bytes)
SELinux:  Registering netfilter hooks
Initializing Cryptographic API
ksign: Installing public key data
Loading keyring
- Added public key A42C93F7DA961759
- User ID: Red Hat, Inc. (Kernel Module GPG key)
PCI: MSI quirk detected. pci_msi_quirk set.
PCI: MSI quirk detected. pci_msi_quirk set.
pci_hotplug: PCI Hot Plug PCI Core version: 0.5
ACPI: CPU0 (power states: C1[C1])
ACPI: CPU1 (power states: C1[C1])
Real Time Clock Driver v1.12
Linux agpgart interface v0.101 (c) Dave Jones
PNP: PS/2 Controller [PNP0303:KBC,PNP0f13:PS2M] at 0x60,0x64 irq 1,12
serio: i8042 AUX port at 0x60,0x64 irq 12
serio: i8042 KBD port at 0x60,0x64 irq 1
Serial: 8250/16550 driver $Revision: 1.90 $ 32 ports, IRQ sharing enabled
ÿttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
ttyS1 at I/O 0x2f8 (irq = 3) is a 16550A
ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
ttyS1 at I/O 0x2f8 (irq = 3) is a 16550A
io scheduler noop registered
io scheduler anticipatory registered
io scheduler deadline registered
io scheduler cfq registered
RAMDISK driver initialized: 16 RAM disks of 16384K size 1024 blocksize
Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2
ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
AMD8111: IDE controller at PCI slot 0000:00:07.1
AMD8111: chipset revision 3
AMD8111: not 100% native mode: will probe irqs later
AMD8111: 0000:00:07.1 (rev 03) UDMA133 controller
    ide0: BM-DMA at 0x1020-0x1027, BIOS settings: hda:DMA, hdb:pio
    ide1: BM-DMA at 0x1028-0x102f, BIOS settings: hdc:DMA, hdd:pio
hda: SAMSUNG SP1213N, ATA DISK drive
ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
hdc: LITE-ON DVDRW LDW-811S, ATAPI CD/DVD-ROM drive
ide1 at 0x170-0x177,0x376 on irq 15
hda: max request size: 1024KiB
hda: 234493056 sectors (120060 MB) w/8192KiB Cache, CHS=16383/255/63, UDMA(100)
hda: cache flushes supported
 hda: hda1
hdc: ATAPI 40X DVD-ROM CD-R/RW drive, 2048kB Cache, UDMA(33)
Uniform CD-ROM driver Revision: 3.20
ide-floppy driver 0.99.newide
usbcore: registered new driver hiddev
usbcore: registered new driver usbhid
drivers/usb/input/hid-core.c: v2.6:USB HID core driver
mice: PS/2 mouse device common for all mice
md: md driver 0.90.2 MAX_MD_DEVS=256, MD_SB_DISKS=27
md: bitmap version 3.39
NET: Registered protocol family 2
input: AT Translated Set 2 keyboard on isa0060/serio0
IP route cache hash table entries: 131072 (order: 8, 1048576 bytes)
TCP established hash table entries: 131072 (order: 10, 4194304 bytes)
TCP bind hash table entries: 65536 (order: 9, 2097152 bytes)
TCP: Hash tables configured (established 131072 bind 65536)
TCP reno registered
TCP bic registered
Initializing IPsec netlink socket
NET: Registered protocol family 1
NET: Registered protocol family 17
powernow-k8: Power state transitions not supported
powernow-k8: Power state transitions not supported
Freeing unused kernel memory: 236k freed
Write protecting the kernel read-only data: 824k
SCSI subsystem initialized
ACPI: PCI Interrupt 0000:03:01.0[A] -> GSI 29 (level, low) -> IRQ 169
scsi0 : Adaptec AIC79XX PCI-X SCSI HBA DRIVER, Rev 1.3.11
        <Adaptec 29320 Ultra320 SCSI adapter>
        aic7902: Ultra320 Wide Channel A, SCSI Id=7, PCI-X 101-133Mhz, 512 SCBs

ACPI: PCI Interrupt 0000:03:01.1[B] -> GSI 30 (level, low) -> IRQ 177
scsi1 : Adaptec AIC79XX PCI-X SCSI HBA DRIVER, Rev 1.3.11
        <Adaptec 29320 Ultra320 SCSI adapter>
        aic7902: Ultra320 Wide Channel B, SCSI Id=7, PCI-X 101-133Mhz, 512 SCBs

  Vendor: SEAGATE   Model: ST336607LW        Rev: 0007
  Type:   Direct-Access                      ANSI SCSI revision: 03
 target1:0:0: asynchronous.
scsi1:A:0:0: Tagged Queuing enabled.  Depth 4
 target1:0:0: Beginning Domain Validation
 target1:0:0: wide asynchronous.
 target1:0:0: FAST-160 WIDE SCSI 320.0 MB/s DT IU QAS RDSTRM WRFLOW PCOMP (6.25
ns, offset 63)
 target1:0:0: Ending Domain Validation
SCSI device sda: 71687372 512-byte hdwr sectors (36704 MB)
SCSI device sda: drive cache: write back
SCSI device sda: 71687372 512-byte hdwr sectors (36704 MB)
SCSI device sda: drive cache: write back
 sda: sda1 sda2 sda3
Attached scsi disk sda at scsi1, channel 0, id 0, lun 0
  Vendor: SEAGATE   Model: ST336607LW        Rev: 0007
  Type:   Direct-Access                      ANSI SCSI revision: 03
 target1:0:1: asynchronous.
scsi1:A:1:0: Tagged Queuing enabled.  Depth 4
 target1:0:1: Beginning Domain Validation
 target1:0:1: wide asynchronous.
 target1:0:1: FAST-160 WIDE SCSI 320.0 MB/s DT IU QAS RDSTRM WRFLOW PCOMP (6.25
ns, offset 63)
 target1:0:1: Ending Domain Validation
SCSI device sdb: 71687372 512-byte hdwr sectors (36704 MB)
SCSI device sdb: drive cache: write back
SCSI device sdb: 71687372 512-byte hdwr sectors (36704 MB)
SCSI device sdb: drive cache: write back
 sdb: sdb1 sdb2 sdb3
Attached scsi disk sdb at scsi1, channel 0, id 1, lun 0
md: raid1 personality registered as nr 3
md: Autodetecting RAID arrays.
md: autorun ...
md: considering sdb3 ...
md:  adding sdb3 ...
md: sdb2 has different UUID to sdb3
md: sdb1 has different UUID to sdb3
md:  adding sda3 ...
md: sda2 has different UUID to sdb3
md: sda1 has different UUID to sdb3
md: created md2
md: bind<sda3>
md: bind<sdb3>
md: running: <sdb3><sda3>
raid1: raid set md2 active with 2 out of 2 mirrors
md: considering sdb2 ...
md:  adding sdb2 ...
md: sdb1 has different UUID to sdb2
md:  adding sda2 ...
md: sda1 has different UUID to sdb2
md: created md1
md: bind<sda2>
md: bind<sdb2>
md: running: <sdb2><sda2>
raid1: raid set md1 active with 2 out of 2 mirrors
md: considering sdb1 ...
md:  adding sdb1 ...
md:  adding sda1 ...
md: created md0
md: bind<sda1>
md: bind<sdb1>
md: running: <sdb1><sda1>
raid1: raid set md0 active with 2 out of 2 mirrors
md: ... autorun DONE.
md: Autodetecting RAID arrays.
md: autorun ...
md: ... autorun DONE.
md: Autodetecting RAID arrays.
md: autorun ...
md: ... autorun DONE.
kjournald starting.  Commit interval 5 seconds
EXT3-fs: mounted filesystem with ordered data mode.
raid1: Disk failure on sdb2, disabling device. 
raid1: Disk failure on sdb1, disabling device. 
raid1: Disk failure on sdb3, disabling device.

Comment 1 Dave Jones 2006-02-03 05:35:06 UTC
This is a mass-update to all currently open kernel bugs.

A new kernel update has been released (Version: 2.6.15-1.1830_FC4)
based upon a new upstream kernel release.

Please retest against this new kernel, as a large number of patches
go into each upstream release, possibly including changes that
may address this problem.

This bug has been placed in NEEDINFO_REPORTER state.
Due to the large volume of inactive bugs in bugzilla, if this bug is
still in this state in two weeks time, it will be closed.

Should this bug still be relevant after this period, the reporter
can reopen the bug at any time. Any other users on the Cc: list
of this bug can request that the bug be reopened by adding a
comment to the bug.

If this bug is a problem preventing you from installing the
release this version is filed against, please see bug 169613.

Thank you.


Comment 2 Sam Varshavchik 2006-02-04 17:02:17 UTC
This bug still exists in the kernel update.

However I managed to capture some good diagnostics this time, where "good
diagnostics" means "the ugliest boot log I've seen to some time."

The disks are initially detected and raid gets initialized.  But soon thereafter
aic7xxx begins reporting "scsi1: transmission error".  The kernel valiantly
tries to continue booting, with aic7xxx complaining every couple of seconds, and
it manages to make quite a bit of progress before giving up and panicing.

The last kernel that works reliably on this hardware is still 2.6.13

Attaching boot log.


Comment 3 Sam Varshavchik 2006-02-04 17:07:25 UTC
Created attachment 124170 [details]
2.6.15 boot log.

The final panic should probably be also looked at, but the root cause are all
the preceding scsi exceptions.

This hardware works fine on all kernel revs up until the 2.6.13 update.

Comment 4 Sam Varshavchik 2006-02-04 17:10:22 UTC
Created attachment 124172 [details]
Output of lspci

Comment 5 Sam Varshavchik 2006-02-04 17:11:19 UTC
Created attachment 124173 [details]
Output of dmidecode

Comment 6 Sam Varshavchik 2006-04-01 17:49:17 UTC
Created attachment 127181 [details]
2.6.16 boot log.

This bug still exists in 2.6.16-1.2069_FC4

Serial console log attached.

NOTE: I tried booting first just with "noacpi pci=noacpi", only.  The kernel
initially continued to boot past the failure point, but then crashed with a
"BUG: spinlock recursion".

Thinking that "noacpi pci=noacpi" will suppress this bug, I attached a serial
console, to capture what appears to be a different bug.  Well, with the serial
console, at 9600bps, the kernel crashed with SCSI errors again.  So, "noacpi
pci=noacpi" helps, but the bug still exists.

Comment 7 Sam Varshavchik 2006-07-01 13:39:23 UTC
This bug appears to be fixed in 2.6.17



Note You need to log in before you can comment on or make changes to this bug.