Bug 125885

Summary:	aic79xx badblocks and filesystem corruption
Product:	[Fedora] Fedora	Reporter:	Joe Cooper <joe>
Component:	kernel	Assignee:	Arjan van de Ven <arjanv>
Status:	CLOSED WONTFIX	QA Contact:	Brian Brock <bbrock>
Severity:	high	Docs Contact:
Priority:	medium
Version:	1
Target Milestone:	---
Target Release:	---
Hardware:	i386
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2004-09-29 20:29:47 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Joe Cooper 2004-06-13 07:54:43 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.6)
Gecko/20040207 Firefox/0.8

Description of problem:
badblocks on a partition on a disk results in two badblocks being
reported at the end of the partition, and an "attempt to access beyond
end of device" in the kernel log:


attempt to access beyond end of device
08:03: rw=0, want=24394704, limit=24394702
attempt to access beyond end of device
08:03: rw=0, want=24394704, limit=24394702
attempt to access beyond end of device
08:03: rw=0, want=24394704, limit=24394702
attempt to access beyond end of device
08:03: rw=0, want=24394704, limit=24394702


Also results in filesystem corruption and errors similar to the following:

scsi0:0:0:0: Command not found
scsi0:0:0:0: Attempting to abort cmd f7118800: 0x2a 0x0 0x3 0xeb 0xc
0xa8 0x0 0x0 0x8 0x0
scsi0:0:0:0: Command not found
SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 8000002
Info fld=0x13ed20, Current sd08:02: sense key Hardware Error
 I/O error: dev 08:02, sector 1048848
raid1: sda2: rescheduling block 1048848
raid1: sda2: unrecoverable I/O read error for block 1048848
EXT3-fs error (device md(9,2)): ext3_get_inode_loc: unable to read
inode block - inode=66177, block=131106


In quite large numbers.  System eventually panics (I don't have access
to the screen on this system, as it is in a data center, so I don't
have details of the panic, but a hard reboot is required to bring it
back to life).


lspci output:

00:00.0 Host bridge: Intel Corp. E7501 Memory Controller Hub (rev 01)
00:04.0 PCI bridge: Intel Corp. E7000 Series Hub Interface D
PCI-to-PCI Bridge (rev 01)
00:1d.0 USB Controller: Intel Corp. 82801CA/CAM USB (Hub #1) (rev 02)
00:1d.1 USB Controller: Intel Corp. 82801CA/CAM USB (Hub #2) (rev 02)
00:1e.0 PCI bridge: Intel Corp. 82801BA/CA/DB/EB/ER Hub interface to
PCI Bridge (rev 42)
00:1f.0 ISA bridge: Intel Corp. 82801CA LPC Interface Controller (rev 02)
00:1f.1 IDE interface: Intel Corp. 82801CA Ultra ATA Storage
Controller (rev 02)00:1f.3 SMBus: Intel Corp. 82801CA/CAM SMBus
Controller (rev 02)
01:09.0 VGA compatible controller: ATI Technologies Inc Rage XL (rev 27)
02:1c.0 PIC: Intel Corp. 82870P2 P64H2 I/OxAPIC (rev 04)
02:1d.0 PCI bridge: Intel Corp. 82870P2 P64H2 Hub PCI Bridge (rev 04)
02:1e.0 PIC: Intel Corp. 82870P2 P64H2 I/OxAPIC (rev 04)
02:1f.0 PCI bridge: Intel Corp. 82870P2 P64H2 Hub PCI Bridge (rev 04)
03:06.0 SCSI storage controller: Adaptec AIC-7902 U320 (rev 03)
03:06.1 SCSI storage controller: Adaptec AIC-7902 U320 (rev 03)
04:02.0 Ethernet controller: Intel Corp. 82546EB Gigabit Ethernet
Controller (Copper) (rev 01)
04:02.1 Ethernet controller: Intel Corp. 82546EB Gigabit Ethernet
Controller (Copper) (rev 01)


The disks are Maxtor Atlas 10k 36GB SCA disks, and the error occurs
consistently on both disks.

Version-Release number of selected component (if applicable):
kernel-2.4.22-1.2188 and kernel-smp-2.4.22-1.2188

How reproducible:
Always

Steps to Reproduce:
1.  Run badblocks (or write to the end of) a partition on this controller.
2.  Wait for it to finish and report errors.


Actual Results:  Bad blocks are reported, filesystem corruption occurs
in the event data is written to the end of the disk, system becomes
unstable.

Expected Results:  No bad blocks, no filesystem corruption, no system
instability.

Additional info:

Strangely, this bug looks almost exactly like
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=125883 which I
just submitted a few minutes ago about a similarly specced server, but
with completely different hardware (this is an Intel chipset with an
Adaptec controller, the other bug is on a ServerWorks chipset with an
LSI controller).  The only things that are the same are the CPU (2.8
GHz Xeon) and the disks (Maxtor Atlas 10k).

In the case of the other problem, it didn't occur in the RHEL
2.4.21-15 kernel.  I will attempt to try this kernel on this machine,
but since I don't have access to the machine, I need to wait until
someone can be handy to reboot it in the event of problems.

Comment 1 Joe Cooper 2004-06-13 08:24:25 UTC

Problem also exists in RHEL kernel 2.4.21-15.ELsmp.

Comment 2 Joe Cooper 2004-06-14 02:33:24 UTC

I've attempted resizing the offending partition on this system (as
proved to be a seemingly successful workaround for
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=125885 ), but the
problem persists...it just moves down the disk a few blocks.

Comment 3 David Lawrence 2004-09-29 20:29:47 UTC

Thanks for the bug report. However, Red Hat no longer maintains this version of
the product. Please upgrade to the latest version and open a new bug if the problem
persists.

The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, 
and if you believe this bug is interesting to them, please report the problem in
the bug tracker at: http://bugzilla.fedora.us/