Bug 181474

Summary: ext3 commit errors on ext3 fs mounted from usb drive on SMP box
Product: [Fedora] Fedora Reporter: Alexandre Oliva <oliva>
Component: kernelAssignee: Pete Zaitcev <zaitcev>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 5CC: davej, jonstanley, wtogami
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard: NeedsRetesting
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-01-06 01:11:57 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 165247    

Description Alexandre Oliva 2006-02-14 15:58:25 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.8.0.1) Gecko/20060210 Fedora/1.5.0.1-3 Firefox/1.5.0.1

Description of problem:
The rawhide kernels from the past two days have displayed odd behavior when running backups from my shiny new Athlon64X2 box to an external USB hard drive.  The drive worked flawlessly when booted with maxcpus=1, but when allowed to use both CPU cores, part-way through the incremental rsync backup of 130+GB it would fail to commit the journal of that filesystem:

usb 1-2: reset high speed USB device using ehci_hcd and address 3
usb 1-2: device not accepting address 3, error -110
usb 1-2: reset high speed USB device using ehci_hcd and address 3
usb 1-2: device not accepting address 3, error -110usb 1-2: reset high speed USB device using ehci_hcd and address 3
usb 1-2: device not accepting address 3, error -110usb 1-2: reset high speed USB device using ehci_hcd and address 3
usb 1-2: device not accepting address 3, error -110usb 1-2: USB disconnect, address 3
sd 4:0:0:0: scsi: Device offlined - not ready after error recovery
sd 4:0:0:0: SCSI error: return code = 0x50000
end_request: I/O error, dev sdc, sector 293449207
sd 4:0:0:0: rejecting I/O to offline device
sd 4:0:0:0: rejecting I/O to device being removed
Feb 13 19:16:01 free last message repeated 2 times
EXT3-fs error (device sdc1): ext3_readdir: directory #18336136 contains a hole at offset 0
Aborting journal on device sdc1.
sd 4:0:0:0: rejecting I/O to device being removed
Buffer I/O error on device sdc1, logical block 1545 lost page write due to I/O error on sdc1
sd 4:0:0:0: rejecting I/O to device being removed
Buffer I/O error on device sdc1, logical block 0
lost page write due to I/O error on sdc1
ext3_abort called.
EXT3-fs error (device sdc1): ext3_journal_start_sb: Detected aborted journal
Remounting filesystem read-only
sd 4:0:0:0: rejecting I/O to device being removed
EXT3-fs error (device sdc1): ext3_find_entry: reading directory #18336136 offset 0
sd 4:0:0:0: rejecting I/O to device being removed
[...]

All of the above happened in less than 10 seconds.  5 seconds later it failed to commit the journal one more time, stopped printing I/O errors, and started trying to assign a new USB address to the `new´ disk.

From that point on (I only came back a few hours later), listing the contents of the filesystem's mount point would print out an empty directory, although it still appeared to be mounted.  Unmounting the filesystem succeeded, but fsck wouldn't recognize a filesystem there.  Unplugging the disk caused messages about I/O errors on sdc to hit /var/log/messages.  Plugging it back in, the usb subsystem would no longer be able to assign an address to the disk, saying it rejected one after the other:

[...]
usb 1-2: device not accepting address 15, error -110
usb 1-2: new high speed USB device using ehci_hcd and address 16
[11 seconds]
usb 1-2: device not accepting address 16, error -110
usb 1-2: new high speed USB device using ehci_hcd and address 17
[...]

The USB mouse also got very jerky after the failure.  I'm not sure that's related.  I didn't think of trying to reconnect the mouse before rebooting, to see how it would respond.

Version-Release number of selected component (if applicable):
kernel-2.6.15-1.1939_FC5

How reproducible:
Sometimes

Steps to Reproduce:
1.Boot without maxcpus=1 on an Athlon64X2 box, Asus A8V Deluxe MoBo
2.Write a lot of stuff to an ext3 filesystem on an external USB disk that works fine with maxcpus=1

Actual Results:  Journal commit errors.  User panics as the disk appears to be empty :-)

Expected Results:  Flawless backups.  One can always hope :-)

Additional info:

This might possibly be related with bug 181310 and bug 181347, since it's the same box and the same work-around.  Who knows if it's not just something corrupting kernel data structures or so?

Comment 1 Alexandre Oliva 2006-02-17 19:26:32 UTC
Jerky USB mouse is not related; I just got it with 1955_FC5 without any other
USB device plugged in.  rmmod uhci_hcd; modprobe uhci_hcd fixed it.

Comment 2 Dave Jones 2006-10-16 19:51:05 UTC
A new kernel update has been released (Version: 2.6.18-1.2200.fc5)
based upon a new upstream kernel release.

Please retest against this new kernel, as a large number of patches
go into each upstream release, possibly including changes that
may address this problem.

This bug has been placed in NEEDINFO state.
Due to the large volume of inactive bugs in bugzilla, if this bug is
still in this state in two weeks time, it will be closed.

Should this bug still be relevant after this period, the reporter
can reopen the bug at any time. Any other users on the Cc: list
of this bug can request that the bug be reopened by adding a
comment to the bug.

In the last few updates, some users upgrading from FC4->FC5
have reported that installing a kernel update has left their
systems unbootable. If you have been affected by this problem
please check you only have one version of device-mapper & lvm2
installed.  See bug 207474 for further details.

If this bug is a problem preventing you from installing the
release this version is filed against, please see bug 169613.

If this bug has been fixed, but you are now experiencing a different
problem, please file a separate bug for the new problem.

Thank you.

Comment 3 Jon Stanley 2008-01-06 01:11:57 UTC
Closing per previous comment