From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; de-AT; rv:1.7.3) Gecko/20041020 Description of problem: This is closely related to bug 128602: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=128602 I am opening a new one because the error message is now a bit different, and the devs asked for a new bug to be filed. With FC3, kernel 2.6.10-1.737_FC3smp, trying to write to a USB flash drive results in almost immediate filesystem corruption (on the device, not on the host computer)> Here's commented section of the messages file at the problem. Upon inserting a 'Handy Steno' 512MB USB flash drive, which had worked perfectly with Fedora1, this kernel says: Jan 11 20:57:18 planck kernel: usb 1-6: new high speed USB device using ehci_hcd and address 6 Jan 11 20:57:18 planck kernel: scsi4 : SCSI emulation for USB Mass Storage devices Jan 11 20:57:23 planck kernel: Vendor: USB Model: Flash Drive Rev: 1.12 Jan 11 20:57:23 planck kernel: Type: Direct-Access ANSI SCSI revision: 00 Jan 11 20:57:23 planck kernel: SCSI device sda: 1015805 512-byte hdwr sectors (520 MB) Jan 11 20:57:23 planck kernel: sda: Write Protect is off Jan 11 20:57:23 planck kernel: sda: assuming drive cache: write through Jan 11 20:57:23 planck kernel: SCSI device sda: 1015805 512-byte hdwr sectors (520 MB) Jan 11 20:57:23 planck kernel: sda: Write Protect is off Jan 11 20:57:23 planck kernel: sda: assuming drive cache: write through Jan 11 20:57:23 planck kernel: sda: sda1 sda2 Jan 11 20:57:23 planck kernel: Attached scsi removable disk sda at scsi4, channel 0, id 0, lun 0 Jan 11 20:57:23 planck scsi.agent[26069]: disk at /devices/pci0000:00/0000:00:1d.7/usb1/1-6/1-6:1.0/host4/target4:0:0/4:0:0:0 Jan 11 20:57:23 planck fstab-sync[26121]: added mount point /media/usbdsk_win for /dev/sda1 Jan 11 20:57:24 planck fstab-sync[26125]: added mount point /media/usbdsk for /dev/sda2 So far so good. I then try to delete a directory, or copy a few files, on the ext2 partition (/dev/sda2): Jan 11 21:01:27 planck kernel: usb 1-6: reset high speed USB device using ehci_hcd and address 6 Jan 11 21:01:28 planck kernel: usb 1-6: scsi_eh_4 timed out on ep0in Jan 11 21:01:28 planck kernel: usb 1-6: device descriptor read/64, error -110 Jan 11 21:01:33 planck kernel: usb 1-6: scsi_eh_4 timed out on ep0in Jan 11 21:01:33 planck kernel: usb 1-6: device descriptor read/64, error -110 Jan 11 21:01:33 planck kernel: scsi: Device offlined - not ready after error recovery: host 4 channel 0 id 0 lun 0 Jan 11 21:01:33 planck kernel: usb 1-6: USB disconnect, address 6 Jan 11 21:01:33 planck kernel: SCSI error : <4 0 0 0> return code = 0x50000 Jan 11 21:01:33 planck kernel: end_request: I/O error, dev sda, sector 816350 Jan 11 21:01:33 planck kernel: scsi4 (0:0): rejecting I/O to offline device Jan 11 21:01:33 planck last message repeated 2 times Jan 11 21:01:33 planck kernel: Buffer I/O error on device sda2, logical block 368494 Jan 11 21:01:33 planck kernel: lost page write due to I/O error on sda2 Jan 11 21:01:33 planck kernel: scsi4 (0:0): rejecting I/O to device being removed After that, the filesystem on the device is fully hosed: Jan 11 21:01:34 planck kernel: scsi4 (0:0): rejecting I/O to device being removed Jan 11 21:01:34 planck last message repeated 34 times Jan 11 21:01:34 planck kernel: EXT2-fs error (device sda2): ext2_readdir: bad page in #78945 The only solution is to plug the USB drive on a Fedora1 machine and mke2fs the filesystem again. It's not worth running fsck, since the error flood is so long that it takes forever. I know that it's hard for the kernel team to debug this, but frankly , filesystem corruption on USB devices which had worked for ages without a glitch is a really nasty problem. Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1. Plug USB flash drive 2. Attempt write operation to it. 3. Filesystem corruption and irrecoverable data loss on flash drive. Actual Results: Data loss. Expected Results: No data loss. Additional info:
Looks like a crash of the firmware.
Do you mean, on the USB flash disk? Even if this is the case, I should note that Fedora1/RedHat9 machines currently can handle this device perfectly.
Yes, it's a regression. Thanks for spelling it out explicitly. Something in the way 2.6 kernel works confuses the device. Unfortunately, before the usbmon is deployed, they only way to find out what is happening is to rebuild with CONFIG_USB_STORAGE_DEBUG, then try to collect the trace and pray that it either fits into dmesg buffer or klogd doesn't drop vital records... It's such a pain, I'm reluctant to recommend this to users.
Thanks for acknowledging this. I truly hope that this can be resolved in some way, and I'll be glad to help testing. I have unfortunately already sunk a ton of time into this USB problem, and for a number of reasons here I can not build custom kernels, I am limited to installing things which are deployed via yum (though I can use the updates-testing repo). But if you can put out a kernel which enables extended debugging info on this problem, I'll gladly send it back. As I said in comment #47 to https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=128602 the current situation with regards to USB external storage with the 2.6 kernel series is a complete disaster. And I've been bumping into this problem ever since Fedora 2 came out, all with fairly standard hardware. I am really surprised that such a nasty bug has lasted this long, since I would imagine that using USB storage devices for external backups is a pretty common practice these days. If it's something I'm doing wrong, I'll be happy to be corrected, but so far everything indicates kernel bugs. Again, my thanks for keeping an eye on this. And apologies if my tone is a bit upset, but I've wasted a lot of time rebuilding corrupted filesystems, so I'm a bit frustrated :)
Fernando, did you have a chance to try an FC4 test kernel? Something like 2.6.11-1.1275_FC4 is interesting (it's based on 2.6.12-rc3).
OK, I just ran some tests with the current FC4 test kernel (kernel-smp-2.6.11-1.1286_FC4) on my Fedora3 box, as suggested. It's the same disaster as before. In addition, the usb drive doesn't appear with auto-generated mountpoints under /media as it used to, but that's a separate issue (I just mention it in case it's of interest). Tailing /var/log/messages, I get: - On device insertion (an Apacer 512MB HandySteno USB flash disk): May 13 12:11:56 planck kernel: usb 1-5: new high speed USB device using ehci_hcd and address 5 May 13 12:11:56 planck kernel: scsi3 : SCSI emulation for USB Mass Storage devices May 13 12:12:01 planck kernel: Vendor: USB Model: Flash Drive Rev: 1.12 May 13 12:12:01 planck kernel: Type: Direct-Access ANSI SCSI revision: 00 May 13 12:12:01 planck scsi.agent[6121]: disk at /devices/pci0000:00/0000:00:1d.7/usb1/1-5/1-5:1.0/host3/target3:0:0/3:0:0:0 May 13 12:12:01 planck kernel: SCSI device sda: 1015805 512-byte hdwr sectors (520 MB) May 13 12:12:01 planck kernel: sda: Write Protect is off May 13 12:12:01 planck kernel: sda: assuming drive cache: write through May 13 12:12:01 planck kernel: SCSI device sda: 1015805 512-byte hdwr sectors (520 MB) May 13 12:12:01 planck kernel: sda: Write Protect is off May 13 12:12:01 planck kernel: sda: assuming drive cache: write through May 13 12:12:01 planck kernel: sda: sda1 sda2 May 13 12:12:01 planck kernel: Attached scsi removable disk sda at scsi3, channel 0, id 0, lun 0 - I then mount it (mount points manually added): May 13 12:18:51 planck kernel: kjournald starting. Commit interval 5 seconds May 13 12:18:51 planck kernel: EXT3 FS on sda2, internal journal May 13 12:18:51 planck kernel: EXT3-fs: recovery complete. May 13 12:18:51 planck kernel: EXT3-fs: mounted filesystem with ordered data mode. - A large transfer is initiated, all seems to work for a few minutes. Then, boom: May 13 12:22:05 planck kernel: usb 1-5: reset high speed USB device using ehci_hcd and address 5 May 13 12:22:10 planck kernel: usb 1-5: device descriptor read/8, error -110 May 13 12:22:15 planck kernel: usb 1-5: device descriptor read/8, error -110 May 13 12:22:15 planck kernel: usb 1-5: reset high speed USB device using ehci_hcd and address 5 ... May 13 12:23:02 planck kernel: scsi: Device offlined - not ready after error recovery: host 3 channel 0 id 0 lun 0 May 13 12:23:02 planck kernel: SCSI error : <3 0 0 0> return code = 0x50000 May 13 12:23:02 planck kernel: end_request: I/O error, dev sda, sector 453786 May 13 12:23:02 planck kernel: Buffer I/O error on device sda2, logical block 187213 May 13 12:23:02 planck kernel: lost page write due to I/O error on sda2 May 13 12:23:02 planck kernel: scsi3 (0:0): rejecting I/O to offline device May 13 12:23:02 planck kernel: Buffer I/O error on device sda2, logical block 187214 At this point the filesystem on the device is hosed: May 13 12:23:02 planck kernel: usb 1-5: USB disconnect, address 5 May 13 12:23:02 planck kernel: scsi3 (0:0): rejecting I/O to device being removed May 13 12:23:02 planck last message repeated 285 times May 13 12:23:02 planck kernel: EXT3-fs error (device sda2): read_block_bitmap: Cannot read block bitmap - block_group = 27, block_bitmap = 221188 May 13 12:23:02 planck kernel: Aborting journal on device sda2. May 13 12:23:02 planck kernel: EXT3-fs error (device sda2) in ext3_ordered_writepage: IO failure May 13 12:23:02 planck kernel: scsi3 (0:0): rejecting I/O to device being removed May 13 12:23:02 planck last message repeated 2 times May 13 12:23:02 planck kernel: ext3_abort called. May 13 12:23:02 planck kernel: EXT3-fs error (device sda2): ext3_journal_start_sb: Detected aborted journal<3>scsi3 (0:0): rejecting I/O to device being removed May 13 12:23:02 planck kernel: May 13 12:23:02 planck kernel: Remounting filesystem read-only May 13 12:23:02 planck kernel: EXT3-fs error (device sda2) in ext3_prepare_write: IO failure May 13 12:23:02 planck kernel: EXT3-fs error (device sda2) in start_transaction: Journal has aborted May 13 12:23:02 planck kernel: EXT3-fs error (device sda2) in start_transaction: Journal has aborted May 13 12:23:02 planck kernel: EXT3-fs error (device sda2) in ext3_reserve_inode_write: Journal has aborted May 13 12:23:02 planck kernel: EXT3-fs error (device sda2) in ext3_dirty_inode: Journal has aborted It then seems like the kernel tries to remount it: May 13 12:23:02 planck kernel: sda : READ CAPACITY failed. May 13 12:23:02 planck kernel: sda : status=0, message=00, host=1, driver=00 May 13 12:23:02 planck kernel: sda : sense not available. May 13 12:23:02 planck kernel: sda: Write Protect is off May 13 12:23:02 planck kernel: sda: assuming drive cache: write through May 13 12:23:02 planck kernel: scsi3 (0:0): rejecting I/O to device being removed May 13 12:23:02 planck last message repeated 190 times May 13 12:23:02 planck kernel: __journal_remove_journal_head: freeing b_committed_data May 13 12:23:02 planck last message repeated 2 times etc (more similar stuff follows). As a final note, the little read/write LED on the usb flashdisk stays solid ON once the problems begin. I'll gladly test again if there is reason to expect that a new kernel might improve things back to what the 2.4 series used to do. Just drop me a line here or directly.
Thanks a lot, Fernando. This is the key bit: > - A large transfer is initiated, all seems to work for a few minutes. Then, boom: > May 13 12:22:05 planck kernel: usb 1-5: reset high speed USB device using ehci_hcd and address 5 > May 13 12:22:10 planck kernel: usb 1-5: device descriptor read/8, error -110 Now we have to separate two things: running out of RAM and the infamous EHCI hangs. Please do this for me: 1. Try separately big enough transfers in two directions: read and write. Copying to device eats memory. 2. After failure, take dmesg and attach (do not drop into comments box) Preferably try to capture it right after the reset to avoid overflow of dmesg buffer. /var/log/messages doesn't have complete picture. 3. This is a little tricky... It would be great to capture SysRq-M right after the moment the EHCI collapses to verify that we have atomic memory depleted. But you need to watch dmesg like a hawk to hit it in the right moment.
Adding Riel to cc: just FYI about write throttling issues with ext3. No immediate action requested.
Better pigeonhole - corrected summary to show ext3, USB flash.
OK, I'm happy to test, but just to clarify: I sort of doubt this is a memory problem. The machine has 1GB of ram and I was copying maybe 20-40 MB in small files (it was actually an rsync script syncing 2 directories). The machine never showed any signs of memory depletion, no swapping, and I keep gkrellm running which didn't show any abnormal increase in memory usage. So let me know if this changes anything before I get into testing further.
Free atomics occupy only a small percentage of memory at any given time. If kswapd cannot keep up finding pages to refill the pool, we experience a temporary depletion. Give me that dmesg though, I need it.
Created attachment 114354 [details] requested kernel messages Here's the dmesg output right after I first saw signs of trouble. I'll attach two more I made for a few more minutes, fearing the kernel's buffer might fill up.
Created attachment 114355 [details] kernel messages (more) A few minutes later, after more stuff came out in /var/log/messages, I made another copy of dmesg in case the new info could be useful (I wanted the first one to be made immediately so as not to lose any of the early info).
Created attachment 114356 [details] kernel messages (last) One more set, just in case. There will obviously be overlap between the three, I just wanted to be safe.
Comment on attachment 114354 [details] requested kernel messages Last dmesg output has everything.
Comment on attachment 114355 [details] kernel messages (more) Sorry about the noise: I just realized the third attachment does contain the full output from 1 and 2. So just ignore the first two, the last one has it all. I thought dmesg might drop stuff quicker, sorry for not double-checking before submitting.
Just a couple more data points: - I tested again with the same USB flash disk, but formatted as ext2 instead of ext3. Same results. - When trying to burn some DVDs on an external, USB-connected burner, I started getting messages like: Jun 15 16:25:35 planck kernel: SCSI error : <2 0 0 0> return code = 0x70000 Jun 15 16:25:35 planck last message repeated 7 times Jun 15 16:25:36 planck kernel: usb 1-1: USB disconnect, address 6 Jun 15 16:25:38 planck fstab-sync[12084]: removed mount point /media/cdrecorder1 for/dev/scd0 A bit of googling pointed me towards setting echo N > /sys/module/usbcore/parameters/old_scheme_first This seems to have fixed the DVD burning issues. I got hopeful that the same thing might also help the problems with the USB flash disk, but no: I just tried (after checking that old_scheme_first is indeed set at 'N'), and I still get the same problems. These tests were done again with the 2.6.11-1.27_FC3 kernel, as I had to reinstall my box (unrelated problem). I don't know if this info is terribly useful, but I'm trying to provide what I can. If you'd want me to test with any of the newer FC4 kernels, please let me know and I'll gladly do so.
Thanks, good data. Note that the renumbering scheme only comes into play when device reconnects. I'm looking here for a reason it disconnects... That's what I think the bug is with the USB key (leaving the DVD drive aside)
Well, my DVD burn just died after about 500MB transferred with those messages, and some blind googling turned up the old_scheme suggestion. It _did_ help with the DVD problem, but I have no idea why, sorry :) Again, let me know of any info/tests I can help with and I'll be glad to comply. I'd really like to see this problem fixed, it's causing no end of headaches on our network, to the point of considering reverting to Fedora1 (the last version where usb storage devices worked reliably).
The only way the old_scheme can have a lasting effect after the enumeration is done is when it bounces devices between hi-speed EHCI and its companion full-speed controller. This points at a broken EHCI which simply cannot hold up. Is there a way to add a PCI USB 2.0 card for that box?
Unfortunately I don't think so: this is a 'small form factor' Dell GX-270. I don't think they have any open PCI slots, and the chassis is half-width, so no standard card of any kind fits in. I should note that with Fedora1 (kernel 2.4.X) this all worked just fine always (for a long time, with daily use), so somehow that kernel manages to make it work. I have next to me a nearly identical machine running redhat9.0, and that one also works just fine. That RH9 box stays just so we can move data off our network onto external USB drives, since after the upgrade to Fedora2/3 we lost that ability on the entire network.
I have taken a note that FC1 worked. But the difficulty is that FC1 was our last release with 2.4 kernel. EHCI has diverged too far between 2.4 and 2.6 now, so it's not feasible for me to make proper divide-and-test. Even for you, with local access, it will take a month, I expect.
Bummer. I'd really like to help with some tests to see this fixed, but I can't commit to a month-long deep debugging effort, unfortunately. Would it help if I mailed you physically the USB flash disk? It's nearly useless to me as is, so I could send it by mail to you for local testing. It misbehaves both on my desktop and on my Dell Inspiron 4100 laptop (usb 1.1 only), so you might be able to reproduce the problem locally and send it back later. It's a standard APACER Handy Steno 512 MB: http://www.flash-memory-store.com/512mb-apacer-steno.html (wow, I paid $140 for that a while ago :) Let me know if there's anything I can do to help track this down, and many thanks for at least paying attention to the problem. Much appreciated.
An update has been released for Fedora Core 3 (kernel-2.6.12-1.1372_FC3) which may contain a fix for your problem. Please update to this new kernel, and report whether or not it fixes your problem. If you have updated to Fedora Core 4 since this bug was opened, and the problem still occurs with the latest updates for that release, please change the version field of this bug to 'fc4'. Thank you.
No luck: I just tested that kernel under FC3. It took longer than before to die, but eventually (a large rsync-controlled transfer to a USB key) it did. Upon insertion, dmesg said: usb 1-5: new high speed USB device using ehci_hcd and address 7 scsi6 : SCSI emulation for USB Mass Storage devices usb-storage: device found at 7 usb-storage: waiting for device to settle before scanning Vendor: USB Model: Flash Drive Rev: 1.12 Type: Direct-Access ANSI SCSI revision: 00 SCSI device sda: 1015805 512-byte hdwr sectors (520 MB) sda: Write Protect is off sda: Mode Sense: 37 00 00 00 sda: assuming drive cache: write through SCSI device sda: 1015805 512-byte hdwr sectors (520 MB) sda: Write Protect is off sda: Mode Sense: 37 00 00 00 sda: assuming drive cache: write through sda: sda1 sda2 Attached scsi removable disk sda at scsi6, channel 0, id 0, lun 0 usb-storage: device scan complete Then I started the rsync transfer (writing about 500MB to the device). I should note that it felt unusually slow, but it could just be my perception. After about 10 minutes: usb 1-5: reset high speed USB device using ehci_hcd and address 7 usb 1-5: device descriptor read/64, error -110 usb 1-5: device descriptor read/64, error -110 usb 1-5: reset high speed USB device using ehci_hcd and address 7 usb 1-5: device descriptor read/64, error -110 usb 1-5: device descriptor read/64, error -110 usb 1-5: reset high speed USB device using ehci_hcd and address 7 usb 1-5: device descriptor read/8, error -110 usb 1-5: device descriptor read/8, error -110 usb 1-5: reset high speed USB device using ehci_hcd and address 7 usb 1-5: device descriptor read/8, error -110 usb 1-5: device descriptor read/8, error -110 scsi: Device offlined - not ready after error recovery: host 6 channel 0 id 0 lun 0 SCSI error : <6 0 0 0> return code = 0x50000 end_request: I/O error, dev sda, sector 426406 scsi6 (0:0): rejecting I/O to offline device scsi6 (0:0): rejecting I/O to offline device scsi6 (0:0): rejecting I/O to offline device scsi6 (0:0): rejecting I/O to offline device Buffer I/O error on device sda2, logical block 173522 lost page write due to I/O error on sda2 And tons more of the same. At this point, a reboot is the only way to get the usb subsystem back to even recognize an inserted device.
This is a mass-update to all currently open Fedora Core 3 kernel bugs. Fedora Core 3 support has transitioned to the Fedora Legacy project. Due to the limited resources of this project, typically only updates for new security issues are released. As this bug isn't security related, it has been migrated to a Fedora Core 4 bug. Please upgrade to this newer release, and test if this bug is still present there. This bug has been placed in NEEDINFO_REPORTER state. Due to the large volume of inactive bugs in bugzilla, if this bug is still in this state in two weeks time, it will be closed. Should this bug still be relevant after this period, the reporter can reopen the bug at any time. Any other users on the Cc: list of this bug can request that the bug be reopened by adding a comment to the bug. Thank you.
This is a mass-update to all currently open kernel bugs. A new kernel update has been released (Version: 2.6.15-1.1830_FC4) based upon a new upstream kernel release. Please retest against this new kernel, as a large number of patches go into each upstream release, possibly including changes that may address this problem. This bug has been placed in NEEDINFO_REPORTER state. Due to the large volume of inactive bugs in bugzilla, if this bug is still in this state in two weeks time, it will be closed. Should this bug still be relevant after this period, the reporter can reopen the bug at any time. Any other users on the Cc: list of this bug can request that the bug be reopened by adding a comment to the bug. If this bug is a problem preventing you from installing the release this version is filed against, please see bug 169613. Thank you.
Closing per previous comment.