Description of problem: OOPS i/o error, mark_buffer_dirty in fs/buffer.c This oops occurred after the system was up and running for 2 days, running folding@home and seeding torrents (only 30kbps up). The load is low for I/O, and nothing should have been writing to or reading from this disk at the time. The FAH client and torrents are not stored on this device. The effected filesystem is ext3 created originally by F6 or F7. The attached dmesg shows many i/o failures to offline device before this oops. Version-Release number of selected component (if applicable): 2.6.25-0.218.rc8.git7.fc9.i686 How reproducible: Unknown Steps to Reproduce: No idea, the device offlined itself. Additional info: 12:57:17 up 2 days, 9:45, 3 users, load average: 1.39, 1.40, 1.22 #-> df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/luks-sdb4 15G 5.3G 8.8G 38% / /dev/mapper/luks-sdb2 20G 14G 5.4G 72% /home /dev/sdb1 342M 54M 272M 17% /boot tmpfs 506M 24K 506M 1% /tmp tmpfs 506M 48K 506M 1% /dev/shm /dev/sdd1 20G 12G 7.4G 61% /media/extarc /dev/sdd3 76G 67G 8.7G 89% /media/archive /dev/sdd2 20G 14G 5.8G 71% /media/blackhole /dev/sda1 38G 27G 11G 71% /media/disk #-> cat /etc/mtab /dev/mapper/luks-sdb4 / ext4dev rw 0 0 proc /proc proc rw 0 0 sysfs /sys sysfs rw 0 0 devpts /dev/pts devpts rw,gid=5,mode=620 0 0 /dev/mapper/luks-sdb2 /home ext4dev rw 0 0 /dev/sdb1 /boot ext3 rw 0 0 tmpfs /tmp tmpfs rw 0 0 tmpfs /dev/shm tmpfs rw 0 0 /dev/sdd1 /media/extarc ext3 rw,noexec,nosuid,nodev 0 0 /dev/sdd3 /media/archive vfat rw,noexec,nosuid,nodev,shortname=lower,fmask=0013,dmask=0002,gid=555 0 0 /dev/sdd2 /media/blackhole fuseblk rw,noexec,nosuid,nodev,noatime,allow_other,default_permissions,blksize=1024 0 0 none /proc/sys/fs/binfmt_misc binfmt_misc rw 0 0 sunrpc /var/lib/nfs/rpc_pipefs rpc_pipefs rw 0 0 /dev/sda1 /media/disk fuseblk rw,nosuid,nodev,noatime,allow_other,blksize=4096 0 0 /var/log/messages: Apr 15 12:44:10 localhost kernel: FAT: Directory bread(block 77245) failed Apr 15 12:44:10 localhost kernel: sd 3:0:0:0: rejecting I/O to offline device Apr 15 12:44:10 localhost kernel: EXT3-fs error (device sdd1): ext3_find_entry: reading directory #2 offset 0 Apr 15 12:44:10 localhost kernel: sd 3:0:0:0: rejecting I/O to offline device Apr 15 12:44:10 localhost kernel: Buffer I/O error on device sdd1, logical block 0 Apr 15 12:44:10 localhost kernel: lost page write due to I/O error on sdd1 Apr 15 12:44:10 localhost kernel: sd 3:0:0:0: rejecting I/O to offline device Apr 15 12:44:10 localhost kernel: EXT3-fs error (device sdd1): ext3_find_entry: reading directory #2 offset 0 Apr 15 12:44:10 localhost kernel: ------------[ cut here ]------------ Apr 15 12:44:10 localhost kernel: WARNING: at fs/buffer.c:1183 mark_buffer_dirty+0x23/0x6a() (Not tainted) Apr 15 12:44:10 localhost kernel: Modules linked in: appletalk autofs4 smsc47m192 hwmon_vid hwmon sunrpc ipt_REJECT nf_conntrack_ipv4 iptable_filter ip_tables ip6t_REJECT xt_tcpudp xt_limit nf_conntrack_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables x_tables ipv6 fuse vfat fat ext3 jbd dm_multipath firewire_sbp2 snd_emu10k1_synth snd_emux_synth snd_seq_virmidi snd_seq_midi_emul snd_emu10k1 snd_rawmidi snd_seq_dummy arc4 dcdbas ecb snd_intel8x0 snd_ac97_codec ac97_bus snd_seq_oss snd_seq_midi_event snd_seq snd_pcm_oss snd_mixer_oss floppy pcspkr snd_pcm serio_raw rt2500pci rt2x00pci rt2x00lib snd_seq_device rfkill snd_timer usb_storage snd_page_alloc input_polldev snd_util_mem mac80211 snd_hwdep snd 3c59x firewire_ohci tulip cfg80211 firewire_core mii soundcore crc_itu_t emu10k1_gp eeprom_93cx6 gameport button iTCO_wdt iTCO_vendor_support i2c_i801 joydev i2c_core intel_rng sg sr_mod cdrom ata_generic pata_acpi ata_piix libata sd_mod scsi_mod sha256_generic cbc aes_i586 aes_generic dm_crypt crypto_blkcipher dm Apr 15 12:44:10 localhost kernel: _snapshot dm_zero dm_mirror dm_mod ext4dev jbd2 mbcache crc16 uhci_hcd ohci_hcd ehci_hcd [last unloaded: microcode] Apr 15 12:44:10 localhost kernel: Pid: 3986, comm: gvfsd-trash Not tainted 2.6.25-0.218.rc8.git7.fc9.i686 #1 Apr 15 12:44:10 localhost kernel: [warn_on_slowpath+71/115] warn_on_slowpath+0x47/0x73 Apr 15 12:44:10 localhost kernel: [vt_console_print+51/646] ? vt_console_print+0x33/0x286 Apr 15 12:44:10 localhost kernel: [default_wake_function+11/13] ? default_wake_function+0xb/0xd Apr 15 12:44:10 localhost kernel: [__wake_up_common+53/91] ? __wake_up_common+0x35/0x5b Apr 15 12:44:10 localhost kernel: [vt_console_print+0/646] ? vt_console_print+0x0/0x286 Apr 15 12:44:10 localhost kernel: [__call_console_drivers+86/99] ? __call_console_drivers+0x56/0x63 Apr 15 12:44:10 localhost kernel: [_call_console_drivers+87/91] ? _call_console_drivers+0x57/0x5b Apr 15 12:44:10 localhost kernel: [release_console_sem+416/424] ? release_console_sem+0x1a0/0x1a8 Apr 15 12:44:10 localhost kernel: [mark_buffer_dirty+35/106] mark_buffer_dirty+0x23/0x6a Apr 15 12:44:10 localhost kernel: [<f91e1c36>] ext3_commit_super+0x40/0x53 [ext3] Apr 15 12:44:10 localhost kernel: [<f91e3076>] ext3_handle_error+0x71/0x95 [ext3] Apr 15 12:44:10 localhost kernel: [<f91e3129>] ext3_error+0x39/0x43 [ext3] Apr 15 12:44:10 localhost kernel: [<f91dffdc>] ext3_find_entry+0x3a1/0x52b [ext3] Apr 15 12:44:10 localhost kernel: [inode_has_perm+91/101] ? inode_has_perm+0x5b/0x65 Apr 15 12:44:10 localhost kernel: [<f91e088d>] ext3_lookup+0x25/0xa2 [ext3] Apr 15 12:44:10 localhost kernel: [do_lookup+161/320] do_lookup+0xa1/0x140 Apr 15 12:44:10 localhost kernel: [__link_path_walk+2204/3324] __link_path_walk+0x89c/0xcfc Apr 15 12:44:10 localhost kernel: [rb_insert_color+86/192] ? rb_insert_color+0x56/0xc0 Apr 15 12:44:10 localhost kernel: [mntput_no_expire+22/105] ? mntput_no_expire+0x16/0x69 Apr 15 12:44:10 localhost kernel: [path_walk+76/155] path_walk+0x4c/0x9b Apr 15 12:44:10 localhost kernel: [do_path_lookup+391/464] do_path_lookup+0x187/0x1d0 Apr 15 12:44:10 localhost kernel: [__user_walk_fd+47/67] __user_walk_fd+0x2f/0x43 Apr 15 12:44:10 localhost kernel: [vfs_lstat_fd+22/61] vfs_lstat_fd+0x16/0x3d Apr 15 12:44:10 localhost kernel: [autoremove_wake_function+0/51] ? autoremove_wake_function+0x0/0x33 Apr 15 12:44:10 localhost kernel: [selinux_file_permission+256/262] ? selinux_file_permission+0x100/0x106 Apr 15 12:44:10 localhost kernel: [vfs_lstat+17/19] vfs_lstat+0x11/0x13 Apr 15 12:44:10 localhost kernel: [sys_lstat64+20/40] sys_lstat64+0x14/0x28 Apr 15 12:44:10 localhost kernel: [audit_syscall_entry+249/291] ? audit_syscall_entry+0xf9/0x123 Apr 15 12:44:10 localhost kernel: [do_syscall_trace+294/365] ? do_syscall_trace+0x126/0x16d Apr 15 12:44:10 localhost kernel: [syscall_call+7/11] syscall_call+0x7/0xb Apr 15 12:44:10 localhost kernel: ======================= Apr 15 12:44:10 localhost kernel: ---[ end trace c85abf95dc3347d0 ]--- Apr 15 12:44:10 localhost kernel: sd 3:0:0:0: rejecting I/O to offline device Apr 15 12:44:10 localhost kernel: Buffer I/O error on device sdd1, logical block 0 Apr 15 12:44:10 localhost kernel: lost page write due to I/O error on sdd1 Apr 15 12:44:22 localhost kerneloops: Submitted 1 kernel oopses to www.kerneloops.org
Created attachment 302519 [details] dmesg-oops-mark-buffer-dirty.txt
Created attachment 302520 [details] var-log-messages-oops-mark-buffer-dirty.txt Shows the timestamping for the I/O failure events.
The disk is still working after journal recovery, and then I ran which fsck modified the filesystem but I didn't get the verbose output for that.
Your firewire hard drive went offline earlier -- that is the real error: firewire_sbp2: fw1.0: sbp2_scsi_abort firewire_sbp2: fw1.0: sbp2_scsi_abort sd 3:0:0:0: Device offlined - not ready after error recovery sd 3:0:0:0: [sdd] Result: hostbyte=DID_BUS_BUSY driverbyte=DRIVER_OK,SUGGEST_OK Those are the only messages I could find.
Huh. Yeah, not a whole lot to go on there, the bus appears to be completely silent until the sbp2_scsi_aborts hit... Judging from the GUID of fw0, you're hooking to the FireWire port on your Audigy sound card, which has no known issues. I'm mildly concerned about the GUID of the disk though. Its reporting a vendor of Xerox, which I'm assuming is incorrect. That being the case, there may well be a firmware update available for your device that is worth applying for that and other reasons. Was there anything interesting environmentally that happened around the time of the failure? Power fluctuation? Disk decided to suspend itself? Cosmic rays?
You are correct it is connected to the Audigy FW port, which has functioned pretty well for several years now both in linux and windows. This is the same device that has prior issues though, see bug 428554 (was a powerup login failure, firewire_sbp2: orb reply timed out). The device is actually an I-Rocks external USB2/FW enclosure with a Maxtor 120Gb IDE internal drive inside it. I'm not aware if there is any firmware that can be uploaded to the enclosure bits, there may be. I do not think there were any environmental issues involved, the machine is running off a APC UPS (500VA) which didn't sound off any low power conditions. I felt no cosmic rays, but I may have been dozing off at the time. ;)
These are of course two different bugs: - Disk going offline for no apparent reason. - ext3 throwing a bug after the partition became inaccessible. Could you have a look at the chips on the IDE bridge board in the enclosure? It would be interesting to know what chips are used. Can you repeat the 1st bug? For example, try the following while the disk is not mounted (to avoid data corruption and the ext3 bug): # dd if=/dev/disk/by-id/ieee1394*0000 of=/dev/null If this doesn't work, mount the disk and read files from it e.g. with # find /path/to/the/filesystem -type f -exec cat {} \; > /dev/null If you found a way to reliably reproduce the offlining, try # echo 1 > /sys/module/firewire_sbp2/parameters/workarounds Then plug the disk back in and check if you get the IO errors again.
> These are of course two different bugs: > - Disk going offline for no apparent reason. > - ext3 throwing a bug after the partition became inaccessible. On the latter: http://www.kerneloops.org/searchweek.php?search=mark_buffer_dirty
Also, if cracking open the case to look at the bridge chip is impractical, attaching /sys/bus/firewire/devices/fwX/config_rom here might give enough info to tell what's in there. (Where X is the fw device # matching the disk).
Stefan/Jarod, sorry for not getting back to this quickly, I'd be happy to open it and get that info, I'm an EE major so it shouldn't be a problem, but I've been under a heavy load this last week. I will get to this. The error hasn't shown up again yet, and the system has been up for 12 days under load, so while I do want to look into why its not the most pressing thing just now.
Changing version to '9' as part of upcoming Fedora 9 GA. More information and reason for this action is here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping
Stefan, I finally remembered this bug and decided to open up the drive. I haven't had issues with this particular problem in awhile but I have not had the drive connected to the Audigy card, instead its been used with my macbook's firewire, and seems to operate fine there with my rawhide install (so far, I rarely turn it on). The USB2/Firewire to IDE bridge is an AGERE FW802B, the full code is: AGERE FW802B 65518433 05021. That desktop system with the Audigy has been packed away awhile since moving out of my college town in June, but I do plan to revisit this in the future and try those debugging suggestions.
Thanks for the info. Agere FW802B is only a so-called PHY though (FireWire physical layer). There is another chip on the bridge board which implements the IDE bridge, which from the FireWire point of view would be a so-called link layer controller, possibly combined with an USB-IDE bridge. The behaviour of a FireWire disk is of course be influenced by the PHY, by the link (with its SBP-2 firmware), and by the IDE or SATA drive. But if there are device quirks, then they are most likely caused by the link and its firmware because the link is more complex than the PHY and is developed and tested less rigorously than the drive which is produced in much larger quantities.
> These are of course two different bugs: > - Disk going offline for no apparent reason. If this is not reproducible, let's ignore it. > - ext3 throwing a bug after the partition became inaccessible. It seems this warning still exists in upstream 2.6.27.5 and 2.6.28-rc3.
This message is a reminder that Fedora 9 is nearing its end of life. Approximately 30 (thirty) days from now Fedora will stop maintaining and issuing updates for Fedora 9. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as WONTFIX if it remains open with a Fedora 'version' of '9'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version prior to Fedora 9's end of life. Bug Reporter: Thank you for reporting this issue and we are sorry that we may not be able to fix it before Fedora 9 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora please change the 'version' of this bug to the applicable version. If you are unable to change the version, please add a comment here and someone will do it for you. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. The process we are following is described here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping
Fedora 9 changed to end-of-life (EOL) status on 2009-07-10. Fedora 9 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. Thank you for reporting this bug and we are sorry it could not be fixed.