Description of problem: Well, this is only a suspect, but, since some days, an external firewire harddisk cannot be accessed anymore. At first I suspect an hardware problem, then, due to other reasons, I downgraded the firewire stack to the old one (ohci1349, ieee1394, sbp2, etc.) and the drive started to work again... The only reasonable update, which happened in the meantime, was "udev" (no new kernel or other libraries, I think). That's way I'm reporting the issue here, even if I cannot be sure "udev" is guilty. Version-Release number of selected component (if applicable): udev-118-1.fc8 How reproducible: Systematically Steps to Reproduce: 1. plug-in the MD2-FW2 (Datafab) external firewire harddisk Actual results: The device is offlined and the following appears with "dmesg": ... firewire_core: created new fw device fw0 (0 config rom retries, S400) firewire_core: created new fw device fw1 (0 config rom retries, S400) firewire_core: phy config: card 0, new root=ffc2, gap_count=7 scsi6 : SBP-2 IEEE-1394 firewire_sbp2: error status: 0:4 firewire_sbp2: error status: 0:4 firewire_sbp2: logged in to fw1.0 LUN 0000 (2 retries) scsi 6:0:0:0: Direct-Access LSILogic SYM13FW500-Disk 1.00 PQ: 0 ANSI: 0 sd 6:0:0:0: [sdb] 117210240 512-byte hardware sectors (60012 MB) sd 6:0:0:0: [sdb] Write Protect is off sd 6:0:0:0: [sdb] Mode Sense: 10 00 00 00 sd 6:0:0:0: [sdb] Cache data unavailable sd 6:0:0:0: [sdb] Assuming drive cache: write through sd 6:0:0:0: [sdb] 117210240 512-byte hardware sectors (60012 MB) sd 6:0:0:0: [sdb] Write Protect is off sd 6:0:0:0: [sdb] Mode Sense: 10 00 00 00 sd 6:0:0:0: [sdb] Cache data unavailable sd 6:0:0:0: [sdb] Assuming drive cache: write through sdb: sdb1 sd 6:0:0:0: [sdb] Attached SCSI disk sd 6:0:0:0: Attached scsi generic sg2 type 0 end_request: I/O error, dev sdb, sector 136 Buffer I/O error on device sdb1, logical block 13 Buffer I/O error on device sdb1, logical block 14 Buffer I/O error on device sdb1, logical block 15 Buffer I/O error on device sdb1, logical block 16 Buffer I/O error on device sdb1, logical block 17 Buffer I/O error on device sdb1, logical block 18 Buffer I/O error on device sdb1, logical block 19 Buffer I/O error on device sdb1, logical block 20 Buffer I/O error on device sdb1, logical block 21 Buffer I/O error on device sdb1, logical block 22 end_request: I/O error, dev sdb, sector 136 end_request: I/O error, dev sdb, sector 136 ... ... end_request: I/O error, dev sdb, sector 0 printk: 39 messages suppressed. Buffer I/O error on device sdb, logical block 0 Buffer I/O error on device sdb, logical block 1 Buffer I/O error on device sdb, logical block 2 Buffer I/O error on device sdb, logical block 3 end_request: I/O error, dev sdb, sector 0 end_request: I/O error, dev sdb, sector 0 end_request: I/O error, dev sdb, sector 32 end_request: I/O error, dev sdb, sector 32 end_request: I/O error, dev sdb, sector 0 end_request: I/O error, dev sdb, sector 8 end_request: I/O error, dev sdb, sector 0 end_request: I/O error, dev sdb, sector 0 ... Expected results: The device should be enabled and working. Additional info: A second firewire sbp2 controller seems to be, however, working... The previous kernel, 2.6.23.8-63.fc8, shows the same problem. The old firewire stack, as mentioned above, works fine and a disk check did not report any problem. The harddisk itself is working fine even with an USB controller. Of course, I could try the old udev (and libvolume_id) if I can get it somewhere, or a new one. Thanks, bye, pg
OK, I got an enlightenment and I installed udev-116-3 from the base F8 release and, surprise surprise, the drive works fine now. So, I think I can confirm udev-118-1 creates some problems with this specific sbp2 unit and the new firewire stack. However, it can also be that this udev triggers a bug of the new firewire stack... How should we proceed? Thanks, pg
Uhm, just for the sake of completion, I add Stefan CC to this bug, since he is the sbp2 expert. Stefan, what do you think? Any idea on what's going on? Thanks, pg
Did udev or a helper library start to issue SCSI commands now? For example, some FireWire firmwares of Prolific PL3507 based FireWire/USB combo devices are known to return corrupt data after they got and responded to an INQUIRY command --- unless the INQUIRY was followed up by READ CAPACITY. So, if userland sends an INQUIRY, then a PL3507 disk will become inaccessible. A variety of USB devices is known to feature similar bugs. The root cause of such bugs is that firmware writers test a little bit with Windows, and that's it. I've got two LSI based CD-R/Ws. (No HDD though, and I can't open the enclosures without destruction to put a HDD in there.) I can update to udev 118 here and see what happens. Piergiorgio, you can help to debug this by switching on SCSI command logging. If the Linux SCSI core is built modular, then you can enable command logging by something like # echo 9216 > /sys/module/scsi_mod/parameters/scsi_logging_level If it is not modular, then you need to specify it on the kernel command line in the bootloader's prompt. (But don't ask me for details, I don't know more than that.) Then you can watch in dmesg which commands succeed and which fail. Even better would be to find out from which process the commands come, but I don't know whether this could be done without meddling with the kernel sources. References: http://marc.info/?t=119980027300007
> A second firewire sbp2 controller seems to be, however, working... > The previous kernel, 2.6.23.8-63.fc8, shows the same problem. > The old firewire stack, as mentioned above, works fine I suppose you tested the old stack together with the new udev? If so, this adds to the mysteries. Note that the dmesg output does not contain any FireWire/SBP-2 transport error messages. This lets me suspect a problem at the SCSI commands level. Both the sbp2 driver of the ieee1394 stack and firewire-sbp2 don't touch SCSI commands when they are sent to the device. As far as I am aware of, they also handle the returned status the same.
> References: http://marc.info/?t=119980027300007 also http://marc.info/?l=linux-scsi&m=120076480316820
> http://marc.info/?l=linux-scsi&m=120076480316820 Here, merely reading at the (supposed) end of the block device crashed the device. The device stated its size incorrectly. Such devices need a blacklist flag in usb-storage or in (fw-)sbp2 respectively, so that the kernel corrects the size of the block device.
Created attachment 292292 [details] grep "9:0:0:0" /var/log/messages Hi, this is the grep of SCSI device 9:0:0:0: (sdb) in /var/log/messages after setting the logging level of the SCSI module to 9216. Please note that I had to grep, since the SATA SCSI subsystem was flooding the system log, so if something else is needed, please let me know. Hope this helps, pg
(In reply to comment #4) > I suppose you tested the old stack together with the new udev? Just to make it clear, I tested: udev-118 + firewire-sbp2 (new) -> not working udev-118 + sbp2 (old) -> working udev-116 + firewire-sbp2 -> working udev-116 + sbp2 -> not tested (but I would bet it works...) I'm really depressed... BTW, I also tested udev-116 with the DV camera and it is NOT working, so the camera problem belongs to something else, I guess. bye, pg
I have still no idea what the difference of udev-118 + firewire-sbp2 versus udev-118 + sbp2 could be. Regarding your logs: SCSI direct access devices like HDDs may support either the RBC command set specification or the SBC(-2,-3) command set specification. Your disk claims to support SBC ("Direct-Access" in the log in the bug description) while almost all other SBP-2 disks implement RBC ("Direct-Access-RBC" in the log). My suspicion is that the SBC claim of the few SBP-2 devices which do this is just wrong and that these only support RBC... or, to be more precise, something resembling RBC. One major difference between RBC and SBC is that the former contains only READ(10) as read command, while the latter contain READ(6), READ(10) - both mandatory -, READ(12)/(16)/(32) - these optional. These various commands all do the same (read a specified number of blocks at a specified offset) but hae different formats of the command descriptor block. Your disk chokes either on the second READ(10) command or on the first READ(6) command. The commands which you logged try to read the following: READ(10) - offset 00 00 00 48 (= LBA 72) - size 00 38 (= 56 blocks) READ(10) - offset 06 fc 77 f8 (= LBA 117209080) - size 00 08 (= 8 blocks) READ(6) - offset 00 00 (= LBA 0) - size 20 (= 32 blocks) ditto with the following READ(6) commands. Both sbp2 and firewire-sbp2 tell the HDD driver sd-mod to always use READ(10), regardless if the device claims to support RBC or SBC or MMC or whatnot. So, sd-mod did not issue these commands AFAIU. However, userspace does not know of the use_10_for_rw flag which sd-mod got from (firewire-)sbp2 and might still inject commands which the SCSI device doesn't implement, or implements incorrectly. This does not explain though why the bug doesn't happen with sbp2. Well, maybe something in the Fedora userland is fine-tuned for firewire-sbp2 so that some udev actions or actions further up --- e.g. amarok probing newly inserted devices which were brought to its attention by udev --- only happen with firewire-sbp2 driven disks. Or vice versa, there are some older scripts which know how to handle sbp2 disks but mishandle firewire-sbp2 disks. Solution if this hypothesis is correct: Convince whatever userspace program which is at work here to use READ(10), or take his userspace program out. Other hypothesis: Maybe the disk has the off-by-one bug in its READ CAPACITY implementation. Then it is dangerous to read at high offsets because they might already exceed the actual limit of the disk. The disk claims to have 117210240 blocks. (I am actually not sure if sd-mod logs the READ CAPACITY parameter data which would be the highest allowed LBA, or the number of logical blocks, which would be the highest allowed LBA + 1 because the first LBA is 0.) So the last READ(10) command should still be fine, but maybe we are missing something. For the same reasons as stated above, something about how sbp2 and firewire-sbp2 driven devices are represented to userspace might prevent userspace from sending a dangerous read command to sbp2 disks, while it freely sends those commands to sbp2 disks. Solution if this hypothesis is correct: - unplug disk - # echo 0x8 > /sys/module/firewire_sbp2/parameters/workarounds - plug in disk Now, the READ CAPACITY result should be lowered by 1. Anyway, it's more likely that it is about the READ(6) and not about the read attempts at the end of the disk. Because if a too large LBA and/or transfer size was asked for, the device should indicate a Sense Key of "Illegal Request" with an additional sense code of "Logical block address out of range". But the additional sense code in your log is "Invalid field in cdb (Command Descriptor Block)", i.e. the command format was what apparently offended the device. Alas I have only MMC devices (CD-R/W) with bridge chips here which I believe to be LSI SYM*FW* chips, so I can't test either hypothesis myself. However, you can easily check the READ CAPACITY bug hypothesis.
(In reply to comment #9) > I have still no idea what the difference of udev-118 + firewire-sbp2 versus > udev-118 + sbp2 could be. For this I guess we will have to wait until tomorrow morning (US time), that is our afternoon, when Harald Hoyer will, hopefully, read this report. > Solution if this hypothesis is correct: > - unplug disk > - # echo 0x8 > /sys/module/firewire_sbp2/parameters/workarounds > - plug in disk > Now, the READ CAPACITY result should be lowered by 1. Uhm, I could not find /sys/module/firewire_sbp2/parameters/workarounds, the directory /sys/module/firewire_sbp2/parameters/ contains only "exclusive_login", as configuration file. I guess I cannot create "workarounds" myself. BTW, "modinfo firewire-sbp2" returns only "exclusive_login" as parameter. There is an other, painful, possibility... I could try also the other two combinations (udev-116 + fw-sbp2, udev-118 + sbp2) and collect the logs, as done in the udev-118 + fw-sbp2 case, and post them here. Maybe it could be possible to spot some difference and narrow down the possibilities. What do you think? pg
> Uhm, I could not find /sys/module/firewire_sbp2/parameters/workarounds Ah, that was added only recently. Obviously your driver version doesn't have it yet. This one perhaps has it: https://bugzilla.redhat.com/show_bug.cgi?id=271801#c20 > There is an other, painful, possibility... > I could try also the other two combinations (udev-116 + fw-sbp2, udev-118 + > sbp2) and collect the logs, as done in the udev-118 + fw-sbp2 case, and post > them here. Yes, would be good. First do the combination(s) to which you can switch the easiest, then let's go from there.
Created attachment 292296 [details] grep 19:0:0 /var/log/messages OK, this is the log, collected as the other one, using udev-116. If I get the mood, I'll do the same with the old sbp2, if really necessary... pg
These are all READ(10) to offsets somewhere in the middle of the disk: LBA 58'604'809...73'148'793, all well in the range of LBA 0...117'210'240 or 117'210'239. You probably lost a number of log lines, because there still should be read attempts at least to the beginning of the disk, perhaps also to the end. Anyway, I rather believe it's the READ(10) vs. READ(6) issue, not the READ CAPACITY issue. > If I get the mood, I'll do the same with the old sbp2 udev-118 plus sbp2 might be interesting, i.e. are there any READ(6) then? However, more interesting would be to find out - who issues READ(6) in the udev-118/firewire-sbp2 combo, - how to switch that to READ(10), or easier: how to block it entirely, - whether the prevention of READ(6) lets your disk survive.
Short update. Until now I was upgrading/downgrading udev AND libvolume_id together. So, I tried to upgrade only udev to 118, keeping libvolume_id to the old 116 (it seems udev does not depend on libvolume_id). In this setup, udev-118 and libvolume_id-116, the SBP2 device is recognized without errors and it works fine, as usual. So, I guess, there is something wrong with libvolume_id, whatever this lib does. pg
Hi again! I tried the kernel 2.6.14-111.fc8, as per suggestion, with workaround 0x8, but, as expected, it did not improve the situation. My humble guess is that libvolume_id tries to read the volume ID and, the new release (0.82.0, from udev-118) seems to use these indigestibles READ(6) commands, while the old one (0.80.0, from udev-116) does not. I'll try the old firewire stack with the new libvolume_id and attach the logs here. pg
Created attachment 292399 [details] grep "sd 6:0:0" /var/log/messages OK, this is the grep of the logs using the old FW stack and the new libvolume_id. What do you think? pg
Does this line with firewire-sbp2 > scsi 6:0:0:0: Direct-Access LSILogic SYM13FW500-Disk 1.00 PQ: 0 ANSI: 0 look exactly the same with sbp2? (Probably, the ever-changing SCSI host number aside.)
> Created an attachment (id=292399) [edit] > grep "sd 6:0:0" /var/log/messages Are there any firewire driver messages in between the SCSI messages? ---------------- I now tested with a disk with Texas Instruments StorageLynx TSB42AA9 chip with an old firmware which also poses as "Direct-Access" instead of "Direct-Access-RBC" (inquiry result log message "Direct-Access "DViCO" MOMOBAY CX-1 REV2 PQ: 0 ANSI: 0"). Kernel: 2.4.26-rc8 plus firewire updates Userland: Gentoo, udev-115-r1 (/lib/libvolume_id.so.0.80.0) and udev-118 (/lib/libvolume_id.so.0.82.0) With both udev versions I always only see READ(10) in the log, never READ(6).
> 2.4.26-rc8 2.6.24-rc8 actually.
(In reply to comment #17) > Does this line with firewire-sbp2 > > > scsi 6:0:0:0: Direct-Access LSILogic SYM13FW500-Disk 1.00 PQ: 0 ANSI: 0 > > look exactly the same with sbp2? (Probably, the ever-changing SCSI host number > aside.) This is the output of: grep -C3 SYM /var/log/messages Jan 20 10:48:37 lazy kernel: firewire_sbp2: status write for unknown orb Jan 20 10:48:37 lazy kernel: firewire_sbp2: orb reply timed out, rcode=0x11 Jan 20 10:48:37 lazy kernel: firewire_sbp2: logged in to fw1.0 LUN 0000 (2 retries) Jan 20 10:48:37 lazy kernel: scsi 6:0:0:0: Direct-Access LSILogic SYM13FW500-Disk 1.00 PQ: 0 ANSI: 0 Jan 20 10:48:37 lazy kernel: sd 6:0:0:0: [sdb] 117210240 512-byte hardware sectors (60012 MB) Jan 20 10:48:37 lazy kernel: sd 6:0:0:0: [sdb] Write Protect is off Jan 20 10:48:37 lazy kernel: sd 6:0:0:0: [sdb] Cache data unavailable By chance the SCSI number is the same. Other case, with different SCSI number: Jan 20 17:25:33 lazy kernel: firewire_sbp2: status write for unknown orb Jan 20 17:25:34 lazy kernel: firewire_sbp2: orb reply timed out, rcode=0x11 Jan 20 17:25:35 lazy kernel: firewire_sbp2: logged in to fw1.0 LUN 0000 (2 retries) Jan 20 17:25:35 lazy kernel: scsi 10:0:0:0: Direct-Access LSILogic SYM13FW500-Disk 1.00 PQ: 0 ANSI: 0 Jan 20 17:25:35 lazy kernel: sd 10:0:0:0: [sdb] 117210240 512-byte hardware sectors (60012 MB) Jan 20 17:25:35 lazy kernel: sd 10:0:0:0: [sdb] Write Protect is off Jan 20 17:25:35 lazy kernel: sd 10:0:0:0: [sdb] Cache data unavailable As you can see, it seems there always are 2 retries. The line you asked for seems to me identical to the old sbp2 case. This looks like the same for old and new libvolume_id. pg
(In reply to comment #18) > > Created an attachment (id=292399) [edit] [edit] > > grep "sd 6:0:0" /var/log/messages > > Are there any firewire driver messages in between the SCSI messages? Uhm, I can only see SCSI messages for the SATA HD, [sda] or sd 0:0:0:0, in between. > Kernel: 2.4.26-rc8 plus firewire updates > Userland: Gentoo, udev-115-r1 (/lib/libvolume_id.so.0.80.0) and udev-118 > (/lib/libvolume_id.so.0.82.0) > > With both udev versions I always only see READ(10) in the log, never READ(6). Uhm, does this mean that there is something new in the 2.6.24-rc8 kernel? Something that actually fixes the issue? One, maybe minor, thing, I have /lib64/libvolume_id.so.0.82.0, i.e. a x86_64 case... Hope this is not a problem... pg
(In reply to comment #17) > Does this line with firewire-sbp2 > > > scsi 6:0:0:0: Direct-Access LSILogic SYM13FW500-Disk 1.00 PQ: 0 ANSI: 0 > > look exactly the same with sbp2? (Probably, the ever-changing SCSI host number > aside.) Well, sorry I understood your request the other way round. Anyway, here is an extract from /var/log/messages, with the sbp2 log: Jan 21 20:27:58 lazy kernel: ieee1394: sbp2: Logged into SBP-2 device Jan 21 20:27:58 lazy kernel: ieee1394: sbp2: Node 0-00:1023: Max speed [S400] - Max payload [2048] Jan 21 20:27:58 lazy kernel: scsi 7:0:0:0: Direct-Access LSILogic SYM13FW500-Disk 1.00 PQ: 0 ANSI: 0 Jan 21 20:27:58 lazy kernel: sd 7:0:0:0: [sdb] 117210240 512-byte hardware sectors (60012 MB) So, it's the same, I suppose. pg
> does this mean that there is something new in the 2.6.24-rc8 kernel? > Something that actually fixes the issue? No, I don't think so. > One, maybe minor, thing, I have /lib64/libvolume_id.so.0.82.0, i.e. a > x86_64 case... Hope this is not a problem... Shouldn't be. I have an x86_64 box here too which I could switch to udev-118 as well... but not today anymore. I also did a diff between the sources of udev-{116,118}/extras/{scsi,volume}_id/ now but didn't spot anything obvious. > Anyway, here is an extract from /var/log/messages, with the sbp2 log: ... > So, it's the same, I suppose. Yes.
hmm, any news to this?
(In reply to comment #24) > hmm, any news to this? Uh? We were hoping you could give some info on what was changed, in respect of the SCSI commands sent, between libvolume_id 0.80 and 0.82. ATM I've a version-lock in yum, in order to prevent the update of the library, but if you have anything to test, I'll be glad to try. pg
See also comments from Douglas Gilbert, Feb 19, and Hannes Reinecke, Mar 4, http://thread.gmane.org/gmane.linux.scsi/37948/focus=12491
(In reply to comment #26) > See also comments from Douglas Gilbert, Feb 19, and Hannes Reinecke, Mar 4, > http://thread.gmane.org/gmane.linux.scsi/37948/focus=12491 About Gilbert reply: <<< Special case: removable media >>> Since response[1] of a standard INQUIRY has RMB (removable medium bit) set in its top bit and the rest of the byte reserved, 0x80 is a possibility. That could give a false positive when the serial number VPD page is requested. So try looking for VPD page 0x83 (with alloc_len=36) first. I was wondering, if VPD is requested, repsonse[1-127] are supposed to be ASCII codes, so the MSB should be always zero, hence if RMB is set (which is bit 7, i.e. MSB), it cannot be ASCII. Well, but I'm not an expert of the field, just loud thinking... pg
Bug 434830 could be of relevance here, would be curious to see results with kernel-2.6.24.3-23.fc8 (or later). http://koji.fedoraproject.org/packages/kernel/2.6.24.3/23.fc8/
(In reply to comment #28) > Bug 434830 could be of relevance here, would be curious to see results with > kernel-2.6.24.3-23.fc8 (or later). > > http://koji.fedoraproject.org/packages/kernel/2.6.24.3/23.fc8/ Actually, it worked, but not because of the new kernel, I guess, but thanks to the workaround suggested by Stefan for bug #436879, that is: echo 1 > /sys/module/firewire_sbp2/parameters/workarounds So, it seems that, also here the problem was reading too much altogether. Maybe also this could be closed or wait a bit... Thanks! pg
Hi all, I took the freedom to mark this one depending on #436879, since it seems the two issues are connected. pg
I close this too, due to the dependency. pg