436879 – Read SBP2 raw block device results in errors

Bug 436879 - Read SBP2 raw block device results in errors

Summary: Read SBP2 raw block device results in errors

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	8
Hardware:	All
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	---
Assignee:	Jarod Wilson
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	429430
TreeView+	depends on / blocked

Reported:	2008-03-10 21:33 UTC by Piergiorgio Sartor
Modified:	2008-08-02 23:40 UTC (History)
CC List:	1 user (show)
Fixed In Version:	2.6.24.3-50.fc8
Clone Of:
Environment:
Last Closed:	2008-03-26 17:15:21 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Piergiorgio Sartor 2008-03-10 21:33:39 UTC

Description of problem:
For the enjoyment of Jarod and Stefan...
I know, I'm a pain in the neck...

OK, the SBP2 device is always the same, the Datafab MD2-FW2.
Accessing the /dev/sdX device causes read errors and device off-lining.

Version-Release number of selected component (if applicable):
kernel-2.6.24.3-17.fc8

How reproducible:
Always...

Steps to Reproduce:
1.
Connect the SBP2 bay and check which device is assigned
2.
Let's say the SBP is /dev/sdb, then type:

dd if=/dev/sdb of=/dev/null bs=1024k count=1k

or

hdparm -tT /dev/sdb

3.

Actual results:
Read errors are returned

Expected results:
Well read data, return the disk performance...

Additional info:
A device reset (off-on cycle) restore the thing to work.
Note that the device, formatted as NTFS and accessed using ntfs-3g (fuse), works
fine for files...

This could be the same issue as bug #429430

/var/log/messages says:

...
Mar 10 22:24:19 lazy kernel: end_request: I/O error, dev sdb, sector 0
Mar 10 22:24:19 lazy kernel: printk: 62 messages suppressed.
Mar 10 22:24:19 lazy kernel: Buffer I/O error on device sdb, logical block 0
Mar 10 22:24:19 lazy kernel: Buffer I/O error on device sdb, logical block 1
Mar 10 22:24:19 lazy kernel: Buffer I/O error on device sdb, logical block 2
Mar 10 22:24:19 lazy kernel: Buffer I/O error on device sdb, logical block 3
Mar 10 22:24:19 lazy kernel: Buffer I/O error on device sdb, logical block 4
Mar 10 22:24:19 lazy kernel: Buffer I/O error on device sdb, logical block 5
Mar 10 22:24:19 lazy kernel: Buffer I/O error on device sdb, logical block 6
Mar 10 22:24:19 lazy kernel: Buffer I/O error on device sdb, logical block 7
Mar 10 22:24:19 lazy kernel: Buffer I/O error on device sdb, logical block 8
Mar 10 22:24:19 lazy kernel: Buffer I/O error on device sdb, logical block 9
Mar 10 22:24:19 lazy kernel: end_request: I/O error, dev sdb, sector 0
...

Comment 1 Stefan Richter 2008-03-10 22:17:37 UTC

Would
# echo 1 > /sys/module/firewire_sbp2/parameters/workarounds
# <plug device in>
improve it?  This reduces the maximum data size per SCSI request and is said to
be necessary for some bridges from Symbios (bought by LSI).  The Datafab
enclosures contain a Symbios bridge AFAIK.

Comment 2 Piergiorgio Sartor 2008-03-10 22:43:17 UTC

(In reply to comment #1)
> Would
> # echo 1 > /sys/module/firewire_sbp2/parameters/workarounds
> # <plug device in>
> improve it?  This reduces the maximum data size per SCSI request and is said to

Uhm, no, same as before.

> be necessary for some bridges from Symbios (bought by LSI).  The Datafab
> enclosures contain a Symbios bridge AFAIK.

The chipset has LSI on it, so it could be Symbios.

OK, I tried to enable the SCSI debug with:

echo 9216 > /sys/module/scsi_mod/parameters/scsi_logging_level

Apart of flooding the logfile, in this condition the device seems to works, with
or without workaround, and it continues to work after disabling the SCSI debug
logging.
BTW, it seems there are only "Read(10)" in the logfile.

If I cycle off-on, then the problem (on the next test) reappears.

Does this help?

pg

Comment 3 Jarod Wilson 2008-03-11 02:20:00 UTC

This smells like a duplicate of bug 434830, which should be resolved by patches
added to rawhide and f8 kernel 2.6.24.3-23.fc8, which just finished building in
koji earlier today. Give that a spin, if you would...

http://koji.fedoraproject.org/packages/kernel/2.6.24.3/23.fc8/

Comment 4 Piergiorgio Sartor 2008-03-11 18:47:40 UTC

(In reply to comment #3)
> This smells like a duplicate of bug 434830, which should be resolved by patches
> added to rawhide and f8 kernel 2.6.24.3-23.fc8, which just finished building in
> koji earlier today. Give that a spin, if you would...
> 
> http://koji.fedoraproject.org/packages/kernel/2.6.24.3/23.fc8/

Uhm, I installed the -28, which should have the same patches (I need also WiFi
updates).
A first test with "hdparm" and "dd" had no visible improvements.

One thing I forgot to mention is that also removing and reloading the
firewire-sbp.ko module brings the device back to life.

Other ideas?

pg

Comment 5 Piergiorgio Sartor 2008-03-11 18:54:32 UTC

(In reply to comment #1)
> Would
> # echo 1 > /sys/module/firewire_sbp2/parameters/workarounds
> # <plug device in>
> improve it?  This reduces the maximum data size per SCSI request and is said to
> be necessary for some bridges from Symbios (bought by LSI).  The Datafab
> enclosures contain a Symbios bridge AFAIK.

OK, I repeated the suggest operation, starting from device off (I was reading
too fast and I miss the <plug device in> concept, sorry).

A first test, with the new kernel, seems to be successful, "hdparm" and "dd"
were working fine, without causing errors.
So it seems this is the real issue.

Should we close this one or wait a little bit?

Thanks a lot!

pg

Comment 6 Piergiorgio Sartor 2008-03-11 19:19:21 UTC

(In reply to comment #1)
> Would
> # echo 1 > /sys/module/firewire_sbp2/parameters/workarounds
> # <plug device in>
> improve it?  This reduces the maximum data size per SCSI request and is said to
> be necessary for some bridges from Symbios (bought by LSI).  The Datafab
> enclosures contain a Symbios bridge AFAIK.

I forgot to thank you: so thank you very much!

pg

Comment 7 Stefan Richter 2008-03-11 19:53:17 UTC

Re comment #5:
If you are reasonably sure that the workaround fixes the heavy transfers, then
check dmesg for a message from firewire_sbp2 about the activated workaround +
firmware revision + model ID of the drive.  We can then use these values to
permanently enable the request size limit for this device, so that it will work
out of the box.

(BTW, I have two devices which appear to be LSI based too, but these are both
CD-RWs and they are sealed devices so that I can't swap in HDDs for test
purposes.  I occasionally use them for tests like CDDA extraction or CD burning,
but they have never shown a problem like this.  Apparently those applications
never reach the dangerous request sizes or the devices have other revisions of
the bridge or of the firmware.)

Comment 8 Piergiorgio Sartor 2008-03-11 20:13:50 UTC

(In reply to comment #7)
> If you are reasonably sure that the workaround fixes the heavy transfers, then
> check dmesg for a message from firewire_sbp2 about the activated workaround +
> firmware revision + model ID of the drive.  We can then use these values to
> permanently enable the request size limit for this device, so that it will work
> out of the box.

I guess these are the lines you might need:

firewire_sbp2: Workarounds for fw1.0: 0x1 (firmware_revision 0x002600, model_id
0x000000)
firewire_sbp2: fw1.0: logged in to LUN 0000 (0 retries)
scsi 17:0:0:0: Direct-Access     LSILogic SYM13FW500-Disk  1.00 PQ: 0 ANSI: 0

I'm going to use this device, so if something new happens I'll report.
I guess in few days this bug could be closed.

> (BTW, I have two devices which appear to be LSI based too, but these are both
> CD-RWs and they are sealed devices so that I can't swap in HDDs for test
> purposes.  I occasionally use them for tests like CDDA extraction or CD burning,
> but they have never shown a problem like this.  Apparently those applications
> never reach the dangerous request sizes or the devices have other revisions of
> the bridge or of the firmware.)

For the casual observer: why the old stack does not show the problem (I mean the
udev one, bug #429430)? Is it automatically limiting the transfer size?

pg

Comment 9 Stefan Richter 2008-03-11 20:51:32 UTC

> firewire_sbp2: Workarounds for fw1.0: 0x1 (firmware_revision 0x002600,
> model_id 0x000000)
> firewire_sbp2: fw1.0: logged in to LUN 0000 (0 retries)
> scsi 17:0:0:0: Direct-Access     LSILogic SYM13FW500-Disk  1.00 PQ: 0 ANSI: 0

Thanks, we will add these markers to fw-sbp2's internal quirks list.

> For the casual observer: why the old stack does not show the problem
> (I mean the udev one, bug #429430)? Is it automatically limiting the
> transfer size?

The old sbp2 driver always had the limit set to the low level of that
workaround.  Therefore we never noticed whether there may be more devices out
there which need the transfer size limitation.

However, I sent a patch in to Linux 2.6.25-rc1 which adjusts sbp2 to use the
same default limit which firewire-sbp2 does (i.e. the Linux SCSI stack's
defaults).  So, if it weren't for your report, all users with HDDs with
SYM13FW500 bridge would encounter that problem from Linux 2.6.25-rc1 onwards
with the old drivers too.

(Note to self:  My two presumably LSI based CD-RWs have firmware_revision
0x000038, model_id 0x000000, and firmware_revision 0x000035, model_id 0x000000
respectively.)

Comment 10 Stefan Richter 2008-03-11 21:36:14 UTC

Patch posted: http://lkml.org/lkml/2008/3/11/372

Comment 11 Stefan Richter 2008-03-14 01:52:57 UTC

Side note:
LSILogic (Symbios) bridges with the 128k request size limit should all have the
firmware revision 0x0a27?? according to this document:
http://ftp2.de.freebsd.org/pub/misc/specs/symbios/symchips/1394/UPDATED_configuration_ROMs/ReadMe.doc
(But some don't, as we learned now.)

Comment 12 Jarod Wilson 2008-03-14 19:21:44 UTC

Patch merged into rawhide, will merge into F8 shortly as well. Should show up in
rawhide kernel-2.6.25-0.119.rc5.git3.fc9 and f8 kernel-2.6.24.3-37.fc8 and
later, respectively.

Comment 13 Piergiorgio Sartor 2008-03-15 09:36:11 UTC

Hi Jarod, I'm just trying the kernel-2.6.26.3-38, and I see that, without
explicitly using the option "workarounds=0x1", the problem seems to be still there.

Any ideas?

Thanks a lot!

pg

Comment 14 Stefan Richter 2008-03-15 10:19:55 UTC

> kernel-2.6.26.3-38

2.6.24.3-38?

Is there a message from firewire_sbp2 that it automatically switched the
workaround on for this device?

Comment 15 Piergiorgio Sartor 2008-03-15 12:27:15 UTC

(In reply to comment #14)
> > kernel-2.6.26.3-38
> 
> 2.6.24.3-38?

Yeah, a typo... It would be amazing (and sad) to have a .26 kernel
with the problem still open... :-)

> Is there a message from firewire_sbp2 that it automatically switched the
> workaround on for this device?

Without explicit workaround I can see this in the logs:

...
firewire_core: phy config: card 0, new root=ffc2, gap_count=7
firewire_core: created device fw1: GUID 0030ffa046010076, S400
scsi11 : SBP-2 IEEE-1394
firewire_sbp2: fw1.0: logged in to LUN 0000 (0 retries)
scsi 11:0:0:0: Direct-Access     LSILogic SYM13FW500-Disk  1.00 PQ: 0 ANSI: 0
sd 11:0:0:0: [sdb] 117210240 512-byte hardware sectors (60012 MB)
...

With the workaround:

...
firewire_core: phy config: card 0, new root=ffc2, gap_count=7
firewire_core: created device fw1: GUID 0030ffa046010076, S400
scsi12 : SBP-2 IEEE-1394
firewire_sbp2: Please notify linux1394-devel.net if you need
the workarounds parameter for fw1.0
firewire_sbp2: Workarounds for fw1.0: 0x1 (firmware_revision 0x002600, model_id
0x000000)
firewire_sbp2: fw1.0: logged in to LUN 0000 (0 retries)
scsi 12:0:0:0: Direct-Access     LSILogic SYM13FW500-Disk  1.00 PQ: 0 ANSI: 0
sd 12:0:0:0: [sdb] 117210240 512-byte hardware sectors (60012 MB)
...

Which is the same situation as the previous kernel, I think.

pg

Comment 16 Stefan Richter 2008-03-15 13:29:51 UTC

> firewire_sbp2: Workarounds for fw1.0: 0x1 (firmware_revision 0x002600,
> model_id 0x000000)

This message should automatically appear (without specifying a module parameter)
if the kernel contains the patch from comment #10.

Comment 17 Piergiorgio Sartor 2008-03-15 14:26:49 UTC

(In reply to comment #16)

> This message should automatically appear (without specifying a module parameter)
> if the kernel contains the patch from comment #10.

Well, it does not, so do I see the following possibilities:

1) I got the wrong kernel
2) The patch is not there/not properly
3) The patch is wrong
4) Somewhere else: for example the firmware revision is not propagated properly
and what we see in the logs is not what the workarounds section receives.

I can double check 1) :-)...
In the rpm changelog I see (note the -37, I've the -38, but this includes the
previous one, I hope):

* Fri Mar 14 2008 Jarod Wilson <jwilson> 2.6.24.3-37
- Resync firewire patches w/linux1394-2.6.git
- Add firewire selfID/AT/AR debug support via optional
  module parameters
- firewire: fix DMA coherence on x86_64 systems w/memory mapped
  over the 4GB boundry (#434830)

One more thing, in order to re-read the option "workarounds", it is not enough
to detach/attach the sbp2 bay, I've also to remove the firewire-sbp module.
If I remember correctly, previously was enough to reset the 1394 bus (cycle
power on the sbp2 bay or detach/attach).

Is there any way I can check/confirm 2) and or 3)?
Some debugging option or similar?
I think 4) is beyond my possibilities.

Thanks.

pg

Comment 18 Jarod Wilson 2008-03-15 18:11:20 UTC

D'oh, my mistake. I didn't check closely enough. The patch with the workaround
for your device didn't actually make it into the f8 kernel yet. Its definitely
there for rawhide, and I'll get it into F8 properly soonish.

Comment 19 Stefan Richter 2008-03-15 18:18:46 UTC

> One more thing, in order to re-read the option "workarounds",
> it is not enough to detach/attach the sbp2 bay, I've also to
> remove the firewire-sbp module.

The echo command from comment #1 should be effective during runtime for each
plugged in (or re-plugged) device.

If you alter the module parameter by means of /etc/modprobe.d/some_file or
/etc/modprobe.conf, then the module needs indeed to be reloaded to become aware
of the altered parameter.

Comment 20 Piergiorgio Sartor 2008-03-15 18:25:08 UTC

(In reply to comment #18)
> D'oh, my mistake. I didn't check closely enough. The patch with the workaround
> for your device didn't actually make it into the f8 kernel yet. Its definitely
> there for rawhide, and I'll get it into F8 properly soonish.

Oh, OK, no problem.
As soon as it pops up I'll give it a try and report here, so you can close this
bug (or can I close it?).

pg

Comment 21 Jarod Wilson 2008-03-17 14:22:55 UTC

Got the patch into 2.6.24.3-39.fc8, but I haven't fired up a build (waiting for
other kernel guys to add more stuff before a new build). I think that as the
original bug reporter, you do have permissions to close the bug. I'm somewhat
inclined to just close it anyhow, we're about 99.999% sure this fixes the
problem, given that manually specifying the workaround fixes it.

Comment 22 Piergiorgio Sartor 2008-03-17 14:54:38 UTC

(In reply to comment #21)
> Got the patch into 2.6.24.3-39.fc8, but I haven't fired up a build (waiting for
> other kernel guys to add more stuff before a new build). I think that as the

Thank you very much!
I'll look closely the koji page :-)

> original bug reporter, you do have permissions to close the bug. I'm somewhat

OK, good.

> inclined to just close it anyhow, we're about 99.999% sure this fixes the
> problem, given that manually specifying the workaround fixes it.

Are you in hurry? :-)
Actually, I like to close bugs, too... ;-)

Anyway, if it is OK with you, I would prefer to give at least one try to the
coming kernel, then I'll close this bug and the other, #429430.

Thanks again for your support!
If you need some help, testing something else in the firewire stack (or others),
just let me know.

pg

Comment 23 Jarod Wilson 2008-03-17 15:31:43 UTC

(In reply to comment #22)
> > inclined to just close it anyhow, we're about 99.999% sure this fixes the
> > problem, given that manually specifying the workaround fixes it.
> 
> Are you in hurry? :-)

Nah, I just like to try to keep the bug list I'm looking at as free of stuff
that I'm pretty sure is already fixed, or I get easily distracted. :)

> Actually, I like to close bugs, too... ;-)
> 
> Anyway, if it is OK with you, I would prefer to give at least one try to the
> coming kernel, then I'll close this bug and the other, #429430.

Works for me. I'll just put the bug into NEEDINFO for now (then its not closed,
but its also off my main bug tracking view :).

> Thanks again for your support!
> If you need some help, testing something else in the firewire stack (or others),
> just let me know.

Absolutely. I think we're in pretty damned good shape on the storage side now,
might be getting back over to the dv side of the house soon... :)

Comment 24 Jarod Wilson 2008-03-18 03:50:22 UTC

Okay, I've got 2.6.24.3-40.fc8 building right now, definitely *does* have the patch.

http://koji.fedoraproject.org/koji/taskinfo?taskID=520390

Comment 25 Piergiorgio Sartor 2008-03-18 18:48:30 UTC

OK, I tested quickly the new kernel, 2.6.24.3-40.fc8, and it seems to work fine,
hdparm & dd did not caused any errors and the workaround activation is reported
by dmesg (of course, without the modprobe.conf entry).

I'll close the bug, BUT one thing I noticed is that:

cat /sys/bus/firewire/drivers/sbp2/module/parameters/workarounds

returns 0, I hope this is correct.

pg

Comment 26 Jarod Wilson 2008-03-18 19:27:18 UTC

The 0 is correct there, because there are no globally enabled workarounds,
they're only being activated for the one device. Glad we finally got it taken
care of! :)

Comment 27 Piergiorgio Sartor 2008-03-18 19:55:22 UTC

(In reply to comment #26)
> The 0 is correct there, because there are no globally enabled workarounds,
> they're only being activated for the one device. Glad we finally got it taken
> care of! :)

Yep, now you'll have to fix the DV thing! :-)

Anyhow, thank you very much for your support, well done!

pg

Comment 28 Fedora Update System 2008-03-21 03:15:26 UTC

kernel-2.6.24.3-50.fc8 has been submitted as an update for Fedora 8

Comment 29 Fedora Update System 2008-03-26 17:14:56 UTC

kernel-2.6.24.3-50.fc8 has been pushed to the Fedora 8 stable repository.  If problems still persist, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.