Bug 435550

Summary:

Cannot capture DV stream with new firewire stack

Product:

[Fedora] Fedora

Reporter:

Piergiorgio Sartor <piergiorgio.sartor>

Component:

kernel

Assignee:

Jarod Wilson <jarod>

Status:

CLOSED RAWHIDE

QA Contact:

Fedora Extras Quality Assurance <extras-qa>

Severity:

high

Docs Contact:

Priority:

low

Version:

CC:

maurizio.antillon, stefan-r-rhbz

Target Milestone:

---

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2008-11-26 21:02:45 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Terminal dump of dvgrab 3.1	none
mem=2G	none
mem=2200M	none
mem=2500M	none
no specific memory setup	none
log bus addresses in dualbuffer IR	none
Simple fix	none
logging + allocation test	none
31bit consistent DMA mask	none
firewire: fw-ohci: TSB43AB22/A dualbuffer workaround	none

Description Piergiorgio Sartor 2008-03-01 13:05:05 UTC

Description of problem:
It seems that the HW/SW combination I've in my hand does not work for capturing
DV streams.

Version-Release number of selected component (if applicable):
kernel-2.6.23.15-137.fc8

How reproducible:
Not really systematically, but quite often. See below.

Steps to Reproduce:
1.
Connect DV camera (Sony DCR-PC110E)

2.
Launch: dvgrab -i -t -showstatus -debug all test

3.
Play and capture.

Actual results:
It depends, I do not (yet) know on what.
Sometimes, very seldom (actually only once), the stream is captured.
Sometimes, there are a lot of "buffer underrun" errors and the stream is only
partially captured, with many "holes".
Almost always "dvgrab" reports something like "error no DV stream" (or similar).
Sometimes, I got kernel panic and reboot was necessary (requested by the kernel
itself).

Expected results:
Well, the stream should be captured.

Additional info:
The old stack seems to work fine, with only one issue (see below).
The autosplit function of dvgrab does not work, the filename timestamp is not
updated, so dvgrab overwrites always the same file. This might be a dvgrab
problem or of some library in between, since it happens also with kino and with
the old stack. Note that the timestamp is properly printed (old stack), when
dvgrab captures the stream.

The motherboard is an ASUS M2NPV-VM, with NVIDIA 6150 + 430 chipset, lspci -vv
returns the following for the firewire part:

01:05.0 FireWire (IEEE 1394): Texas Instruments TSB43AB22/A IEEE-1394a-2000
Controller (PHY/Link) (prog-if 10 [OHCI])
        Subsystem: ASUSTeK Computer Inc. K8N4-E Mainboard
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 32 (500ns min, 1000ns max), Cache Line Size: 32 bytes
        Interrupt: pin A routed to IRQ 19
        Region 0: Memory at fddff000 (32-bit, non-prefetchable) [size=2K]
        Region 1: Memory at fddf8000 (32-bit, non-prefetchable) [size=16K]
        Capabilities: [44] Power Management version 2
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA
PME(D0+,D1+,D2+,D3hot+,D3cold-)
                Status: D0 PME-Enable- DSel=0 DScale=0 PME+
        Kernel driver in use: firewire_ohci
        Kernel modules: firewire-ohci

Comment 1 Jarod Wilson 2008-03-02 06:21:48 UTC

Please try out the same with the latest rawhide kernel build if you would, just
to make absolutely certain this isn't already fixed (I don't think so, but want
to verify).

Comment 2 Piergiorgio Sartor 2008-03-02 11:03:18 UTC

With kernel 2.6.24.3-13.fc8, from koji, trying the usual:

dvgrab -i -t -showstatus -debug all test

I got the following:

rom1394_1 warning: read failed: 0x0000fffff0000414
error reading config rom directory for node 1
Found AV/C device with GUID 0x08004601029441d8
Going interactive. Press '?' for help.
""     0.00 MB 0 frames"          sec                                           
Capture Stopped
Error: no DV

Which is the same result as with previous kernel, so no improvement.
Note that the first and second line happen always, even when the capturing works.

The kernels for F9 do not seem to fit properly in F8. When I tried those, I got
some errors/warnings at boot, apparently unrelated to the FW subsystem, but who
knows...

pg

Comment 3 Jarod Wilson 2008-03-02 17:49:32 UTC

Rawhide kernels are definitely installable on Fedora 8 systems, that's how a lot
of us Fedora kernel folk tend to roll, since other parts of rawhide may well be
broken, and we really only care about the kernel... :)

Although of late, it does seem you may need rawhide lvm2 and mkinitrd (plus
deps) to get booted on a rawhide kernel, but that should be it. Ah well, I'm
pretty sure nothing that's been added to rawhide kernels will make a difference
anyhow.

One other thing to double-check... That's the latest dvgrab for F8, right? I
think earlier versions still had some issues that have since been fixed. Not
quite sure where to poke next, would be easier if I could find a setup on my end
that produces the same results...

Comment 4 Piergiorgio Sartor 2008-03-02 18:33:32 UTC

I guess we have a problem here.
In order to upgrade to 2.6.25-0.80.rc3.git2.fc9 another 16 packages need to be
installed/upgraded.
Among all "libstdc++" and "initscripts"...
This is quite a bit too much, since I still need a stable working system.
One possible solution would be to compile a vanilla kernel, eventually a
2.6.25-rc3, without all the other things, if this works.
If you think this helps, I could give it a try.

dvgrab is 3.0-2, should be latest from F8, but not latest in general, since
version 3.1 is out, which should fix something.
Anyway, with the old FW stack it was working, even if the old stacks requires
also a different libraw1394, maybe there is something there to check.

Thanks

pg

Comment 5 Stefan Richter 2008-03-02 19:22:03 UTC

Unless Jarod has a better idea for you, you could try vanilla 2.6.24.y or
2.6.25-rcX with the very latest firewire patches from
http://me.in-berlin.de/~s5r6/linux1394/updates/.  Fedora kernels have much newer
firewire drivers than vanilla has.  (Git users can obtain firewire updates from
git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394-2.6.git .)

Comment 6 Jarod Wilson 2008-03-02 21:56:28 UTC

For a minute, I thought I had a good idea, but now I don't think so... Best idea
I can come up with (as far as minimal time investment goes) is probably to just
take your current kernel, install the machine kernel-devel and build the
firewire drivers out of git, then drop the resulting .ko files in place of the
existing ones. But again, I doubt if there's any code changes that help this
particular problem, I think all the relevant firewire-ohci updates are already
included in the 2.6.24.3 Fedora kernel. Rawhide does have dvgrab 3.1, which
might be worth upgrading to and testing. I don't *think* it'll pull in a bunch
of other changes, but I'm not certain (worst case, the rawhide dvgrab should
rebuild on F8 just fine). I ought to push updated dvgrab packages for F8 too...

Comment 7 Piergiorgio Sartor 2008-03-03 08:29:50 UTC

(In reply to comment #6)
> For a minute, I thought I had a good idea, but now I don't think so... Best idea
> I can come up with (as far as minimal time investment goes) is probably to just

Well, time investment is a minor issue, the main concern I have is to keep the
system changes minimal or easily reversible, since the system should be still
"available", so to speak.

> take your current kernel, install the machine kernel-devel and build the
> firewire drivers out of git, then drop the resulting .ko files in place of the
> existing ones. But again, I doubt if there's any code changes that help this

Actually I'm on my way with 2.6.25-rc3 and Stefan's patches, but I'm anyway
interested in your proposal. Could you please give more details, or point me to
some documentation, on how to proceed with the kernel-devel package?
I quickly tried to copy the firewire*.[ch] file, from 2.6.25-rc3 to the same
place in the kernel-devel tree, but then I've to idea on how to build the module.
Using "make drivers/firewire" complains it does not know how to build something.
I'm unsure on the correct procedure.

> particular problem, I think all the relevant firewire-ohci updates are already
> included in the 2.6.24.3 Fedora kernel. Rawhide does have dvgrab 3.1, which
> might be worth upgrading to and testing. I don't *think* it'll pull in a bunch
> of other changes, but I'm not certain (worst case, the rawhide dvgrab should
> rebuild on F8 just fine). I ought to push updated dvgrab packages for F8 too...

I suspect rawhide is "out of range", since dvgrab requires the new libstdc++,
which I would prefer not to upgrade.
I'll try with the src.rpm.

Thanks!

pg

Comment 8 Jarod Wilson 2008-03-04 05:35:35 UTC

Nb: there's an f8 kernel (2.6.24.3-17.fc8) with all the same firewire patches as
rawhide as of today, currently building in koji.

(In reply to comment #7)
> Actually I'm on my way with 2.6.25-rc3 and Stefan's patches, but I'm anyway
> interested in your proposal. Could you please give more details, or point me to
> some documentation, on how to proceed with the kernel-devel package?

Assuming for example you've got kernel-2.6.24.2-12.fc8 (i686) installed, you
want to then install kernel-devel-2.6.24.2-12.fc8 (i686) as well. From in
drivers/firewire, then run:

make -C /usr/src/kernels/2.6.24.2-12.fc8-i686/ M=`pwd` modules

Comment 9 Piergiorgio Sartor 2008-03-04 08:15:20 UTC

So, some updates.
I was able to compile dvgrab-3.1, from rawhide.
This one does not seem to improve the situation, in one single sequence of
trials, I got always the "buffer underrun" errors and, in the end, a system
freeze (no log available).
One positive thing was that the lines:

rom1394_1 warning: read failed: 0x0000fffff0000414
error reading config rom directory for node 1

did not show up.

I was able to compile a vanilla kernel 2.6.25-rc3 with Stefen's firewire
patches. Unfortunately this did not boot, I guess the new lvm2 thing is needed
or I made some mistakes.

I tried then to build the FW modules (from 2.6.25-rc3 + patches) in the
2.6.24.3-13, following your instruction:

cd drivers/firewire
make -C /usr/src/kernels/2.6.24.3-13.fc8-x86_64 M=`pwd` modules

This one failed, claiming a function, "dma_allignement... something" is missing
(implicit declaration of function).

So, I guess I'll have to get the koji one and try it.

Side note, maybe unrelated to this one. You tell me if another bug report is
needed and to which component.
While testing the DV camera, an SBP2 device was attached to the PC.
Each time the camera was switched on, a bus reset occurred, detaching the SBP2
drive and then re-attaching it.
First of all, I'm not sure how good this is with a mounted device.
Second, if the device is not mounted, the re-attaching event causes
udev->hal->whatever->gnome-mount chain to be trigger, with the final result to
mount it... Which is unwanted, of course.
Third, after these tricks, I started to get block errors while accessing the
SBP2 disk (maybe fixed in latest FW patches?), which were solved by un-mount,
detach and re-attach of the device.

Thanks.

pg

Comment 10 Stefan Richter 2008-03-04 09:53:09 UTC

> I tried then to build the FW modules (from 2.6.25-rc3 + patches)
> in the 2.6.24.3-13 [...] This one failed, claiming a function,
> "dma_allignement... something" is missing

Yes, alas copying sources from one kernel source tree to another is in general
not possible.  This is why I maintain the firewire patchkits on my website for a
few different kernel releases.  These patches still only work for kernel.org
kernels though, not necessarily for distributor kernels (in particular not for
Fedora, RHEL, CentOS, Oracle... kernels).  So, easiest is to wait for the Fedora
package maintainers to produce packages or source packages for you.

> Side note, maybe unrelated to this one. You tell me if another
> bug report is needed and to which component.
> Each time the camera was switched on, a bus reset occurred,
> detaching the SBP2 drive and then re-attaching it.

This is worth putting into another bug report.  Don't forget to quote the
relevant part of the kernel log.  (Log with time stamps please; dmesg perhaps
doesn't contain them, so you have to take them from /var/log/messages or
/var/log/syslog or wherever Fedora writes out kernel messages.  Hmm, maybe I
should finally install Fedora somewhere to be able to make qualified comments in
this bugtracker...)

Comment 11 Stefan Richter 2008-03-04 11:39:17 UTC

> 01:05.0 FireWire (IEEE 1394): Texas Instruments TSB43AB22/A IEEE-1394a-2000
> Controller (PHY/Link) (prog-if 10 [OHCI])

Do you happen to know whether it is a TSB43AB22 or TSB43AB22A?  You would
probably have to look inside the PC to know this.  (Texas Instruments sell both
versions but recommend the latter, without going into detail on their website.)

Comment 12 Piergiorgio Sartor 2008-03-04 11:56:28 UTC

(In reply to comment #11)
> Do you happen to know whether it is a TSB43AB22 or TSB43AB22A?  You would
> probably have to look inside the PC to know this.  (Texas Instruments sell both
> versions but recommend the latter, without going into detail on their website.)

According to ASUS (motherboard docs) it is a TSB43AB22A, I'll check directly as
soon as possible.
In any case, keep always in mind that the old stack was working.

pg

Comment 13 Piergiorgio Sartor 2008-03-04 12:01:42 UTC

(In reply to comment #10)
> > Side note, maybe unrelated to this one. You tell me if another
> > bug report is needed and to which component.
> > Each time the camera was switched on, a bus reset occurred,
> > detaching the SBP2 drive and then re-attaching it.
> 
> This is worth putting into another bug report.  Don't forget to quote the
> relevant part of the kernel log.  (Log with time stamps please; dmesg perhaps
> doesn't contain them, so you have to take them from /var/log/messages or
> /var/log/syslog or wherever Fedora writes out kernel messages.  Hmm, maybe I
> should finally install Fedora somewhere to be able to make qualified comments in
> this bugtracker...)

A couple of questions:

1) Should this go to Fedora bugzilla or to kernel bug tracker?
2) Do you have any chance to test this situation? It seems to me a design issue:
a bus reset of FW should trigger udev (or whatever) only for added/removed
devices... Or not?

Thanks.

pg

Comment 14 Jarod Wilson 2008-03-04 14:50:16 UTC

(In reply to comment #9)
> I tried then to build the FW modules (from 2.6.25-rc3 + patches) in the
> 2.6.24.3-13, following your instruction:
> 
> cd drivers/firewire
> make -C /usr/src/kernels/2.6.24.3-13.fc8-x86_64 M=`pwd` modules
> 
> This one failed, claiming a function, "dma_allignement... something" is missing
> (implicit declaration of function).

Oh crud, yeah, forgot about that. Yeah, that works better when Linus' tree and
the linux1394 tree are based on a similar 2.6.x.

> So, I guess I'll have to get the koji one and try it.

Its built now. And I should add that the patchset it carries is actually off of
Stefan's site, which he referenced in comment #10.

http://me.in-berlin.de/~s5r6/linux1394/updates/2.6.24/


(In reply to comment #13)
> A couple of questions:
> 
> 1) Should this go to Fedora bugzilla or to kernel bug tracker?

Sounds like it should be generic enough that it could go in the kernel bugzilla,
but you may also put it in here if you like/prefer.

> 2) Do you have any chance to test this situation? It seems to me a design issue:
> a bus reset of FW should trigger udev (or whatever) only for added/removed
> devices... Or not?

I believe if we just reconnect to the device, no, we shouldn't trigger udev, but
if we have to disconnect (logout and re-do an sbp2 login), there's not yet any
way to distinguish between this being a re-login and a freshly plugged in
device. However, I have a thought on something that may improve this situation
slightly (patch coming soon, Stefan... :)

Comment 15 Stefan Richter 2008-03-04 15:41:40 UTC

Re comment #12:
> According to ASUS (motherboard docs) it is a TSB43AB22A, I'll check
> directly as soon as possible.
> In any case, keep always in mind that the old stack was working.

I just asked because I saw a presumably TSB43AB22A based card in a web shop. :-)

Side note:
The old stack programs iso reception in buffer-fill mode or packet-per-buffer
mode depending on what the application program requested.  I have to look up
which mode dvgrab would use.  (This is with raw1394 which dvgrab uses. 
video1394 always uses buffer-fill.)  The new stack OTOH always uses
packet-per-buffer on OHCI 1.0 chips and dual-buffer on OHCI 1.1 chips such as
yours.  The upshot:  There is now the possibility that we get bitten by
previously unknown chip quirks which were irrelevant for the old drivers.

Comment 16 Stefan Richter 2008-03-04 16:57:59 UTC

Re comment #13, comment #14:
Please open another bug to keep this one on-topic, and post the log.

Comment 17 Piergiorgio Sartor 2008-03-04 19:00:30 UTC

OK, some updates.
kernel-2.6.24.3-17.fc8 did not improve the situation, as expected.
I tried dvgrab 3.0 and 3.1, with somehow different results:
While the 3.0 returned:

""     0.00 MB 0 frames"          sec                                           
Capture Stopped
Error: no DV

The 3.1 crashed. I'll provide the terminal dump.

The chip on the MB is an "A" version, also this as expected.

pg

Comment 18 Piergiorgio Sartor 2008-03-04 19:01:21 UTC

Created attachment 296782 [details]
Terminal dump of dvgrab 3.1

That's it, it seems something went wrong somewhere... :-)

pg

Comment 19 Piergiorgio Sartor 2008-03-12 09:57:51 UTC

Uhm, since it seems the other two FW issues are gone, maybe we continue here... :-)

Searching the web returned the full TSB43AB22A data sheet (112 pages) (TI seems
to offer the 2 pages version only).

Could this be of any interest for debugging?

Just for your info, the document I found is named "slls520.pdf".

pg

Comment 20 Jarod Wilson 2008-03-24 17:44:55 UTC

Comment #18 looks a lot like bug 370931, which I can reproduce on one of my own
boxes.

Comment 21 Jarod Wilson 2008-04-11 20:22:54 UTC

Piergiorgio, how much RAM is in your system? I'm wondering if your earlier
failures could possibly be the coherent DMA issues fixed in 2.6.24.3-50.fc8 or
later and in rawhide, and now we're just up against the same thing as bug 370931...

Comment 22 Piergiorgio Sartor 2008-04-11 22:21:27 UTC

(In reply to comment #21)
> Piergiorgio, how much RAM is in your system? I'm wondering if your earlier
> failures could possibly be the coherent DMA issues fixed in 2.6.24.3-50.fc8 or
> later and in rawhide, and now we're just up against the same thing as bug
370931...

The machine has 4GB of RAM... But...
I tried two modes.
First was without memory hole remapping, that is, I've got only 3.25GB, since
.75GB are the 32bit (PCI?) address space.
Second was with memory hole remapping, that is 4GB RAM (minus 64MB of the UMA
video buffer) crossing over the 4GB boundary, i.e. .75GB are mapped from 4GB to
4.75GB (for the same reason as above: 32bit address space).
In this mode, BTW, the kernel complains about IOMMU not being available (?) and
to work properly (3D OpenGL things) it needs "pci=nommconf".
Maybe not helpful, but I do not have more...

Well, anyhow both modes give same results.

Final note, the dvgrab crash was only observable with dvgrab 3.1, the 3.0
version never did it.

I've kernel-2.6.24.4-64.fc8 installed, I guess the DMA fix is in.

I noticed that kernels .25, for F9, seem to have some more patches for the
firewire (also DMA), any chance to get those in F8?
Or I'll have to go to F9?

Thanks,

pg

Comment 23 Stefan Richter 2008-04-12 17:33:31 UTC

fw-ohci stumbles over a bug in some TI controllers:  Bug 243081

The bug exists in TSB82AA2 and possibly also in TSB43AB22(A) (not fully proven
yet).  Depending on 1. whether the generation mismatch found in bug 243081 also
occurs on your setup and 2. how your camera would react on transaction timeouts,
this may or may not affect your setup too.  The failure mode here is quite
different from bug 243081 though.

Comment 24 Jarod Wilson 2008-04-13 03:03:41 UTC

I believe that up to now, F8 had all the same possibly relevant patches as F9,
but it definitely doesn't yet have the patch Stefan is referring to in comment
#23, which I only just now added to rawhide. I'd like to beat on it some in
rawhide before throwing it into F8, but soonish here I'll probably resync the F8
firewire bits with rawhide...

Comment 25 Piergiorgio Sartor 2008-04-14 08:18:26 UTC

Looking for errata of the TSB43AB22(A) returned this document, which does not
seems to include the A version (maybe that's why they made it), with an issue
about bus reset:

http://focus.ti.com/lit/er/sllz012/sllz012.pdf

They claim is a "lab only" problem, etc., etc., anyway they provide a software
workaround. Maybe useful.

About a possible new kernel for F8, if you've something, even just the fw
modules, just let me know.

pg

Comment 26 Jarod Wilson 2008-04-15 19:25:18 UTC

Okay, I've added all the latest firewire bits to the F8 kernel tree. They'll be
present in 2.6.24.4-81.fc8 and later, should get a build started in just a sec.

Piergiorgio, if you want to try sooner than when the kernel is built, you can
grab the bits from cvs now and build 'em.

Comment 27 Piergiorgio Sartor 2008-04-16 17:53:35 UTC

Uhm, uhm, uhm...

I tried the new kernel...

The dvgrab problem is still there:

$] dvgrab -i -t -showstatus -debug all test

rom1394_1 warning: read failed: 0x0000fffff0000414
error reading config rom directory for node 1
Found AV/C device with GUID 0x08004601029441d8
Going interactive. Press '?' for help.
""     0.00 MB 0 frames"          sec                                           
Capture Stopped
Error: no DV

In addition, the new bus reset scheme kills the SBP2 device without any way out.
I can see the following as output of "dmesg", when the camera is switched on
(after the SBP2 is initialized and even mounted):

firewire_core: skipped bus generations, destroying all nodes
firewire_sbp2: released fw1.0
firewire_core: created device fw0: GUID 0011d800012a56d3, S400
scsi15 : SBP-2 IEEE-1394
firewire_sbp2: Workarounds for fw1.0: 0x1 (firmware_revision 0x002600, model_id
0x000000)
firewire_core: created device fw1: GUID 0030ffa046010076, S400
firewire_core: phy config: card 0, new root=ffc3, gap_count=8
firewire_sbp2: fw1.0: error status: 0:4
firewire_core: skipped bus generations, destroying all nodes
firewire_sbp2: released fw1.0
firewire_core: giving up on config rom for node id ffc2
firewire_core: created device fw0: GUID 0011d800012a56d3, S400
firewire_core: created device fw1: GUID 08004601029441d8, S100
scsi16 : SBP-2 IEEE-1394
firewire_sbp2: Workarounds for fw2.0: 0x1 (firmware_revision 0x002600, model_id
0x000000)
firewire_core: created device fw2: GUID 0030ffa046010076, S400
firewire_sbp2: fw2.0: error status: 0:4
firewire_sbp2: fw2.0: error status: 0:4
firewire_sbp2: fw2.0: error status: 0:4
firewire_sbp2: fw2.0: error status: 0:4
firewire_sbp2: fw2.0: error status: 0:4
firewire_sbp2: fw2.0: error status: 0:4
firewire_sbp2: fw2.0: failed to login to LUN 0000

Reconnecting the SBP2 (off/on sequence) somehow gets it back:

firewire_core: skipped bus generations, destroying all nodes
firewire_sbp2: released fw2.0
firewire_core: created device fw0: GUID 0011d800012a56d3, S400
firewire_core: created device fw1: GUID 08004601029441d8, S100
firewire_core: phy config: card 0, new root=ffc1, gap_count=5
firewire_core: skipped bus generations, destroying all nodes
firewire_core: created device fw0: GUID 0011d800012a56d3, S400
firewire_core: created device fw1: GUID 08004601029441d8, S100
firewire_core: skipped bus generations, destroying all nodes
firewire_core: created device fw0: GUID 0011d800012a56d3, S400
firewire_core: created device fw1: GUID 08004601029441d8, S100
scsi17 : SBP-2 IEEE-1394
firewire_sbp2: Workarounds for fw2.0: 0x1 (firmware_revision 0x002600, model_id
0x000000)
firewire_core: created device fw2: GUID 0030ffa046010076, S400
firewire_core: phy config: card 0, new root=ffc3, gap_count=8
firewire_core: skipped bus generations, destroying all nodes
firewire_core: created device fw0: GUID 0011d800012a56d3, S400
firewire_core: created device fw1: GUID 08004601029441d8, S100
scsi18 : SBP-2 IEEE-1394
firewire_sbp2: Workarounds for fw2.0: 0x1 (firmware_revision 0x002600, model_id
0x000000)
firewire_core: created device fw2: GUID 0030ffa046010076, S400
firewire_sbp2: fw2.0: orb reply timed out, rcode=0x11
firewire_sbp2: fw2.0: logged in to LUN 0000 (0 retries)
scsi 18:0:0:0: Direct-Access     LSILogic SYM13FW500-Disk  1.00 PQ: 0 ANSI: 0
sd 18:0:0:0: [sdb] 117210240 512-byte hardware sectors (60012 MB)
sd 18:0:0:0: [sdb] Write Protect is off
sd 18:0:0:0: [sdb] Mode Sense: 10 00 00 00
sd 18:0:0:0: [sdb] Cache data unavailable
sd 18:0:0:0: [sdb] Assuming drive cache: write through
sd 18:0:0:0: [sdb] 117210240 512-byte hardware sectors (60012 MB)
sd 18:0:0:0: [sdb] Write Protect is off
sd 18:0:0:0: [sdb] Mode Sense: 10 00 00 00
sd 18:0:0:0: [sdb] Cache data unavailable
sd 18:0:0:0: [sdb] Assuming drive cache: write through
 sdb: sdb1
sd 18:0:0:0: [sdb] Attached SCSI disk
sd 18:0:0:0: Attached scsi generic sg2 type 0
firewire_sbp2: released fw2.0

Switching off the camera does not seem to have negative effects:

firewire_core: phy config: card 0, new root=ffc2, gap_count=7
firewire_sbp2: fw2.0: orb reply timed out, rcode=0x11
firewire_sbp2: fw2.0: reconnected to LUN 0000 (1 retries)

All in all I would not say this patch improves the situation, eventually it
makes it worse.
My suggestion would be to reconsider it...

pg

Comment 28 Stefan Richter 2008-04-16 18:22:42 UTC

> In addition, the new bus reset scheme kills the SBP2 device without
> any way out.  I can see the following as output of "dmesg", when the
> camera is switched on (after the SBP2 is initialized and even mounted):
[...]
> firewire_core: skipped bus generations, destroying all nodes
[...]
> firewire_core: created device fw2: GUID 0030ffa046010076, S400
> firewire_sbp2: fw2.0: error status: 0:4
[...]
> Switching off the camera does not seem to have negative effects:

Some explanation:

The cause for the regression is not the workaround for the TI bus reset packet
bug.  SBP-2 status writes are the only AR events in case of SBP-2, and the
status write works just fine.  (We get 0:4 = "access denied" status -- which is
of course not the optimum...)

The cause is probably the patch which introduced "skipped bus generations,
destroying all nodes".
Patch "firewire: insist on successive self ID complete events"
http://git.kernel.org/?p=linux/kernel/git/ieee1394/linux1394-2.6.git;a=commit;h=c4ea81fcdf2172f65632c3955a674b15bd1bb781
(Commit ID will become invalid soon when I'm going to prepare the next mainline
merge.)

This patch is necessary to prevent firewire-core from crashing the kernel.  Alas
firewire-sbp2 (or alternatively fw-device.c in firewire-core) has not yet been
extended to better handle fw-topology's fundamental inability to match nodes
across more than a single self ID generation increment.

> All in all I would not say this patch improves the situation,
> eventually it makes it worse.
> My suggestion would be to reconsider it...

As I said, it evidently is not the patch with the TI specific workaround, but
that other patch which you probably did not have when you last tried the Datafab
enclosure together with this camcorder.

But it is good that you reported it.  The solution though cannot be to revoke
that patch; it needs to be to better handle the "destroying all nodes" situation
in one of the layers above the topology code.

Thanks for being our Guinea pig once again...

Comment 29 Stefan Richter 2008-04-16 18:29:00 UTC

> The solution though cannot be to revoke that patch

Well, maybe the Fedora maintainers want to temporarily undo the patch until I
improved the upper layers.

Vice versa, I am thinking about holding off the mainline submission of the patch
until I have those other bits in place.

Both means to live with a possibility of a crash or other corruption when self
ID complete events are not sequential, while avoiding the more frequent hassle
with the destruction and recreation of device representations (which can cause
data loss to e.g. if you have a filesystem mounted on a FireWire device.)

Comment 30 Jarod Wilson 2008-04-16 18:44:52 UTC

Piergiorgio, can you try building replacement modules w/just the patch Stefan
referenced in comment #28 backed out and verify that you don't lose your disk
drive? If so, I'll just back out that patch in the F8 tree for now.

Comment 31 Piergiorgio Sartor 2008-04-16 19:12:49 UTC

(In reply to comment #30)
> Piergiorgio, can you try building replacement modules w/just the patch Stefan
> referenced in comment #28 backed out and verify that you don't lose your disk
> drive? If so, I'll just back out that patch in the F8 tree for now.

OK, no problem.
Where or how do I get the proper source(s)?
I guess I can just build the modules in the current kernel-devel dir tree, given
the sources, or do you recommend a different method?

pg

Comment 32 Stefan Richter 2008-04-16 19:23:18 UTC

I wrote:
> SBP-2 status writes are the only AR events in case of SBP-2

"request AR events", to be entirely precise.  SBP-2 may also involve response AR
events but those are not affected by the workaround for the TI quirk.

Comment 33 Jarod Wilson 2008-04-16 19:46:43 UTC

(In reply to comment #31)
> Where or how do I get the proper source(s)?
> I guess I can just build the modules in the current kernel-devel dir tree, given
> the sources, or do you recommend a different method?

Either grab the src.rpm out of koji and install it, then run 'rpmbuild -bp
kernel.spec' and you'll get a patched kernel tree, or just check stuff out of
cvs. From memory, cvs checkout procedure should be like so:

$ export CVSROOT=:pserver:anonymous.org:/cvs/pkgs
$ cvs co kernel/F-8
$ cd kernel/F-8
$ make prep
$ cd kernel-2.6.24/linux-2.6.24.noarch/drivers/firewire
$ <edit fw-topology.c, backing out that change in comment #28>
$ make -C /usr/src/kernels/2.6.24.4-81.fc8-i686/ M=`pwd` modules

Comment 34 Piergiorgio Sartor 2008-04-16 20:30:43 UTC

(In reply to comment #33)

> Either grab the src.rpm out of koji and install it, then run 'rpmbuild -bp
> kernel.spec' and you'll get a patched kernel tree, or just check stuff out of
> cvs. From memory, cvs checkout procedure should be like so:
> 
> $ export CVSROOT=:pserver:anonymous.org:/cvs/pkgs
> $ cvs co kernel/F-8
> $ cd kernel/F-8
> $ make prep
> $ cd kernel-2.6.24/linux-2.6.24.noarch/drivers/firewire
> $ <edit fw-topology.c, backing out that change in comment #28>
> $ make -C /usr/src/kernels/2.6.24.4-81.fc8-i686/ M=`pwd` modules

This is really cool! :-)
I'll go for it!

OK, I removed the section as per comment #28, compiled and installed the new
modules (all three).

With this setup, switching on the camera does not kill the SBP2.
"dmesg" reports the following:

firewire_sbp2: fw1.0: reconnected to LUN 0000 (0 retries)
firewire_core: phy config: card 0, new root=ffc3, gap_count=8
firewire_sbp2: fw1.0: reconnected to LUN 0000 (0 retries)
firewire_core: created device fw2: GUID 08004601029441d8, S100

In one trial it required two retries to get it done:

firewire_sbp2: fw1.0: reconnected to LUN 0000 (0 retries)
firewire_core: phy config: card 0, new root=ffc3, gap_count=8
firewire_sbp2: fw1.0: reconnected to LUN 0000 (0 retries)
firewire_core: created device fw2: GUID 08004601029441d8, S100
firewire_core: phy config: card 0, new root=ffc2, gap_count=7
firewire_sbp2: fw1.0: orb reply timed out, rcode=0x11
firewire_sbp2: fw1.0: reconnected to LUN 0000 (1 retries)

I don't know if it matters, but this was with SBP2 fs mounted.

In any case, with or without patch, the DV capture does not work.

May I ask you both a completely unrelated question?
Where are you located?
I guess Stefan is in Berlin. And you, Jarod?

Thanks!

pg

Comment 35 Jarod Wilson 2008-04-16 20:46:14 UTC

(In reply to comment #34)
> OK, I removed the section as per comment #28, compiled and installed the new
> modules (all three).
> 
> With this setup, switching on the camera does not kill the SBP2.
> "dmesg" reports the following:
> 
> firewire_sbp2: fw1.0: reconnected to LUN 0000 (0 retries)
> firewire_core: phy config: card 0, new root=ffc3, gap_count=8
> firewire_sbp2: fw1.0: reconnected to LUN 0000 (0 retries)
> firewire_core: created device fw2: GUID 08004601029441d8, S100
> 
> In one trial it required two retries to get it done:
> 
> firewire_sbp2: fw1.0: reconnected to LUN 0000 (0 retries)
> firewire_core: phy config: card 0, new root=ffc3, gap_count=8
> firewire_sbp2: fw1.0: reconnected to LUN 0000 (0 retries)
> firewire_core: created device fw2: GUID 08004601029441d8, S100
> firewire_core: phy config: card 0, new root=ffc2, gap_count=7
> firewire_sbp2: fw1.0: orb reply timed out, rcode=0x11
> firewire_sbp2: fw1.0: reconnected to LUN 0000 (1 retries)
> 
> I don't know if it matters, but this was with SBP2 fs mounted.

Okay, this much looks good, I'll go ahead and back out that chunk for F8.

> In any case, with or without patch, the DV capture does not work.

Darn. Oh, now one thing I wanted to clarify... In comment #27, your command line
shows you using interactive mode (-i switch), but doesn't seem to have any
output suggesting the camera actually started rolling... To be 100% certain, if
you omit that, and simply run 'dvgrab -d 2', (grab for 2 seconds), I presume
nothing gets captured?

> May I ask you both a completely unrelated question?

Sure!

> Where are you located?
> I guess Stefan is in Berlin. And you, Jarod?

I work out of the Red Hat engineering office in Westford, Massachusetts, USA.
(Northeastern coast of the US).

Comment 36 Piergiorgio Sartor 2008-04-16 22:09:17 UTC

(In reply to comment #35)

> Darn. Oh, now one thing I wanted to clarify... In comment #27, your command line
> shows you using interactive mode (-i switch), but doesn't seem to have any
> output suggesting the camera actually started rolling... To be 100% certain, if
> you omit that, and simply run 'dvgrab -d 2', (grab for 2 seconds), I presume
> nothing gets captured?

Well, I used "p/space" to play (it works) and "c" to capture.
Actually the camera starts and it shows the movie (in its own screen).
I used the interactive mode to have a second "control" path, the AVC one.
This was to make sure other things, like cable, are working (I had bad
experience with 1394 cables...).

Interesting enough, it is possible to do everything, play, ff, backward, step
motion and so on, the commands (I guess asynchronous mode) work fine.

Just for further reassurance, I connected the camera to a Miranda Box, having
analog video and audio output. These were then connected to a monitor.
I can confirm that the camera and cable(s) work fine, I got perfect picture and
sound.

With "debug -d 2" I get the same results.

pg

Comment 37 Stefan Richter 2008-04-16 23:47:04 UTC

Now that you got self-compiled drivers, you could try forcing fw-ohci into OHCI
1.0 mode.  In drivers/firewire/fw-ohci.c, change the three occurrences of "if
(... >= OHCI_VERSION_1_1)" to "if (0)".

(I assume unmodified fw-ohci drives TSB43AB22 in OHCI 1.1 mode --- you should
confirm that first before doing the modification, unless you already did so. 
E.g. insert a printk("...") into the >= OHCI_VERSION_1_1 branch of
ohci_allocate_iso_context.)

Comment 38 Piergiorgio Sartor 2008-04-17 08:38:38 UTC

(In reply to comment #37)
> Now that you got self-compiled drivers, you could try forcing fw-ohci into OHCI
> 1.0 mode.  In drivers/firewire/fw-ohci.c, change the three occurrences of "if
> (... >= OHCI_VERSION_1_1)" to "if (0)".
> 
> (I assume unmodified fw-ohci drives TSB43AB22 in OHCI 1.1 mode --- you should
> confirm that first before doing the modification, unless you already did so. 
> E.g. insert a printk("...") into the >= OHCI_VERSION_1_1 branch of
> ohci_allocate_iso_context.)

I've a couple of questions:

1) do you mean the driver with the selfID patch removed? (i.e. my own personal
version...)
2) what's the idea behind forcing OHCI 1.0?

Anyway, I'll try it this evening (CET).

pg

Comment 39 Stefan Richter 2008-04-17 09:43:40 UTC

> 1) do you mean the driver with the selfID patch removed?
>    (i.e. my own personal version...)

The self ID thing only influences operation of your Datafab disk.  It doesn't
matter to DV reception.  So you can keep or remove that patch.

> 2) what's the idea behind forcing OHCI 1.0?

firewire-ohci uses different DMA modes for isochronous reception in OHCI 1.1 vs.
OHCI 1.0 mode.  OHCI 1.1 chips get to do "dual buffer mode", OHCI 1.0 chips do
"packet per buffer mode".  Maybe the latter one gets other results.

The old stack used to do "buffer fill mode" or "packet per buffer mode",
depending on what the userspace program or library requested.  I would have to
investigate which of those would be used by dvgrab.

Comment 40 Piergiorgio Sartor 2008-04-17 11:24:16 UTC

(In reply to comment #39)

> The self ID thing only influences operation of your Datafab disk.  It doesn't
> matter to DV reception.  So you can keep or remove that patch.

Do you think, in general, it is better to test with or without other "things" on
the 1394 bus?

> firewire-ohci uses different DMA modes for isochronous reception in OHCI 1.1 vs.
> OHCI 1.0 mode.  OHCI 1.1 chips get to do "dual buffer mode", OHCI 1.0 chips do
> "packet per buffer mode".  Maybe the latter one gets other results.
> 
> The old stack used to do "buffer fill mode" or "packet per buffer mode",
> depending on what the userspace program or library requested.  I would have to
> investigate which of those would be used by dvgrab.

I was suspecting this...

Maybe, if/when you've time, it could be nice to have the 1.0/1.1 selection as
module parameter, as possible fallback.
Of course, if it is planned to do the same buffer handling mode in both
versions, then there is no point in having a module parameter.

pg

Comment 41 Stefan Richter 2008-04-17 14:33:41 UTC

> Maybe, if/when you've time, it could be nice to have the 1.0/1.1
> selection as module parameter, as possible fallback.
> Of course, if it is planned to do the same buffer handling mode in
> both versions, then there is no point in having a module parameter.

Actually the goal is that isochronous reception (and everything else) Just Works
eventually, without extra configuration by the user.

Comment 42 Piergiorgio Sartor 2008-04-17 18:35:50 UTC

OK, I confirmed that, in normal conditions, the TSB43AB22A is configured as OHCI
1.1.
Second, I forced the OHCI 1.0 mode, as you suggested, and dvgrab worked as per
the old 1394 stack.
I tried back and forth a couple of times, just to make sure it was not a false
positive, with similar results. So I'm quite confident that OHCI 1.0 is working
stable.

Only one note.
During the tests, in OHCI 1.0 mode, I switched off the camera and this caused a
complete sudden freeze of the PC (reset needed, no chances of anything else).

The selfID thing was disabled during all tests.

pg

Comment 43 Stefan Richter 2008-04-17 19:21:29 UTC

Someone else's TSB43AB22 or TSB43AB22A is able to receive:
https://bugzilla.redhat.com/show_bug.cgi?id=243081#c40
https://bugzilla.redhat.com/show_bug.cgi?id=243081#c90
https://bugzilla.redhat.com/show_bug.cgi?id=243081#c97

Sigh.

Comment 44 Stefan Richter 2008-04-17 19:41:34 UTC

> OK, I confirmed that, in normal conditions, the TSB43AB22A is configured
> as OHCI 1.1.
> Second, I forced the OHCI 1.0 mode, as you suggested, and dvgrab worked
> as per the old 1394 stack.

Jarod, maybe we should just remove all of the dual buffer code.  Unless someone
of us finds a TI chip with the same problem and is able to fix up dual buffer...
which right now sounds like a waste of time to me.

------------
> During the tests, in OHCI 1.0 mode, I switched off the camera and this
> caused a complete sudden freeze of the PC (reset needed, no chances of
> anything else).

IOW a panic in the bus reset handler.  This could be related to...

> The selfID thing was disabled during all tests.

...that one.  I will try working on the issue from comments #27 - #29 on the
weekend, so that we can do the strict self ID sequence checking without the
drawback of spurious device de- and reattachments.

Comment 45 Stefan Richter 2008-04-17 19:56:11 UTC

>>> Sometimes, very seldom (actually only once), the stream is captured.
>>> Sometimes, there are a lot of "buffer underrun" errors and the stream
>>> is only partially captured, with many "holes".
>>> Almost always "dvgrab" reports something like "error no DV stream"
>>> (or similar).
...
>> OHCI 1.0 is working stable.
...
> Unless someone of us finds a TI chip with the same problem [...]

Or maybe the mainboard's chipset rather than the controller (or the combination
of the two) is the culprit.  Though from what I understood about how the
packet-per-buffer replacement for dual-buffer works, it should cause similar
memory access patterns.

Comment 46 Piergiorgio Sartor 2008-04-17 20:04:18 UTC

(In reply to comment #43)
> Someone else's TSB43AB22 or TSB43AB22A is able to receive:
> https://bugzilla.redhat.com/show_bug.cgi?id=243081#c40
> https://bugzilla.redhat.com/show_bug.cgi?id=243081#c90
> https://bugzilla.redhat.com/show_bug.cgi?id=243081#c97
> 
> Sigh.


But different camera!
I've an i.Link(tm) one, maybe it does not like firewire(tm)... :-)

Nevertheless, here OHCI 1.0 seems to work (more or less).
Does this give any hints on how to proceed?

For example, what about testing the "buffer fill mode", since this is closer to
the "dual buffer mode" and see what happens.
Any chance to do this? Does it makes sense to you?

pg

Comment 47 Jarod Wilson 2008-04-17 20:45:11 UTC

Hrmph. One TSB43AB22A works in dual-buffer mode, one doesn't... Yuk. But yeah,
Stefan's understanding is correct, the memory access usage and patterns between
dual-buffer and packet-per-buffer should be quite similar, actually moreso than
dual-buffer vs. buffer-fill, I believe.

Regardless, there's no way to test buffer-fill mode w/the new driver, as nobody
has written buffer-fill code for this stack. I actually started down the
buffer-fill route when first working on OHCI 1.0 support, and quickly found it
would be a nasty mess to implement in a way where the upper layers wouldn't have
to care if the underlying device was OHCI 1.0 or 1.1.

Kristian would likely be very much against dumping dual-buffer mode, iirc, as it
does have some measurable benefits over packet-per-buffer in latency-sensitive
operations (I believe his primary example was high-end a/v stuff). At the
moment, I'd be more inclined to maybe make it a module option to firewire-ohci
to run dual-buffer or packet-per-buffer for 1.1 chips. (Actually, that makes me
wonder... If one were to force an OHCI 1.0 Via controller to try to use
dual-buffer, what would happen... unrelated to this bug, of course...)

So far, I'm not finding a TSB43AB22* in my stash of controllers to try out, the
closest I have is a TSB43AB23, which has worked just fine in dual-buffer mode
for as long as I can remember.

Comment 48 Jarod Wilson 2008-04-17 20:46:54 UTC

Also, who knows if there aren't actually some bugs lingering in
packet-per-buffer support as well... (back to the whole Via OHCI 1.0 thing --
bug 415841). :\

Comment 49 Stefan Richter 2008-04-17 20:50:04 UTC

[written before I read Jarod's last two comments]

> For example, what about testing the "buffer fill mode", since this is
> closer to the "dual buffer mode" and see what happens.

Dual-buffer has the feature to split a portion of each packet off and put it
into a separate buffer.  (A very handy feature for some important protocols. 
Every OHCI chip should have it... but alas that's not the case.)

Buffer-fill could emulate dual-buffer only by some copying by the CPU.  That
might be an issue with systems with low CPU power.  (It shouldn't be an issue on
desktop systems which aren't totally ancient.)

Packet-per-buffer can emulate dual-buffer simply by setting appropriate buffer
boundaries, without the CPU having to copy between buffers.

The old stack uses buffer-fill and packet-per-buffer, depending on whether
raw1394, video1394, or dv1394 is at work, and in case of raw1394, depending on
what the application client requested.  Nevertheless, raw1394 is not universal
enough to replace video1394.  firewire-core/-ohci on the other hand is supposed
to provide a single isochronous API for all purposes, hence started out with
dual-buffer with is the most capable of the modes.  When it became clear that
many card vendors disable OHCI 1.1 compatibility even if the chip supports it,
the packet-per-buffer emulation of dual-buffer was added to firewire-ohci,
having the benefit of providing the same split buffer layout to the application
client as dual-buffer and still being a zero copy implementation.

Still, firewire-ohci's packet-per-buffer isn't that great either, because VIA
VT6306/7 still make trouble with it (while VT6307 works fine with dual-buffer if
the card vendor didn't disable it --- it's a mess).

BTW, raw1394 --- when used with libiec61883 clients such as dvgrab and kino ---
as well as the old dv1394 driver use packet-per-buffer.  But I don't know if
they use it in a way like firewire-ohci.

Comment 50 Piergiorgio Sartor 2008-04-17 21:00:02 UTC

Maybe one more thing.

As I mentioned at the beginning, sometimes I get "buffer underrun" errors.
This (sometimes?) happens as soon as dvgrab is launched.
Since it is in interactive mode, this means I get these errors _before_ the
capture actually starts.

Now, the first question is "who is returning these errors"?
The second is "why"?

Is it possible there is some issue in the _initialization_ of some hardware,
that can result in different behavior in different environments? BIOS? Not so
safe 32/64bit code?
Maybe some undefined registers can lead to different reactions.

Specifically this could explain:

1) different performances in different MB
2) the random "buffer underrun" errors before capturing (depending on who is
generating those)
3) the fact that sometimes it works, sometimes not

pg

Comment 51 Stefan Richter 2008-04-17 21:01:51 UTC

> At the moment, I'd be more inclined to maybe make it a module option
> to firewire-ohci to run dual-buffer or packet-per-buffer for 1.1 chips.

And what would the default value of the option be?
Packet-per-buffer for all chips?
Or "automatic", i.e. dual-buffer for 1.1 chips unless a blacklisted chip was
detected?  (TSB43AB22/A to be blacklisted, for reasons that are still unclear.)

And should the ioctl ABI provide that switch too?  Probably not.

Comment 52 Stefan Richter 2008-04-17 21:08:30 UTC

> As I mentioned at the beginning, sometimes I get "buffer underrun" errors.
> This (sometimes?) happens as soon as dvgrab is launched.
> Since it is in interactive mode, this means I get these errors _before_ the
> capture actually starts.

I get this initial alleged buffer underrun too.

> Now, the first question is "who is returning these errors"?
> The second is "why"?

Maybe it is caused by junk timecode values.

Comment 53 Piergiorgio Sartor 2008-04-17 21:23:47 UTC

(In reply to comment #49)

> BTW, raw1394 --- when used with libiec61883 clients such as dvgrab and kino ---
> as well as the old dv1394 driver use packet-per-buffer.  But I don't know if
> they use it in a way like firewire-ohci.

Then, if I got it right, there is no way to test "buffer fill mode".

IMHO this would have revealed HW problems, since it behaves like "dual buffer
mode", but with only one buffer. One DMA vs. two, the rest is the same.
This means, "dual buffer" could have two DMA engines or one multiplexed, hence
problems if done not properly (or defective).
If there is a DMA HW problem, we have 50% chance to get it.

In "packet per buffer", likely the HW implementation is different, so there are
less chances to detect HW problems of "dual buffer".

Related to comment #52, what do you mean with "junk timecode values"?
Shouldn't everything properly initialized?
On the other hand, when I get these errors, the capture is badly working and,
usually, I get later some kernel crash.

Just to close the circle, what if there is a bug in the libraw or libiec?
Could this cause all these issues?

pg

Comment 54 Jarod Wilson 2008-04-17 21:34:28 UTC

(In reply to comment #51)
> > At the moment, I'd be more inclined to maybe make it a module option
> > to firewire-ohci to run dual-buffer or packet-per-buffer for 1.1 chips.
> 
> And what would the default value of the option be?
> Packet-per-buffer for all chips?
> Or "automatic", i.e. dual-buffer for 1.1 chips unless a blacklisted chip was
> detected?  (TSB43AB22/A to be blacklisted, for reasons that are still unclear.)

My thought was that this is the one and only case I've seen/heard where
dual-buffer fell down, and I can reproduce the OHCI 1.0 Via packet-per-buffer
failure on 3 different Via controllers (as well as you being able to), so I'd go
with "automatic", using dual-buffer still on all OHCI 1.1 chips, save those that
are blacklisted. And then yeah, blacklist the TSB43AB22/A, but possibly with the
ability to override the blacklist (similar to how the sbp2 work-arounds are set up).

> And should the ioctl ABI provide that switch too?  Probably not.

I'd say probably not as well. I still think the upper layers shouldn't have to
care. Although perhaps that would have to change anyway if someone writes
buffer-fill support... (I have no plans to do so myself though).


Oh, and I do get the buffer underrun thing from time to time too, right near the
start of capture. Best as I could surmise, this happens when we start handling
descriptors, and we handle them faster than we queue them and reach the end of
the descriptor list (b=0x11, z=0) and the context is temporarily halted (this is
usually 1-2 frames into capture) until we queue up more descriptors and restart
the context. (Nb: this is also exactly where the Via controllers fall down and
stall out, so far as I can tell).

Comment 55 Jarod Wilson 2008-04-17 21:36:10 UTC

(In reply to comment #54)
> ... Best as I could surmise, this happens when we start handling
> descriptors, and we handle them faster than we queue them and reach the end of
> the descriptor list (b=0x11, z=0)

That is '...we handle all queued descriptors before queueing more and reach...'.

Comment 56 Jarod Wilson 2008-04-17 21:41:05 UTC

(In reply to comment #53)
> Then, if I got it right, there is no way to test "buffer fill mode".

With the current firewire stack, that is correct. Not implemented at all.

> IMHO this would have revealed HW problems, since it behaves like "dual buffer
> mode", but with only one buffer. One DMA vs. two, the rest is the same.
> This means, "dual buffer" could have two DMA engines or one multiplexed, hence
> problems if done not properly (or defective).
> If there is a DMA HW problem, we have 50% chance to get it.

No, packet-per-buffer really does behave more like dual-buffer than buffer-fill
does in this case. In dual-buffer, each descriptor points to a header buffer and
a payload buffer. We emulate that in packet-per-buffer by chaining together two
descriptors, the first points to the header buffer, the second to the payload
buffer.

> Just to close the circle, what if there is a bug in the libraw or libiec?
> Could this cause all these issues?

I do still think its possible there's an issue somewhere in userspace that
ultimately leads to the buffer underruns (and via stall-outs), but I don't think
a userspace issue could explain more than that, and the via stall-out I do think
is a controller problem (but it would possibly be circumvented if we didn't have
the buffer underrun).

Comment 57 Piergiorgio Sartor 2008-04-17 21:52:41 UTC

(In reply to comment #55)

> That is '...we handle all queued descriptors before queueing more and reach...'.

OK, but the symptoms here are not mixed up.

When the "underruns" occur, a specific situation is set up:

1) it happens _before_ capturing (the camera is stopped)
2) capture is working, but broken
3) later kernel crash

It never (ever) happened to have "underruns" and no capturing at all and it
never (ever) happened to have "underruns" at start up and then good capturing.

I'm still thinking that some HW is not completely or correctly initialized.
Maybe is it a BIOS fault.
Or a defective chip...

On the TSB43B22A datasheet it is mentioned a "DV/link enhanced mode", maybe
there is a chance to enable it and see (an explosion).

pg

Comment 58 Piergiorgio Sartor 2008-04-17 21:56:19 UTC

(In reply to comment #56)

> > IMHO this would have revealed HW problems, since it behaves like "dual buffer
> > mode", but with only one buffer. One DMA vs. two, the rest is the same.
> > This means, "dual buffer" could have two DMA engines or one multiplexed, hence
> > problems if done not properly (or defective).
> > If there is a DMA HW problem, we have 50% chance to get it.
> 
> No, packet-per-buffer really does behave more like dual-buffer than buffer-fill
> does in this case. In dual-buffer, each descriptor points to a header buffer and
> a payload buffer. We emulate that in packet-per-buffer by chaining together two
> descriptors, the first points to the header buffer, the second to the payload
> buffer.

I meant from the HW point of view.
To implement in the chip "dual buffer" or "buffer fill" it is almost the same
"logic", only dual vs. single DMA.

The other mode, "packet per buffer", is different, still from the HW point of view.

Hence, if there is a chip errata in "dual buffer", likely it will show up in
"buffer fill", less likely in "packet per buffer".
Anyway, this is just academic, since we cannot test.

pg

Comment 59 Stefan Richter 2008-04-17 22:16:03 UTC

> When the "underruns" occur, a specific situation is set up:
> 
> 1) it happens _before_ capturing (the camera is stopped)
> 2) capture is working, but broken
> 3) later kernel crash
> 
> It never (ever) happened to have "underruns" and no capturing at all and it
> never (ever) happened to have "underruns" at start up and then good capturing.

I get the alleged buffer underrun before capture starts on all chips which work
for me (FW323/1.0, NEC/1.0, TSB82AA2/1.1, VT6307/1.1).  "Work" means the stream
is captured 100% perfect, as far as I and dvgrab can tell.

I don't get this underrun on VT6306/1.0 which is currently unable to capture due
to bug 415841.

Both of this apparently happens always.

Comment 60 Jarod Wilson 2008-04-18 03:00:38 UTC

Hrm. I may be way off base then. My brain has been going all over the place
trying to come up with some sort of explanation for the via failures...

Comment 61 Piergiorgio Sartor 2008-04-18 12:09:53 UTC

The TSB43AB22A datasheet says that "dual buffer mode" does not work in multi
channel mode.
Specifically, it says that multichannel is automatically disabled, when enabling
"dual buffer mode".
Nevertheless, it is also mentioned that, in "single channel" mode, if multiple
channels are enabled (in the channel mask register(s)), the results are undefined.
Those ones (the channel mask register(s)) are undefined at reset.

Does this ring a bell? Or is it standard?

How difficult is to play with the OHCI register in the current stack?
Is there any chance I can have a look and experiment something?
Which are the files to investigate? And functions...?

pg

Comment 62 Stefan Richter 2008-04-18 13:37:59 UTC

Re comment 61:
I haven't checked the TSB43AB22A manual in detail, but apart from the vendor
extensions it repeats part of the OHCI 1.1 spec (particularly, the MMIO
registers specifications).  CPU and OHCI controller communicate by
  - memory mapped registers,
  - DMA programs (linked lists of buffer descriptors),
  - data buffers,
  - and of course interrupts.
The formats of the DMA programs and data buffers are only described in the OHCI
spec, not in TI's manual.  There is a link to the spec at
http://wiki.linux1394.org/Links/Specs .

The chip is programmed by drivers/firewire/fw-ohci.[ch].

DV reception is single channel.

Comment 63 Stefan Richter 2008-04-18 19:17:40 UTC

What does "lspci -nnv" say about the controller, BTW?

If we are going the blacklist route, maybe we want to narrow it down to the
subsystem_vendor:subsystem_device ID.  Until the next one with the problem comes
around.  (I'm going to get myself a TSB43AB22A card as well to see how it works.)

Comment 64 Piergiorgio Sartor 2008-04-19 09:51:32 UTC

(In reply to comment #63)
> What does "lspci -nnv" say about the controller, BTW?

01:05.0 FireWire (IEEE 1394) [0c00]: Texas Instruments TSB43AB22/A
IEEE-1394a-2000 Controller (PHY/Link) [104c:8023] (prog-if 10 [OHCI])
        Subsystem: ASUSTeK Computer Inc. K8N4-E Mainboard [1043:808b]
        Flags: bus master, medium devsel, latency 32, IRQ 19
        Memory at fddff000 (32-bit, non-prefetchable) [size=2K]
        Memory at fddf8000 (32-bit, non-prefetchable) [size=16K]
        Capabilities: [44] Power Management version 2
        Kernel driver in use: firewire_ohci
        Kernel modules: firewire-ohci

The motherboard does not seem to fit the actual model, since it is a M2NPV-VM.

> If we are going the blacklist route, maybe we want to narrow it down to the
> subsystem_vendor:subsystem_device ID.  Until the next one with the problem comes

I'm looking forward to this. Hopefully I'll be able to capture something!
I'll then have to file an other bug, due to the not working autosplit of dvgrab...

> around.  (I'm going to get myself a TSB43AB22A card as well to see how it works.)

Well, I'm really curious to see if I'm the lucky one with broken chip/BIOS/MB...

Thanks!

pg

Comment 65 Piergiorgio Sartor 2008-04-19 09:57:39 UTC

(In reply to comment #62)

> http://wiki.linux1394.org/Links/Specs .

I'll have a look to this, maybe in comparison to the TI datasheet.
It could be the TI chip have some constrains or so...

> DV reception is single channel.

Sorry, I was mis-quoting the DS, they claim (if I got it right) that multi
channel works only when a single ISO context has it enabled.
I guess, this means "active" ISO context, not all ISO context.

pg

Comment 66 Piergiorgio Sartor 2008-04-22 11:51:13 UTC

Maybe is not a problem, but according to the OHCI 1.1 specs and TI datasheet,
the rcvSelfID bit of the LinkControl register must be set only _after_ a valid
address is loaded in the selfID buffer pointer register.
It seems to me, that in ohci_init(), the bit is first set and only later the
address loaded.
This might not be an issue, but it could be anyhow not within the specs.
Another point is the cycleSource bit of the same register.
According the the OHCI 1.1 specs, it is cleared at (hw) reset, but according to
TI (it seems) it is undefined (actually TI is a bit lacking info here).

I'm planning (when, I don't know) to move the selfID buffer pointer load before
the rcvSelfID is set and to explicitly clear (or set?) the cycleSource bit.

Unless you have other ideas about the topics (which would save me the time).

Topic change.
I was enabling the irq debug option in the firewire-ohci, and I can confirm
interrupts are raining as soon as dvgrab is started (even if the camera is paused).
Is there any easy way to somehow benchmark this irqs?
I mean, is it possible to confirm the irq flow is consistent with the, supposed,
data flow?

Thanks.

pg

Comment 67 Stefan Richter 2008-04-22 19:58:07 UTC

I remember having had a conversation, or at least thought out loud, about the
selfID buffer pointer issue somewhere sometime ago.  Strange that I haven't
patched it yet.  However, selfID receive DMA is per se unrelated to isochronous
receive DMA.

Cycle master related functions are important.  But they should be OK if debug
logging shows regular cycle64Seconds interrupt events.

Comment 68 Stefan Richter 2008-04-22 20:14:07 UTC

> is it possible to confirm the irq flow is consistent with the,
> supposed, data flow?

Not entirely.  Interrupt events which fw-ohci logs as "IR" are the events per
OHCI 1.1 section 6.4.1 ("...if a packet completes and any of the buffers it
spans have the i bits set to 2'b11...").  That is, you only get these interrupts
as long as the DMA context keeps running && when the descriptors told the
controller to send an interrupt.

There is also an "unrecoverableError" interrupt event which would fire e.g. when
a DMA context goes dead.  But we don't enable this event in the IntMask register.

Comment 69 Jarod Wilson 2008-04-22 20:15:01 UTC

Stefan, I believe you and I discussed the ordering of setting the buffer pointer
and setting the rcvSelfID bit on irc when I was poking at LPS issues with my
JMicron card, but it was mostly inconsequential, since we don't do anything
selfID related until a bit later on. (At least, I think that was why, my
recollection may be slightly off... So yeah, technically, we should fix that
ordering up, but in practice, it shouldn't matter).

Comment 70 Piergiorgio Sartor 2008-04-22 20:57:26 UTC

(In reply to comment #67)
> I remember having had a conversation, or at least thought out loud, about the
> selfID buffer pointer issue somewhere sometime ago.  Strange that I haven't
> patched it yet.  However, selfID receive DMA is per se unrelated to isochronous
> receive DMA.

OK, I tried both, moving the selfID pointer before setting the recvSelfID bit
and clearing the cycleSource bit, with no success.

About the buffer story, I strongly recommend to do it by-the-book, in order to
avoid potential problems somewhere in the future.
You don't need a patch from me, do you? :-)

> Cycle master related functions are important.  But they should be OK if debug
> logging shows regular cycle64Seconds interrupt events.

Regular? I saw only one "cycle64Seconds" in dmesg, when the camera started
(maybe during one of this "buffer underrun" situations).
They are supposed to happen every 64 seconds, I hope...

pg

Comment 71 Piergiorgio Sartor 2008-04-22 21:25:20 UTC

(In reply to comment #68)

> Not entirely.  Interrupt events which fw-ohci logs as "IR" are the events per
> OHCI 1.1 section 6.4.1 ("...if a packet completes and any of the buffers it
> spans have the i bits set to 2'b11...").  That is, you only get these interrupts
> as long as the DMA context keeps running && when the descriptors told the
> controller to send an interrupt.

The question is: how can dvgrab report "no DV" if the IR irq are coming?

Does this mean some data is in the buffers, but not of "DV" type?
Or that some event is not generated in time, thus leading dvgrab to believe
there is no data at all?

Thanks,

pg

Comment 72 Stefan Richter 2008-04-22 22:42:36 UTC

> I saw only one "cycle64Seconds" in dmesg, when the camera started

They should occur in 64 seconds intervals.  If not, something is wrong with the
controller's cycle counter, or with the cycle master.

(The cycle master is a node on the bus which sends "cycle start" packets in 125
µs  intervals.  Isochronous talkers listen for these packets and send an
isochronous packet whenever they got the cycle start.)

Comment 73 Stefan Richter 2008-04-28 13:08:37 UTC

Good news and bad news:  I have got a TSB43AB22(A) CardBus card now (Exsys
EX-6600E).  It works fine with dvgrab, i.e. I can't reproduce the problem.

06:00.0 FireWire (IEEE 1394) [0c00]: Texas Instruments TSB43AB22/A
IEEE-1394a-2000 Controller (PHY/Link) [104c:8023] (prog-if 10 [OHCI])
        Flags: bus master, medium devsel, latency 64, IRQ 17
        Memory at 80004000 (32-bit, non-prefetchable) [size=2K]
        Memory at 80000000 (32-bit, non-prefetchable) [size=16K]
        Memory at 80004800 (32-bit, non-prefetchable) [size=2K]
        Capabilities: [44] Power Management version 2
        Kernel driver in use: firewire_ohci
        Kernel modules: firewire-ohci, ohci1394

Comment 74 Piergiorgio Sartor 2008-04-28 17:48:04 UTC

(In reply to comment #73)
> Good news and bad news:  I have got a TSB43AB22(A) CardBus card now (Exsys
> EX-6600E).  It works fine with dvgrab, i.e. I can't reproduce the problem.

Sob, sob...
It seems I'm very lucky...

> 06:00.0 FireWire (IEEE 1394) [0c00]: Texas Instruments TSB43AB22/A
> IEEE-1394a-2000 Controller (PHY/Link) [104c:8023] (prog-if 10 [OHCI])
>         Flags: bus master, medium devsel, latency 64, IRQ 17
>         Memory at 80004000 (32-bit, non-prefetchable) [size=2K]
>         Memory at 80000000 (32-bit, non-prefetchable) [size=16K]
>         Memory at 80004800 (32-bit, non-prefetchable) [size=2K]

Why you've one more memory range than I have?
The last 2K do not appear in my setup. Is this OK?

>         Capabilities: [44] Power Management version 2
>         Kernel driver in use: firewire_ohci
>         Kernel modules: firewire-ohci, ohci1394

So, I guess then you'll have to get working the other TI card I've, the OHCI 1.0
one... :-)

pg

Comment 75 Stefan Richter 2008-05-01 21:23:55 UTC

>>         Memory at 80004000 (32-bit, non-prefetchable) [size=2K]
>>         Memory at 80000000 (32-bit, non-prefetchable) [size=16K]
>>         Memory at 80004800 (32-bit, non-prefetchable) [size=2K]
> 
> Why you've one more memory range than I have?
> The last 2K do not appear in my setup. Is this OK?

I have no idea.  OHCI requires just 2K but allows more for vendor-specific
memory-mapped registers.  A XIO2000 + TSB82AA2 PCIe card and a TSB82AA2 CardBus
card of mine feature a 2K and a 16K region.  Other cards have all sorts of other
configurations.  The Linux drivers don't use any vendor-specific registers.  In
theory, anything outside the OHCI range should not matter at all.

Comment 76 Stefan Richter 2008-05-01 22:47:25 UTC

Jarod, could you point Piergiorgio to the latest packages (kernel, libraw1394,
dvgrab) which he should have?

Would be good if he could retest.  The comment about IR interrupts happening all
the time though dvgrab not receiving anything makes me wonder if it actually is
a fw-ohci problem.

However, if the latest and greatest still doesn't work, it would be good if
Piergiorgio could test this (admittedly rather uninspired) patch:
http://marc.info/?l=linux1394-devel&m=120968142705449

Comment 77 Jarod Wilson 2008-05-02 01:48:20 UTC

Latest kernel, dvgrab and libraw1394 should all be available from the Fedora 8
updates repo now, so a simple 'yum upgrade' should do the trick (or 'yum upgrade
kernel dvgrab libraw1394' would do to limit the scope of the upgrade).

Comment 78 Piergiorgio Sartor 2008-05-02 08:02:22 UTC

I've kernel -85, which is the latest with some firewire updates, libraw1394 is
-6, also latest, but dvgrab is still 3.0, I see a 3.1 in koji, I'll pick it from
there (why this one is not in update? It's from last year...).

I've mixed feelings about this thing, at the moment.
On one side, it could be HW (MB, BIOS, broken chip, etc.) related, thus the only
solution could be the patch Stefan proposed, i.e. force packet-per-buffer, in
this case.
Another aspect could be the user space part, but this will not explain the
"kernel panic" following the "buffer underrun" things, I guess. Assuming this is
not a different problem.

There is, unfortunately, something else. I've a dual core CPU and 4GiB. Due to
the memory size, I've to boot with pci=nommconf, otherwise "unpredictable
results may occur" (especially with openGL).
The IOMMU seems to be "masked" by the BIOS, so the kernel has to workaround by
its own.
I also know that sometimes there are/were issues with dual core machines.
Furthermore, the BIOS enables (but it is disable-able) the virtualization
extensions of the CPU.
So, overall it's a mess...

In this situation, one possibility would be to physically remove 2GiB (is there
any "soft" possibility?) and/or boot with maxcpus=1 and/or without pci=nommconf.
If this could make any sense...

Coming back to (my) comment #71, is there any way to dump the packet headers
(and eventually data) from within the driver, after the reception?
I mean, of course: where is(are) the buffer(s)? Is this the 16MiB allocated
somewhere in fw-ohci.c?

It would be interesting to know if and which data is there.
I was thinking to memset this to some 0xD15EA5ED value (maybe 0x55 will be
enough) and then check if and how is it overwritten.

And how can dvgrab (because it's dvgrab, it seems) report the "underruns"?

Finally, I'll try to add the patch and see how further can I go.

Thanks,

pg

Comment 79 Stefan Richter 2008-05-02 10:49:30 UTC

> Due to the memory size, I've to boot with pci=nommconf, otherwise
> "unpredictable results may occur" (especially with openGL).
> The IOMMU seems to be "masked" by the BIOS, so the kernel has to
> workaround by its own.

Yes, that's worrying.  Please test with the unpatched driver and mem=3G on the
boot loader's command line, perhaps also with memmap=something.  See
http://lxr.linux.no/linux/Documentation/kernel-parameters.txt
Or if that doesn't work out, a test with RAM physically reduced to anywhere <=
3G would be good.

> I also know that sometimes there are/were issues with dual core
> machines.  Furthermore, the BIOS enables (but it is disable-able) the
> virtualization extensions of the CPU.

I don't expect trouble from either of those.  Although, if Fedora starts a
userspacce IRQ balancer, you may want to disable it to work around driver bugs.
 However, these bugs should be random in nature == not as systematically as the
DV reception failure here, I presume.

> Coming back to (my) comment #71, is there any way to dump the packet
> headers (and eventually data) from within the driver, after the
> reception?

Only the upper layers can do that, i.e. fw-cdev.c or perhaps something in
fw-iso.c.  You can't do this in fw-ohci.c.  That's because the CPU will only see
the correct data in the buffers after having them dma_unmap_*()'d them.  Looking
into the buffers before them is a bug which only works on simple platforms
without IOMMU or "software IOMMU".

Mmm, I wonder if fw-ohci already looks into the buffers.  The firewire drivers
used to be sprinkled with DMA mapping/ DMA syncing bugs.  However, I believe
there are already people using the current drivers successfully on platforms
which require proper mapping/ syncing.

> I mean, of course: where is(are) the buffer(s)? Is this the 16MiB
> allocated somewhere in fw-ohci.c?

No, these are the descriptor buffers.  (Call them metadata buffers if you will,
since they contain the description of the actual data buffers.)  The actual data
buffers are allocated by fw-core on behalf of the userspace client, if I'm not
mistaken.  (And these data buffers are split into header buffer and payload
buffer.)  Browse fw-iso.c.

If I knew how all this works, I would tell you.

> It would be interesting to know if and which data is there.
> I was thinking to memset this to some 0xD15EA5ED value (maybe 0x55
> will be enough) and then check if and how is it overwritten.

Yes, I'm under the impression that you are on to something.

> And how can dvgrab (because it's dvgrab, it seems) report the
> "underruns"?

If I knew how all this works...  (Oh the good old times when I only maintained
sbp2...)

Comment 80 Stefan Richter 2008-05-02 10:53:29 UTC

PS:
> You can't do this in fw-ohci.c.  That's because the CPU will only see
> the correct data in the buffers after having them dma_unmap_*()'d

This refers to the data buffers only.  The descriptor buffers reside in coherent
memory.  In that one, PCI device and CPU always see the same contents.

Comment 81 Jarod Wilson 2008-05-02 14:47:28 UTC

Gah, I thought I'd pushed dvgrab 3.1 ages ago... Okay, just did so now.

As for issues with multi-core systems and 4GB of RAM, fwiw, I have multiple
multi-core boxes with 4GB of RAM or more that work just fine, so there's no
generic issue there, but certainly could be a bios-specific issue.

Comment 82 Piergiorgio Sartor 2008-05-02 18:16:04 UTC

OK, I ran the following tests:

On boot kernel command line:

1) nothing
2) pci=nommconf mem=2G
3) mem=2G

The good news is that in cases 2) and 3) dvgrab captured the DV stream
flawlessly, while in 1) there was the usual "error no DV".
The bad news is that in case 1) there was the usual "error no DV, while in 2)
and 3) dvgrab captured the DV stream flawlessly.
;-)

The BIOS settings where untouched, in all cases.

One thing is, the device works in dual-buffer mode, at least something was
captured (actually I did not look at the video itself).
So, chip errata and so on are excluded.
Other thing is that only the mem=xG seems to affect the functionality, that is
either BIOS mis-mis-mis-configuration or kernel problem.
A thing I forgot (I was in hurry to inform you :-)) is to check what the kernel
said about the IOMMU.

What so you think?

pg

Comment 83 Piergiorgio Sartor 2008-05-02 18:58:57 UTC

I forgot, since the packet-per-buffer seems to work, with 4GiB, there could be
something between this and the other mode, which does not fit with mem > 3GiB.

What's the difference between the packet-per-buffer and dual-buffer modes, in
terms of DMA, memory allocation and so on?

Thanks again,

pg

Comment 84 Piergiorgio Sartor 2008-05-03 07:27:59 UTC

OK, further experiments, booting with pci=nommconf and the following:

1) mem=3G
2) mem=2800M
3) mem=2500M
4) mem=2200M
5) mem=2G (again)

Results were: 1), 2) and 3) do not work, while 4) and 5) do.
I checked, when working, the captured video and it is fine, no problems.

In all these cases, no mention of IOMMU is visible with dmesg.

pg

Comment 85 Stefan Richter 2008-05-03 10:34:28 UTC

> 3) mem=2500M
> 4) mem=2200M
...
> Results were: 1), 2) and 3) do not work, while 4) and 5) do.

Could you take the time to collect the first ~100 lines of dmesg for 2500M and
2200M?

Comment 86 Stefan Richter 2008-05-03 10:35:23 UTC

...fresh after boot that is, before the dmesg ring buffer wraps.

Comment 87 Stefan Richter 2008-05-03 11:53:32 UTC

As I just remembered while writing
http://bugzilla.kernel.org/show_bug.cgi?id=10342#c22 (on SBP-2 I/O errors on
Asus M2R32-MVP), we may still have memory access ordering bugs in fw-ohci.

Comment 88 Piergiorgio Sartor 2008-05-04 10:37:22 UTC

I'm under the impression that the limit is 2G and 2200M works just because the
upper 200MiB might allocated before the firewire DMA buffers are.
So, I dumped the 2G, 2200M, 2500M and normal (4G) dmesg output.
Since dmesg is quite verbose, I cut them to 350 lines, where it seems nothing
more relate to memory or I/O happens.

pg

Comment 89 Piergiorgio Sartor 2008-05-04 10:37:47 UTC

Created attachment 304485 [details]
mem=2G

Comment 90 Piergiorgio Sartor 2008-05-04 10:38:09 UTC

Created attachment 304486 [details]
mem=2200M

Comment 91 Piergiorgio Sartor 2008-05-04 10:38:28 UTC

Created attachment 304488 [details]
mem=2500M

Comment 92 Piergiorgio Sartor 2008-05-04 10:39:00 UTC

Created attachment 304489 [details]
no specific memory setup

Comment 93 Piergiorgio Sartor 2008-05-04 10:42:46 UTC

Forgot something...

It might still be a HW issue, maybe the chip can't DMA above 2G, in dual-buffer
mode (and maybe even in packet-per-buffer).
Somewhere I read about PCI DMA mask, maybe this out should be forced to 31 bits
(or it is already like this, but it is not used).

pg

Comment 94 Stefan Richter 2008-05-04 11:29:11 UTC

You could insert

        dma_set_mask(&dev->dev, DMA_31BIT_MASK);

in pci_probe() of fw-ohci.c.  I believe it can go between

        pci_set_master(dev);
        pci_write_config_dword(dev, OHCI1394_PCI_HCI_Control, 0);

If that makes it work, it still doesn't tell us whether it is a driver bug or
hardware bug --- and if the latter, OHCI chip bug or board bug; or if the
former, fw-ohci bug or platform code bug.  AFAIU.

The randomness of failures according to comment #0 indeed indicates that some
DMA memory ranges don't work for us, for whatever reason.  And I agree that your
latest findings point at a 2G limit.

Comment 95 Piergiorgio Sartor 2008-05-04 21:43:01 UTC

I added the DMA mask, as you wrote, and, from a first test, dvgrab was fine, the
stream was captured and playable.

I was doing the following considerations.
All DMAs of this machine seem to work fine: SATA, an old BTTV PCI card, ethernet
(if it has a DMA), SBP2, even ISO packet-per-buffer. Only the DMA of dual-buffer
seems to be limited to 2GiB.
It could be TI added the dual-buffer on the side of an existing design, maybe
they did not make a fully new chip. So, they could have some "limitations" on
the new part, independently from the old one.

One possibility, would be to add the DMA mask as specific workaround for this
chip. Any drawbacks? What about SBP2? Maybe better than forcing packet-per-buffer?

Of course, it would nice if TI confirms/denies the findings.

Unless you can point differences between one ISO mode and the other
Or someone can suggest some test revealing problems elsewhere in the system.

Any ideas?

pg

Comment 96 Stefan Richter 2008-05-10 16:07:21 UTC

I am currently testing TSB43AB22/A on an i945GT based board with 3.2 GB RAM. 
But I have yet to hit physical addresses above 2 GB.  (I added a printk to get
notified when that happens.)  I guess I have to allocate some memory before
running dvgrab.

Comment 97 Piergiorgio Sartor 2008-05-10 16:27:35 UTC

(In reply to comment #96)
> I am currently testing TSB43AB22/A on an i945GT based board with 3.2 GB RAM. 
> But I have yet to hit physical addresses above 2 GB.  (I added a printk to get
> notified when that happens.)  I guess I have to allocate some memory before
> running dvgrab.

Cool! Good you could get such a thing.
One question and one note.

Is this a 64bit machine with 64bit kernel?

I've the /tmp with tmpfs, I don't know if it matters, but this results in, more
or less, 2GiB virtually allocated (I guess they're not really allocated until
something is written there, but maybe this have impact on following allocations).

Hope this helps, and thanks again!

pg

Comment 98 Jarod Wilson 2008-05-14 18:39:01 UTC

I've got a pair of pcmcia cards coming that are both two-port fw400, hoping at
least one of them is a TSB43AB22/A. Will plug them into my core 2 duo laptop
running a 64-bit kernel with 4GB of RAM.

Comment 99 Piergiorgio Sartor 2008-05-14 19:39:38 UTC

(In reply to comment #98)
> I've got a pair of pcmcia cards coming that are both two-port fw400, hoping at
> least one of them is a TSB43AB22/A. Will plug them into my core 2 duo laptop
> running a 64-bit kernel with 4GB of RAM.

Great!
Hopefully someone can find what the matter is!

One question for Stefan, could you please tell where and how to add the printing
for the memory allocation?
I would like to check a couple of things.
Specifically, where, in my case, the memory is, with and without the 31 bit DMA
limitation.
What if in both cases the memory is below 2GiB?

What surprises me is that I get the problem immediately, without any particular
memory usage happening before the capture (OK, xorg, but nothing more...).
I was even thinking that the memory is allocated starting from the higher
addresses to the lower.

pg

Comment 100 Stefan Richter 2008-05-14 22:08:18 UTC

Created attachment 305409 [details]
log bus addresses in dualbuffer IR

>>> One possibility, would be to add the DMA mask as specific workaround
>>> for this chip. Any drawbacks? What about SBP2? Maybe better than
>>> forcing packet-per-buffer?

This will kill performance on machines without IOMMU if more than 2 GB of
memory is present because the CPU will have to copy back and forth to DMA
bounce buffers.  If it works in the first place.

>> Is this a 64bit machine with 64bit kernel?

32 bit kernel.	Shouldn't matter though, because PCI physical addresses are 32
bits wide.  (I have not yet continued to set it up that I actually get buffers
above 2 GB.)

> could you please tell where and how to add the printing
> for the memory allocation?

The virtual addresses which we get at memory allocation probably aren't
interesting (for now).	More so the physical addresses (a.k.a. bus addresses)
which we get by dma-mapping the memory.  Attached is a stupid patch which logs
the bus address of the descriptor and of the buffer page in the dualbuffer IR
path.  (If one of them is > 2G and only every 32nd time, to not flood the log. 
Well, as I said, I haven't triggered that yet.	No guarantee that this patch
does what I thought it would do.)

Comment 101 Piergiorgio Sartor 2008-05-15 07:51:25 UTC

(In reply to comment #100)

> This will kill performance on machines without IOMMU if more than 2 GB of
> memory is present because the CPU will have to copy back and forth to DMA
> bounce buffers.  If it works in the first place.

Assuming the chip cannot DMA above 2GiB in dual-buffer mode, would it be
possible to force _only_ the memory allocation of the iso transfers to 31 bit
addresses?
The question is if there is more overhead with 4GiB w/o IOMMU, in the cases when
this occurs, or in having always packet-per-buffer mode.
Assuming the async transfer part (SBP2) is working properly.
Of course, this would be only for this chip.

> 32 bit kernel.	Shouldn't matter though, because PCI physical addresses are 32
> bits wide.  (I have not yet continued to set it up that I actually get buffers
> above 2 GB.)

Uhm, but there is this story of high/low mem, with 32bit machines with more than
1GiB of memory.
I've an intel based, 32 bit, PC with 2GiB and the "memory split" is 3/1 (GiB),
dmesg reports 1151MB highmem and 896MB lowmem.
AFAIK there is bounce buffering going on with this setup, but I'm not sure about
the details, I was reading the explanation long ago...

> The virtual addresses which we get at memory allocation probably aren't
> interesting (for now).	More so the physical addresses (a.k.a. bus addresses)
> which we get by dma-mapping the memory.  Attached is a stupid patch which logs
> the bus address of the descriptor and of the buffer page in the dualbuffer IR
> path.  (If one of them is > 2G and only every 32nd time, to not flood the log. 
> Well, as I said, I haven't triggered that yet.	No guarantee that this patch
> does what I thought it would do.)

Thanks, I'll have a look, I'm curious to how the allocation patterns are.

pg

Comment 102 Piergiorgio Sartor 2008-05-15 19:01:34 UTC

Hi again.
I tried the fw_notify() patch printing the DMA addresses.
If I understand it correctly, it prints only if the address is above 2GiB.

Well, the machine was running several different things, while I was patching,
compiling and so on. I could imagine some memory was allocated.
In this conditions, without the DMA_31BIT thing, there was no print and I was
able to capture the DV stream.

After this strange experience, I rebooted, and the situation went back to
"normal" (i.e. no capturing), with something like this in /var/log/messages:

...
May 15 20:47:28 lazy kernel: firewire_ohci: ##### d_bus 3415088752x, page_bus
3414638592x
May 15 20:47:28 lazy kernel: firewire_ohci: ##### d_bus 3415090352x, page_bus
3414654976x
May 15 20:47:28 lazy kernel: firewire_ohci: ##### d_bus 3415091888x, page_bus
3414671360x
May 15 20:47:28 lazy kernel: firewire_ohci: ##### d_bus 3414384880x, page_bus
3414687744x
May 15 20:47:28 lazy kernel: firewire_ohci: ##### d_bus 3414386416x, page_bus
3414704128x
...

If I get it correctly, this somehow confirms that, for some unknown reason, DMA
with high addresses do not work.

Note that, this was after fresh reboot, somehow it also confirms that, at least
this memory, is allocated starting from higher addresses.

I think the 32 bit kernel will never go that far, due to the 3/1 split.

Hope this helps.

pg

Comment 103 Jarod Wilson 2008-05-20 02:53:21 UTC

So my two pcmcia cards arrived today. They're identical, save the stickers on
'em. That would be fine if they were the right chipset, but they aren't. They're
both NEC cards. D'oh.

Comment 104 Stefan Richter 2008-05-28 07:15:58 UTC

I was just reminded today that fw-ohci is still vulnerable to this:
http://lkml.org/lkml/2008/5/26/297
(reordering of MMIO accesses vs. DMA buffer accesses, for a bunch of reasons)
I don't know though whether this has a hand in this bug here.

Comment 105 Piergiorgio Sartor 2008-05-30 12:50:20 UTC

Hi all, I was a bit busy upgrading some machines to F9.
I also "patched" the fw-ohci.c for the new kernel (on the x86_64 box), with the
DMA limit "feature".
It even works :-), I can capture dv streams with dvgrab and kino.

How should we proceed, then? Any ideas or a plan?

pg

Comment 106 Stefan Richter 2008-05-30 15:35:38 UTC

We still need to pinpoint whether the drivers or the TSB43AB22 or the board is
at fault. A main obstacle on my side currently is lack of time.

Comment 107 Piergiorgio Sartor 2008-05-30 17:20:05 UTC

(In reply to comment #106)
> We still need to pinpoint whether the drivers or the TSB43AB22 or the board is
> at fault. A main obstacle on my side currently is lack of time.

Well, of course, the question was how to pinpoint this.

Considering that a couple solutions are available, I was just wondering if you
(both) had some ideas on how to go further (implement one, the other, all, try
to find the root cause, etc.).

Anyway, I guess we can wait until you'll have more time.

pg

Comment 108 Stefan Richter 2008-05-30 17:38:21 UTC

 - code inspection (I did it to some degree but may have missed something)
 - fix up those other seemingly unrelated bugs or sloppinesses while we are at
it and watch whether it has unexpected positive results
 - test different combos of software -- controller -- board to eliminate parts
of the equation (hard because juju is the only known software which utilizes
dual buffer, and you don't have a stash of controllers, notably not OHCI 1.1
ones, and I don't have this board)
 - attempt to find someone at TI who knows something about dualbuffer and big
physical addresses (questionable approach)

Comment 109 Piergiorgio Sartor 2008-05-30 18:01:06 UTC

(In reply to comment #108)
>  - code inspection (I did it to some degree but may have missed something)

For this I'll not be of big help.

>  - fix up those other seemingly unrelated bugs or sloppinesses while we are at
> it and watch whether it has unexpected positive results

This even less.

>  - test different combos of software -- controller -- board to eliminate parts
> of the equation (hard because juju is the only known software which utilizes
> dual buffer, and you don't have a stash of controllers, notably not OHCI 1.1
> ones, and I don't have this board)

Actually I've some, but all OHCI 1.0... :-(
I can try to get some OHCI 1.1, but I cannot promise anything.

>  - attempt to find someone at TI who knows something about dualbuffer and big
> physical addresses (questionable approach)

Why "questionable approach"?
I was thinking to email their support, maybe something will happen, in the past
they were quite friendly.

pg

Comment 110 Stefan Richter 2008-05-30 19:43:47 UTC

> Why "questionable approach"?

Depends on whether they already were in touch with somebody who extensively used
dual buffer mode.

Comment 111 Piergiorgio Sartor 2008-05-31 22:46:21 UTC

Created attachment 307293 [details]
Simple fix

Hi, I created this simple little patch to temporary fix the issue.
Some notes:

1) It is entirely derived from Stefan's previous patch and suggestion
2) I inlined the TSB43AB22 PCI ID definition, I know it's ugly, but I was too
lazy to get a patch for the entire source tree... :-)
3) There is a debug print inside, but I'm not sure it is done the correct way
4) I did not investigate how to use a parameter or sysconfig for it, which
would be nice also for further testing (any hints, apart looking at fw-sbp2.c?)

5) it compiles, but I did not (yet) tested it

Please have a look.

Jarod, would it make sense to get this (or a similar one) temporary in the
Fedora kernel patchset? After verification, of course. At least until one of
you will have more time to follow the issue again?

Thanks a lot in advance,

pg

Comment 112 Stefan Richter 2008-05-31 23:45:58 UTC

I would rather agree to a tentative workaround which switches IR to
packet-per-buffer, to avoid performance impact on all the other FireWire DMA
functions.  Notably, but not only, SBP-2.  I don't know if this board has an
IOMMU /and/ can transparently make up address mappings below 2G.  If not, then
the CPU will have to do useless copying to and fro bounce buffers, and buffer
allocations are more prone to fail because those bounce buffers are AFAIK a
scarce resource.

(BTW, callers of dma_set_mask should check its return value for possible error
return code, which happens if the architecture does not support the requested
mask.  But the architectures which can run on an Asus board probably all support
this mask.)

Comment 113 Stefan Richter 2008-05-31 23:49:19 UTC

...OTOH it is not my business what goes into Fedora kernels and what not.

Comment 114 Piergiorgio Sartor 2008-06-01 11:15:06 UTC

(In reply to comment #112)
> I would rather agree to a tentative workaround which switches IR to
> packet-per-buffer, to avoid performance impact on all the other FireWire DMA
> functions.  Notably, but not only, SBP-2.  I don't know if this board has an
> IOMMU /and/ can transparently make up address mappings below 2G.  If not, then
> the CPU will have to do useless copying to and fro bounce buffers, and buffer
> allocations are more prone to fail because those bounce buffers are AFAIK a
> scarce resource.

My thinking was the following:

1) the patch should be temporary, this means until we found the root cause and a
reasonable workaround or I change PC :-)
2) the DMA change is minimal invasive, making your life easier...
3) this type of workaround "captures" the findings we had so far
4) the platform *should* have IOMMU, even if the kernel complains about memory
aperture
5) the combination 64bit, 4GiB, no IOMMU seems to me in any case problematic
(32bit have anyway other issues)

Wrote that, I have nothing against the packet-per-buffer vs. dual-buffer, I
would only have some stable temporary workaround, one or the other is the same.

> (BTW, callers of dma_set_mask should check its return value for possible error
> return code, which happens if the architecture does not support the requested
> mask.  But the architectures which can run on an Asus board probably all support
> this mask.)

Ops... :-)
Anyway, it was working for me... :-)

pg

Comment 115 Jarod Wilson 2008-06-09 20:28:46 UTC

I *finally* found a system here in the office with a TSB43AB22/A controller, and
have borrowed some memory for it to knock its total up to 3GB. Will beat on it
some tomorrow...

Comment 116 Jarod Wilson 2008-06-11 16:56:46 UTC

Bug 449252 is looking suspiciously like a duplicate of this one (erratic dvgrab
failure with a TSB43AB22/A controller and >2GB memory), and I've now reproduced
the problem on a system here on my end.

Comment 117 Piergiorgio Sartor 2008-06-11 17:46:34 UTC

(In reply to comment #116)
> Bug 449252 is looking suspiciously like a duplicate of this one (erratic dvgrab
> failure with a TSB43AB22/A controller and >2GB memory), and I've now reproduced
> the problem on a system here on my end.

Ah! Good, very good!
Luckily you were able to reproduce it. I started to feel like those characters,
in certain movies, witnessing some conspiracy, telling it to everybody, but
having no evidence... :-)

This is really good news!

pg

Comment 118 Stefan Richter 2008-06-11 17:58:01 UTC

pg's board: nVIDIA GeForce 6150 based, Asus
Jarod's: nVIDIA nForce Pro 3600 based, Tyan

Jarod, did you already try other chips on the Tyan board?

Comment 119 Jarod Wilson 2008-06-11 20:00:08 UTC

Yup, and just triple-checked again. Texas Instruments TSB82AA2 IEEE-1394b Link
Layer Controller (rev 01) in the same box captured video perfectly each of a
dozen times attempted just now.

Comment 120 Piergiorgio Sartor 2008-06-12 08:04:41 UTC

Jarod, it would be also interesting to confirm, if you did not already, that
booting with mem=2G fixes the problem with the TSB43AB22/A.

So we will be (if confirmed) in sync also with this finding.

pg

Comment 121 Stefan Richter 2008-06-12 10:41:08 UTC

Also, try IIDC capture too if you can spare a few minutes for that.  We have
seen IIDC and DV capture behave differently in bug 415841.  But in this bug here
I expect IIDC to fail very similarly to DV, i.e. frames will be corrupted or no
frames received, while the DMA program keeps going.

Comment 122 Jarod Wilson 2008-06-13 15:36:34 UTC

(In reply to comment #120)
> Jarod, it would be also interesting to confirm, if you did not already, that
> booting with mem=2G fixes the problem with the TSB43AB22/A.
> 
> So we will be (if confirmed) in sync also with this finding.

Didn't try mem=2G, but did patch the driver to set a 31-bit DMA mask. Works just
fine then. Also works just fine when forced into packet-per-buffer mode.

Comment 123 Jarod Wilson 2008-06-13 15:38:06 UTC

(In reply to comment #121)
> Also, try IIDC capture too if you can spare a few minutes for that.  We have
> seen IIDC and DV capture behave differently in bug 415841.  But in this bug here
> I expect IIDC to fail very similarly to DV, i.e. frames will be corrupted or no
> frames received, while the DMA program keeps going.

Hrm. Just tried IIDC, and somehow or another, IIDC is working fine. (This is
even with dvgrab attempts mixed in around it, all of which stalled).

Comment 124 Jarod Wilson 2008-06-13 15:58:41 UTC

(In reply to comment #123)
> Hrm. Just tried IIDC, and somehow or another, IIDC is working fine. (This is
> even with dvgrab attempts mixed in around it, all of which stalled).

There's no visual corruption, and I could just be imagining things, but
actually, IIDC seems to be a touch choppy. When I set the 31-bit DMA mask, video
appears to be smoother.

Comment 125 Piergiorgio Sartor 2008-06-14 16:57:25 UTC

(In reply to comment #124)

> There's no visual corruption, and I could just be imagining things, but
> actually, IIDC seems to be a touch choppy. When I set the 31-bit DMA mask, video
> appears to be smoother.

Well, this could be something similar to what I get sometimes together with
those "buffer underrun" messages.
Something is captured, but not really in a clean way.
Maybe some buffer is above and some below the 2GiB border...

pg

Comment 126 Stefan Richter 2008-06-15 13:32:44 UTC

Created attachment 309397 [details]
logging + allocation test

This debug patch adds the __GFP_HIGHMEM flag to descriptor allocations and
buffer allocations and logs when descriptor physical addresses or buffer
physical addresses are located above 2G.

The GFP flag causes my 945GM/ICH7 based system with 32bit kernel and 3.2GB
usable RAM to use buffer addresses above 2G.  But I did not get descriptor
addresses above 2G yet.  I captured 20GB from my TSB43AB22/A CardBus card, and
dvgrab was entirely satisfied with what it got so far.

This /may/ mean that the problem specifically depends on descriptors located at
physical addresses above 2G, while the data buffer locations don't matter.

Comment 127 Stefan Richter 2008-06-15 13:37:43 UTC

Created attachment 309398 [details]
31bit consistent DMA mask

This patch only forces consistent allocations to be located below 2G.  This
influences allocations of descriptors and some others, but not data buffer
allocations.  pg or Jarod, please test this at your leisure to narrow the issue
down a little bit further.

Comment 128 Piergiorgio Sartor 2008-06-15 14:36:03 UTC

Hi, I just patched the module, rebooted (to have all memory free) and tried.

It works!
I could capture the DV streams and play them, without issues.

Good point!

pg

Comment 129 Stefan Richter 2008-07-22 16:44:17 UTC

Created attachment 312363 [details]
firewire: fw-ohci: TSB43AB22/A dualbuffer workaround

Years later...	Proposed patch, posted on lkml/linux1394-devel:
http://lkml.org/lkml/2008/7/22/331

Comment 130 Piergiorgio Sartor 2008-08-13 18:09:37 UTC

Hi all, I was just trying out kernel-2.6.26.2-14.fc9.x86_64, which seems to have the latest patch from Stefan.
It seems this is working, I could capture the DV stream without issues.

How is now? Will this issue go to --> update --> QA --> CLOSED or there is still something to do?

Thanks,

pg

Comment 131 Bug Zapper 2008-11-26 09:59:08 UTC

This message is a reminder that Fedora 8 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 8.  It is Fedora's policy to close all
bug reports from releases that are no longer maintained.  At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '8'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 8's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 8 is end of life.  If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora please change the 'version' of this 
bug to the applicable version.  If you are unable to change the version, 
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 132 Piergiorgio Sartor 2008-11-26 19:00:43 UTC

I changed version to 9, but I guess this can be officially closed, since the fix is working and included.

Should I close or you do it, Jarod?

Thanks,

pg

Comment 133 Jarod Wilson 2008-11-26 21:02:45 UTC

Either one works, and since I'm commenting, I'll just close it too... :)