1822948 – Hang when opening /dev/sr0 with encrypted movie DVD in drive

Bug 1822948 - Hang when opening /dev/sr0 with encrypted movie DVD in drive

Summary: Hang when opening /dev/sr0 with encrypted movie DVD in drive

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	31
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-04-10 15:36 UTC by Tom Horsley
Modified:	2020-11-24 17:17 UTC (History)
CC List:	20 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2020-11-24 17:17:53 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
output from requested journalctl command above (92.62 KB, text/plain) 2020-04-10 15:36 UTC, Tom Horsley	no flags	Details
new dmesg output (2.81 KB, text/plain) 2020-04-13 23:18 UTC, Tom Horsley	no flags	Details
rerun of journalctl on the asus mini-pc with usb debugging on (24.22 KB, text/plain) 2020-04-14 16:26 UTC, Tom Horsley	no flags	Details
the usb debug trace output from the same run (14.90 MB, text/plain) 2020-04-14 16:29 UTC, Tom Horsley	no flags	Details
log of a data dvd working in the same drive (7.87 KB, text/plain) 2020-04-15 17:11 UTC, Tom Horsley	no flags	Details
dmesg during run of xorriso hang (137.49 KB, text/plain) 2020-04-15 19:52 UTC, Tom Horsley	no flags	Details
kernel debug during run of xorriso hang (2.93 MB, text/plain) 2020-04-15 19:53 UTC, Tom Horsley	no flags	Details
lsmod listing (5.08 KB, text/plain) 2020-04-16 00:12 UTC, Tom Horsley	no flags	Details
xorriso log with udevd squashed (281 bytes, text/plain) 2020-04-16 13:46 UTC, Tom Horsley	no flags	Details
xorriso log with udevd squashed (12.15 KB, text/plain) 2020-04-16 13:49 UTC, Tom Horsley	no flags	Details
trace of loading disk on direct connect sata drive (3.11 MB, text/plain) 2020-04-16 20:24 UTC, Tom Horsley	no flags	Details
systemd-udevd records with SCSI commands extracted from trace-5.txt. (3.91 KB, text/plain) 2020-04-16 21:15 UTC, Steve	no flags	Details
dmesg from system where trace-5 was done (78.83 KB, text/plain) 2020-04-16 23:00 UTC, Tom Horsley	no flags	Details
Screenshot showing Wireshark on Windows 10 capturing USB traffic including a SCSI Read Capacity command (115.33 KB, image/png) 2020-04-17 21:27 UTC, Steve	no flags	Details
output from xorriso check media command (204.56 KB, text/plain) 2020-04-18 01:04 UTC, Tom Horsley	no flags	Details
SCSI commands for host_no=5 from trace-5.txt (40.89 KB, text/plain) 2020-04-18 01:37 UTC, Steve	no flags	Details
trace-5.txt analyzer script used to generate scsi-commands-for-host_no-5-from-trace-5-txt-1.txt (667 bytes, application/x-perl) 2020-04-18 01:42 UTC, Steve	no flags	Details
Show Obsolete (2) View All

Description Tom Horsley 2020-04-10 15:36:29 UTC

Created attachment 1677807 [details]
output from requested journalctl command above

1. Please describe the problem:

With a commercial Movie DVD (not a Blu-ray) in my drive, any open() call on /dev/sr0 appears to hang forever. This happens with my sata drive in a external USB3 enclosure connected via a USB3 port. I decided this must be a linux kernel problem when I discovered I could play the same DVD on the same drive just find on a Windows 10 system using the Windows version of VLC. I then booted that exact same Windows box with the fedora 31 iso image from a USB stick, and watched it hang with any program that tried to open /dev/sr0.

The drive: LG WH16NS40 Super Multi Blue Internal SATA 16x Blu-ray Disc Rewriter
The enclosure: OWC Mercury Pro 5.25" Optical Drive External Enclosure

Everything else works fine - cdrom, data DVD (not encrypted), Blu-ray, and writing to everything I've tried. It is just when a movie DVD is in the drive that the hang happens.


2. What is the Version-Release number of the kernel:
The live USB running in the Windows box is kernel 5.3.7-301.fc31.x86_64
The identical hang happens on my desktop running 5.5.15-200.fc31.x86_64


3. Did it work previously in Fedora? If so, what kernel version did the issue
   *first* appear?  Old kernels are available for download at
   https://koji.fedoraproject.org/koji/packageinfo?packageID=8 :

Since this is a brand new drive, I don't know if it worked before, but I can say that a very old slim usb powered pioneer BDR-XD05S works perfectly fine on the same systems where the LG drive hangs.

4. Can you reproduce this issue? If so, please provide the steps to reproduce
   the issue below:

Put DVD Movie disk in drive
Try to open it with any program
Watch it hang till you unplug or power off the drive, at which point it fails.

5. Does this problem occur with the latest Rawhide kernel? To install the
   Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by
   ``sudo dnf update --enablerepo=rawhide kernel``:

Can't really do that from live media

6. Are you running any modules that not shipped with directly Fedora's kernel?:

Nope, booted directly from live workstation iso image

7. Please attach the kernel logs. You can get the complete kernel log
   for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the
   issue occurred on a previous boot, use the journalctl ``-b`` flag.

Comment 1 Tom Horsley 2020-04-10 15:42:00 UTC

More info: I had an ASUS optical drive that showed the same symptoms, and when I put the ASUS drive in a computer with an available drive bay and connected it with sata directly to the motherboard, it can read a movie DVD just fine, so it is apparently something about the USB interface that triggers the hang.

Comment 2 Tom Horsley 2020-04-13 15:13:15 UTC

Just tried the ultimate experiment: took the LG drive mentioned above out of the eternal enclosure and put it in a drive bay in a computer which hung previously when trying the USB connection. With a direct sata connection, the exact same LG drive that hangs when using USB works fine when using sata. Yet it isn't a fundamental USB hardware problem because the USB interface works fine in Windows 10 and fails in the same windows box booted into linux from a live USB. Possibly this does have something to do with the new "phase II" region playback control, but I set the drive to region 1 and I'm trying to open a region 1 dvd in all my experiments. The old pioneer USB drive that works is using the phase 1 RPC. Whatever is going on has apparently been going on a long time. I tried an old version 3 kernel and it had the same problem.

Comment 3 Steve 2020-04-13 21:54:19 UTC

Thanks for your detailed report. The USB enclosure is getting repeatedly reset:

$ grep -n 'usb 4-1' dmesg.txt
972:Apr 10 10:42:40 kernel: usb 4-1: new SuperSpeed Gen 1 USB device number 3 using xhci_hcd
973:Apr 10 10:42:40 kernel: usb 4-1: New USB device found, idVendor=1e91, idProduct=de2c, bcdDevice= 1.00
974:Apr 10 10:42:40 kernel: usb 4-1: New USB device strings: Mfr=2, Product=3, SerialNumber=1
975:Apr 10 10:42:40 kernel: usb 4-1: Product: Mercury Pro Optical
976:Apr 10 10:42:40 kernel: usb 4-1: Manufacturer: Other World Computing
977:Apr 10 10:42:40 kernel: usb 4-1: SerialNumber: 002933001438
986:Apr 10 10:43:45 kernel: usb 4-1: reset SuperSpeed Gen 1 USB device number 3 using xhci_hcd
990:Apr 10 10:44:09 kernel: usb 4-1: USB disconnect, device number 3
1004:Apr 10 10:46:32 kernel: usb 4-1: new SuperSpeed Gen 1 USB device number 4 using xhci_hcd
1005:Apr 10 10:46:32 kernel: usb 4-1: New USB device found, idVendor=1e91, idProduct=de2c, bcdDevice= 1.00
1006:Apr 10 10:46:32 kernel: usb 4-1: New USB device strings: Mfr=2, Product=3, SerialNumber=1
1007:Apr 10 10:46:32 kernel: usb 4-1: Product: Mercury Pro Optical
1008:Apr 10 10:46:32 kernel: usb 4-1: Manufacturer: Other World Computing
1009:Apr 10 10:46:32 kernel: usb 4-1: SerialNumber: 002933001438
1017:Apr 10 10:47:21 kernel: usb 4-1: reset SuperSpeed Gen 1 USB device number 4 using xhci_hcd
1021:Apr 10 10:47:52 kernel: usb 4-1: reset SuperSpeed Gen 1 USB device number 4 using xhci_hcd
1022:Apr 10 10:50:07 kernel: usb 4-1: USB disconnect, device number 4
1030:Apr 10 11:00:52 kernel: usb 4-1: new SuperSpeed Gen 1 USB device number 5 using xhci_hcd
1031:Apr 10 11:00:52 kernel: usb 4-1: New USB device found, idVendor=1e91, idProduct=de2c, bcdDevice= 1.00
1032:Apr 10 11:00:52 kernel: usb 4-1: New USB device strings: Mfr=2, Product=3, SerialNumber=1
1033:Apr 10 11:00:52 kernel: usb 4-1: Product: Mercury Pro Optical
1034:Apr 10 11:00:52 kernel: usb 4-1: Manufacturer: Other World Computing
1035:Apr 10 11:00:52 kernel: usb 4-1: SerialNumber: 002933001438
1043:Apr 10 11:01:42 kernel: usb 4-1: reset SuperSpeed Gen 1 USB device number 5 using xhci_hcd
1047:Apr 10 11:02:12 kernel: usb 4-1: reset SuperSpeed Gen 1 USB device number 5 using xhci_hcd
1103:Apr 10 11:27:39 kernel: usb 4-1: USB disconnect, device number 5

Comment 4 Steve 2020-04-13 22:42:54 UTC

Could you try the following:

Eject any disc from the USB enclosure.

Disconnect the USB enclosure from your system. (The enclosure appears to use an external power supply, so leave the enclosure powered up.)

If you have auto-mounting enabled, try to disable it before running the test.

In a full-screen terminal window, run:

$ journalctl --no-hostname -k -f | tee /tmp/dmesg-2.txt

That should log messages that you can watch as they are logged.

Connect the USB enclosure to your system.

Wait for log messages to stop scrolling.

Load the test DVD.

After it appears to have settled and log messages have stopped scrolling, eject the DVD, using the eject button, if possible.

Press "ctrl-c" to terminate the logging.

Attach /tmp/dmesg-2.txt.

NB: Waiting for the log messages to stop scrolling may or may not be possible. In a test on my system with a USB flash drive, there was a small flurry of messages when the drive was inserted and another small furry when the drive was removed. In both cases, there were pauses after the flurry.

If you see a continuous stream of messages about resets, please make a note of that.

Comment 5 Tom Horsley 2020-04-13 23:18:27 UTC

Created attachment 1678574 [details]
new dmesg output

I'm pretty sure those resets were all from me turning off the drive to get the hung open to fail, here's the new log with me just loading the DVD, not even trying to read it.

Comment 6 Thomas Schmitt 2020-04-13 23:51:15 UTC

Hi,

in
  https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org/message/KWNGRQW4ETEFAVNRPMZBBCG7UHNZ22FK/
you report libburn's error message
  SG_IO: host_status= 0x7 SG_ERR_DID_ERROR
which is a transport problem mentioned in
  https://www.tldp.org/HOWTO/SCSI-Generic-HOWTO/x291.html

Further the libburn SCSI log snippets shown in that mail tell that some
normally fast running SCSI commands for drive inspection completed
successfully, but in the sum needed 180 seconds to finally run into
SG_ERR_DID_ERROR with command 12h INQUIRY.

So i deem it unlikely that it is about DRM regions.

Have a nice day :)

Thomas

Comment 7 Steve 2020-04-14 00:23:31 UTC

(In reply to Tom Horsley from comment #5)
> Created attachment 1678574 [details]
> new dmesg output
> 
> I'm pretty sure those resets were all from me turning off the drive to get the hung open to fail, here's the new log with me just loading the DVD, not even trying to read it.

Thanks. There are two resets.

Did you mean to change the USB port (4-1 -> 2-1)? Which USB controller the enclosure is connected to could affect the behavior, especially if some are USB 2.0 and some are USB 3.0. I'm not a USB expert, but Comment 3 shows "SuperSpeed Gen 1" and this one shows "high-speed".

BTW, there is the tail end of a Call Trace at the beginning of dmesg-2.txt. Could you post the whole thing?

$ grep -n 'usb 2-1' dmesg-2.txt
11:Apr 13 18:51:26 kernel: usb 2-1.2: USB disconnect, device number 3
12:Apr 13 18:57:35 kernel: usb 2-1.2: new high-speed USB device number 6 using ehci-pci
13:Apr 13 18:57:35 kernel: usb 2-1.2: New USB device found, idVendor=1e91, idProduct=de2c, bcdDevice= 1.00
14:Apr 13 18:57:35 kernel: usb 2-1.2: New USB device strings: Mfr=2, Product=3, SerialNumber=1
15:Apr 13 18:57:35 kernel: usb 2-1.2: Product: Mercury Pro Optical
16:Apr 13 18:57:35 kernel: usb 2-1.2: Manufacturer: Other World Computing
17:Apr 13 18:57:35 kernel: usb 2-1.2: SerialNumber: 002933001438
28:Apr 13 18:58:49 kernel: usb 2-1.2: reset high-speed USB device number 6 using ehci-pci
32:Apr 13 18:59:19 kernel: usb 2-1.2: reset high-speed USB device number 6 using ehci-pci

Comment 8 Tom Horsley 2020-04-14 00:42:35 UTC

I may have done this most recent log on a different computer (but I thought it was the same one I used before but I lose track of which experiments I conduct where). This computer doesn't have any working USB 3 ports, but the USB drive hangs the exact same way in a usb 2 port.

One other oddity I noticed. This computer has an ASUS drive connected via sata, and it is normally /dev/sr0, but apparently plugging in the USB drive stole the /dev/sr0 device (I was expecting the usb drive to show up as /dev/sr1).

The call trace is just the trace everyone using rpmfusion nvidia drivers get every time X starts. See bug 1806257 for an example.

Comment 9 Steve 2020-04-14 00:50:16 UTC

$ grep DMI: dmesg-1.txt 
Apr 10 06:36:12 kernel: DMI: ASUSTeK Computer INC. VM40B/VM40B, BIOS 1501 12/09/2014

That's a mini-PC. According to the specs it has:

2 x USB 3.1 Gen 1
4 x USB 2.0 

https://www.asus.com/us/Mini-PCs/VivoPC_VM40B/specifications/

Comment 10 Steve 2020-04-14 01:09:19 UTC

(In reply to Tom Horsley from comment #8)
> I may have done this most recent log on a different computer (but I thought
> it was the same one I used before but I lose track of which experiments I
> conduct where). This computer doesn't have any working USB 3 ports, but the
> USB drive hangs the exact same way in a usb 2 port.

OK.

> One other oddity I noticed. This computer has an ASUS drive connected via
> sata, and it is normally /dev/sr0, but apparently plugging in the USB drive
> stole the /dev/sr0 device (I was expecting the usb drive to show up as /dev/sr1).

Do you still show two drives?

$ ls -l /dev/sr*

> The call trace is just the trace everyone using rpmfusion nvidia drivers get
> every time X starts. See bug 1806257 for an example.

Thanks for mentioning that. dmesg-1.txt does not show any "taints" and is for an ASUS mini-PC.

Could you post this for the system you used when capturing dmesg-2.txt so we can keep track of how you are testing:

$ journalctl --no-hostname -k | egrep -i 'DMI:|taint'

Comment 11 Steve 2020-04-14 11:50:00 UTC

You might be able to get more information about what is happening at the USB layer by running these commands as root from the command-line:

# echo 'module xhci_hcd =pmf' > /sys/kernel/debug/dynamic_debug/control
# echo 'module usbcore =pmf'  > /sys/kernel/debug/dynamic_debug/control

# echo 81920 > /sys/kernel/debug/tracing/buffer_size_kb
# echo 1     > /sys/kernel/debug/tracing/events/xhci-hcd/enable

Collect logs with:

$ journalctl --no-hostname -k > /tmp/dmesg-1.txt
$ sudo cat /sys/kernel/debug/tracing/trace > /tmp/trace-1.txt

Those are from Mathias upstream, except that I have added the "m" and "f" flags to log the module and function names:

https://bugzilla.kernel.org/show_bug.cgi?id=203419#c21

I am not a USB expert, so this is speculative:

Enable "xhci_hcd" logging for the "SuperSpeed Gen 1" controller.
Enable "ehci-pci" logging for the "high-speed" controller.

Or enable both.

You can get a list of controllers with:

$ journalctl --no-hostname -b | grep -i controller

For more about the kernel's dynamic debugging facilities, see:

Dynamic debug
https://www.kernel.org/doc/html/latest/admin-guide/dynamic-debug-howto.html

Comment 12 Steve 2020-04-14 13:56:08 UTC

(In reply to Tom Horsley from comment #8)
...
> One other oddity I noticed. This computer has an ASUS drive connected via sata, and it is normally /dev/sr0, but apparently plugging in the USB drive stole the /dev/sr0 device (I was expecting the usb drive to show up as /dev/sr1).
...

Could you open a new bug for that with component "systemd"? An already-connected drive should not be renamed when you connect a second drive.

I believe that the component is "systemd" from reading the man page for "systemd-udevd", which says:

"systemd-udevd listens to kernel uevents. For every event, systemd-udevd executes matching instructions specified in udev rules."

Comment 13 Tom Horsley 2020-04-14 16:26:44 UTC

Created attachment 1678735 [details]
rerun of journalctl on the asus mini-pc with usb debugging on

I've run the journalctl command, this time on the mini-pc I used for the first set of journalctl output and with the usb debug enabled.

There was a lot of clutter from linux not liking the wifi chip, so I disabled wifi and ran it for a second time, there may be stuff at the beginning from the first try.

Comment 14 Tom Horsley 2020-04-14 16:29:11 UTC

Created attachment 1678737 [details]
the usb debug trace output from the same run

No doubt this has output from both times I loaded the DVD since I didn't turn trace off and back on for the 2nd time I ran the test.

Comment 15 Tom Horsley 2020-04-14 16:45:19 UTC

(In reply to Steve from comment #12)
> (In reply to Tom Horsley from comment #8)
> ...
> > One other oddity I noticed. This computer has an ASUS drive connected via sata, and it is normally /dev/sr0, but apparently plugging in the USB drive stole the /dev/sr0 device (I was expecting the usb drive to show up as /dev/sr1).
> ...
> 
> Could you open a new bug for that with component "systemd"? An
> already-connected drive should not be renamed when you connect a second
> drive.

This was spurious. Apparently somewhere in there one of the cables came loose on the ASUS drive, so it wasn't there at all. I wiggled the cables and it is better now (and the usb does show up as /dev/sr1).

Comment 16 Steve 2020-04-14 19:24:16 UTC

(In reply to Tom Horsley from comment #13)
> Created attachment 1678735 [details]
> rerun of journalctl on the asus mini-pc with usb debugging on
> 
> I've run the journalctl command, this time on the mini-pc I used for the first set of journalctl output and with the usb debug enabled.
> 
> There was a lot of clutter from linux not liking the wifi chip, so I disabled wifi and ran it for a second time, there may be stuff at the beginning from the first try.

Thanks. That looks much cleaner. I am not qualified to give a diagnosis except to say that the USB layer looks happy as a clam. :-)

At this point you could:

1. Try to enable dynamic debugging output for the next layer up. "sr"?

2. Open an upstream bug at https://bugzilla.kernel.org.
   I looked at the choices for Product/Component and two possibilities are: Drivers/USB and IO/Storage/SCSI.

$ less -N dmesg-3.txt
...
    163 Apr 14 12:15:07 kernel: xhci_hcd:xhci_ring_cmd_db: xhci_hcd 0000:00:14.0: // Ding dong!
    164 Apr 14 12:15:07 kernel: xhci_hcd:xhci_dbg_trace: xhci_hcd 0000:00:14.0: Successful evaluate context command
    165 Apr 14 12:15:07 kernel: sr 5:0:0:0: [sr0] tag#0 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=30s
    166 Apr 14 12:15:07 kernel: sr 5:0:0:0: [sr0] tag#0 CDB: Read(10) 28 00 00 00 02 00 00 00 02 00
    167 Apr 14 12:15:07 kernel: blk_update_request: I/O error, dev sr0, sector 2048 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
    168 Apr 14 12:15:08 kernel: xhci_hcd:xhci_hub_status_data: xhci_hcd 0000:00:14.0: xhci_hub_status_data: stopping port polling.
    169 Apr 14 12:15:38 kernel: xhci_hcd:xhci_dbg_trace: xhci_hcd 0000:00:14.0: Cancel URB 0000000074932569, dev 1, ep 0x81, starting at offset 0x
    169 3faeda020
    170 Apr 14 12:15:38 kernel: xhci_hcd:xhci_ring_cmd_db: xhci_hcd 0000:00:14.0: // Ding dong!
...

Comment 17 Steve 2020-04-14 19:36:35 UTC

(In reply to Tom Horsley from comment #14)
> Created attachment 1678737 [details]
> the usb debug trace output from the same run
> 
> No doubt this has output from both times I loaded the DVD since I didn't turn trace off and back on for the 2nd time I ran the test.

Thanks. That is bigger than I was expecting:

$ wc -l trace-3.txt
88468 trace-3.txt

There are two "Stall Error" events, which probably correspond to your two test runs, but that is as far as I care to comment on this trace. A USB expert would need to look at it to say what is normal and what is not:

$ grep -C1 'Stall Error' trace-3.txt 
     usb-storage-3363  [001] d...  2362.886878: xhci_ring_ep_doorbell: Ring doorbell for Slot 6 ep1in
 systemd-journal-445   [000] d.h.  2362.886891: xhci_handle_event: EVENT: TRB 00000003bfb8b090 status 'Stall Error' len 13 slot 6 ep 3 type 'Transfer Event' flags e:C
 systemd-journal-445   [000] d.h.  2362.886907: xhci_handle_transfer: BULK: Buffer 0000000406879080 length 13 TD size 0 intr 0 type 'Normal' flags b:i:I:c:s:I:e:C
--
     usb-storage-3461  [001] d...  2612.623140: xhci_ring_ep_doorbell: Ring doorbell for Slot 7 ep1in
 systemd-journal-445   [000] d.h.  2612.623154: xhci_handle_event: EVENT: TRB 00000003faeda090 status 'Stall Error' len 13 slot 7 ep 3 type 'Transfer Event' flags e:C
 systemd-journal-445   [000] d.h.  2612.623167: xhci_handle_transfer: BULK: Buffer 0000000406879080 length 13 TD size 0 intr 0 type 'Normal' flags b:i:I:c:s:I:e:C

BTW, the trace appears to have timestamps in column four, so your two test runs would have been about 250 seconds apart.

Comment 18 Steve 2020-04-14 19:43:14 UTC

(In reply to Tom Horsley from comment #15)
> (In reply to Steve from comment #12)
> > (In reply to Tom Horsley from comment #8)
> > ...
> > > One other oddity I noticed. This computer has an ASUS drive connected via sata, and it is normally /dev/sr0, but apparently plugging in the USB drive stole the /dev/sr0 device (I was expecting the usb drive to show up as /dev/sr1).
> > ...
> > 
> > Could you open a new bug for that with component "systemd"? An > already-connected drive should not be renamed when you connect a second drive.
> 
> This was spurious. Apparently somewhere in there one of the cables came loose on the ASUS drive, so it wasn't there at all. I wiggled the cables and it is better now (and the usb does show up as /dev/sr1).

Thanks for resolving that to a hardware problem.

I had an ethernet connection fail when I tried to use a connector with the clip broken off, so my "free" advice is to throw away any SATA cables that do not have clips on them. :-)

Comment 19 Steve 2020-04-14 20:20:21 UTC

I found a way to correlate the dmesg output and the trace by searching for "0x3faeda020". The "-B6" and the "-B7" options were chosen to cover the previous "ding dong". The results below are hard to read, but there is no indication that the USB layer is having any problems, except possibly for the "CErr 3" in line 79145 of trace-3.txt.


$ grep -n -B6 '0x3faeda020' dmesg-3.txt 
163-Apr 14 12:15:07 kernel: xhci_hcd:xhci_ring_cmd_db: xhci_hcd 0000:00:14.0: // Ding dong!
164-Apr 14 12:15:07 kernel: xhci_hcd:xhci_dbg_trace: xhci_hcd 0000:00:14.0: Successful evaluate context command
165-Apr 14 12:15:07 kernel: sr 5:0:0:0: [sr0] tag#0 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=30s
166-Apr 14 12:15:07 kernel: sr 5:0:0:0: [sr0] tag#0 CDB: Read(10) 28 00 00 00 02 00 00 00 02 00
167-Apr 14 12:15:07 kernel: blk_update_request: I/O error, dev sr0, sector 2048 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
168-Apr 14 12:15:08 kernel: xhci_hcd:xhci_hub_status_data: xhci_hcd 0000:00:14.0: xhci_hub_status_data: stopping port polling.
169:Apr 14 12:15:38 kernel: xhci_hcd:xhci_dbg_trace: xhci_hcd 0000:00:14.0: Cancel URB 0000000074932569, dev 1, ep 0x81, starting at offset 0x3faeda020
170-Apr 14 12:15:38 kernel: xhci_hcd:xhci_ring_cmd_db: xhci_hcd 0000:00:14.0: // Ding dong!
171-Apr 14 12:15:38 kernel: xhci_hcd:handle_tx_event: xhci_hcd 0000:00:14.0: Stopped on Transfer TRB for slot 7 ep 2
172:Apr 14 12:15:38 kernel: xhci_hcd:xhci_dbg_trace: xhci_hcd 0000:00:14.0: Removing canceled TD starting at 0x3faeda020 (dma).


$ grep -n -B7 '0x3faeda020 (dma)' trace-3.txt
79139-    kworker/u4:0-3251  [001] d...  2715.633800: xhci_ring_host_doorbell: Ring doorbell for Command Ring 0
79140-          <idle>-0     [000] d.h.  2715.633849: xhci_handle_event: EVENT: TRB 00000003faeda020 status 'Stopped' len 2048 slot 7 ep 3 type 'Transfer Event' flags e:c
79141-          <idle>-0     [000] d.h.  2715.633871: xhci_handle_transfer: BULK: Buffer 00000003edacd000 length 4096 TD size 0 intr 0 type 'Normal' flags b:i:I:c:s:I:e:C
79142-          <idle>-0     [000] d.h.  2715.633876: xhci_inc_deq: EVENT 000000002c5e8e5e: enq 0x0000000407eba000(0x0000000407eba000) deq 0x0000000407ebac00(0x0000000407eba000) segs 1 stream 0 free_trbs 254 bounce 0 cycle 0
79143-          <idle>-0     [000] d.h.  2715.633878: xhci_handle_event: EVENT: TRB 0000000407eb8490 status 'Success' len 0 slot 7 ep 0 type 'Command Completion Event' flags e:c
79144-          <idle>-0     [000] d.h.  2715.633879: xhci_handle_command: CMD: Stop Ring Command: slot 7 sp 0 ep 3 flags C
79145-          <idle>-0     [000] d.h.  2715.633883: xhci_handle_cmd_stop_ep: State stopped mult 1 max P. Streams 0 interval 125 us max ESIT payload 0 CErr 3 Type Bulk IN burst 15 maxp 1024 deq 00000003faeda021 avg trb len 0
79146:          <idle>-0     [000] d.h.  2715.633891: xhci_dbg_cancel_urb: Removing canceled TD starting at 0x3faeda020 (dma).

Comment 20 Steve 2020-04-14 20:37:25 UTC

According to the timestamps below, there is a 10 second gap between these two records. Would that be enough to cause a timeout in the upper layers?

$ grep -n -A1 '2705.384478:.*EVENT' trace-3.txt
79134:          <idle>-0     [000] d.h.  2705.384478: xhci_inc_deq: EVENT 000000002c5e8e5e: enq 0x0000000407eba000(0x0000000407eba000) deq 0x0000000407ebabf0(0x0000000407eba000) segs 1 stream 0 free_trbs 254 bounce 0 cycle 0
79135-    kworker/u4:0-3251  [001] d...  2715.633748: xhci_urb_dequeue: ep1in-bulk: urb 0000000074932569 pipe 3221259392 slot 7 length 0/4096 sgs 1/1 stream 0 flags 00040201

Comment 21 Steve 2020-04-14 20:49:44 UTC

Use a bigger number here to look further back in time. Use the "G" command in "less" or the "End" key to jump to the end and then scroll backwards with the up-arrow key. There are no other large time gaps than the 10 second gap noted in Comment 20:

$ grep -n -B100 '0x3faeda020 (dma)' trace-3.txt | less
            ^^^

Comment 22 Steve 2020-04-14 21:05:42 UTC

Tom in Comment 0:

"... I discovered I could play the same DVD on the same drive just find on a Windows 10 system using the Windows version of VLC."

Could you try a timing test with Windows? I'm not sure what events to use, but when the DVD drive tray closes would be a good starting event. If there is a light on the DVD drive, it might be possible to wait for it to go out and use that as the ending event.

A wristwatch "stopwatch" should be accurate enough for the purpose. If you take it off, it might be possible to operate it with one hand.

Specifically, I'm wondering if Windows is just more "patient". :-)

Comment 23 Steve 2020-04-14 22:14:18 UTC

Under /sys/module/scsi_mod/parameters/ there a scsi parameter called "inq_timeout". According to the kernel source code "inq_timeout" is the:

"Timeout (in seconds) waiting for devices to answer INQUIRY."
" Default is 20. Some devices may need more; most need less.");

So try this:

$ sudo sh -c "echo 60 > /sys/module/scsi_mod/parameters/inq_timeout"
$ cat /sys/module/scsi_mod/parameters/inq_timeout

("60" is a guess.)

Details:

$ less -N drivers/scsi/scsi_scan.c
...
    108 static unsigned int scsi_inq_timeout = SCSI_TIMEOUT/HZ + 18;
    109 
    110 module_param_named(inq_timeout, scsi_inq_timeout, uint, S_IRUGO|S_IWUSR);
    111 MODULE_PARM_DESC(inq_timeout, 
    112                  "Timeout (in seconds) waiting for devices to answer INQUIRY."
    113                  " Default is 20. Some devices may need more; most need less.");
...

Comment 24 Steve 2020-04-14 22:31:39 UTC

(In reply to Thomas Schmitt from comment #6)

Thanks for posting that link, where you say in the same thread:

"139255322 us     [ 180481085 ] 

This says that the run already lasted 180 seconds and the SCSI command
to which this line belongs lasted 139 seconds. (Your patience is in good shape.)"
https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org/message/QIIFFUF4SQCWEWAIV26SMTEWKEPKUWQI/

So the "inq_timeout" may need to be much higher that "60" seconds.

If that turns out to be the case, I would say that this is a firmware problem with the USB drive enclosure.

Comment 25 Steve 2020-04-14 22:59:02 UTC

(In reply to Steve from comment #11)
> You might be able to get more information about what is happening at the USB layer by running these commands as root from the command-line:
> 
> # echo 'module xhci_hcd =pmf' > /sys/kernel/debug/dynamic_debug/control
> # echo 'module usbcore =pmf'  > /sys/kernel/debug/dynamic_debug/control
> 
> # echo 81920 > /sys/kernel/debug/tracing/buffer_size_kb
> # echo 1     > /sys/kernel/debug/tracing/events/xhci-hcd/enable
> 
> Collect logs with:
> 
> $ journalctl --no-hostname -k > /tmp/dmesg-1.txt
> $ sudo cat /sys/kernel/debug/tracing/trace > /tmp/trace-1.txt

This enables dynamic debugging for "scsi_mod":

# echo 'module scsi_mod =pmf' > /sys/kernel/debug/dynamic_debug/control

Tested in a VM with this running in a terminal window:

$ journalctl --no-hostname -k -f

And then by inserting and removing a USB flash drive with the F31 Live image on it.

I don't know how informative the output will be, because I got messages like these:

Apr 14 22:54:24 kernel: scsi_mod:scsi_runtime_idle: scsi target6:0:0: scsi_runtime_idle
Apr 14 22:54:24 kernel: scsi_mod:scsi_runtime_suspend: scsi target6:0:0: scsi_runtime_suspend

For tracing, you may need to do this too:

# echo 1     > /sys/kernel/debug/tracing/events/scsi_mod/enable

Comment 26 Tom Horsley 2020-04-14 23:53:58 UTC

On windows 10 (on the same asus mini pc) it takes 13 seconds from the time I hit the button to close the tray and load the disc till the lights stop flashing, so that's comfortably under 20 seconds. VLC can start playing immediately if I hit play at the end of that time. So I don't think there's a firmware problem in the USB enclosure :-).

Comment 27 Thomas Schmitt 2020-04-15 07:34:38 UTC

Hi,

> So the "inq_timeout" may need to be much higher that "60" seconds.

INQUIRY is an SCSI command which should be done very quickly. It tells
no info about the medium but only static drive properties.
SPC-3 says in 6.1:
"The INQUIRY data should be returned even though the device server is
not ready for other commands. The standard INQUIRY data should be available
without incurring any media access delays."

The further statement does not apply to DVD, because no INQUIRY data
come from CD, DVD, or BD. Nevertheless it further confirms that the
INQUIRY should be done quickly:
"If the device server [the drive] does store some of the standard INQUIRY
data or VPD data on the media, it may return ASCII spaces (20h) in ASCII
fields and zeros in other fields until the data is available from thei
media."


> If that turns out to be the case, I would say that this is a firmware
> problem with the USB drive enclosure.

The error symptoms are like that.
But why only with (copy protected) DVD ? 
My first guess would be power supply problems. This would match the
observation that the problem is not present via direct SATA connection
and happens only with particular media.

But it works in the USB box under MS-Windows and the box has its own
power supply rather than getting its power via data cable. Both does
not give the idea of plain power shortage.

We could learn more about the timing if Tom Horsley would post the
complete libburn SCSI command log obtained by

  xorriso -scsi_log on -outdev /dev/sr0 -toc 2>&1 | tee -i /tmp/xorriso.log

(Maybe now the drive is /dev/sr1.
 xorriso uses libburn, which uses ioctl(SG_IO) for its SCSI transactions.)


Have a nice day :)

Thomas

Comment 28 Steve 2020-04-15 15:07:35 UTC

(In reply to Tom Horsley from comment #26)

Thanks for doing that timing test. 13 seconds sounds like a good benchmark for Linux to meet. :-)

Could you post the output from "lsmod"? (per the following)

(In reply to Thomas Schmitt from comment #27)

Thanks for those quotes. So the comments in the scsi driver code appear to be inconsistent with the SCSI specification:

"Default is 20 [seconds]. Some devices may need more; most need less [for a SCSI INQUIRY to complete]."

And, as Tom notes, there are no problems when the DVD drive is connected directly to a SATA port. (Comment 2)

However, the log shows several drivers at play. Why do we need cxgb3i, cxgb4i, bnx2i, and be2iscsi?

The problem here could be that too many or the wrong drivers are being loaded.

$ fgrep -B20 -n -w 'uas' dmesg-1.txt | egrep -v 'audit|RPC'
758-Apr 10 06:36:12 kernel: Chelsio T3 iSCSI Driver cxgb3i v2.0.1-ko (Apr. 2015)
759-Apr 10 06:36:12 kernel: iscsi: registered transport (cxgb3i)
760-Apr 10 06:36:12 kernel: Chelsio T4-T6 iSCSI Driver cxgb4i v0.9.5-ko (Apr. 2015)
761-Apr 10 06:36:12 kernel: iscsi: registered transport (cxgb4i)
762-Apr 10 06:36:12 kernel: cxgb4i:cxgb4i_init_module: cxgb4i dcb enabled.
763-Apr 10 06:36:12 kernel: cnic: QLogic cnicDriver v2.5.22 (July 20, 2015)
764-Apr 10 06:36:12 kernel: QLogic NetXtreme II iSCSI Driver bnx2i v2.7.10.1 (Jul 16, 2014)
765-Apr 10 06:36:12 kernel: iscsi: registered transport (bnx2i)
766-Apr 10 06:36:12 kernel: iscsi: registered transport (be2iscsi)
767-Apr 10 06:36:12 kernel: In beiscsi_module_init, tt=(____ptrval____)
775-Apr 10 06:36:13 kernel: usb-storage 4-2:1.0: USB Mass Storage device detected
776-Apr 10 06:36:13 kernel: scsi host4: usb-storage 4-2:1.0
777-Apr 10 06:36:13 kernel: usbcore: registered new interface driver usb-storage
778:Apr 10 06:36:13 kernel: usbcore: registered new interface driver uas

Comment 29 Thomas Schmitt 2020-04-15 16:11:21 UTC

Hi,

> 13 seconds sounds like a good benchmark for Linux to meet. :-)

This depends on drive and medium. The drive won't tell much during most
of the medium accessing time. But INQURIY should work.

The first INQURY of a libburn access here with drive tray sticking out
succeeds 52 milliseconds after the library was started up. 1.8 seconds
later the command to load the tray gets issued. It needs 1.25 seconds
to complete. After 13 more seconds the drive reports to be ready for
work with the medium.
The first SCSI command READ(10) happens 16.2 seconds after library start.

> The problem here could be that too many or the wrong drivers are being loaded.

iSCSI is SCSI-over-internet. So cxgb3i, cxgb4i, bnx2i, be2iscsi should
all play no role in Tom's situation. (It might be interesting to find
out why they get started and why there are four of them.)

The communication between userland, kernel, and drive goes through the
upper and middle layers of Linux scsi driver as described in
  https://www.kernel.org/doc/html/v4.13/driver-api/scsi.html
where the lowest layer is only sparsely mentioned.
I assume that in Tom's case this layer is described in
  https://www.kernel.org/doc/html/v4.13/driver-api/usb/index.html
but there i get lost with my userland-to-drive SCSI knowledge.

It is a riddle how the functionality of USB can get disturbed by a
medium condition which is supposed to affect the relation of drive to
medium and of Linux uppermost SCSI level to drive.

Have a nice day :)

Thomas

Comment 30 Tom Horsley 2020-04-15 17:05:50 UTC

(In reply to Thomas Schmitt from comment #27)

> We could learn more about the timing if Tom Horsley would post the
> complete libburn SCSI command log obtained by
> 
>   xorriso -scsi_log on -outdev /dev/sr0 -toc 2>&1 | tee -i /tmp/xorriso.log

I doubt you'll learn much. Here's the complete output:

[tweety@vivo ~]$ xorriso -scsi_log on -outdev /dev/sr0 -toc 2>&1 | tee -i /tmp/xorriso.log
xorriso 1.5.2 : RockRidge filesystem manipulator, libburnia project.

 740819134 us     [ 789589505 ]
libburn : FAILURE : Cannot access '/dev/sr0' as SG_IO CDROM drive
xorriso : FAILURE : Cannot acquire drive '/dev/sr0'
xorriso : aborting : -abort_on 'FAILURE' encountered 'FAILURE'

The only thing that came out was the header line, the time and failure messages only appeared when I powered off the drive to get the hung process to exit. Merely ejecting the disk with the button on the drive had no effect, the command was still hung.

Comment 31 Tom Horsley 2020-04-15 17:11:04 UTC

Created attachment 1679148 [details]
log of a data dvd working in the same drive

Just for contrast, I tried a data DVD (no encryption) in the drive and I do get a perfectly wonderful log from it.

Comment 32 Thomas Schmitt 2020-04-15 17:36:31 UTC

Hi,

> I doubt you'll learn much.
>  740819134 us     [ 789589505 ]
> libburn : FAILURE : Cannot access '/dev/sr0' as SG_IO CDROM drive

Indeed that's sparse. The last time your xorriso run could at least
perform some SCSI commands after having acquired /dev/sr0.

I wonder why the name of the SCSI command is not shown which was stuck 
for 740 seconds. I assume it is the first INQUIRY which shall confirm 
that the device claims to be a CD drive. Timeout for this command is
30 seconds. So its end was 710 seconds overdue.
(It might also be a TEST UNIT READY. More than those two cannot be
 issued before the program code with the FAILURE message is reached.)

Whatever, the connection to the drive seems to have been completely
unusable. 49 seconds seem to have been spent before the SCSI command 
was started (789589505 - 740819134 microseconds). I have few idea what 
system calls might have needed so long.
As said, for me it lasts about 50 milliseconds to get that far.

Have a nice day :)

Thomas

Comment 33 Thomas Schmitt 2020-04-15 17:57:54 UTC

Hi,

> I tried a data DVD (no encryption) in the drive and I do get a 
> perfectly wonderful log from it.

With the one riddle that again a SCSI transaction timestamp is shown
with no command before it. It is not the INQUIRY, which is shown next.
So it must be the TEST UNIT READY.
(I will have to investigate why you get that timestamp whereas i don't.)

The reported times look good. After 3 seconds the drive is done with
reporting its state and its perception of the loaded medium.

Have a nice day :)

Thomas

Comment 34 Steve 2020-04-15 18:48:02 UTC

(In reply to Thomas Schmitt from comment #29)

Thanks for the timings and for the doc links, which already help.*

To add dynamic debugging output for the CD-ROM SCSI upper layer:

In a terminal window, run:

$ journalctl --no-hostname -k | tee -i /tmp/sr_mod-log-1.txt

In a separate terminal window, enable logging for the "sr_mod" module and for the USB modules:

# echo 'module sr_mod =pmf'   > /sys/kernel/debug/dynamic_debug/control
# echo 'module xhci_hcd =pmf' > /sys/kernel/debug/dynamic_debug/control
# echo 'module usbcore =pmf'  > /sys/kernel/debug/dynamic_debug/control

Load the test DVD.

Run "xorriso".

NB: I don't have a working optical drive ATM, but the commands run fine in a VM and some logging is generated when a USB flash drive is inserted and removed.

* SCSI Interfaces Guide
https://www.kernel.org/doc/html/v4.13/driver-api/scsi.html

Comment 35 Thomas Schmitt 2020-04-15 18:55:38 UTC

Hi,

at least libburn has a bug less now. The SCSI commands issued in the
course of the Linux-specific drive acquiration were not logged.

This was not noticed here, because they get emitted only if ioctl
HDIO_GET_IDENTITY says the drive is under control of the old driver
"hd" for Parallel ATA (decommissioned long ago), or if ioctl
SG_GET_SCSI_ID returns error. I bet that the latter happens to you.
In this case, Linux would refuse to tell the address which it logged
e.g. as "6:0:0:0:" in
  Apr 13 18:57:36 kernel: sr 6:0:0:0: Attached scsi CD-ROM sr0


So the lonely timestamp probably stems from TEST UNIT READY, which gets
emitted in either case. An INQUIRY is emitted only if HDIO_GET_IDENTITY
does not indicate "hd". But that second command would have caused a
second timestamp. So i guess it's TEST UNIT READY.

ioctl SG_GET_SCSI_ID might have consumed the 49 seconds before the
TEST UNIT READY command was started.

Have a nice day :)

Thomas

Comment 36 Steve 2020-04-15 19:35:56 UTC

(In reply to Steve from comment #19)

I just noticed a 30 second gap in the log output posted in Comment 19. That shows that the log records may be out of order, because the "sr" upper layer "hostbyte=DID_TIME_OUT" entry is logged *before* the 30 second gap.

So why is xhci_hcd "stopping port polling" and 30 seconds later cancelling a URB*?

Copied from Comment 19:

$ grep -n -B6 '0x3faeda020' dmesg-3.txt 
163-Apr 14 12:15:07 kernel: xhci_hcd:xhci_ring_cmd_db: xhci_hcd 0000:00:14.0: // Ding dong!
164-Apr 14 12:15:07 kernel: xhci_hcd:xhci_dbg_trace: xhci_hcd 0000:00:14.0: Successful evaluate context command
165-Apr 14 12:15:07 kernel: sr 5:0:0:0: [sr0] tag#0 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=30s
166-Apr 14 12:15:07 kernel: sr 5:0:0:0: [sr0] tag#0 CDB: Read(10) 28 00 00 00 02 00 00 00 02 00
167-Apr 14 12:15:07 kernel: blk_update_request: I/O error, dev sr0, sector 2048 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
168-Apr 14 12:15:08 kernel: xhci_hcd:xhci_hub_status_data: xhci_hcd 0000:00:14.0: xhci_hub_status_data: stopping port polling.
                 ^^
                 vv
169:Apr 14 12:15:38 kernel: xhci_hcd:xhci_dbg_trace: xhci_hcd 0000:00:14.0: Cancel URB 0000000074932569, dev 1, ep 0x81, starting at offset 0x3faeda020
170-Apr 14 12:15:38 kernel: xhci_hcd:xhci_ring_cmd_db: xhci_hcd 0000:00:14.0: // Ding dong!
171-Apr 14 12:15:38 kernel: xhci_hcd:handle_tx_event: xhci_hcd 0000:00:14.0: Stopped on Transfer TRB for slot 7 ep 2
172:Apr 14 12:15:38 kernel: xhci_hcd:xhci_dbg_trace: xhci_hcd 0000:00:14.0: Removing canceled TD starting at 0x3faeda020 (dma).

* "There are two ways to cancel an URB you’ve submitted but which hasn’t been returned to your driver yet."

How to cancel an already running URB?
https://www.kernel.org/doc/html/latest/driver-api/usb/URB.html#how-to-cancel-an-already-running-urb

"URB" means "USB Request Block"
https://www.kernel.org/doc/html/latest/driver-api/usb/URB.html

Comment 37 Tom Horsley 2020-04-15 19:50:11 UTC

OK, I did another run of xorriso with lots of kernel debugging turned on (I don't understand it, I just turned on everything I saw mentioned in the comments). The xorriso log acted a bit different, it printed the initial timestamp all by itself without me ejecting the disk or turning off the drive:

xorriso 1.5.2 : RockRidge filesystem manipulator, libburnia project.

 30214061 us     [ 54708892 ]
libburn : FAILURE : Cannot access '/dev/sr0' as SG_IO CDROM drive
xorriso : FAILURE : Cannot acquire drive '/dev/sr0'
xorriso : aborting : -abort_on 'FAILURE' encountered 'FAILURE'

The FAILURE lines did come out after I turned off the drive to get it out of the hang.

I'll attach dmesg.txt and kernel-debug.txt created during the run of the hung xorriso with the movie dvd in the drive.

Comment 38 Tom Horsley 2020-04-15 19:52:15 UTC

Created attachment 1679160 [details]
dmesg during run of xorriso hang

Comment 39 Tom Horsley 2020-04-15 19:53:00 UTC

Created attachment 1679161 [details]
kernel debug during run of xorriso hang

Comment 40 Steve 2020-04-15 21:20:10 UTC

(In reply to Tom Horsley from comment #39)
> Created attachment 1679161 [details]
> kernel debug during run of xorriso hang

Thanks. The good news is that some SCSI commands were logged. The bad news is that they don't seem to explain the 30 second gap here. This is extremely hard to read, but there are line numbers on the left side and timestamps in column 5. I've separated the records where there is a 30 second gap.

The search pattern is 'Cancel URB' and the entries before it are displayed up to the first 'Ring doorbell' entry (going backwards or "up").

$ fgrep -B9 -n -m1 'Cancel URB' trace-4.txt
2967-     usb-storage-315   [001] d... 10102.320216: xhci_ring_ep_doorbell: Ring doorbell for Slot 3 ep4in
2968-          <idle>-0     [000] d.h. 10102.320342: xhci_handle_event: EVENT: TRB 0000000407b21910 status 'Success' len 0 slot 3 ep 9 type 'Transfer Event' flags e:c
2969-          <idle>-0     [000] d.h. 10102.320347: xhci_handle_transfer: BULK: Buffer 000000040615c000 length 13 TD size 0 intr 0 type 'Normal' flags b:i:I:c:s:I:e:C
2970-          <idle>-0     [000] d.h. 10102.320350: xhci_inc_deq: BULK 000000009249b2eb: enq 0x0000000407b21920(0x0000000407b21000) deq 0x0000000407b21920(0x0000000407b21000) segs 2 stream 0 free_trbs 509 bounce 1024 cycle 1
2971-          <idle>-0     [000] d.h. 10102.320353: xhci_urb_giveback: ep4in-bulk: urb 00000000ae4357dc pipe 3221357184 slot 3 length 13/13 sgs 0/0 stream 0 flags 00000204
2972-          <idle>-0     [000] d.h. 10102.320356: xhci_inc_deq: EVENT 000000009abdcb21: enq 0x0000000407e6c000(0x0000000407e6c000) deq 0x0000000407e6c210(0x0000000407e6c000) segs 1 stream 0 free_trbs 254 bounce 0 cycle 0
2973-     usb-storage-315   [001] .... 10102.320425: scsi_dispatch_cmd_done: host_no=4 channel=0 id=0 lun=0 data_sgl=1 prot_sgl=0 prot_op=SCSI_PROT_NORMAL cmnd=(WRITE_10 lba=86866184 txlen=8 protect=0 raw=2a 00 05 2d 79 08 00 00 08 00) result=(driver=DRIVER_OK host=DID_OK message=COMMAND_COMPLETE status=SAM_STAT_GOOD)
                                          ^^
                                          vv
2974-    kworker/1:1H-335   [001] .... 10132.015859: scsi_dispatch_cmd_timeout: host_no=5 channel=0 id=0 lun=0 data_sgl=1 prot_sgl=0 prot_op=SCSI_PROT_NORMAL cmnd=(READ_10 lba=512 txlen=2 protect=0 raw=28 00 00 00 02 00 00 00 02 00) result=(driver=DRIVER_OK host=DID_OK message=COMMAND_COMPLETE status=SAM_STAT_GOOD)
2975-    kworker/u4:2-6805  [000] d... 10132.026879: xhci_urb_dequeue: ep1in-bulk: urb 000000001e88754a pipe 3221259392 slot 7 length 0/4096 sgs 1/1 stream 0 flags 00040201
2976:    kworker/u4:2-6805  [000] d... 10132.026912: xhci_dbg_cancel_urb: Cancel URB 000000001e88754a, dev 1, ep 0x81, starting at offset 0x3f8dc8900

NB: I renamed the log "trace-4.txt" for consistency and clarity.

Comment 41 Steve 2020-04-15 21:42:10 UTC

This will display just the SCSI commands:

$ fgrep -i -n scsi trace-4.txt | less

There is one oddity that can be viewed with:

$ fgrep -C4 -n scsi_eh_wakeup trace-4.txt | less

There are three of these timeouts:

$ fgrep -n scsi_dispatch_cmd_timeout trace-4.txt | wc -l
3

The first two show a significant time gap before the previous record (29.695434 s, 24.566161 s). The third does not (0.683883 s):

$ fgrep -B1 -n scsi_dispatch_cmd_timeout trace-4.txt | less

Comment 42 Thomas Schmitt 2020-04-15 21:46:11 UTC

Hi,

> the "sr" upper layer "hostbyte=DID_TIME_OUT" entry is logged *before*
> the 30 second gap.
> $ grep -n -B6 '0x3faeda020' dmesg-3.txt 

It could be from a previous SCSI-via-USB transaction attempt.

  https://bugzilla.redhat.com/attachment.cgi?id=1678735
  "rerun of journalctl on the asus mini-pc with usb debugging on "

shows:

Apr 14 12:15:07 kernel: xhci_hcd:xhci_dbg_trace: xhci_hcd 0000:00:14.0: Cancel URB 000000003f0a2866, dev 1, ep 0x81, starting at offset 0x3faedac40
...
Apr 14 12:15:07 kernel: xhci_hcd:xhci_ring_cmd_db: xhci_hcd 0000:00:14.0: // Ding dong!
Apr 14 12:15:07 kernel: xhci_hcd:xhci_dbg_trace: xhci_hcd 0000:00:14.0: Successful evaluate context command
Apr 14 12:15:07 kernel: sr 5:0:0:0: [sr0] tag#0 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=30s

> https://www.kernel.org/doc/html/latest/driver-api/usb/URB.html

If i only could see the way how an URB transports a SCSI command, its
possible data, its possible result data, and its success indication.

> https://bugzilla.redhat.com/attachment.cgi?id=1679160
> dmesg during run of xorriso hang

Says that sr0 has SCSI address 5:0:0:0 , which corresponds in

> https://bugzilla.redhat.com/attachment.cgi?id=1679161
> kernel debug during run of xorriso hang

to "host_no=5 channel=0 id=0 lun=0".

These two might mark start and timeout of a READ command for 2 blocks
from Logical Block Address 2:

   kworker/1:1H-335   [001] .... 10162.370474: scsi_dispatch_cmd_start: host_no=5 channel=0 id=0 lun=0 data_sgl=1 prot_sgl=0 prot_op=SCSI_PROT_NORMAL cmnd=(READ_10 lba=512 txlen=2 protect=0 raw=28 00 00 00 02 00 00 00 02 00)
   ...
   kworker/1:1H-335   [001] .... 10192.432179: scsi_dispatch_cmd_timeout: host_no=5 channel=0 id=0 lun=0 data_sgl=1 prot_sgl=0 prot_op=SCSI_PROT_NORMAL cmnd=(READ_10 lba=512 txlen=2 protect=0 raw=28 00 00 00 02 00 00 00 02 00) result=(driver=DRIVER_OK host=DID_OK message=COMMAND_COMPLETE status=SAM_STAT_GOOD)
          <idle>-0     [001] d.s. 10192.438071: scsi_eh_wakeup: host_no=5

Problem with interpreting the lines inbetween is the busy device 4:0:0:0,
which is reported in dmesg as
  Apr 15 08:37:52 kernel: usb 4-2: Product: My Book 1130
  Apr 15 08:37:52 kernel: usb 4-2: Manufacturer: Western Digital
  ...
  Apr 15 08:37:53 kernel: scsi 4:0:0:0: Direct-Access     WD       My Book 1130     1016 PQ: 0 ANSI: 6
  ...
  Apr 15 08:37:53 kernel: sd 4:0:0:0: [sdb] 1953458176 512-byte logical blocks: (1.00 TB/931 GiB)

How to filter away the USB messages of this disk ?

---------------------------------------------------------------
About comment 40:

> 2973-     usb-storage-315   [001] .... 10102.320425: scsi_dispatch_cmd_done:
> host_no=4 channel=0 id=0 lun=0 data_sgl=1 prot_sgl=0 prot_op=SCSI_PROT_NORMAL
> cmnd=(WRITE_10 lba=86866184 txlen=8 protect=0 raw=2a 00 05 2d 79 08 00 00 08
> 00) result=(driver=DRIVER_OK host=DID_OK message=COMMAND_COMPLETE

That's the disk. The DVD drive will get no WRITE_10 and offer no LBA
86866184 (0x052d7908) for reading. That's byte address 177,901,944,832.
I understand this transaction was successful.

> 2974-    kworker/1:1H-335   [001] .... 10132.015859:
> scsi_dispatch_cmd_timeout: host_no=5 channel=0 id=0 lun=0 data_sgl=1
> prot_sgl=0 prot_op=SCSI_PROT_NORMAL cmnd=(READ_10 lba=512 txlen=2
> protect=0 raw=28 00 00 00 02 00 00 00 02 00) result=(driver=DRIVER_OK
> host=DID_OK message=COMMAND_COMPLETE status=SAM_STAT_GOOD)

This is the DVD drive having a timeout.
Probably started by

> systemd-udevd-7136  [000] .... 10101.875658: scsi_dispatch_cmd_start:
> host_no=5 channel=0 id=0 lun=0 data_sgl=1 prot_sgl=0
> prot_op=SCSI_PROT_NORMAL cmnd=(READ_10 lba=512 txlen=2 protect=0
> raw=28 00 00 00 02 00 00 00 02 00)

immediately after another READ_10 succeeded:

> usb-storage-4803  [001] .... 10101.875369: scsi_dispatch_cmd_done:
> host_no=5 channel=0 id=0 lun=0 data_sgl=1 prot_sgl=0
> prot_op=SCSI_PROT_NORMAL cmnd=(READ_10 lba=256 txlen=2 protect=0
> raw=28 00 00 00 01 00 00 00 02 00)
> result=(driver=DRIVER_OK host=DID_OK message=COMMAND_COMPLETE
> status=SAM_STAT_GOOD)

(There are many instances interested in reading blocks from the start
of the DVD.)

Have a nice day :)

Thomas

Comment 43 Thomas Schmitt 2020-04-15 22:03:20 UTC

Errata:

I wrote:
> These two might mark start and timeout of a READ command for 2 blocks
> from Logical Block Address 2:
> [...] raw=28 00 00 00 02 00 00 00 02 00 [...]

This is Logical Block Address 512 (0x00000200), not 2 (0x00000002).

Comment 44 Thomas Schmitt 2020-04-15 22:20:14 UTC

Hi,

> $ fgrep -B1 -n scsi_dispatch_cmd_timeout trace-4.txt | less

If i run

  grep -n 'scsi_dispatch_cmd.*: host_no=5' trace-4.txt | less

i see lots of successful SCSI transactions
  scsi_dispatch_cmd_start
  scsi_dispatch_cmd_done

Most are automatic 4Ah GET EVENT STATUS NOTIFICATION issued by the
kernel
  raw=4a 01 00 00 10 00 00 00 08 00

Only three timeouts are to see
  10101.875658 to 10132.015859
  10132.156237 to 10162.224030
  10162.370474 to 10192.432179
all nearly 30 seconds. All with command READ_10 at LBA 512.

Have a nice day :)

Thomas

Comment 45 Steve 2020-04-15 22:44:38 UTC

Thomas:

You already realized this, but when I went to post it, there was a BZ mid-air collision, so I am posting after your comments:

There are three values for "host_no" in trace-4.txt: 0, 4, 5. So the trace log has entries for several devices.

They appear to correspond to these devices:

$ grep -n 'scsi [0-9]' dmesg-4.txt 
691:Apr 15 08:37:52 kernel: scsi 0:0:0:0: Direct-Access     ATA      Crucial_CT512MX1 MU02 PQ: 0 ANSI: 5
782:Apr 15 08:37:53 kernel: scsi 4:0:0:0: Direct-Access     WD       My Book 1130     1016 PQ: 0 ANSI: 6
...
1073:Apr 15 12:45:37 kernel: scsi 5:0:0:0: CD-ROM            HL-DT-ST BD-RE  WH16NS40  1.04 PQ: 0 ANSI: 0
...

Comment 40 shows that host_no=4 is being accessed. That's the "My Book", which is an external USB drive. Why would it be getting accessed?

For the record:

$ fgrep 'DMI:' dmesg-4.txt 
Apr 15 08:37:52 kernel: DMI: ASUSTeK Computer INC. VM40B/VM40B, BIOS 1501 12/09/2014

Comment 46 Steve 2020-04-15 23:02:11 UTC

Thomas wrote in Comment 42:

> How to filter away the USB messages of this disk ?

$ grep -v <pattern> ... # With the "-v" option, records matching the pattern are removed from the output.

So this will remove the SCSI records for the My Book 1130:

$ grep -v -n 'host_no=4' trace-4.txt | less

Comment 47 Tom Horsley 2020-04-15 23:13:14 UTC

The WD My Book is where fedora is installed for purposes of running fedora tests on what is normally a Windows 10 mini pc (the internal disk is windows only).

Comment 48 Steve 2020-04-15 23:41:17 UTC

(In reply to Tom Horsley from comment #47)
> The WD My Book is where fedora is installed for purposes of running fedora tests on what is normally a Windows 10 mini pc (the internal disk is windows only).

Thanks. The "-k" option removes a lot of context from the log, but there is just enough, barely, to infer that, in retrospect.* :-)

Could you post or attach the list of your kernel modules:

$ lsmod > /tmp/lsmod-1.txt

* ==
$ grep -n 'sd[ab][0-9]' dmesg-4.txt 
700:Apr 15 08:37:52 kernel:  sda: sda1 sda2 sda3 sda4 sda5 sda6
806:Apr 15 08:37:59 kernel:  sdb: sdb1 sdb2 sdb3 sdb4 sdb5
814:Apr 15 08:38:00 kernel: EXT4-fs (sdb4): mounted filesystem with ordered data mode. Opts: (null)
889:Apr 15 12:38:08 kernel: EXT4-fs (sdb4): re-mounted. Opts: (null)
891:Apr 15 12:38:08 kernel: Adding 8214524k swap on /dev/sdb3.  Priority:-2 extents:1 across:8214524k FS
968:Apr 15 12:38:15 kernel: EXT4-fs (sdb2): mounted filesystem with ordered data mode. Opts: (null)
971:Apr 15 12:38:15 kernel: EXT4-fs (sdb5): mounted filesystem with ordered data mode. Opts: (null)

Comment 49 Tom Horsley 2020-04-16 00:12:25 UTC

Created attachment 1679183 [details]
lsmod listing

I tried running this before and after accessing the dvd, but got the same list, so apparently nothing was dynamically loaded during the test.

Comment 50 Steve 2020-04-16 01:25:32 UTC

(In reply to Tom Horsley from comment #49)
> Created attachment 1679183 [details]
> lsmod listing
> 
> I tried running this before and after accessing the dvd, but got the same list, so apparently nothing was dynamically loaded during the test.

Thanks. Apparently there are kernel modules for enclosures:

SCSI- and enclosure-related:

$ egrep 'scsi|ses|enclosure' lsmod.txt
ses                    20480  0
enclosure              16384  1 ses
scsi_transport_sas     49152  1 ses

$ modinfo ses | egrep 'description|^name'
description:    SCSI Enclosure Services (ses) driver
name:           ses

$ modinfo enclosure | egrep 'description|^name'
description:    Enclosure Services
name:           enclosure

$ modinfo scsi_transport_sas | egrep 'description|^name'
description:    SAS Transport Attributes
name:           scsi_transport_sas


USB-related:

$ egrep 'usb|usb_storage|uas' lsmod.txt
btusb                  57344  0
btrtl                  24576  1 btusb  # "bt" is bluetooth, so we can ignore those.
btbcm                  16384  1 btusb
btintel                32768  1 btusb
bluetooth             634880  33 btrtl,btintel,btbcm,bnep,btusb,rfcomm
uas                    32768  0
usb_storage            81920  6 uas

$ modinfo uas | egrep 'description|^name|import_ns'
import_ns:      USB_STORAGE  # "ns" is namespace.
name:           uas

$ modinfo usb_storage | egrep 'description|^name|import_ns'
description:    USB Mass Storage driver for Linux
name:           usb_storage

Comment 51 Steve 2020-04-16 02:11:25 UTC

The My Book also has an enclosure, but the enclosure is logged as LUN* 1 on SCSI host 4:

$ fgrep -A3 -n -m1 'scsi 4' dmesg-4.txt 
782:Apr 15 08:37:53 kernel: scsi 4:0:0:0: Direct-Access     WD       My Book 1130     1016 PQ: 0 ANSI: 6
783-Apr 15 08:37:53 kernel: scsi 4:0:0:1: Enclosure         WD       SES Device       1016 PQ: 0 ANSI: 6 <<< note "SES Device" here.
784-Apr 15 08:37:53 kernel: sd 4:0:0:0: Attached scsi generic sg1 type 0
785-Apr 15 08:37:53 kernel: scsi 4:0:0:1: Attached scsi generic sg2 type 13

And there are some problems reported for it:

$ fgrep -A3 -n 'scsi 4:0:0:1: Wrong' dmesg-4.txt 
917:Apr 15 12:38:11 kernel: scsi 4:0:0:1: Wrong diagnostic page; asked for 1 got 8
918-Apr 15 12:38:11 kernel: scsi 4:0:0:1: Failed to get diagnostic page 0x1
919-Apr 15 12:38:11 kernel: scsi 4:0:0:1: Failed to bind enclosure -19
920-Apr 15 12:38:11 kernel: ses 4:0:0:1: Attached Enclosure device <<<<<<<<<<<<<<< "ses" module logged here.

* Logical Unit Number
https://www.kernel.org/doc/html/latest/driver-api/scsi.html

Comment 52 Steve 2020-04-16 02:53:32 UTC

In contrast to the My Book enclosure, the OWC enclosure appears as a "USB Mass Storage device":

$ fgrep -A7 -n -m1 'usb 4-1' dmesg-4.txt 
1065:Apr 15 12:45:36 kernel: usb 4-1: new SuperSpeed Gen 1 USB device number 3 using xhci_hcd
1066-Apr 15 12:45:36 kernel: usb 4-1: New USB device found, idVendor=1e91, idProduct=de2c, bcdDevice= 1.00
1067-Apr 15 12:45:36 kernel: usb 4-1: New USB device strings: Mfr=2, Product=3, SerialNumber=1
1068-Apr 15 12:45:36 kernel: usb 4-1: Product: Mercury Pro Optical
1069-Apr 15 12:45:36 kernel: usb 4-1: Manufacturer: Other World Computing
1070-Apr 15 12:45:36 kernel: usb 4-1: SerialNumber: 002933001438
1071-Apr 15 12:45:36 kernel: usb-storage 4-1:1.0: USB Mass Storage device detected
1072-Apr 15 12:45:36 kernel: scsi host5: usb-storage 4-1:1.0

Comment 53 Steve 2020-04-16 03:47:29 UTC

Thomas wrote in Comment 42:

> How to filter away the USB messages of this disk ?

Second reply with a more precise answer than in Comment 42:

Many of the USB records are tagged with one of only two identifiers: "usb-storage-315" or "usb-storage-4803":

$ cat trace-4.txt | awk '/usb-storage-/ { print $1 }' | sort -u
usb-storage-315
usb-storage-4803

I believe the number is a process or thread ID.

This command selects records that have one of those tags and a "host_no" field:

$ cat trace-4.txt | awk '/usb-storage-.*host_no/ { print $1, $6 }' | sort -u
usb-storage-315 host_no=4
usb-storage-4803 host_no=5

Comment 54 Steve 2020-04-16 03:50:14 UTC

(In reply to Steve from comment #53)
...
> Second reply with a more precise answer than in Comment 42:
...

That back reference should be to Comment 46.

Comment 55 Steve 2020-04-16 04:30:45 UTC

(In reply to Steve from comment #53)
... 
> Many of the USB records are tagged with one of only two identifiers: "usb-storage-315" or "usb-storage-4803":
...

Actually, the log has records for many processes that ran during the test, including "vi":

$ cat trace-4.txt | awk '/^#.*/ { next }; { print $1 }' | sort -u
cdrom_id-7134
cdrom_id-7135
cdrom_id-7137
dbus-broker-3532
dconf-service-3777
gdbus-3632
gdbus-4073
gnome-shell-3619
gnome-terminal--5456
gvfs-udisks2-vo-3782
<idle>-0
jbd2/sdb4-8-350
jbd2/sdb5-8-577
journal-offline-7141
kworker/0:1H-186
kworker/0:2-6861
kworker/1:1-7061
kworker/1:1H-335
kworker/u4:0-6955
kworker/u4:2-6805
NetworkManager-739
pool-gnome-shel-7153
pool-udisksd-7142
rcu_sched-10
systemd-journal-444
systemd-udevd-460
systemd-udevd-7133
systemd-udevd-7136
tracker-miner-f-3890
usb-storage-315
usb-storage-4803
vi-7157
xorriso-7138

Comment 56 Steve 2020-04-16 05:38:14 UTC

Here, some of the SCSI commands are not interpreted in the output. There is one time gap of 24.706237 seconds when the process id changes.

The awk script simply extracts the process name/id, the timestamp, and the command, and prints them.

$ cat trace-4.txt | awk '/^ *cdrom_id-.*cmnd=.*$/ { pid=$1; ts=$4; n=match($0, "cmnd=.*$"); print pid, ts, substr($0, n) }'
cdrom_id-7134 10075.091385: cmnd=(0x4a - raw=4a 01 00 00 10 00 00 00 08 00)
cdrom_id-7134 10075.091894: cmnd=(TEST_UNIT_READY - raw=00 00 00 00 00 00)
cdrom_id-7134 10075.092257: cmnd=(TEST_UNIT_READY - raw=00 00 00 00 00 00)
cdrom_id-7134 10075.092645: cmnd=(INQUIRY - raw=12 00 00 00 24 00)
cdrom_id-7134 10075.093117: cmnd=(0x46 - raw=46 00 00 00 00 00 00 00 08 00)
cdrom_id-7134 10075.094305: cmnd=(0x46 - raw=46 00 00 00 00 00 00 01 d8 00)
cdrom_id-7134 10075.096011: cmnd=(READ_TOC - raw=43 00 00 00 00 00 01 00 0c 00)
cdrom_id-7134 10075.096510: cmnd=(READ_TOC - raw=43 00 00 00 00 00 01 00 14 00)
cdrom_id-7134 10075.096953: cmnd=(READ_TOC - raw=43 00 01 00 00 00 00 00 0c 00)
cdrom_id-7134 10075.097332: cmnd=(0x51 - raw=51 00 00 00 00 00 00 00 20 00)
cdrom_id-7134 10075.097914: cmnd=(ALLOW_MEDIUM_REMOVAL - raw=1e 00 00 00 00 00)
cdrom_id-7134 10075.098267: cmnd=(START_STOP - raw=1b 00 00 00 02 00)
cdrom_id-7135 10076.555466: cmnd=(0x4a - raw=4a 01 00 00 10 00 00 00 08 00)
cdrom_id-7135 10076.555865: cmnd=(TEST_UNIT_READY - raw=00 00 00 00 00 00)
cdrom_id-7135 10076.556471: cmnd=(TEST_UNIT_READY - raw=00 00 00 00 00 00)
cdrom_id-7135 10076.556979: cmnd=(0x4a - raw=4a 01 00 00 10 00 00 00 08 00)
cdrom_id-7135 10076.557221: cmnd=(INQUIRY - raw=12 00 00 00 24 00)
cdrom_id-7135 10076.557526: cmnd=(0x46 - raw=46 00 00 00 00 00 00 00 08 00)
cdrom_id-7135 10076.558995: cmnd=(0x46 - raw=46 00 00 00 00 00 00 01 d8 00)
cdrom_id-7137 10101.265232: cmnd=(0x4a - raw=4a 01 00 00 10 00 00 00 08 00)
cdrom_id-7137 10101.265706: cmnd=(TEST_UNIT_READY - raw=00 00 00 00 00 00)
cdrom_id-7137 10101.266109: cmnd=(TEST_UNIT_READY - raw=00 00 00 00 00 00)
cdrom_id-7137 10101.266478: cmnd=(INQUIRY - raw=12 00 00 00 24 00)
cdrom_id-7137 10101.266972: cmnd=(0x46 - raw=46 00 00 00 00 00 00 00 08 00)
cdrom_id-7137 10101.268269: cmnd=(0x46 - raw=46 00 00 00 00 00 00 01 d8 00)
cdrom_id-7137 10101.269830: cmnd=(READ_TOC - raw=43 00 00 00 00 00 01 00 0c 00)
cdrom_id-7137 10101.270420: cmnd=(READ_TOC - raw=43 00 00 00 00 00 01 00 14 00)
cdrom_id-7137 10101.271000: cmnd=(READ_TOC - raw=43 00 01 00 00 00 00 00 0c 00)
cdrom_id-7137 10101.271523: cmnd=(0x51 - raw=51 00 00 00 00 00 00 00 20 00)
cdrom_id-7137 10101.272007: cmnd=(ALLOW_MEDIUM_REMOVAL - raw=1e 00 00 00 01 00)

The uninterpreted commands can be found here:

SCSI command
https://en.wikipedia.org/wiki/SCSI_command

Comment 57 Steve 2020-04-16 06:40:09 UTC

(In reply to Steve from comment #56)
...
cdrom_id-7134 10075.091385: cmnd=(0x4a - raw=4a 01 00 00 10 00 00 00 08 00)
...

cdrom_id is a user space program in the systemd-udev package. And it is run from:

/usr/lib/udev/rules.d/60-cdrom_id.rules

systemd-udevd is also a user space program ("man systemd-udevd"). There is a 24.692882 second gap where the process id changes, and there is a READ_10 sequence at the end of this:

$ cat trace-4.txt | awk '/^ *systemd-udevd-.*cmnd=.*$/ { pid=$1; ts=$4; n=match($0, "cmnd=.*$"); print pid, ts, substr($0, n) }'
systemd-udevd-7133 10075.063881: cmnd=(0x4a - raw=4a 01 00 00 10 00 00 00 08 00)
systemd-udevd-7133 10075.064351: cmnd=(TEST_UNIT_READY - raw=00 00 00 00 00 00)
systemd-udevd-7133 10076.541158: cmnd=(0x4a - raw=4a 01 00 00 10 00 00 00 08 00)
systemd-udevd-7133 10076.541595: cmnd=(TEST_UNIT_READY - raw=00 00 00 00 00 00)
systemd-udevd-7133 10076.542209: cmnd=(TEST_UNIT_READY - raw=00 00 00 00 00 00)
systemd-udevd-7133 10076.542660: cmnd=(TEST_UNIT_READY - raw=00 00 00 00 00 00)
systemd-udevd-7136 10101.235542: cmnd=(0x4a - raw=4a 01 00 00 10 00 00 00 08 00)
systemd-udevd-7136 10101.236058: cmnd=(TEST_UNIT_READY - raw=00 00 00 00 00 00)
systemd-udevd-7136 10101.236813: cmnd=(TEST_UNIT_READY - raw=00 00 00 00 00 00)
systemd-udevd-7136 10101.237177: cmnd=(TEST_UNIT_READY - raw=00 00 00 00 00 00)
systemd-udevd-7136 10101.237511: cmnd=(READ_TOC - raw=43 00 00 00 00 00 00 00 0c 40)
systemd-udevd-7136 10101.237942: cmnd=(READ_TOC - raw=43 00 00 00 00 00 00 00 0c 00)
systemd-udevd-7136 10101.238405: cmnd=(READ_TOC - raw=43 00 00 00 00 00 01 00 0c 00)
systemd-udevd-7136 10101.238988: cmnd=(READ_CAPACITY - raw=25 00 00 00 00 00 00 00 00 00)
systemd-udevd-7136 10101.239432: cmnd=(0x51 - raw=51 00 00 00 00 00 00 00 02 00)
systemd-udevd-7136 10101.239850: cmnd=(0x51 - raw=51 00 00 00 00 00 00 00 22 00)
systemd-udevd-7136 10101.240180: cmnd=(0x52 - raw=52 01 00 00 00 01 00 00 08 00)
systemd-udevd-7136 10101.240521: cmnd=(0x52 - raw=52 01 00 00 00 01 00 00 20 00)
systemd-udevd-7136 10101.240863: cmnd=(TEST_UNIT_READY - raw=00 00 00 00 00 00)
systemd-udevd-7136 10101.241083: cmnd=(READ_TOC - raw=43 00 00 00 00 00 00 00 0c 40)
systemd-udevd-7136 10101.241427: cmnd=(READ_TOC - raw=43 00 00 00 00 00 00 00 0c 00)
systemd-udevd-7136 10101.241904: cmnd=(READ_TOC - raw=43 00 00 00 00 00 01 00 0c 00)
systemd-udevd-7136 10101.242366: cmnd=(READ_CAPACITY - raw=25 00 00 00 00 00 00 00 00 00)
systemd-udevd-7136 10101.242762: cmnd=(0x51 - raw=51 00 00 00 00 00 00 00 02 00)
systemd-udevd-7136 10101.243178: cmnd=(0x51 - raw=51 00 00 00 00 00 00 00 22 00)
systemd-udevd-7136 10101.243573: cmnd=(0x52 - raw=52 01 00 00 00 01 00 00 08 00)
systemd-udevd-7136 10101.244047: cmnd=(0x52 - raw=52 01 00 00 00 01 00 00 20 00)
systemd-udevd-7136 10101.277361: cmnd=(0x4a - raw=4a 01 00 00 10 00 00 00 08 00)
systemd-udevd-7136 10101.277807: cmnd=(TEST_UNIT_READY - raw=00 00 00 00 00 00)
systemd-udevd-7136 10101.278369: cmnd=(READ_10 lba=4101504 txlen=2 protect=0 raw=28 00 00 3e 95 80 00 00 02 00)
systemd-udevd-7136 10101.516146: cmnd=(READ_10 lba=4101506 txlen=1 protect=0 raw=28 00 00 3e 95 82 00 00 01 00)
systemd-udevd-7136 10101.517146: cmnd=(READ_10 lba=0 txlen=2 protect=0 raw=28 00 00 00 00 00 00 00 02 00)
systemd-udevd-7136 10101.775259: cmnd=(READ_10 lba=8 txlen=2 protect=0 raw=28 00 00 00 00 08 00 00 02 00)
systemd-udevd-7136 10101.776173: cmnd=(READ_10 lba=16 txlen=2 protect=0 raw=28 00 00 00 00 10 00 00 02 00)
systemd-udevd-7136 10101.784320: cmnd=(READ_10 lba=32 txlen=2 protect=0 raw=28 00 00 00 00 20 00 00 02 00)
systemd-udevd-7136 10101.793401: cmnd=(READ_10 lba=64 txlen=2 protect=0 raw=28 00 00 00 00 40 00 00 02 00)
systemd-udevd-7136 10101.826396: cmnd=(READ_10 lba=128 txlen=2 protect=0 raw=28 00 00 00 00 80 00 00 02 00)
systemd-udevd-7136 10101.847772: cmnd=(READ_10 lba=256 txlen=2 protect=0 raw=28 00 00 00 01 00 00 00 02 00)
systemd-udevd-7136 10101.875658: cmnd=(READ_10 lba=512 txlen=2 protect=0 raw=28 00 00 00 02 00 00 00 02 00)

Comment 58 Thomas Schmitt 2020-04-16 07:53:13 UTC

Hi,

> Here, some of the SCSI commands are not interpreted in the output.

They are ok for optical drives. From MMC-5 specs:
  4Ah GET EVENT STATUS NOTIFICATION 6.7
  46h GET CONFIGURATION 6.6
  51h READ DISC INFORMATION 6.22
  52h READ TRACK INFORMATION 6.27

It is somewhat astounding that the debug log does not tell their
command names.

libburn does not emit 4Ah GET EVENT STATUS NOTIFICATION. Nevertheless
there is such a command attributed to xorriso-7138 in the log:

  xorriso-7138  [000] .... 10132.155618: scsi_dispatch_cmd_start: ...
  raw=4a 01 00 00 10 00 00 00 08 00)

It might get emitted automatically by ioctl(SG_IO) before the payload
command of SG_IO gets sent.

But the two commands attributed to xorriso get no corresponding
scsi_dispatch_cmd_done or scsi_dispatch_cmd_timeout. There are matching
lines attributed to usb-storage, though.

3325:         xorriso-7138  [000] .... 10132.155618: scsi_dispatch_cmd_start: host_no=5 channel=0 id=0 lun=0 data_sgl=1 prot_sgl=0 prot_op=SCSI_PROT_NORMAL cmnd=(0x4a - raw=4a 01 00 00 10 00 00 00 08 00)
3353:     usb-storage-4803  [000] .... 10132.155909: scsi_dispatch_cmd_done: host_no=5 channel=0 id=0 lun=0 data_sgl=1 prot_sgl=0 prot_op=SCSI_PROT_NORMAL cmnd=(0x4a - raw=4a 01 00 00 10 00 00 00 08 00) result=(driver=DRIVER_OK host=DID_OK message=COMMAND_COMPLETE status=SAM_STAT_GOOD)

3354:         xorriso-7138  [000] .N.. 10132.155951: scsi_dispatch_cmd_start: host_no=5 channel=0 id=0 lun=0 data_sgl=0 prot_sgl=0 prot_op=SCSI_PROT_NORMAL cmnd=(TEST_UNIT_READY - raw=00 00 00 00 00 00)
3373:     usb-storage-4803  [000] .... 10132.156207: scsi_dispatch_cmd_done: host_no=5 channel=0 id=0 lun=0 data_sgl=0 prot_sgl=0 prot_op=SCSI_PROT_NORMAL cmnd=(TEST_UNIT_READY - raw=00 00 00 00 00 00) result=(driver=DRIVER_OK host=DID_OK message=COMMAND_COMPLETE status=SAM_STAT_GOOD)

I am not sure whether we always can trust that process name attribution.
At least not with scsi_dispatch_cmd_done and scsi_dispatch_cmd_timeout
lines.


> There is one time gap of 24.706237 seconds when the process id changes.
> ...
> cdrom_id-7134 10075.091385: cmnd=(0x4a - raw=4a 01 00 00 10 00 00 00 08 00)
> ...
> cdrom_id-7135 10076.555466: cmnd=(0x4a - raw=4a 01 00 00 10 00 00 00 08 00)
> ...
> cdrom_id-7137 10101.265232: cmnd=(0x4a - raw=4a 01 00 00 10 00 00 00 08 00)

Afaik, cdrom-id belongs to sytemd's udev. The time gap seems just to
be the interval between three drive inspections by udev.
(If we can trust the process name ...)

-----------------------------------------------------------------------------

I have this nebulous theory stub:

The READ_10 commands to host=5 lba=512 (possibly to identify the
filesystem on DVD) get stuck and finally time out. Interestingly, reads
to lba=4101504 or lba=0 to lba=256 succeed. Only lba=512 times out.

If some other SCSI command is issued to the drive during the waiting
time of the timeout, then it falls victim to this state.
There is no such occasion to see in trace-4.txt, though.

(If both commands are emitted via ioctl(SG_IO) to /dev/sr, then they
come through sr_block_ioctl() in drivers/scsi/sr.c with its quite
brainless mutex_lock(). In this case the other SCSI command would have
to wait until the READ_10 is done. Even if both commands are for
different drives.
But POSIX i/o circumvents that mutex-stricken function.
So i assume that commands to the same drive queue in the lower layers.)

Have a nice day :)

Thomas

Comment 59 Steve 2020-04-16 09:01:29 UTC

(In reply to Thomas Schmitt from comment #58)

Thanks for your very informative analysis. I never would have thought to pair up different process names. :-)

> I have this nebulous theory stub:

I was thinking along those lines too, although you are much better informed about the details than I am.

There was another BZ mid-air collision, so I will post what I had written before I read your analysis. Can you suggest any other tests?

==
Here is another analysis. Several processes are sending SCSI commands to the DVD drive. Some have system-level privileges, so they could take control of the DVD drive and lock out all other processes.

So here are some simple tests, the first of which is to log out and repeat the test. (running xorriso won't be possible, though)

A second one is to boot to "runlevel" 3 (append "3" to the kernel command line from grub2), log in, and repeat the test. Reboot when done.

And a third one is to stop systemd-udevd:

# systemctl stop systemd-udevd
NB: This change is not permanent. However, it could cause some things to stop working. Reboot or use the "start" command when done.

Verify with:

$ systemctl status systemd-udevd

Details:

These processes are sending SCSI commands to the DVD drive (host_no=5):

$ cat trace-4.txt | grep 'host_no=5.*cmnd=' | awk '{ print $1 }' | sort -u
cdrom_id-7134
cdrom_id-7135
cdrom_id-7137
kworker/0:1H-186
kworker/0:2-6861
kworker/1:1-7061
kworker/1:1H-335
systemd-udevd-7133
systemd-udevd-7136
usb-storage-4803
xorriso-7138

These processes are sending SCSI commands to the My Book drive (host_no=4):

$ cat trace-4.txt | grep 'host_no=4.*cmnd=' | awk '{ print $1 }' | sort -u
dconf-service-3777
jbd2/sdb4-8-350
jbd2/sdb5-8-577
journal-offline-7141
kworker/0:1H-186
kworker/1:1H-335
kworker/u4:0-6955
NetworkManager-739
pool-gnome-shel-7153
systemd-journal-444
usb-storage-315
vi-7157

And these to the SSD (host_no=0):

$ cat trace-4.txt | grep 'host_no=0.*cmnd=' | awk '{ print $1 }' | sort -u
<idle>-0
pool-udisksd-7142

Comment 60 Steve 2020-04-16 09:10:11 UTC

systemd-udevd might get restarted if a process attempts to connect to one of its sockets:

# systemctl stop systemd-udevd
Warning: Stopping systemd-udevd.service, but it can still be activated by:
  systemd-udevd-kernel.socket
  systemd-udevd-control.socket

So stop all of those systemd units:

# systemctl stop systemd-udevd\*

Verify with:

# systemctl list-units --all systemd-udevd\*

Tested in an F31 VM.

Comment 61 Thomas Schmitt 2020-04-16 09:52:14 UTC

Hi,

@Tom:

Did you already try a xorriso run which begins when the tray with
the DVD is still sticking out ?

If not, then it might be worth a try in the hope to keep udev
from opening the drive (see below: O_EXCL).

If yes, then was it the run which you reported on the mailing list
and which came far enough to run into
"SG_IO: host_status= 0x7 SG_ERR_DID_ERROR" ?

Also i support Steve's proposal to stop udev completely for a test.

-----------------------------------------------------------------
More theory and background:

The theory about the bad READ_10 to lba=512 would even match the
observation that MS-Windows succeeds:
If it does not try to read LBA 512 out of mere curiosity, and
if the filesystem on the DVD does not advertise LBA 512 as part
of any read-worthy object, then MS-Windows could stay away from
that address and experience no timeout.

LBA 512 is 1 megabyte from medium start (programmer's MiB: 1024 * 1024).
In an ISO 9660 i would expect to find valuable data there. But UDF
is wasteful with storage space at the start of the medium. So it might
really be a spot which is not referred by the filesystem metadata.

> Several processes are sending SCSI commands to the DVD drive.

Yes. A READ_10 by udev seems to be the trigger for the recognizable
timeouts. Question is whether this influences others.

> Some have system-level privileges, so they could take control of
> the DVD drive and lock out all other processes.

There is no such privileged control known to me.
Success of open(2) depends on device file permissions or Linux
capabilities(7) of the process.

Linux has the O_EXCL protocol for open(2) on device files:
While defined in POSIX only for data files, open(...,O_EXCL) on Linux
/dev/sr implements advisory locking. If a /dev/sr is already opened with
O_EXCL, then another open(2) with O_EXCL will fail with errno EBUSY.
open(2) without O_EXCL will succeed.

burn programs use O_EXCL to keep each other from interfering with their
respective activities.

> to stop systemd-udevd:
> # systemctl stop systemd-udevd

If udev's READ_10 lba=512 is indeed the trigger for the problems, then
this should show at least some change in behavior.

Have a nice day :)

Thomas

Comment 62 Tom Horsley 2020-04-16 13:46:52 UTC

Created attachment 1679385 [details]
xorriso log with udevd squashed

Holy cow! It seems to work properly when udevd is well and truly squashed. To use belt and suspenders, I ran these commands:

[root@vivo ~]# systemctl stop systemd-udevd
Warning: Stopping systemd-udevd.service, but it can still be activated by:
  systemd-udevd-kernel.socket
  systemd-udevd-control.socket
[root@vivo ~]# systemctl stop systemd-udevd-kernel.socket
[root@vivo ~]# systemctl stop systemd-udevd-control.socket
[root@vivo ~]# systemctl disable systemd-udevd
[root@vivo ~]# systemctl mask systemd-udevd
Created symlink /etc/systemd/system/systemd-udevd.service → /dev/null.

Then, starting with the tray open and the dvd in the tray, I ran the xorriso command, and it never hung and appears to have functioned properly.

Just started dvdbackup to see if it could read and decrypt the disc, and it is running just fine. (I better go re-enable udevd now before I forget)

Comment 63 Tom Horsley 2020-04-16 13:49:17 UTC

Created attachment 1679386 [details]
xorriso log with udevd squashed

Yikes, picked wrong log file. This is the log of it working.

Comment 64 Steve 2020-04-16 14:42:12 UTC

Thomas in Comment 61:

> LBA 512 is 1 megabyte from medium start (programmer's MiB: 1024 * 1024).

Some of the other systemd-udevd READ_10 commands use LBAs that increase in powers of 2. (Comment 57)

That sounds like a "sizing" algorithm. Why would systemd-udevd need to do that?

Can you suggest a test program that does something similar?

Tom in Comment 62:

> Holy cow! It seems to work properly when udevd is well and truly squashed.

Thanks for squashing and testing. :-)

Comment 65 Thomas Schmitt 2020-04-16 14:44:49 UTC

Hi,

> Holy cow! It seems to work properly when udevd is well and truly squashed.

So we possibly have the trigger in form of a SCSI command 28h READ(10)
with LBA 512 and read size 2.

What happens if you cause that SCSI command with deactivated udev by

  dd if=/dev/sr0 bs=2048 skip=512 count=2 of=/dev/null

while the tray is already loaded and the drive stopped blinking.
(I doubt that the tray load waiting bug is fixed in kernel 5.
If the tray later refuses to come out by its own eject button,
use program "eject". That's another bug here when USB is involved.)

dd should normally end after 0 to 5 seconds and report 4096 bytes
being copied.

> ISO session  :   1 ,         0 ,   4101507s , STALLFF
> Media summary: 1 session, 4101507 data blocks, 8011m data,     0 free

xorriso recognizes an ISO 9660 filesystem. But i assume that this is
just a protective hull around an UDF filesystem for video.

Well, everything seems to work well.
(I still riddle why the drive acquiration code of libburn sends a
single SCSI command of which we only see the timestamp in the log.
It appears that ioctl HDIO_GET_IDENTITY states that the drive is
attached via ATAPI. Very strange.)

------------------------------------------------------------------------

What's still missing for a plausible theory is how USB contributes to
the timeout which seems not to happen or not to matter with direct SATA.

Depending on the outcome of above dd it would be worth to spy on
the SATA attached drive whether udev tries that READ(10) for LBA 512
too and what is the outcome of that try.

But first we should learn whether simply reading blocks 512 and 513
really makes the USB connection mad. For now this is the biggest
plausibility gap in my theory.

It could of course be that READ(10) is a red herring and that udev does
something to USB which produces the timeout as mere consequence.

Have a nice day :)

Thomas

Comment 66 Steve 2020-04-16 14:56:10 UTC

(In reply to Tom Horsley from comment #62)
...
> [root@vivo ~]# systemctl stop systemd-udevd
...

For the record, could you post the version of systemd-udev you are testing with:

$ rpm -q systemd-udev

> I better go re-enable udevd now before I forget

This should show you the current configuration:

$ systemctl list-units --all systemd-udevd\*

Comment 67 Tom Horsley 2020-04-16 15:23:23 UTC

Here's the version and unit info:

systemd-udev-243.8-1.fc31.x86_64

[tweety@vivo ~]$ systemctl list-units --all systemd-udevd\*
UNIT                         LOAD   ACTIVE SUB     DESCRIPTION               
systemd-udevd.service        loaded active running udev Kernel Device Manager
systemd-udevd-control.socket loaded active running udev Control Socket       
systemd-udevd-kernel.socket  loaded active running udev Kernel Socket        

LOAD   = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB    = The low-level unit activation state, values depend on unit type.

3 loaded units listed.
To show all installed unit files use 'systemctl list-unit-files'.

Comment 68 Tom Horsley 2020-04-16 15:40:32 UTC

I made sure udev was squashed again, verified that xorriso worked well (and the dvd was loaded and ready), then ran the dd command:

[tweety@vivo ~]$ dd if=/dev/sr0 bs=2048 skip=512 count=2 of=/dev/null
dd: error reading '/dev/sr0': Input/output error
0+0 records in
0+0 records out
0 bytes copied, 146.173 s, 0.0 kB/s

It sat there with no lights flashing on the drive for a minute or two, ejecting the disk didn't do anything, finally powering off the disk made it exit with errors.

I guess it really hates that location on the disk.

Comment 69 Thomas Schmitt 2020-04-16 15:46:51 UTC

Hi,

> That sounds like a "sizing" algorithm. Why would systemd-udevd need
> to do that?

It is a strange pattern, indeed.

There is such a size detector named
  blkdev_find_size()
in line 52 ff. of
  https://github.com/karelzak/util-linux/blob/master/lib/blkdev.c

I understand that it starts its search for a high limit at byte 1024.
That would still be LBA 0. Then it doubles the byte address.
This is not the sequence of 2048-byte LBAs we see in trace-4.txt:
0, 8, 16, 32, ...
The missing 1 is covered by the read size of 2 blocks in the READ_10.

But LBA 2 and 4 are not covered by the first READ_10 and not read by
the subsequent READ_10s. So why don't READ_10s happen for them ?

Grrr.

Whatever, blkdev_find_size() is the last resort of
  blkdev_get_size()
in line 83.

This gets called by
  blkid_probe_set_device()
in line 858 ff. of
  https://github.com/karelzak/util-linux/blob/master/libblkid/src/probe.c

blkid_probe_set_device() is called by function
  builtin_blkid()
in line 272 of
  https://github.com/systemd/systemd/blob/master/src/udev/udev-builtin-blkid.c

But why would none of the ioctls in blkdev_get_size() yield a result ?

Have a nice day :)

Thomas

Comment 70 Thomas Schmitt 2020-04-16 15:55:45 UTC

Hi,

> [tweety@vivo ~]$ dd if=/dev/sr0 bs=2048 skip=512 count=2 of=/dev/null
> dd: error reading '/dev/sr0': Input/output error
> 0 bytes copied, 146.173 s, 0.0 kB/s
> [...] finally powering off the disk made it exit with errors.

Now this is significant.

Can i talk you into connecting the drive directly by SATA and to
try getting a log like
  https://bugzilla.redhat.com/attachment.cgi?id=1679161
  "kernel debug during run of xorriso hang"
which Steve now calls "trace-4.txt" but you called "kernel-log.txt".

Goal is to see whether
  ... READ_10 lba=512 ...
gets reported and what is its outcome.

It would also be interesting to see whether
  dd if=/dev/sr0 bs=2048 skip=512 count=2 of=/dev/null
works with direct SATA.

Have a nice day :)

Thomas

Comment 71 Tom Horsley 2020-04-16 16:19:27 UTC

Out of random curiosity, I tried plugging in the old pioneer slim USB drive which never had problems. The dd command works fine with it, no problem reading that block with the exact same disk in the drive.

Comment 72 Tom Horsley 2020-04-16 16:53:12 UTC

Too much trouble to unscrew drives again, but I have a different system with an ASUS BW-16D1HT drive connected directly via sata. It had the same problem when it was in the usb enclosure, so I assume it will act the same. Running the dd command with the movie in the drive fails, but does not hang, just gets I/O errors:

[tweety@rebo ~]$ dd if=/dev/sr0 bs=2048 skip=512 count=2 of=/dev/null
dd: error reading '/dev/sr0': Input/output error
0+0 records in
0+0 records out
0 bytes copied, 2.94269 s, 0.0 kB/s

So it looks like something about the USB enclosure is causing the hang and it still hates that address on the disk even with a direct sata connection.

Comment 73 Steve 2020-04-16 17:00:39 UTC

(In reply to Thomas Schmitt from comment #70)
> which Steve now calls "trace-4.txt" but you called "kernel-log.txt".

So the next one will be "trace-5.txt". :-)

Using incremental version numbers prevents name collisions when downloading several attachments. They also make it easier to discuss specific files.

> Goal is to see whether
>   ... READ_10 lba=512 ...
> gets reported and what is its outcome.

Tom: If you do that again, could you post the commands you used to enable tracing so we have a record of how the trace log was collected?

They are probably still in your shell history, so this should find them:

# history | grep '/sys/kernel'

> It would also be interesting to see whether
>   dd if=/dev/sr0 bs=2048 skip=512 count=2 of=/dev/null
> works with direct SATA.

I was going to speculate that there is something about the way the DVD was mastered. Even with a SATA-attached DVD drive, I have had inconsistent results with commercial DVDs. Sometimes they cannot be read. (Admittedly, the DVDs are from the library, which means that they are not always in perfect condition.)

So that suggests another test -- what happens with a different commercial DVD?

> Have a nice day :)
> 
> Thomas

Comment 74 Thomas Schmitt 2020-04-16 17:27:39 UTC

Hi,

> ASUS BW-16D1HT drive connected directly via sata.
> [tweety@rebo ~]$ dd if=/dev/sr0 bs=2048 skip=512 count=2 of=/dev/null
> dd: error reading '/dev/sr0': Input/output error
> 0 bytes copied, 2.94269 s, 0.0 kB/s
> So it looks like something about the USB enclosure is causing the hang and
> it still hates that address on the disk even with a direct sata connection.

Yes. The READ_10 command fails for some systematic reason of DVD and
drive (hardware or firmware).

The dramatic reaction with USB must be something between USB box and
drive or between USB box and the Linux USB drivers.

Do you see messages about sr0 and "Read(10)" from the SATA experiment ?

What do you get from this xorriso run with the SATA attached drive:

  xorriso -outdev /dev/sr0 -check_media use=outdev min_lba=512 max_lba=513 --

(This run will emit a READ(10) for LBA 512 with size 2. It should report
any SCSI error codes from that command.)

Have a nice day :)

Thomas

Comment 75 Thomas Schmitt 2020-04-16 17:35:07 UTC

With "Do you see messages" i meant "in dmesg".

Comment 76 Steve 2020-04-16 19:27:47 UTC

(In reply to Thomas Schmitt from comment #69)
...
> There is such a size detector named
>   blkdev_find_size()
> in line 52 ff. of
>   https://github.com/karelzak/util-linux/blob/master/lib/blkdev.c
...

Thanks for that link. systemd-udevd links to a library from util-linux:

$ rpm -q --requires systemd-udev | grep -m1 libblkid
libblkid.so.1()(64bit)

Indeed, there is a call to a blkid sizing function here:

$ git -C systemd-git grep blkid_probe_get_size
src/udev/udev-builtin-blkid.c:            blkid_probe_get_size(pr) <= 1024 * 1440 &&

However, that file appears to be a copy:

$ git -C systemd-git grep 'Karel Zak' src
src/udev/udev-builtin-blkid.c: * Copyright © 2011 Karel Zak <kzak>

The systemd repo is here:
https://github.com/systemd/systemd

Comment 77 Steve 2020-04-16 19:38:07 UTC

(In reply to Steve from comment #76)
> ... systemd-udevd links to a library from util-linux:

Strictly speaking, libblkid.so.1 is in a sub-package:

$ rpm -qf /usr/lib64/libblkid.so.1
libblkid-2.34-4.fc31.x86_64

$ rpm -qi libblkid | grep Source
Source RPM  : util-linux-2.34-4.fc31.src.rpm

Comment 78 Tom Horsley 2020-04-16 20:24:01 UTC

Created attachment 1679482 [details]
trace of loading disk on direct connect sata drive

Here's the trace I got of simply loading the disk in the ASUS sata drive. There are 4 instances of READ_10 lba=512 in there.

I had previously made a script to log stuff, so I just ran it again (it didn't like the attempt to enable scsi_mod, I guess that files doesn't exist):

#!/bin/sh
echo 'module sr_mod =pmf'   > /sys/kernel/debug/dynamic_debug/control
echo 'module xhci_hcd =pmf' > /sys/kernel/debug/dynamic_debug/control
echo 'module usbcore =pmf'  > /sys/kernel/debug/dynamic_debug/control
echo 'module scsi_mod =pmf' > /sys/kernel/debug/dynamic_debug/control
echo 'module scsi =pmf' > /sys/kernel/debug/dynamic_debug/control
echo 81920 > /sys/kernel/debug/tracing/buffer_size_kb
echo 1     > /sys/kernel/debug/tracing/events/xhci-hcd/enable
echo 1     > /sys/kernel/debug/tracing/events/scsi/enable
echo 1     > /sys/kernel/debug/tracing/events/scsi_mod/enable

Probably the usb related stuff is of no value as well.

I tried several different DVDs early on with the same hangs on all of them, so I don't think there is anything special about the DVD I've been sticking with for these tests.

Comment 79 Tom Horsley 2020-04-16 20:36:49 UTC

(In reply to Thomas Schmitt from comment #74)

> Do you see messages about sr0 and "Read(10)" from the SATA experiment ?

dmesg has lots of this stuff in it (just a sample):

[  107.297869] sr 5:0:0:0: [sr0] tag#7 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
[  107.297876] sr 5:0:0:0: [sr0] tag#7 Sense Key : Illegal Request [current] 
[  107.297881] sr 5:0:0:0: [sr0] tag#7 Add. Sense: Read of scrambled sector without authentication
[  107.297885] sr 5:0:0:0: [sr0] tag#7 CDB: Read(10) 28 00 00 00 02 00 00 00 02 00
[  107.297889] blk_update_request: I/O error, dev sr0, sector 2048 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[  109.397832] sr 5:0:0:0: [sr0] tag#15 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
[  109.397840] sr 5:0:0:0: [sr0] tag#15 Sense Key : Illegal Request [current] 
[  109.397844] sr 5:0:0:0: [sr0] tag#15 Add. Sense: Read of scrambled sector without authentication
[  109.397848] sr 5:0:0:0: [sr0] tag#15 CDB: Read(10) 28 00 00 00 02 01 00 00 01 00
[  109.397852] blk_update_request: I/O error, dev sr0, sector 2052 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[  109.397856] Buffer I/O error on dev sr0, logical block 513, async page read
[  110.546831] sr 5:0:0:0: [sr0] tag#21 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
[  110.546838] sr 5:0:0:0: [sr0] tag#21 Sense Key : Illegal Request [current] 
[  110.546842] sr 5:0:0:0: [sr0] tag#21 Add. Sense: Read of scrambled sector without authentication
[  110.546846] sr 5:0:0:0: [sr0] tag#21 CDB: Read(10) 28 00 00 3e 94 80 00 00 40 00
[  110.546851] blk_update_request: I/O error, dev sr0, sector 16404992 op 0x0:(READ) flags 0x80700 phys_seg 2 prio class 0

> 
> What do you get from this xorriso run with the SATA attached drive:
> 
>   xorriso -outdev /dev/sr0 -check_media use=outdev min_lba=512 max_lba=513 --
> 
> (This run will emit a READ(10) for LBA 512 with size 2. It should report
> any SCSI error codes from that command.)

[tweety@rebo ~]$ xorriso -outdev /dev/sr0 -check_media use=outdev min_lba=512 max_lba=513 --
xorriso 1.5.2 : RockRidge filesystem manipulator, libburnia project.

xorriso : NOTE : Disc status unsuitable for writing
Drive current: -outdev '/dev/sr0'
Media current: DVD-ROM
Media status : is written , is closed
Media summary: 1 session, 4101507 data blocks, 8011m data,     0 free
libburn : SORRY : SCSI error on read_10(512,2): [5 6F 03] Illegal request. Read of scrambled sector without authentication.
xorriso : UPDATE : 2 of 2 blocks read in 1 s , 0.0xD
xorriso : UPDATE : 2 of 2 blocks read in 1 s = 0.0xD
Media checks :        lba ,       size , quality
Media region :          0 ,        512 , 0 untested
Media region :        512 ,          2 , - unreadable
Media region :        514 ,    4100993 , 0 untested
xorriso : NOTE : Tolerated problem event of severity 'SORRY'
xorriso : NOTE : -return_with SORRY 32 triggered by problem severity SORRY

Looks like it knows something about copy protection in the hardware. Possibly the old pioneer drive just goes ahead and reads the scrambled data without complaint.

Comment 80 Steve 2020-04-16 20:47:46 UTC

(In reply to Tom Horsley from comment #78)

Thanks for posting the shell script and for attaching trace-5.txt.

There are more SCSI devices on this system:

$ cat trace-5.txt | awk '/^#/ { next }; { print $6 }' | sort -u
host_no=0
host_no=1
host_no=2
host_no=4
host_no=5

Could you also attach dmesg-5.txt from the same system you used to run your test:

$ journalctl --no-hostname -k > /tmp/dmesg-5.txt

And to simplify further discussion, could you suggest a name for it? (ASUS desktop?)

Comment 81 Steve 2020-04-16 21:15:12 UTC

Created attachment 1679492 [details]
systemd-udevd records with SCSI commands extracted from trace-5.txt.

systemd-udevd records with SCSI commands extracted from trace-5.txt.

The records all have "host_no=5", but I am simply assuming that that is the SCSI id for the SATA-connected ASUS DVD drive.

Command-line is:

$ cat trace-5.txt | awk '/^ *systemd-udevd-.*cmnd=.*$/ { pid=$1; ts=$4; n=match($0, "cmnd=.*$"); print pid, ts, substr($0, n) }'

Comment 82 Thomas Schmitt 2020-04-16 21:41:11 UTC

Hi,

> [  107.297876] sr 5:0:0:0: [sr0] tag#7 Sense Key : Illegal Request [current]
> [  107.297881] sr 5:0:0:0: [sr0] tag#7 Add. Sense: Read of scrambled sector
> without authentication
> [  107.297885] sr 5:0:0:0: [sr0] tag#7 CDB: Read(10) 28 00 00 00 02 00 00 00
> 02 00

That's indeed a READ_10 to LBA 0x00000200 = 512.
The "Add. Sense:" text is the cleartext for SCSI error code 5 6F 03.
MMC-5 has about READ(10) in section 6.15.3:
 "Any read by the Host to a Logical Block with a Title Key present in the
  sector (DVD-ROM Media Only), when the Authentication Success Flag (ASF)
  is set to zero, shall be blocked. The command shall be terminated with
  CHECK CONDITION status and SK/ASC/ASCQ values shall be set to
  ILLEGAL REQUEST/READ OF SCRAMBLED SECTOR WITHOUT AUTHENTICATION."

Section 4.10 "Content Protection" outlines various copy prevention methods:
CSS, CPRM/CPPM, VCPS, AACS. Quite complicated and surely not my kind of
programming topic.

> libburn : SORRY : SCSI error on read_10(512,2): [5 6F 03] Illegal request.
> Read of scrambled sector without authentication.

So libburn and kernel experience the same.

> Looks like it knows something about copy protection in the hardware.

libburn only knows the error text associated with [5 6F 03].
The drive firmware indeed seems to know more.

> Possibly the old pioneer drive just goes ahead and reads the scrambled
> data without complaint.

Good theory.

What do you get to see when dd-ing them from old pioneer through
a hex dumper ? Like 

  dd if=/dev/sr0 bs=2048 skip=512 count=2 | od -t x1 | less

  dd if=/dev/sr0 bs=2048 skip=512 count=2 | od -c | less

Zeros ? Cleartext snippets with od -c ?

---------------------------------------------------------------------

The question remains why a normal SCSI error condition wrecks the USB
connection. So back to trace-4.txt, where USB stuff is logged.

I compared the lines from
  scsi_dispatch_cmd_start ... READ_10 lba=256 ...
to
  scsi_dispatch_cmd_done ... READ_10 lba=256 ...
with the lines from
  scsi_dispatch_cmd_start ... READ_10 lba=512 ...
to
  scsi_dispatch_cmd_done ... READ_10 lba=512 ...
following
  usb-storage-4803
which seems to be in charge of processing the READ_10s.

They begin to differ after a dozen lines.
lba=256:

     usb-storage-4803  [001] .... 10101.875226: xhci_urb_enqueue: ep1in-bulk: urb 0000000036e6ba93 pipe 3221259392 slot 7 length 0/13 sgs 0/0 stream 0 flags 00000204
     usb-storage-4803  [001] d... 10101.875233: xhci_queue_trb: BULK: Buffer 000000040615c080 length 13 TD size 0 intr 0 type 'Normal' flags b:i:I:c:s:I:e:c

versus lba=512:

     usb-storage-4803  [000] .... 10132.027142: xhci_urb_enqueue: ep0out-control: urb 0000000072fd3232 pipe 2147484672 slot 7 length 0/0 sgs 0/0 stream 0 flags 00000000
     usb-storage-4803  [000] d... 10132.027146: xhci_queue_trb: CTRL: bRequestType 00 bRequest 01 wValue 0030 wIndex 0000 wLength 0 length 8 TD size 0 intr 0 type 'Setup Stage' flags I:i:c

from then on they run apart.
No idea what this USB stuff means, though.

Have a nice day :)

Thomas

Comment 83 Steve 2020-04-16 21:56:45 UTC

This is easier now that we know what to look for. :-)

These are the only trace records with a 'host_no' field and 'lba=512':

$ cat trace-5.txt | grep -n 'host_no=[0-9]' | grep 'lba=512'

8703:   systemd-udevd-2532  [001] ....   627.381674: scsi_dispatch_cmd_start: host_no=5 channel=0 id=0 lun=0 data_sgl=1 prot_sgl=0 prot_op=SCSI_PROT_NORMAL cmnd=(READ_10 lba=512 txlen=2 protect=0 raw=28 00 00 00 02 00 00 00 02 00)

8716:    kworker/6:1H-469   [006] ....   627.424401: scsi_dispatch_cmd_timeout: host_no=5 channel=0 id=0 lun=0 data_sgl=1 prot_sgl=0 prot_op=SCSI_PROT_NORMAL cmnd=(READ_10 lba=512 txlen=2 protect=0 raw=28 00 00 00 02 00 00 00 02 00) result=(driver=DRIVER_OK host=DID_OK message=COMMAND_COMPLETE status=SAM_STAT_GOOD)

8718:    kworker/1:1H-536   [001] ....   627.441450: scsi_dispatch_cmd_start: host_no=5 channel=0 id=0 lun=0 data_sgl=1 prot_sgl=0 prot_op=SCSI_PROT_NORMAL cmnd=(READ_10 lba=512 txlen=1 protect=0 raw=28 00 00 00 02 00 00 00 01 00)

8789:          <idle>-0     [006] d.h.   628.987442: scsi_dispatch_cmd_done: host_no=5 channel=0 id=0 lun=0 data_sgl=1 prot_sgl=0 prot_op=SCSI_PROT_NORMAL cmnd=(READ_10 lba=512 txlen=1 protect=0 raw=28 00 00 00 02 00 00 00 01 00) result=(driver=DRIVER_OK host=DID_OK message=COMMAND_COMPLETE status=SAM_STAT_GOOD)

NB: I added newlines to make the records slightly more readable.

Comment 84 Steve 2020-04-16 22:27:43 UTC

(In reply to Thomas Schmitt from comment #82)
...
> ... when the Authentication Success Flag (ASF) is set to zero, shall be blocked.
...

Who can set that flag?

Comment 85 Steve 2020-04-16 22:59:50 UTC

Alan, upstream, enables usbmon tracing:

https://bugzilla.kernel.org/show_bug.cgi?id=207219#c1

But the output can only be interpreted by a USB expert. The output is in the attachments here:

Bug_207219 - [BISECTED] Sony Vaio laptop built-in ricoh webcam no longer found on Fedora 31 with 5.4.11 kernel 
https://bugzilla.kernel.org/show_bug.cgi?id=207219

So if the problem is now down to debugging at the USB level, I would suggest opening a bug upstream under Drivers/USB:

https://bugzilla.kernel.org/

Comment 86 Tom Horsley 2020-04-16 23:00:45 UTC

Created attachment 1679500 [details]
dmesg from system where trace-5 was done

This is a dmesg right after booting the system again, so there won't be any dvd errors in here, but you can see all the devices and such.

The host names I've given these systems are:

zooty - my main system I use all the time and the one where I first encountered the bug, but very little of the rest of the testing was done on zooty.

vivo - the ASUS mini PC which normally runs windows and where I've done 90% of the testing using an external USB drive where I installed fedora 31. (This is where I learned Windows 10 does not have a problem with DVDs in the external drive).

rebo - the system where I've done the latest tests using the sata ASUS drive. It also accumulated a lot of disks left over from other systems, so there are indeed many sata drives in here. This dmesg is from rebo.

Comment 87 Tom Horsley 2020-04-16 23:14:22 UTC

(In reply to Thomas Schmitt from comment #82)

> What do you get to see when dd-ing them from old pioneer through
> a hex dumper ? Like 
> 
>   dd if=/dev/sr0 bs=2048 skip=512 count=2 | od -t x1 | less
> 
>   dd if=/dev/sr0 bs=2048 skip=512 count=2 | od -c | less
> 
> Zeros ? Cleartext snippets with od -c ?

It looks quite a lot like random noise. Not zeroes, but nothing intelligible.

Comment 88 Steve 2020-04-16 23:23:33 UTC

(In reply to Steve from comment #85)
> Alan, upstream, enables usbmon tracing:
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=207219#c1

BTW, usbmon tracing is easy to try out:

Figure out which USB bus your mouse is on.* Mine is on 3. And then run:

# cat /sys/kernel/debug/usb/usbmon/3u

As you move your mouse, output will scroll by in your terminal window.

NB: I don't know what the "s", "t", and "u" suffixes mean:

# ls /sys/kernel/debug/usb/usbmon
0s  0u  1s  1t  1u  2s  2t  2u  3s  3t  3u  4s  4t  4u

* Run:

$ lsusb

And look for the bus number for your mouse.

Comment 89 Steve 2020-04-16 23:31:11 UTC

(In reply to Tom Horsley from comment #86)
> Created attachment 1679500 [details]
> dmesg from system where trace-5 was done
> 
> This is a dmesg right after booting the system again, so there won't be any
> dvd errors in here, but you can see all the devices and such.

Thanks. That's perfect:

$ grep 'scsi [0-9]' dmesg-rebo-1.txt 
[    1.423646] scsi 0:0:0:0: Direct-Access     ATA      M4-CT064M4SSD2   000F PQ: 0 ANSI: 5
[    1.425494] scsi 1:0:0:0: Direct-Access     ATA      Hitachi HDS5C303 A5C0 PQ: 0 ANSI: 5
[    1.426253] scsi 2:0:0:0: Direct-Access     ATA      WDC WD4003FZEX-0 1A01 PQ: 0 ANSI: 5
[    1.426976] scsi 4:0:0:0: Direct-Access     ATA      WDC WD4003FZEX-0 1A01 PQ: 0 ANSI: 5
[    1.430499] scsi 5:0:0:0: CD-ROM            ASUS     BW-16D1HT        3.00 PQ: 0 ANSI: 5

> The host names I've given these systems are:

Great names and informative background:

> zooty - my main system I use all the time and the one where I first
> encountered the bug, but very little of the rest of the testing was done on
> zooty.
> 
> vivo - the ASUS mini PC which normally runs windows and where I've done 90%
> of the testing using an external USB drive where I installed fedora 31.
> (This is where I learned Windows 10 does not have a problem with DVDs in the
> external drive).
> 
> rebo - the system where I've done the latest tests using the sata ASUS
> drive. It also accumulated a lot of disks left over from other systems, so
> there are indeed many sata drives in here. This dmesg is from rebo.

Comment 90 Steve 2020-04-16 23:50:57 UTC

(In reply to Steve from comment #88)
...
> BTW, usbmon tracing is easy to try out:
...

Here is some documentation:

usbmon
https://www.kernel.org/doc/html/latest/usb/usbmon.html

Comment 91 Steve 2020-04-17 00:15:24 UTC

Wireshark can capture USB traffic using the /dev/usbmon devices:

$ ls /dev/usbmon*
/dev/usbmon0  /dev/usbmon1  /dev/usbmon2  /dev/usbmon3  /dev/usbmon4

USB capture setup
https://wiki.wireshark.org/CaptureSetup/USB

Comment 92 Steve 2020-04-17 01:23:26 UTC

I'm already in group "wireshark", but I couldn't run as a non-privileged user. However, this works as a demonstration using the bus that my USB keyboard and mouse are on (3).

All traffic on the bus is captured, so in the below example, both key presses and mouse movements generated capture data (not shown).

There are options to report more information: "man tshark". The wireshark GUI could be used instead.

$ sudo tshark -i usbmon3
Running as user "root" and group "root". This could be dangerous.
Capturing on 'usbmon3'
    1   0.000000         host → 3.3.0        USB 64 GET DESCRIPTOR Request DEVICE
...
18 packets captured

Comment 93 Steve 2020-04-17 02:56:57 UTC

(In reply to Steve from comment #92)
> I'm already in group "wireshark", but I couldn't run as a non-privileged user.

The /dev/usbmon device nodes are in group "usbmon"*, so I took a guess and added myself to that group. Now I can capture USB data without running as root.

The next problem is that wireshark dissects USB packets, but it shows nothing but "USBMS" and binary data for the SCSI layer.

However, it is fundamentally working, because when I attached an external USB hard drive with an OS installed on it, I found this in one of the packets:

0000   eb 63 90 10 8e d0 bc 00 b0 b8 00 00 8e d8 8e c0   .c..............
...
0180   7d e8 2e 00 cd 18 eb fe 47 52 55 42 20 00 47 65   }.......GRUB .Ge
0190   6f 6d 00 48 61 72 64 20 44 69 73 6b 00 52 65 61   om.Hard Disk.Rea
01a0   64 00 20 45 72 72 6f 72 0d 0a 00 bb 01 00 b4 0e   d. Error........
...

That was captured while using the wireshark GUI, which works well.

Some packet filters can be applied, for example: "usb.transfer_type == 0x03".

According to the GUI that means: "URB transfer type: URB_BULK (0x03)".


* $ ls -l /dev/usbmon*
crw-r-----. 1 root usbmon 244, 0 Apr 16 06:56 /dev/usbmon0
crw-r-----. 1 root usbmon 244, 1 Apr 16 06:56 /dev/usbmon1
crw-r-----. 1 root usbmon 244, 2 Apr 16 06:56 /dev/usbmon2
crw-r-----. 1 root usbmon 244, 3 Apr 16 06:56 /dev/usbmon3
crw-r-----. 1 root usbmon 244, 4 Apr 16 06:56 /dev/usbmon4

Comment 94 Steve 2020-04-17 09:00:14 UTC

(In reply to Thomas Schmitt from comment #82)
...
versus lba=512:

     usb-storage-4803  [000] .... 10132.027142: xhci_urb_enqueue: ep0out-control: urb 0000000072fd3232 pipe 2147484672 slot 7 length 0/0 sgs 0/0 stream 0 flags 00000000
     usb-storage-4803  [000] d... 10132.027146: xhci_queue_trb: CTRL: bRequestType 00 bRequest 01 wValue 0030 wIndex 0000 wLength 0 length 8 TD size 0 intr 0 type 'Setup Stage' flags I:i:c
...

The kernel source for that 'Setup Stage' message is here:

drivers/usb/host/xhci.h:

static inline const char *xhci_decode_trb(...)
...
	case TRB_SETUP:
		sprintf(...)
...

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/usb/host/xhci.h#n2248

An upper-case flag means a set bit, a lower-case flag means an unset bit.

Comment 95 Thomas Schmitt 2020-04-17 09:26:48 UTC

Hi,

MMC-5, 6.15.3 says:
> > when the Authentication Success Flag (ASF) is set to zero, shall be
> > blocked.

Steve asks:
> Who can set that flag?

That's subject to research in MMC specs and maybe specs about the copy
prevention method that is used. So first the prevention method and the
purpose of the encrypted blocks would have to be determined.
I guess this means to dive into MS-Windows-centric DVD ripper fora and
start there as total noob with the computational handicap of using
GNU/Linux.

The encrypted blocks seem not to be used by computer software for the
normal operation of Tom's DVD. They might be needed for extra features
of the DVD, which Tom's usage does not try to provide. Or they might
be needed for the internal negociation between drive and medium about
the decryption key for the other blocks which contain the viceo payload.

Last time i looked, MMC specs were available for 30 USD at t10.org.
Google ...
   https://www.t10.org/cgi-bin/ac.pl?t=f&f=mmc5r04.pdf
shows a "Guest Access" form, which may or may not give access to the
last draft version of MMC-5 for free.

  https://www.t10.org/drafts.htm#MMC_Family
shows "purchase" links for MMC-2 to MMC-6. The numbers are revisions.
So you do not need the older versions to learn about recent features.
Older versions can be informative about the specs history, though.
(I was quite disappointed by MMC-6, as MMC-5 has the better PDF
navigation elements and MMC-6 seemed not to contain anything new for
my purpose.)

Have a nice day :)

Thomas

Comment 96 Steve 2020-04-17 09:39:22 UTC

(In reply to Thomas Schmitt from comment #95)

OK on all that.

After looking at the xhci source code (Comment 94), I am now convinced that we are not going to debug this, and that this needs to be moved upstream:

https://bugzilla.kernel.org

And, IMO, the OWC firmware is somehow violating the USB protocol.

Comment 97 Steve 2020-04-17 09:43:01 UTC

(In reply to Steve from comment #96)
> And, IMO, the OWC firmware is somehow violating the USB protocol.

BTW, if Window authenticates, everything is hunky-dory.

Comment 98 Steve 2020-04-17 10:02:35 UTC

(In reply to Steve from comment #97)
> (In reply to Steve from comment #96)
> > And, IMO, the OWC firmware is somehow violating the USB protocol.
> 
> BTW, if Window authenticates, everything is hunky-dory.

Conjecture: The OWC firmware is trying to stop hosts from reading certain "secret" sectors, but trying to do that in the USB layer by sniffing SCSI packets.

If so, that is a design flaw in the OWC firmware, because it should not be looking into the encapsulated packets it is supposed to be just transferring.

Comment 99 Thomas Schmitt 2020-04-17 10:08:01 UTC

Hi,

now that the SCSI layer aspects seem to be explained and the riddles
moved out of my scope i will enter lurk mode with this bug report.
It now needs other experts.

-------------------------------------------------------------------
Summary of the findings so far:

The BD/DVD/CD drive LG WH16NS40 (and probably others) refuses to read
block 512 of a commercial video DVD, whereas an older Pioneer drive
delivers the data of this block.
The WH16NS40 returns SCSI error code:
  5 6F 03 READ OF SCRAMBLED SECTOR WITHOUT AUTHENTICATION

When attached to the computer directly by SATA, this error reply does
not cause further harm.
But if the SATA-USB converter of an USB box is involved, the READ(10)
SCSI command does not return but rather times out after 30 seconds.
The drive stays unusable until power-cycle.

The USB box is reported by dmesg as:
  New USB device found, idVendor=1e91, idProduct=de2c, bcdDevice= 1.00
  New USB device strings: Mfr=2, Product=3, SerialNumber=1
  Product: Mercury Pro Optical
  Manufacturer: Other World Computing

The "scrambled sector" seems not to be read by unauthorized READ(10)
commands during normal operation of the DVD. No error appears.

But for some reason, udev reads block 0, 8, 32, 64, 128, and finally 
tries block 512, where it causes the timeout and the loss of connection
to the USB box.
It looks like udev, or libblkid, or util-linux cannot determine the DVD
size by other means than a binary search for the last readable sector.
Normally there are ioctls which can do that (or SCSI commands like
25Ah READ CAPACITY).

The main riddle is now how a regular and expectable SCSI error condition
can break the connection between USB box and the higher Linux SCSI layers.

-----------------------------------------------------------------------

Have a nice day :)

Thomas

Comment 100 Steve 2020-04-17 10:14:11 UTC

(In reply to Steve from comment #98)
> (In reply to Steve from comment #97)
> > (In reply to Steve from comment #96)
> > > And, IMO, the OWC firmware is somehow violating the USB protocol.
> > 
> > BTW, if Window authenticates, everything is hunky-dory.
> 
> Conjecture: The OWC firmware is trying to stop hosts from reading certain "secret" sectors, but trying to do that in the USB layer by sniffing SCSI packets.
> 
> If so, that is a design flaw in the OWC firmware, because it should not be looking into the encapsulated packets it is supposed to be just transferring.

Thomas: You gave me that thought with some of your previous comments, including this one:

"What's still missing for a plausible theory is how USB contributes to the timeout which seems not to happen or not to matter with direct SATA." (Comment 65)

The short answer: The OWC firmware is sniffing SCSI packets.

BTW, I still highly recommend trying out Wireshark's USB capture facilities. Doing that gave me the insight to propose that explanation.

Comment 101 Thomas Schmitt 2020-04-17 10:40:52 UTC

Hi,

> The OWC firmware is sniffing SCSI packets.

This is not impossible, but:

The box would intrude a layer of communications which is not in its
scope as data transport device. Judgement of suitability of commands
is a matter of Linux SCSI layers and the DVD drive's firmware.

If the consequences are intentional, the box would react very harshly
to an attempt which already the drive prevents in the interest of the
DVD vendor.
The box itself cannot know which sectors of the DVD are banned for
unauthorized reading. It has to rely on the drive's judgement and would
then decide to aggravate the drive's reaction.

If the consequences are not intentional, then the box would stop working
by mistake or would drive the USB layer(s) of Linux nuts by behavior
which Linux does not expect.
This is the case where debugging by a Linux USB expert would help.

-----------------------------------------------------------------------

We could of course try to find out why udev decides to do the read
experiments with the DVD. This is not normal.

There is also the minor riddle why libburn's Linux adapter emits one
own SCSI command when acquiring the drive. Source study indicates that
it is a TEST UNIT READY command caused by
  ioctl(fd, HDIO_GET_IDENTITY, &tm)
returning
  (tm.config & 0x8000) || (tm.config & 0x4000)
which it should only do if the old /dev/hd driver controls the DVD drive
as ATAPI device.
I plan to work on this.

But the big riddle is why the drive in the USB box cannot be reached
any more via the Linux SCSI layers after the failed read attempt.

Have a nice day :)

Thomas

Comment 102 Steve 2020-04-17 10:56:39 UTC

(In reply to Thomas Schmitt from comment #101)
> Hi,
> 
> > The OWC firmware is sniffing SCSI packets.
> 
> This is not impossible, but:
> 
> The box would intrude a layer of communications which is not in its
> scope as data transport device. Judgement of suitability of commands
> is a matter of Linux SCSI layers and the DVD drive's firmware.
...

I completely agree. And, if I am right, the OWC enclosure is a defective product, because it is just supposed to transfer SCSI packets, not read them.

> The box itself cannot know which sectors of the DVD are banned for unauthorized reading.

The OWC firmware authors can read the appropriate specifications. From your results and Tom's, sector 512 is a "secret" sector.

I couldn't find an exact statement, but the Wikipedia article on the "Content Scramble System"* has a link to:

Standard ECMA-267 3rd Edition - April 2001 
120 mm DVD - Read-Only Disk
https://www.ecma-international.org/publications/files/ECMA-ST/Ecma-267.pdf

See "Section 4  -  Data Format". I couldn't see an exact statement that sector 512 is a "secret" sector, but maybe I didn't look carefully enough.

* https://en.wikipedia.org/wiki/Content_Scramble_System

Comment 103 Steve 2020-04-17 11:10:40 UTC

(In reply to Thomas Schmitt from comment #101)
...
> We could of course try to find out why udev decides to do the read experiments with the DVD. This is not normal.
...

Yes. A properly mastered DVD should be fully indexed, so there would be no need to go poking around in the sectors just to see what is there. :-)

As you suggested before*, Windows could be simply avoiding the "secret" sector. So Linux should do the same thing when it detects a video DVD. That is a change that could be done in user space (in systemd-udevd/blkid).

Indeed, I would propose changing the component of this bug to "systemd".

* "If it [Windows] does not try to read LBA 512 out of mere curiosity, and if the filesystem on the DVD does not advertise LBA 512 as part of any read-worthy object, then MS-Windows could stay away from that address and experience no timeout." (Comment 61)

Comment 104 Thomas Schmitt 2020-04-17 11:33:05 UTC

Hi,

> Standard ECMA-267 3rd Edition - April 2001 

That's about the mechanical engineering, electrical engineering, and
- especially for people like me - algebra on finite rings and fields
which does the error prevention and correction for DVD.

The part for our problem is extremely sparse, though:

  16.3 Copyright Management Information (CPR_MAI)
  This field shall consist of 6 bytes. Their setting is application-
  dependent, for instance a video application. If this setting is not
  specified by the application, the default setting shall be to set
  all bytes to all ZEROs.

The other spot that looks like mentioning our problem is actually about
prevention of systematic transmission errors by non-secret scrambling:
   17 Scrambled Frames

I assume that the interested parties defined own specs on base of 16.3.
Possibly one has to be member of some consortium to get to see those.

----------------------------------------------------------------------

> As you suggested before*, Windows could be simply avoiding the
> "secret" sector.

Or both, MS-Windows and Tom's dvdbackup, know how to authorize for
access to block 512.
Obviously udev and i don't know. 

> Indeed, I would propose changing the component of this bug to "systemd".

One could spin off a systemd bug report which only mentions the fact
that udev seems not to be able to learn the size by other means.
The drive's refusal to read block 512 could possibly derail the size
determination by binary search, so that a wrong size is used in udev
rules. (I am not sure whether i have found the code which does the search.
So i cannot tell what a failed READ(10) will trigger.)

But i am quite sure that systemd developers will not take responsibility
for the USB hangs.

Have a nice day :)

Thomas

Comment 105 Tom Horsley 2020-04-17 12:46:26 UTC

My thought is that Windows doesn't go reading random addresses like udev apparently does. After all, when I squashed udev, xorriso was able to read all the info it printed just fine and apparently didn't touch any protected sectors. But it does sure seem like USB ought to be able to return an error rather than hang.

I was able to try this in a completely different USB enclosure and got the same hangs, so I doubt there is any specific OWC firmware doing something tricky, both the enclosures are probably using the same chips and getting the same results.

Comment 106 Tom Horsley 2020-04-17 13:01:32 UTC

Meanwhile I tried to see if I could make a kernel bug, but it tells me I'm not allowed to create a new account.

Comment 107 Tom Horsley 2020-04-17 13:11:02 UTC

The only chip google could find for sata to usb3 is from TI and the spec sheet says it comes with default firmware. I have this feeling no one in the external enclosure business would feel like modifying the firmware. I definitely remember seeing a surface mount chip in both enclosures that looked about 7mm on a side:

http://www.ti.com/lit/ds/symlink/tusb9261.pdf

Comment 108 Steve 2020-04-17 13:22:11 UTC

(In reply to Tom Horsley from comment #105)
> But it does sure seem like USB ought to be able to return an error rather than hang.

Refining my theory slightly: The OWC firmware is crashing. There could be a firmware update. If not, it might be possible to contact OWC and ask for tech support.

> I was able to try this in a completely different USB enclosure and got the
> same hangs, so I doubt there is any specific OWC firmware doing something
> tricky, both the enclosures are probably using the same chips and getting
> the same results.

What is the enclosure?

$ lsusb

Also, how did you change the region code? Was that done by a method supported by OWC?

Comment 109 Steve 2020-04-17 13:29:22 UTC

(In reply to Tom Horsley from comment #107)
> The only chip google could find for sata to usb3 is from TI and the spec
> sheet says it comes with default firmware. I have this feeling no one in the
> external enclosure business would feel like modifying the firmware. I
> definitely remember seeing a surface mount chip in both enclosures that
> looked about 7mm on a side:
> 
> http://www.ti.com/lit/ds/symlink/tusb9261.pdf

A search for "SATA USB bridge asmedia" found this:

ASM235CM
https://www.asmedia.com.tw/eng/e_show_products.php?cate_index=172&item=177

"ASM235CM is the latest generation of ASMedia’s single chip solution, bridging the USB3.1 to Serial ATA host interface with highly integrated SuperSpeedPlus, SuperSpeed, High Speed, SATA 1.5/3.0/6.0 Gbps ASMedia self-designed PHYs as well as built-in CC logic circuit enabling reversible USB cable orientation and direction. Along with excellent compatibility with USB3.1 hosts and SATA devices, ASM235CM utilizes advanced process technology to optimize the chip power consumption."

Comment 110 Steve 2020-04-17 13:39:30 UTC

This Plugable page explicitly mentions firmware:

"We have made the following table to help you find out if there is a firmware update available for your hard drive dock and what the updates are for."

However:

"The manufacturers of the chipsets used in our docks (ASMedia, JMicron, LucidPort) do not make update programs for Apple Mac OS X or Linux/Unix."

Plugable USB 3.0 Hard Drive Dock Firmware History & Updates
https://plugable.com/drivers/hddockfirmware/

Comment 111 Thomas Schmitt 2020-04-17 13:54:00 UTC

Hi,

> I was able to try this in a completely different USB enclosure and got the
> same hangs, so I doubt there is any specific OWC firmware doing something
> tricky

This info should be added to the potential bug report towards Linux USB.

> My thought is that Windows doesn't go reading random addresses

Quite plausible. But it is still not outruled that MS-Windows and
dvdbackup know how to enable decryption. After all your drive is entitled
to read the vast majority of the data.

You could get an impression how many blocks are encrypted by a full
-check_media run with possibly an error message every 32 KiB of data.

   xorriso -outdev /dev/sr0 \
           -check_media use=outdev what=disc time_limit=1800 -- \
     2>&1 | tee -i /tmp/xorriso_check_media_log

This will try to read the blocks from 0 to 4101506 in chunks of 32 KiB
with hopefully no repetition.
The attempts will go on at most 1800 seconds, not to overwork your drive.
But i assume you will get possible error messages with high speed.
If you see much good progress and the xorriso run ends by
  xorriso : NOTE : -check_media: Reached time_limit=1800
then repeat it without "time_limit=1800".

The run can be aborted by Ctrl+C, or more gracefully by
  touch /var/opt/xorriso/do_abort_check_media

In the end there will be printed a summary about the readable and
unreadble block intervals. Maybe like:
  Media checks :        lba ,       size , quality
  Media region :          0 ,        512 , + good
  Media region :        512 ,       1024 , - unreadable
  Media region :       1536 ,      20480 , + good  
  Media region :      22016 ,    4079490 , - unreadable

If nearly the whole DVD is encrypted, then the success of reader software
means that it can enable the decryption capabilities of the drive.

Have a nice day :)

Thomas

Comment 112 Steve 2020-04-17 14:29:50 UTC

(In reply to Thomas Schmitt from comment #61)
...
> > Some have system-level privileges, so they could take control of the DVD drive and lock out all other processes.
> 
> There is no such privileged control known to me.
> Success of open(2) depends on device file permissions or Linux capabilities(7) of the process.
...

That theory is obsolete now, but thank-you for citing those man pages and for citing O_EXCL, specifically.

Comment 113 Steve 2020-04-17 15:12:31 UTC

OWC says they support Linux here:
https://www.owcdigital.com/products/mercury-pro-optical#tech-specs

And OWC appears to have excellent support, including by chat, phone, and email:
https://www.owcdigital.com/support/contact

Since you have a reliable and fairly simple reproducer, it might be useful ask OWC to take a look at the problem.

Comment 114 Tom Horsley 2020-04-17 17:40:25 UTC

Finally got the kernel bug submitted:

https://bugzilla.kernel.org/show_bug.cgi?id=207327

Some info I apparently neglected to save previously is missing, so here it is again:

The other enclosure I got a chance to test was a Vantec NST-536S3-BK NexStar DX
(but I don't have it any longer, it was borrowed). It hug the same way.

I set the region in the LG drive using the linux regionset utility.
While I had it connected to windows, I dove into the drive properties
and found Windows believed it was a region 1 drive, so regionset
seems to have worked properly.

Comment 115 Tom Horsley 2020-04-17 18:09:52 UTC

(In reply to Thomas Schmitt from comment #111)

> You could get an impression how many blocks are encrypted by a full
> -check_media run with possibly an error message every 32 KiB of data.
> 
>    xorriso -outdev /dev/sr0 \
>            -check_media use=outdev what=disc time_limit=1800 -- \
>      2>&1 | tee -i /tmp/xorriso_check_media_log

Seems to be virtually all of them. It starts like this:

xorriso 1.5.2 : RockRidge filesystem manipulator, libburnia project.

xorriso : NOTE : Disc status unsuitable for writing
Drive current: -outdev '/dev/sr0'
Media current: DVD-ROM
Media status : is written , is closed
Media summary: 1 session, 4101507 data blocks, 8011m data,     0 free
xorriso : UPDATE : 32 of 4101507 blocks read in 1 s , 0.0xD
libburn : SORRY : SCSI error on read_10(288,32): [5 6F 03] Illegal request. Read of scrambled sector without authentication.

Then the illegal request errors go on for as long as I could stand watching them :-).

(This was, of course, on rebo with the sata connected drive).

Comment 116 Steve 2020-04-17 19:34:22 UTC

(In reply to Tom Horsley from comment #114)
> Finally got the kernel bug submitted:
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=207327

That bug id can also be put in the "Links" field at the top of the bug page. "Linux Kernel" seems to be the right choice for the "System".

> Some info I apparently neglected to save previously is missing, so here it is again:
> 
> The other enclosure I got a chance to test was a Vantec NST-536S3-BK NexStar DX
> (but I don't have it any longer, it was borrowed). It hug the same way.

Thanks for your followup reply. The specs page for the Vantec enclosure doesn't mention Linux, so that is a difference from the OWC enclosure:

https://www.vantecusa.com/products_detail.php?p_id=51&p_name=Vantec#tab-2

> I set the region in the LG drive using the linux regionset utility. While I had it connected to windows, I dove into the drive properties
> and found Windows believed it was a region 1 drive, so regionset seems to have worked properly.

Thanks, I didn't know about that:

$ dnf -Cq repoquery --whatprovides \*/regionset
regionset-0:0.2-13.fc31.x86_64

Comment 117 Thomas Schmitt 2020-04-17 19:59:50 UTC

Hi,

> > You could get an impression how many blocks are encrypted

> Seems to be virtually all of them. 

Then they don't get avoided by MS-Windows and dvdbackup, but it's rather
that those readers know more and better about DVD encryption than udev
and myself.

I don't feel much embarrassed for my ignorance in this aspect. :))

But if a bug report to systemd/udev is submitted, it should be mentioned
that dvdbackup seems to know how to enable the read operations which would
be needed for the binary search for the device's size.

> libburn : SORRY : SCSI error on read_10(288,32): [5 6F 03] Illegal request.
> Read of scrambled sector without authentication.

So the encrypted part begins at latest at block 288+31 = 319.

> Then the illegal request errors go on for as long as I could stand watching

If its only about the ugliness of error messages then we can
suppress them by xorriso command -report_about :

   xorriso -report_about update \
           -outdev /dev/sr0 \
           -report_about failure \
           -check_media use=outdev what=disc time_limit=1800 -- 

The messages about drive acquiration will appear. But then all messages
which would not end the program run will be suppressed.
The final summary will be emitted on standard output, nevertheless.

It would give us some statistical evidence, beyond ugliness.

-----------------------------------------------------------------------

Steve wrote:
> https://www.owcdigital.com/support/contact

If Tom, the customer, wants to submit a request, then i propose to
point the support to my summary in comment 99
  https://bugzilla.redhat.com/show_bug.cgi?id=1822948#c99
and to ask whether they have an idea how to learn more about what is
derailing the USB connection.

The content of "kernel-debug.txt" (alias "trace-4.txt") is the deepest
raw info of the problem we have for now.
  https://bugzilla.redhat.com/attachment.cgi?id=1679161

An example of the mishap begins at line

     systemd-udevd-7136  [000] .... 10101.875658: scsi_dispatch_cmd_start: host_no=5 channel=0 id=0 lun=0 data_sgl=1 prot_sgl=0 prot_op=SCSI_PROT_NORMAL cmnd=(READ_10 lba=512 txlen=2 protect=0 raw=28 00 00 00 02 00 00 00 02 00)

is then probably handled by multiple occasions of

     usb-storage-4803 ...

until timeout happens

    kworker/1:1H-335   [001] .... 10132.015859: scsi_dispatch_cmd_timeout: host_no=5 channel=0 id=0 lun=0 data_sgl=1 prot_sgl=0 prot_op=SCSI_PROT_NORMAL cmnd=(READ_10 lba=512 txlen=2 protect=0 raw=28 00 00 00 02 00 00 00 02 00) result=(driver=DRIVER_OK host=DID_OK message=COMMAND_COMPLETE status=SAM_STAT_GOOD)

-----------------------------------------------------------------------

> https://bugzilla.kernel.org/show_bug.cgi?id=207327

Aha. You already made use of my summary.
Consider to add info how much of the DVD is found to be unreadable
without decryption preparations. (In trace-4 are successful reads to
blocks near the DVD's end.)

Now i am curious ...

Have a nice day :)

Thomas

Comment 118 Tom Horsley 2020-04-17 20:32:29 UTC

dvdbackup's goal in life is to dig up the decryption keys using libdvdcss (now available in the "tainted" rpmfusion repos), VLC for windows has the same code. I think as far as just loading the DVD, Windows doesn't touch anything but readable headers to get the label and wot-not (much like the info xorriso prints if I have udev turned off first).

Comment 119 Steve 2020-04-17 20:40:25 UTC

(In reply to Thomas Schmitt from comment #117)

Since you are delurking ... :-)

Could you comment on the SCSI READ_10 commands in the file* attached in Comment 81?

That is the direct-to-SATA case (trace-5.txt), and it has READ_10s at lba=2, lba=4, and lba=1024 that are out of sequence with the other powers-of-two reads.

There are also non-sequential READ_10s beyond the lba=512 read (presumably because the drive/enclosure does not hang there in the direct-to-SATA case).

* systemd-udevd-scsi-commands-from-trace-5-txt-1.txt

Comment 120 Steve 2020-04-17 20:44:23 UTC

(In reply to Tom Horsley from comment #118)
> dvdbackup's goal in life is to dig up the decryption keys using libdvdcss
> (now available in the "tainted" rpmfusion repos), VLC for windows has the
> same code. I think as far as just loading the DVD, Windows doesn't touch
> anything but readable headers to get the label and wot-not (much like the
> info xorriso prints if I have udev turned off first).

Wireshark runs on Windows, so you might be able get a USB capture for that case:
https://www.wireshark.org/#download

Comment 121 Steve 2020-04-17 21:27:07 UTC

Created attachment 1679760 [details]
Screenshot showing Wireshark on Windows 10 capturing USB traffic including a SCSI Read Capacity command

This screenshot shows Wireshark running on Windows 10.

The capture shows USB packets as an external USB disk drive was attached to the Windows 10 system.

In particular, it shows that Windows 10 sends a "SCSI: Read Capacity(10)" command to the disk drive.

Comment 122 Thomas Schmitt 2020-04-17 22:08:58 UTC

Hi,

> Since you are delurking ... :-)

Yeah. That plan failed thoroughly.

> Could you comment on the SCSI READ_10 commands in the file* attached
> in Comment 81?
> [systemd-udevd-scsi-commands-from-trace-5-txt-1.txt]

Well, first i wondered why the READ_10 attempts to LBA 513 don't show up.
(They come from kworker/6:1H-469 instead of systemd-udevd-*. So they
did not pass the grep.)

Then i see an allegedly successful READ_10 to LBA 512 in trace-5.txt.
First it fails for systemd-udevd-2532:

8703:   systemd-udevd-2532  [001] ....   627.381674: scsi_dispatch_cmd_start: host_no=5 channel=0 id=0 lun=0 data_sgl=1 prot_sgl=0 prot_op=SCSI_PROT_NORMAL cmnd=(READ_10 lba=512 txlen=2 protect=0 raw=28 00 00 00 02 00 00 00 02 00)
8716:    kworker/6:1H-469   [006] ....   627.424401: scsi_dispatch_cmd_timeout: host_no=5 channel=0 id=0 lun=0 data_sgl=1 prot_sgl=0 prot_op=SCSI_PROT_NORMAL cmnd=(READ_10 lba=512 txlen=2 protect=0 raw=28 00 00 00 02 00 00 00 02 00) result=(driver=DRIVER_OK host=DID_OK message=COMMAND_COMPLETE status=SAM_STAT_GOOD)

Then it succeeds within 0.02 seconds for kworker/1:1H-536:

8718:    kworker/1:1H-536   [001] ....   627.441450: scsi_dispatch_cmd_start: host_no=5 channel=0 id=0 lun=0 data_sgl=1 prot_sgl=0 prot_op=SCSI_PROT_NORMAL cmnd=(READ_10 lba=512 txlen=1 protect=0 raw=28 00 00 00 02 00 00 00 01 00)
8789:          <idle>-0     [006] d.h.   628.987442: scsi_dispatch_cmd_done: host_no=5 channel=0 id=0 lun=0 data_sgl=1 prot_sgl=0 prot_op=SCSI_PROT_NORMAL cmnd=(READ_10 lba=512 txlen=1 protect=0 raw=28 00 00 00 02 00 00 00 01 00) result=(driver=DRIVER_OK host=DID_OK message=COMMAND_COMPLETE status=SAM_STAT_GOOD)

As if a repetition of the SCSI command would switch to the behavior of
the old pioneer drive.

(For the new drive and USB box, any repetition would not happen.)

> it has READ_10s at lba=2, lba=4, and lba=1024 that are out of sequence
> with the other powers-of-two reads.

But too late to have served a potential binary search which would have
caused the 8, 16, 32, 64, 128, 256, 512 sequence.

There are also reads to 18, 34, 66, 130, 194.
18 is probably part of the ISO 9660 Volume descriptors. The others could
be in the range where ISO 9660 directory entries are stored. UDF normally
begins after 256.

I fail to invent a fully plausible theory what purpose is behind these
read attempts.

> Windows 10 sends a "SCSI: Read Capacity(10)" command to the disk drive.

That would be "cmnd=(0x25" and deliver the LBA of the last readable block.

It's not in trace-5.txt.
But the xorriso log on SATA has it

  READ CAPACITY
  25 00 00 00 00 00 00 00 00 00
  From drive: 8b
  00 3e 95 82 00 00 08 00
      376 us     [ 12227982 ]

The reply is 0x003e9582 = 4101506.
So xorriso reports to the user
  Media blocks : 4101507 readable , 0 writable , 4101507 overall

This is part of the riddle why udev begins to search for the end block
when it can inquire it from the kernel which could inquire it like
above.

Have a nice day :)

Thomas

Comment 123 Thomas Schmitt 2020-04-17 22:17:54 UTC

Correction:
The repeated read to 512 succeeds in 1.5 seconds. Not in 0.02 seconds.

Comment 124 Tom Horsley 2020-04-18 01:04:59 UTC

Created attachment 1679788 [details]
output from xorriso check media command

I just let this run for a couple of hours or so and it eventually finished with a vast collection of interleaved error and no error results.

Comment 125 Steve 2020-04-18 01:37:10 UTC

Created attachment 1679790 [details]
SCSI commands for host_no=5 from trace-5.txt

As Thomas notes in Comment 122, my earlier attempt at extracting SCSI commands from trace-5.txt excluded major events.

This includes every SCSI event for host_no=5.

Columns are timestamp, process descriptor, event code, SCSI command or SCSI result.

Event codes are as follows:

"s" = start
"d" = done
"t" = timeout
"w" = wakeup

"-" means "not present".

Example output:

628.988440: kworker/6:1H-469   |s|cmnd=(READ_10 lba=513 txlen=1 protect=0 raw=28 00 00 00 02 01 00 00 01 00)
                                 |-
629.480827: kworker/6:1H-469   |t|cmnd=(READ_10 lba=513 txlen=1 protect=0 raw=28 00 00 00 02 01 00 00 01 00) 
                                 |result=(driver=DRIVER_OK host=DID_OK message=COMMAND_COMPLETE status=SAM_STAT_GOOD)
629.498929: <idle>-0           |w|-
                                 |-

Comment 126 Steve 2020-04-18 01:42:47 UTC

Created attachment 1679791 [details]
trace-5.txt analyzer script used to generate scsi-commands-for-host_no-5-from-trace-5-txt-1.txt

For documentation purposes only.

Comment 127 Thomas Schmitt 2020-04-18 08:29:15 UTC

Hi,

> https://bugzilla.redhat.com/attachment.cgi?id=1679788
> it eventually finished with a vast collection of interleaved error 
> and no error results.

That's really unexpected. The unreadable intervals are larger than the
readable ones. 4,014,176 versus 87,331 blocks.
But the number of readables and the sucessful repetition of the READ_10
to lba=512 in trace-5.txt give me the idea that the drive is not overly
steady in its refusals.

Have a nice day :)

Thomas

Comment 128 Steve 2020-04-18 11:13:41 UTC

The "blkid" command is invoked by certain udev rules for optical drives.*

Although those rules do not explicitly use the "--probe" option**, which bypasses the cache, I am guessing that a hang can be induced by this procedure:

1. Disable systemd-udevd as before.
2. Load a commercial DVD.
3. Run:
   # blkid --probe /dev/sr0

Further, if SCSI tracing is enabled, I am guessing that the powers-of-two search will be observed.

The "blkid" command was tested in a VM with the F31 Live DVD ISO image configured as a virtual disk:

# blkid --probe /dev/sr0
/dev/sr0: UUID="2019-10-23-23-21-29-00" BOOT_SYSTEM_ID="EL TORITO SPECIFICATION" VERSION="Joliet Extension" LABEL="Fedora-WS-Live-31-1-9" TYPE="iso9660" USAGE="filesystem" PTUUID="6f869649" PTTYPE="dos"

== footnotes ==

* Udev rules for optical drives:

$ grep -A8 -m1 probe /usr/lib/udev/rules.d/60-persistent-storage.rules
# probe filesystem metadata of optical drives which have a media inserted
KERNEL=="sr*", ENV{DISK_EJECT_REQUEST}!="?*", ENV{ID_CDROM_MEDIA_TRACK_COUNT_DATA}=="?*", ENV{ID_CDROM_MEDIA_SESSION_LAST_OFFSET}=="?*", \
  IMPORT{builtin}="blkid --offset=$env{ID_CDROM_MEDIA_SESSION_LAST_OFFSET}"
# single-session CDs do not have ID_CDROM_MEDIA_SESSION_LAST_OFFSET
KERNEL=="sr*", ENV{DISK_EJECT_REQUEST}!="?*", ENV{ID_CDROM_MEDIA_TRACK_COUNT_DATA}=="?*", ENV{ID_CDROM_MEDIA_SESSION_LAST_OFFSET}=="", \
  IMPORT{builtin}="blkid --noraid"

# probe filesystem metadata of disks
KERNEL!="sr*", IMPORT{builtin}="blkid"

** $ man blkid

Comment 129 Tom Horsley 2020-04-18 14:07:06 UTC

That sounds like I might be able to override just the udev rules for cdrom media and make udev utterly ignore cdrom activity. Might be worth a try at least as a workaround.

Comment 130 Steve 2020-04-18 16:07:47 UTC

(In reply to Tom Horsley from comment #129)
> That sounds like I might be able to override just the udev rules for cdrom media and make udev utterly ignore cdrom activity.
> Might be worth a try at least as a workaround.

Good idea. Although, you will probably want to check that your apps still work. An alternative would be to configure your own program to do something less intrusive on the DVD drive.*

BTW, the rules I listed in Comment 128 have an assignment to this variable: "IMPORT{builtin}". You can list those with:

$ udevadm test-builtin --help
udevadm test-builtin [OPTIONS] COMMAND DEVPATH

Test a built-in command.
...
Commands:
  blkid           Filesystem and partition probing <<<<<< udev has its own version of "blkid".
...

And that can be invoked with:

$ udevadm test-builtin blkid /sys/class/block/sr0

However, it doesn't seem to accept a "--probe" option, so it must not be the same as "/usr/sbin/blkid":

$ udevadm test-builtin blkid --probe /sys/class/block/sr0
test-builtin: unrecognized option '--probe'

* According to "man udev", you can assign to "IMPORT{type}", where "type" can be "program".

Comment 131 Steve 2020-04-18 16:27:09 UTC

(In reply to Steve from comment #130)
> An alternative would be to configure your own program to do something less intrusive on the DVD drive.

A wrapper shell script that returns dummy data in a valid format when passed sr0 might work.

Otherwise it would invoke the "real" udev builtin "blkid", which in my VM returns:

$ udevadm test-builtin blkid /sys/class/block/sr0 2>/dev/null
ID_FS_UUID=2019-10-23-23-21-29-00
ID_FS_UUID_ENC=2019-10-23-23-21-29-00
ID_FS_BOOT_SYSTEM_ID=EL\x20TORITO\x20SPECIFICATION
ID_FS_VERSION=Joliet Extension
ID_FS_LABEL=Fedora-WS-Live-31-1-9
ID_FS_LABEL_ENC=Fedora-WS-Live-31-1-9
ID_FS_TYPE=iso9660
ID_FS_USAGE=filesystem
ID_PART_TABLE_UUID=6f869649
ID_PART_TABLE_TYPE=dos

The format is a list of key/value pairs per "man udev":

IMPORT{type}
...
"program"
Execute an external program specified as the assigned value and, if it returns successfully, import its output, which must be in environment key format. Path specification, command/argument separation, and quoting work like in RUN.

Comment 132 Steve 2020-04-18 16:40:41 UTC

The version of "blkid" in the util-linux package produces identical output when invoked with the "-o udev" option:

$ /usr/sbin/blkid --probe -o udev /dev/sr0
ID_FS_UUID=2019-10-23-23-21-29-00
ID_FS_UUID_ENC=2019-10-23-23-21-29-00
ID_FS_BOOT_SYSTEM_ID=EL TORITO SPECIFICATION
ID_FS_VERSION=Joliet\x20Extension
ID_FS_LABEL=Fedora-WS-Live-31-1-9
ID_FS_LABEL_ENC=Fedora-WS-Live-31-1-9
ID_FS_TYPE=iso9660
ID_FS_USAGE=filesystem
ID_PART_TABLE_UUID=6f869649
ID_PART_TABLE_TYPE=dos

NB: I used the full path for clarity. The command is in this package:

$ rpm -qf /usr/sbin/blkid
util-linux-2.34-4.fc31.x86_64

Comment 133 Tom Horsley 2020-04-18 16:54:42 UTC

Amazing! I copied /usr/lib/udev/rules.d/60-persistent-storage.rules to /etc/udev/rules.d/ and changed the two lines claiming to be for optical media to GOTO="persistent_storage_end" instead of doing the IMPORT from blkid.

I then rebooted to make absolutely sure the new rule was in effect (I don't trust the inotify magic it claims to use :-).

I loaded a data dvd then ran a mount command and it worked fine.

I then loaded the movie DVD I've been using for testing, and dvdbackup is reading it perfectly well. No more hangs:

zooty> dvdbackup -M -i /dev/sr0 -o /home/tom

libdvdread: Attempting to retrieve all CSS keys
libdvdread: This can take a _long_ time, please be patient

libdvdread: Get key for /VIDEO_TS/VIDEO_TS.VOB at 0x00000126
libdvdread: Elapsed time 0
libdvdread: Get key for /VIDEO_TS/VTS_01_0.VOB at 0x0000c8de
libdvdread: Elapsed time 0
libdvdread: Get key for /VIDEO_TS/VTS_01_1.VOB at 0x0000cf23
libdvdread: Elapsed time 0
libdvdread: Get key for /VIDEO_TS/VTS_02_0.VOB at 0x001f7686
libdvdread: Elapsed time 0
libdvdread: Get key for /VIDEO_TS/VTS_02_1.VOB at 0x001f7ccb
libdvdread: Elapsed time 0
libdvdread: Found 2 VTS's
libdvdread: Elapsed time 0

I've never had any use for the automounting junk anyway (heck, I know I just put a disk in - I was there doing it). This seems to be a good work around.

Comment 134 Steve 2020-04-18 17:31:20 UTC

(In reply to Tom Horsley from comment #133)
> This seems to be a good work around.

Awesome!

While you are at it, could you see if this causes a hang:

# blkid --probe /dev/sr0

If so, that might be grounds for opening a bug against util-linux:

$ rpm -qf /usr/sbin/blkid
util-linux-2.34-4.fc31.x86_64

Comment 135 Steve 2020-04-18 17:39:31 UTC

(In reply to Tom Horsley from comment #133)
> Amazing! I copied /usr/lib/udev/rules.d/60-persistent-storage.rules to /etc/udev/rules.d/ and changed the two lines claiming to be for optical media to GOTO="persistent_storage_end" instead of doing the IMPORT from blkid.
...

Could you post a diff (or patch) for your workaround, so there is an exact record of what you did?

Comment 136 Tom Horsley 2020-04-18 17:52:11 UTC

Just to see if anything happens, I submitted bug 1825533 against blkid.

Here's the diff, just changed two lines:

--- /usr/lib/udev/rules.d/60-persistent-storage.rules   2020-03-26 09:34:16.000000000 -0400
+++ /etc/udev/rules.d/60-persistent-storage.rules       2020-04-18 12:32:46.837614823 -0400
@@ -100,10 +100,10 @@
 
 # probe filesystem metadata of optical drives which have a media inserted
 KERNEL=="sr*", ENV{DISK_EJECT_REQUEST}!="?*", ENV{ID_CDROM_MEDIA_TRACK_COUNT_DATA}=="?*", ENV{ID_CDROM_MEDIA_SESSION_LAST_OFFSET}=="?*", \
-  IMPORT{builtin}="blkid --offset=$env{ID_CDROM_MEDIA_SESSION_LAST_OFFSET}"
+  GOTO="persistent_storage_end"
 # single-session CDs do not have ID_CDROM_MEDIA_SESSION_LAST_OFFSET
 KERNEL=="sr*", ENV{DISK_EJECT_REQUEST}!="?*", ENV{ID_CDROM_MEDIA_TRACK_COUNT_DATA}=="?*", ENV{ID_CDROM_MEDIA_SESSION_LAST_OFFSET}=="", \
-  IMPORT{builtin}="blkid --noraid"
+  GOTO="persistent_storage_end"
 
 # probe filesystem metadata of disks
 KERNEL!="sr*", IMPORT{builtin}="blkid"

Comment 137 Thomas Schmitt 2020-04-18 18:21:21 UTC

Hi,

congrats to the workaround.

> +  GOTO="persistent_storage_end"

And by such classic means. :))

Computer science tries to eliminate GOTO since more than 50 years
without any success.
  http://www.u.arizona.edu/~rubinson/copyright_violations/Go_To_Considered_Harmful.html
Roll over Dijkstra and Wirth !

Have a nice day :)

Thomas

Comment 138 Steve 2020-04-18 21:13:29 UTC

(In reply to Thomas Schmitt from comment #122)

> Hi,

>> Since you are delurking ... :-)

> Yeah. That plan failed thoroughly.

...
> Well, first i wondered why the READ_10 attempts to LBA 513 don't show up.
> (They come from kworker/6:1H-469 instead of systemd-udevd-*. So they did not pass the grep.)

Thanks for pointing that out. Although it is moot now, I rewrote the awk script to display all events for "host_no=5". That provides much better insight into what is going on.

> Then i see an allegedly successful READ_10 to LBA 512 in trace-5.txt.
> First it fails for systemd-udevd-2532:
...
> Then it succeeds within 0.02 seconds for kworker/1:1H-536:
...

Yes, it seems peculiar that a timeout would occur in less time than a successful transaction.

> As if a repetition of the SCSI command would switch to the behavior of the old pioneer drive.
> (For the new drive and USB box, any repetition would not happen.)

OK.

>> it has READ_10s at lba=2, lba=4, and lba=1024 that are out of sequence with the other powers-of-two reads.

> But too late to have served a potential binary search which would have caused the 8, 16, 32, 64, 128, 256, 512 sequence.

Good point.

> There are also reads to 18, 34, 66, 130, 194.
> 18 is probably part of the ISO 9660 Volume descriptors. The others could be in the range where ISO 9660 directory entries are stored. UDF normally begins after 256.

So those would be "non-sizing" reads.

> I fail to invent a fully plausible theory what purpose is behind these read attempts.

udev/blkid need some of that info to generate their output. Comment 131 has an example.

There could also be directory caching for a file manager or a video player, which would certainly need to read more than just the Volume descriptors. 

> Windows 10 sends a "SCSI: Read Capacity(10)" command to the disk drive.

> That would be "cmnd=(0x25" and deliver the LBA of the last readable block.

OK. That is in pane 2 of the Wireshark on Windows screenshot. (Comment 121)

> It's not in trace-5.txt.
> But the xorriso log on SATA has it

>   READ CAPACITY
>   25 00 00 00 00 00 00 00 00 00
...

Wireshark shows all 10 bytes in the bottom pane. Wireshark is usually used to capture networking packets but it works very well with USB traffic too:

$ rpm -q wireshark
wireshark-3.2.0-1.fc31.x86_64

> The reply is 0x003e9582 = 4101506.
> So xorriso reports to the user
>   Media blocks : 4101507 readable , 0 writable , 4101507 overall

So it seems that the kernel could be dropping some transfers when tracing.

> This is part of the riddle why udev begins to search for the end block when it can inquire it from the kernel which could inquire it like
above.

My current theory is that udev/blkid are somehow running of the rails when trying to identify the disc and using a fallback sizing algorithm. Assuming the disc is correctly mastered, that would mean that there is a bug in udev/blkid. (Bug 1825533)

> Have a nice day :)

> Thomas

Comment 139 Steve 2020-04-18 21:54:43 UTC

(In reply to Tom Horsley from comment #136)
> Just to see if anything happens, I submitted bug 1825533 against blkid.

I'm guessing the problem is some easily fixed misconfiguration, because libblkid is designed to be configurable*.

The libblkid man page, which is in libblkid-devel**, says:

"The  libblkid library is used to identify block devices (disks) as to their content (e.g. filesystem type) as well as extracting additional information such as filesystem labels/volume names, unique identifiers/serial numbers. A common use is to allow use of LABEL= and UUID= tags instead of hard-coding specific block device names into configuration files."

libblkid also supports caching, which would explain why blkid seems to be reading more than is necessary for mere identification.

> Here's the diff, just changed two lines:
...

Thanks. That's much simpler than anything I had in mind. :-)

== footnotes ==

* With /etc/blkid.conf. See "man 8 blkid".
 
** $ rpm -q libblkid-devel
libblkid-devel-2.34-4.fc31.x86_64

Comment 140 Steve 2020-04-18 22:32:36 UTC

(In reply to Thomas Schmitt from comment #122)
...
> It's not in trace-5.txt.
> But the xorriso log on SATA has it
> 
>   READ CAPACITY
>   25 00 00 00 00 00 00 00 00 00
>   From drive: 8b
>   00 3e 95 82 00 00 08 00
>       376 us     [ 12227982 ]
> 
> The reply is 0x003e9582 = 4101506.
> So xorriso reports to the user
>   Media blocks : 4101507 readable , 0 writable , 4101507 overall
...

So it looks like blkid is probing the last two blocks of the disc instead of sending a SCSI "READ CAPACITY" request:

626.759905: systemd-udevd-2532 |s|cmnd=(READ_10 lba=4101504 txlen=2 protect=0 raw=28 00 00 3e 95 80 00 00 02 00)
626.991589: systemd-udevd-2532 |s|cmnd=(READ_10 lba=4101506 txlen=1 protect=0 raw=28 00 00 3e 95 82 00 00 01 00)

Those succeed, but why couldn't blkid do BOTH? (Trust but verify. :-))

Comment 141 Steve 2020-04-18 22:47:00 UTC

"... probing the last two blocks of the disc ..."

Correcting myself: 4101504,4101505 for the first of the two reads (txlen=2).
                   4101506 for the second of the two reads (txlen=1).

That's *three* ... :-)

(Assuming that "txlen" is a block count.)

"... 4101507 overall"

Presumably, blocks are numbered starting from 0.

Comment 142 Thomas Schmitt 2020-04-19 07:27:45 UTC

Hi,

> So it looks like blkid is probing the last two blocks of the disc
> instead of sending a SCSI "READ CAPACITY" request:

The purpose of the first reads to the end blocks might be to detect
a possible GPT backup header block. (*)

But to know the LBA of the last blocks it has to already know the size.
(This information can be obtained by READ CAPACITY or READ DISC INFORMATION
and READ TRACK INFORMATION.)

So it is extra riddling why udev begins to suck off the lower blocks
of the DVD.

-------------------------------------------------------------------
(*)
I remember an incident about Archlinux ISOs which contained a blkid
program that was curious about GPT. When the ISO was burnt to CD by
write type TAO, and the reader was in the majority group of drives
which report the Run-out blocks of CD TAO as readable, then this read
attempt failed and Archlinux did not boot from the CD.

This was repaired by making blkid tolerant towards read failure.
Probably this is also the reason why not only the last block but also
the second and third last block are read.

The two Run-out sectors of TAO CDs are not readable by READ(10) but
only by READ CD commands. So the drives which count them as readable
are not really liars.
("I'm not crazy ! Don't call me crazy ! I'm just not user friendly !"
 - Roberto in Futurama.)

This drive dependent ambiguity with TAO CD and READ CAPACITY plus the
credulence of linux-scsi establish the old Linux TAO Read-Ahead Bug.
It caused the cargo cult of adding 300 KB of padding to CD burn runs
and to the production of ISO 9660 images.

Have a nice day :)

Thomas

Comment 143 Steve 2020-04-19 10:50:37 UTC

(In reply to Thomas Schmitt from comment #142)
...
> But to know the LBA of the last blocks it has to already know the size.
> (This information can be obtained by READ CAPACITY or READ DISC INFORMATION and READ TRACK INFORMATION.)
...

Thanks for pointing that out. 

"READ_CAPACITY" is in the trace logs -- I just didn't think to look for it before now. :-)

From Comment 57 (analysis of trace-4.txt):

systemd-udevd-7136 10101.238988: cmnd=(READ_CAPACITY - raw=25 00 00 00 00 00 00 00 00 00)

Likewise for Attachment 1679790 [details] (analysis of trace-5.txt: scsi-commands-for-host_no-5-from-trace-5-txt-1.txt):

626.731734: systemd-udevd-2532 |s|cmnd=(READ_CAPACITY - raw=25 00 00 00 00 00 00 00 00 00)

"READ_CAPACITY" is, of course, in the original trace logs:

Attachment 1679161 [details] (kernel-debug.txt aka trace-4.txt)
Attachment 1679482 [details] (trace-5.txt)


BTW, this is a perfect example of the phenomenon of "seeing but not observing" as described by Sherlock Holmes in "A Scandal in Bohemia".

(cited by Maria Konnikova in "Mastermind: How to Think Like Sherlock Holmes")

Comment 144 Steve 2020-04-19 11:20:25 UTC

(In reply to Steve from comment #143)
...
> "READ_CAPACITY" is in the trace logs ...
...

The replies are there too, but they are not SCSI commands so they don't show up in my analyses.

The replies come back as USB packets. I believe they are logged as "BULK" packets, as in this sequence of records from trace-4.txt.

Does "length 31" mean the length of the READ_CAPACITY reply? And does "Slot 7 ep2out" mean the recipient of the reply?

systemd-udevd-7136  [001] .... 10101.238988: scsi_dispatch_cmd_start: host_no=5 channel=0 id=0 lun=0 data_sgl=1 prot_sgl=0 prot_op=SCSI_PROT_NORMAL cmnd=(READ_CAPACITY - raw=25 00 00 00 00 00 00 00 00 00)

usb-storage-4803  [000] .... 10101.239052: xhci_urb_enqueue: ep2out-bulk: urb 0000000036e6ba93 pipe 3221292032 slot 7 length 0/31 sgs 0/0 stream 0 flags 00000004

usb-storage-4803  [000] d... 10101.239057: xhci_queue_trb: BULK: Buffer 000000040615c080 length 31 TD size 0 intr 0 type 'Normal' flags b:i:I:c:s:i:e:c

usb-storage-4803  [000] d... 10101.239058: xhci_inc_enq: BULK 00000000ae4f77c9: enq 0x0000000407fd6430(0x0000000407fd6000) deq 0x0000000407fd6420(0x0000000407fd6000) segs 2 stream 0 free_trbs 508 bounce 1024 cycle 1

usb-storage-4803  [000] d... 10101.239059: xhci_ring_ep_doorbell: Ring doorbell for Slot 7 ep2out

Comment 145 Thomas Schmitt 2020-04-19 11:45:10 UTC

Hi,

> systemd-udevd-7136 10101.238988: cmnd=(READ_CAPACITY

Grmpf. I searched for "cmnd=(0x25" and thought there were none.

> Sherlock Holmes 

The grand master of hunting nasty bugs, indeed.
"Once you eliminate the impossible, whatever remains, no matter how
improbable, must be the truth."

Ok. It is obviously not a binary search for the size what we see.
(This was already improbable by the late reading of blocks 2 to 7.)

> Does "length 31" mean the length of the READ_CAPACITY reply?

Not on SCSI level. The data reply from the drive is 8 bytes long.
But of course there might be USB stuff wrapped around the SCSI payload.

The payload is obviously not shown in trace-4.txt. Nothing to see that
looks like the byte sequence 3e 95 82, which the drive replied to
libburn underneath xorriso.

Have a nice day :)

Thomas

Comment 146 Steve 2020-04-19 11:57:31 UTC

(In reply to Thomas Schmitt from comment #145)
...
> Grmpf. I searched for "cmnd=(0x25" and thought there were none.
...

Now that you point that out, I <observe> that the logs use "raw=25":  :-)

"... cmnd=(READ_CAPACITY - raw=25 00 00 00 00 00 00 00 00 00) ..."

Comment 147 Steve 2020-04-19 12:26:29 UTC

(In reply to Thomas Schmitt from comment #145)
...
> > Does "length 31" mean the length of the READ_CAPACITY reply?
> 
> Not on SCSI level. The data reply from the drive is 8 bytes long.
> But of course there might be USB stuff wrapped around the SCSI payload.
...

Thanks. The "length 31" appears to be the length of the encapsulated SCSI READ_CAPACITY command in the USB packet. From my Wireshark capture on Windows*:

Frame 23: 58 bytes on wire (464 bits), 58 bytes captured (464 bits) on interface wireshark_extcap1640, id 0
USB URB
    [Source: host]
    [Destination: 2.2.2]
    USBPcap pseudoheader length: 27
    IRP ID: 0xffff92014d97b460
    IRP USBD_STATUS: USBD_STATUS_SUCCESS (0x00000000)
    URB Function: URB_FUNCTION_BULK_OR_INTERRUPT_TRANSFER (0x0009)
    IRP information: 0x00, Direction: FDO -> PDO
    URB bus id: 2
    Device address: 2
    Endpoint: 0x02, Direction: OUT
    URB transfer type: URB_BULK (0x03)
    Packet Data Length: 31  <<<<<<<<<<<<<<<<<<<<<<<<<<<<< 31
    [Response in: 24]
    [bInterfaceClass: Mass Storage (0x08)]
USB Mass Storage
    Signature: 0x43425355
    Tag: 0x4d97b460
    DataTransferLength: 8
    Flags: 0x80
    .000 .... = Target: 0x0 (0)
    .... 0000 = LUN: 0x0
    ...0 1010 = CDB Length: 0x0a
SCSI CDB Read Capacity(10)
    [LUN: 0x0000]
    [Command Set:Direct Access Device (0x00) (Using default commandset)]
    [Response in: 28]
    Opcode: Read Capacity(10) (0x25)
    Control: 0x00

* The Wireshark capture was saved to a file in Windows, transferred to my Linux system, and I am viewing the file with Wireshark on Linux.

Comment 148 Steve 2020-04-19 12:44:59 UTC

The hexdump is easier to understand. I've marked the lengths:

0000   1b 00 60 b4 97 4d 01 92 ff ff 00 00 00 00 09 00   ..`..M..........
0010   00 02 00 02 00 02 03 1f 00 00 00 55 53 42 43 60   ...........USBC`
0020   b4 97 4d 08 00 00 00 80 00 0a 25 00 00 00 00 00   ..M.......%.....
0030   00 00 00 00 00 00 00 00 00 00                     ..........

                                  ^^ 0a is the length of the following SCSI command 25.
                            ^^ 1f is the length of the SCSI command 25 with signature and tag (in little-endian order).
                ^^ That 08 could be the length of the expected SCSI command 25 reply. ("DataTransferLength" in Comment 147)
       ^^ 1b is the length of the "USBPcap pseudoheader" (per Comment 147)

There appear to be six additional 00 bytes after the SCSI command 25.

Comment 149 Steve 2020-04-19 14:48:23 UTC

I finally found an actual anomaly. In trace-4.txt, after two of the timeouts, there is no wakeup:

$ cat trace-4.txt | awk '/host_no=5/ { print $4, $5 }' | fgrep -A20 'timeout'
10132.015859: scsi_dispatch_cmd_timeout:
10132.153972: scsi_dispatch_cmd_start:   <<< no wakeup here
10132.155549: scsi_dispatch_cmd_done:
10132.155618: scsi_dispatch_cmd_start:
10132.155909: scsi_dispatch_cmd_done:
10132.155951: scsi_dispatch_cmd_start:
10132.156207: scsi_dispatch_cmd_done:
10132.156237: scsi_dispatch_cmd_start:
10162.224030: scsi_dispatch_cmd_timeout:
10162.369979: scsi_dispatch_cmd_start:   <<< no wakeup here
10162.370359: scsi_dispatch_cmd_done:
10162.370474: scsi_dispatch_cmd_start:
10192.432179: scsi_dispatch_cmd_timeout:
10192.438071: scsi_eh_wakeup:            <<< but there is a wakeup here

In contrast, trace-5.txt shows a wakeup after each timeout:

$ cat trace-5.txt | awk '/host_no=5/ { print $4, $5 }' | fgrep -A1 'timeout'
626.713247: scsi_dispatch_cmd_timeout:
626.729892: scsi_eh_wakeup:
--
627.424401: scsi_dispatch_cmd_timeout:
627.440905: scsi_eh_wakeup:
--
629.480827: scsi_dispatch_cmd_timeout:
629.498929: scsi_eh_wakeup:
--
630.630342: scsi_dispatch_cmd_timeout:
630.646961: scsi_eh_wakeup:
--
630.704811: scsi_dispatch_cmd_timeout:
630.720960: scsi_eh_wakeup:
--
630.992037: scsi_dispatch_cmd_timeout:
631.008961: scsi_eh_wakeup:
--
631.051348: scsi_dispatch_cmd_timeout:
631.067966: scsi_eh_wakeup:

If you want to look at the raw data, run these:

$ egrep 'timeout|wakeup' trace-4.txt
$ egrep 'timeout|wakeup' trace-5.txt

Comment 150 Steve 2020-04-19 15:17:45 UTC

The "eh" in the function name means "error handler":

$ git grep -l scsi_eh_wakeup drivers/
drivers/scsi/scsi_error.c
...

So the USB-attached case and the SATA-attached case are different in how errors are being handled.

That points the finger at the kernel ...

Comment 151 Steve 2020-04-19 17:11:31 UTC

(In reply to Tom Horsley from comment #133)

> I've never had any use for the automounting junk anyway ...

I stopped using gnome as my prime desktop for that reason and others involving too much automation. Now I use Xfce, which isn't so "busy".

Anyway, I attached strace to systemd-udevd to see what happens. Here are the results:

# strace -f -p `pgrep systemd-udevd` -o strace-1.txt

Insert and remove a USB flash drive.

Terminate strace with ctrl-c.

# fgrep -m1 exec strace-1.txt 
2275  execve("/usr/lib/udev/mtp-probe", ["/usr/lib/udev/mtp-probe", "/sys/devices/pci0000:00/0000:00:"..., "3", "4"], 0x5643824dd340 /* 26 vars */) = 0

I know my USB flash drive is not a media player, but udev checks for me anyway. :-)

$ rpm -qf /usr/lib/udev/mtp-probe
libmtp-1.1.16-2.fc30.x86_64

$ rpm -q systemd-udev
systemd-udev-241-14.git18dd3fb.fc30.x86_64

Comment 152 Steve 2020-04-20 12:13:38 UTC

Tom: Could you please add a link to the upstream bug in the "Links" field at the top of the bug page so it is easy to find. The "System" is "Linux Kernel".

Thomas: Thanks for your possible explanation for why the second and third last blocks are being read by blkid. And thanks for the informative and entertaining background on disc formatting. (Comment 142)

Comment 153 Tom Horsley 2020-04-21 13:56:09 UTC

(In reply to Steve from comment #152)
> Tom: Could you please add a link to the upstream bug in the "Links" field at
> the top of the bug page so it is easy to find. The "System" is "Linux
> Kernel".

I can't see any field named "Links" or find any place I might be able to point
at the upstream bug other than perhaps putting it in the URL: field.

Comment 154 Steve 2020-04-21 15:15:18 UTC

(In reply to Thomas Schmitt from comment #142)
> But to know the LBA of the last blocks it has to already know the size.
> (This information can be obtained by READ CAPACITY or READ DISC INFORMATION and READ TRACK INFORMATION.)

This gets more interesting all the time. :-) I tried enabling scsi tracing on my F30 system while connecting and disconnecting an external USB hard drive and found several of these in the trace log:

kworker/u8:1-9073  [000] .... 14855.987940: scsi_dispatch_cmd_start: host_no=6 channel=0 id=0 lun=0 data_sgl=1 prot_sgl=0 prot_op=SCSI_PROT_NORMAL cmnd=(SERVICE_ACTION_IN_16 READ_CAPACITY_16 lba=0 alloc_len=32 raw=9e 10 00 00 00 00 00 00 00 00 00 00 00 20 00 00)

"raw=9e 10" appears to be yet another variant of SCSI support for disk sizing. There must be an "extended" SCSI command set.

NB: "host_no=6" is the external USB hard drive. (per "journalctl")

Comment 155 Thomas Schmitt 2020-04-21 15:43:41 UTC

Hi,

> cmnd=(SERVICE_ACTION_IN_16 READ_CAPACITY_16 lba=0 alloc_len=32 raw=9e
>
> "raw=9e 10" appears to be yet another variant of SCSI support for disk
> sizing.
> There must be an "extended" SCSI command set.

It is a tradition in SCSI to introduce extended versions of older
commands. Like 88h READ(16) extends A8h READ(12) which extends
28h READ(10) which extends 08h READ(6). (Whew wheeze.)

SPC-3 lists as command 9Eh only SERVICE ACTION IN(16) without giving
any SCSI volume (like MMC or SBC) where it would be documented.
The internet says that 9Eh READ CAPACITY(16) belongs to SBC-3.


Have a nice day :)

Thomas

Comment 156 Steve 2020-04-21 18:03:43 UTC

(In reply to Thomas Schmitt from comment #155)
> Hi,
> 
> > cmnd=(SERVICE_ACTION_IN_16 READ_CAPACITY_16 lba=0 alloc_len=32 raw=9e
> >
> > "raw=9e 10" appears to be yet another variant of SCSI support for disk sizing.
> > There must be an "extended" SCSI command set.
> 
> It is a tradition in SCSI to introduce extended versions of older commands. Like 88h READ(16) extends A8h READ(12) which extends 28h READ(10) which extends 08h READ(6). (Whew wheeze.)

Thanks. That trend must be motivated by the need to accommodate larger devices. The USB hard drive is 500GB:

$ lsscsi -ts
[2:0:0:0]    disk    sata:[wwn removed..*]           /dev/sda    500GB
[4:0:0:0]    disk    sata:[wwn removed...]           /dev/sdb    500GB
[5:0:0:0]    cd/dvd  sata:                           /dev/sr0        -
[6:0:0:0]    disk    usb:3-11:1.0                    /dev/sdc    500GB

> SPC-3 lists as command 9Eh only SERVICE ACTION IN(16) without giving any SCSI volume (like MMC or SBC) where it would be documented.
> The internet says that 9Eh READ CAPACITY(16) belongs to SBC-3.

The official docs require registration to view, but Seagate has a doc that gives the details in:

3.23 READ CAPACITY (16) command (p. 157)

SCSI Commands Reference Manual
Fibre Channel (FC)
Serial Attached SCSI (SAS)
October 2016
https://www.seagate.com/files/staticfiles/support/docs/manual/Interface%20manuals/100293068j.pdf

(Also documented in Table 30, page 66)

> Have a nice day :)
> 
> Thomas

* The "lsscsi" command's "-t" option displays the "WWN", which is the "World Wide Number". Apparently, SCSI devices have unique identifiers that probably shouldn't be posted on the internet. I had to use the "hdparm -I" command to figure out what that number is.

Comment 157 Ben Cotton 2020-11-03 16:35:22 UTC

This message is a reminder that Fedora 31 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora 31 on 2020-11-24.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
Fedora 'version' of '31'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 31 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 158 Ben Cotton 2020-11-24 17:17:53 UTC

Fedora 31 changed to end-of-life (EOL) status on 2020-11-24. Fedora 31 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.

Note You need to log in before you can comment on or make changes to this bug.

airlied
bskeggs
hdegoede
ichavero
itamar
jarodwilson
jeremy
jglisse
john.j5live
jonathan
josef
kernel-maint
linville
masami256
mchehab
mjg59
samuel-rhbugs
scdbackup
steved
y9t7sypezp