Created attachment 1677807 [details] output from requested journalctl command above 1. Please describe the problem: With a commercial Movie DVD (not a Blu-ray) in my drive, any open() call on /dev/sr0 appears to hang forever. This happens with my sata drive in a external USB3 enclosure connected via a USB3 port. I decided this must be a linux kernel problem when I discovered I could play the same DVD on the same drive just find on a Windows 10 system using the Windows version of VLC. I then booted that exact same Windows box with the fedora 31 iso image from a USB stick, and watched it hang with any program that tried to open /dev/sr0. The drive: LG WH16NS40 Super Multi Blue Internal SATA 16x Blu-ray Disc Rewriter The enclosure: OWC Mercury Pro 5.25" Optical Drive External Enclosure Everything else works fine - cdrom, data DVD (not encrypted), Blu-ray, and writing to everything I've tried. It is just when a movie DVD is in the drive that the hang happens. 2. What is the Version-Release number of the kernel: The live USB running in the Windows box is kernel 5.3.7-301.fc31.x86_64 The identical hang happens on my desktop running 5.5.15-200.fc31.x86_64 3. Did it work previously in Fedora? If so, what kernel version did the issue *first* appear? Old kernels are available for download at https://koji.fedoraproject.org/koji/packageinfo?packageID=8 : Since this is a brand new drive, I don't know if it worked before, but I can say that a very old slim usb powered pioneer BDR-XD05S works perfectly fine on the same systems where the LG drive hangs. 4. Can you reproduce this issue? If so, please provide the steps to reproduce the issue below: Put DVD Movie disk in drive Try to open it with any program Watch it hang till you unplug or power off the drive, at which point it fails. 5. Does this problem occur with the latest Rawhide kernel? To install the Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by ``sudo dnf update --enablerepo=rawhide kernel``: Can't really do that from live media 6. Are you running any modules that not shipped with directly Fedora's kernel?: Nope, booted directly from live workstation iso image 7. Please attach the kernel logs. You can get the complete kernel log for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the issue occurred on a previous boot, use the journalctl ``-b`` flag.
More info: I had an ASUS optical drive that showed the same symptoms, and when I put the ASUS drive in a computer with an available drive bay and connected it with sata directly to the motherboard, it can read a movie DVD just fine, so it is apparently something about the USB interface that triggers the hang.
Just tried the ultimate experiment: took the LG drive mentioned above out of the eternal enclosure and put it in a drive bay in a computer which hung previously when trying the USB connection. With a direct sata connection, the exact same LG drive that hangs when using USB works fine when using sata. Yet it isn't a fundamental USB hardware problem because the USB interface works fine in Windows 10 and fails in the same windows box booted into linux from a live USB. Possibly this does have something to do with the new "phase II" region playback control, but I set the drive to region 1 and I'm trying to open a region 1 dvd in all my experiments. The old pioneer USB drive that works is using the phase 1 RPC. Whatever is going on has apparently been going on a long time. I tried an old version 3 kernel and it had the same problem.
Thanks for your detailed report. The USB enclosure is getting repeatedly reset: $ grep -n 'usb 4-1' dmesg.txt 972:Apr 10 10:42:40 kernel: usb 4-1: new SuperSpeed Gen 1 USB device number 3 using xhci_hcd 973:Apr 10 10:42:40 kernel: usb 4-1: New USB device found, idVendor=1e91, idProduct=de2c, bcdDevice= 1.00 974:Apr 10 10:42:40 kernel: usb 4-1: New USB device strings: Mfr=2, Product=3, SerialNumber=1 975:Apr 10 10:42:40 kernel: usb 4-1: Product: Mercury Pro Optical 976:Apr 10 10:42:40 kernel: usb 4-1: Manufacturer: Other World Computing 977:Apr 10 10:42:40 kernel: usb 4-1: SerialNumber: 002933001438 986:Apr 10 10:43:45 kernel: usb 4-1: reset SuperSpeed Gen 1 USB device number 3 using xhci_hcd 990:Apr 10 10:44:09 kernel: usb 4-1: USB disconnect, device number 3 1004:Apr 10 10:46:32 kernel: usb 4-1: new SuperSpeed Gen 1 USB device number 4 using xhci_hcd 1005:Apr 10 10:46:32 kernel: usb 4-1: New USB device found, idVendor=1e91, idProduct=de2c, bcdDevice= 1.00 1006:Apr 10 10:46:32 kernel: usb 4-1: New USB device strings: Mfr=2, Product=3, SerialNumber=1 1007:Apr 10 10:46:32 kernel: usb 4-1: Product: Mercury Pro Optical 1008:Apr 10 10:46:32 kernel: usb 4-1: Manufacturer: Other World Computing 1009:Apr 10 10:46:32 kernel: usb 4-1: SerialNumber: 002933001438 1017:Apr 10 10:47:21 kernel: usb 4-1: reset SuperSpeed Gen 1 USB device number 4 using xhci_hcd 1021:Apr 10 10:47:52 kernel: usb 4-1: reset SuperSpeed Gen 1 USB device number 4 using xhci_hcd 1022:Apr 10 10:50:07 kernel: usb 4-1: USB disconnect, device number 4 1030:Apr 10 11:00:52 kernel: usb 4-1: new SuperSpeed Gen 1 USB device number 5 using xhci_hcd 1031:Apr 10 11:00:52 kernel: usb 4-1: New USB device found, idVendor=1e91, idProduct=de2c, bcdDevice= 1.00 1032:Apr 10 11:00:52 kernel: usb 4-1: New USB device strings: Mfr=2, Product=3, SerialNumber=1 1033:Apr 10 11:00:52 kernel: usb 4-1: Product: Mercury Pro Optical 1034:Apr 10 11:00:52 kernel: usb 4-1: Manufacturer: Other World Computing 1035:Apr 10 11:00:52 kernel: usb 4-1: SerialNumber: 002933001438 1043:Apr 10 11:01:42 kernel: usb 4-1: reset SuperSpeed Gen 1 USB device number 5 using xhci_hcd 1047:Apr 10 11:02:12 kernel: usb 4-1: reset SuperSpeed Gen 1 USB device number 5 using xhci_hcd 1103:Apr 10 11:27:39 kernel: usb 4-1: USB disconnect, device number 5
Could you try the following: Eject any disc from the USB enclosure. Disconnect the USB enclosure from your system. (The enclosure appears to use an external power supply, so leave the enclosure powered up.) If you have auto-mounting enabled, try to disable it before running the test. In a full-screen terminal window, run: $ journalctl --no-hostname -k -f | tee /tmp/dmesg-2.txt That should log messages that you can watch as they are logged. Connect the USB enclosure to your system. Wait for log messages to stop scrolling. Load the test DVD. After it appears to have settled and log messages have stopped scrolling, eject the DVD, using the eject button, if possible. Press "ctrl-c" to terminate the logging. Attach /tmp/dmesg-2.txt. NB: Waiting for the log messages to stop scrolling may or may not be possible. In a test on my system with a USB flash drive, there was a small flurry of messages when the drive was inserted and another small furry when the drive was removed. In both cases, there were pauses after the flurry. If you see a continuous stream of messages about resets, please make a note of that.
Created attachment 1678574 [details] new dmesg output I'm pretty sure those resets were all from me turning off the drive to get the hung open to fail, here's the new log with me just loading the DVD, not even trying to read it.
Hi, in https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org/message/KWNGRQW4ETEFAVNRPMZBBCG7UHNZ22FK/ you report libburn's error message SG_IO: host_status= 0x7 SG_ERR_DID_ERROR which is a transport problem mentioned in https://www.tldp.org/HOWTO/SCSI-Generic-HOWTO/x291.html Further the libburn SCSI log snippets shown in that mail tell that some normally fast running SCSI commands for drive inspection completed successfully, but in the sum needed 180 seconds to finally run into SG_ERR_DID_ERROR with command 12h INQUIRY. So i deem it unlikely that it is about DRM regions. Have a nice day :) Thomas
(In reply to Tom Horsley from comment #5) > Created attachment 1678574 [details] > new dmesg output > > I'm pretty sure those resets were all from me turning off the drive to get the hung open to fail, here's the new log with me just loading the DVD, not even trying to read it. Thanks. There are two resets. Did you mean to change the USB port (4-1 -> 2-1)? Which USB controller the enclosure is connected to could affect the behavior, especially if some are USB 2.0 and some are USB 3.0. I'm not a USB expert, but Comment 3 shows "SuperSpeed Gen 1" and this one shows "high-speed". BTW, there is the tail end of a Call Trace at the beginning of dmesg-2.txt. Could you post the whole thing? $ grep -n 'usb 2-1' dmesg-2.txt 11:Apr 13 18:51:26 kernel: usb 2-1.2: USB disconnect, device number 3 12:Apr 13 18:57:35 kernel: usb 2-1.2: new high-speed USB device number 6 using ehci-pci 13:Apr 13 18:57:35 kernel: usb 2-1.2: New USB device found, idVendor=1e91, idProduct=de2c, bcdDevice= 1.00 14:Apr 13 18:57:35 kernel: usb 2-1.2: New USB device strings: Mfr=2, Product=3, SerialNumber=1 15:Apr 13 18:57:35 kernel: usb 2-1.2: Product: Mercury Pro Optical 16:Apr 13 18:57:35 kernel: usb 2-1.2: Manufacturer: Other World Computing 17:Apr 13 18:57:35 kernel: usb 2-1.2: SerialNumber: 002933001438 28:Apr 13 18:58:49 kernel: usb 2-1.2: reset high-speed USB device number 6 using ehci-pci 32:Apr 13 18:59:19 kernel: usb 2-1.2: reset high-speed USB device number 6 using ehci-pci
I may have done this most recent log on a different computer (but I thought it was the same one I used before but I lose track of which experiments I conduct where). This computer doesn't have any working USB 3 ports, but the USB drive hangs the exact same way in a usb 2 port. One other oddity I noticed. This computer has an ASUS drive connected via sata, and it is normally /dev/sr0, but apparently plugging in the USB drive stole the /dev/sr0 device (I was expecting the usb drive to show up as /dev/sr1). The call trace is just the trace everyone using rpmfusion nvidia drivers get every time X starts. See bug 1806257 for an example.
$ grep DMI: dmesg-1.txt Apr 10 06:36:12 kernel: DMI: ASUSTeK Computer INC. VM40B/VM40B, BIOS 1501 12/09/2014 That's a mini-PC. According to the specs it has: 2 x USB 3.1 Gen 1 4 x USB 2.0 https://www.asus.com/us/Mini-PCs/VivoPC_VM40B/specifications/
(In reply to Tom Horsley from comment #8) > I may have done this most recent log on a different computer (but I thought > it was the same one I used before but I lose track of which experiments I > conduct where). This computer doesn't have any working USB 3 ports, but the > USB drive hangs the exact same way in a usb 2 port. OK. > One other oddity I noticed. This computer has an ASUS drive connected via > sata, and it is normally /dev/sr0, but apparently plugging in the USB drive > stole the /dev/sr0 device (I was expecting the usb drive to show up as /dev/sr1). Do you still show two drives? $ ls -l /dev/sr* > The call trace is just the trace everyone using rpmfusion nvidia drivers get > every time X starts. See bug 1806257 for an example. Thanks for mentioning that. dmesg-1.txt does not show any "taints" and is for an ASUS mini-PC. Could you post this for the system you used when capturing dmesg-2.txt so we can keep track of how you are testing: $ journalctl --no-hostname -k | egrep -i 'DMI:|taint'
You might be able to get more information about what is happening at the USB layer by running these commands as root from the command-line: # echo 'module xhci_hcd =pmf' > /sys/kernel/debug/dynamic_debug/control # echo 'module usbcore =pmf' > /sys/kernel/debug/dynamic_debug/control # echo 81920 > /sys/kernel/debug/tracing/buffer_size_kb # echo 1 > /sys/kernel/debug/tracing/events/xhci-hcd/enable Collect logs with: $ journalctl --no-hostname -k > /tmp/dmesg-1.txt $ sudo cat /sys/kernel/debug/tracing/trace > /tmp/trace-1.txt Those are from Mathias upstream, except that I have added the "m" and "f" flags to log the module and function names: https://bugzilla.kernel.org/show_bug.cgi?id=203419#c21 I am not a USB expert, so this is speculative: Enable "xhci_hcd" logging for the "SuperSpeed Gen 1" controller. Enable "ehci-pci" logging for the "high-speed" controller. Or enable both. You can get a list of controllers with: $ journalctl --no-hostname -b | grep -i controller For more about the kernel's dynamic debugging facilities, see: Dynamic debug https://www.kernel.org/doc/html/latest/admin-guide/dynamic-debug-howto.html
(In reply to Tom Horsley from comment #8) ... > One other oddity I noticed. This computer has an ASUS drive connected via sata, and it is normally /dev/sr0, but apparently plugging in the USB drive stole the /dev/sr0 device (I was expecting the usb drive to show up as /dev/sr1). ... Could you open a new bug for that with component "systemd"? An already-connected drive should not be renamed when you connect a second drive. I believe that the component is "systemd" from reading the man page for "systemd-udevd", which says: "systemd-udevd listens to kernel uevents. For every event, systemd-udevd executes matching instructions specified in udev rules."
Created attachment 1678735 [details] rerun of journalctl on the asus mini-pc with usb debugging on I've run the journalctl command, this time on the mini-pc I used for the first set of journalctl output and with the usb debug enabled. There was a lot of clutter from linux not liking the wifi chip, so I disabled wifi and ran it for a second time, there may be stuff at the beginning from the first try.
Created attachment 1678737 [details] the usb debug trace output from the same run No doubt this has output from both times I loaded the DVD since I didn't turn trace off and back on for the 2nd time I ran the test.
(In reply to Steve from comment #12) > (In reply to Tom Horsley from comment #8) > ... > > One other oddity I noticed. This computer has an ASUS drive connected via sata, and it is normally /dev/sr0, but apparently plugging in the USB drive stole the /dev/sr0 device (I was expecting the usb drive to show up as /dev/sr1). > ... > > Could you open a new bug for that with component "systemd"? An > already-connected drive should not be renamed when you connect a second > drive. This was spurious. Apparently somewhere in there one of the cables came loose on the ASUS drive, so it wasn't there at all. I wiggled the cables and it is better now (and the usb does show up as /dev/sr1).
(In reply to Tom Horsley from comment #13) > Created attachment 1678735 [details] > rerun of journalctl on the asus mini-pc with usb debugging on > > I've run the journalctl command, this time on the mini-pc I used for the first set of journalctl output and with the usb debug enabled. > > There was a lot of clutter from linux not liking the wifi chip, so I disabled wifi and ran it for a second time, there may be stuff at the beginning from the first try. Thanks. That looks much cleaner. I am not qualified to give a diagnosis except to say that the USB layer looks happy as a clam. :-) At this point you could: 1. Try to enable dynamic debugging output for the next layer up. "sr"? 2. Open an upstream bug at https://bugzilla.kernel.org. I looked at the choices for Product/Component and two possibilities are: Drivers/USB and IO/Storage/SCSI. $ less -N dmesg-3.txt ... 163 Apr 14 12:15:07 kernel: xhci_hcd:xhci_ring_cmd_db: xhci_hcd 0000:00:14.0: // Ding dong! 164 Apr 14 12:15:07 kernel: xhci_hcd:xhci_dbg_trace: xhci_hcd 0000:00:14.0: Successful evaluate context command 165 Apr 14 12:15:07 kernel: sr 5:0:0:0: [sr0] tag#0 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=30s 166 Apr 14 12:15:07 kernel: sr 5:0:0:0: [sr0] tag#0 CDB: Read(10) 28 00 00 00 02 00 00 00 02 00 167 Apr 14 12:15:07 kernel: blk_update_request: I/O error, dev sr0, sector 2048 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0 168 Apr 14 12:15:08 kernel: xhci_hcd:xhci_hub_status_data: xhci_hcd 0000:00:14.0: xhci_hub_status_data: stopping port polling. 169 Apr 14 12:15:38 kernel: xhci_hcd:xhci_dbg_trace: xhci_hcd 0000:00:14.0: Cancel URB 0000000074932569, dev 1, ep 0x81, starting at offset 0x 169 3faeda020 170 Apr 14 12:15:38 kernel: xhci_hcd:xhci_ring_cmd_db: xhci_hcd 0000:00:14.0: // Ding dong! ...
(In reply to Tom Horsley from comment #14) > Created attachment 1678737 [details] > the usb debug trace output from the same run > > No doubt this has output from both times I loaded the DVD since I didn't turn trace off and back on for the 2nd time I ran the test. Thanks. That is bigger than I was expecting: $ wc -l trace-3.txt 88468 trace-3.txt There are two "Stall Error" events, which probably correspond to your two test runs, but that is as far as I care to comment on this trace. A USB expert would need to look at it to say what is normal and what is not: $ grep -C1 'Stall Error' trace-3.txt usb-storage-3363 [001] d... 2362.886878: xhci_ring_ep_doorbell: Ring doorbell for Slot 6 ep1in systemd-journal-445 [000] d.h. 2362.886891: xhci_handle_event: EVENT: TRB 00000003bfb8b090 status 'Stall Error' len 13 slot 6 ep 3 type 'Transfer Event' flags e:C systemd-journal-445 [000] d.h. 2362.886907: xhci_handle_transfer: BULK: Buffer 0000000406879080 length 13 TD size 0 intr 0 type 'Normal' flags b:i:I:c:s:I:e:C -- usb-storage-3461 [001] d... 2612.623140: xhci_ring_ep_doorbell: Ring doorbell for Slot 7 ep1in systemd-journal-445 [000] d.h. 2612.623154: xhci_handle_event: EVENT: TRB 00000003faeda090 status 'Stall Error' len 13 slot 7 ep 3 type 'Transfer Event' flags e:C systemd-journal-445 [000] d.h. 2612.623167: xhci_handle_transfer: BULK: Buffer 0000000406879080 length 13 TD size 0 intr 0 type 'Normal' flags b:i:I:c:s:I:e:C BTW, the trace appears to have timestamps in column four, so your two test runs would have been about 250 seconds apart.
(In reply to Tom Horsley from comment #15) > (In reply to Steve from comment #12) > > (In reply to Tom Horsley from comment #8) > > ... > > > One other oddity I noticed. This computer has an ASUS drive connected via sata, and it is normally /dev/sr0, but apparently plugging in the USB drive stole the /dev/sr0 device (I was expecting the usb drive to show up as /dev/sr1). > > ... > > > > Could you open a new bug for that with component "systemd"? An > already-connected drive should not be renamed when you connect a second drive. > > This was spurious. Apparently somewhere in there one of the cables came loose on the ASUS drive, so it wasn't there at all. I wiggled the cables and it is better now (and the usb does show up as /dev/sr1). Thanks for resolving that to a hardware problem. I had an ethernet connection fail when I tried to use a connector with the clip broken off, so my "free" advice is to throw away any SATA cables that do not have clips on them. :-)
I found a way to correlate the dmesg output and the trace by searching for "0x3faeda020". The "-B6" and the "-B7" options were chosen to cover the previous "ding dong". The results below are hard to read, but there is no indication that the USB layer is having any problems, except possibly for the "CErr 3" in line 79145 of trace-3.txt. $ grep -n -B6 '0x3faeda020' dmesg-3.txt 163-Apr 14 12:15:07 kernel: xhci_hcd:xhci_ring_cmd_db: xhci_hcd 0000:00:14.0: // Ding dong! 164-Apr 14 12:15:07 kernel: xhci_hcd:xhci_dbg_trace: xhci_hcd 0000:00:14.0: Successful evaluate context command 165-Apr 14 12:15:07 kernel: sr 5:0:0:0: [sr0] tag#0 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=30s 166-Apr 14 12:15:07 kernel: sr 5:0:0:0: [sr0] tag#0 CDB: Read(10) 28 00 00 00 02 00 00 00 02 00 167-Apr 14 12:15:07 kernel: blk_update_request: I/O error, dev sr0, sector 2048 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0 168-Apr 14 12:15:08 kernel: xhci_hcd:xhci_hub_status_data: xhci_hcd 0000:00:14.0: xhci_hub_status_data: stopping port polling. 169:Apr 14 12:15:38 kernel: xhci_hcd:xhci_dbg_trace: xhci_hcd 0000:00:14.0: Cancel URB 0000000074932569, dev 1, ep 0x81, starting at offset 0x3faeda020 170-Apr 14 12:15:38 kernel: xhci_hcd:xhci_ring_cmd_db: xhci_hcd 0000:00:14.0: // Ding dong! 171-Apr 14 12:15:38 kernel: xhci_hcd:handle_tx_event: xhci_hcd 0000:00:14.0: Stopped on Transfer TRB for slot 7 ep 2 172:Apr 14 12:15:38 kernel: xhci_hcd:xhci_dbg_trace: xhci_hcd 0000:00:14.0: Removing canceled TD starting at 0x3faeda020 (dma). $ grep -n -B7 '0x3faeda020 (dma)' trace-3.txt 79139- kworker/u4:0-3251 [001] d... 2715.633800: xhci_ring_host_doorbell: Ring doorbell for Command Ring 0 79140- <idle>-0 [000] d.h. 2715.633849: xhci_handle_event: EVENT: TRB 00000003faeda020 status 'Stopped' len 2048 slot 7 ep 3 type 'Transfer Event' flags e:c 79141- <idle>-0 [000] d.h. 2715.633871: xhci_handle_transfer: BULK: Buffer 00000003edacd000 length 4096 TD size 0 intr 0 type 'Normal' flags b:i:I:c:s:I:e:C 79142- <idle>-0 [000] d.h. 2715.633876: xhci_inc_deq: EVENT 000000002c5e8e5e: enq 0x0000000407eba000(0x0000000407eba000) deq 0x0000000407ebac00(0x0000000407eba000) segs 1 stream 0 free_trbs 254 bounce 0 cycle 0 79143- <idle>-0 [000] d.h. 2715.633878: xhci_handle_event: EVENT: TRB 0000000407eb8490 status 'Success' len 0 slot 7 ep 0 type 'Command Completion Event' flags e:c 79144- <idle>-0 [000] d.h. 2715.633879: xhci_handle_command: CMD: Stop Ring Command: slot 7 sp 0 ep 3 flags C 79145- <idle>-0 [000] d.h. 2715.633883: xhci_handle_cmd_stop_ep: State stopped mult 1 max P. Streams 0 interval 125 us max ESIT payload 0 CErr 3 Type Bulk IN burst 15 maxp 1024 deq 00000003faeda021 avg trb len 0 79146: <idle>-0 [000] d.h. 2715.633891: xhci_dbg_cancel_urb: Removing canceled TD starting at 0x3faeda020 (dma).
According to the timestamps below, there is a 10 second gap between these two records. Would that be enough to cause a timeout in the upper layers? $ grep -n -A1 '2705.384478:.*EVENT' trace-3.txt 79134: <idle>-0 [000] d.h. 2705.384478: xhci_inc_deq: EVENT 000000002c5e8e5e: enq 0x0000000407eba000(0x0000000407eba000) deq 0x0000000407ebabf0(0x0000000407eba000) segs 1 stream 0 free_trbs 254 bounce 0 cycle 0 79135- kworker/u4:0-3251 [001] d... 2715.633748: xhci_urb_dequeue: ep1in-bulk: urb 0000000074932569 pipe 3221259392 slot 7 length 0/4096 sgs 1/1 stream 0 flags 00040201
Use a bigger number here to look further back in time. Use the "G" command in "less" or the "End" key to jump to the end and then scroll backwards with the up-arrow key. There are no other large time gaps than the 10 second gap noted in Comment 20: $ grep -n -B100 '0x3faeda020 (dma)' trace-3.txt | less ^^^
Tom in Comment 0: "... I discovered I could play the same DVD on the same drive just find on a Windows 10 system using the Windows version of VLC." Could you try a timing test with Windows? I'm not sure what events to use, but when the DVD drive tray closes would be a good starting event. If there is a light on the DVD drive, it might be possible to wait for it to go out and use that as the ending event. A wristwatch "stopwatch" should be accurate enough for the purpose. If you take it off, it might be possible to operate it with one hand. Specifically, I'm wondering if Windows is just more "patient". :-)
Under /sys/module/scsi_mod/parameters/ there a scsi parameter called "inq_timeout". According to the kernel source code "inq_timeout" is the: "Timeout (in seconds) waiting for devices to answer INQUIRY." " Default is 20. Some devices may need more; most need less."); So try this: $ sudo sh -c "echo 60 > /sys/module/scsi_mod/parameters/inq_timeout" $ cat /sys/module/scsi_mod/parameters/inq_timeout ("60" is a guess.) Details: $ less -N drivers/scsi/scsi_scan.c ... 108 static unsigned int scsi_inq_timeout = SCSI_TIMEOUT/HZ + 18; 109 110 module_param_named(inq_timeout, scsi_inq_timeout, uint, S_IRUGO|S_IWUSR); 111 MODULE_PARM_DESC(inq_timeout, 112 "Timeout (in seconds) waiting for devices to answer INQUIRY." 113 " Default is 20. Some devices may need more; most need less."); ...
(In reply to Thomas Schmitt from comment #6) Thanks for posting that link, where you say in the same thread: "139255322 us [ 180481085 ] This says that the run already lasted 180 seconds and the SCSI command to which this line belongs lasted 139 seconds. (Your patience is in good shape.)" https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org/message/QIIFFUF4SQCWEWAIV26SMTEWKEPKUWQI/ So the "inq_timeout" may need to be much higher that "60" seconds. If that turns out to be the case, I would say that this is a firmware problem with the USB drive enclosure.
(In reply to Steve from comment #11) > You might be able to get more information about what is happening at the USB layer by running these commands as root from the command-line: > > # echo 'module xhci_hcd =pmf' > /sys/kernel/debug/dynamic_debug/control > # echo 'module usbcore =pmf' > /sys/kernel/debug/dynamic_debug/control > > # echo 81920 > /sys/kernel/debug/tracing/buffer_size_kb > # echo 1 > /sys/kernel/debug/tracing/events/xhci-hcd/enable > > Collect logs with: > > $ journalctl --no-hostname -k > /tmp/dmesg-1.txt > $ sudo cat /sys/kernel/debug/tracing/trace > /tmp/trace-1.txt This enables dynamic debugging for "scsi_mod": # echo 'module scsi_mod =pmf' > /sys/kernel/debug/dynamic_debug/control Tested in a VM with this running in a terminal window: $ journalctl --no-hostname -k -f And then by inserting and removing a USB flash drive with the F31 Live image on it. I don't know how informative the output will be, because I got messages like these: Apr 14 22:54:24 kernel: scsi_mod:scsi_runtime_idle: scsi target6:0:0: scsi_runtime_idle Apr 14 22:54:24 kernel: scsi_mod:scsi_runtime_suspend: scsi target6:0:0: scsi_runtime_suspend For tracing, you may need to do this too: # echo 1 > /sys/kernel/debug/tracing/events/scsi_mod/enable
On windows 10 (on the same asus mini pc) it takes 13 seconds from the time I hit the button to close the tray and load the disc till the lights stop flashing, so that's comfortably under 20 seconds. VLC can start playing immediately if I hit play at the end of that time. So I don't think there's a firmware problem in the USB enclosure :-).
Hi, > So the "inq_timeout" may need to be much higher that "60" seconds. INQUIRY is an SCSI command which should be done very quickly. It tells no info about the medium but only static drive properties. SPC-3 says in 6.1: "The INQUIRY data should be returned even though the device server is not ready for other commands. The standard INQUIRY data should be available without incurring any media access delays." The further statement does not apply to DVD, because no INQUIRY data come from CD, DVD, or BD. Nevertheless it further confirms that the INQUIRY should be done quickly: "If the device server [the drive] does store some of the standard INQUIRY data or VPD data on the media, it may return ASCII spaces (20h) in ASCII fields and zeros in other fields until the data is available from thei media." > If that turns out to be the case, I would say that this is a firmware > problem with the USB drive enclosure. The error symptoms are like that. But why only with (copy protected) DVD ? My first guess would be power supply problems. This would match the observation that the problem is not present via direct SATA connection and happens only with particular media. But it works in the USB box under MS-Windows and the box has its own power supply rather than getting its power via data cable. Both does not give the idea of plain power shortage. We could learn more about the timing if Tom Horsley would post the complete libburn SCSI command log obtained by xorriso -scsi_log on -outdev /dev/sr0 -toc 2>&1 | tee -i /tmp/xorriso.log (Maybe now the drive is /dev/sr1. xorriso uses libburn, which uses ioctl(SG_IO) for its SCSI transactions.) Have a nice day :) Thomas
(In reply to Tom Horsley from comment #26) Thanks for doing that timing test. 13 seconds sounds like a good benchmark for Linux to meet. :-) Could you post the output from "lsmod"? (per the following) (In reply to Thomas Schmitt from comment #27) Thanks for those quotes. So the comments in the scsi driver code appear to be inconsistent with the SCSI specification: "Default is 20 [seconds]. Some devices may need more; most need less [for a SCSI INQUIRY to complete]." And, as Tom notes, there are no problems when the DVD drive is connected directly to a SATA port. (Comment 2) However, the log shows several drivers at play. Why do we need cxgb3i, cxgb4i, bnx2i, and be2iscsi? The problem here could be that too many or the wrong drivers are being loaded. $ fgrep -B20 -n -w 'uas' dmesg-1.txt | egrep -v 'audit|RPC' 758-Apr 10 06:36:12 kernel: Chelsio T3 iSCSI Driver cxgb3i v2.0.1-ko (Apr. 2015) 759-Apr 10 06:36:12 kernel: iscsi: registered transport (cxgb3i) 760-Apr 10 06:36:12 kernel: Chelsio T4-T6 iSCSI Driver cxgb4i v0.9.5-ko (Apr. 2015) 761-Apr 10 06:36:12 kernel: iscsi: registered transport (cxgb4i) 762-Apr 10 06:36:12 kernel: cxgb4i:cxgb4i_init_module: cxgb4i dcb enabled. 763-Apr 10 06:36:12 kernel: cnic: QLogic cnicDriver v2.5.22 (July 20, 2015) 764-Apr 10 06:36:12 kernel: QLogic NetXtreme II iSCSI Driver bnx2i v2.7.10.1 (Jul 16, 2014) 765-Apr 10 06:36:12 kernel: iscsi: registered transport (bnx2i) 766-Apr 10 06:36:12 kernel: iscsi: registered transport (be2iscsi) 767-Apr 10 06:36:12 kernel: In beiscsi_module_init, tt=(____ptrval____) 775-Apr 10 06:36:13 kernel: usb-storage 4-2:1.0: USB Mass Storage device detected 776-Apr 10 06:36:13 kernel: scsi host4: usb-storage 4-2:1.0 777-Apr 10 06:36:13 kernel: usbcore: registered new interface driver usb-storage 778:Apr 10 06:36:13 kernel: usbcore: registered new interface driver uas
Hi, > 13 seconds sounds like a good benchmark for Linux to meet. :-) This depends on drive and medium. The drive won't tell much during most of the medium accessing time. But INQURIY should work. The first INQURY of a libburn access here with drive tray sticking out succeeds 52 milliseconds after the library was started up. 1.8 seconds later the command to load the tray gets issued. It needs 1.25 seconds to complete. After 13 more seconds the drive reports to be ready for work with the medium. The first SCSI command READ(10) happens 16.2 seconds after library start. > The problem here could be that too many or the wrong drivers are being loaded. iSCSI is SCSI-over-internet. So cxgb3i, cxgb4i, bnx2i, be2iscsi should all play no role in Tom's situation. (It might be interesting to find out why they get started and why there are four of them.) The communication between userland, kernel, and drive goes through the upper and middle layers of Linux scsi driver as described in https://www.kernel.org/doc/html/v4.13/driver-api/scsi.html where the lowest layer is only sparsely mentioned. I assume that in Tom's case this layer is described in https://www.kernel.org/doc/html/v4.13/driver-api/usb/index.html but there i get lost with my userland-to-drive SCSI knowledge. It is a riddle how the functionality of USB can get disturbed by a medium condition which is supposed to affect the relation of drive to medium and of Linux uppermost SCSI level to drive. Have a nice day :) Thomas
(In reply to Thomas Schmitt from comment #27) > We could learn more about the timing if Tom Horsley would post the > complete libburn SCSI command log obtained by > > xorriso -scsi_log on -outdev /dev/sr0 -toc 2>&1 | tee -i /tmp/xorriso.log I doubt you'll learn much. Here's the complete output: [tweety@vivo ~]$ xorriso -scsi_log on -outdev /dev/sr0 -toc 2>&1 | tee -i /tmp/xorriso.log xorriso 1.5.2 : RockRidge filesystem manipulator, libburnia project. 740819134 us [ 789589505 ] libburn : FAILURE : Cannot access '/dev/sr0' as SG_IO CDROM drive xorriso : FAILURE : Cannot acquire drive '/dev/sr0' xorriso : aborting : -abort_on 'FAILURE' encountered 'FAILURE' The only thing that came out was the header line, the time and failure messages only appeared when I powered off the drive to get the hung process to exit. Merely ejecting the disk with the button on the drive had no effect, the command was still hung.
Created attachment 1679148 [details] log of a data dvd working in the same drive Just for contrast, I tried a data DVD (no encryption) in the drive and I do get a perfectly wonderful log from it.
Hi, > I doubt you'll learn much. > 740819134 us [ 789589505 ] > libburn : FAILURE : Cannot access '/dev/sr0' as SG_IO CDROM drive Indeed that's sparse. The last time your xorriso run could at least perform some SCSI commands after having acquired /dev/sr0. I wonder why the name of the SCSI command is not shown which was stuck for 740 seconds. I assume it is the first INQUIRY which shall confirm that the device claims to be a CD drive. Timeout for this command is 30 seconds. So its end was 710 seconds overdue. (It might also be a TEST UNIT READY. More than those two cannot be issued before the program code with the FAILURE message is reached.) Whatever, the connection to the drive seems to have been completely unusable. 49 seconds seem to have been spent before the SCSI command was started (789589505 - 740819134 microseconds). I have few idea what system calls might have needed so long. As said, for me it lasts about 50 milliseconds to get that far. Have a nice day :) Thomas
Hi, > I tried a data DVD (no encryption) in the drive and I do get a > perfectly wonderful log from it. With the one riddle that again a SCSI transaction timestamp is shown with no command before it. It is not the INQUIRY, which is shown next. So it must be the TEST UNIT READY. (I will have to investigate why you get that timestamp whereas i don't.) The reported times look good. After 3 seconds the drive is done with reporting its state and its perception of the loaded medium. Have a nice day :) Thomas
(In reply to Thomas Schmitt from comment #29) Thanks for the timings and for the doc links, which already help.* To add dynamic debugging output for the CD-ROM SCSI upper layer: In a terminal window, run: $ journalctl --no-hostname -k | tee -i /tmp/sr_mod-log-1.txt In a separate terminal window, enable logging for the "sr_mod" module and for the USB modules: # echo 'module sr_mod =pmf' > /sys/kernel/debug/dynamic_debug/control # echo 'module xhci_hcd =pmf' > /sys/kernel/debug/dynamic_debug/control # echo 'module usbcore =pmf' > /sys/kernel/debug/dynamic_debug/control Load the test DVD. Run "xorriso". NB: I don't have a working optical drive ATM, but the commands run fine in a VM and some logging is generated when a USB flash drive is inserted and removed. * SCSI Interfaces Guide https://www.kernel.org/doc/html/v4.13/driver-api/scsi.html
Hi, at least libburn has a bug less now. The SCSI commands issued in the course of the Linux-specific drive acquiration were not logged. This was not noticed here, because they get emitted only if ioctl HDIO_GET_IDENTITY says the drive is under control of the old driver "hd" for Parallel ATA (decommissioned long ago), or if ioctl SG_GET_SCSI_ID returns error. I bet that the latter happens to you. In this case, Linux would refuse to tell the address which it logged e.g. as "6:0:0:0:" in Apr 13 18:57:36 kernel: sr 6:0:0:0: Attached scsi CD-ROM sr0 So the lonely timestamp probably stems from TEST UNIT READY, which gets emitted in either case. An INQUIRY is emitted only if HDIO_GET_IDENTITY does not indicate "hd". But that second command would have caused a second timestamp. So i guess it's TEST UNIT READY. ioctl SG_GET_SCSI_ID might have consumed the 49 seconds before the TEST UNIT READY command was started. Have a nice day :) Thomas
(In reply to Steve from comment #19) I just noticed a 30 second gap in the log output posted in Comment 19. That shows that the log records may be out of order, because the "sr" upper layer "hostbyte=DID_TIME_OUT" entry is logged *before* the 30 second gap. So why is xhci_hcd "stopping port polling" and 30 seconds later cancelling a URB*? Copied from Comment 19: $ grep -n -B6 '0x3faeda020' dmesg-3.txt 163-Apr 14 12:15:07 kernel: xhci_hcd:xhci_ring_cmd_db: xhci_hcd 0000:00:14.0: // Ding dong! 164-Apr 14 12:15:07 kernel: xhci_hcd:xhci_dbg_trace: xhci_hcd 0000:00:14.0: Successful evaluate context command 165-Apr 14 12:15:07 kernel: sr 5:0:0:0: [sr0] tag#0 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=30s 166-Apr 14 12:15:07 kernel: sr 5:0:0:0: [sr0] tag#0 CDB: Read(10) 28 00 00 00 02 00 00 00 02 00 167-Apr 14 12:15:07 kernel: blk_update_request: I/O error, dev sr0, sector 2048 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0 168-Apr 14 12:15:08 kernel: xhci_hcd:xhci_hub_status_data: xhci_hcd 0000:00:14.0: xhci_hub_status_data: stopping port polling. ^^ vv 169:Apr 14 12:15:38 kernel: xhci_hcd:xhci_dbg_trace: xhci_hcd 0000:00:14.0: Cancel URB 0000000074932569, dev 1, ep 0x81, starting at offset 0x3faeda020 170-Apr 14 12:15:38 kernel: xhci_hcd:xhci_ring_cmd_db: xhci_hcd 0000:00:14.0: // Ding dong! 171-Apr 14 12:15:38 kernel: xhci_hcd:handle_tx_event: xhci_hcd 0000:00:14.0: Stopped on Transfer TRB for slot 7 ep 2 172:Apr 14 12:15:38 kernel: xhci_hcd:xhci_dbg_trace: xhci_hcd 0000:00:14.0: Removing canceled TD starting at 0x3faeda020 (dma). * "There are two ways to cancel an URB you’ve submitted but which hasn’t been returned to your driver yet." How to cancel an already running URB? https://www.kernel.org/doc/html/latest/driver-api/usb/URB.html#how-to-cancel-an-already-running-urb "URB" means "USB Request Block" https://www.kernel.org/doc/html/latest/driver-api/usb/URB.html
OK, I did another run of xorriso with lots of kernel debugging turned on (I don't understand it, I just turned on everything I saw mentioned in the comments). The xorriso log acted a bit different, it printed the initial timestamp all by itself without me ejecting the disk or turning off the drive: xorriso 1.5.2 : RockRidge filesystem manipulator, libburnia project. 30214061 us [ 54708892 ] libburn : FAILURE : Cannot access '/dev/sr0' as SG_IO CDROM drive xorriso : FAILURE : Cannot acquire drive '/dev/sr0' xorriso : aborting : -abort_on 'FAILURE' encountered 'FAILURE' The FAILURE lines did come out after I turned off the drive to get it out of the hang. I'll attach dmesg.txt and kernel-debug.txt created during the run of the hung xorriso with the movie dvd in the drive.
Created attachment 1679160 [details] dmesg during run of xorriso hang
Created attachment 1679161 [details] kernel debug during run of xorriso hang
(In reply to Tom Horsley from comment #39) > Created attachment 1679161 [details] > kernel debug during run of xorriso hang Thanks. The good news is that some SCSI commands were logged. The bad news is that they don't seem to explain the 30 second gap here. This is extremely hard to read, but there are line numbers on the left side and timestamps in column 5. I've separated the records where there is a 30 second gap. The search pattern is 'Cancel URB' and the entries before it are displayed up to the first 'Ring doorbell' entry (going backwards or "up"). $ fgrep -B9 -n -m1 'Cancel URB' trace-4.txt 2967- usb-storage-315 [001] d... 10102.320216: xhci_ring_ep_doorbell: Ring doorbell for Slot 3 ep4in 2968- <idle>-0 [000] d.h. 10102.320342: xhci_handle_event: EVENT: TRB 0000000407b21910 status 'Success' len 0 slot 3 ep 9 type 'Transfer Event' flags e:c 2969- <idle>-0 [000] d.h. 10102.320347: xhci_handle_transfer: BULK: Buffer 000000040615c000 length 13 TD size 0 intr 0 type 'Normal' flags b:i:I:c:s:I:e:C 2970- <idle>-0 [000] d.h. 10102.320350: xhci_inc_deq: BULK 000000009249b2eb: enq 0x0000000407b21920(0x0000000407b21000) deq 0x0000000407b21920(0x0000000407b21000) segs 2 stream 0 free_trbs 509 bounce 1024 cycle 1 2971- <idle>-0 [000] d.h. 10102.320353: xhci_urb_giveback: ep4in-bulk: urb 00000000ae4357dc pipe 3221357184 slot 3 length 13/13 sgs 0/0 stream 0 flags 00000204 2972- <idle>-0 [000] d.h. 10102.320356: xhci_inc_deq: EVENT 000000009abdcb21: enq 0x0000000407e6c000(0x0000000407e6c000) deq 0x0000000407e6c210(0x0000000407e6c000) segs 1 stream 0 free_trbs 254 bounce 0 cycle 0 2973- usb-storage-315 [001] .... 10102.320425: scsi_dispatch_cmd_done: host_no=4 channel=0 id=0 lun=0 data_sgl=1 prot_sgl=0 prot_op=SCSI_PROT_NORMAL cmnd=(WRITE_10 lba=86866184 txlen=8 protect=0 raw=2a 00 05 2d 79 08 00 00 08 00) result=(driver=DRIVER_OK host=DID_OK message=COMMAND_COMPLETE status=SAM_STAT_GOOD) ^^ vv 2974- kworker/1:1H-335 [001] .... 10132.015859: scsi_dispatch_cmd_timeout: host_no=5 channel=0 id=0 lun=0 data_sgl=1 prot_sgl=0 prot_op=SCSI_PROT_NORMAL cmnd=(READ_10 lba=512 txlen=2 protect=0 raw=28 00 00 00 02 00 00 00 02 00) result=(driver=DRIVER_OK host=DID_OK message=COMMAND_COMPLETE status=SAM_STAT_GOOD) 2975- kworker/u4:2-6805 [000] d... 10132.026879: xhci_urb_dequeue: ep1in-bulk: urb 000000001e88754a pipe 3221259392 slot 7 length 0/4096 sgs 1/1 stream 0 flags 00040201 2976: kworker/u4:2-6805 [000] d... 10132.026912: xhci_dbg_cancel_urb: Cancel URB 000000001e88754a, dev 1, ep 0x81, starting at offset 0x3f8dc8900 NB: I renamed the log "trace-4.txt" for consistency and clarity.
This will display just the SCSI commands: $ fgrep -i -n scsi trace-4.txt | less There is one oddity that can be viewed with: $ fgrep -C4 -n scsi_eh_wakeup trace-4.txt | less There are three of these timeouts: $ fgrep -n scsi_dispatch_cmd_timeout trace-4.txt | wc -l 3 The first two show a significant time gap before the previous record (29.695434 s, 24.566161 s). The third does not (0.683883 s): $ fgrep -B1 -n scsi_dispatch_cmd_timeout trace-4.txt | less
Hi, > the "sr" upper layer "hostbyte=DID_TIME_OUT" entry is logged *before* > the 30 second gap. > $ grep -n -B6 '0x3faeda020' dmesg-3.txt It could be from a previous SCSI-via-USB transaction attempt. https://bugzilla.redhat.com/attachment.cgi?id=1678735 "rerun of journalctl on the asus mini-pc with usb debugging on " shows: Apr 14 12:15:07 kernel: xhci_hcd:xhci_dbg_trace: xhci_hcd 0000:00:14.0: Cancel URB 000000003f0a2866, dev 1, ep 0x81, starting at offset 0x3faedac40 ... Apr 14 12:15:07 kernel: xhci_hcd:xhci_ring_cmd_db: xhci_hcd 0000:00:14.0: // Ding dong! Apr 14 12:15:07 kernel: xhci_hcd:xhci_dbg_trace: xhci_hcd 0000:00:14.0: Successful evaluate context command Apr 14 12:15:07 kernel: sr 5:0:0:0: [sr0] tag#0 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=30s > https://www.kernel.org/doc/html/latest/driver-api/usb/URB.html If i only could see the way how an URB transports a SCSI command, its possible data, its possible result data, and its success indication. > https://bugzilla.redhat.com/attachment.cgi?id=1679160 > dmesg during run of xorriso hang Says that sr0 has SCSI address 5:0:0:0 , which corresponds in > https://bugzilla.redhat.com/attachment.cgi?id=1679161 > kernel debug during run of xorriso hang to "host_no=5 channel=0 id=0 lun=0". These two might mark start and timeout of a READ command for 2 blocks from Logical Block Address 2: kworker/1:1H-335 [001] .... 10162.370474: scsi_dispatch_cmd_start: host_no=5 channel=0 id=0 lun=0 data_sgl=1 prot_sgl=0 prot_op=SCSI_PROT_NORMAL cmnd=(READ_10 lba=512 txlen=2 protect=0 raw=28 00 00 00 02 00 00 00 02 00) ... kworker/1:1H-335 [001] .... 10192.432179: scsi_dispatch_cmd_timeout: host_no=5 channel=0 id=0 lun=0 data_sgl=1 prot_sgl=0 prot_op=SCSI_PROT_NORMAL cmnd=(READ_10 lba=512 txlen=2 protect=0 raw=28 00 00 00 02 00 00 00 02 00) result=(driver=DRIVER_OK host=DID_OK message=COMMAND_COMPLETE status=SAM_STAT_GOOD) <idle>-0 [001] d.s. 10192.438071: scsi_eh_wakeup: host_no=5 Problem with interpreting the lines inbetween is the busy device 4:0:0:0, which is reported in dmesg as Apr 15 08:37:52 kernel: usb 4-2: Product: My Book 1130 Apr 15 08:37:52 kernel: usb 4-2: Manufacturer: Western Digital ... Apr 15 08:37:53 kernel: scsi 4:0:0:0: Direct-Access WD My Book 1130 1016 PQ: 0 ANSI: 6 ... Apr 15 08:37:53 kernel: sd 4:0:0:0: [sdb] 1953458176 512-byte logical blocks: (1.00 TB/931 GiB) How to filter away the USB messages of this disk ? --------------------------------------------------------------- About comment 40: > 2973- usb-storage-315 [001] .... 10102.320425: scsi_dispatch_cmd_done: > host_no=4 channel=0 id=0 lun=0 data_sgl=1 prot_sgl=0 prot_op=SCSI_PROT_NORMAL > cmnd=(WRITE_10 lba=86866184 txlen=8 protect=0 raw=2a 00 05 2d 79 08 00 00 08 > 00) result=(driver=DRIVER_OK host=DID_OK message=COMMAND_COMPLETE That's the disk. The DVD drive will get no WRITE_10 and offer no LBA 86866184 (0x052d7908) for reading. That's byte address 177,901,944,832. I understand this transaction was successful. > 2974- kworker/1:1H-335 [001] .... 10132.015859: > scsi_dispatch_cmd_timeout: host_no=5 channel=0 id=0 lun=0 data_sgl=1 > prot_sgl=0 prot_op=SCSI_PROT_NORMAL cmnd=(READ_10 lba=512 txlen=2 > protect=0 raw=28 00 00 00 02 00 00 00 02 00) result=(driver=DRIVER_OK > host=DID_OK message=COMMAND_COMPLETE status=SAM_STAT_GOOD) This is the DVD drive having a timeout. Probably started by > systemd-udevd-7136 [000] .... 10101.875658: scsi_dispatch_cmd_start: > host_no=5 channel=0 id=0 lun=0 data_sgl=1 prot_sgl=0 > prot_op=SCSI_PROT_NORMAL cmnd=(READ_10 lba=512 txlen=2 protect=0 > raw=28 00 00 00 02 00 00 00 02 00) immediately after another READ_10 succeeded: > usb-storage-4803 [001] .... 10101.875369: scsi_dispatch_cmd_done: > host_no=5 channel=0 id=0 lun=0 data_sgl=1 prot_sgl=0 > prot_op=SCSI_PROT_NORMAL cmnd=(READ_10 lba=256 txlen=2 protect=0 > raw=28 00 00 00 01 00 00 00 02 00) > result=(driver=DRIVER_OK host=DID_OK message=COMMAND_COMPLETE > status=SAM_STAT_GOOD) (There are many instances interested in reading blocks from the start of the DVD.) Have a nice day :) Thomas
Errata: I wrote: > These two might mark start and timeout of a READ command for 2 blocks > from Logical Block Address 2: > [...] raw=28 00 00 00 02 00 00 00 02 00 [...] This is Logical Block Address 512 (0x00000200), not 2 (0x00000002).
Hi, > $ fgrep -B1 -n scsi_dispatch_cmd_timeout trace-4.txt | less If i run grep -n 'scsi_dispatch_cmd.*: host_no=5' trace-4.txt | less i see lots of successful SCSI transactions scsi_dispatch_cmd_start scsi_dispatch_cmd_done Most are automatic 4Ah GET EVENT STATUS NOTIFICATION issued by the kernel raw=4a 01 00 00 10 00 00 00 08 00 Only three timeouts are to see 10101.875658 to 10132.015859 10132.156237 to 10162.224030 10162.370474 to 10192.432179 all nearly 30 seconds. All with command READ_10 at LBA 512. Have a nice day :) Thomas
Thomas: You already realized this, but when I went to post it, there was a BZ mid-air collision, so I am posting after your comments: There are three values for "host_no" in trace-4.txt: 0, 4, 5. So the trace log has entries for several devices. They appear to correspond to these devices: $ grep -n 'scsi [0-9]' dmesg-4.txt 691:Apr 15 08:37:52 kernel: scsi 0:0:0:0: Direct-Access ATA Crucial_CT512MX1 MU02 PQ: 0 ANSI: 5 782:Apr 15 08:37:53 kernel: scsi 4:0:0:0: Direct-Access WD My Book 1130 1016 PQ: 0 ANSI: 6 ... 1073:Apr 15 12:45:37 kernel: scsi 5:0:0:0: CD-ROM HL-DT-ST BD-RE WH16NS40 1.04 PQ: 0 ANSI: 0 ... Comment 40 shows that host_no=4 is being accessed. That's the "My Book", which is an external USB drive. Why would it be getting accessed? For the record: $ fgrep 'DMI:' dmesg-4.txt Apr 15 08:37:52 kernel: DMI: ASUSTeK Computer INC. VM40B/VM40B, BIOS 1501 12/09/2014
Thomas wrote in Comment 42: > How to filter away the USB messages of this disk ? $ grep -v <pattern> ... # With the "-v" option, records matching the pattern are removed from the output. So this will remove the SCSI records for the My Book 1130: $ grep -v -n 'host_no=4' trace-4.txt | less
The WD My Book is where fedora is installed for purposes of running fedora tests on what is normally a Windows 10 mini pc (the internal disk is windows only).
(In reply to Tom Horsley from comment #47) > The WD My Book is where fedora is installed for purposes of running fedora tests on what is normally a Windows 10 mini pc (the internal disk is windows only). Thanks. The "-k" option removes a lot of context from the log, but there is just enough, barely, to infer that, in retrospect.* :-) Could you post or attach the list of your kernel modules: $ lsmod > /tmp/lsmod-1.txt * == $ grep -n 'sd[ab][0-9]' dmesg-4.txt 700:Apr 15 08:37:52 kernel: sda: sda1 sda2 sda3 sda4 sda5 sda6 806:Apr 15 08:37:59 kernel: sdb: sdb1 sdb2 sdb3 sdb4 sdb5 814:Apr 15 08:38:00 kernel: EXT4-fs (sdb4): mounted filesystem with ordered data mode. Opts: (null) 889:Apr 15 12:38:08 kernel: EXT4-fs (sdb4): re-mounted. Opts: (null) 891:Apr 15 12:38:08 kernel: Adding 8214524k swap on /dev/sdb3. Priority:-2 extents:1 across:8214524k FS 968:Apr 15 12:38:15 kernel: EXT4-fs (sdb2): mounted filesystem with ordered data mode. Opts: (null) 971:Apr 15 12:38:15 kernel: EXT4-fs (sdb5): mounted filesystem with ordered data mode. Opts: (null)
Created attachment 1679183 [details] lsmod listing I tried running this before and after accessing the dvd, but got the same list, so apparently nothing was dynamically loaded during the test.
(In reply to Tom Horsley from comment #49) > Created attachment 1679183 [details] > lsmod listing > > I tried running this before and after accessing the dvd, but got the same list, so apparently nothing was dynamically loaded during the test. Thanks. Apparently there are kernel modules for enclosures: SCSI- and enclosure-related: $ egrep 'scsi|ses|enclosure' lsmod.txt ses 20480 0 enclosure 16384 1 ses scsi_transport_sas 49152 1 ses $ modinfo ses | egrep 'description|^name' description: SCSI Enclosure Services (ses) driver name: ses $ modinfo enclosure | egrep 'description|^name' description: Enclosure Services name: enclosure $ modinfo scsi_transport_sas | egrep 'description|^name' description: SAS Transport Attributes name: scsi_transport_sas USB-related: $ egrep 'usb|usb_storage|uas' lsmod.txt btusb 57344 0 btrtl 24576 1 btusb # "bt" is bluetooth, so we can ignore those. btbcm 16384 1 btusb btintel 32768 1 btusb bluetooth 634880 33 btrtl,btintel,btbcm,bnep,btusb,rfcomm uas 32768 0 usb_storage 81920 6 uas $ modinfo uas | egrep 'description|^name|import_ns' import_ns: USB_STORAGE # "ns" is namespace. name: uas $ modinfo usb_storage | egrep 'description|^name|import_ns' description: USB Mass Storage driver for Linux name: usb_storage
The My Book also has an enclosure, but the enclosure is logged as LUN* 1 on SCSI host 4: $ fgrep -A3 -n -m1 'scsi 4' dmesg-4.txt 782:Apr 15 08:37:53 kernel: scsi 4:0:0:0: Direct-Access WD My Book 1130 1016 PQ: 0 ANSI: 6 783-Apr 15 08:37:53 kernel: scsi 4:0:0:1: Enclosure WD SES Device 1016 PQ: 0 ANSI: 6 <<< note "SES Device" here. 784-Apr 15 08:37:53 kernel: sd 4:0:0:0: Attached scsi generic sg1 type 0 785-Apr 15 08:37:53 kernel: scsi 4:0:0:1: Attached scsi generic sg2 type 13 And there are some problems reported for it: $ fgrep -A3 -n 'scsi 4:0:0:1: Wrong' dmesg-4.txt 917:Apr 15 12:38:11 kernel: scsi 4:0:0:1: Wrong diagnostic page; asked for 1 got 8 918-Apr 15 12:38:11 kernel: scsi 4:0:0:1: Failed to get diagnostic page 0x1 919-Apr 15 12:38:11 kernel: scsi 4:0:0:1: Failed to bind enclosure -19 920-Apr 15 12:38:11 kernel: ses 4:0:0:1: Attached Enclosure device <<<<<<<<<<<<<<< "ses" module logged here. * Logical Unit Number https://www.kernel.org/doc/html/latest/driver-api/scsi.html
In contrast to the My Book enclosure, the OWC enclosure appears as a "USB Mass Storage device": $ fgrep -A7 -n -m1 'usb 4-1' dmesg-4.txt 1065:Apr 15 12:45:36 kernel: usb 4-1: new SuperSpeed Gen 1 USB device number 3 using xhci_hcd 1066-Apr 15 12:45:36 kernel: usb 4-1: New USB device found, idVendor=1e91, idProduct=de2c, bcdDevice= 1.00 1067-Apr 15 12:45:36 kernel: usb 4-1: New USB device strings: Mfr=2, Product=3, SerialNumber=1 1068-Apr 15 12:45:36 kernel: usb 4-1: Product: Mercury Pro Optical 1069-Apr 15 12:45:36 kernel: usb 4-1: Manufacturer: Other World Computing 1070-Apr 15 12:45:36 kernel: usb 4-1: SerialNumber: 002933001438 1071-Apr 15 12:45:36 kernel: usb-storage 4-1:1.0: USB Mass Storage device detected 1072-Apr 15 12:45:36 kernel: scsi host5: usb-storage 4-1:1.0
Thomas wrote in Comment 42: > How to filter away the USB messages of this disk ? Second reply with a more precise answer than in Comment 42: Many of the USB records are tagged with one of only two identifiers: "usb-storage-315" or "usb-storage-4803": $ cat trace-4.txt | awk '/usb-storage-/ { print $1 }' | sort -u usb-storage-315 usb-storage-4803 I believe the number is a process or thread ID. This command selects records that have one of those tags and a "host_no" field: $ cat trace-4.txt | awk '/usb-storage-.*host_no/ { print $1, $6 }' | sort -u usb-storage-315 host_no=4 usb-storage-4803 host_no=5
(In reply to Steve from comment #53) ... > Second reply with a more precise answer than in Comment 42: ... That back reference should be to Comment 46.
(In reply to Steve from comment #53) ... > Many of the USB records are tagged with one of only two identifiers: "usb-storage-315" or "usb-storage-4803": ... Actually, the log has records for many processes that ran during the test, including "vi": $ cat trace-4.txt | awk '/^#.*/ { next }; { print $1 }' | sort -u cdrom_id-7134 cdrom_id-7135 cdrom_id-7137 dbus-broker-3532 dconf-service-3777 gdbus-3632 gdbus-4073 gnome-shell-3619 gnome-terminal--5456 gvfs-udisks2-vo-3782 <idle>-0 jbd2/sdb4-8-350 jbd2/sdb5-8-577 journal-offline-7141 kworker/0:1H-186 kworker/0:2-6861 kworker/1:1-7061 kworker/1:1H-335 kworker/u4:0-6955 kworker/u4:2-6805 NetworkManager-739 pool-gnome-shel-7153 pool-udisksd-7142 rcu_sched-10 systemd-journal-444 systemd-udevd-460 systemd-udevd-7133 systemd-udevd-7136 tracker-miner-f-3890 usb-storage-315 usb-storage-4803 vi-7157 xorriso-7138
Here, some of the SCSI commands are not interpreted in the output. There is one time gap of 24.706237 seconds when the process id changes. The awk script simply extracts the process name/id, the timestamp, and the command, and prints them. $ cat trace-4.txt | awk '/^ *cdrom_id-.*cmnd=.*$/ { pid=$1; ts=$4; n=match($0, "cmnd=.*$"); print pid, ts, substr($0, n) }' cdrom_id-7134 10075.091385: cmnd=(0x4a - raw=4a 01 00 00 10 00 00 00 08 00) cdrom_id-7134 10075.091894: cmnd=(TEST_UNIT_READY - raw=00 00 00 00 00 00) cdrom_id-7134 10075.092257: cmnd=(TEST_UNIT_READY - raw=00 00 00 00 00 00) cdrom_id-7134 10075.092645: cmnd=(INQUIRY - raw=12 00 00 00 24 00) cdrom_id-7134 10075.093117: cmnd=(0x46 - raw=46 00 00 00 00 00 00 00 08 00) cdrom_id-7134 10075.094305: cmnd=(0x46 - raw=46 00 00 00 00 00 00 01 d8 00) cdrom_id-7134 10075.096011: cmnd=(READ_TOC - raw=43 00 00 00 00 00 01 00 0c 00) cdrom_id-7134 10075.096510: cmnd=(READ_TOC - raw=43 00 00 00 00 00 01 00 14 00) cdrom_id-7134 10075.096953: cmnd=(READ_TOC - raw=43 00 01 00 00 00 00 00 0c 00) cdrom_id-7134 10075.097332: cmnd=(0x51 - raw=51 00 00 00 00 00 00 00 20 00) cdrom_id-7134 10075.097914: cmnd=(ALLOW_MEDIUM_REMOVAL - raw=1e 00 00 00 00 00) cdrom_id-7134 10075.098267: cmnd=(START_STOP - raw=1b 00 00 00 02 00) cdrom_id-7135 10076.555466: cmnd=(0x4a - raw=4a 01 00 00 10 00 00 00 08 00) cdrom_id-7135 10076.555865: cmnd=(TEST_UNIT_READY - raw=00 00 00 00 00 00) cdrom_id-7135 10076.556471: cmnd=(TEST_UNIT_READY - raw=00 00 00 00 00 00) cdrom_id-7135 10076.556979: cmnd=(0x4a - raw=4a 01 00 00 10 00 00 00 08 00) cdrom_id-7135 10076.557221: cmnd=(INQUIRY - raw=12 00 00 00 24 00) cdrom_id-7135 10076.557526: cmnd=(0x46 - raw=46 00 00 00 00 00 00 00 08 00) cdrom_id-7135 10076.558995: cmnd=(0x46 - raw=46 00 00 00 00 00 00 01 d8 00) cdrom_id-7137 10101.265232: cmnd=(0x4a - raw=4a 01 00 00 10 00 00 00 08 00) cdrom_id-7137 10101.265706: cmnd=(TEST_UNIT_READY - raw=00 00 00 00 00 00) cdrom_id-7137 10101.266109: cmnd=(TEST_UNIT_READY - raw=00 00 00 00 00 00) cdrom_id-7137 10101.266478: cmnd=(INQUIRY - raw=12 00 00 00 24 00) cdrom_id-7137 10101.266972: cmnd=(0x46 - raw=46 00 00 00 00 00 00 00 08 00) cdrom_id-7137 10101.268269: cmnd=(0x46 - raw=46 00 00 00 00 00 00 01 d8 00) cdrom_id-7137 10101.269830: cmnd=(READ_TOC - raw=43 00 00 00 00 00 01 00 0c 00) cdrom_id-7137 10101.270420: cmnd=(READ_TOC - raw=43 00 00 00 00 00 01 00 14 00) cdrom_id-7137 10101.271000: cmnd=(READ_TOC - raw=43 00 01 00 00 00 00 00 0c 00) cdrom_id-7137 10101.271523: cmnd=(0x51 - raw=51 00 00 00 00 00 00 00 20 00) cdrom_id-7137 10101.272007: cmnd=(ALLOW_MEDIUM_REMOVAL - raw=1e 00 00 00 01 00) The uninterpreted commands can be found here: SCSI command https://en.wikipedia.org/wiki/SCSI_command
(In reply to Steve from comment #56) ... cdrom_id-7134 10075.091385: cmnd=(0x4a - raw=4a 01 00 00 10 00 00 00 08 00) ... cdrom_id is a user space program in the systemd-udev package. And it is run from: /usr/lib/udev/rules.d/60-cdrom_id.rules systemd-udevd is also a user space program ("man systemd-udevd"). There is a 24.692882 second gap where the process id changes, and there is a READ_10 sequence at the end of this: $ cat trace-4.txt | awk '/^ *systemd-udevd-.*cmnd=.*$/ { pid=$1; ts=$4; n=match($0, "cmnd=.*$"); print pid, ts, substr($0, n) }' systemd-udevd-7133 10075.063881: cmnd=(0x4a - raw=4a 01 00 00 10 00 00 00 08 00) systemd-udevd-7133 10075.064351: cmnd=(TEST_UNIT_READY - raw=00 00 00 00 00 00) systemd-udevd-7133 10076.541158: cmnd=(0x4a - raw=4a 01 00 00 10 00 00 00 08 00) systemd-udevd-7133 10076.541595: cmnd=(TEST_UNIT_READY - raw=00 00 00 00 00 00) systemd-udevd-7133 10076.542209: cmnd=(TEST_UNIT_READY - raw=00 00 00 00 00 00) systemd-udevd-7133 10076.542660: cmnd=(TEST_UNIT_READY - raw=00 00 00 00 00 00) systemd-udevd-7136 10101.235542: cmnd=(0x4a - raw=4a 01 00 00 10 00 00 00 08 00) systemd-udevd-7136 10101.236058: cmnd=(TEST_UNIT_READY - raw=00 00 00 00 00 00) systemd-udevd-7136 10101.236813: cmnd=(TEST_UNIT_READY - raw=00 00 00 00 00 00) systemd-udevd-7136 10101.237177: cmnd=(TEST_UNIT_READY - raw=00 00 00 00 00 00) systemd-udevd-7136 10101.237511: cmnd=(READ_TOC - raw=43 00 00 00 00 00 00 00 0c 40) systemd-udevd-7136 10101.237942: cmnd=(READ_TOC - raw=43 00 00 00 00 00 00 00 0c 00) systemd-udevd-7136 10101.238405: cmnd=(READ_TOC - raw=43 00 00 00 00 00 01 00 0c 00) systemd-udevd-7136 10101.238988: cmnd=(READ_CAPACITY - raw=25 00 00 00 00 00 00 00 00 00) systemd-udevd-7136 10101.239432: cmnd=(0x51 - raw=51 00 00 00 00 00 00 00 02 00) systemd-udevd-7136 10101.239850: cmnd=(0x51 - raw=51 00 00 00 00 00 00 00 22 00) systemd-udevd-7136 10101.240180: cmnd=(0x52 - raw=52 01 00 00 00 01 00 00 08 00) systemd-udevd-7136 10101.240521: cmnd=(0x52 - raw=52 01 00 00 00 01 00 00 20 00) systemd-udevd-7136 10101.240863: cmnd=(TEST_UNIT_READY - raw=00 00 00 00 00 00) systemd-udevd-7136 10101.241083: cmnd=(READ_TOC - raw=43 00 00 00 00 00 00 00 0c 40) systemd-udevd-7136 10101.241427: cmnd=(READ_TOC - raw=43 00 00 00 00 00 00 00 0c 00) systemd-udevd-7136 10101.241904: cmnd=(READ_TOC - raw=43 00 00 00 00 00 01 00 0c 00) systemd-udevd-7136 10101.242366: cmnd=(READ_CAPACITY - raw=25 00 00 00 00 00 00 00 00 00) systemd-udevd-7136 10101.242762: cmnd=(0x51 - raw=51 00 00 00 00 00 00 00 02 00) systemd-udevd-7136 10101.243178: cmnd=(0x51 - raw=51 00 00 00 00 00 00 00 22 00) systemd-udevd-7136 10101.243573: cmnd=(0x52 - raw=52 01 00 00 00 01 00 00 08 00) systemd-udevd-7136 10101.244047: cmnd=(0x52 - raw=52 01 00 00 00 01 00 00 20 00) systemd-udevd-7136 10101.277361: cmnd=(0x4a - raw=4a 01 00 00 10 00 00 00 08 00) systemd-udevd-7136 10101.277807: cmnd=(TEST_UNIT_READY - raw=00 00 00 00 00 00) systemd-udevd-7136 10101.278369: cmnd=(READ_10 lba=4101504 txlen=2 protect=0 raw=28 00 00 3e 95 80 00 00 02 00) systemd-udevd-7136 10101.516146: cmnd=(READ_10 lba=4101506 txlen=1 protect=0 raw=28 00 00 3e 95 82 00 00 01 00) systemd-udevd-7136 10101.517146: cmnd=(READ_10 lba=0 txlen=2 protect=0 raw=28 00 00 00 00 00 00 00 02 00) systemd-udevd-7136 10101.775259: cmnd=(READ_10 lba=8 txlen=2 protect=0 raw=28 00 00 00 00 08 00 00 02 00) systemd-udevd-7136 10101.776173: cmnd=(READ_10 lba=16 txlen=2 protect=0 raw=28 00 00 00 00 10 00 00 02 00) systemd-udevd-7136 10101.784320: cmnd=(READ_10 lba=32 txlen=2 protect=0 raw=28 00 00 00 00 20 00 00 02 00) systemd-udevd-7136 10101.793401: cmnd=(READ_10 lba=64 txlen=2 protect=0 raw=28 00 00 00 00 40 00 00 02 00) systemd-udevd-7136 10101.826396: cmnd=(READ_10 lba=128 txlen=2 protect=0 raw=28 00 00 00 00 80 00 00 02 00) systemd-udevd-7136 10101.847772: cmnd=(READ_10 lba=256 txlen=2 protect=0 raw=28 00 00 00 01 00 00 00 02 00) systemd-udevd-7136 10101.875658: cmnd=(READ_10 lba=512 txlen=2 protect=0 raw=28 00 00 00 02 00 00 00 02 00)
Hi, > Here, some of the SCSI commands are not interpreted in the output. They are ok for optical drives. From MMC-5 specs: 4Ah GET EVENT STATUS NOTIFICATION 6.7 46h GET CONFIGURATION 6.6 51h READ DISC INFORMATION 6.22 52h READ TRACK INFORMATION 6.27 It is somewhat astounding that the debug log does not tell their command names. libburn does not emit 4Ah GET EVENT STATUS NOTIFICATION. Nevertheless there is such a command attributed to xorriso-7138 in the log: xorriso-7138 [000] .... 10132.155618: scsi_dispatch_cmd_start: ... raw=4a 01 00 00 10 00 00 00 08 00) It might get emitted automatically by ioctl(SG_IO) before the payload command of SG_IO gets sent. But the two commands attributed to xorriso get no corresponding scsi_dispatch_cmd_done or scsi_dispatch_cmd_timeout. There are matching lines attributed to usb-storage, though. 3325: xorriso-7138 [000] .... 10132.155618: scsi_dispatch_cmd_start: host_no=5 channel=0 id=0 lun=0 data_sgl=1 prot_sgl=0 prot_op=SCSI_PROT_NORMAL cmnd=(0x4a - raw=4a 01 00 00 10 00 00 00 08 00) 3353: usb-storage-4803 [000] .... 10132.155909: scsi_dispatch_cmd_done: host_no=5 channel=0 id=0 lun=0 data_sgl=1 prot_sgl=0 prot_op=SCSI_PROT_NORMAL cmnd=(0x4a - raw=4a 01 00 00 10 00 00 00 08 00) result=(driver=DRIVER_OK host=DID_OK message=COMMAND_COMPLETE status=SAM_STAT_GOOD) 3354: xorriso-7138 [000] .N.. 10132.155951: scsi_dispatch_cmd_start: host_no=5 channel=0 id=0 lun=0 data_sgl=0 prot_sgl=0 prot_op=SCSI_PROT_NORMAL cmnd=(TEST_UNIT_READY - raw=00 00 00 00 00 00) 3373: usb-storage-4803 [000] .... 10132.156207: scsi_dispatch_cmd_done: host_no=5 channel=0 id=0 lun=0 data_sgl=0 prot_sgl=0 prot_op=SCSI_PROT_NORMAL cmnd=(TEST_UNIT_READY - raw=00 00 00 00 00 00) result=(driver=DRIVER_OK host=DID_OK message=COMMAND_COMPLETE status=SAM_STAT_GOOD) I am not sure whether we always can trust that process name attribution. At least not with scsi_dispatch_cmd_done and scsi_dispatch_cmd_timeout lines. > There is one time gap of 24.706237 seconds when the process id changes. > ... > cdrom_id-7134 10075.091385: cmnd=(0x4a - raw=4a 01 00 00 10 00 00 00 08 00) > ... > cdrom_id-7135 10076.555466: cmnd=(0x4a - raw=4a 01 00 00 10 00 00 00 08 00) > ... > cdrom_id-7137 10101.265232: cmnd=(0x4a - raw=4a 01 00 00 10 00 00 00 08 00) Afaik, cdrom-id belongs to sytemd's udev. The time gap seems just to be the interval between three drive inspections by udev. (If we can trust the process name ...) ----------------------------------------------------------------------------- I have this nebulous theory stub: The READ_10 commands to host=5 lba=512 (possibly to identify the filesystem on DVD) get stuck and finally time out. Interestingly, reads to lba=4101504 or lba=0 to lba=256 succeed. Only lba=512 times out. If some other SCSI command is issued to the drive during the waiting time of the timeout, then it falls victim to this state. There is no such occasion to see in trace-4.txt, though. (If both commands are emitted via ioctl(SG_IO) to /dev/sr, then they come through sr_block_ioctl() in drivers/scsi/sr.c with its quite brainless mutex_lock(). In this case the other SCSI command would have to wait until the READ_10 is done. Even if both commands are for different drives. But POSIX i/o circumvents that mutex-stricken function. So i assume that commands to the same drive queue in the lower layers.) Have a nice day :) Thomas
(In reply to Thomas Schmitt from comment #58) Thanks for your very informative analysis. I never would have thought to pair up different process names. :-) > I have this nebulous theory stub: I was thinking along those lines too, although you are much better informed about the details than I am. There was another BZ mid-air collision, so I will post what I had written before I read your analysis. Can you suggest any other tests? == Here is another analysis. Several processes are sending SCSI commands to the DVD drive. Some have system-level privileges, so they could take control of the DVD drive and lock out all other processes. So here are some simple tests, the first of which is to log out and repeat the test. (running xorriso won't be possible, though) A second one is to boot to "runlevel" 3 (append "3" to the kernel command line from grub2), log in, and repeat the test. Reboot when done. And a third one is to stop systemd-udevd: # systemctl stop systemd-udevd NB: This change is not permanent. However, it could cause some things to stop working. Reboot or use the "start" command when done. Verify with: $ systemctl status systemd-udevd Details: These processes are sending SCSI commands to the DVD drive (host_no=5): $ cat trace-4.txt | grep 'host_no=5.*cmnd=' | awk '{ print $1 }' | sort -u cdrom_id-7134 cdrom_id-7135 cdrom_id-7137 kworker/0:1H-186 kworker/0:2-6861 kworker/1:1-7061 kworker/1:1H-335 systemd-udevd-7133 systemd-udevd-7136 usb-storage-4803 xorriso-7138 These processes are sending SCSI commands to the My Book drive (host_no=4): $ cat trace-4.txt | grep 'host_no=4.*cmnd=' | awk '{ print $1 }' | sort -u dconf-service-3777 jbd2/sdb4-8-350 jbd2/sdb5-8-577 journal-offline-7141 kworker/0:1H-186 kworker/1:1H-335 kworker/u4:0-6955 NetworkManager-739 pool-gnome-shel-7153 systemd-journal-444 usb-storage-315 vi-7157 And these to the SSD (host_no=0): $ cat trace-4.txt | grep 'host_no=0.*cmnd=' | awk '{ print $1 }' | sort -u <idle>-0 pool-udisksd-7142
systemd-udevd might get restarted if a process attempts to connect to one of its sockets: # systemctl stop systemd-udevd Warning: Stopping systemd-udevd.service, but it can still be activated by: systemd-udevd-kernel.socket systemd-udevd-control.socket So stop all of those systemd units: # systemctl stop systemd-udevd\* Verify with: # systemctl list-units --all systemd-udevd\* Tested in an F31 VM.
Hi, @Tom: Did you already try a xorriso run which begins when the tray with the DVD is still sticking out ? If not, then it might be worth a try in the hope to keep udev from opening the drive (see below: O_EXCL). If yes, then was it the run which you reported on the mailing list and which came far enough to run into "SG_IO: host_status= 0x7 SG_ERR_DID_ERROR" ? Also i support Steve's proposal to stop udev completely for a test. ----------------------------------------------------------------- More theory and background: The theory about the bad READ_10 to lba=512 would even match the observation that MS-Windows succeeds: If it does not try to read LBA 512 out of mere curiosity, and if the filesystem on the DVD does not advertise LBA 512 as part of any read-worthy object, then MS-Windows could stay away from that address and experience no timeout. LBA 512 is 1 megabyte from medium start (programmer's MiB: 1024 * 1024). In an ISO 9660 i would expect to find valuable data there. But UDF is wasteful with storage space at the start of the medium. So it might really be a spot which is not referred by the filesystem metadata. > Several processes are sending SCSI commands to the DVD drive. Yes. A READ_10 by udev seems to be the trigger for the recognizable timeouts. Question is whether this influences others. > Some have system-level privileges, so they could take control of > the DVD drive and lock out all other processes. There is no such privileged control known to me. Success of open(2) depends on device file permissions or Linux capabilities(7) of the process. Linux has the O_EXCL protocol for open(2) on device files: While defined in POSIX only for data files, open(...,O_EXCL) on Linux /dev/sr implements advisory locking. If a /dev/sr is already opened with O_EXCL, then another open(2) with O_EXCL will fail with errno EBUSY. open(2) without O_EXCL will succeed. burn programs use O_EXCL to keep each other from interfering with their respective activities. > to stop systemd-udevd: > # systemctl stop systemd-udevd If udev's READ_10 lba=512 is indeed the trigger for the problems, then this should show at least some change in behavior. Have a nice day :) Thomas
Created attachment 1679385 [details] xorriso log with udevd squashed Holy cow! It seems to work properly when udevd is well and truly squashed. To use belt and suspenders, I ran these commands: [root@vivo ~]# systemctl stop systemd-udevd Warning: Stopping systemd-udevd.service, but it can still be activated by: systemd-udevd-kernel.socket systemd-udevd-control.socket [root@vivo ~]# systemctl stop systemd-udevd-kernel.socket [root@vivo ~]# systemctl stop systemd-udevd-control.socket [root@vivo ~]# systemctl disable systemd-udevd [root@vivo ~]# systemctl mask systemd-udevd Created symlink /etc/systemd/system/systemd-udevd.service → /dev/null. Then, starting with the tray open and the dvd in the tray, I ran the xorriso command, and it never hung and appears to have functioned properly. Just started dvdbackup to see if it could read and decrypt the disc, and it is running just fine. (I better go re-enable udevd now before I forget)
Created attachment 1679386 [details] xorriso log with udevd squashed Yikes, picked wrong log file. This is the log of it working.
Thomas in Comment 61: > LBA 512 is 1 megabyte from medium start (programmer's MiB: 1024 * 1024). Some of the other systemd-udevd READ_10 commands use LBAs that increase in powers of 2. (Comment 57) That sounds like a "sizing" algorithm. Why would systemd-udevd need to do that? Can you suggest a test program that does something similar? Tom in Comment 62: > Holy cow! It seems to work properly when udevd is well and truly squashed. Thanks for squashing and testing. :-)
Hi, > Holy cow! It seems to work properly when udevd is well and truly squashed. So we possibly have the trigger in form of a SCSI command 28h READ(10) with LBA 512 and read size 2. What happens if you cause that SCSI command with deactivated udev by dd if=/dev/sr0 bs=2048 skip=512 count=2 of=/dev/null while the tray is already loaded and the drive stopped blinking. (I doubt that the tray load waiting bug is fixed in kernel 5. If the tray later refuses to come out by its own eject button, use program "eject". That's another bug here when USB is involved.) dd should normally end after 0 to 5 seconds and report 4096 bytes being copied. > ISO session : 1 , 0 , 4101507s , STALLFF > Media summary: 1 session, 4101507 data blocks, 8011m data, 0 free xorriso recognizes an ISO 9660 filesystem. But i assume that this is just a protective hull around an UDF filesystem for video. Well, everything seems to work well. (I still riddle why the drive acquiration code of libburn sends a single SCSI command of which we only see the timestamp in the log. It appears that ioctl HDIO_GET_IDENTITY states that the drive is attached via ATAPI. Very strange.) ------------------------------------------------------------------------ What's still missing for a plausible theory is how USB contributes to the timeout which seems not to happen or not to matter with direct SATA. Depending on the outcome of above dd it would be worth to spy on the SATA attached drive whether udev tries that READ(10) for LBA 512 too and what is the outcome of that try. But first we should learn whether simply reading blocks 512 and 513 really makes the USB connection mad. For now this is the biggest plausibility gap in my theory. It could of course be that READ(10) is a red herring and that udev does something to USB which produces the timeout as mere consequence. Have a nice day :) Thomas
(In reply to Tom Horsley from comment #62) ... > [root@vivo ~]# systemctl stop systemd-udevd ... For the record, could you post the version of systemd-udev you are testing with: $ rpm -q systemd-udev > I better go re-enable udevd now before I forget This should show you the current configuration: $ systemctl list-units --all systemd-udevd\*
Here's the version and unit info: systemd-udev-243.8-1.fc31.x86_64 [tweety@vivo ~]$ systemctl list-units --all systemd-udevd\* UNIT LOAD ACTIVE SUB DESCRIPTION systemd-udevd.service loaded active running udev Kernel Device Manager systemd-udevd-control.socket loaded active running udev Control Socket systemd-udevd-kernel.socket loaded active running udev Kernel Socket LOAD = Reflects whether the unit definition was properly loaded. ACTIVE = The high-level unit activation state, i.e. generalization of SUB. SUB = The low-level unit activation state, values depend on unit type. 3 loaded units listed. To show all installed unit files use 'systemctl list-unit-files'.
I made sure udev was squashed again, verified that xorriso worked well (and the dvd was loaded and ready), then ran the dd command: [tweety@vivo ~]$ dd if=/dev/sr0 bs=2048 skip=512 count=2 of=/dev/null dd: error reading '/dev/sr0': Input/output error 0+0 records in 0+0 records out 0 bytes copied, 146.173 s, 0.0 kB/s It sat there with no lights flashing on the drive for a minute or two, ejecting the disk didn't do anything, finally powering off the disk made it exit with errors. I guess it really hates that location on the disk.
Hi, > That sounds like a "sizing" algorithm. Why would systemd-udevd need > to do that? It is a strange pattern, indeed. There is such a size detector named blkdev_find_size() in line 52 ff. of https://github.com/karelzak/util-linux/blob/master/lib/blkdev.c I understand that it starts its search for a high limit at byte 1024. That would still be LBA 0. Then it doubles the byte address. This is not the sequence of 2048-byte LBAs we see in trace-4.txt: 0, 8, 16, 32, ... The missing 1 is covered by the read size of 2 blocks in the READ_10. But LBA 2 and 4 are not covered by the first READ_10 and not read by the subsequent READ_10s. So why don't READ_10s happen for them ? Grrr. Whatever, blkdev_find_size() is the last resort of blkdev_get_size() in line 83. This gets called by blkid_probe_set_device() in line 858 ff. of https://github.com/karelzak/util-linux/blob/master/libblkid/src/probe.c blkid_probe_set_device() is called by function builtin_blkid() in line 272 of https://github.com/systemd/systemd/blob/master/src/udev/udev-builtin-blkid.c But why would none of the ioctls in blkdev_get_size() yield a result ? Have a nice day :) Thomas
Hi, > [tweety@vivo ~]$ dd if=/dev/sr0 bs=2048 skip=512 count=2 of=/dev/null > dd: error reading '/dev/sr0': Input/output error > 0 bytes copied, 146.173 s, 0.0 kB/s > [...] finally powering off the disk made it exit with errors. Now this is significant. Can i talk you into connecting the drive directly by SATA and to try getting a log like https://bugzilla.redhat.com/attachment.cgi?id=1679161 "kernel debug during run of xorriso hang" which Steve now calls "trace-4.txt" but you called "kernel-log.txt". Goal is to see whether ... READ_10 lba=512 ... gets reported and what is its outcome. It would also be interesting to see whether dd if=/dev/sr0 bs=2048 skip=512 count=2 of=/dev/null works with direct SATA. Have a nice day :) Thomas
Out of random curiosity, I tried plugging in the old pioneer slim USB drive which never had problems. The dd command works fine with it, no problem reading that block with the exact same disk in the drive.
Too much trouble to unscrew drives again, but I have a different system with an ASUS BW-16D1HT drive connected directly via sata. It had the same problem when it was in the usb enclosure, so I assume it will act the same. Running the dd command with the movie in the drive fails, but does not hang, just gets I/O errors: [tweety@rebo ~]$ dd if=/dev/sr0 bs=2048 skip=512 count=2 of=/dev/null dd: error reading '/dev/sr0': Input/output error 0+0 records in 0+0 records out 0 bytes copied, 2.94269 s, 0.0 kB/s So it looks like something about the USB enclosure is causing the hang and it still hates that address on the disk even with a direct sata connection.
(In reply to Thomas Schmitt from comment #70) > which Steve now calls "trace-4.txt" but you called "kernel-log.txt". So the next one will be "trace-5.txt". :-) Using incremental version numbers prevents name collisions when downloading several attachments. They also make it easier to discuss specific files. > Goal is to see whether > ... READ_10 lba=512 ... > gets reported and what is its outcome. Tom: If you do that again, could you post the commands you used to enable tracing so we have a record of how the trace log was collected? They are probably still in your shell history, so this should find them: # history | grep '/sys/kernel' > It would also be interesting to see whether > dd if=/dev/sr0 bs=2048 skip=512 count=2 of=/dev/null > works with direct SATA. I was going to speculate that there is something about the way the DVD was mastered. Even with a SATA-attached DVD drive, I have had inconsistent results with commercial DVDs. Sometimes they cannot be read. (Admittedly, the DVDs are from the library, which means that they are not always in perfect condition.) So that suggests another test -- what happens with a different commercial DVD? > Have a nice day :) > > Thomas
Hi, > ASUS BW-16D1HT drive connected directly via sata. > [tweety@rebo ~]$ dd if=/dev/sr0 bs=2048 skip=512 count=2 of=/dev/null > dd: error reading '/dev/sr0': Input/output error > 0 bytes copied, 2.94269 s, 0.0 kB/s > So it looks like something about the USB enclosure is causing the hang and > it still hates that address on the disk even with a direct sata connection. Yes. The READ_10 command fails for some systematic reason of DVD and drive (hardware or firmware). The dramatic reaction with USB must be something between USB box and drive or between USB box and the Linux USB drivers. Do you see messages about sr0 and "Read(10)" from the SATA experiment ? What do you get from this xorriso run with the SATA attached drive: xorriso -outdev /dev/sr0 -check_media use=outdev min_lba=512 max_lba=513 -- (This run will emit a READ(10) for LBA 512 with size 2. It should report any SCSI error codes from that command.) Have a nice day :) Thomas
With "Do you see messages" i meant "in dmesg".
(In reply to Thomas Schmitt from comment #69) ... > There is such a size detector named > blkdev_find_size() > in line 52 ff. of > https://github.com/karelzak/util-linux/blob/master/lib/blkdev.c ... Thanks for that link. systemd-udevd links to a library from util-linux: $ rpm -q --requires systemd-udev | grep -m1 libblkid libblkid.so.1()(64bit) Indeed, there is a call to a blkid sizing function here: $ git -C systemd-git grep blkid_probe_get_size src/udev/udev-builtin-blkid.c: blkid_probe_get_size(pr) <= 1024 * 1440 && However, that file appears to be a copy: $ git -C systemd-git grep 'Karel Zak' src src/udev/udev-builtin-blkid.c: * Copyright © 2011 Karel Zak <kzak> The systemd repo is here: https://github.com/systemd/systemd
(In reply to Steve from comment #76) > ... systemd-udevd links to a library from util-linux: Strictly speaking, libblkid.so.1 is in a sub-package: $ rpm -qf /usr/lib64/libblkid.so.1 libblkid-2.34-4.fc31.x86_64 $ rpm -qi libblkid | grep Source Source RPM : util-linux-2.34-4.fc31.src.rpm
Created attachment 1679482 [details] trace of loading disk on direct connect sata drive Here's the trace I got of simply loading the disk in the ASUS sata drive. There are 4 instances of READ_10 lba=512 in there. I had previously made a script to log stuff, so I just ran it again (it didn't like the attempt to enable scsi_mod, I guess that files doesn't exist): #!/bin/sh echo 'module sr_mod =pmf' > /sys/kernel/debug/dynamic_debug/control echo 'module xhci_hcd =pmf' > /sys/kernel/debug/dynamic_debug/control echo 'module usbcore =pmf' > /sys/kernel/debug/dynamic_debug/control echo 'module scsi_mod =pmf' > /sys/kernel/debug/dynamic_debug/control echo 'module scsi =pmf' > /sys/kernel/debug/dynamic_debug/control echo 81920 > /sys/kernel/debug/tracing/buffer_size_kb echo 1 > /sys/kernel/debug/tracing/events/xhci-hcd/enable echo 1 > /sys/kernel/debug/tracing/events/scsi/enable echo 1 > /sys/kernel/debug/tracing/events/scsi_mod/enable Probably the usb related stuff is of no value as well. I tried several different DVDs early on with the same hangs on all of them, so I don't think there is anything special about the DVD I've been sticking with for these tests.
(In reply to Thomas Schmitt from comment #74) > Do you see messages about sr0 and "Read(10)" from the SATA experiment ? dmesg has lots of this stuff in it (just a sample): [ 107.297869] sr 5:0:0:0: [sr0] tag#7 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s [ 107.297876] sr 5:0:0:0: [sr0] tag#7 Sense Key : Illegal Request [current] [ 107.297881] sr 5:0:0:0: [sr0] tag#7 Add. Sense: Read of scrambled sector without authentication [ 107.297885] sr 5:0:0:0: [sr0] tag#7 CDB: Read(10) 28 00 00 00 02 00 00 00 02 00 [ 107.297889] blk_update_request: I/O error, dev sr0, sector 2048 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0 [ 109.397832] sr 5:0:0:0: [sr0] tag#15 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s [ 109.397840] sr 5:0:0:0: [sr0] tag#15 Sense Key : Illegal Request [current] [ 109.397844] sr 5:0:0:0: [sr0] tag#15 Add. Sense: Read of scrambled sector without authentication [ 109.397848] sr 5:0:0:0: [sr0] tag#15 CDB: Read(10) 28 00 00 00 02 01 00 00 01 00 [ 109.397852] blk_update_request: I/O error, dev sr0, sector 2052 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0 [ 109.397856] Buffer I/O error on dev sr0, logical block 513, async page read [ 110.546831] sr 5:0:0:0: [sr0] tag#21 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s [ 110.546838] sr 5:0:0:0: [sr0] tag#21 Sense Key : Illegal Request [current] [ 110.546842] sr 5:0:0:0: [sr0] tag#21 Add. Sense: Read of scrambled sector without authentication [ 110.546846] sr 5:0:0:0: [sr0] tag#21 CDB: Read(10) 28 00 00 3e 94 80 00 00 40 00 [ 110.546851] blk_update_request: I/O error, dev sr0, sector 16404992 op 0x0:(READ) flags 0x80700 phys_seg 2 prio class 0 > > What do you get from this xorriso run with the SATA attached drive: > > xorriso -outdev /dev/sr0 -check_media use=outdev min_lba=512 max_lba=513 -- > > (This run will emit a READ(10) for LBA 512 with size 2. It should report > any SCSI error codes from that command.) [tweety@rebo ~]$ xorriso -outdev /dev/sr0 -check_media use=outdev min_lba=512 max_lba=513 -- xorriso 1.5.2 : RockRidge filesystem manipulator, libburnia project. xorriso : NOTE : Disc status unsuitable for writing Drive current: -outdev '/dev/sr0' Media current: DVD-ROM Media status : is written , is closed Media summary: 1 session, 4101507 data blocks, 8011m data, 0 free libburn : SORRY : SCSI error on read_10(512,2): [5 6F 03] Illegal request. Read of scrambled sector without authentication. xorriso : UPDATE : 2 of 2 blocks read in 1 s , 0.0xD xorriso : UPDATE : 2 of 2 blocks read in 1 s = 0.0xD Media checks : lba , size , quality Media region : 0 , 512 , 0 untested Media region : 512 , 2 , - unreadable Media region : 514 , 4100993 , 0 untested xorriso : NOTE : Tolerated problem event of severity 'SORRY' xorriso : NOTE : -return_with SORRY 32 triggered by problem severity SORRY Looks like it knows something about copy protection in the hardware. Possibly the old pioneer drive just goes ahead and reads the scrambled data without complaint.
(In reply to Tom Horsley from comment #78) Thanks for posting the shell script and for attaching trace-5.txt. There are more SCSI devices on this system: $ cat trace-5.txt | awk '/^#/ { next }; { print $6 }' | sort -u host_no=0 host_no=1 host_no=2 host_no=4 host_no=5 Could you also attach dmesg-5.txt from the same system you used to run your test: $ journalctl --no-hostname -k > /tmp/dmesg-5.txt And to simplify further discussion, could you suggest a name for it? (ASUS desktop?)
Created attachment 1679492 [details] systemd-udevd records with SCSI commands extracted from trace-5.txt. systemd-udevd records with SCSI commands extracted from trace-5.txt. The records all have "host_no=5", but I am simply assuming that that is the SCSI id for the SATA-connected ASUS DVD drive. Command-line is: $ cat trace-5.txt | awk '/^ *systemd-udevd-.*cmnd=.*$/ { pid=$1; ts=$4; n=match($0, "cmnd=.*$"); print pid, ts, substr($0, n) }'
Hi, > [ 107.297876] sr 5:0:0:0: [sr0] tag#7 Sense Key : Illegal Request [current] > [ 107.297881] sr 5:0:0:0: [sr0] tag#7 Add. Sense: Read of scrambled sector > without authentication > [ 107.297885] sr 5:0:0:0: [sr0] tag#7 CDB: Read(10) 28 00 00 00 02 00 00 00 > 02 00 That's indeed a READ_10 to LBA 0x00000200 = 512. The "Add. Sense:" text is the cleartext for SCSI error code 5 6F 03. MMC-5 has about READ(10) in section 6.15.3: "Any read by the Host to a Logical Block with a Title Key present in the sector (DVD-ROM Media Only), when the Authentication Success Flag (ASF) is set to zero, shall be blocked. The command shall be terminated with CHECK CONDITION status and SK/ASC/ASCQ values shall be set to ILLEGAL REQUEST/READ OF SCRAMBLED SECTOR WITHOUT AUTHENTICATION." Section 4.10 "Content Protection" outlines various copy prevention methods: CSS, CPRM/CPPM, VCPS, AACS. Quite complicated and surely not my kind of programming topic. > libburn : SORRY : SCSI error on read_10(512,2): [5 6F 03] Illegal request. > Read of scrambled sector without authentication. So libburn and kernel experience the same. > Looks like it knows something about copy protection in the hardware. libburn only knows the error text associated with [5 6F 03]. The drive firmware indeed seems to know more. > Possibly the old pioneer drive just goes ahead and reads the scrambled > data without complaint. Good theory. What do you get to see when dd-ing them from old pioneer through a hex dumper ? Like dd if=/dev/sr0 bs=2048 skip=512 count=2 | od -t x1 | less dd if=/dev/sr0 bs=2048 skip=512 count=2 | od -c | less Zeros ? Cleartext snippets with od -c ? --------------------------------------------------------------------- The question remains why a normal SCSI error condition wrecks the USB connection. So back to trace-4.txt, where USB stuff is logged. I compared the lines from scsi_dispatch_cmd_start ... READ_10 lba=256 ... to scsi_dispatch_cmd_done ... READ_10 lba=256 ... with the lines from scsi_dispatch_cmd_start ... READ_10 lba=512 ... to scsi_dispatch_cmd_done ... READ_10 lba=512 ... following usb-storage-4803 which seems to be in charge of processing the READ_10s. They begin to differ after a dozen lines. lba=256: usb-storage-4803 [001] .... 10101.875226: xhci_urb_enqueue: ep1in-bulk: urb 0000000036e6ba93 pipe 3221259392 slot 7 length 0/13 sgs 0/0 stream 0 flags 00000204 usb-storage-4803 [001] d... 10101.875233: xhci_queue_trb: BULK: Buffer 000000040615c080 length 13 TD size 0 intr 0 type 'Normal' flags b:i:I:c:s:I:e:c versus lba=512: usb-storage-4803 [000] .... 10132.027142: xhci_urb_enqueue: ep0out-control: urb 0000000072fd3232 pipe 2147484672 slot 7 length 0/0 sgs 0/0 stream 0 flags 00000000 usb-storage-4803 [000] d... 10132.027146: xhci_queue_trb: CTRL: bRequestType 00 bRequest 01 wValue 0030 wIndex 0000 wLength 0 length 8 TD size 0 intr 0 type 'Setup Stage' flags I:i:c from then on they run apart. No idea what this USB stuff means, though. Have a nice day :) Thomas
This is easier now that we know what to look for. :-) These are the only trace records with a 'host_no' field and 'lba=512': $ cat trace-5.txt | grep -n 'host_no=[0-9]' | grep 'lba=512' 8703: systemd-udevd-2532 [001] .... 627.381674: scsi_dispatch_cmd_start: host_no=5 channel=0 id=0 lun=0 data_sgl=1 prot_sgl=0 prot_op=SCSI_PROT_NORMAL cmnd=(READ_10 lba=512 txlen=2 protect=0 raw=28 00 00 00 02 00 00 00 02 00) 8716: kworker/6:1H-469 [006] .... 627.424401: scsi_dispatch_cmd_timeout: host_no=5 channel=0 id=0 lun=0 data_sgl=1 prot_sgl=0 prot_op=SCSI_PROT_NORMAL cmnd=(READ_10 lba=512 txlen=2 protect=0 raw=28 00 00 00 02 00 00 00 02 00) result=(driver=DRIVER_OK host=DID_OK message=COMMAND_COMPLETE status=SAM_STAT_GOOD) 8718: kworker/1:1H-536 [001] .... 627.441450: scsi_dispatch_cmd_start: host_no=5 channel=0 id=0 lun=0 data_sgl=1 prot_sgl=0 prot_op=SCSI_PROT_NORMAL cmnd=(READ_10 lba=512 txlen=1 protect=0 raw=28 00 00 00 02 00 00 00 01 00) 8789: <idle>-0 [006] d.h. 628.987442: scsi_dispatch_cmd_done: host_no=5 channel=0 id=0 lun=0 data_sgl=1 prot_sgl=0 prot_op=SCSI_PROT_NORMAL cmnd=(READ_10 lba=512 txlen=1 protect=0 raw=28 00 00 00 02 00 00 00 01 00) result=(driver=DRIVER_OK host=DID_OK message=COMMAND_COMPLETE status=SAM_STAT_GOOD) NB: I added newlines to make the records slightly more readable.
(In reply to Thomas Schmitt from comment #82) ... > ... when the Authentication Success Flag (ASF) is set to zero, shall be blocked. ... Who can set that flag?
Alan, upstream, enables usbmon tracing: https://bugzilla.kernel.org/show_bug.cgi?id=207219#c1 But the output can only be interpreted by a USB expert. The output is in the attachments here: Bug_207219 - [BISECTED] Sony Vaio laptop built-in ricoh webcam no longer found on Fedora 31 with 5.4.11 kernel https://bugzilla.kernel.org/show_bug.cgi?id=207219 So if the problem is now down to debugging at the USB level, I would suggest opening a bug upstream under Drivers/USB: https://bugzilla.kernel.org/
Created attachment 1679500 [details] dmesg from system where trace-5 was done This is a dmesg right after booting the system again, so there won't be any dvd errors in here, but you can see all the devices and such. The host names I've given these systems are: zooty - my main system I use all the time and the one where I first encountered the bug, but very little of the rest of the testing was done on zooty. vivo - the ASUS mini PC which normally runs windows and where I've done 90% of the testing using an external USB drive where I installed fedora 31. (This is where I learned Windows 10 does not have a problem with DVDs in the external drive). rebo - the system where I've done the latest tests using the sata ASUS drive. It also accumulated a lot of disks left over from other systems, so there are indeed many sata drives in here. This dmesg is from rebo.
(In reply to Thomas Schmitt from comment #82) > What do you get to see when dd-ing them from old pioneer through > a hex dumper ? Like > > dd if=/dev/sr0 bs=2048 skip=512 count=2 | od -t x1 | less > > dd if=/dev/sr0 bs=2048 skip=512 count=2 | od -c | less > > Zeros ? Cleartext snippets with od -c ? It looks quite a lot like random noise. Not zeroes, but nothing intelligible.
(In reply to Steve from comment #85) > Alan, upstream, enables usbmon tracing: > > https://bugzilla.kernel.org/show_bug.cgi?id=207219#c1 BTW, usbmon tracing is easy to try out: Figure out which USB bus your mouse is on.* Mine is on 3. And then run: # cat /sys/kernel/debug/usb/usbmon/3u As you move your mouse, output will scroll by in your terminal window. NB: I don't know what the "s", "t", and "u" suffixes mean: # ls /sys/kernel/debug/usb/usbmon 0s 0u 1s 1t 1u 2s 2t 2u 3s 3t 3u 4s 4t 4u * Run: $ lsusb And look for the bus number for your mouse.
(In reply to Tom Horsley from comment #86) > Created attachment 1679500 [details] > dmesg from system where trace-5 was done > > This is a dmesg right after booting the system again, so there won't be any > dvd errors in here, but you can see all the devices and such. Thanks. That's perfect: $ grep 'scsi [0-9]' dmesg-rebo-1.txt [ 1.423646] scsi 0:0:0:0: Direct-Access ATA M4-CT064M4SSD2 000F PQ: 0 ANSI: 5 [ 1.425494] scsi 1:0:0:0: Direct-Access ATA Hitachi HDS5C303 A5C0 PQ: 0 ANSI: 5 [ 1.426253] scsi 2:0:0:0: Direct-Access ATA WDC WD4003FZEX-0 1A01 PQ: 0 ANSI: 5 [ 1.426976] scsi 4:0:0:0: Direct-Access ATA WDC WD4003FZEX-0 1A01 PQ: 0 ANSI: 5 [ 1.430499] scsi 5:0:0:0: CD-ROM ASUS BW-16D1HT 3.00 PQ: 0 ANSI: 5 > The host names I've given these systems are: Great names and informative background: > zooty - my main system I use all the time and the one where I first > encountered the bug, but very little of the rest of the testing was done on > zooty. > > vivo - the ASUS mini PC which normally runs windows and where I've done 90% > of the testing using an external USB drive where I installed fedora 31. > (This is where I learned Windows 10 does not have a problem with DVDs in the > external drive). > > rebo - the system where I've done the latest tests using the sata ASUS > drive. It also accumulated a lot of disks left over from other systems, so > there are indeed many sata drives in here. This dmesg is from rebo.
(In reply to Steve from comment #88) ... > BTW, usbmon tracing is easy to try out: ... Here is some documentation: usbmon https://www.kernel.org/doc/html/latest/usb/usbmon.html
Wireshark can capture USB traffic using the /dev/usbmon devices: $ ls /dev/usbmon* /dev/usbmon0 /dev/usbmon1 /dev/usbmon2 /dev/usbmon3 /dev/usbmon4 USB capture setup https://wiki.wireshark.org/CaptureSetup/USB
I'm already in group "wireshark", but I couldn't run as a non-privileged user. However, this works as a demonstration using the bus that my USB keyboard and mouse are on (3). All traffic on the bus is captured, so in the below example, both key presses and mouse movements generated capture data (not shown). There are options to report more information: "man tshark". The wireshark GUI could be used instead. $ sudo tshark -i usbmon3 Running as user "root" and group "root". This could be dangerous. Capturing on 'usbmon3' 1 0.000000 host → 3.3.0 USB 64 GET DESCRIPTOR Request DEVICE ... 18 packets captured
(In reply to Steve from comment #92) > I'm already in group "wireshark", but I couldn't run as a non-privileged user. The /dev/usbmon device nodes are in group "usbmon"*, so I took a guess and added myself to that group. Now I can capture USB data without running as root. The next problem is that wireshark dissects USB packets, but it shows nothing but "USBMS" and binary data for the SCSI layer. However, it is fundamentally working, because when I attached an external USB hard drive with an OS installed on it, I found this in one of the packets: 0000 eb 63 90 10 8e d0 bc 00 b0 b8 00 00 8e d8 8e c0 .c.............. ... 0180 7d e8 2e 00 cd 18 eb fe 47 52 55 42 20 00 47 65 }.......GRUB .Ge 0190 6f 6d 00 48 61 72 64 20 44 69 73 6b 00 52 65 61 om.Hard Disk.Rea 01a0 64 00 20 45 72 72 6f 72 0d 0a 00 bb 01 00 b4 0e d. Error........ ... That was captured while using the wireshark GUI, which works well. Some packet filters can be applied, for example: "usb.transfer_type == 0x03". According to the GUI that means: "URB transfer type: URB_BULK (0x03)". * $ ls -l /dev/usbmon* crw-r-----. 1 root usbmon 244, 0 Apr 16 06:56 /dev/usbmon0 crw-r-----. 1 root usbmon 244, 1 Apr 16 06:56 /dev/usbmon1 crw-r-----. 1 root usbmon 244, 2 Apr 16 06:56 /dev/usbmon2 crw-r-----. 1 root usbmon 244, 3 Apr 16 06:56 /dev/usbmon3 crw-r-----. 1 root usbmon 244, 4 Apr 16 06:56 /dev/usbmon4
(In reply to Thomas Schmitt from comment #82) ... versus lba=512: usb-storage-4803 [000] .... 10132.027142: xhci_urb_enqueue: ep0out-control: urb 0000000072fd3232 pipe 2147484672 slot 7 length 0/0 sgs 0/0 stream 0 flags 00000000 usb-storage-4803 [000] d... 10132.027146: xhci_queue_trb: CTRL: bRequestType 00 bRequest 01 wValue 0030 wIndex 0000 wLength 0 length 8 TD size 0 intr 0 type 'Setup Stage' flags I:i:c ... The kernel source for that 'Setup Stage' message is here: drivers/usb/host/xhci.h: static inline const char *xhci_decode_trb(...) ... case TRB_SETUP: sprintf(...) ... https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/usb/host/xhci.h#n2248 An upper-case flag means a set bit, a lower-case flag means an unset bit.
Hi, MMC-5, 6.15.3 says: > > when the Authentication Success Flag (ASF) is set to zero, shall be > > blocked. Steve asks: > Who can set that flag? That's subject to research in MMC specs and maybe specs about the copy prevention method that is used. So first the prevention method and the purpose of the encrypted blocks would have to be determined. I guess this means to dive into MS-Windows-centric DVD ripper fora and start there as total noob with the computational handicap of using GNU/Linux. The encrypted blocks seem not to be used by computer software for the normal operation of Tom's DVD. They might be needed for extra features of the DVD, which Tom's usage does not try to provide. Or they might be needed for the internal negociation between drive and medium about the decryption key for the other blocks which contain the viceo payload. Last time i looked, MMC specs were available for 30 USD at t10.org. Google ... https://www.t10.org/cgi-bin/ac.pl?t=f&f=mmc5r04.pdf shows a "Guest Access" form, which may or may not give access to the last draft version of MMC-5 for free. https://www.t10.org/drafts.htm#MMC_Family shows "purchase" links for MMC-2 to MMC-6. The numbers are revisions. So you do not need the older versions to learn about recent features. Older versions can be informative about the specs history, though. (I was quite disappointed by MMC-6, as MMC-5 has the better PDF navigation elements and MMC-6 seemed not to contain anything new for my purpose.) Have a nice day :) Thomas
(In reply to Thomas Schmitt from comment #95) OK on all that. After looking at the xhci source code (Comment 94), I am now convinced that we are not going to debug this, and that this needs to be moved upstream: https://bugzilla.kernel.org And, IMO, the OWC firmware is somehow violating the USB protocol.
(In reply to Steve from comment #96) > And, IMO, the OWC firmware is somehow violating the USB protocol. BTW, if Window authenticates, everything is hunky-dory.
(In reply to Steve from comment #97) > (In reply to Steve from comment #96) > > And, IMO, the OWC firmware is somehow violating the USB protocol. > > BTW, if Window authenticates, everything is hunky-dory. Conjecture: The OWC firmware is trying to stop hosts from reading certain "secret" sectors, but trying to do that in the USB layer by sniffing SCSI packets. If so, that is a design flaw in the OWC firmware, because it should not be looking into the encapsulated packets it is supposed to be just transferring.
Hi, now that the SCSI layer aspects seem to be explained and the riddles moved out of my scope i will enter lurk mode with this bug report. It now needs other experts. ------------------------------------------------------------------- Summary of the findings so far: The BD/DVD/CD drive LG WH16NS40 (and probably others) refuses to read block 512 of a commercial video DVD, whereas an older Pioneer drive delivers the data of this block. The WH16NS40 returns SCSI error code: 5 6F 03 READ OF SCRAMBLED SECTOR WITHOUT AUTHENTICATION When attached to the computer directly by SATA, this error reply does not cause further harm. But if the SATA-USB converter of an USB box is involved, the READ(10) SCSI command does not return but rather times out after 30 seconds. The drive stays unusable until power-cycle. The USB box is reported by dmesg as: New USB device found, idVendor=1e91, idProduct=de2c, bcdDevice= 1.00 New USB device strings: Mfr=2, Product=3, SerialNumber=1 Product: Mercury Pro Optical Manufacturer: Other World Computing The "scrambled sector" seems not to be read by unauthorized READ(10) commands during normal operation of the DVD. No error appears. But for some reason, udev reads block 0, 8, 32, 64, 128, and finally tries block 512, where it causes the timeout and the loss of connection to the USB box. It looks like udev, or libblkid, or util-linux cannot determine the DVD size by other means than a binary search for the last readable sector. Normally there are ioctls which can do that (or SCSI commands like 25Ah READ CAPACITY). The main riddle is now how a regular and expectable SCSI error condition can break the connection between USB box and the higher Linux SCSI layers. ----------------------------------------------------------------------- Have a nice day :) Thomas
(In reply to Steve from comment #98) > (In reply to Steve from comment #97) > > (In reply to Steve from comment #96) > > > And, IMO, the OWC firmware is somehow violating the USB protocol. > > > > BTW, if Window authenticates, everything is hunky-dory. > > Conjecture: The OWC firmware is trying to stop hosts from reading certain "secret" sectors, but trying to do that in the USB layer by sniffing SCSI packets. > > If so, that is a design flaw in the OWC firmware, because it should not be looking into the encapsulated packets it is supposed to be just transferring. Thomas: You gave me that thought with some of your previous comments, including this one: "What's still missing for a plausible theory is how USB contributes to the timeout which seems not to happen or not to matter with direct SATA." (Comment 65) The short answer: The OWC firmware is sniffing SCSI packets. BTW, I still highly recommend trying out Wireshark's USB capture facilities. Doing that gave me the insight to propose that explanation.
Hi, > The OWC firmware is sniffing SCSI packets. This is not impossible, but: The box would intrude a layer of communications which is not in its scope as data transport device. Judgement of suitability of commands is a matter of Linux SCSI layers and the DVD drive's firmware. If the consequences are intentional, the box would react very harshly to an attempt which already the drive prevents in the interest of the DVD vendor. The box itself cannot know which sectors of the DVD are banned for unauthorized reading. It has to rely on the drive's judgement and would then decide to aggravate the drive's reaction. If the consequences are not intentional, then the box would stop working by mistake or would drive the USB layer(s) of Linux nuts by behavior which Linux does not expect. This is the case where debugging by a Linux USB expert would help. ----------------------------------------------------------------------- We could of course try to find out why udev decides to do the read experiments with the DVD. This is not normal. There is also the minor riddle why libburn's Linux adapter emits one own SCSI command when acquiring the drive. Source study indicates that it is a TEST UNIT READY command caused by ioctl(fd, HDIO_GET_IDENTITY, &tm) returning (tm.config & 0x8000) || (tm.config & 0x4000) which it should only do if the old /dev/hd driver controls the DVD drive as ATAPI device. I plan to work on this. But the big riddle is why the drive in the USB box cannot be reached any more via the Linux SCSI layers after the failed read attempt. Have a nice day :) Thomas
(In reply to Thomas Schmitt from comment #101) > Hi, > > > The OWC firmware is sniffing SCSI packets. > > This is not impossible, but: > > The box would intrude a layer of communications which is not in its > scope as data transport device. Judgement of suitability of commands > is a matter of Linux SCSI layers and the DVD drive's firmware. ... I completely agree. And, if I am right, the OWC enclosure is a defective product, because it is just supposed to transfer SCSI packets, not read them. > The box itself cannot know which sectors of the DVD are banned for unauthorized reading. The OWC firmware authors can read the appropriate specifications. From your results and Tom's, sector 512 is a "secret" sector. I couldn't find an exact statement, but the Wikipedia article on the "Content Scramble System"* has a link to: Standard ECMA-267 3rd Edition - April 2001 120 mm DVD - Read-Only Disk https://www.ecma-international.org/publications/files/ECMA-ST/Ecma-267.pdf See "Section 4 - Data Format". I couldn't see an exact statement that sector 512 is a "secret" sector, but maybe I didn't look carefully enough. * https://en.wikipedia.org/wiki/Content_Scramble_System
(In reply to Thomas Schmitt from comment #101) ... > We could of course try to find out why udev decides to do the read experiments with the DVD. This is not normal. ... Yes. A properly mastered DVD should be fully indexed, so there would be no need to go poking around in the sectors just to see what is there. :-) As you suggested before*, Windows could be simply avoiding the "secret" sector. So Linux should do the same thing when it detects a video DVD. That is a change that could be done in user space (in systemd-udevd/blkid). Indeed, I would propose changing the component of this bug to "systemd". * "If it [Windows] does not try to read LBA 512 out of mere curiosity, and if the filesystem on the DVD does not advertise LBA 512 as part of any read-worthy object, then MS-Windows could stay away from that address and experience no timeout." (Comment 61)
Hi, > Standard ECMA-267 3rd Edition - April 2001 That's about the mechanical engineering, electrical engineering, and - especially for people like me - algebra on finite rings and fields which does the error prevention and correction for DVD. The part for our problem is extremely sparse, though: 16.3 Copyright Management Information (CPR_MAI) This field shall consist of 6 bytes. Their setting is application- dependent, for instance a video application. If this setting is not specified by the application, the default setting shall be to set all bytes to all ZEROs. The other spot that looks like mentioning our problem is actually about prevention of systematic transmission errors by non-secret scrambling: 17 Scrambled Frames I assume that the interested parties defined own specs on base of 16.3. Possibly one has to be member of some consortium to get to see those. ---------------------------------------------------------------------- > As you suggested before*, Windows could be simply avoiding the > "secret" sector. Or both, MS-Windows and Tom's dvdbackup, know how to authorize for access to block 512. Obviously udev and i don't know. > Indeed, I would propose changing the component of this bug to "systemd". One could spin off a systemd bug report which only mentions the fact that udev seems not to be able to learn the size by other means. The drive's refusal to read block 512 could possibly derail the size determination by binary search, so that a wrong size is used in udev rules. (I am not sure whether i have found the code which does the search. So i cannot tell what a failed READ(10) will trigger.) But i am quite sure that systemd developers will not take responsibility for the USB hangs. Have a nice day :) Thomas
My thought is that Windows doesn't go reading random addresses like udev apparently does. After all, when I squashed udev, xorriso was able to read all the info it printed just fine and apparently didn't touch any protected sectors. But it does sure seem like USB ought to be able to return an error rather than hang. I was able to try this in a completely different USB enclosure and got the same hangs, so I doubt there is any specific OWC firmware doing something tricky, both the enclosures are probably using the same chips and getting the same results.
Meanwhile I tried to see if I could make a kernel bug, but it tells me I'm not allowed to create a new account.
The only chip google could find for sata to usb3 is from TI and the spec sheet says it comes with default firmware. I have this feeling no one in the external enclosure business would feel like modifying the firmware. I definitely remember seeing a surface mount chip in both enclosures that looked about 7mm on a side: http://www.ti.com/lit/ds/symlink/tusb9261.pdf
(In reply to Tom Horsley from comment #105) > But it does sure seem like USB ought to be able to return an error rather than hang. Refining my theory slightly: The OWC firmware is crashing. There could be a firmware update. If not, it might be possible to contact OWC and ask for tech support. > I was able to try this in a completely different USB enclosure and got the > same hangs, so I doubt there is any specific OWC firmware doing something > tricky, both the enclosures are probably using the same chips and getting > the same results. What is the enclosure? $ lsusb Also, how did you change the region code? Was that done by a method supported by OWC?
(In reply to Tom Horsley from comment #107) > The only chip google could find for sata to usb3 is from TI and the spec > sheet says it comes with default firmware. I have this feeling no one in the > external enclosure business would feel like modifying the firmware. I > definitely remember seeing a surface mount chip in both enclosures that > looked about 7mm on a side: > > http://www.ti.com/lit/ds/symlink/tusb9261.pdf A search for "SATA USB bridge asmedia" found this: ASM235CM https://www.asmedia.com.tw/eng/e_show_products.php?cate_index=172&item=177 "ASM235CM is the latest generation of ASMedia’s single chip solution, bridging the USB3.1 to Serial ATA host interface with highly integrated SuperSpeedPlus, SuperSpeed, High Speed, SATA 1.5/3.0/6.0 Gbps ASMedia self-designed PHYs as well as built-in CC logic circuit enabling reversible USB cable orientation and direction. Along with excellent compatibility with USB3.1 hosts and SATA devices, ASM235CM utilizes advanced process technology to optimize the chip power consumption."
This Plugable page explicitly mentions firmware: "We have made the following table to help you find out if there is a firmware update available for your hard drive dock and what the updates are for." However: "The manufacturers of the chipsets used in our docks (ASMedia, JMicron, LucidPort) do not make update programs for Apple Mac OS X or Linux/Unix." Plugable USB 3.0 Hard Drive Dock Firmware History & Updates https://plugable.com/drivers/hddockfirmware/
Hi, > I was able to try this in a completely different USB enclosure and got the > same hangs, so I doubt there is any specific OWC firmware doing something > tricky This info should be added to the potential bug report towards Linux USB. > My thought is that Windows doesn't go reading random addresses Quite plausible. But it is still not outruled that MS-Windows and dvdbackup know how to enable decryption. After all your drive is entitled to read the vast majority of the data. You could get an impression how many blocks are encrypted by a full -check_media run with possibly an error message every 32 KiB of data. xorriso -outdev /dev/sr0 \ -check_media use=outdev what=disc time_limit=1800 -- \ 2>&1 | tee -i /tmp/xorriso_check_media_log This will try to read the blocks from 0 to 4101506 in chunks of 32 KiB with hopefully no repetition. The attempts will go on at most 1800 seconds, not to overwork your drive. But i assume you will get possible error messages with high speed. If you see much good progress and the xorriso run ends by xorriso : NOTE : -check_media: Reached time_limit=1800 then repeat it without "time_limit=1800". The run can be aborted by Ctrl+C, or more gracefully by touch /var/opt/xorriso/do_abort_check_media In the end there will be printed a summary about the readable and unreadble block intervals. Maybe like: Media checks : lba , size , quality Media region : 0 , 512 , + good Media region : 512 , 1024 , - unreadable Media region : 1536 , 20480 , + good Media region : 22016 , 4079490 , - unreadable If nearly the whole DVD is encrypted, then the success of reader software means that it can enable the decryption capabilities of the drive. Have a nice day :) Thomas
(In reply to Thomas Schmitt from comment #61) ... > > Some have system-level privileges, so they could take control of the DVD drive and lock out all other processes. > > There is no such privileged control known to me. > Success of open(2) depends on device file permissions or Linux capabilities(7) of the process. ... That theory is obsolete now, but thank-you for citing those man pages and for citing O_EXCL, specifically.
OWC says they support Linux here: https://www.owcdigital.com/products/mercury-pro-optical#tech-specs And OWC appears to have excellent support, including by chat, phone, and email: https://www.owcdigital.com/support/contact Since you have a reliable and fairly simple reproducer, it might be useful ask OWC to take a look at the problem.
Finally got the kernel bug submitted: https://bugzilla.kernel.org/show_bug.cgi?id=207327 Some info I apparently neglected to save previously is missing, so here it is again: The other enclosure I got a chance to test was a Vantec NST-536S3-BK NexStar DX (but I don't have it any longer, it was borrowed). It hug the same way. I set the region in the LG drive using the linux regionset utility. While I had it connected to windows, I dove into the drive properties and found Windows believed it was a region 1 drive, so regionset seems to have worked properly.
(In reply to Thomas Schmitt from comment #111) > You could get an impression how many blocks are encrypted by a full > -check_media run with possibly an error message every 32 KiB of data. > > xorriso -outdev /dev/sr0 \ > -check_media use=outdev what=disc time_limit=1800 -- \ > 2>&1 | tee -i /tmp/xorriso_check_media_log Seems to be virtually all of them. It starts like this: xorriso 1.5.2 : RockRidge filesystem manipulator, libburnia project. xorriso : NOTE : Disc status unsuitable for writing Drive current: -outdev '/dev/sr0' Media current: DVD-ROM Media status : is written , is closed Media summary: 1 session, 4101507 data blocks, 8011m data, 0 free xorriso : UPDATE : 32 of 4101507 blocks read in 1 s , 0.0xD libburn : SORRY : SCSI error on read_10(288,32): [5 6F 03] Illegal request. Read of scrambled sector without authentication. Then the illegal request errors go on for as long as I could stand watching them :-). (This was, of course, on rebo with the sata connected drive).
(In reply to Tom Horsley from comment #114) > Finally got the kernel bug submitted: > > https://bugzilla.kernel.org/show_bug.cgi?id=207327 That bug id can also be put in the "Links" field at the top of the bug page. "Linux Kernel" seems to be the right choice for the "System". > Some info I apparently neglected to save previously is missing, so here it is again: > > The other enclosure I got a chance to test was a Vantec NST-536S3-BK NexStar DX > (but I don't have it any longer, it was borrowed). It hug the same way. Thanks for your followup reply. The specs page for the Vantec enclosure doesn't mention Linux, so that is a difference from the OWC enclosure: https://www.vantecusa.com/products_detail.php?p_id=51&p_name=Vantec#tab-2 > I set the region in the LG drive using the linux regionset utility. While I had it connected to windows, I dove into the drive properties > and found Windows believed it was a region 1 drive, so regionset seems to have worked properly. Thanks, I didn't know about that: $ dnf -Cq repoquery --whatprovides \*/regionset regionset-0:0.2-13.fc31.x86_64
Hi, > > You could get an impression how many blocks are encrypted > Seems to be virtually all of them. Then they don't get avoided by MS-Windows and dvdbackup, but it's rather that those readers know more and better about DVD encryption than udev and myself. I don't feel much embarrassed for my ignorance in this aspect. :)) But if a bug report to systemd/udev is submitted, it should be mentioned that dvdbackup seems to know how to enable the read operations which would be needed for the binary search for the device's size. > libburn : SORRY : SCSI error on read_10(288,32): [5 6F 03] Illegal request. > Read of scrambled sector without authentication. So the encrypted part begins at latest at block 288+31 = 319. > Then the illegal request errors go on for as long as I could stand watching If its only about the ugliness of error messages then we can suppress them by xorriso command -report_about : xorriso -report_about update \ -outdev /dev/sr0 \ -report_about failure \ -check_media use=outdev what=disc time_limit=1800 -- The messages about drive acquiration will appear. But then all messages which would not end the program run will be suppressed. The final summary will be emitted on standard output, nevertheless. It would give us some statistical evidence, beyond ugliness. ----------------------------------------------------------------------- Steve wrote: > https://www.owcdigital.com/support/contact If Tom, the customer, wants to submit a request, then i propose to point the support to my summary in comment 99 https://bugzilla.redhat.com/show_bug.cgi?id=1822948#c99 and to ask whether they have an idea how to learn more about what is derailing the USB connection. The content of "kernel-debug.txt" (alias "trace-4.txt") is the deepest raw info of the problem we have for now. https://bugzilla.redhat.com/attachment.cgi?id=1679161 An example of the mishap begins at line systemd-udevd-7136 [000] .... 10101.875658: scsi_dispatch_cmd_start: host_no=5 channel=0 id=0 lun=0 data_sgl=1 prot_sgl=0 prot_op=SCSI_PROT_NORMAL cmnd=(READ_10 lba=512 txlen=2 protect=0 raw=28 00 00 00 02 00 00 00 02 00) is then probably handled by multiple occasions of usb-storage-4803 ... until timeout happens kworker/1:1H-335 [001] .... 10132.015859: scsi_dispatch_cmd_timeout: host_no=5 channel=0 id=0 lun=0 data_sgl=1 prot_sgl=0 prot_op=SCSI_PROT_NORMAL cmnd=(READ_10 lba=512 txlen=2 protect=0 raw=28 00 00 00 02 00 00 00 02 00) result=(driver=DRIVER_OK host=DID_OK message=COMMAND_COMPLETE status=SAM_STAT_GOOD) ----------------------------------------------------------------------- > https://bugzilla.kernel.org/show_bug.cgi?id=207327 Aha. You already made use of my summary. Consider to add info how much of the DVD is found to be unreadable without decryption preparations. (In trace-4 are successful reads to blocks near the DVD's end.) Now i am curious ... Have a nice day :) Thomas
dvdbackup's goal in life is to dig up the decryption keys using libdvdcss (now available in the "tainted" rpmfusion repos), VLC for windows has the same code. I think as far as just loading the DVD, Windows doesn't touch anything but readable headers to get the label and wot-not (much like the info xorriso prints if I have udev turned off first).
(In reply to Thomas Schmitt from comment #117) Since you are delurking ... :-) Could you comment on the SCSI READ_10 commands in the file* attached in Comment 81? That is the direct-to-SATA case (trace-5.txt), and it has READ_10s at lba=2, lba=4, and lba=1024 that are out of sequence with the other powers-of-two reads. There are also non-sequential READ_10s beyond the lba=512 read (presumably because the drive/enclosure does not hang there in the direct-to-SATA case). * systemd-udevd-scsi-commands-from-trace-5-txt-1.txt
(In reply to Tom Horsley from comment #118) > dvdbackup's goal in life is to dig up the decryption keys using libdvdcss > (now available in the "tainted" rpmfusion repos), VLC for windows has the > same code. I think as far as just loading the DVD, Windows doesn't touch > anything but readable headers to get the label and wot-not (much like the > info xorriso prints if I have udev turned off first). Wireshark runs on Windows, so you might be able get a USB capture for that case: https://www.wireshark.org/#download
Created attachment 1679760 [details] Screenshot showing Wireshark on Windows 10 capturing USB traffic including a SCSI Read Capacity command This screenshot shows Wireshark running on Windows 10. The capture shows USB packets as an external USB disk drive was attached to the Windows 10 system. In particular, it shows that Windows 10 sends a "SCSI: Read Capacity(10)" command to the disk drive.
Hi, > Since you are delurking ... :-) Yeah. That plan failed thoroughly. > Could you comment on the SCSI READ_10 commands in the file* attached > in Comment 81? > [systemd-udevd-scsi-commands-from-trace-5-txt-1.txt] Well, first i wondered why the READ_10 attempts to LBA 513 don't show up. (They come from kworker/6:1H-469 instead of systemd-udevd-*. So they did not pass the grep.) Then i see an allegedly successful READ_10 to LBA 512 in trace-5.txt. First it fails for systemd-udevd-2532: 8703: systemd-udevd-2532 [001] .... 627.381674: scsi_dispatch_cmd_start: host_no=5 channel=0 id=0 lun=0 data_sgl=1 prot_sgl=0 prot_op=SCSI_PROT_NORMAL cmnd=(READ_10 lba=512 txlen=2 protect=0 raw=28 00 00 00 02 00 00 00 02 00) 8716: kworker/6:1H-469 [006] .... 627.424401: scsi_dispatch_cmd_timeout: host_no=5 channel=0 id=0 lun=0 data_sgl=1 prot_sgl=0 prot_op=SCSI_PROT_NORMAL cmnd=(READ_10 lba=512 txlen=2 protect=0 raw=28 00 00 00 02 00 00 00 02 00) result=(driver=DRIVER_OK host=DID_OK message=COMMAND_COMPLETE status=SAM_STAT_GOOD) Then it succeeds within 0.02 seconds for kworker/1:1H-536: 8718: kworker/1:1H-536 [001] .... 627.441450: scsi_dispatch_cmd_start: host_no=5 channel=0 id=0 lun=0 data_sgl=1 prot_sgl=0 prot_op=SCSI_PROT_NORMAL cmnd=(READ_10 lba=512 txlen=1 protect=0 raw=28 00 00 00 02 00 00 00 01 00) 8789: <idle>-0 [006] d.h. 628.987442: scsi_dispatch_cmd_done: host_no=5 channel=0 id=0 lun=0 data_sgl=1 prot_sgl=0 prot_op=SCSI_PROT_NORMAL cmnd=(READ_10 lba=512 txlen=1 protect=0 raw=28 00 00 00 02 00 00 00 01 00) result=(driver=DRIVER_OK host=DID_OK message=COMMAND_COMPLETE status=SAM_STAT_GOOD) As if a repetition of the SCSI command would switch to the behavior of the old pioneer drive. (For the new drive and USB box, any repetition would not happen.) > it has READ_10s at lba=2, lba=4, and lba=1024 that are out of sequence > with the other powers-of-two reads. But too late to have served a potential binary search which would have caused the 8, 16, 32, 64, 128, 256, 512 sequence. There are also reads to 18, 34, 66, 130, 194. 18 is probably part of the ISO 9660 Volume descriptors. The others could be in the range where ISO 9660 directory entries are stored. UDF normally begins after 256. I fail to invent a fully plausible theory what purpose is behind these read attempts. > Windows 10 sends a "SCSI: Read Capacity(10)" command to the disk drive. That would be "cmnd=(0x25" and deliver the LBA of the last readable block. It's not in trace-5.txt. But the xorriso log on SATA has it READ CAPACITY 25 00 00 00 00 00 00 00 00 00 From drive: 8b 00 3e 95 82 00 00 08 00 376 us [ 12227982 ] The reply is 0x003e9582 = 4101506. So xorriso reports to the user Media blocks : 4101507 readable , 0 writable , 4101507 overall This is part of the riddle why udev begins to search for the end block when it can inquire it from the kernel which could inquire it like above. Have a nice day :) Thomas
Correction: The repeated read to 512 succeeds in 1.5 seconds. Not in 0.02 seconds.
Created attachment 1679788 [details] output from xorriso check media command I just let this run for a couple of hours or so and it eventually finished with a vast collection of interleaved error and no error results.
Created attachment 1679790 [details] SCSI commands for host_no=5 from trace-5.txt As Thomas notes in Comment 122, my earlier attempt at extracting SCSI commands from trace-5.txt excluded major events. This includes every SCSI event for host_no=5. Columns are timestamp, process descriptor, event code, SCSI command or SCSI result. Event codes are as follows: "s" = start "d" = done "t" = timeout "w" = wakeup "-" means "not present". Example output: 628.988440: kworker/6:1H-469 |s|cmnd=(READ_10 lba=513 txlen=1 protect=0 raw=28 00 00 00 02 01 00 00 01 00) |- 629.480827: kworker/6:1H-469 |t|cmnd=(READ_10 lba=513 txlen=1 protect=0 raw=28 00 00 00 02 01 00 00 01 00) |result=(driver=DRIVER_OK host=DID_OK message=COMMAND_COMPLETE status=SAM_STAT_GOOD) 629.498929: <idle>-0 |w|- |-
Created attachment 1679791 [details] trace-5.txt analyzer script used to generate scsi-commands-for-host_no-5-from-trace-5-txt-1.txt For documentation purposes only.
Hi, > https://bugzilla.redhat.com/attachment.cgi?id=1679788 > it eventually finished with a vast collection of interleaved error > and no error results. That's really unexpected. The unreadable intervals are larger than the readable ones. 4,014,176 versus 87,331 blocks. But the number of readables and the sucessful repetition of the READ_10 to lba=512 in trace-5.txt give me the idea that the drive is not overly steady in its refusals. Have a nice day :) Thomas
The "blkid" command is invoked by certain udev rules for optical drives.* Although those rules do not explicitly use the "--probe" option**, which bypasses the cache, I am guessing that a hang can be induced by this procedure: 1. Disable systemd-udevd as before. 2. Load a commercial DVD. 3. Run: # blkid --probe /dev/sr0 Further, if SCSI tracing is enabled, I am guessing that the powers-of-two search will be observed. The "blkid" command was tested in a VM with the F31 Live DVD ISO image configured as a virtual disk: # blkid --probe /dev/sr0 /dev/sr0: UUID="2019-10-23-23-21-29-00" BOOT_SYSTEM_ID="EL TORITO SPECIFICATION" VERSION="Joliet Extension" LABEL="Fedora-WS-Live-31-1-9" TYPE="iso9660" USAGE="filesystem" PTUUID="6f869649" PTTYPE="dos" == footnotes == * Udev rules for optical drives: $ grep -A8 -m1 probe /usr/lib/udev/rules.d/60-persistent-storage.rules # probe filesystem metadata of optical drives which have a media inserted KERNEL=="sr*", ENV{DISK_EJECT_REQUEST}!="?*", ENV{ID_CDROM_MEDIA_TRACK_COUNT_DATA}=="?*", ENV{ID_CDROM_MEDIA_SESSION_LAST_OFFSET}=="?*", \ IMPORT{builtin}="blkid --offset=$env{ID_CDROM_MEDIA_SESSION_LAST_OFFSET}" # single-session CDs do not have ID_CDROM_MEDIA_SESSION_LAST_OFFSET KERNEL=="sr*", ENV{DISK_EJECT_REQUEST}!="?*", ENV{ID_CDROM_MEDIA_TRACK_COUNT_DATA}=="?*", ENV{ID_CDROM_MEDIA_SESSION_LAST_OFFSET}=="", \ IMPORT{builtin}="blkid --noraid" # probe filesystem metadata of disks KERNEL!="sr*", IMPORT{builtin}="blkid" ** $ man blkid
That sounds like I might be able to override just the udev rules for cdrom media and make udev utterly ignore cdrom activity. Might be worth a try at least as a workaround.
(In reply to Tom Horsley from comment #129) > That sounds like I might be able to override just the udev rules for cdrom media and make udev utterly ignore cdrom activity. > Might be worth a try at least as a workaround. Good idea. Although, you will probably want to check that your apps still work. An alternative would be to configure your own program to do something less intrusive on the DVD drive.* BTW, the rules I listed in Comment 128 have an assignment to this variable: "IMPORT{builtin}". You can list those with: $ udevadm test-builtin --help udevadm test-builtin [OPTIONS] COMMAND DEVPATH Test a built-in command. ... Commands: blkid Filesystem and partition probing <<<<<< udev has its own version of "blkid". ... And that can be invoked with: $ udevadm test-builtin blkid /sys/class/block/sr0 However, it doesn't seem to accept a "--probe" option, so it must not be the same as "/usr/sbin/blkid": $ udevadm test-builtin blkid --probe /sys/class/block/sr0 test-builtin: unrecognized option '--probe' * According to "man udev", you can assign to "IMPORT{type}", where "type" can be "program".
(In reply to Steve from comment #130) > An alternative would be to configure your own program to do something less intrusive on the DVD drive. A wrapper shell script that returns dummy data in a valid format when passed sr0 might work. Otherwise it would invoke the "real" udev builtin "blkid", which in my VM returns: $ udevadm test-builtin blkid /sys/class/block/sr0 2>/dev/null ID_FS_UUID=2019-10-23-23-21-29-00 ID_FS_UUID_ENC=2019-10-23-23-21-29-00 ID_FS_BOOT_SYSTEM_ID=EL\x20TORITO\x20SPECIFICATION ID_FS_VERSION=Joliet Extension ID_FS_LABEL=Fedora-WS-Live-31-1-9 ID_FS_LABEL_ENC=Fedora-WS-Live-31-1-9 ID_FS_TYPE=iso9660 ID_FS_USAGE=filesystem ID_PART_TABLE_UUID=6f869649 ID_PART_TABLE_TYPE=dos The format is a list of key/value pairs per "man udev": IMPORT{type} ... "program" Execute an external program specified as the assigned value and, if it returns successfully, import its output, which must be in environment key format. Path specification, command/argument separation, and quoting work like in RUN.
The version of "blkid" in the util-linux package produces identical output when invoked with the "-o udev" option: $ /usr/sbin/blkid --probe -o udev /dev/sr0 ID_FS_UUID=2019-10-23-23-21-29-00 ID_FS_UUID_ENC=2019-10-23-23-21-29-00 ID_FS_BOOT_SYSTEM_ID=EL TORITO SPECIFICATION ID_FS_VERSION=Joliet\x20Extension ID_FS_LABEL=Fedora-WS-Live-31-1-9 ID_FS_LABEL_ENC=Fedora-WS-Live-31-1-9 ID_FS_TYPE=iso9660 ID_FS_USAGE=filesystem ID_PART_TABLE_UUID=6f869649 ID_PART_TABLE_TYPE=dos NB: I used the full path for clarity. The command is in this package: $ rpm -qf /usr/sbin/blkid util-linux-2.34-4.fc31.x86_64
Amazing! I copied /usr/lib/udev/rules.d/60-persistent-storage.rules to /etc/udev/rules.d/ and changed the two lines claiming to be for optical media to GOTO="persistent_storage_end" instead of doing the IMPORT from blkid. I then rebooted to make absolutely sure the new rule was in effect (I don't trust the inotify magic it claims to use :-). I loaded a data dvd then ran a mount command and it worked fine. I then loaded the movie DVD I've been using for testing, and dvdbackup is reading it perfectly well. No more hangs: zooty> dvdbackup -M -i /dev/sr0 -o /home/tom libdvdread: Attempting to retrieve all CSS keys libdvdread: This can take a _long_ time, please be patient libdvdread: Get key for /VIDEO_TS/VIDEO_TS.VOB at 0x00000126 libdvdread: Elapsed time 0 libdvdread: Get key for /VIDEO_TS/VTS_01_0.VOB at 0x0000c8de libdvdread: Elapsed time 0 libdvdread: Get key for /VIDEO_TS/VTS_01_1.VOB at 0x0000cf23 libdvdread: Elapsed time 0 libdvdread: Get key for /VIDEO_TS/VTS_02_0.VOB at 0x001f7686 libdvdread: Elapsed time 0 libdvdread: Get key for /VIDEO_TS/VTS_02_1.VOB at 0x001f7ccb libdvdread: Elapsed time 0 libdvdread: Found 2 VTS's libdvdread: Elapsed time 0 I've never had any use for the automounting junk anyway (heck, I know I just put a disk in - I was there doing it). This seems to be a good work around.
(In reply to Tom Horsley from comment #133) > This seems to be a good work around. Awesome! While you are at it, could you see if this causes a hang: # blkid --probe /dev/sr0 If so, that might be grounds for opening a bug against util-linux: $ rpm -qf /usr/sbin/blkid util-linux-2.34-4.fc31.x86_64
(In reply to Tom Horsley from comment #133) > Amazing! I copied /usr/lib/udev/rules.d/60-persistent-storage.rules to /etc/udev/rules.d/ and changed the two lines claiming to be for optical media to GOTO="persistent_storage_end" instead of doing the IMPORT from blkid. ... Could you post a diff (or patch) for your workaround, so there is an exact record of what you did?
Just to see if anything happens, I submitted bug 1825533 against blkid. Here's the diff, just changed two lines: --- /usr/lib/udev/rules.d/60-persistent-storage.rules 2020-03-26 09:34:16.000000000 -0400 +++ /etc/udev/rules.d/60-persistent-storage.rules 2020-04-18 12:32:46.837614823 -0400 @@ -100,10 +100,10 @@ # probe filesystem metadata of optical drives which have a media inserted KERNEL=="sr*", ENV{DISK_EJECT_REQUEST}!="?*", ENV{ID_CDROM_MEDIA_TRACK_COUNT_DATA}=="?*", ENV{ID_CDROM_MEDIA_SESSION_LAST_OFFSET}=="?*", \ - IMPORT{builtin}="blkid --offset=$env{ID_CDROM_MEDIA_SESSION_LAST_OFFSET}" + GOTO="persistent_storage_end" # single-session CDs do not have ID_CDROM_MEDIA_SESSION_LAST_OFFSET KERNEL=="sr*", ENV{DISK_EJECT_REQUEST}!="?*", ENV{ID_CDROM_MEDIA_TRACK_COUNT_DATA}=="?*", ENV{ID_CDROM_MEDIA_SESSION_LAST_OFFSET}=="", \ - IMPORT{builtin}="blkid --noraid" + GOTO="persistent_storage_end" # probe filesystem metadata of disks KERNEL!="sr*", IMPORT{builtin}="blkid"
Hi, congrats to the workaround. > + GOTO="persistent_storage_end" And by such classic means. :)) Computer science tries to eliminate GOTO since more than 50 years without any success. http://www.u.arizona.edu/~rubinson/copyright_violations/Go_To_Considered_Harmful.html Roll over Dijkstra and Wirth ! Have a nice day :) Thomas
(In reply to Thomas Schmitt from comment #122) > Hi, >> Since you are delurking ... :-) > Yeah. That plan failed thoroughly. ... > Well, first i wondered why the READ_10 attempts to LBA 513 don't show up. > (They come from kworker/6:1H-469 instead of systemd-udevd-*. So they did not pass the grep.) Thanks for pointing that out. Although it is moot now, I rewrote the awk script to display all events for "host_no=5". That provides much better insight into what is going on. > Then i see an allegedly successful READ_10 to LBA 512 in trace-5.txt. > First it fails for systemd-udevd-2532: ... > Then it succeeds within 0.02 seconds for kworker/1:1H-536: ... Yes, it seems peculiar that a timeout would occur in less time than a successful transaction. > As if a repetition of the SCSI command would switch to the behavior of the old pioneer drive. > (For the new drive and USB box, any repetition would not happen.) OK. >> it has READ_10s at lba=2, lba=4, and lba=1024 that are out of sequence with the other powers-of-two reads. > But too late to have served a potential binary search which would have caused the 8, 16, 32, 64, 128, 256, 512 sequence. Good point. > There are also reads to 18, 34, 66, 130, 194. > 18 is probably part of the ISO 9660 Volume descriptors. The others could be in the range where ISO 9660 directory entries are stored. UDF normally begins after 256. So those would be "non-sizing" reads. > I fail to invent a fully plausible theory what purpose is behind these read attempts. udev/blkid need some of that info to generate their output. Comment 131 has an example. There could also be directory caching for a file manager or a video player, which would certainly need to read more than just the Volume descriptors. > Windows 10 sends a "SCSI: Read Capacity(10)" command to the disk drive. > That would be "cmnd=(0x25" and deliver the LBA of the last readable block. OK. That is in pane 2 of the Wireshark on Windows screenshot. (Comment 121) > It's not in trace-5.txt. > But the xorriso log on SATA has it > READ CAPACITY > 25 00 00 00 00 00 00 00 00 00 ... Wireshark shows all 10 bytes in the bottom pane. Wireshark is usually used to capture networking packets but it works very well with USB traffic too: $ rpm -q wireshark wireshark-3.2.0-1.fc31.x86_64 > The reply is 0x003e9582 = 4101506. > So xorriso reports to the user > Media blocks : 4101507 readable , 0 writable , 4101507 overall So it seems that the kernel could be dropping some transfers when tracing. > This is part of the riddle why udev begins to search for the end block when it can inquire it from the kernel which could inquire it like above. My current theory is that udev/blkid are somehow running of the rails when trying to identify the disc and using a fallback sizing algorithm. Assuming the disc is correctly mastered, that would mean that there is a bug in udev/blkid. (Bug 1825533) > Have a nice day :) > Thomas
(In reply to Tom Horsley from comment #136) > Just to see if anything happens, I submitted bug 1825533 against blkid. I'm guessing the problem is some easily fixed misconfiguration, because libblkid is designed to be configurable*. The libblkid man page, which is in libblkid-devel**, says: "The libblkid library is used to identify block devices (disks) as to their content (e.g. filesystem type) as well as extracting additional information such as filesystem labels/volume names, unique identifiers/serial numbers. A common use is to allow use of LABEL= and UUID= tags instead of hard-coding specific block device names into configuration files." libblkid also supports caching, which would explain why blkid seems to be reading more than is necessary for mere identification. > Here's the diff, just changed two lines: ... Thanks. That's much simpler than anything I had in mind. :-) == footnotes == * With /etc/blkid.conf. See "man 8 blkid". ** $ rpm -q libblkid-devel libblkid-devel-2.34-4.fc31.x86_64
(In reply to Thomas Schmitt from comment #122) ... > It's not in trace-5.txt. > But the xorriso log on SATA has it > > READ CAPACITY > 25 00 00 00 00 00 00 00 00 00 > From drive: 8b > 00 3e 95 82 00 00 08 00 > 376 us [ 12227982 ] > > The reply is 0x003e9582 = 4101506. > So xorriso reports to the user > Media blocks : 4101507 readable , 0 writable , 4101507 overall ... So it looks like blkid is probing the last two blocks of the disc instead of sending a SCSI "READ CAPACITY" request: 626.759905: systemd-udevd-2532 |s|cmnd=(READ_10 lba=4101504 txlen=2 protect=0 raw=28 00 00 3e 95 80 00 00 02 00) 626.991589: systemd-udevd-2532 |s|cmnd=(READ_10 lba=4101506 txlen=1 protect=0 raw=28 00 00 3e 95 82 00 00 01 00) Those succeed, but why couldn't blkid do BOTH? (Trust but verify. :-))
"... probing the last two blocks of the disc ..." Correcting myself: 4101504,4101505 for the first of the two reads (txlen=2). 4101506 for the second of the two reads (txlen=1). That's *three* ... :-) (Assuming that "txlen" is a block count.) "... 4101507 overall" Presumably, blocks are numbered starting from 0.
Hi, > So it looks like blkid is probing the last two blocks of the disc > instead of sending a SCSI "READ CAPACITY" request: The purpose of the first reads to the end blocks might be to detect a possible GPT backup header block. (*) But to know the LBA of the last blocks it has to already know the size. (This information can be obtained by READ CAPACITY or READ DISC INFORMATION and READ TRACK INFORMATION.) So it is extra riddling why udev begins to suck off the lower blocks of the DVD. ------------------------------------------------------------------- (*) I remember an incident about Archlinux ISOs which contained a blkid program that was curious about GPT. When the ISO was burnt to CD by write type TAO, and the reader was in the majority group of drives which report the Run-out blocks of CD TAO as readable, then this read attempt failed and Archlinux did not boot from the CD. This was repaired by making blkid tolerant towards read failure. Probably this is also the reason why not only the last block but also the second and third last block are read. The two Run-out sectors of TAO CDs are not readable by READ(10) but only by READ CD commands. So the drives which count them as readable are not really liars. ("I'm not crazy ! Don't call me crazy ! I'm just not user friendly !" - Roberto in Futurama.) This drive dependent ambiguity with TAO CD and READ CAPACITY plus the credulence of linux-scsi establish the old Linux TAO Read-Ahead Bug. It caused the cargo cult of adding 300 KB of padding to CD burn runs and to the production of ISO 9660 images. Have a nice day :) Thomas
(In reply to Thomas Schmitt from comment #142) ... > But to know the LBA of the last blocks it has to already know the size. > (This information can be obtained by READ CAPACITY or READ DISC INFORMATION and READ TRACK INFORMATION.) ... Thanks for pointing that out. "READ_CAPACITY" is in the trace logs -- I just didn't think to look for it before now. :-) From Comment 57 (analysis of trace-4.txt): systemd-udevd-7136 10101.238988: cmnd=(READ_CAPACITY - raw=25 00 00 00 00 00 00 00 00 00) Likewise for Attachment 1679790 [details] (analysis of trace-5.txt: scsi-commands-for-host_no-5-from-trace-5-txt-1.txt): 626.731734: systemd-udevd-2532 |s|cmnd=(READ_CAPACITY - raw=25 00 00 00 00 00 00 00 00 00) "READ_CAPACITY" is, of course, in the original trace logs: Attachment 1679161 [details] (kernel-debug.txt aka trace-4.txt) Attachment 1679482 [details] (trace-5.txt) BTW, this is a perfect example of the phenomenon of "seeing but not observing" as described by Sherlock Holmes in "A Scandal in Bohemia". (cited by Maria Konnikova in "Mastermind: How to Think Like Sherlock Holmes")
(In reply to Steve from comment #143) ... > "READ_CAPACITY" is in the trace logs ... ... The replies are there too, but they are not SCSI commands so they don't show up in my analyses. The replies come back as USB packets. I believe they are logged as "BULK" packets, as in this sequence of records from trace-4.txt. Does "length 31" mean the length of the READ_CAPACITY reply? And does "Slot 7 ep2out" mean the recipient of the reply? systemd-udevd-7136 [001] .... 10101.238988: scsi_dispatch_cmd_start: host_no=5 channel=0 id=0 lun=0 data_sgl=1 prot_sgl=0 prot_op=SCSI_PROT_NORMAL cmnd=(READ_CAPACITY - raw=25 00 00 00 00 00 00 00 00 00) usb-storage-4803 [000] .... 10101.239052: xhci_urb_enqueue: ep2out-bulk: urb 0000000036e6ba93 pipe 3221292032 slot 7 length 0/31 sgs 0/0 stream 0 flags 00000004 usb-storage-4803 [000] d... 10101.239057: xhci_queue_trb: BULK: Buffer 000000040615c080 length 31 TD size 0 intr 0 type 'Normal' flags b:i:I:c:s:i:e:c usb-storage-4803 [000] d... 10101.239058: xhci_inc_enq: BULK 00000000ae4f77c9: enq 0x0000000407fd6430(0x0000000407fd6000) deq 0x0000000407fd6420(0x0000000407fd6000) segs 2 stream 0 free_trbs 508 bounce 1024 cycle 1 usb-storage-4803 [000] d... 10101.239059: xhci_ring_ep_doorbell: Ring doorbell for Slot 7 ep2out
Hi, > systemd-udevd-7136 10101.238988: cmnd=(READ_CAPACITY Grmpf. I searched for "cmnd=(0x25" and thought there were none. > Sherlock Holmes The grand master of hunting nasty bugs, indeed. "Once you eliminate the impossible, whatever remains, no matter how improbable, must be the truth." Ok. It is obviously not a binary search for the size what we see. (This was already improbable by the late reading of blocks 2 to 7.) > Does "length 31" mean the length of the READ_CAPACITY reply? Not on SCSI level. The data reply from the drive is 8 bytes long. But of course there might be USB stuff wrapped around the SCSI payload. The payload is obviously not shown in trace-4.txt. Nothing to see that looks like the byte sequence 3e 95 82, which the drive replied to libburn underneath xorriso. Have a nice day :) Thomas
(In reply to Thomas Schmitt from comment #145) ... > Grmpf. I searched for "cmnd=(0x25" and thought there were none. ... Now that you point that out, I <observe> that the logs use "raw=25": :-) "... cmnd=(READ_CAPACITY - raw=25 00 00 00 00 00 00 00 00 00) ..."
(In reply to Thomas Schmitt from comment #145) ... > > Does "length 31" mean the length of the READ_CAPACITY reply? > > Not on SCSI level. The data reply from the drive is 8 bytes long. > But of course there might be USB stuff wrapped around the SCSI payload. ... Thanks. The "length 31" appears to be the length of the encapsulated SCSI READ_CAPACITY command in the USB packet. From my Wireshark capture on Windows*: Frame 23: 58 bytes on wire (464 bits), 58 bytes captured (464 bits) on interface wireshark_extcap1640, id 0 USB URB [Source: host] [Destination: 2.2.2] USBPcap pseudoheader length: 27 IRP ID: 0xffff92014d97b460 IRP USBD_STATUS: USBD_STATUS_SUCCESS (0x00000000) URB Function: URB_FUNCTION_BULK_OR_INTERRUPT_TRANSFER (0x0009) IRP information: 0x00, Direction: FDO -> PDO URB bus id: 2 Device address: 2 Endpoint: 0x02, Direction: OUT URB transfer type: URB_BULK (0x03) Packet Data Length: 31 <<<<<<<<<<<<<<<<<<<<<<<<<<<<< 31 [Response in: 24] [bInterfaceClass: Mass Storage (0x08)] USB Mass Storage Signature: 0x43425355 Tag: 0x4d97b460 DataTransferLength: 8 Flags: 0x80 .000 .... = Target: 0x0 (0) .... 0000 = LUN: 0x0 ...0 1010 = CDB Length: 0x0a SCSI CDB Read Capacity(10) [LUN: 0x0000] [Command Set:Direct Access Device (0x00) (Using default commandset)] [Response in: 28] Opcode: Read Capacity(10) (0x25) Control: 0x00 * The Wireshark capture was saved to a file in Windows, transferred to my Linux system, and I am viewing the file with Wireshark on Linux.
The hexdump is easier to understand. I've marked the lengths: 0000 1b 00 60 b4 97 4d 01 92 ff ff 00 00 00 00 09 00 ..`..M.......... 0010 00 02 00 02 00 02 03 1f 00 00 00 55 53 42 43 60 ...........USBC` 0020 b4 97 4d 08 00 00 00 80 00 0a 25 00 00 00 00 00 ..M.......%..... 0030 00 00 00 00 00 00 00 00 00 00 .......... ^^ 0a is the length of the following SCSI command 25. ^^ 1f is the length of the SCSI command 25 with signature and tag (in little-endian order). ^^ That 08 could be the length of the expected SCSI command 25 reply. ("DataTransferLength" in Comment 147) ^^ 1b is the length of the "USBPcap pseudoheader" (per Comment 147) There appear to be six additional 00 bytes after the SCSI command 25.
I finally found an actual anomaly. In trace-4.txt, after two of the timeouts, there is no wakeup: $ cat trace-4.txt | awk '/host_no=5/ { print $4, $5 }' | fgrep -A20 'timeout' 10132.015859: scsi_dispatch_cmd_timeout: 10132.153972: scsi_dispatch_cmd_start: <<< no wakeup here 10132.155549: scsi_dispatch_cmd_done: 10132.155618: scsi_dispatch_cmd_start: 10132.155909: scsi_dispatch_cmd_done: 10132.155951: scsi_dispatch_cmd_start: 10132.156207: scsi_dispatch_cmd_done: 10132.156237: scsi_dispatch_cmd_start: 10162.224030: scsi_dispatch_cmd_timeout: 10162.369979: scsi_dispatch_cmd_start: <<< no wakeup here 10162.370359: scsi_dispatch_cmd_done: 10162.370474: scsi_dispatch_cmd_start: 10192.432179: scsi_dispatch_cmd_timeout: 10192.438071: scsi_eh_wakeup: <<< but there is a wakeup here In contrast, trace-5.txt shows a wakeup after each timeout: $ cat trace-5.txt | awk '/host_no=5/ { print $4, $5 }' | fgrep -A1 'timeout' 626.713247: scsi_dispatch_cmd_timeout: 626.729892: scsi_eh_wakeup: -- 627.424401: scsi_dispatch_cmd_timeout: 627.440905: scsi_eh_wakeup: -- 629.480827: scsi_dispatch_cmd_timeout: 629.498929: scsi_eh_wakeup: -- 630.630342: scsi_dispatch_cmd_timeout: 630.646961: scsi_eh_wakeup: -- 630.704811: scsi_dispatch_cmd_timeout: 630.720960: scsi_eh_wakeup: -- 630.992037: scsi_dispatch_cmd_timeout: 631.008961: scsi_eh_wakeup: -- 631.051348: scsi_dispatch_cmd_timeout: 631.067966: scsi_eh_wakeup: If you want to look at the raw data, run these: $ egrep 'timeout|wakeup' trace-4.txt $ egrep 'timeout|wakeup' trace-5.txt
The "eh" in the function name means "error handler": $ git grep -l scsi_eh_wakeup drivers/ drivers/scsi/scsi_error.c ... So the USB-attached case and the SATA-attached case are different in how errors are being handled. That points the finger at the kernel ...
(In reply to Tom Horsley from comment #133) > I've never had any use for the automounting junk anyway ... I stopped using gnome as my prime desktop for that reason and others involving too much automation. Now I use Xfce, which isn't so "busy". Anyway, I attached strace to systemd-udevd to see what happens. Here are the results: # strace -f -p `pgrep systemd-udevd` -o strace-1.txt Insert and remove a USB flash drive. Terminate strace with ctrl-c. # fgrep -m1 exec strace-1.txt 2275 execve("/usr/lib/udev/mtp-probe", ["/usr/lib/udev/mtp-probe", "/sys/devices/pci0000:00/0000:00:"..., "3", "4"], 0x5643824dd340 /* 26 vars */) = 0 I know my USB flash drive is not a media player, but udev checks for me anyway. :-) $ rpm -qf /usr/lib/udev/mtp-probe libmtp-1.1.16-2.fc30.x86_64 $ rpm -q systemd-udev systemd-udev-241-14.git18dd3fb.fc30.x86_64
Tom: Could you please add a link to the upstream bug in the "Links" field at the top of the bug page so it is easy to find. The "System" is "Linux Kernel". Thomas: Thanks for your possible explanation for why the second and third last blocks are being read by blkid. And thanks for the informative and entertaining background on disc formatting. (Comment 142)
(In reply to Steve from comment #152) > Tom: Could you please add a link to the upstream bug in the "Links" field at > the top of the bug page so it is easy to find. The "System" is "Linux > Kernel". I can't see any field named "Links" or find any place I might be able to point at the upstream bug other than perhaps putting it in the URL: field.
(In reply to Thomas Schmitt from comment #142) > But to know the LBA of the last blocks it has to already know the size. > (This information can be obtained by READ CAPACITY or READ DISC INFORMATION and READ TRACK INFORMATION.) This gets more interesting all the time. :-) I tried enabling scsi tracing on my F30 system while connecting and disconnecting an external USB hard drive and found several of these in the trace log: kworker/u8:1-9073 [000] .... 14855.987940: scsi_dispatch_cmd_start: host_no=6 channel=0 id=0 lun=0 data_sgl=1 prot_sgl=0 prot_op=SCSI_PROT_NORMAL cmnd=(SERVICE_ACTION_IN_16 READ_CAPACITY_16 lba=0 alloc_len=32 raw=9e 10 00 00 00 00 00 00 00 00 00 00 00 20 00 00) "raw=9e 10" appears to be yet another variant of SCSI support for disk sizing. There must be an "extended" SCSI command set. NB: "host_no=6" is the external USB hard drive. (per "journalctl")
Hi, > cmnd=(SERVICE_ACTION_IN_16 READ_CAPACITY_16 lba=0 alloc_len=32 raw=9e > > "raw=9e 10" appears to be yet another variant of SCSI support for disk > sizing. > There must be an "extended" SCSI command set. It is a tradition in SCSI to introduce extended versions of older commands. Like 88h READ(16) extends A8h READ(12) which extends 28h READ(10) which extends 08h READ(6). (Whew wheeze.) SPC-3 lists as command 9Eh only SERVICE ACTION IN(16) without giving any SCSI volume (like MMC or SBC) where it would be documented. The internet says that 9Eh READ CAPACITY(16) belongs to SBC-3. Have a nice day :) Thomas
(In reply to Thomas Schmitt from comment #155) > Hi, > > > cmnd=(SERVICE_ACTION_IN_16 READ_CAPACITY_16 lba=0 alloc_len=32 raw=9e > > > > "raw=9e 10" appears to be yet another variant of SCSI support for disk sizing. > > There must be an "extended" SCSI command set. > > It is a tradition in SCSI to introduce extended versions of older commands. Like 88h READ(16) extends A8h READ(12) which extends 28h READ(10) which extends 08h READ(6). (Whew wheeze.) Thanks. That trend must be motivated by the need to accommodate larger devices. The USB hard drive is 500GB: $ lsscsi -ts [2:0:0:0] disk sata:[wwn removed..*] /dev/sda 500GB [4:0:0:0] disk sata:[wwn removed...] /dev/sdb 500GB [5:0:0:0] cd/dvd sata: /dev/sr0 - [6:0:0:0] disk usb:3-11:1.0 /dev/sdc 500GB > SPC-3 lists as command 9Eh only SERVICE ACTION IN(16) without giving any SCSI volume (like MMC or SBC) where it would be documented. > The internet says that 9Eh READ CAPACITY(16) belongs to SBC-3. The official docs require registration to view, but Seagate has a doc that gives the details in: 3.23 READ CAPACITY (16) command (p. 157) SCSI Commands Reference Manual Fibre Channel (FC) Serial Attached SCSI (SAS) October 2016 https://www.seagate.com/files/staticfiles/support/docs/manual/Interface%20manuals/100293068j.pdf (Also documented in Table 30, page 66) > Have a nice day :) > > Thomas * The "lsscsi" command's "-t" option displays the "WWN", which is the "World Wide Number". Apparently, SCSI devices have unique identifiers that probably shouldn't be posted on the internet. I had to use the "hdparm -I" command to figure out what that number is.
This message is a reminder that Fedora 31 is nearing its end of life. Fedora will stop maintaining and issuing updates for Fedora 31 on 2020-11-24. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a Fedora 'version' of '31'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora 31 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior this bug is closed as described in the policy above. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete.
Fedora 31 changed to end-of-life (EOL) status on 2020-11-24. Fedora 31 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. If you are unable to reopen this bug, please file a new report against the current release. If you experience problems, please add a comment to this bug. Thank you for reporting this bug and we are sorry it could not be fixed.