Bug 1277060 - vioscsi loading stuck with uas device on scsi-block
vioscsi loading stuck with uas device on scsi-block
Status: NEW
Product: Virtualization Tools
Classification: Community
Component: virtio-win (Show other bugs)
unspecified
x86_64 Unspecified
unspecified Severity medium
: ---
: ---
Assigned To: Vadim Rozenfeld
Virtualization Bugs
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-11-02 03:52 EST by Tom Yan
Modified: 2017-09-04 00:21 EDT (History)
11 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
screenshot of guest when issue occured (16.35 KB, image/png)
2015-11-02 03:54 EST, Tom Yan
no flags Details
screenshot of terminals on host (53.63 KB, image/png)
2015-11-02 03:54 EST, Tom Yan
no flags Details
screenshot (28.67 KB, image/png)
2015-11-04 01:50 EST, Min Deng
no flags Details
ehci + uas + vioscsi + scsi-block (119.24 KB, image/png)
2016-05-02 12:39 EDT, Tom Yan
no flags Details
xhci + uas + vioscsi + scsi-block (113.82 KB, image/png)
2016-05-02 12:54 EDT, Tom Yan
no flags Details
"trace-event scsi_* on" (110-2 on SuperSpeed port) (10.78 KB, text/plain)
2017-08-25 22:43 EDT, Tom Yan
no flags Details
"trace-event scsi_* on" (110-2 on HiSpeed port) (434.16 KB, text/plain)
2017-08-25 22:44 EDT, Tom Yan
no flags Details
"trace-event scsi_* on" (141-1 on SuperSpeed port) (444.46 KB, text/plain)
2017-08-25 22:45 EDT, Tom Yan
no flags Details
vioscsi drivers for testing from builds 119,121,123 and 124 (2.60 MB, application/zip)
2017-08-30 21:25 EDT, Vadim Rozenfeld
no flags Details

  None (edit)
Description Tom Yan 2015-11-02 03:52:08 EST
Description of problem:

When trying to load the vioscsi in Windows 10 installation, the guest stucks at the loading screen. It does not freeze though, I can still move the cursor and the progress bar is still going back and forth.

Then when I try to kill qemu in terminal with ^C, the process turns into some kind of zombie (some "qemu-system-x86" which doesn't really exist).

The device is a first gen Intel X25-M SSD connected to USB 3 through a StarTech UAS adapter:
http://ark.intel.com/products/56604/Intel-SSD-X25-M-Series-80GB-2_5in-SATA-3Gbs-50nm-MLC
http://www.startech.com/HDD/Adapters/USB-3-SATA-adapter-cable-with-UASP~USB3S2SAT3CB

I can load vioscsi successfully in the guest if I use the "u" quirk of "usb-storage" in host so that the device does not bind to the uas driver.

scsi-hd and viostor works on uas too, just not scsi-block. Also the device never showed any problem when being used physically under Windows or Linux.

Same issue occured on a Windows Server 2012 R2 installation.

I can complete the installation of Windows 10 with scsi-block on usb-storage and Arch Linux with scsi-block on uas.

Feel free to let me know if you need some more other info.


Version-Release number of selected component (if applicable):

0.1.110


Additional info:

linux 4.2.5-1 and qemu-2.4.0.1-1 on Arch Linux

command: qemu-system-x86_64 -enable-kvm -cpu host -m 4G -device virtio-scsi-pci -drive file=PATH_TO_DRIVE,if=none,format=raw,id=system -device scsi-block,drive=system -drive file=PATH_TO_WIN_ISO,media=cdrom -drive file=PATH_TO_VIRTIO_WIN_ISO,media=cdrom -full-screen
Comment 1 Tom Yan 2015-11-02 03:54 EST
Created attachment 1088489 [details]
screenshot of guest when issue occured

it did not freeze, just stuck
Comment 2 Tom Yan 2015-11-02 03:54 EST
Created attachment 1088490 [details]
screenshot of terminals on host
Comment 3 Vadim Rozenfeld 2015-11-02 04:33:18 EST
can we try reproducing this problem?

Thanks,
Vadim.
Comment 4 Min Deng 2015-11-04 01:48:10 EST
(In reply to Vadim Rozenfeld from comment #3)
> can we try reproducing this problem?
> 
> Thanks,
> Vadim.

QE already reproduce similar issue with the following builds,
qemu-kvm-rhev-2.3.0-31.el7.x86_64
virtio-win-1.8.0-4.el7.noarch.rpm or build 110
kernel-3.10.0-326.el7.x86_64

cli,/usr/libexec/qemu-kvm -name win10-32 -enable-kvm -m 4G -smp 4 -uuid 9e59c1d2-23ee-41cb-a992-6c45ae4d9cb5 -nodefconfig -nodefaults -rtc base=localtime,driftfix=slew -boot order=cd,menu=on -device virtio-scsi-pci,bus=pci.0,addr=0x5,id=scsi0 -drive file=/dev/vgtest/lvtest,if=none,id=drive-virtio-disk,format=raw,cache=none,aio=native,werror=stop,rerror=stop -device scsi-block,bus=scsi0.0,drive=drive-virtio-disk,id=scsi1 -drive file=en_windows_10_enterprise_x86_dvd_6851156.iso,if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0,bootindex=0 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=isa_serial0 -vnc 0.0.0.0:0 -vga cirrus -monitor stdio -drive file=virtio-win-prewhql-0.1-110.iso,if=none,media=cdrom,id=drive-ide0-1-1,readonly=on,format=raw -device ide-drive,bus=ide.1,unit=1,drive=drive-ide0-1-1,id=ide0-1-1 -drive file=/usr/share/virtio-win/virtio-win-1.8.0_x86.vfd,if=none,id=drive-fdc0-0-0,format=raw,cache=none -global isa-fdc.driveA=drive-fdc0-0-0 -netdev tap,script=/etc/qemu-ifup,downscript=no,id=hostnet1,vhost=on -device virtio-net-pci,netdev=hostnet1,id=net1,mac=00:52:14:21:12:13,bus=pci.0
Comment 5 Min Deng 2015-11-04 01:50 EST
Created attachment 1089427 [details]
screenshot
Comment 6 Tom Yan 2016-04-30 05:52:54 EDT
Any progress so far? I am still experiencing the issue with virtio-win-0.1.117.iso / qemu 2.5.0 / linux 4.5.1 / Windows 10 Build 10586.

Btw this works fine (I mean the viostor driver loads without a problem):

qemu-system-x86_64 -enable-kvm -cpu host -smp cores=4 -m 4G -net none -full-screen -drive file=Downloads/10586.0.151029-1700.TH2_RELEASE_CLIENTENTERPRISEEVAL_OEMRET_X64FRE_EN-US.ISO,media=cdrom -drive file=Downloads/virtio-win-0.1.117.iso,media=cdrom -drive file=/dev/sdc,format=raw,cache=none,aio=native,if=none,id=system -device virtio-blk-pci,drive=system,scsi=on

With "scsi=on" virtio-blk-pci does SCSI passthrough just like virtio-scsi-pci does (can be checked with sg3_utils for example, and also `hdparm -I` if qemu is run with root/sudo), so that's probably not the cause of the issue?

Should I also file a bug report on the qemu tracker, in case it's a bug in the virtio-scsi-pci code instead of the vioscsi Windows driver?
Comment 7 Tom Yan 2016-04-30 08:16:16 EDT
(In reply to Tom Yan from comment #6)
> 
> With "scsi=on" virtio-blk-pci does SCSI passthrough just like
> virtio-scsi-pci does (can be checked with sg3_utils for example, and also
> `hdparm -I` if qemu is run with root/sudo), so that's probably not the cause
> of the issue?
> 
Hmm never mind. It seems that "scsi=on" only has effect on Linux guests. viostor on Windows seems to respond to SCSI commands (issued with sg_vpd, sg_inq...) but expose only a virtual SCSI layer anyway, while on Linux guests virtio-blk does not respond to SCSI commands unless "scsi=on" is specified, and when it is specified, it expose the physical SCSI layer to the guest.
Comment 8 Tom Yan 2016-05-02 12:39 EDT
Created attachment 1153006 [details]
ehci + uas + vioscsi + scsi-block

Just found out that this issue only occur with xhci + uas but not ehci + uas (same computer with "Intel xHCI Mode" in UEFI settings set to "Disabled" instead of "Enabled").

Maybe it has something to do with the "streams" thing (https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/drivers/usb/storage/uas.c?h=v4.5#n869)?
Comment 9 Tom Yan 2016-05-02 12:54 EDT
Created attachment 1153009 [details]
xhci + uas + vioscsi + scsi-block

I then disable the disk drive in Device Manager of the Windows guest, shutdown the guest, then reboot the host with "Intel xHCI Mode" set to "Enabled". The guest starts fine with the drive (in xhci + uas mode) attached (but disabled in the guest). It only struggles when I try to enable the disk drive in Device Manager again.
Comment 10 lijin 2017-08-17 03:23:32 EDT
Hi Tom,

Is this still an issue for you?Could you try with latest virtio-win version?

Thanks.
Comment 11 Tom Yan 2017-08-17 12:42:12 EDT
Well I will not be at liberty to try reproduce it with similar hardware configuration in the foreseeable future. (Namely, connecting the adapter to a SuperSpeed port)

As I told in my past message, ths issue only occurs with XHCI. However, what I meant by "XHCI" was merely an XHCI SuperSpeed port. I never tested with an XHCI HiSpeed port, which is now the only hardware I can get access to. The problem does not occur with that, just like it didn't occur with EHCI.

I don't see how it could suddenly get fixed anyway, since I am not seeing any commit to the vioscsi win driver or the uas kernel driver since I last tested. And this is still the best lead I could see myself:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/usb/storage/uas.c?h=v4.12#n905

(mainly usb_alloc_streams() / use_streams, because IIRC, qdepth is set to the same value for either case with this particular device)
Comment 12 Peixiu Hou 2017-08-25 03:46:47 EDT
Hi,

Tried to reproduce this bug with our test env:

1. cannot reproduce it with the comment#4 steps and versions. vioscsi loading normally with scsi-block.

2. cannot reproduce it with the comment#4 steps and our latest versions. vioscsi loading normally with scsi-block.

Used versions:
kernel-3.10.0-691.el7.x86_64
qemu-kvm-rhev-2.9.0-14.el7.x86_64
virtio-win-1.9.3-1.el7.noarch
seabios-1.10.2-3.el7.x86_64

3. Tried to reproduce this bug with a USB3.0-HD due to we don't have the env "Intel X25-M SSD connected to USB 3 through a StarTech UAS adapter", didn't reproduce it. vioscsi driver loading normally with scsi-block. Used vioscsi driver version is build 110.

Best Regards~
Peixiu Hou
Comment 13 lijin 2017-08-25 03:56:16 EDT
Hi Vadim,

QE can't reproduce this issue as we don't have such device.

Do you have any idea about this bug?
Comment 14 Vadim Rozenfeld 2017-08-25 06:11:23 EDT
(In reply to lijin from comment #13)
> Hi Vadim,
> 
> QE can't reproduce this issue as we don't have such device.
> 
> Do you have any idea about this bug?

Hi Li Jin,
Honestly I don't see any other way to troubleshoot this case rather than
reproducing the problem on our setup with a slightly customized driver to print out SRBs execution flow. Otherwise, only if Tom can dump scsi related traffic with "trace-event scsi_* on" command, issued in qemu monitor for both, successul and non-successful runs, we can try analysing two different trace logs and try to
guess the problem.

Best regards,
Vadim.
Comment 15 Tom Yan 2017-08-25 22:40:20 EDT
Okay I end up with getting access to a machine with XHCI SuperSpeed ports (and HiSpeed ports). I can reproduce the issue with virtio-win-0.1.110-2 ISO, and can also confirm that the issue is apparently fixed in both virtio-win-0.1.126-2 and virtio-win-0.1.141-1 ISOs (current stable and latest).

I tested three cases with the "trace-event scsi_* on" suggested by Vadim anyway and did found something.

110-2 on SuperSpeed port:
...
1955@1503712118.332653:scsi_req_parsed target 0 lun 0 tag 925820064 command 40 dir 1 length 512
1955@1503712118.332657:scsi_req_parsed_lba target 0 lun 0 tag 925820064 command 40 lba 0
1955@1503712118.332662:scsi_req_alloc target 0 lun 0 tag 925820064
1955@1503712118.332666:scsi_req_continue target 0 lun 0 tag 925820064
1955@1503712118.332920:scsi_req_dequeue target 0 lun 0 tag 925820064
1955@1503712118.333066:scsi_req_parsed target 0 lun 0 tag 1904113856 command 0 dir 1 length 556
1955@1503712118.333074:scsi_req_parsed_lba target 0 lun 0 tag 1904113856 command 0 lba 0
1955@1503712118.333078:scsi_req_alloc target 0 lun 0 tag 1904113856
1955@1503712118.333080:scsi_test_unit_ready target 0 lun 0 tag 1904113856
1955@1503712118.333084:scsi_req_continue target 0 lun 0 tag 1904113856
^Cqemu-system-x86_64: terminating on signal 2
(Had to unplug the device to get qemu killed)
1955@1503712159.447021:scsi_req_dequeue target 0 lun 0 tag 1904113856
...

110-2 on HiSpeed port:
...
2033@1503712305.375597:scsi_req_parsed target 0 lun 0 tag -829434720 command 40 dir 1 length 512
2033@1503712305.375601:scsi_req_parsed_lba target 0 lun 0 tag -829434720 command 40 lba 0
2033@1503712305.375604:scsi_req_alloc target 0 lun 0 tag -829434720
2033@1503712305.375607:scsi_req_continue target 0 lun 0 tag -829434720
2033@1503712305.375861:scsi_req_dequeue target 0 lun 0 tag -829434720
2033@1503712305.375980:scsi_req_parsed target 0 lun 0 tag 1599908032 command 0 dir 1 length 556
2033@1503712305.375983:scsi_req_parsed_lba target 0 lun 0 tag 1599908032 command 0 lba 0
2033@1503712305.375985:scsi_req_alloc target 0 lun 0 tag 1599908032
2033@1503712305.375987:scsi_test_unit_ready target 0 lun 0 tag 1599908032
2033@1503712305.375989:scsi_req_continue target 0 lun 0 tag 1599908032
2033@1503712305.376080:scsi_req_data target 0 lun 0 tag 1599908032 len 556
2033@1503712305.376083:scsi_req_continue target 0 lun 0 tag 1599908032
2033@1503712305.376085:scsi_req_dequeue target 0 lun 0 tag 1599908032
2033@1503712305.376111:scsi_req_parsed target 0 lun 0 tag -813637728 command 26 dir 1 length 192
2033@1503712305.376114:scsi_req_parsed_lba target 0 lun 0 tag -813637728 command 26 lba 7168
2033@1503712305.376116:scsi_req_alloc target 0 lun 0 tag -813637728
2033@1503712305.376119:scsi_req_continue target 0 lun 0 tag -813637728
...

141-1 on SuperSpeed port:
...
2425@1503714206.930049:scsi_req_parsed target 0 lun 0 tag -417548272 command 40 dir 1 length 512
2425@1503714206.930053:scsi_req_parsed_lba target 0 lun 0 tag -417548272 command 40 lba 0
2425@1503714206.930055:scsi_req_alloc target 0 lun 0 tag -417548272
2425@1503714206.930058:scsi_req_continue target 0 lun 0 tag -417548272
2425@1503714206.930306:scsi_req_dequeue target 0 lun 0 tag -417548272
2425@1503714206.930416:scsi_req_parsed target 0 lun 0 tag -417545360 command 26 dir 1 length 192
2425@1503714206.930420:scsi_req_parsed_lba target 0 lun 0 tag -417545360 command 26 lba 7168
2425@1503714206.930422:scsi_req_alloc target 0 lun 0 tag -417545360
2425@1503714206.930425:scsi_req_continue target 0 lun 0 tag -417545360
...

It seems that the issue was triggered by the peculiar "command 0 dir 1 length 556", which could be got through if the port is a HiSpeed one (because not using streams?) and no longer found if the latest ISO is used.

It would be great if its origin (and the exact commit that got rid of it) can be identified, otherwise we cannot call it fixed, because we don't really know its nature and can't be certain that it wouldn't be triggered in some other occasions.

Will attach the full trace-event logs of the three cases. There is a single blank line in all three of them, which I created at the point just before I press "Next" after I picked the driver to install.

Btw, in case you are interested in reproducing it yourself, maybe it can be done with nested kvm (nec-usb-xhci + usb-uas + scsi-hd/scsi-block on the first level and virtio-scsi-pci + scsi-block on the second level).
Comment 16 Tom Yan 2017-08-25 22:43 EDT
Created attachment 1318354 [details]
"trace-event scsi_* on" (110-2 on SuperSpeed port)
Comment 17 Tom Yan 2017-08-25 22:44 EDT
Created attachment 1318355 [details]
"trace-event scsi_* on" (110-2 on HiSpeed port)
Comment 18 Tom Yan 2017-08-25 22:45 EDT
Created attachment 1318356 [details]
"trace-event scsi_* on" (141-1 on SuperSpeed port)
Comment 19 Tom Yan 2017-08-26 02:02:40 EDT
Just tested with nested kvm. Apparently the emulated UAS will not trigger the same issue. I even tried to change this to 5:

https://git.qemu.org/gitweb.cgi?p=qemu.git;a=blob;f=hw/usb/dev-uas.c;h=fffc4243969611d263247889b5175e938cd6ce57;hb=359c41abe32638adad503e386969fa428cecff52#l104

So that its max streams will be 32 just like the adapter. Still it does not stall.
Comment 20 Vadim Rozenfeld 2017-08-27 06:13:17 EDT
(In reply to Tom Yan from comment #15)
> Okay I end up with getting access to a machine with XHCI SuperSpeed ports
> (and HiSpeed ports). I can reproduce the issue with virtio-win-0.1.110-2
> ISO, and can also confirm that the issue is apparently fixed in both
> virtio-win-0.1.126-2 and virtio-win-0.1.141-1 ISOs (current stable and
> latest).
> 
.........................................................
> I tested three cases with the "trace-event scsi_* on" suggested by Vadim
> anyway and did found something.
> It would be great if its origin (and the exact commit that got rid of it)
> can be identified, otherwise we cannot call it fixed, because we don't
> really know its nature and can't be certain that it wouldn't be triggered in
> some other occasions.
>

Hi Tom.
First of all let me thank you for your invaluable help with this issue. You are really doing a great job helping us ut. I don't like to put more pressure on you, but if you have some time, could you give me a favor and give a try to some more drivers released between builds 110 and 126? The difference between these two versions is just as huge, that I even wouldn't try guessing which particular patch fixed the problem without trying to narrow down the options.

Thank you in advance,
Vadim.
Comment 21 Tom Yan 2017-08-28 00:11:29 EDT
Apparently 126 is the first version that works. I tried 118, 117 and 113, none of them works.

There seems to be quite a gap between 126 and 118 though. However, there isn't any ISO in between them that can be found here:

https://fedorapeople.org/groups/virt/virtio-win/direct-downloads/archive-virtio/

So I am not sure if I can help further.
Comment 22 Vadim Rozenfeld 2017-08-28 21:26:18 EDT
builds from 119 to 125 were created for internal testing purpose only, this is the reason why they are not available for direct download. If you have some time to give a try to the drivers from those builds I can pack and send them to you.

Thanks,
Vadim.
Comment 23 Tom Yan 2017-08-28 21:39:17 EDT
No problem. I can have them tested.
Comment 24 Vadim Rozenfeld 2017-08-30 21:25 EDT
Created attachment 1320357 [details]
vioscsi drivers for testing from builds 119,121,123 and 124
Comment 25 Vadim Rozenfeld 2017-08-30 21:37:32 EDT
(In reply to Tom Yan from comment #23)
> No problem. I can have them tested.

That will be great.
There is collection of vioscsi drivers from four different builds at https://bugzilla.redhat.com/attachment.cgi?id=1320357

Thank you a lot.
Vadim.
Comment 26 Tom Yan 2017-08-31 23:23:42 EDT
I confirmed that all four builds work. In other words, it starts working since 119. I also ran "trace-event scsi_* on" to confirm that the apparently-relevant "command 0 dir 1 length 556" is gone since 119.
Comment 27 Vadim Rozenfeld 2017-09-03 05:50:41 EDT
can you give another try to 119 with virtio 1.0 disabled. You can turn it off
by adding the following parameters ",disable-legacy=off,disable-modern=on" to 
"-device virtio-scsi-pci," string.
Thanks,
Vadim.
Comment 28 Tom Yan 2017-09-04 00:21:02 EDT
The options do not stop 119 from working.

Note You need to log in before you can comment on or make changes to this bug.