Bug 1711045 - Windows 2016 guests report "The required inquiry data (SCSI page 83h VPD descriptor) was reported as not being supported." when using shared disks
Summary: Windows 2016 guests report "The required inquiry data (SCSI page 83h VPD desc...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 4.3.0
Hardware: All
OS: Linux
unspecified
high
Target Milestone: ---
: ---
Assignee: Ryan Barry
QA Contact: Lukas Svaty
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-05-16 19:25 UTC by Allie DeVolder
Modified: 2023-12-15 16:30 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-03-27 22:00:52 UTC
oVirt Team: Virt
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
datadomain.PNG (70.01 KB, image/png)
2019-05-17 14:15 UTC, Shawn B.
no flags Details
diskpart.PNG (40.39 KB, image/png)
2019-05-17 14:15 UTC, Shawn B.
no flags Details
test-cluster_report.PNG (72.07 KB, image/png)
2019-05-17 14:16 UTC, Shawn B.
no flags Details
virtioscsi_driver_details.PNG (49.59 KB, image/png)
2019-05-17 14:16 UTC, Shawn B.
no flags Details
vm_shareddisk.PNG (19.33 KB, image/png)
2019-05-17 14:16 UTC, Shawn B.
no flags Details
clearreservation_ps.PNG (34.55 KB, image/png)
2019-05-17 16:44 UTC, Shawn B.
no flags Details
clearreservation_scsicmd.PNG (69.60 KB, image/png)
2019-05-17 16:44 UTC, Shawn B.
no flags Details
Validation Report 2019.05.17 At 16.01.30.htm (814.76 KB, text/html)
2019-05-17 20:10 UTC, Shawn B.
no flags Details
directlun.PNG (15.64 KB, image/png)
2019-05-17 20:11 UTC, Shawn B.
no flags Details
Win32_DiskDrive.PNG (45.19 KB, image/png)
2019-05-17 20:11 UTC, Shawn B.
no flags Details
vdsm.log (11.34 MB, text/plain)
2019-05-17 20:44 UTC, Shawn B.
no flags Details
vdsm.log (11.46 MB, text/plain)
2019-05-20 13:41 UTC, Shawn B.
no flags Details
qemu-pr-helper.strace.out (312.24 KB, text/plain)
2019-10-09 10:22 UTC, Roman Hodain
no flags Details
strace during .\sg_persist.exe --in -k -d f: (7.74 KB, text/plain)
2019-11-04 13:20 UTC, Roman Hodain
no flags Details
strace -ff -p 202536 of the qemu-pr-helper (419.49 KB, text/plain)
2019-11-12 09:41 UTC, Roman Hodain
no flags Details
strace during reservation from host (4.95 KB, text/plain)
2019-11-22 14:25 UTC, Roman Hodain
no flags Details
strace during reservation from vm (301.94 KB, text/plain)
2019-11-22 14:26 UTC, Roman Hodain
no flags Details
Report from cluster verification (44.90 KB, text/html)
2020-04-01 09:40 UTC, Roman Hodain
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 4741251 0 None None None 2020-01-17 18:40:05 UTC

Internal Links: 1710323

Description Allie DeVolder 2019-05-16 19:25:34 UTC
Description of problem:
Windows 2016 guests report "The required inquiry data (SCSI page 83h VPD descriptor) was reported as not being supported." when using shared disks

Version-Release number of selected component (if applicable):
RHV-M: rhvm-4.3.3.7-0.1.el7.noarch
vdsm: vdsm-4.30.13-1.el7ev.x86_64


How reproducible:
unknown

Steps to Reproduce:
1. Build Windows 2016 Standard Core VM in RHV 4.3
2. Attempt clustering within Windows using shared virtio OR virtio-scsi disks

Actual results:
"The required inquiry data (SCSI page 83h VPD descriptor) was reported as not being supported." reported by Windows guest

Expected results:
Functional cluster

Additional info:
This appears to be the same issue reported in https://bugzilla.redhat.com/show_bug.cgi?id=1111783 but for Windows 2016 guests.

Comment 1 Ryan Barry 2019-05-17 00:19:47 UTC
This is not the same issue.

That bug was about SCSI reservation only.

Please provide a screenshot of the failure in Windows, as the full cluster message is important.

In addition, a screenshot of the disk sharing settings in RHV. And any SAN settings related to the VPD ID (some vendors allow setting this, and it must be set to 3 or above, but vendor defaults may be lower).

Is MPIO used.? Are the disk signatures normal? Windows clustering has many places where it can fail, and the following needs to be verified before we get to RHV:

SAN VPD settings
LUN signature
Then, in RHV, direct LUN? Marked sharable? SCSI reservation set?

J. Windows, MPIO configured correctly? Disk partitioned?

Comment 2 Shawn B. 2019-05-17 14:14:59 UTC
Hello,

We are using iSCSI data domains with shared disks not direct luns as the BZ above details. See https://bugzilla.redhat.com/show_bug.cgi?id=1111784

Comment 3 Shawn B. 2019-05-17 14:15:30 UTC
Created attachment 1570119 [details]
datadomain.PNG

Comment 4 Shawn B. 2019-05-17 14:15:42 UTC
Created attachment 1570120 [details]
diskpart.PNG

Comment 5 Shawn B. 2019-05-17 14:16:00 UTC
Created attachment 1570121 [details]
test-cluster_report.PNG

Comment 6 Shawn B. 2019-05-17 14:16:12 UTC
Created attachment 1570122 [details]
virtioscsi_driver_details.PNG

Comment 7 Shawn B. 2019-05-17 14:16:28 UTC
Created attachment 1570123 [details]
vm_shareddisk.PNG

Comment 8 Ryan Barry 2019-05-17 14:43:54 UTC
(In reply to Shawn B. from comment #2)
> Hello,
> 
> We are using iSCSI data domains with shared disks not direct luns as the BZ
> above details. See https://bugzilla.redhat.com/show_bug.cgi?id=1111784

Hi Shawn. That bug is about SCSI reservations, and the cluster deployment is failing until it gets to the point of SCSI-3 Reservations, which points to a different issue.

After ensuring that the VPD Inquiry type is set to 3 on the SAN controller (no images of that), please do the following:

Add the disk as a direct LUN in RHV (still marked as sharable, with reservations set, and the checkbox for privileged I/O operations set), and re-test

The reason for this is that, even though virtio-scsi makes a best effort to pass raw SCSI commands (including VPD inquiries) back to the host storage, creating a disk on an iSCSI domain and attaching that shared disk to VMs is not going to the work the way Windows clustering expects it to, and will not pass VPD inquiries to the backend storage. 

Yes, this does mean that it is not a "plain" disk image in RHV, and must be managed as any other LUN. From the libvirt docs:

--------------------------------------------------------------------

disk
    The disk element is the main container for describing disks and supports the following attributes: 
  device
    Indicates how the disk is to be exposed to the guest OS. Possible values for this attribute are "floppy", "disk", "cdrom", and "lun", defaulting to "disk".

    Using "lun" (since 0.9.10) is only valid when the type is "block" or "network" for protocol='iscsi' or when the type is "volume" when using an iSCSI source pool for mode "host" or as an NPIV virtual Host Bus Adapter (vHBA) using a Fibre Channel storage pool. Configured in this manner, the LUN behaves identically to "disk", except that generic SCSI commands from the guest are accepted and passed through to the physical device. Also note that device='lun' will only be recognized for actual raw devices, but never for individual partitions or LVM partitions (in those cases, the kernel will reject the generic SCSI commands, making it identical to device='disk'). Since 0.1.4


--------------------------------------------------------------------

The important part for the error you are seeing is "generic SCSI commands from the guest are accepted and passed through to the physical device". It must be type "lun"

Comment 9 Shawn B. 2019-05-17 15:31:52 UTC
That indeed works as expected.

Is there anything on the roadmap or BZ for supporting plain virtual disks instead of direct luns? 

Thanks

Comment 10 Ryan Barry 2019-05-17 15:37:58 UTC
Not from RHV, but we depend on libvirt for this functionality.

I'll try to find an RFE, and create one if it doesn't exist, then link it to this bug.

Since it works, ok to close?

Comment 11 Shawn B. 2019-05-17 15:39:53 UTC
Yes thanks

Comment 12 Shawn B. 2019-05-17 16:40:09 UTC
Actually I am going to have to retract that. The test shows 83h being supported but for some reason the guest thinks there is a reservation held when there is not. I am unable to clear said reservation.

Comment 13 Shawn B. 2019-05-17 16:44:03 UTC
Created attachment 1570174 [details]
clearreservation_ps.PNG

Comment 14 Shawn B. 2019-05-17 16:44:20 UTC
Created attachment 1570175 [details]
clearreservation_scsicmd.PNG

Comment 15 Ryan Barry 2019-05-17 17:06:30 UTC
I'd actually call this a positive result, since the reservation is visible at all, which means https://bugzilla.redhat.com/show_bug.cgi?id=1111784 is working...

If you haven't yet, please stop the cluster service on both node and add -Force to clear it

Comment 16 Shawn B. 2019-05-17 17:21:46 UTC
Same I/O error (clearreservation_ps.PNG) on all nodes.

Comment 17 Ryan Barry 2019-05-17 17:32:27 UTC
Ok --

Not to put too fine a point on it, but the RHV bug here is intended to present SCSI reservation support, and that seems to be working. I'm happy to help you work through the issue, but it's strictly not related to RHV (which is doing its job by making SCSI persistent reservations visible to the guest)

Have you tried clearing the reservation from one of the hypervisors? Are you sure the cluster service is stopped on all nodes? Re-creating the LUN? Is there a screenshot from the Windows Cluster Wizard? Are you using the disk id from Disk Management (correct) or from the cluster wizard (will not work)?

Comment 18 Shawn B. 2019-05-17 19:05:58 UTC
I am able to view and create and release a reservation on the host machines but cannot in the guest VM.




$ ansible rhev_hosts -a "sg_persist --in -r -d /dev/mapper/36000d310048cee00000000000000001d"
rvhost06.cmg.pok.med | CHANGED | rc=0 >>
  COMPELNT  Compellent Vol    0703
  Peripheral device type: disk
  PR generation=0x3, there is NO reservation held

rvhost02.cmg.pok.med | CHANGED | rc=0 >>
  COMPELNT  Compellent Vol    0703
  Peripheral device type: disk
  PR generation=0x3, there is NO reservation held

rvhost01.cmg.pok.med | CHANGED | rc=0 >>
  COMPELNT  Compellent Vol    0703
  Peripheral device type: disk
  PR generation=0x3, there is NO reservation held

rvhost04.cmg.pok.med | CHANGED | rc=0 >>
  COMPELNT  Compellent Vol    0703
  Peripheral device type: disk
  PR generation=0x3, there is NO reservation held

rvhost03.cmg.pok.med | CHANGED | rc=0 >>
  COMPELNT  Compellent Vol    0703
  Peripheral device type: disk
  PR generation=0x3, there is NO reservation held

rvhost07.cmg.pok.med | CHANGED | rc=0 >>
  COMPELNT  Compellent Vol    0703
  Peripheral device type: disk
  PR generation=0x3, there is NO reservation held




[root@rvhost04 ~]# sg_persist --out --register --param-sark=0xFFFFFFFF /dev/mapper/36000d310048cee00000000000000001d
  COMPELNT  Compellent Vol    0703
  Peripheral device type: disk





[root@rvhost04 ~]# sg_persist --out --reserve --param-rk=0xFFFFFFFF --prout-type=5 /dev/mapper/36000d310048cee00000000000000001d
  COMPELNT  Compellent Vol    0703
  Peripheral device type: disk





$ ansible rhev_hosts -a "sg_persist --in -r -d /dev/mapper/36000d310048cee00000000000000001d"
rvhost06.cmg.pok.med | CHANGED | rc=0 >>
  COMPELNT  Compellent Vol    0703
  Peripheral device type: disk
  PR generation=0x3, Reservation follows:
    Key=0xffffffff
    scope: LU_SCOPE,  type: Write Exclusive, registrants only

rvhost02.cmg.pok.med | CHANGED | rc=0 >>
  COMPELNT  Compellent Vol    0703
  Peripheral device type: disk
  PR generation=0x3, Reservation follows:
    Key=0xffffffff
    scope: LU_SCOPE,  type: Write Exclusive, registrants only

rvhost01.cmg.pok.med | CHANGED | rc=0 >>
  COMPELNT  Compellent Vol    0703
  Peripheral device type: disk
  PR generation=0x3, Reservation follows:
    Key=0xffffffff
    scope: LU_SCOPE,  type: Write Exclusive, registrants only

rvhost04.cmg.pok.med | CHANGED | rc=0 >>
  COMPELNT  Compellent Vol    0703
  Peripheral device type: disk
  PR generation=0x3, Reservation follows:
    Key=0xffffffff
    scope: LU_SCOPE,  type: Write Exclusive, registrants only

rvhost03.cmg.pok.med | CHANGED | rc=0 >>
  COMPELNT  Compellent Vol    0703
  Peripheral device type: disk
  PR generation=0x3, Reservation follows:
    Key=0xffffffff
    scope: LU_SCOPE,  type: Write Exclusive, registrants only

rvhost07.cmg.pok.med | CHANGED | rc=0 >>
  COMPELNT  Compellent Vol    0703
  Peripheral device type: disk
  PR generation=0x3, Reservation follows:
    Key=0xffffffff
    scope: LU_SCOPE,  type: Write Exclusive, registrants only





[root@rvhost04 ~]# sg_persist --out --release --param-rk=0xFFFFFFFF --prout-type=5 /dev/mapper/36000d310048cee00000000000000001d
  COMPELNT  Compellent Vol    0703
  Peripheral device type: disk





[root@rvhost04 ~]# sg_persist --out --register --param-rk=0xFFFFFFFF /dev/mapper/36000d310048cee00000000000000001d
  COMPELNT  Compellent Vol    0703
  Peripheral device type: disk




$ ansible rhev_hosts -a "sg_persist --in -r -d /dev/mapper/36000d310048cee00000000000000001d"
rvhost02.cmg.pok.med | CHANGED | rc=0 >>
  COMPELNT  Compellent Vol    0703
  Peripheral device type: disk
  PR generation=0x4, there is NO reservation held

rvhost06.cmg.pok.med | CHANGED | rc=0 >>
  COMPELNT  Compellent Vol    0703
  Peripheral device type: disk
  PR generation=0x4, there is NO reservation held

rvhost01.cmg.pok.med | CHANGED | rc=0 >>
  COMPELNT  Compellent Vol    0703
  Peripheral device type: disk
  PR generation=0x4, there is NO reservation held

rvhost03.cmg.pok.med | CHANGED | rc=0 >>
  COMPELNT  Compellent Vol    0703
  Peripheral device type: disk
  PR generation=0x4, there is NO reservation held

rvhost04.cmg.pok.med | CHANGED | rc=0 >>
  COMPELNT  Compellent Vol    0703
  Peripheral device type: disk
  PR generation=0x4, there is NO reservation held

rvhost07.cmg.pok.med | CHANGED | rc=0 >>
  COMPELNT  Compellent Vol    0703
  Peripheral device type: disk
  PR generation=0x4, there is NO reservation held







With one cluster guest VM running:

$ ansible rhev_hosts -a "sg_persist --in -r -d /dev/mapper/36000d310048cee00000000000000001d"
rvhost06.cmg.pok.med | CHANGED | rc=0 >>
  COMPELNT  Compellent Vol    0703
  Peripheral device type: disk
  PR generation=0x4, there is NO reservation held

rvhost02.cmg.pok.med | CHANGED | rc=0 >>
  COMPELNT  Compellent Vol    0703
  Peripheral device type: disk
  PR generation=0x4, there is NO reservation held

rvhost01.cmg.pok.med | CHANGED | rc=0 >>
  COMPELNT  Compellent Vol    0703
  Peripheral device type: disk
  PR generation=0x4, there is NO reservation held

rvhost04.cmg.pok.med | CHANGED | rc=0 >>
  COMPELNT  Compellent Vol    0703
  Peripheral device type: disk
  PR generation=0x4, there is NO reservation held

rvhost03.cmg.pok.med | CHANGED | rc=0 >>
  COMPELNT  Compellent Vol    0703
  Peripheral device type: disk
  PR generation=0x4, there is NO reservation held

rvhost07.cmg.pok.med | CHANGED | rc=0 >>
  COMPELNT  Compellent Vol    0703
  Peripheral device type: disk
  PR generation=0x4, there is NO reservation held



PS C:\> .\scsicmd.exe -d1 -sscsi3_reserve



************************* SCSI-3 RESEVE OPERATION *****************************
* The SCSI-3 RESERVATION performs SCSI-3 reservation on disk(s) given in the  *
*  -d option.                                                                 *
* SCSICMD uses the predefined SCSI-3 '1234567812345678' to perform SCSI-3     *
* reservation on a disk that has not had any SCSI-2/SCSI-3 reservation.       *
*                                                                             *
* If SFW DMP (5.0 DDI-3/5.1 GA/5.1 DDI-1 or a newer DDI version) hasn't been  *
* installed nor claim the disk, make sure that there is only one HBA path     *
* connected to the disk(s) from the testing host before running SCSICMD tool. *
*******************************************************************************


Harddisk1

Scsi Address
------------------
  Length     : 0x8
  PortNumber : 0x3
  PathId     : 0x0
  TargetId   : 0x0
  Lun        : 0x1

ERROR: Failed to perform SCSI-3 RESERVE action on harddisk1.
          *****  PGR IN <== READ RESERVATION operation on harddisk1  *****

Scsi Address
------------------
  Length     : 0x8
  PortNumber : 0x3
  PathId     : 0x0
  TargetId   : 0x0
  Lun        : 0x1

Retry flag: 0
Scsi status: 02h

Sense Info -- consult SCSI spec for details
-------------------------------------------------------------
70 00 0B 00 00 00 00 0A 00 00 00 00 08 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Sense code Interpretation
Major (0x70):
=============
ABORTED COMMAND: Indicates that the device server aborted the command. The application
client may be able to recover by trying the command again.
Ch Obsolete

Minor (0x08 0x00):
==================
D T L WR OMA E B K V F LOGICAL UNIT COMMUNICATION FAILURE


ERROR: Unable to perform SCSI-3 read reservation on harddisk1.
       The disk may have under a cluster resource or may not support SCSI-3 PR.

Comment 19 Ryan Barry 2019-05-17 19:35:11 UTC
This is a change from the last comment, which showed the reservation held, and no abort.

Still missing screenshots of:

The cluster wizard
Updated storage for the LUN now that it is direct instead of a shared disk. privileged SCSI enabled, enable SCSI pass-through, shareable, reservation?.

Additionally, ValidateStorage.txt from the cluster wizard would be helpful, plus verifying the disk ID. Disk 1 is almost always C: in Windows.

$AllDevices = gwmi -Class Win32_DiskDrive -Namespace 'root\CIMV2'
ForEach ($Device in $AllDevices) {
  @{
    Name=$Device.Name;
    Caption=$Device.Caption;
    Index=$Device.Index;
    SerialNo=$Device.SerialNumber;
  } | Format-Table -AutoSize
}

Comment 20 Shawn B. 2019-05-17 20:10:05 UTC
Right, the check no longer complained about 83h not being supported but now it just doesn't seem to actually be able to detect|release|create a reservation.

I'm attaching the report generated by Test-Cluster, direct lun screenshot, win32_diskdrive screenshot.

Comment 21 Shawn B. 2019-05-17 20:10:58 UTC
Created attachment 1570358 [details]
Validation Report 2019.05.17 At 16.01.30.htm

Comment 22 Shawn B. 2019-05-17 20:11:16 UTC
Created attachment 1570359 [details]
directlun.PNG

Comment 23 Shawn B. 2019-05-17 20:11:35 UTC
Created attachment 1570360 [details]
Win32_DiskDrive.PNG

Comment 24 Ryan Barry 2019-05-17 20:28:19 UTC
Ok, so now we may be back to https://bugzilla.redhat.com/show_bug.cgi?id=1111784

Can you please attach vdsm.log from the host, and check that 'qemu-pr-helper' is in `ps`?

Comment 25 Shawn B. 2019-05-17 20:44:06 UTC
[root@rvhost07 ~]# ps auxww | grep "qemu-pr-helper"
root     41558  0.0  0.0 112708   980 pts/0    S+   16:43   0:00 grep --color=auto qemu-pr-helper
root     44513  0.0  0.0 100028 10568 ?        S    15:49   0:00 /usr/bin/qemu-pr-helper -k /var/lib/libvirt/qemu/domain-15-FILESVR01NEW/pr-helper0.sock -f /var/lib/libvirt/qemu/domain-15-FILESVR01NEW/pr-helper0.pid

Comment 26 Shawn B. 2019-05-17 20:44:27 UTC
Created attachment 1570378 [details]
vdsm.log

Comment 27 Ryan Barry 2019-05-19 08:16:38 UTC
Hi Shawn -

This vdsm log doesn't contain the libvirt XML I was looking for. Can you snag it out of engine.log, or attach whichever vdsm log contains it?

Petr -

This one is iSCSI. I can't reproduce, but I'm traveling this week. Can you get me a Win2016 environment joined to AD with an iSCSI LUN available?

Comment 28 Petr Matyáš 2019-05-20 12:50:27 UTC
I created the env for you and sent you an email with information about it.

Comment 29 Shawn B. 2019-05-20 13:41:09 UTC
Created attachment 1571294 [details]
vdsm.log

Comment 30 Shawn B. 2019-05-20 13:42:50 UTC
Hopefully this log has what you're looking for. I shutdown the VM and started it before pulling the log.

2019-05-20 09:33:47,108-0400 INFO  (jsonrpc/1) [api.virt] START create(vmParams={u'xml': u'<?xml version="1.0" encoding="UTF-8"?><domain type="kvm" xmlns:ovirt-tune="http://ovirt.org/vm/tune/1.0" xmlns:ovirt-vm="http://ovirt.org/vm/1.0"><name>FILESVR01NEW

Comment 31 Ryan Barry 2019-05-28 16:56:28 UTC
The logs here look ok, and I can't reproduce locally.

Is this a domain-independent cluster or no?

Comment 32 Shawn B. 2019-06-03 14:38:19 UTC
Hi Ryan,

The windows cluster is AD joined. 

I ran scsi_test on the disk in Windows, maybe there's some insight.

$ scsicmd.exe -d1 -sscsi3_test



********************** SCSI-3 SUPPORT TEST  ***********************************
* The SCSI-3 SUPPORT TEST performs a set of SCSI-3 PR on disk(s) specified in *
* -d option.                                                                  *
* SCSICMD uses the predefined '1234567812345678' key in SCSI-3 support test.  *
*                                                                             *
* Make sure that the testing shouldn't be under the cluster resource.         *
*                                                                             *
* If SFW DMP (5.0 DDI-3/5.1 GA/5.1 DDI-1 or a newer DDI version) hasn't been  *
* installed nor claim the disk, make sure that there is only one HBA path     *
* connected to the disk(s) from the testing host before running SCSICMD tool. *
*******************************************************************************


Harddisk1

Scsi Address
------------------
  Length     : 0x8
  PortNumber : 0x3
  PathId     : 0x0
  TargetId   : 0x0
  Lun        : 0x1

******** PERFORM SCSI-3 PR OPERATION TESTS ON Harddisk1   *************


===>Test #1:  Clean up any SCSI-3 keys left on harddisk1

          *****  PGR IN <== READ KEY operation on harddisk1  *****

Scsi Address
------------------
  Length     : 0x8
  PortNumber : 0x3
  PathId     : 0x0
  TargetId   : 0x0
  Lun        : 0x1

Retry flag: 0
Scsi status: 02h

Sense Info -- consult SCSI spec for details
-------------------------------------------------------------
70 00 0B 00 00 00 00 0A 00 00 00 00 08 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 

Sense code Interpretation
Major (0x70):
=============
ABORTED COMMAND: Indicates that the device server aborted the command. The application
client may be able to recover by trying the command again.
Ch Obsolete

Minor (0x08 0x00):
==================
D T L WR OMA E B K V F LOGICAL UNIT COMMUNICATION FAILURE


Test #1 - *** Failed.  Unable to perform SCSI-3 read action. 
If the error shows error 21, it means that the disk may have been placed
under cluster resoure and the disk may have been on-line on the other node. 

If the above error occurred, you should try SCSI-3 persistence operation tests on
a different disk. 
          *****  PGR OUT ==> REGISTER IGNORE EXISITING KEY operation on harddisk1  *****

Scsi Address
------------------
  Length     : 0x8
  PortNumber : 0x3
  PathId     : 0x0
  TargetId   : 0x0
  Lun        : 0x1

Retry flag: 0
Scsi status: 02h

Sense Info -- consult SCSI spec for details
-------------------------------------------------------------
70 00 0B 00 00 00 00 0A 00 00 00 00 08 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 

Sense code Interpretation
Major (0x70):
=============
ABORTED COMMAND: Indicates that the device server aborted the command. The application
client may be able to recover by trying the command again.
Ch Obsolete

Minor (0x08 0x00):
==================
D T L WR OMA E B K V F LOGICAL UNIT COMMUNICATION FAILURE


          *****  PGR OUT ==> CLEAR operation on harddisk1  *****

Scsi Address
------------------
  Length     : 0x8
  PortNumber : 0x3
  PathId     : 0x0
  TargetId   : 0x0
  Lun        : 0x1

Retry flag: 0
Scsi status: 00h, Bytes returned: 2Ch, Data buffer length: 18h


Returned SCSI_PASS_THROUGH_DIRECT_WITH_BUFFER

data info:
      00  01  02  03  04  05  06  07   08  09  0A  0B  0C  0D  0E  0F
      ---------------------------------------------------------------
 000  2C  00  00  00  00  01  0A  00   00  00  00  00  18  00  00  00   
 010  1E  00  00  00  80  9D  01  01   30  00  00  00  5F  03  00  00   
 020  00  00  00  00  18  00  00  00   00  00  00  00  00  00  00  00   
 030  00  00  00  00  00  00  00  00   00  00  00  00  00  00  00  00   
 040  00  00  00  00  00  00  00  00   00  00  00  00  00  00  00  00   


Returned data:

data info:
      00  01  02  03  04  05  06  07   08  09  0A  0B  0C  0D  0E  0F
      ---------------------------------------------------------------
 000  12  34  56  78  12  34  56  78   00  00  00  00  00  00  00  00   
 010  00  00  00  00  00  00  00  00

Comment 33 Shawn B. 2019-06-04 19:30:15 UTC
Decided to do a quick test with a CentOS 7 and Ubuntu 16.04 guest using the same LUN.

# parted /dev/sdb p
Model: COMPELNT Compellent Vol (scsi)
Disk /dev/sdb: 1074MB
Sector size (logical/physical): 512B/4096B
Partition Table: msdos
Disk Flags: 

Number  Start   End     Size    Type     File system  Flags
 1      65.5kB  1073MB  1073MB  primary  ntfs

# sg_persist --in -r -d /dev/sdb
  COMPELNT  Compellent Vol    0703
  Peripheral device type: disk
PR in (Read reservation): aborted command

Comment 34 Robert McSwain 2019-06-20 18:19:22 UTC
Any updates to follow up on the last set of new information? Thanks!

Comment 35 Ryan Barry 2019-06-20 18:27:21 UTC
Still no reproducer, and likely to still be storage configuration and NOTABUG

Comment 36 Shawn B. 2019-06-20 18:31:13 UTC
Hi Ryan,

If its a storage configuration issue I'd be curious as to what it could be as it must be within oVirt as I can make the reservation manually. I ended up passing the iSCSI networks to the guest and have no problems with reservations there.

Comment 37 Ryan Barry 2019-06-20 18:33:45 UTC
That is what I'm trying to figure out, but I'm hunting additional systems. My lab uses targetd, without a separate storage network, and I cannot reproduce, but I'm looking...

When you say you passed it to the guest, what do you mean?

Comment 38 Shawn B. 2019-06-20 18:44:13 UTC
I converted the iSCSI networks used for our storage domains into VM networks. After that I passed said networks to the guest and connected to our iSCSI target directly.

Would it be helpful for you to connect to our environment to get a better idea?

Comment 39 Ryan Barry 2019-06-20 19:40:37 UTC
Unfortunately, it probably would not. However, knowing who your storage vendor is may help in reproducing

Comment 40 Shawn B. 2019-06-20 19:45:22 UTC
Dell SCv3020

Comment 41 Ryan Barry 2019-06-20 19:52:12 UTC
Thanks. That helps me narrow my search

Comment 42 Daniel Gur 2019-08-28 13:13:51 UTC
sync2jira

Comment 43 Daniel Gur 2019-08-28 13:18:05 UTC
sync2jira

Comment 44 Ryan Barry 2019-10-09 03:51:22 UTC
Michal -

We have a solid reproducer in the lab. qemu-pr-helper is started, the target (lio) supports reservations on the LUN, and libvirt XML is correct:

        <disk device="lun" sgio="unfiltered" snapshot="no" type="block">
            <target bus="scsi" dev="sdb"/>
            <source dev="/dev/mapper/360014056e7434bf5df14ac0a74b18c59">
                <reservations managed="yes"/>
                <seclabel model="dac" relabel="no" type="none"/>
            </source>
            <driver cache="none" error_policy="stop" io="native" name="qemu" type="raw"/>
            <alias name="ua-f68f0713-56fa-41b4-b7ee-1db46c83ecb8"/>
            <address bus="0" controller="0" target="0" type="drive" unit="3"/>
            <shareable/>
        </disk>

However, the Windows clustering wizard fails to set or clear a reservation, much less SCSI-3. This bug was verified not long ago. Any ideas about what's happening here?

Comment 45 Roman Hodain 2019-10-09 10:22:05 UTC
Created attachment 1623772 [details]
qemu-pr-helper.strace.out

Just an additional thing. I have collected strace of the qemu-pr-helper process when running 

PS C:\Users\Administrator\Desktop> .\sg_persist.exe --in -k -d f:
  LIO-ORG   iscsi07           4.0
  Peripheral device type: disk
PR in: aborted command


The process was 

root     28876     1  0 11:30 ?        00:00:00 /usr/bin/qemu-pr-helper -k /var/lib/libvirt/qemu/domain-76-rhodain-win2016-02/pr-helper0.sock -f /var/lib/libvirt/qemu/domain-76-rhodain-win2016-02/pr-helper0.pid

The disk is a multipath device 

360014056e7434bf5df14ac0a74b18c59 dm-91 LIO-ORG ,iscsi07         
size=20G features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  `- 17:0:0:7  sdac 65:192 active ready running

There is not specific configuration the multipath. Shall we set the reservation_key on the multipath device in multipath.conf?

Comment 46 Michal Privoznik 2019-10-09 10:49:34 UTC
(In reply to Ryan Barry from comment #44)
> Michal -
> 
> We have a solid reproducer in the lab. qemu-pr-helper is started, the target
> (lio) supports reservations on the LUN, and libvirt XML is correct:
> 
>         <disk device="lun" sgio="unfiltered" snapshot="no" type="block">
>             <target bus="scsi" dev="sdb"/>
>             <source dev="/dev/mapper/360014056e7434bf5df14ac0a74b18c59">
>                 <reservations managed="yes"/>
>                 <seclabel model="dac" relabel="no" type="none"/>
>             </source>
>             <driver cache="none" error_policy="stop" io="native" name="qemu"
> type="raw"/>
>             <alias name="ua-f68f0713-56fa-41b4-b7ee-1db46c83ecb8"/>
>             <address bus="0" controller="0" target="0" type="drive"
> unit="3"/>
>             <shareable/>
>         </disk>
> 
> However, the Windows clustering wizard fails to set or clear a reservation,
> much less SCSI-3. This bug was verified not long ago. Any ideas about what's
> happening here?

Nothing rings any bell, XML looks okay and since qemu-pr-helper was started libvirt's part was done. I don't know enough details of SCSI protocol to suggest anything useful, sorry. Maybe there's a bug in the helper binary? Paolo, any thoughts?

Comment 47 Paolo Bonzini 2019-10-18 13:27:31 UTC
When you test setting the persistent reservation manually, please do it using mpathpersist.  This is the same code path that qemu-pr-helper uses (qemu-pr-helper and mpathpersist are basically wrappers for the same code).

Comment 48 Ryan Barry 2019-10-18 17:28:25 UTC
Ack, I'll check on Roman's environment again on Monday

Comment 49 Ryan Barry 2019-10-21 18:40:27 UTC
This is able to be successfully reserved on the host with mpathpersist also. Paolo, any suggestions for digging into qemu-pr-helper? Looks like the logs don't go anywhere.

Comment 50 Paolo Bonzini 2019-10-23 13:36:13 UTC
You could try a "strace -ff -e ioctl" of the qemu-pr-helper process. The LUN_COMM_FAILURE sense is sent in two cases, either a "MPATH_PR_OTHER" result from libmpathpersist, or a failure to send the request to qemu-pr-helper.  So if we get a request at all, and/or if we see an error from qemu-pr-helper's SG_IO ioctls, we can narrow it to one of the two cases.

Comment 51 Roman Hodain 2019-11-04 13:20:32 UTC
Created attachment 1632531 [details]
strace during .\sg_persist.exe --in -k -d f:

Comment 61 Paolo Bonzini 2019-11-11 17:18:55 UTC
Roman, an oVirt user reported seeing something like

open("/dev/sded", O_RDONLY)             = -1 ENOENT (No such file or directory)

So it's worth trying a "strace -ff" again but this time without "-e ioctl".  If we can reproduce it, this would be a libvirt bug in how it starts qemu-pr-helper.

The ENOENT happens because qemu-pr-helper needs access to other devices than the ones in the VM configuration (namely those belonging to the multipath device).  I asked Michal Privoznik whether libvirt takes care with mount namespaces and/or device cgroups when it runs qemu-pr-helper, and if there is possibly a way to disable those security features and test whether they are the culprit.

Comment 62 Roman Hodain 2019-11-12 09:39:21 UTC
(In reply to Paolo Bonzini from comment #61)
> Roman, an oVirt user reported seeing something like
> 
> open("/dev/sded", O_RDONLY)             = -1 ENOENT (No such file or
> directory)
> 
> So it's worth trying a "strace -ff" again but this time without "-e ioctl". 
> If we can reproduce it, this would be a libvirt bug in how it starts
> qemu-pr-helper.
> 
> The ENOENT happens because qemu-pr-helper needs access to other devices than
> the ones in the VM configuration (namely those belonging to the multipath
> device).  I asked Michal Privoznik whether libvirt takes care with mount
> namespaces and/or device cgroups when it runs qemu-pr-helper, and if there
> is possibly a way to disable those security features and test whether they
> are the culprit.

Hi Paolo,

I have not noticed this issue. I will upload the strace shortly.

Comment 63 Roman Hodain 2019-11-12 09:41:21 UTC
Created attachment 1635267 [details]
strace -ff -p 202536  of the qemu-pr-helper

Executed during this command run on the guest.

.\sg_persist.exe -v --out --register --param-sark=0xDEADBEEF e:

Comment 64 Michal Privoznik 2019-11-13 15:36:42 UTC
(In reply to Paolo Bonzini from comment #61)
> Roman, an oVirt user reported seeing something like
> 
> open("/dev/sded", O_RDONLY)             = -1 ENOENT (No such file or
> directory)
> 
> So it's worth trying a "strace -ff" again but this time without "-e ioctl". 
> If we can reproduce it, this would be a libvirt bug in how it starts
> qemu-pr-helper.
> 
> The ENOENT happens because qemu-pr-helper needs access to other devices than
> the ones in the VM configuration (namely those belonging to the multipath
> device).  I asked Michal Privoznik whether libvirt takes care with mount
> namespaces and/or device cgroups when it runs qemu-pr-helper, and if there
> is possibly a way to disable those security features and test whether they
> are the culprit.

I've created a scratch build that fixes ENOENT issue here:

https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=24669501

But I don't think that it is the same problem as we are seeing here. Anyway, if anybody wants to test it, please do so.

Comment 65 Paolo Bonzini 2019-11-13 17:09:56 UTC
> Created attachment 1635267 [details]
> strace -ff -p 202536  of the qemu-pr-helper
> 
> Executed during this command run on the guest.
> 
> .\sg_persist.exe -v --out --register --param-sark=0xDEADBEEF e:

This seems to be a strace for "mpathpersist --out --register --param-sark=0xABC123 /dev/mapper/..." run on a bare metal host.  It shows a 02h/08h/00 (Not Ready, LUN Communication Failure) sense code that could be the same symptom as seen in comment 58.  IIUC that is a LIO bug that has been fixed (Roman, do you have a number?)

Comment 66 Michal Privoznik 2019-11-19 13:54:47 UTC
(In reply to Paolo Bonzini from comment #61)
> 
> open("/dev/sded", O_RDONLY)             = -1 ENOENT (No such file or
> directory)
> 

BTW: we can rule libvirt out if you disable namespaces (namespaces=[] in qemu.conf).

Comment 67 Roman Hodain 2019-11-22 14:23:12 UTC
Just to make it clear I attach the strace ones more:

1) 
Hypervisor:

    [root@dell-r440-01 ~]# pidof qemu-pr-helper
    261059

    [root@dell-r440-01 ~]# strace -ff -p 261059 -o strace_reg_from_vm
    strace: Process 261059 attached
    ^Cstrace: Process 261059 detached

    [root@dell-r440-01 ~]# sg_persist --in -s -d /dev/mapper/36001405a2de76f5a0094b27b75d9333e
      LIO-ORG   iscsi10           4.0
      Peripheral device type: disk
      PR generation=0x35
      No full status descriptors


On the VM:

    PS C:\Users\Administrator\Desktop> .\sg_persist.exe -v --out --register --param-sark=0xDEADBEEF e:
        inquiry cdb: 12 00 00 00 24 00
      LIO-ORG   iscsi10           4.0
      Peripheral device type: disk
        Persistent Reservation Out cmd: 5f 00 00 00 00 00 00 00 18 00
    persistent reserve out:  Fixed format, current;  Sense key: Aborted Command
     Additional sense: Logical unit communication failure
    PR out: aborted command

2) 
On the hypervisor again:

    [root@dell-r440-01 ~]# strace -ff -o strace_reg_from_host sg_persist -v --out --register --param-sark=0xDEADBEEF /dev/mapper/36001405a2de76f5a0094b27b75d9333e
        inquiry cdb: 12 00 00 00 24 00 
      LIO-ORG   iscsi10           4.0
      Peripheral device type: disk
        Persistent Reservation Out cmd: 5f 00 00 00 00 00 00 00 18 00 
    PR out: command (Register) successful

    [root@dell-r440-01 ~]# sg_persist --in -s -d /dev/mapper/36001405a2de76f5a0094b27b75d9333e
      LIO-ORG   iscsi10           4.0
      Peripheral device type: disk
      PR generation=0x35
        Key=0xdeadbeef
          All target ports bit clear
          Relative port address: 0x1
          not reservation holder
          Transport Id of initiator:
            iSCSI name and session id: iqn.1994-05.com.redhat:83a6f6026f2

Comment 68 Roman Hodain 2019-11-22 14:25:27 UTC
Created attachment 1638761 [details]
strace during reservation from host

Comment 69 Roman Hodain 2019-11-22 14:26:02 UTC
Created attachment 1638762 [details]
strace during reservation from vm

Comment 73 Paolo Bonzini 2019-12-02 18:45:05 UTC
Roman, is it possible to get access to a reproduction environment?

Comment 74 Paolo Bonzini 2019-12-02 18:46:14 UTC
The /etc/target/pr issue is bug 1658988.

Comment 75 Marina Kalinin 2020-01-16 23:01:20 UTC
(In reply to Paolo Bonzini from comment #73)
> Roman, is it possible to get access to a reproduction environment?

Roman, seems like Paolo is asking you here?

Comment 79 Roman Hodain 2020-04-01 09:40:50 UTC
Created attachment 1675360 [details]
Report from cluster verification

Comment 80 Ryan Barry 2020-04-01 11:38:30 UTC
(In reply to Roman Hodain from comment #79)
> Created attachment 1675360 [details]
> Report from cluster verification

https://bugzilla.redhat.com/show_bug.cgi?id=1658988 is still not shipped


Note You need to log in before you can comment on or make changes to this bug.