Bug 1930388

Summary: [RHEL8] st driver unit attention behavior over iSCSI
Product: Red Hat Enterprise Linux 8 Reporter: Alex Wang <alex.wang>
Component: kernelAssignee: John Meneghini <jmeneghi>
kernel sub component: Storage Drivers QA Contact: Martin Hoyer <mhoyer>
Status: ASSIGNED --- Docs Contact:
Severity: high    
Priority: high CC: ccalhoun, cleech, jmeneghi, jpittman, loberman, mgandhi, mhoyer, revers
Version: 8.3Keywords: Reopened, Triaged
Target Milestone: rcFlags: jpittman: needinfo? (cleech)
loberman: needinfo? (cleech)
jpittman: needinfo? (revers)
ccalhoun: needinfo? (cleech)
ccalhoun: needinfo? (revers)
loberman: needinfo? (jmeneghi)
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2210024 (view as bug list) Environment:
Last Closed: 2023-03-14 19:26:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2210024    
Attachments:
Description Flags
Wireshark Expert Summary
none
Wireshark command summary
none
Wireshark SCSI Unit Attention none

Comment 13 loberman 2021-12-14 19:39:06 UTC
Chris

I managed to get this working with ESOS and a physical tape drive.

I will work on this now, may need help though if I get stuck.
We can get protocol traces using tcpdump

[7:0:0:0]    tape    HP       Ultrium 5-SCSI   Z64D  /dev/st0

target: iqn.2003-01.org.linux-iscsi.trace.x8664:sn.a59f99ea082b
	Portal: 192.168.1.107:3260,1
		Iface Name: default
Target: iqn.2021-12.esos.localhost:89136
	Portal: 192.168.1.178:3260,1
		Iface Name: default

[root@loberhel ~]# iscsiadm -m session -P3
iSCSI Transport Class version 2.0-870
version 6.2.0.874-19
Target: iqn.2021-12.esos.localhost:89136 (non-flash)
	Current Portal: 192.168.1.178:3260,1
	Persistent Portal: 192.168.1.178:3260,1
		**********
		Interface:
		**********
		Iface Name: default
		Iface Transport: tcp
		Iface Initiatorname: iqn.1994-05.com.redhat:29d829ee4a97
		Iface IPaddress: 192.168.1.214
		Iface HWaddress: <empty>
		Iface Netdev: <empty>
		SID: 3
		iSCSI Connection State: LOGGED IN
		iSCSI Session State: LOGGED_IN
		Internal iscsid Session State: NO CHANGE
		*********
		Timeouts:
		*********
		Recovery Timeout: 120
		Target Reset Timeout: 30
		LUN Reset Timeout: 30
		Abort Timeout: 15
		*****
		CHAP:
		*****
		username: <empty>
		password: ********
		username_in: <empty>
		password_in: ********
		************************
		Negotiated iSCSI params:
		************************
		HeaderDigest: None
		DataDigest: None
		MaxRecvDataSegmentLength: 262144
		MaxXmitDataSegmentLength: 1048576
		FirstBurstLength: 65536
		MaxBurstLength: 1048576
		ImmediateData: Yes
		InitialR2T: No
		MaxOutstandingR2T: 1
		************************
		Attached SCSI devices:
		************************
		Host Number: 7	State: running
		scsi7 Channel 00 Id 0 Lun: 0

Comment 14 loberman 2021-12-15 17:15:55 UTC
Customers current workaround

(*) It turns out that an otherwise-innocuous MTIOCTOP with opcode MTTELL, while not actually supported by st, will reset STps->rw to ST_IDLE, preventing the automatic filemark WEOF that leads to the problem. This itself is probably a bug, but we see no shame in exploiting it to get the behavior we need.

Comment 17 loberman 2022-08-09 15:00:59 UTC
I am trying to help now 

I have a device ready now served by another server connected to a Ultrium tape drive

[  388.884193] iscsi: registered transport (tcp)
[  465.744253] scsi host15: iSCSI Initiator over TCP/IP
[  494.465017] scsi host15: iSCSI Initiator over TCP/IP
[  494.504173] scsi 15:0:0:0: Sequential-Access HP       Ultrium 5-SCSI   Z64D PQ: 0 ANSI: 6
[  494.546018] scsi 15:0:0:0: alua: disable for non-disk devices
[  494.573621] scsi 15:0:0:0: Attached scsi generic sg5 type 1
[  494.601449] st: Version 20160209, fixed bufsize 32768, s/g segs 256
[  494.632130] st 15:0:0:0: Attached scsi tape st0
[  494.653792] st 15:0:0:0: st0: try direct i/o: yes (alignment 1 B)

[15:0:0:0]   tape    HP       Ultrium 5-SCSI   Z64D  /dev/st0   /dev/sg5

Comment 18 loberman 2022-08-09 15:14:30 UTC
Reproducer

Start a backup to the iscsi tape
Down the network interface on the target server and wait for the initiator to time out

ifconfig eno1 down

/usr/lib/debug/usr/lib/modules/4.18.0-305.30.1.el8_4.x86_64/kernel/drivers/mfd/
/usr/lib/debug/usr/lib/modules/4.18.0-305.30.1.el8_4.x86_64/kernel/drivers/mfd/intel-lpss-acpi.ko.debug
/usr/lib/debug/usr/lib/modules/4.18.0-305.30.1.el8_4.x86_64/kernel/drivers/mfd/intel-lpss-pci.ko.debug
tar: /dev/st0: Cannot write: Input/output error

tar: Error is not recoverable: exiting now

ifconfig eno1 up

I hear the tape rewind

List the contents and they are there so in this case I never got the issue reproduced
This issue "our next rewind attempt also triggers an implied MTWEOF, which does _not_ fail"

[root@ml150 ~]# tar tvf /dev/st0
drwxr-xr-x root/root         0 2021-11-11 16:04 usr/
dr-xr-xr-x root/root         0 2022-07-27 12:27 usr/bin/
-rwxr-xr-x root/root     71504 2019-11-19 09:21 usr/bin/diff3
-rwxr-xr-x root/root      3283 2022-06-08 15:13 usr/bin/catchsegv
-rwxr-xr-x root/root     54768 2019-11-19 09:21 usr/bin/sdiff
-rwxr-xr-x root/root     25424 2022-06-08 15:25 usr/bin/gencat
-rwxr-xr-x root/root    163416 2021-09-23 08:56 usr/bin/cpio


From the BZ

Per AWS support, when the gateway crashes in this manner, its notion of the current tape position is reset to the beginning of the tape. It also sets a unit attention condition, such that the next request results in a CHECK CONDITION status with sense key UNIT ATTENTION and asc/ascq indicating a device reset. According to their logs and our own, separate analysis, the next command being sent is WRITE FILEMARK, which results in writing an FM at the beginning of the tape, effectively discarding its contents.

In fact, once the write fails with EIO, our software attempts to recover by rewinding and repositioning the tape, then resuming operation. If this fails, it attempts to rewind and reposition again, write a marker at the end of the tape, and then unmount. It does not under any circumstances write either data or filemarks without having successfully positioned the tape to a known point.

What actually happens is that, since the last operation was a write, the kernel executes an implied MTWEOF operation (which translate to a Write Filemarks command) before the rewind that was actually requested. This seems not entirely unreasonable, provided the tape position is known.  However, once this request fails (due to the unit attention condition), our next rewind attempt also triggers an implied MTWEOF, which does _not_ fail (the unit attention condition persists only until the initiator has been notified); this is the command that unexpectedly erases the tape.

I will keep at this but it will probably need code inspection review etc. to figure out unless we can get a reproducer from the customer to use.

Its been so long, I am not keen to reach back out the customer for a reproducer.

Regards
Laurence

Comment 19 loberman 2022-08-09 20:55:52 UTC
Multiple attempts using tar to reproduce this have not worked

The last attempt included stopping and then restarting the target daemon on the target server but still never resulted in the MTWEOF being sent such that the beginning of the tape was compromised.

We will need to be doing what their backup software was doing to reproduce this I think. 

[root@dl385g10 ~]# systemctl stop target

/usr/bin/mysql_waitpid
/usr/bin/mysqlaccess
/usr/bin/mysqladmin
/usr/bin/mysqlbinlog
/usr/bin/mysqlcheck
/usr/bin/mysqldump
tar: /dev/nst0: Cannot write: Input/output error
tar: Error is not recoverable: exiting now

After waiting a while and restarting the target daemon (iscsit)

tar: Error is not recoverable: exiting now

Then list the tape, I hear it rewind, we are at BOT and data is still intact

[root@loberman ~]# tar tvf /dev/st0
drwxr-xr-x root/root         0 2021-06-28 14:41 usr/
drwxr-xr-x root/root         0 2022-05-13 09:11 usr/bin/
-rwxr-xr-x root/root      2161 2016-03-04 07:56 usr/bin/xzmore
-rwxr-xr-x root/root      3336 2021-03-26 08:45 usr/bin/catchsegv
-rwxr-xr-x root/root    199288 2018-05-03 11:50 usr/bin/find
-rwxr-xr-x root/root    154808 2014-08-15 00:05 usr/bin/m4
-rwxr-xr-x root/root     22640 2021-03-26 09:15 usr/bin/gencat

Comment 20 John Meneghini 2022-08-10 03:07:18 UTC
(In reply to loberman from comment #18)
> 
> Per AWS support, when the gateway crashes in this manner, its notion of the
> current tape position is reset to the beginning of the tape. It also sets a
> unit attention condition, such that the next request results in a CHECK
> CONDITION status with sense key UNIT ATTENTION and asc/ascq indicating a
> device reset. According to their logs and our own, separate analysis, the
> next command being sent is WRITE FILEMARK, which results in writing an FM at
> the beginning of the tape, effectively discarding its contents.

Yes, agreed.

I've looked at the provided .pcap trace with Wireshark.  The problem is a number of Malformed Packet
errors led to a TCP RST.  This had the effect of an iSCSI logout, which is a nexus clearing event. 
This resulted in SCSI Target device doing in internal LUN reset.  There is no way the SCSI Target
can be expected to preserve any internal state following the LUN reset. 

After the host logs in again the first unit attention advises the initiator that the device has
been reset and the tape should be re-positioned.  There is no way to continue at the same position
following a reset.  This has nothing to do with iSCSI, the SCSI protocols dictates that the tape must
be re-positioned following a reset.
   
> In fact, once the write fails with EIO, our software attempts to recover by
> rewinding and repositioning the tape, then resuming operation. If this
> fails, it attempts to rewind and reposition again, write a marker at the end
> of the tape, and then unmount. It does not under any circumstances write
> either data or filemarks without having successfully positioned the tape to
> a known point.

Yes, this is what it should do.

> What actually happens is that, since the last operation was a write, the
> kernel executes an implied MTWEOF operation (which translate to a Write
> Filemarks command) before the rewind that was actually requested. This seems
> not entirely unreasonable, provided the tape position is known.  However,
> once this request fails (due to the unit attention condition), our next
> rewind attempt also triggers an implied MTWEOF, which does _not_ fail (the
> unit attention condition persists only until the initiator has been
> notified); this is the command that unexpectedly erases the tape.

Following a reset many tape devices will immediately rewind.  So writing a filemark without knowing
the position of the tape is an error.  The initiator should first rewind and then forward to the end of
the tape to find the last valid file mark.  I agree that these implied MTWEOF operations coming from
the kernel are probably the cause of the unexpected erase.
  
> I will keep at this but it will probably need code inspection review etc. to
> figure out unless we can get a reproducer from the customer to use.
> 
> Its been so long, I am not keen to reach back out the customer for a
> reproducer.

Agreed. If you have the ability to reproduce this, we should probably work on the MTWEOF code. These implied
write filemarks operations are probably the source of the problem.

See the attached Wireshark images.

Comment 21 John Meneghini 2022-08-10 03:10:27 UTC
Created attachment 1904603 [details]
Wireshark Expert Summary

Comment 22 John Meneghini 2022-08-10 03:11:21 UTC
Created attachment 1904604 [details]
Wireshark command summary

Comment 23 John Meneghini 2022-08-10 03:12:32 UTC
Created attachment 1904605 [details]
Wireshark SCSI Unit Attention

Comment 24 loberman 2022-08-10 13:13:59 UTC
Yep, its not iscsi, the customer specifically points out behavior in the st driver.

The WireShark has the customer comments you can click on to see each event they speak of.
Whatever is creating the MTWEOF as an un-flushed pending command to the tape does not happen with the test setup I have.

I am looking writing a test program that directly communicated with the tape device and issues commands in the order the customer describes.

This may be a combination of behavior in that AWS VTL and the customers backup (AFS) application.

Regards
Laurence

Comment 25 loberman 2022-08-25 16:11:39 UTC
Customer Update
--------------------

I'm sorry; I don't have a test program per se. I'm attaching the relevant portions of the backup software, which may be helpful in following my earlier description of the problem and understanding the sequence of operations. The bulk of the work happens in tapeappend(), dumptotape(), and the support functions in tapedrive.c. You may also be interested in the sections protected by #ifdef MTTELL, which are the workaround we applied to avoid the problem.

Taking down the network interface on the target server is not going to trigger this problem. That results in a timeout generated in the iscsi stack on the initiator, but doesn't actually change the state of the target. In order to trigger the problem, you need the following sequence of events:
1) iSCSI communication is interrupted, e.g. because the target process crashed
2) the target resets its notion of the tape position to the beginning of the tape
3) the target responds to the next command with CHECK CONDITION / UNIT ATTENTION / device reset

In the test case you describe, all you are doing is causing (1) by temporarily taking down the network interface. Since the target itself is still fully operational, the system recovers (albeit after generating EIO) once network connectivity is restored.

If I understand correctly, your test case involves a software iscsi target fronting for a physical tape drive. In that case, even if the target process crashes, it seems unlikely that the drive is actually rewound, so the unwanted EOF (if written) will not be at the beginning of the tape. Our targets are virtual tape drives; the state is presumably stored entirely in memory and certainly not in hardware that doesn't magically rewind when the process crashes.

Item (3) is also essential, because the bug is that the st driver is ignoring the UNIT ATTENTION condition entirely, and therefore does not realize that its model of the state of the tape drive may no longer be correct. If, after a crash or similar situation, the target does _not_ respond to the next command with a UNIT ATTENTION condition, then there is no way for the initiator to know anything is wrong. That would be a (probable) bug in the target, but it's not what we're seeing.

Our test rig involved an AWS virtual tape library using their Storage Gateway product. The library is an on-premises VM which communicates with the AWS-hosted storage from behind a firewall, via an HTTP proxy we control. Testing consisted of running a copy of our software against one of those drives, with the proxy configured to inject failures into the VTL appliance's connections to AWS (_not_ into the iSCSI connection). This eventually causes the iSCSI target process to fail in the manner described in the original ticket, thereby reproducing the problem.


I did try restarting the target to see if that made a difference but I don't have one of these Amazon VTL's.

I updated the customer as follows:

Hello Jeff

Many thanks for the details.
I did restart the iscsi target service as well to try reproduce, i.e. not interrupt the network.
I think its the fact that have a physical-backed tape drive perhaps that is preventing a reproducer.
I do have a way to use mhvtl but I suppose what we really need is to get access to AWS and a virt guest with the AWS type tape devices.

I will keep you updated, but I expect engineering will deal with this via code review rather than a reproducer and then provide a test you can run yourself.

Sincerely
Laurence Oberman

My update to them was in response to their detailed update. I will attach the two c code files they attached to the case for review.

Comment 28 John Pittman 2023-03-14 23:04:30 UTC
Chris, can we at least get an explanation on the close here?  The customer put in a lot of effort on this and deserves an explanation.

Comment 32 loberman 2023-07-26 14:36:34 UTC
Hello John

I have this setup 
Client (iscsi initiator Virtual machine) is accessing a tape on the ISCSI server

Linux rhel9.1 5.14.0-284.18.1.el9_2.x86_64 #1 SMP PREEMPT_DYNAMIC Wed May 31 10:39:18 EDT 2023 x86_64 x86_64 x86_64 GNU/Linux
[root@rhel9 ~]# lsscsi -g
[33:0:0:0]   tape    HP       Ultrium 5-SCSI   Z64D  /dev/st0   /dev/sg1 

The tape is connected via SAS to a ML150G9 and shared with ISCSI over 1Gbit Ethernet

root@dl385 ~]# targetcli ls
o- / ..................................................................... [...]
  o- backstores .......................................................... [...]
  | o- block .............................................. [Storage Objects: 0]
  | o- fileio ............................................. [Storage Objects: 0]
  | o- pscsi .............................................. [Storage Objects: 1]
  | | o- tape0 ............................................ [/dev/st0 activated]
  | |   o- alua ............................................... [ALUA Groups: 0]
  | o- ramdisk ............................................ [Storage Objects: 0]
  o- iscsi ........................................................ [Targets: 1]
  | o- iqn.2003-01.org.linux-iscsi.dl385.x8664:sn.f4fb265e42ba ....... [TPGs: 1]
  |   o- tpg1 ........................................... [no-gen-acls, no-auth]
  |     o- acls ...................................................... [ACLs: 1]
  |     | o- iqn.1994-05.com.redhat:9aab1462433f .............. [Mapped LUNs: 1]
  |     |   o- mapped_lun0 ............................. [lun0 pscsi/tape0 (rw)]
  |     o- luns ...................................................... [LUNs: 1]
  |     | o- lun0 .............................. [pscsi/tape0 (/dev/st0) (None)]
  |     o- portals ................................................ [Portals: 1]
  |       o- 0.0.0.0:3260 ................................................. [OK]
  o- loopback ..................................................... [Targets: 0]
  o- srpt ......................................................... [Targets: 0]

It is here at my house so what would you like to capture etc.

We can install an upstream kernel if needed.

Regards
Laurence

Comment 33 loberman 2023-07-27 19:32:10 UTC
Rest by other initiator

Jul 27 15:18:29 rhel9.1 kernel: st 33:0:0:0: [st0] Error: 2, cmd: a 0 0 28 0 0
Jul 27 15:18:29 rhel9.1 kernel: st 33:0:0:0: [st0] Sense Key : No Sense [current] 
Jul 27 15:18:29 rhel9.1 kernel: st 33:0:0:0: [st0] Add. Sense: Rewind operation in progress
Jul 27 15:18:29 rhel9.1 kernel: st 33:0:0:0: [st0] Error on write:
Jul 27 15:18:29 rhel9.1 kernel: st 33:0:0:0: [st0] Number of r/w requests 35913, dio used in 35913, pages 107739.
Jul 27 15:18:29 rhel9.1 kernel: st 33:0:0:0: [st0] Async write waits 0, finished 0.
Jul 27 15:18:29 rhel9.1 kernel: st 33:0:0:0: [st0] Error: 2, cmd: 10 0 0 0 1 0
Jul 27 15:18:29 rhel9.1 kernel: st 33:0:0:0: [st0] Sense Key : No Sense [current] 
Jul 27 15:18:29 rhel9.1 kernel: st 33:0:0:0: [st0] Add. Sense: Rewind operation in progress
Jul 27 15:18:29 rhel9.1 kernel: st 33:0:0:0: [st0] Error on write filemark.
Jul 27 15:18:29 rhel9.1 kernel: st 33:0:0:0: [st0] Buffer flushed, 1 EOF(s) written
Jul 27 15:18:29 rhel9.1 kernel: st 33:0:0:0: [st0] Rewinding tape.
Jul 27 15:18:29 rhel9.1 kernel: st 33:0:0:0: [st0] Error: 2, cmd: 1 0 0 0 0 0
Jul 27 15:18:29 rhel9.1 kernel: st 33:0:0:0: [st0] Sense Key : No Sense [current] 
Jul 27 15:18:29 rhel9.1 kernel: st 33:0:0:0: [st0] Add. Sense: Rewind operation in progress



Tape powered off here

Jul 27 15:21:48 rhel9.1 kernel: st 33:0:0:0: [st0] Error: 2, cmd: a 0 0 28 0 0
Jul 27 15:21:48 rhel9.1 kernel: st 33:0:0:0: [st0] Sense Key : Not Ready [current] 
Jul 27 15:21:48 rhel9.1 kernel: st 33:0:0:0: [st0] Add. Sense: Logical unit communication failure
Jul 27 15:21:48 rhel9.1 kernel: st 33:0:0:0: [st0] Error on write:
Jul 27 15:21:48 rhel9.1 kernel: st 33:0:0:0: [st0] Number of r/w requests 27982, dio used in 27982, pages 83946.
Jul 27 15:21:48 rhel9.1 kernel: st 33:0:0:0: [st0] Async write waits 0, finished 0.
Jul 27 15:21:48 rhel9.1 kernel: st 33:0:0:0: [st0] Error: 2, cmd: 10 0 0 0 1 0
Jul 27 15:21:48 rhel9.1 kernel: st 33:0:0:0: [st0] Sense Key : Not Ready [current] 
Jul 27 15:21:48 rhel9.1 kernel: st 33:0:0:0: [st0] Add. Sense: Logical unit communication failure
Jul 27 15:21:48 rhel9.1 kernel: st 33:0:0:0: [st0] Error on write filemark.
Jul 27 15:21:48 rhel9.1 kernel: st 33:0:0:0: [st0] Buffer flushed, 1 EOF(s) written
Jul 27 15:21:48 rhel9.1 kernel: st 33:0:0:0: [st0] Rewinding tape.
Jul 27 15:21:48 rhel9.1 kernel: st 33:0:0:0: [st0] Error: 2, cmd: 1 0 0 0 0 0
Jul 27 15:21:48 rhel9.1 kernel: st 33:0:0:0: [st0] Sense Key : Not Ready [current]

Comment 34 loberman 2023-07-27 19:39:31 UTC
Next steps
capture a wireshark trace on the initiator where it fails and match the trace to the logging.

Comment 35 loberman 2023-07-27 20:15:21 UTC
wireshark/tcpdump of this sequence

Jul 27 15:58:51 iscsi2rhel9 kernel: device ens160 entered promiscuous mode
Jul 27 15:59:12 iscsi2rhel9 kernel: st 33:0:0:0: [st0] Block limits 1 - 16777215 bytes.
Jul 27 15:59:12 iscsi2rhel9 kernel: st 33:0:0:0: [st0] Mode sense. Length 11, medium 0, WBS 10, BLL 8
Jul 27 15:59:12 iscsi2rhel9 kernel: st 33:0:0:0: [st0] Density 58, tape length: 0, drv buffer: 1
Jul 27 15:59:12 iscsi2rhel9 kernel: st 33:0:0:0: [st0] Block size: 0, buffer size: 4096 (1 blocks).
Jul 27 15:59:20 iscsi2rhel9 kernel: st 33:0:0:0: [st0] Error: 2, cmd: a 0 0 28 0 0
Jul 27 15:59:20 iscsi2rhel9 kernel: st 33:0:0:0: [st0] Sense Key : No Sense [current] 
Jul 27 15:59:20 iscsi2rhel9 kernel: st 33:0:0:0: [st0] Add. Sense: Rewind operation in progress
Jul 27 15:59:20 iscsi2rhel9 kernel: st 33:0:0:0: [st0] Error on write:
Jul 27 15:59:20 iscsi2rhel9 kernel: st 33:0:0:0: [st0] Number of r/w requests 6045, dio used in 6045, pages 18135.
Jul 27 15:59:20 iscsi2rhel9 kernel: st 33:0:0:0: [st0] Async write waits 0, finished 0.
Jul 27 15:59:20 iscsi2rhel9 kernel: st 33:0:0:0: [st0] Error: 2, cmd: 10 0 0 0 1 0
Jul 27 15:59:20 iscsi2rhel9 kernel: st 33:0:0:0: [st0] Sense Key : No Sense [current] 
Jul 27 15:59:20 iscsi2rhel9 kernel: st 33:0:0:0: [st0] Add. Sense: Rewind operation in progress
Jul 27 15:59:20 iscsi2rhel9 kernel: st 33:0:0:0: [st0] Error on write filemark.
Jul 27 15:59:20 iscsi2rhel9 kernel: st 33:0:0:0: [st0] Buffer flushed, 1 EOF(s) written
Jul 27 15:59:20 iscsi2rhel9 kernel: st 33:0:0:0: [st0] Rewinding tape.
Jul 27 15:59:20 iscsi2rhel9 kernel: st 33:0:0:0: [st0] Error: 2, cmd: 1 0 0 0 0 0
Jul 27 15:59:20 iscsi2rhel9 kernel: st 33:0:0:0: [st0] Sense Key : No Sense [current] 
Jul 27 15:59:20 iscsi2rhel9 kernel: st 33:0:0:0: [st0] Add. Sense: Rewind operation in progress
Jul 27 15:59:31 iscsi2rhel9 kernel: device ens160 left promiscuous mode


Opcode 0x10 is Synchronize buffer in below trace

See attachments
screenshots.tar.xz has the screenshots from WireShark taken on the initiator doing the writes

I will add a people page link for the pcap file ens160.pcap
https://people.redhat.com/loberman/customer/.1930388/

Comment 36 loberman 2023-07-27 20:24:15 UTC
Second attempt no wireshark trace
Resetting device using sg_reset from other initiator

Jul 27 16:22:48 iscsi2rhel9 kernel: st 33:0:0:0: [st0] Error: 2, cmd: a 0 0 28 0 0
Jul 27 16:22:48 iscsi2rhel9 kernel: st 33:0:0:0: [st0] Sense Key : No Sense [current] 
Jul 27 16:22:48 iscsi2rhel9 kernel: st 33:0:0:0: [st0] Add. Sense: Rewind operation in progress
Jul 27 16:22:48 iscsi2rhel9 kernel: st 33:0:0:0: [st0] Error on write:
Jul 27 16:22:48 iscsi2rhel9 kernel: st 33:0:0:0: [st0] Number of r/w requests 104974, dio used in 104974, pages 314922.
Jul 27 16:22:48 iscsi2rhel9 kernel: st 33:0:0:0: [st0] Async write waits 0, finished 0.
Jul 27 16:22:48 iscsi2rhel9 kernel: st 33:0:0:0: [st0] Error: 2, cmd: 10 0 0 0 1 0
Jul 27 16:22:48 iscsi2rhel9 kernel: st 33:0:0:0: [st0] Sense Key : No Sense [current] 
Jul 27 16:22:48 iscsi2rhel9 kernel: st 33:0:0:0: [st0] Add. Sense: Rewind operation in progress
Jul 27 16:22:48 iscsi2rhel9 kernel: st 33:0:0:0: [st0] Error on write filemark.
Jul 27 16:22:48 iscsi2rhel9 kernel: st 33:0:0:0: [st0] Buffer flushed, 1 EOF(s) written
Jul 27 16:22:48 iscsi2rhel9 kernel: st 33:0:0:0: [st0] Rewinding tape.
Jul 27 16:22:48 iscsi2rhel9 kernel: st 33:0:0:0: [st0] Error: 2, cmd: 1 0 0 0 0 0
Jul 27 16:22:48 iscsi2rhel9 kernel: st 33:0:0:0: [st0] Sense Key : No Sense [current] 
Jul 27 16:22:48 iscsi2rhel9 kernel: st 33:0:0:0: [st0] Add. Sense: Rewind operation in progress

Comment 37 loberman 2023-07-27 21:13:55 UTC
Repeating the original description to tie it in with what we are seeing


Jul 27 15:59:20 iscsi2rhel9 kernel: st 33:0:0:0: [st0] Error: 2, cmd: a 0 0 28 0 0
Jul 27 15:59:20 iscsi2rhel9 kernel: st 33:0:0:0: [st0] Sense Key : No Sense [current] 
Jul 27 15:59:20 iscsi2rhel9 kernel: st 33:0:0:0: [st0] Add. Sense: Rewind operation in progress
Jul 27 15:59:20 iscsi2rhel9 kernel: st 33:0:0:0: [st0] Error on write:
Jul 27 15:59:20 iscsi2rhel9 kernel: st 33:0:0:0: [st0] Number of r/w requests 6045, dio used in 6045, pages 18135.
Jul 27 15:59:20 iscsi2rhel9 kernel: st 33:0:0:0: [st0] Async write waits 0, finished 0.
Jul 27 15:59:20 iscsi2rhel9 kernel: st 33:0:0:0: [st0] Error: 2, cmd: 10 0 0 0 1 0
Jul 27 15:59:20 iscsi2rhel9 kernel: st 33:0:0:0: [st0] Sense Key : No Sense [current] 
Jul 27 15:59:20 iscsi2rhel9 kernel: st 33:0:0:0: [st0] Add. Sense: Rewind operation in progress
Jul 27 15:59:20 iscsi2rhel9 kernel: st 33:0:0:0: [st0] Error on write filemark.                               ***** Failed here
Jul 27 15:59:20 iscsi2rhel9 kernel: st 33:0:0:0: [st0] Buffer flushed, 1 EOF(s) written
Jul 27 15:59:20 iscsi2rhel9 kernel: st 33:0:0:0: [st0] Rewinding tape.
Jul 27 15:59:20 iscsi2rhel9 kernel: st 33:0:0:0: [st0] Error: 2, cmd: 1 0 0 0 0 0
Jul 27 15:59:20 iscsi2rhel9 kernel: st 33:0:0:0: [st0] Sense Key : No Sense [current] 
Jul 27 15:59:20 iscsi2rhel9 kernel: st 33:0:0:0: [st0] Add. Sense: Rewind operation in progress


31890 15:59:20.574462 192.168.1.146 → 192.168.1.143 iSCSI 10306 SCSI: Write(6) LUN: 0x00 (LBA: 0x000028, Len: 0)SCSI: Data Out LUN: 0x00 (Write(6) Request Data) 
31893 15:59:20.575448 192.168.1.143 → 192.168.1.146 iSCSI 114 SCSI: Response LUN: 0x00 (Write(6)) (Good)
31895 15:59:20.575711 192.168.1.146 → 192.168.1.143 iSCSI 10306 SCSI: Write(6) LUN: 0x00 (LBA: 0x000028, Len: 0)SCSI: Data Out LUN: 0x00 (Write(6) Request Data) 
31898 15:59:20.576752 192.168.1.143 → 192.168.1.146 iSCSI 114 SCSI: Response LUN: 0x00 (Write(6)) (Good)
31900 15:59:20.577061 192.168.1.146 → 192.168.1.143 iSCSI 10306 SCSI: Write(6) LUN: 0x00 (LBA: 0x000028, Len: 0)SCSI: Data Out LUN: 0x00 (Write(6) Request Data) 

reset from other initiator here

31903 15:59:20.579732 192.168.1.143 → 192.168.1.146 iSCSI 214 SCSI Response (Check Condition) LUN:0x00 
31904 15:59:20.579947 192.168.1.146 → 192.168.1.143 iSCSI 114 SCSI Command: 0x10 LUN:0x00                       ***  10                 SYNCHRONIZE BUFFER 
31907 15:59:20.581635 192.168.1.143 → 192.168.1.146 iSCSI 214 SCSI Response (Check Condition) LUN:0x00 
31908 15:59:20.581734 192.168.1.146 → 192.168.1.143 iSCSI 114 SCSI Command: 0x01 LUN:0x00                       *** 01    M            REWIND
31909 15:59:20.583450 192.168.1.143 → 192.168.1.146 iSCSI 214 SCSI Response (Check Condition) LUN:0x00 
31926 15:59:25.639608 192.168.1.146 → 192.168.1.143 iSCSI 114 NOP Out
31927 15:59:25.641052 192.168.1.143 → 192.168.1.146 iSCSI 114 NOP In

I dont see a Unit attention in the trace, just the check condition, going to put SCSI debug on and test again tomorrow


Per AWS support, when the gateway crashes in this manner, its notion of the current tape position is reset to the beginning of the tape. It also sets a unit attention condition, such that the next request results in a CHECK CONDITION status with sense key UNIT ATTENTION and asc/ascq indicating a device reset. According to their logs and our own, separate analysis, the next command being sent is WRITE FILEMARK, which results in writing an FM at the beginning of the tape, effectively discarding its contents.

In fact, once the write fails with EIO, our software attempts to recover by rewinding and repositioning the tape, then resuming operation. If this fails, it attempts to rewind and reposition again, write a marker at the end of the tape, and then unmount. It does not under any circumstances write either data or filemarks without having successfully positioned the tape to a known point.

What actually happens is that, since the last operation was a write, the kernel executes an implied MTWEOF operation (which translate to a Write Filemarks command) before the rewind that was actually requested. This seems not entirely unreasonable, provided the tape position is known.  However, once this request fails (due to the unit attention condition), our next rewind attempt also triggers an implied MTWEOF, which does _not_ fail (the unit attention condition persists only until the initiator has been notified); this is the command that unexpectedly erases the tape.

Our analysis is that the st driver is in fact completely ignoring the UNIT ATTENTION and associated reset notification from the device. This is not a condition that can be detected in the transport or mid-layer, as it occurs entirely within the target and is reported only via the UNIT ATTENTION sense key. The upper driver (i.e. st) needs to detect this indication and reset its internal model of the device to an unknown state.

Comment 38 loberman 2023-07-27 21:21:57 UTC
Typical tape stuff that should apply to all tapes

Table 1 gives the ASC and ASCQ summary for Sense Key 6 (Unit Attention) in the 3584 Tape Library.

SENSE key 6 is UA

Table 1. ASC and ASCQ Summary for Sense Key 6 (Unit Attention)
ASC ASCQ
Description
28 00	Not Ready to Ready Transition, Medium May Have Changed
28 01	Import or Export Element Accessed
29 00	Power On, Reset, or Bus Device Reset Occurred
2A 01	Mode Parameters Changed
3F 01	Microcode Has Been Changed

Comment 39 loberman 2023-07-27 21:48:34 UTC
Added scsi_logging

Did not do any wireshark trace as its not showing anything extra
I am thinking we have not reproduced enough of this to make sense of it.

I mean we send a reset and there is data at the BOT until th reset happened
so we are not seeing a write-file-mark at BOT like The customer sees

-rwxr-xr-x root/root     30472 2021-08-10 00:27 usr/bin/iptc
-rwxr-xr-x root/root     16040 2021-08-10 19:41 usr/bin/dltest
-rwxr-xr-x root/root    199672 2021-08-10 12:47 usr/bin/patch
-rwxr-xr-x root/root     40952 2021-08-10 19:41 usr/bin/isql
-rwxr-xr-x root/root    132976 2021-10-18 09:56 usr/bin/pigz
-rwxr-xr-x root/root   9926208 2022-10-13 09:46 usr/bin/runc
tar: /dev/st0: Cannot read: Input/output error
tar: /dev/st0: Cannot read: Input/output error
tar: /dev/st0: Cannot read: Input/output error
tar: /dev/st0: Cannot read: Input/output error
tar: /dev/st0: Cannot read: Input/output error
tar: /dev/st0: Cannot read: Input/output error


Where is Sense Key 6

Jul 27 17:42:39 iscsi2rhel9 kernel: st 33:0:0:0: tag#111 CDB: Write(6) 0a 00 00 28 00 00
Jul 27 17:42:39 iscsi2rhel9 kernel: st 33:0:0:0: tag#111 Done: SUCCESS Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
Jul 27 17:42:39 iscsi2rhel9 kernel: st 33:0:0:0: tag#111 CDB: Write(6) 0a 00 00 28 00 00
Jul 27 17:42:39 iscsi2rhel9 kernel: st 33:0:0:0: tag#111 scsi host busy 1 failed 0
Jul 27 17:42:39 iscsi2rhel9 kernel: st 33:0:0:0: Notifying upper driver of completion (result 0)
Jul 27 17:42:39 iscsi2rhel9 kernel: st 33:0:0:0: tag#111 20 sectors total, 10240 bytes done.
Jul 27 17:42:39 iscsi2rhel9 kernel: st 33:0:0:0: tag#96 Send: scmd 0x00000000f79ed774
Jul 27 17:42:39 iscsi2rhel9 kernel: st 33:0:0:0: tag#96 CDB: Write(6) 0a 00 00 28 00 00
Jul 27 17:42:39 iscsi2rhel9 kernel: st 33:0:0:0: tag#96 Done: SUCCESS Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
Jul 27 17:42:39 iscsi2rhel9 kernel: st 33:0:0:0: tag#96 CDB: Write(6) 0a 00 00 28 00 00

Jul 27 17:42:39 iscsi2rhel9 kernel: st 33:0:0:0: tag#96 Sense Key : No Sense [current] 
Jul 27 17:42:39 iscsi2rhel9 kernel: st 33:0:0:0: tag#96 Add. Sense: Rewind operation in progress
Jul 27 17:42:39 iscsi2rhel9 kernel: st 33:0:0:0: tag#96 scsi host busy 1 failed 0
Jul 27 17:42:39 iscsi2rhel9 kernel: st 33:0:0:0: Notifying upper driver of completion (result 2)
Jul 27 17:42:39 iscsi2rhel9 kernel: st 33:0:0:0: tag#96 20 sectors total, 10240 bytes done.

Jul 27 17:42:39 iscsi2rhel9 kernel: st 33:0:0:0: [st0] Error: 2, cmd: a 0 0 28 0 0
Jul 27 17:42:39 iscsi2rhel9 kernel: st 33:0:0:0: [st0] Sense Key : No Sense [current] 
Jul 27 17:42:39 iscsi2rhel9 kernel: st 33:0:0:0: [st0] Add. Sense: Rewind operation in progress
Jul 27 17:42:39 iscsi2rhel9 kernel: st 33:0:0:0: [st0] Error on write:
Jul 27 17:42:39 iscsi2rhel9 kernel: st 33:0:0:0: [st0] Number of r/w requests 2478, dio used in 2478, pages 7434.
Jul 27 17:42:39 iscsi2rhel9 kernel: st 33:0:0:0: [st0] Async write waits 0, finished 0.
Jul 27 17:42:39 iscsi2rhel9 kernel: st 33:0:0:0: tag#97 Send: scmd 0x000000000cfe376e
Jul 27 17:42:39 iscsi2rhel9 kernel: st 33:0:0:0: tag#97 CDB: Write Filemarks 10 00 00 00 01 00
Jul 27 17:42:39 iscsi2rhel9 kernel: st 33:0:0:0: tag#97 Done: SUCCESS Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
Jul 27 17:42:39 iscsi2rhel9 kernel: st 33:0:0:0: tag#97 CDB: Write Filemarks 10 00 00 00 01 00
Jul 27 17:42:39 iscsi2rhel9 kernel: st 33:0:0:0: tag#97 Sense Key : No Sense [current] 
Jul 27 17:42:39 iscsi2rhel9 kernel: st 33:0:0:0: tag#97 Add. Sense: Rewind operation in progress
Jul 27 17:42:39 iscsi2rhel9 kernel: st 33:0:0:0: tag#97 scsi host busy 1 failed 0
Jul 27 17:42:39 iscsi2rhel9 kernel: st 33:0:0:0: Notifying upper driver of completion (result 2)
Jul 27 17:42:39 iscsi2rhel9 kernel: st 33:0:0:0: tag#97 0 sectors total, 0 bytes done.
Jul 27 17:42:39 iscsi2rhel9 kernel: st 33:0:0:0: [st0] Error: 2, cmd: 10 0 0 0 1 0
Jul 27 17:42:39 iscsi2rhel9 kernel: st 33:0:0:0: [st0] Sense Key : No Sense [current] 
Jul 27 17:42:39 iscsi2rhel9 kernel: st 33:0:0:0: [st0] Add. Sense: Rewind operation in progress
Jul 27 17:42:39 iscsi2rhel9 kernel: st 33:0:0:0: [st0] Error on write filemark.
Jul 27 17:42:39 iscsi2rhel9 kernel: st 33:0:0:0: [st0] Buffer flushed, 1 EOF(s) written
Jul 27 17:42:39 iscsi2rhel9 kernel: st 33:0:0:0: [st0] Rewinding tape.
Jul 27 17:42:39 iscsi2rhel9 kernel: st 33:0:0:0: tag#98 Send: scmd 0x0000000029d3c08e
Jul 27 17:42:39 iscsi2rhel9 kernel: st 33:0:0:0: tag#98 CDB: Rezero Unit/Rewind 01 00 00 00 00 00
Jul 27 17:42:39 iscsi2rhel9 kernel: st 33:0:0:0: tag#98 Done: SUCCESS Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
Jul 27 17:42:39 iscsi2rhel9 kernel: st 33:0:0:0: tag#98 CDB: Rezero Unit/Rewind 01 00 00 00 00 00
Jul 27 17:42:39 iscsi2rhel9 kernel: st 33:0:0:0: tag#98 Sense Key : No Sense [current] 
Jul 27 17:42:39 iscsi2rhel9 kernel: st 33:0:0:0: tag#98 Add. Sense: Rewind operation in progress
Jul 27 17:42:39 iscsi2rhel9 kernel: st 33:0:0:0: tag#98 scsi host busy 1 failed 0
Jul 27 17:42:39 iscsi2rhel9 kernel: st 33:0:0:0: Notifying upper driver of completion (result 2)
Jul 27 17:42:39 iscsi2rhel9 kernel: st 33:0:0:0: tag#98 0 sectors total, 0 bytes done.
Jul 27 17:42:39 iscsi2rhel9 kernel: st 33:0:0:0: [st0] Error: 2, cmd: 1 0 0 0 0 0
Jul 27 17:42:39 iscsi2rhel9 kernel: st 33:0:0:0: [st0] Sense Key : No Sense [current] 
Jul 27 17:42:39 iscsi2rhel9 kernel: st 33:0:0:0: [st0] Add. Sense: Rewind operation in progress

Comment 40 John Meneghini 2023-07-28 16:04:24 UTC
Thanks Laurence.  Here's what I think we've learned.

The tape drive we are using is not returning proper sense data.  It looks like it isn't returning any kind of Unit Attention at all.  However, what is is doing is: rewinding the tape.  This is good because that's exactly what the customer is seeing with the AWS gateway: after a reset or power up event: 1) the tape rewinds and 2) a Unit Attention 0x29 is returned.  I suspect that the AWS gateway has properly implemented Unit Attentions, according to the SCSI specification - while the tape drive is not.

Comments below.
 
(In reply to loberman from comment #33)
> Reset by other initiator
> 
> Jul 27 15:18:29 rhel9.1 kernel: st 33:0:0:0: [st0] Error: 2, cmd: a 0 0 28 0
> 0

This is a write command... which is what we expect.

> Jul 27 15:18:29 rhel9.1 kernel: st 33:0:0:0: [st0] Sense Key : No Sense
> [current] 
> Jul 27 15:18:29 rhel9.1 kernel: st 33:0:0:0: [st0] Add. Sense: Rewind
> operation in progress


The write command failed and returned an additional sense code of 00h/1Ah - rewind in progress.  The problem is, the Tape drive did not fill in the Key Field in the sense data.  So the sense key is 0x0, No Sense.  This should have been a Unit Attention sense key 0x6, 29h/03h  BUS DEVICE RESET FUNCTION OCCURRED or something. I think that's a bug in the tape drive, but at least the tape is rewinding, so that's what we expect. 

> Jul 27 15:18:29 rhel9.1 kernel: st 33:0:0:0: [st0] Error on write:
> Jul 27 15:18:29 rhel9.1 kernel: st 33:0:0:0: [st0] Number of r/w requests
> 35913, dio used in 35913, pages 107739.
> Jul 27 15:18:29 rhel9.1 kernel: st 33:0:0:0: [st0] Async write waits 0,
> finished 0.
> Jul 27 15:18:29 rhel9.1 kernel: st 33:0:0:0: [st0] Error: 2, cmd: 10 0 0 0 1
> 0

This is very important also.  The next command is a Write Filemark.  So it look like either the application, or the driver is attempting to write a file make immediately after the error.  We need to figure out if this Write Filemark is coming from the application or from the driver. This is what the customer is seeing too, a write filemark after a rewind.

> Jul 27 15:18:29 rhel9.1 kernel: st 33:0:0:0: [st0] Sense Key : No Sense
> [current] 
> Jul 27 15:18:29 rhel9.1 kernel: st 33:0:0:0: [st0] Add. Sense: Rewind
> operation in progress

The Write Filemark command fails, but that's only because the tape device is still rewinding.  The Tape returns the same error: No Sense with an ASC/ASCQ of 00h/1Ah - rewind in progress.  One difference between this tape and the AWS gateway is: rewinds are probably instantaneous with AWS. So I'll bet that if this were the AWS gateway, instead of this physical tape which is slowly rewinding, we would have seen the corrupted data. 
 
> Jul 27 15:18:29 rhel9.1 kernel: st 33:0:0:0: [st0] Error on write filemark.
> Jul 27 15:18:29 rhel9.1 kernel: st 33:0:0:0: [st0] Buffer flushed, 1 EOF(s)
> written

Next the driver sends a Rewind command - 0x1. Again we need understand if this is the driver, or the application that's doing this.

> Jul 27 15:18:29 rhel9.1 kernel: st 33:0:0:0: [st0] Rewinding tape.
> Jul 27 15:18:29 rhel9.1 kernel: st 33:0:0:0: [st0] Error: 2, cmd: 1 0 0 0 0 0
> Jul 27 15:18:29 rhel9.1 kernel: st 33:0:0:0: [st0] Sense Key : No Sense
> [current] 
> Jul 27 15:18:29 rhel9.1 kernel: st 33:0:0:0: [st0] Add. Sense: Rewind
> operation in progress

The Rewind command also fails with No Sense and ASC/ASCQ of 00h/1Ah.  So the driver is just plain clueless. It has no idea that the tape has been re-positioned already.
 
> 
> Tape powered off here

The power on condition is slightly different because the Tape drive is returning a different error.  But the st driver looks like it's behaving the same. 

> Jul 27 15:21:48 rhel9.1 kernel: st 33:0:0:0: [st0] Error: 2, cmd: a 0 0 28 0
> 0

Here's the Write command, which we expect.

> Jul 27 15:21:48 rhel9.1 kernel: st 33:0:0:0: [st0] Sense Key : Not Ready
> [current] 
> Jul 27 15:21:48 rhel9.1 kernel: st 33:0:0:0: [st0] Add. Sense: Logical unit
> communication failure

The tape is returning a Not Ready Sense Key (0x2) with an asc/ascq of 08h/00h LOGICAL UNIT COMMUNICATION FAILURE.  This is still incorrect. This should have been a Unit Attention sense key 0x6, 29h/00h  POWER ON, RESET, OR BUS DEVICE RESET OCCURRED.

It's not reporting anything about a rewind or tape position error.  We should probably double check and see if the tape has physically rewound when you power cycled.  It should have. 

> Jul 27 15:21:48 rhel9.1 kernel: st 33:0:0:0: [st0] Error on write:
> Jul 27 15:21:48 rhel9.1 kernel: st 33:0:0:0: [st0] Number of r/w requests
> 27982, dio used in 27982, pages 83946.
> Jul 27 15:21:48 rhel9.1 kernel: st 33:0:0:0: [st0] Async write waits 0,
> finished 0.

The driver or the application turns around and does the same thing: tries to write a file make. If this is really the driver, it should NOT be doing this.  Write Filemarks after an unknown error should not be done.
  
> Jul 27 15:21:48 rhel9.1 kernel: st 33:0:0:0: [st0] Error: 2, cmd: 10 0 0 0 1
> 0
> Jul 27 15:21:48 rhel9.1 kernel: st 33:0:0:0: [st0] Sense Key : Not Ready
> [current] 
> Jul 27 15:21:48 rhel9.1 kernel: st 33:0:0:0: [st0] Add. Sense: Logical unit
> communication failure

Again, the tape is going slow, and returning the same Not Ready condition.  If this were the AWS gateway, or some other software emulated SCSI Tape target, this command would have succeeded. 

> Jul 27 15:21:48 rhel9.1 kernel: st 33:0:0:0: [st0] Error on write filemark.
> Jul 27 15:21:48 rhel9.1 kernel: st 33:0:0:0: [st0] Buffer flushed, 1 EOF(s)
> written

And the driver, or something, tries to Rewind the tape again.

> Jul 27 15:21:48 rhel9.1 kernel: st 33:0:0:0: [st0] Rewinding tape.
> Jul 27 15:21:48 rhel9.1 kernel: st 33:0:0:0: [st0] Error: 2, cmd: 1 0 0 0 0 0
> Jul 27 15:21:48 rhel9.1 kernel: st 33:0:0:0: [st0] Sense Key : Not Ready
> [current]

I think a wireshark trace would be helpful, but the information we have so far has given me a lot to work with.

Thanks

Comment 41 loberman 2023-07-28 16:09:07 UTC
Excellent, Always happy to try help Engineering when possible

I have multiple wireshark traces from before, just not the last one,
Let me redo the test with all logging on and do a trace on initiator and target 
I will summarize it all and report back later

I do expect a UA from this tape device for a media error or when media is inserted though.
I will check that to make sure generic UA's are sent and maybe just not for the rewind.

Regards
Laurence

Comment 42 John Meneghini 2023-07-28 16:18:45 UTC
(In reply to loberman from comment #41)
> Excellent, Always happy to try help Engineering when possible
> 
> I have multiple wireshark traces from before, just not the last one,
> Let me redo the test with all logging on and do a trace on initiator and
> target. I will summarize it all and report back later

No need Laurence, if you can just get me the .pcap files I'd like to look at the trace myself.

> I do expect a UA from this tape device for a media error or when media is
> inserted though.

Yes, this tape drive is definitely NOT returning Unit Attentions when it should.  If you can verify that ejecting media returns a unit attention, that would be good.
 
> I will check that to make sure generic UA's are sent and maybe just not for
> the rewind.

Agreed. Thanks.

> Per AWS support, when the gateway crashes in this manner, its notion of the
> current tape position is reset to the beginning of the tape. It also sets a
> unit attention condition, such that the next request results in a CHECK
> CONDITION status with sense key UNIT ATTENTION and asc/ascq indicating a
> device reset. According to their logs and our own, separate analysis, the
> next command being sent is WRITE FILEMARK, which results in writing an FM at
> the beginning of the tape, effectively discarding its contents.

This is what we are seeing for ANY unknown error.  Not just a Unit Attention.  
This a problem with the st.c driver.  It should not be writing file marks after error.

I will look at this further and see if I can come up with a patch. I think the customers patch would help 
with AWS, but I think there is also a fundamental problem here in that the driver should not be writing filemarks 
following an error... any error.

Comment 43 loberman 2023-07-28 17:51:41 UTC
Hi John

See comment https://bugzilla.redhat.com/show_bug.cgi?id=1930388#c35
I had already provided pcap and other notes

Set needinfo so you see this