Description of problem:
This is a RHEL 7.7 version of the RHEL6.10 bz1749512 bug.
This is the same customer with an identical delay.
scsi_cmnd's stalled for up to 543 seconds or just over 9 minutes.
crash> scsishow --time
..
scsi_cmnd ffff9913422bfb80 on scsi_device 0xffff98de12dfa000 (3:0:0:14) is unknown, deadline: -440391 cmnd-alloc: -543113 rq-alloc: -543113
Also the first hung_task was reported 418 seconds before the panic. AS this timeout was 120 seconds 418+120 = 538 or just over 9 minutes.
[607630.264537] INFO: task in:imjournal:5089 blocked for more than 120 seconds.
..
.. 418 seconds later
..
[608048.236401] SysRq : Trigger a crash
From scsishow we see the target3:0:0 was blocked.
-------------------------------------------------------------------------------
TARGET DEVICE scsi_target CHANNEL ID TARGET STATUS TARGET_BUSY TARGET_BLOCKED
target3:0:0 ffff98fc61263c00 0 0 STARGET_RUNNING 0 3 <====
Using the oldest request check the state of the related storage structures and confirmed this is a result of the blocked target. :
scsishow -r
..
ffff98fcb8ed7a80 (3:0:0:4) start_time: 4902223086 special: 0xffff9913422bd180
..
crash> struct request.q ffff98fcb8ed7a80
q = 0xffff98fc7170b990
crash> struct -x request_queue.queue_flags,queuedata 0xffff98fc7170b990
queue_flags = 0x42112a00
queuedata = 0xffff98fcad735800
crash> struct scsi_device.sdev_target 0xffff98fcad735800
sdev_target = 0xffff98fc61263c00
crash> struct scsi_target.target_blocked 0xffff98fc61263c00
target_blocked = {
counter = 3 <<<---
}
crash>
The fc_rport struct for the port was currently marked as online and not blocked:
crash> struct fc_rport.port_state ffff98fcadf1b000
port_state = FC_PORTSTATE_ONLINE
A scsi_target's target_blocked state gets set if an HBA driver returns the SCSI_MLQUEUE_TARGET_BUSY error code. The HBA driver was bnx2fc, which has 2 conditions it will return this code. If its session is failed for some reason or if the storage had returned a retry delay timestamp. Checking bnx2fc state:
crash> p ((struct fc_rport *)0xffff98fcadf1b000)->dd_data
$1 = (void *) 0xffff98fcadf1b468
crash> p &((struct fc_rport_libfc_priv *)0xffff98fcadf1b468)[1]
$2 = (struct fc_rport_libfc_priv *) 0xffff98fcadf1b480
crash> px ((struct bnx2fc_rport *)0xffff98fcadf1b480)->flags
$3 = 0x426
crash>
The BNX2FC_FLAG_SESSION_READY would be value 0x2, which is set in the flags field. So the busy error was not from a failed session state. Checking for the retry delay timestamp we find a non-zero timestamp value:
crash> p ((struct bnx2fc_rport *)0xffff98fcadf1b480)->retry_delay_timestamp
$4 = 4903861586
crash> p jiffies
jiffies = $5 = 4902766219
crash> eval 4903861586 - 4902766219
hexadecimal: 10b6c7
decimal: 1095367
octal: 4133307
binary: 0000000000000000000000000000000000000000000100001011011011000111
crash>
So there were still 1095 seconds (over 18 minutes) until this retry delay would be over. With the oldest commands having been waiting roughly 540 seconds, this would imply the hardware returned the maximum delay value of 0x3fff, which would be a delay of 27 minutes.
So the I/O is being delayed because the storage device requested a delay. A command was sent to the storage on this FC rport, and the storage returned a code and retry delay timestamp telling the system to delay before resending the I/O to try again. But the storage returned a huge delay value, and so the I/O to this FC port was being delayed a long time as requested.
Comment 17guazhang@redhat.com
2020-01-06 04:35:48 UTC
Hello
I did code review on fixed package and run the regression testing, After check, all modified can be found in source and regression has passed.
so move to verified.
thanks
guazhang
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHSA-2020:1016