Bug 1665903

Summary: hw/scsi/scsi-bus.c:1374: scsi_req_complete: Assertion `req->status == -1' failed.
Product: Red Hat Enterprise Linux 8 Reporter: Xueqiang Wei <xuwei>
Component: qemu-kvmAssignee: Philippe Mathieu-Daudé <philmd>
Status: CLOSED CURRENTRELEASE QA Contact: qing.wang <qinwang>
Severity: high Docs Contact:
Priority: high    
Version: 8.1CC: chayang, coli, juzhang, knoel, mtessun, ngu, pbonzini, philmd, qzhang, rbalakri, rjones, virt-maint, xuwei
Target Milestone: rc   
Target Release: 8.2   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1650975 Environment:
Last Closed: 2020-01-30 16:25:59 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1650975    
Bug Blocks: 910269, 1755139    

Description Xueqiang Wei 2019-01-14 12:10:53 UTC
+++ This bug was initially created as a clone of Bug #1650975 +++

Description of problem:

qemu in Rawhide fails when I test injecting EIO errors into requests
using nbdkit:

nbdkit: memory[1]: debug: error: pread count=1024 offset=102400 flags=0x0
nbdkit: memory[1]: error: injecting EIO error into pread
nbdkit: memory[1]: debug: sending error reply: Input/output error
qemu-system-x86_64: /builddir/build/BUILD/qemu-3.1.0-rc1/hw/scsi/scsi-bus.c:1374: scsi_req_complete: Assertion `req->status == -1' failed.

Version-Release number of selected component (if applicable):

qemu 2:3.1.0-0.1.rc1.fc30

How reproducible:

Unknown.

Steps to Reproduce:
1. Unknown, I'll try to come up with a reproducer if I can make one work locally.

--- Additional comment from Richard W.M. Jones on 2018-11-18 11:37 UTC ---

build.log from Koji showing the failure

--- Additional comment from Richard W.M. Jones on 2018-11-18 11:53:50 UTC ---

Actually yes this is easily reproducible with qemu from git.

(1) nbdkit -f -v --filter=error memory size=64M error-rate=100%

(2) x86_64-softmmu/qemu-system-x86_64 -device virtio-scsi,id=scsi -drive file=nbd:localhost:10809,format=raw,id=hd0,if=none -device scsi-hd,drive=hd0

qemu-system-x86_64: hw/scsi/scsi-bus.c:1374: scsi_req_complete: Assertion `req->status == -1' failed.
Aborted (core dumped)

Stack trace:

(gdb) bt
#0  0x00007f7f18d4253f in raise () at /lib64/libc.so.6
#1  0x00007f7f18d2c895 in abort () at /lib64/libc.so.6
#2  0x00007f7f18d2c769 in _nl_load_domain.cold.0 () at /lib64/libc.so.6
#3  0x00007f7f18d3a9f6 in .annobin_assert.c_end () at /lib64/libc.so.6
#4  0x000055ce0f920fb0 in scsi_req_complete (req=<optimized out>, status=<optimized out>) at hw/scsi/scsi-bus.c:1374
#5  0x000055ce0f91b850 in scsi_dma_complete_noio (r=0x55ce116ea090, ret=<optimized out>) at hw/scsi/scsi-disk.c:281
#6  0x000055ce0f91b8ff in scsi_dma_complete (opaque=0x55ce116ea090, ret=-5)
    at hw/scsi/scsi-disk.c:302
#7  0x000055ce0f8103c7 in dma_complete (ret=-5, dbs=0x55ce11d36c00)
    at dma-helpers.c:116
#8  0x000055ce0f8103c7 in dma_blk_cb (opaque=0x55ce11d36c00, ret=-5)
    at dma-helpers.c:138
#9  0x000055ce0fa42cce in blk_aio_complete (acb=0x55ce10d36300)
    at block/block-backend.c:1345
#10 0x000055ce0fafce6b in coroutine_trampoline (i0=<optimized out>, i1=<optimized out>) at util/coroutine-ucontext.c:116
#11 0x00007f7f18d58200 in __start_context () at /lib64/libc.so.6
#12 0x00007fff50b87130 in  ()
#13 0x0000000000000000 in  ()

--- Additional comment from Richard W.M. Jones on 2018-11-18 15:11:57 UTC ---

40dce4ee61c68395f6d463fae792f61b7c003bce is the first bad commit
commit 40dce4ee61c68395f6d463fae792f61b7c003bce
Author: Paolo Bonzini <pbonzini>
Date:   Sat Oct 13 11:52:34 2018 +0200

    scsi-disk: fix rerror/werror=ignore
    
    rerror=ignore was returning true from scsi_handle_rw_error but the callers were not
    calling scsi_req_complete when rerror=ignore returns true (this is the correct thing
    to do when true is returned after executing a passthrough command).  Fix this by
    calling it in scsi_handle_rw_error.
    
    Signed-off-by: Paolo Bonzini <pbonzini>

:040000 040000 311386b9b91d77840a849459ab6ae41a37fd7f42 8adcda67d7487bcc18966f096c9923da3b8dc0b9 M	hw

--- Additional comment from Cole Robinson on 2018-11-19 15:00:22 UTC ---

Rich reported this upstream: https://lists.gnu.org/archive/html/qemu-devel/2018-11/msg03508.html

--- Additional comment from Richard W.M. Jones on 2018-11-20 22:16:43 UTC ---

Reported upstream:
https://bugs.launchpad.net/qemu/+bug/1804323

Comment 1 Xueqiang Wei 2019-01-14 12:34:03 UTC
Versions:

kernel-4.18.0-58.el8.x86_64
qemu-kvm-2.12.0-57.module+el8+2683+02b3b955




1. Configure the multipath device on host 

# multipath -ll
mpatha (36001405c6eefcbb279645ef9d02f3f91) dm-3 LIO-ORG,stor0
size=40G features='0' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| `- 15:0:0:0 sdb 8:16 active ready running
`-+- policy='service-time 0' prio=50 status=enabled
  `- 16:0:0:0 sdc 8:32 active ready running

2. passthrough it withount "werror=stop,rerror=stop"

    -device virtio-scsi-pci,id=scsi1,bus=pci.0 \
    -drive file=/dev/sdb,format=raw,if=none,id=drive-scsi0-0-0-0,cache=none,aio=threads \
    -device scsi-block,bus=scsi1.0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0 \

3. create partition on passthrough disk


after step 3, qemu core dump



if add "werror=stop,rerror=stop" to cmd lines, qemu does not core dump but failed to create partition. It will prompt "the requested resource is in use" when create partition.


This issue block bug [1], so set status to high.

[1] Bug 1633179 - Hit io-error when run validate SCSI-3 Persistent Reservation test on win2012 guest

Comment 3 Ademar Reis 2019-01-15 20:53:34 UTC
Rich Jones reported it as a regression from 40dce4ee61[1] in Bug #1650975, but then Xueqiang Wei reproduced it with QEMU-2.12 in RHEL8. We probably have different issues here.

Xueqiang Wei: why do you think your problem (comment #1) is the same as the one reported by Rich?

The commit in question is not in RHEL-8, it's in QEMU-3.1 (RHEL-8 virt fast train).

1:
commit 40dce4ee61c68395f6d463fae792f61b7c003bce
Author: Paolo Bonzini <pbonzini>
Date:   Sat Oct 13 11:52:34 2018 +0200

    scsi-disk: fix rerror/werror=ignore
    
    rerror=ignore was returning true from scsi_handle_rw_error but the callers were not
    calling scsi_req_complete when rerror=ignore returns true (this is the correct thing
    to do when true is returned after executing a passthrough command).  Fix this by
    calling it in scsi_handle_rw_error.
    
    Signed-off-by: Paolo Bonzini <pbonzini>

Comment 4 Xueqiang Wei 2019-01-16 02:59:48 UTC
(In reply to Ademar Reis from comment #3)
> Rich Jones reported it as a regression from 40dce4ee61[1] in Bug #1650975,
> but then Xueqiang Wei reproduced it with QEMU-2.12 in RHEL8. We probably
> have different issues here.
> 
> Xueqiang Wei: why do you think your problem (comment #1) is the same as the
> one reported by Rich?
> 
> The commit in question is not in RHEL-8, it's in QEMU-3.1 (RHEL-8 virt fast
> train).
> 
> 1:
> commit 40dce4ee61c68395f6d463fae792f61b7c003bce
> Author: Paolo Bonzini <pbonzini>
> Date:   Sat Oct 13 11:52:34 2018 +0200
> 
>     scsi-disk: fix rerror/werror=ignore
>     
>     rerror=ignore was returning true from scsi_handle_rw_error but the
> callers were not
>     calling scsi_req_complete when rerror=ignore returns true (this is the
> correct thing
>     to do when true is returned after executing a passthrough command).  Fix
> this by
>     calling it in scsi_handle_rw_error.
>     
>     Signed-off-by: Paolo Bonzini <pbonzini>


The output and core dump file log are same.

QEMU 2.12.0 monitor - type 'help' for more information
(qemu) qemu-kvm: hw/scsi/scsi-bus.c:1374: scsi_req_complete: Assertion `req->status == -1' failed.


(gdb) bt
#0  0x00007f3e4bcf193f in raise () at /lib64/libc.so.6
#1  0x00007f3e4bcdbc95 in abort () at /lib64/libc.so.6
#2  0x00007f3e4bcdbb69 in _nl_load_domain.cold.0 () at /lib64/libc.so.6
#3  0x00007f3e4bce9df6 in .annobin_assert.c_end () at /lib64/libc.so.6
#4  0x000055a83bea7040 in scsi_req_complete
    (req=<optimized out>, status=<optimized out>) at hw/scsi/scsi-bus.c:1374
#5  0x000055a83bea138a in scsi_handle_rw_error
    (acct_failed=false, error=0, r=0x55a83e852880) at hw/scsi/scsi-disk.c:471
#6  0x000055a83bea138a in scsi_disk_req_check_error
    (r=0x55a83e852880, ret=<optimized out>, acct_failed=<optimized out>)
    at hw/scsi/scsi-disk.c:196
#7  0x000055a83bea2225 in scsi_dma_complete_noio
    (r=0x55a83e852880, ret=<optimized out>) at hw/scsi/scsi-disk.c:269
#8  0x000055a83bea2303 in scsi_dma_complete (opaque=0x55a83e852880, ret=0)
    at hw/scsi/scsi-disk.c:300
#9  0x000055a83bdea4d2 in dma_complete (ret=0, dbs=0x7f3e38091ee0)
    at dma-helpers.c:116
#10 0x000055a83bdea4d2 in dma_blk_cb (opaque=0x7f3e38091ee0, ret=0)
    at dma-helpers.c:138
#11 0x000055a83bf69a9e in blk_aio_complete (acb=0x7f3e30000f20)
    at block/block-backend.c:1346
#12 0x000055a83c0080f3 in coroutine_trampoline
    (i0=<optimized out>, i1=<optimized out>) at util/coroutine-ucontext.c:116
--Type <RET> for more, q to quit, c to continue without paging--
#13 0x00007f3e4bd07600 in __start_context () at /lib64/libc.so.6
#14 0x00007f3e22ffbc10 in  ()
#15 0x0000000000000000 in  ()

Comment 6 Paolo Bonzini 2019-02-05 12:12:39 UTC
There is no fix yet.  We can delay it.

Comment 8 Paolo Bonzini 2019-09-20 12:51:53 UTC
The bug has been fixed for a long time in RHEL AV, so moving the bug for simplicity to RHEL qemu-kvm.

It's fixed by upstream commit 1c7f618f689b0b5b6bbed23a7c159e7dad7b996f.

Comment 11 Philippe Mathieu-Daudé 2020-01-30 11:11:08 UTC
Reassigning (with ack from Paolo).

Comment 13 Richard W.M. Jones 2020-01-30 17:08:04 UTC
Fixed in current release, right?