RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1990835 - Src qemu coredump after multiple times of storage vm migraiton operations(mirror->migrate->kill dst vm->cancel migrate)
Summary: Src qemu coredump after multiple times of storage vm migraiton operations(mir...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux 9
Classification: Red Hat
Component: qemu-kvm
Version: 9.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: rc
: ---
Assignee: Hanna Czenczek
QA Contact: aihua liang
URL:
Whiteboard:
Depends On: 2033626
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-08-06 11:43 UTC by aihua liang
Modified: 2023-01-10 20:42 UTC (History)
10 users (show)

Fixed In Version: qemu-kvm-6.2.0-9.el9
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-01-10 20:38:38 UTC
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHELPLAN-92652 0 None None None 2021-08-06 11:46:53 UTC

Description aihua liang 2021-08-06 11:43:39 UTC
Description of problem:
 Src qemu coredump after multiple times of storage vm migraiton operations(mirror->migrate->kill dst vm->cancel migrate)

Version-Release number of selected component (if applicable):
 src:
   kernel version:4.18.0-325.el8.x86_64
   qemu-kvm version:qemu-kvm-6.0.0-26.module+el8.5.0+12044+525f0ebc
 
 dst:
   kernel version:4.18.0-315.el8.x86_64
   qemu-kvm version:qemu-kvm-6.0.0-21.module+el8.5.0+11555+e0ab0d09

How reproducible:
>10%

Steps to Reproduce:
1.Start src guest with qemu cmd:
    /usr/libexec/qemu-kvm \
    -name 'avocado-vt-vm1'  \
    -sandbox on  \
    -machine q35,memory-backend=mem-machine_mem \
    -device pcie-root-port,id=pcie-root-port-0,multifunction=on,bus=pcie.0,addr=0x1,chassis=1 \
    -device pcie-pci-bridge,id=pcie-pci-bridge-0,addr=0x0,bus=pcie-root-port-0  \
    -nodefaults \
    -device VGA,bus=pcie.0,addr=0x2 \
    -m 4096 \
    -object memory-backend-ram,size=4096M,id=mem-machine_mem  \
    -smp 10,maxcpus=10,cores=5,threads=1,dies=1,sockets=2  \
    -cpu 'Cascadelake-Server-noTSX',+kvm_pv_unhalt \
    -chardev socket,path=/tmp/monitor-qmpmonitor1-20210803-235709-zuoFCFYt,wait=off,server=on,id=qmp_id_qmpmonitor1  \
    -mon chardev=qmp_id_qmpmonitor1,mode=control \
    -chardev socket,path=/tmp/monitor-catch_monitor-20210803-235709-zuoFCFYt,wait=off,server=on,id=qmp_id_catch_monitor  \
    -mon chardev=qmp_id_catch_monitor,mode=control \
    -device pvpanic,ioport=0x505,id=idczkm5z \
    -chardev socket,path=/tmp/serial-serial0-20210803-235709-zuoFCFYt,wait=off,server=on,id=chardev_serial0 \
    -device isa-serial,id=serial0,chardev=chardev_serial0  \
    -chardev socket,id=seabioslog_id_20210803-235709-zuoFCFYt,path=/tmp/seabios-20210803-235709-zuoFCFYt,server=on,wait=off \
    -device isa-debugcon,chardev=seabioslog_id_20210803-235709-zuoFCFYt,iobase=0x402 \
    -device pcie-root-port,id=pcie-root-port-1,port=0x1,addr=0x1.0x1,bus=pcie.0,chassis=2 \
    -device qemu-xhci,id=usb1,bus=pcie-root-port-1,addr=0x0 \
    -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \
    -object iothread,id=iothread0 \
    -device pcie-root-port,id=pcie-root-port-2,port=0x2,addr=0x1.0x2,bus=pcie.0,chassis=3 \
    -blockdev node-name=file_image1,driver=file,auto-read-only=on,discard=unmap,aio=threads,filename=/home/kvm_autotest_root/images/rhel850-64-virtio-scsi.qcow2,cache.direct=on,cache.no-flush=off \
    -blockdev node-name=drive_image1,driver=qcow2,read-only=off,cache.direct=on,cache.no-flush=off,file=file_image1 \
    -device virtio-blk-pci,id=image1,drive=drive_image1,write-cache=on,bus=pcie-root-port-2,addr=0x0,iothread=iothread0 \
    -device pcie-root-port,id=pcie-root-port-3,port=0x3,addr=0x1.0x3,bus=pcie.0,chassis=4 \
    -device virtio-net-pci,mac=9a:d4:3c:c8:71:f0,id=idM0s1KB,netdev=idoOHlmh,bus=pcie-root-port-3,addr=0x0  \
    -netdev tap,id=idoOHlmh,vhost=on  \
    -vnc :0  \
    -rtc base=utc,clock=host,driftfix=slew  \
    -boot menu=off,order=cdn,once=c,strict=off \
    -enable-kvm \
    -device pcie-root-port,id=pcie_extra_root_port_0,multifunction=on,bus=pcie.0,addr=0x3,chassis=5 \
    -monitor stdio \
    -qmp tcp:0:3000,server=on,wait=off \

2. Create a mirror.qcow2 in dst, then start guest with it
    #qemu-img create -f qcow2 /home/mirror.qcow2 20G
    /usr/libexec/qemu-kvm \
    -name 'avocado-vt-vm1'  \
    -sandbox on  \
    -machine q35,memory-backend=mem-machine_mem \
    -device pcie-root-port,id=pcie-root-port-0,multifunction=on,bus=pcie.0,addr=0x1,chassis=1 \
    -device pcie-pci-bridge,id=pcie-pci-bridge-0,addr=0x0,bus=pcie-root-port-0  \
    -nodefaults \
    -device VGA,bus=pcie.0,addr=0x2 \
    -m 4096 \
    -object memory-backend-ram,size=4096M,id=mem-machine_mem  \
    -smp 10,maxcpus=10,cores=5,threads=1,dies=1,sockets=2  \
    -cpu 'Cascadelake-Server-noTSX',+kvm_pv_unhalt \
    -chardev socket,path=/tmp/monitor-qmpmonitor1-20210803-235709-zuoFCFYu,wait=off,server=on,id=qmp_id_qmpmonitor1  \
    -mon chardev=qmp_id_qmpmonitor1,mode=control \
    -chardev socket,path=/tmp/monitor-catch_monitor-20210803-235709-zuoFCFYu,wait=off,server=on,id=qmp_id_catch_monitor  \
    -mon chardev=qmp_id_catch_monitor,mode=control \
    -device pvpanic,ioport=0x505,id=idczkm5z \
    -chardev socket,path=/tmp/serial-serial0-20210803-235709-zuoFCFYt,wait=off,server=on,id=chardev_serial0 \
    -device isa-serial,id=serial0,chardev=chardev_serial0  \
    -chardev socket,id=seabioslog_id_20210803-235709-zuoFCFYt,path=/tmp/seabios-20210803-235709-zuoFCFYt,server=on,wait=off \
    -device isa-debugcon,chardev=seabioslog_id_20210803-235709-zuoFCFYt,iobase=0x402 \
    -device pcie-root-port,id=pcie-root-port-1,port=0x1,addr=0x1.0x1,bus=pcie.0,chassis=2 \
    -device qemu-xhci,id=usb1,bus=pcie-root-port-1,addr=0x0 \
    -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \
    -object iothread,id=iothread0 \
    -device pcie-root-port,id=pcie-root-port-2,port=0x2,addr=0x1.0x2,bus=pcie.0,chassis=3 \
    -blockdev node-name=file_image1,driver=file,auto-read-only=on,discard=unmap,aio=threads,filename=/home/mirror.qcow2,cache.direct=on,cache.no-flush=off \
    -blockdev node-name=drive_image1,driver=qcow2,read-only=off,cache.direct=on,cache.no-flush=off,file=file_image1 \
    -device virtio-blk-pci,id=image1,drive=drive_image1,write-cache=on,bus=pcie-root-port-2,addr=0x0,iothread=iothread0 \
    -device pcie-root-port,id=pcie-root-port-3,port=0x3,addr=0x1.0x3,bus=pcie.0,chassis=4 \
    -device virtio-net-pci,mac=9a:d4:3c:c8:71:f0,id=idM0s1KB,netdev=idoOHlmh,bus=pcie-root-port-3,addr=0x0  \
    -netdev tap,id=idoOHlmh,vhost=on  \
    -vnc :1  \
    -rtc base=utc,clock=host,driftfix=slew  \
    -boot menu=off,order=cdn,once=c,strict=off \
    -enable-kvm \
    -device pcie-root-port,id=pcie_extra_root_port_0,multifunction=on,bus=pcie.0,addr=0x3,chassis=5 \
    -monitor stdio \
    -qmp tcp:0:3001,server=on,wait=off \
    -incoming defer \

3. In dst: start nbd server, then expose the image
    { "execute": "nbd-server-start", "arguments": { "addr": { "type": "inet", "data": { "host": "10.73.196.25","port":"3333"}}}}
{"return": {}}
{"execute":"block-export-add","arguments":{"id": "export0", "node-name": "drive_image1", "type": "nbd", "writable": true}}
{"return": {}}

 4. In src, add the dst's exposed image.
    {"execute":"blockdev-add","arguments":{"driver":"nbd","node-name":"mirror","server":{"type":"inet","host":"10.73.196.25","port":"3333"},"export":"drive_image1"}}

 5. In src, do mirror from src to dst.
     { "execute": "blockdev-mirror", "arguments": { "device": "drive_image1","target": "mirror", "sync": "full","job-id":"j1" } }
{"timestamp": {"seconds": 1628241373, "microseconds": 53592}, "event": "JOB_STATUS_CHANGE", "data": {"status": "created", "id": "j1"}}
{"timestamp": {"seconds": 1628241373, "microseconds": 53669}, "event": "JOB_STATUS_CHANGE", "data": {"status": "running", "id": "j1"}}
{"return": {}}
{"timestamp": {"seconds": 1628241417, "microseconds": 458870}, "event": "JOB_STATUS_CHANGE", "data": {"status": "ready", "id": "j1"}}
{"timestamp": {"seconds": 1628241417, "microseconds": 459001}, "event": "BLOCK_JOB_READY", "data": {"device": "j1", "len": 21475164160, "offset": 21475164160, "speed": 0, "type": "mirror"}}

 6. In src, when mirror job reach "ready" status, set migration capabilities
    {"execute":"migrate-set-capabilities","arguments":{"capabilities":[{"capability":"events","state":true},{"capability":"pause-before-switchover","state":true}]}}
{"return": {}}

 7. In dst, set migration capabilities and setup the migration
    {"execute":"migrate-set-capabilities","arguments":{"capabilities":[{"capability":"events","state":true},{"capability":"pause-before-switchover","state":true}]}}
{"return": {}}
{"execute": "migrate-incoming","arguments": {"uri": "tcp:[::]:5000"}}
{"timestamp": {"seconds": 1628245994, "microseconds": 902021}, "event": "MIGRATION", "data": {"status": "setup"}}
{"return": {}}

 8. In src, migrate from src to dst.
     {"execute": "migrate","arguments":{"uri": "tcp:10.73.196.25:5000"}}
{"timestamp": {"seconds": 1628241463, "microseconds": 89698}, "event": "MIGRATION", "data": {"status": "setup"}}
{"return": {}}
{"timestamp": {"seconds": 1628241463, "microseconds": 92459}, "event": "MIGRATION_PASS", "data": {"pass": 1}}
{"timestamp": {"seconds": 1628241463, "microseconds": 92535}, "event": "MIGRATION", "data": {"status": "active"}}
{"timestamp": {"seconds": 1628241478, "microseconds": 759562}, "event": "MIGRATION_PASS", "data": {"pass": 2}}
{"timestamp": {"seconds": 1628241479, "microseconds": 210013}, "event": "MIGRATION_PASS", "data": {"pass": 3}}
{"timestamp": {"seconds": 1628241479, "microseconds": 214126}, "event": "JOB_STATUS_CHANGE", "data": {"status": "standby", "id": "j1"}}
{"timestamp": {"seconds": 1628241479, "microseconds": 214214}, "event": "JOB_STATUS_CHANGE", "data": {"status": "ready", "id": "j1"}}
{"timestamp": {"seconds": 1628241479, "microseconds": 234621}, "event": "STOP"}
{"timestamp": {"seconds": 1628241479, "microseconds": 234649}, "event": "JOB_STATUS_CHANGE", "data": {"status": "standby", "id": "j1"}}
{"timestamp": {"seconds": 1628241479, "microseconds": 234688}, "event": "JOB_STATUS_CHANGE", "data": {"status": "ready", "id": "j1"}}
{"timestamp": {"seconds": 1628241479, "microseconds": 236263}, "event": "MIGRATION", "data": {"status": "pre-switchover"}}

 9. When migrate reach "pre-switchover" status, kill dst vm
    #kill -9 2068205($dst_vm_qemu_pid)

 10. In src, cancel migrate.
    {"execute":"migrate_cancel"}
{"timestamp": {"seconds": 1628241509, "microseconds": 761946}, "event": "MIGRATION", "data": {"status": "cancelling"}}
{"return": {}}
{"timestamp": {"seconds": 1628241509, "microseconds": 762033}, "event": "RESUME"}
{"timestamp": {"seconds": 1628241509, "microseconds": 762795}, "event": "JOB_STATUS_CHANGE", "data": {"status": "standby", "id": "j1"}}
{"timestamp": {"seconds": 1628241509, "microseconds": 762898}, "event": "JOB_STATUS_CHANGE", "data": {"status": "ready", "id": "j1"}}
{"timestamp": {"seconds": 1628241509, "microseconds": 785977}, "event": "MIGRATION", "data": {"status": "cancelled"}}
{"timestamp": {"seconds": 1628241524, "microseconds": 104565}, "event": "BLOCK_JOB_ERROR", "data": {"device": "j1", "operation": "write", "action": "report"}}
{"timestamp": {"seconds": 1628241524, "microseconds": 104858}, "event": "JOB_STATUS_CHANGE", "data": {"status": "aborting", "id": "j1"}}
{"timestamp": {"seconds": 1628241524, "microseconds": 105054}, "event": "BLOCK_JOB_COMPLETED", "data": {"device": "j1", "len": 21477392384, "offset": 21477326848, "speed": 0, "type": "mirror", "error": "Input/output error"}}
{"timestamp": {"seconds": 1628241524, "microseconds": 105094}, "event": "JOB_STATUS_CHANGE", "data": {"status": "concluded", "id": "j1"}}
{"timestamp": {"seconds": 1628241524, "microseconds": 105118}, "event": "JOB_STATUS_CHANGE", "data": {"status": "null", "id": "j1"}}

 11. Start dst vm again, in dst, start ndb server and expose image again.
     In dst: start nbd server, then expose the image
    { "execute": "nbd-server-start", "arguments": { "addr": { "type": "inet", "data": { "host": "10.73.196.25","port":"3333"}}}}
{"return": {}}
{"execute":"block-export-add","arguments":{"id": "export0", "node-name": "drive_image1", "type": "nbd", "writable": true}}
{"return": {}}

 12. In src, do block mirror from src to dst.
     { "execute": "blockdev-mirror", "arguments": { "device": "drive_image1","target": "mirror", "sync": "full","job-id":"j1" } }

 13. If step12 executed successfully, repeat from step6~ step12.

Actual results:
 After maybe 10 times, src qemu coredump after block mirror failed with error info:
  { "execute": "blockdev-mirror", "arguments": { "device": "drive_image1","target": "mirror", "sync": "full","job-id":"j1" } }
{"timestamp": {"seconds": 1628246397, "microseconds": 895777}, "event": "JOB_STATUS_CHANGE", "data": {"status": "created", "id": "j1"}}
{"timestamp": {"seconds": 1628246397, "microseconds": 895835}, "event": "JOB_STATUS_CHANGE", "data": {"status": "running", "id": "j1"}}
{"return": {}}
{"timestamp": {"seconds": 1628246397, "microseconds": 897368}, "event": "BLOCK_JOB_ERROR", "data": {"device": "j1", "operation": "write", "action": "report"}}
{"timestamp": {"seconds": 1628246397, "microseconds": 897614}, "event": "BLOCK_JOB_ERROR", "data": {"device": "j1", "operation": "write", "action": "report"}}
{"timestamp": {"seconds": 1628246397, "microseconds": 897666}, "event": "BLOCK_JOB_ERROR", "data": {"device": "j1", "operation": "write", "action": "report"}}
{"timestamp": {"seconds": 1628246397, "microseconds": 897772}, "event": "JOB_STATUS_CHANGE", "data": {"status": "aborting", "id": "j1"}}
{"timestamp": {"seconds": 1628246397, "microseconds": 897938}, "event": "BLOCK_JOB_COMPLETED", "data": {"device": "j1", "len": 21474836480, "offset": 0, "speed": 0, "type": "mirror", "error": "Input/output error"}}
{"timestamp": {"seconds": 1628246397, "microseconds": 897977}, "event": "JOB_STATUS_CHANGE", "data": {"status": "concluded", "id": "j1"}}
{"timestamp": {"seconds": 1628246397, "microseconds": 898001}, "event": "JOB_STATUS_CHANGE", "data": {"status": "null", "id": "j1"}}

 Src qemu coredump info:
  (qemu) qemu-kvm: ../io/channel.c:484: qio_channel_restart_read: Assertion `qemu_get_current_aio_context() == qemu_coroutine_get_aio_context(co)' failed.

 Gdb info:
  (gdb) bt full
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
        set = {__val = {18446744067266837079, 140492272740560, 93911023032132, 140492592137383, 4222451712, 140492272740560, 140492272740661, 140492272740560, 140492272740560, 140492272740706, 140492272740860, 140492272740560, 
            140492272740860, 0, 0, 0}}
        pid = <optimized out>
        tid = <optimized out>
        ret = <optimized out>
#1  0x00007fc6fb06cdb5 in __GI_abort () at abort.c:79
        save_stage = 1
        act = {__sigaction_handler = {sa_handler = 0x7fc6e8028ea0, sa_sigaction = 0x7fc6e8028ea0}, sa_mask = {__val = {0, 140492272568564, 6, 0, 0, 0, 140492506346888, 21474836480, 140492506346736, 140492593271472, 140492593256088, 0, 
              8630141220583323648, 140492593241005, 0, 140492593256088}}, sa_flags = 1525398025, sa_restorer = 0x55695aebb908}
        sigs = {__val = {32, 0 <repeats 15 times>}}
#2  0x00007fc6fb06cc89 in __assert_fail_base (fmt=0x7fc6fb1d5698 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x55695aebb908 "qemu_get_current_aio_context() == qemu_coroutine_get_aio_context(co)", 
    file=0x55695aebba09 "../io/channel.c", line=484, function=<optimized out>) at assert.c:92
        str = 0x7fc6e8028ea0 "\300~\002\350\306\177"
        total = 4096
#3  0x00007fc6fb07aa76 in __GI___assert_fail (assertion=assertion@entry=0x55695aebb908 "qemu_get_current_aio_context() == qemu_coroutine_get_aio_context(co)", file=file@entry=0x55695aebba09 "../io/channel.c", line=line@entry=484, 
    function=function@entry=0x55695aebba80 <__PRETTY_FUNCTION__.18557> "qio_channel_restart_read") at assert.c:101
No locals.
#4  0x000055695ad01530 in qio_channel_restart_read (opaque=<optimized out>) at ../io/channel.c:483
        ioc = <optimized out>
        co = <optimized out>
        ioc = <optimized out>
        co = <optimized out>
        __PRETTY_FUNCTION__ = "qio_channel_restart_read"
#5  qio_channel_restart_read (opaque=<optimized out>) at ../io/channel.c:477
        ioc = <optimized out>
        co = 0x55695d2a4400
        __PRETTY_FUNCTION__ = "qio_channel_restart_read"
#6  0x000055695ada6bc9 in aio_dispatch_handler (ctx=ctx@entry=0x55695d2b16a0, node=0x55695d6cfb20) at ../util/aio-posix.c:329
        progress = false
        revents = <optimized out>
#7  0x000055695ada764c in aio_dispatch_ready_handlers (ready_list=<optimized out>, ctx=<optimized out>) at ../util/aio-posix.c:359
        progress = false
        node = <optimized out>
        progress = <optimized out>
        node = <optimized out>
#8  aio_poll (ctx=0x55695d2b16a0, blocking=blocking@entry=true) at ../util/aio-posix.c:662
        ready_list = {lh_first = 0x0}
        ret = <optimized out>
        progress = false
        use_notify_me = <optimized out>
        timeout = <optimized out>
        start = <optimized out>
        __PRETTY_FUNCTION__ = "aio_poll"
#9  0x000055695ac75fba in iothread_run (opaque=0x55695d272320) at ../iothread.c:73
        iothread = 0x55695d272320
#10 0x000055695ad7c484 in qemu_thread_start (args=0x55695d2b1be0) at ../util/qemu-thread-posix.c:521
        __clframe = {__cancel_routine = <optimized out>, __cancel_arg = 0x0, __do_it = 1, __cancel_type = <optimized out>}
        qemu_thread_args = 0x55695d2b1be0
        start_routine = 0x55695ac75f70 <iothread_run>
        arg = 0x55695d272320
        r = <optimized out>
--Type <RET> for more, q to quit, c to continue without paging--
#11 0x00007fc6fb41814a in start_thread (arg=<optimized out>) at pthread_create.c:479
        ret = <optimized out>
        pd = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140492506351360, -5495574922517979624, 140733520111390, 140733520111391, 140733520111536, 140492506347520, 5491612426208370200, 5491644117275860504}, mask_was_saved = 0}}, priv = {
            pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
#12 0x00007fc6fb147dc3 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
No locals.


Expected results:


Additional info:

Comment 1 aihua liang 2021-08-06 11:51:57 UTC
qemu-kvm-6.0.0-27.module+el8.5.0+12121+c40c8708 also hit this issue.

Comment 2 aihua liang 2021-08-06 12:43:37 UTC
coredump file at:
 10.73.194.27:/vol/s2coredump/bug1990835/core.qemu-kvm.0.e28bd05e1db0409fb6f705424b08b5de.127796.1628242518000000.lz4

Comment 4 Klaus Heinrich Kiwi 2021-08-11 12:27:27 UTC
Hanna,

 can you take a look? Perhaps is something related with the cancel job work you've been doing upstream?

Thanks.

Comment 5 John Ferlan 2021-09-08 21:49:54 UTC
Move RHEL-AV bugs to RHEL9. If necessary to resolve in RHEL8, then clone to the current RHEL8 release.

Comment 6 Klaus Heinrich Kiwi 2022-01-27 17:31:48 UTC
@hreitz any updates?

Comment 7 Hanna Czenczek 2022-01-28 15:51:21 UTC
Oops, I completely let this fall through the cracks.

I could reproduce this even quicker than described in comment 0, namely in the first iteration, before the migration was even started.  So I think this actually has nothing to do with the migration, but rather with the NBD block device; in fact, I can reproduce the same assertion failure just by setting up a mirror job from a device in an I/O thread to an NBD block device.

I think the problem is in our NBD client block driver, in that it attaches the QIOChannel it uses for the network connection to its AioContext, but this is never changed when the NBD block device is moved to some other AioContext.  The mirror job needs both source and target to be in the same context, so it moves the target (the NBD device), because the source is fixed into its I/O thread by the attached guest device.  Thus, the QIOChannel is attached to a different AioContext (probably still the main context) than what the mirror runs in (the I/O thread), and so we get the assertion failure:

qemu-system-x86_64: ../io/channel.c:483: qio_channel_restart_read: Assertion `qemu_get_current_aio_context() == qemu_coroutine_get_aio_context(co)' failed.

I.e. the context in which the QIOChannel’s handlers run (still qemu’s main context) is not the same as the coroutine that is supposed to be run by these handlers (the mirror code, which runs in the I/O thread).

I think we can solve this by changing the QIOChannel’s context when the NBD block device is moved to a different context; but besides this connection QIOChannel, there are also some timers that are attached to the block device’s AioContext, and perhaps we need to take care of that, too.  I’ll ask upstream.

Comment 8 Hanna Czenczek 2022-01-28 16:11:03 UTC
Hi Aliang,

Could you perhaps see whether the build in http://download.eng.bos.redhat.com/brewroot/work/tasks/5485/42665485/ fixes this bug?

Thanks!

Comment 9 Hanna Czenczek 2022-02-01 16:34:21 UTC
This has all become a bit more complicated.  Prior to 6.2, block/nbd.c did implement the AioContext attach/detach handlers, so the issue I’m seeing is probably different than what is reported in this BZ here (which was for 6.0).

So I need to have another try at reproducing with 6.0 (specifically qemu-kvm-6.0.0-33.module+el8.5.0).


But this also means that the problem I’m seeing is a regression in 6.2, and it probably affects all cases where an NBD block device is used in an I/O thread.  For example, just the following will already fail:

$ qemu-nbd --fork -k /tmp/nbd.sock -f raw null-co://

$ qemu-system-x86_64 \
    -object iothread,id=iothr0 \
    -device virtio-scsi,id=vscsi,iothread=iothr0 \
    -blockdev '{
        "driver": "nbd",
        "node-name": "nbd",
        "server": {
            "type": "unix",
            "path": "/tmp/nbd.sock"
        } }' \
    -device scsi-hd,bus=vscsi.0,drive=nbd
qemu-system-x86_64: ../qemu-6.2.0/io/channel.c:483: qio_channel_restart_read: Assertion `qemu_get_current_aio_context() == qemu_coroutine_get_aio_context(co)' failed.
qemu-nbd: Disconnect client, due to: Unable to read from socket: Connection reset by peer
Aborted (core dumped)


So that’s kind of bad.

Comment 10 Hanna Czenczek 2022-02-01 17:04:51 UTC
I can reproduce something on 6.0.0-33.module+el8.5.0, which looks different from what I was seeing before.  Specifically, what I was seeing was the mirror coroutine belonging to the I/O thread being started in the main context (because the network connection remained associated with the main context).

Now, I see that NBD’s connection_co (which was removed in commit 4ddb5d2fde6f22b2cf65f314107e890a7ca14fcf[1]) is attempted to be run in the I/O thread, but still is marked as belonging to the main context, i.e. exactly the other way around.  I believe the problem could be that we reassign connection_co to the new context in nbd_client_attach_aio_context_bh(), but that’s a BH that’s only scheduled by nbd_client_attach_aio_context(), but not immediately run by it.  So perhaps it’s possible for some I/O to sneak in after the BDS went to the new context and is undrained, and before that BH runs and reassigns connection_co to the new context.

The problem of course is that 4ddb5d2fde says that the whole connection_co code is very complicated, and thus rewrote it.  So it’s questionable whether there’s a simple fix for those pre-6.2 versions.

[1] That’s also the commit that removed the attach/detach handlers from block/nbd.c altogether, and so caused precisely the other problem I was seeing...

Comment 11 Hanna Czenczek 2022-02-01 17:48:21 UTC
This here seems to work as a reproducer for the pre-6.2 problem:

$ ./qemu-img create -f qcow2 src.qcow2 128M

$ ./qemu-io -c 'write 0 128M' src.qcow2

$ ./qemu-nbd --fork --pid-file=/tmp/nbd.pid --socket=/tmp/nbd.sock --image-opts driver=null-co,size=128M

$ (echo '{"execute": "qmp_capabilities"}';
sleep 1;
kill $(cat /tmp/nbd.pid);
echo '{"execute": "blockdev-mirror", "arguments": {
    "device": "src",
    "target": "nbd",
    "sync": "full",
    "job-id": "j1" } }';
./qemu-nbd --fork --pid-file=/tmp/nbd.pid --socket=/tmp/nbd.sock --image-opts driver=null-co,size=128M;
sleep 1;
echo '{"execute": "quit"}') \
| ./qemu-system-x86_64 \
    -qmp stdio \
    -blockdev '{
        "driver": "nbd",
        "node-name": "nbd",
        "server": {
            "type": "unix",
            "path": "/tmp/nbd.sock"
        } }' \
    -blockdev file,node-name=file,filename=src.qcow2 \
    -blockdev qcow2,node-name=src,file=file \
    -object iothread,id=iothr0 \
    -device virtio-blk,drive=src,iothread=iothr0

The errors this produces aren’t limited to just assertion failures, but you also get the odd memory corruption thrown in (i.e. glib’s malloc implementation complains about corrupted linked lists, or corrupted unsorted chunks), so this too is kind of really bad.  At least I don’t see this failing on 6.2.0.


(There’s also a reproducer without a mirror job, which goes like this:

./qemu-nbd --fork --pid-file=/tmp/nbd.pid --socket=/tmp/nbd.sock --image-opts driver=null-co,size=128M

(echo '{"execute": "qmp_capabilities"}';
sleep 1;
echo '{"execute": "x-blockdev-set-iothread", "arguments": {
    "node-name": "nbd",
    "iothread": "iothr0" } }';
kill $(cat /tmp/nbd.pid);
./qemu-nbd --fork --pid-file=/tmp/nbd.pid --socket=/tmp/nbd.sock --image-opts driver=null-co,size=128M;
echo '{"execute": "human-monitor-command", "arguments": {
    "command-line": "qemu-io nbd \"write 0 128M\"" } }';
sleep 1;
echo '{"execute": "quit"}') \
| ./qemu-system-x86_64 \
    -qmp stdio \
    -object iothread,id=iothr0 \
    -blockdev '{
        "driver": "nbd",
        "node-name": "nbd",
        "server": {
            "type": "unix",
            "path": "/tmp/nbd.sock"
        } }'


But in order not to get the unrelated “qemu: qemu_mutex_unlock_impl: Operation not permitted” error, 78632a3d1685454fec83f83586c675e7f2a1a84a needs to be applied.)

Comment 12 Hanna Czenczek 2022-02-02 13:31:47 UTC
I think I’ve tracked the problem down, and it basically comes down to dd1ec1a4afe190e030edfa052d95c9e6e065438c (“block/nbd: allow drain during reconnect attempt”).  As far as I can tell, the problem is always that we are in a reconnect attempt while the nbd block device’s AioContext is changed.  Vladimir noticed that this used to cause a deadlock[1], and so wrote dd1ec1a4 to fix that, assuming that it would be fine to have the reconnect running in the background while the node is drained, given that this reconnect is technically not an I/O request.  The problem is that it isn’t fine, because in a drained section, the node can be moved to a different AioContext, and this will break the concurrent reconnect attempt.

Upstream, that’s no longer an issue thanks to the rewrite that was 4ddb5d2fd, but I don’t yet know how to fix it downstream, other than backporting 4ddb5d2fd, or disabling auto-reconnect.  If we disallow reconnecting in drained sections, we’ll simply reintroduce the deadlock that dd1ec1a4 wanted to fix.

[1] I’m not quite sure why yet, though.  As far as I can tell, the commit message doesn’t really explain.  I think it’s because the connection will no longer transmit data, because everything’s supposed to be drained; but then we wait for the connection to transmit some data so we can call it truly drained, and that will wait forever.

Comment 13 Hanna Czenczek 2022-02-03 16:48:22 UTC
Oh, I forgot: Please disregard comment 8.  I wrote that when I thought this was a different bug (specifically what BZ 2033626 describes).  For this BZ here (i.e. qemu before 6.2), we’d need a different fix, but I don’t know what yet.

Comment 14 Hanna Czenczek 2022-02-09 10:43:55 UTC
I only just now noticed that this BZ has ITR 9.0, and is for RHEL 9 (as of comment 5).  I don’t think this particular bug (that I described in comment 12) exists in qemu 6.2, and so should not appear in RHEL 9.

(We probably don’t want to fix this in RHEL 8 either, because 8.6 will use qemu 6.2, too, and so similarly the bug won’t appear there.)

When this bug here was fixed upstream, however, a new bug was introduced, but that is tracked by BZ 2033626 (RHEL 9) and BZ 2035185 (RHEL 8).

Therefore, I think there’s no reason to keep this bug here open.  The problem is that I don’t know with what to close it.  I can’t say the bug is fixed until BZ 2033626 is done, because it can’t be verified that this bug here has been fixed while that other bug is present.

For the time being, I’ll mark this one as being dependant on BZ 2033626, and then I hope once that one is closed we’ll remember to maybe put this one to POST + fixed in version or something.

Comment 15 Klaus Heinrich Kiwi 2022-02-16 12:24:39 UTC
(In reply to Hanna Reitz from comment #14)
> 
> For the time being, I’ll mark this one as being dependant on BZ 2033626, and
> then I hope once that one is closed we’ll remember to maybe put this one to
> POST + fixed in version or something.

I'll go ahead and move it to POST as I believe it's a 'more accurate' description of this bug's state: it had a lot of work/debug, and we're somewhat confident that it's a symptom of Bug 2033626.

Comment 16 Hanna Czenczek 2022-02-16 15:50:14 UTC
(In reply to Klaus Heinrich Kiwi from comment #15)
> I'll go ahead and move it to POST as I believe it's a 'more accurate'
> description of this bug's state: it had a lot of work/debug, and we're
> somewhat confident that it's a symptom of Bug 2033626.

It is not a symptom of BZ 2033626, though, it is a separate bug.  This separate bug was fixed in the 6.2 release, but this cannot be verified, because the fix caused NBD with I/O threads to break in a completely different manner (BZ 2033626).

(It’s like your car’s lights broke, and they got fixed, but the fix accidentally broke your car’s battery and generator, so you can’t easily verify whether your lights are working or not.  You’ll have to wait until the battery or generator work again.)

We can leave it in POST, but it’s important to note that this cannot be verified until BZ 2033626 is MODIFIED.  (And so I’ll leave the “Fixed in Version” field blank for now.)

Comment 17 CongLi 2022-02-21 10:26:42 UTC
(In reply to Hanna Reitz from comment #16)
> (In reply to Klaus Heinrich Kiwi from comment #15)
> > I'll go ahead and move it to POST as I believe it's a 'more accurate'
> > description of this bug's state: it had a lot of work/debug, and we're
> > somewhat confident that it's a symptom of Bug 2033626.
> 
> It is not a symptom of BZ 2033626, though, it is a separate bug.  This
> separate bug was fixed in the 6.2 release, but this cannot be verified,
> because the fix caused NBD with I/O threads to break in a completely
> different manner (BZ 2033626).
> 
> (It’s like your car’s lights broke, and they got fixed, but the fix
> accidentally broke your car’s battery and generator, so you can’t easily
> verify whether your lights are working or not.  You’ll have to wait until
> the battery or generator work again.)
> 
> We can leave it in POST, but it’s important to note that this cannot be
> verified until BZ 2033626 is MODIFIED.  (And so I’ll leave the “Fixed in
> Version” field blank for now.)

Hi Aihua,

Can you verify this bug too since BZ2033626 is ONQA now?

Thanks.

Comment 18 aihua liang 2022-02-22 06:29:19 UTC
Test with qemu-kvm-6.2.0-9.el9, not hit src qemu core dump issue, but mirror job failed with "input/output error" after times of test, detail as bellow:

Test Env:
  kernel version: 5.14.0-62.el9.x86_64
  qemu-kvm version: qemu-kvm-6.2.0-9.el9

Test Steps:
  1.Start src guest with qemu cmd:
    /usr/libexec/qemu-kvm \
    -name &apos;avocado-vt-vm1&apos;  \
    -sandbox on  \
    -machine q35,memory-backend=mem-machine_mem \
    -device pcie-root-port,id=pcie-root-port-0,multifunction=on,bus=pcie.0,addr=0x1,chassis=1 \
    -device pcie-pci-bridge,id=pcie-pci-bridge-0,addr=0x0,bus=pcie-root-port-0  \
    -nodefaults \
    -device VGA,bus=pcie.0,addr=0x2 \
    -m 4096 \
    -object memory-backend-ram,size=4096M,id=mem-machine_mem  \
    -smp 10,maxcpus=10,cores=5,threads=1,dies=1,sockets=2  \
    -cpu &apos;Cascadelake-Server-noTSX&apos;,+kvm_pv_unhalt \
    -chardev socket,path=/tmp/monitor-qmpmonitor1-20210803-235709-zuoFCFYt,wait=off,server=on,id=qmp_id_qmpmonitor1  \
    -mon chardev=qmp_id_qmpmonitor1,mode=control \
    -chardev socket,path=/tmp/monitor-catch_monitor-20210803-235709-zuoFCFYt,wait=off,server=on,id=qmp_id_catch_monitor  \
    -mon chardev=qmp_id_catch_monitor,mode=control \
    -device pvpanic,ioport=0x505,id=idczkm5z \
    -chardev socket,path=/tmp/serial-serial0-20210803-235709-zuoFCFYt,wait=off,server=on,id=chardev_serial0 \
    -device isa-serial,id=serial0,chardev=chardev_serial0  \
    -chardev socket,id=seabioslog_id_20210803-235709-zuoFCFYt,path=/tmp/seabios-20210803-235709-zuoFCFYt,server=on,wait=off \
    -device isa-debugcon,chardev=seabioslog_id_20210803-235709-zuoFCFYt,iobase=0x402 \
    -device pcie-root-port,id=pcie-root-port-1,port=0x1,addr=0x1.0x1,bus=pcie.0,chassis=2 \
    -device qemu-xhci,id=usb1,bus=pcie-root-port-1,addr=0x0 \
    -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \
    -object iothread,id=iothread0 \
    -device pcie-root-port,id=pcie-root-port-2,port=0x2,addr=0x1.0x2,bus=pcie.0,chassis=3 \
    -blockdev node-name=file_image1,driver=file,auto-read-only=on,discard=unmap,aio=threads,filename=/home/kvm_autotest_root/images/rhel850-64-virtio-scsi.qcow2,cache.direct=on,cache.no-flush=off \
    -blockdev node-name=drive_image1,driver=qcow2,read-only=off,cache.direct=on,cache.no-flush=off,file=file_image1 \
    -device virtio-blk-pci,id=image1,drive=drive_image1,write-cache=on,bus=pcie-root-port-2,addr=0x0,iothread=iothread0 \
    -device pcie-root-port,id=pcie-root-port-3,port=0x3,addr=0x1.0x3,bus=pcie.0,chassis=4 \
    -device virtio-net-pci,mac=9a:d4:3c:c8:71:f0,id=idM0s1KB,netdev=idoOHlmh,bus=pcie-root-port-3,addr=0x0  \
    -netdev tap,id=idoOHlmh,vhost=on  \
    -vnc :0  \
    -rtc base=utc,clock=host,driftfix=slew  \
    -boot menu=off,order=cdn,once=c,strict=off \
    -enable-kvm \
    -device pcie-root-port,id=pcie_extra_root_port_0,multifunction=on,bus=pcie.0,addr=0x3,chassis=5 \
    -monitor stdio \
    -qmp tcp:0:3000,server=on,wait=off \

2. Create a mirror.qcow2 in dst, then start guest with it
    #qemu-img create -f qcow2 /home/mirror.qcow2 20G
    /usr/libexec/qemu-kvm \
    -name 'avocado-vt-vm1'  \
    -sandbox on  \
    -machine q35,memory-backend=mem-machine_mem \
    -device pcie-root-port,id=pcie-root-port-0,multifunction=on,bus=pcie.0,addr=0x1,chassis=1 \
    -device pcie-pci-bridge,id=pcie-pci-bridge-0,addr=0x0,bus=pcie-root-port-0  \
    -nodefaults \
    -device VGA,bus=pcie.0,addr=0x2 \
    -m 4096 \
    -object memory-backend-ram,size=4096M,id=mem-machine_mem  \
    -smp 10,maxcpus=10,cores=5,threads=1,dies=1,sockets=2  \
    -cpu 'Cascadelake-Server-noTSX',+kvm_pv_unhalt \
    -chardev socket,path=/tmp/monitor-qmpmonitor1-20210803-235709-zuoFCFYu,wait=off,server=on,id=qmp_id_qmpmonitor1  \
    -mon chardev=qmp_id_qmpmonitor1,mode=control \
    -chardev socket,path=/tmp/monitor-catch_monitor-20210803-235709-zuoFCFYu,wait=off,server=on,id=qmp_id_catch_monitor  \
    -mon chardev=qmp_id_catch_monitor,mode=control \
    -device pvpanic,ioport=0x505,id=idczkm5z \
    -chardev socket,path=/tmp/serial-serial0-20210803-235709-zuoFCFYt,wait=off,server=on,id=chardev_serial0 \
    -device isa-serial,id=serial0,chardev=chardev_serial0  \
    -chardev socket,id=seabioslog_id_20210803-235709-zuoFCFYt,path=/tmp/seabios-20210803-235709-zuoFCFYt,server=on,wait=off \
    -device isa-debugcon,chardev=seabioslog_id_20210803-235709-zuoFCFYt,iobase=0x402 \
    -device pcie-root-port,id=pcie-root-port-1,port=0x1,addr=0x1.0x1,bus=pcie.0,chassis=2 \
    -device qemu-xhci,id=usb1,bus=pcie-root-port-1,addr=0x0 \
    -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \
    -object iothread,id=iothread0 \
    -device pcie-root-port,id=pcie-root-port-2,port=0x2,addr=0x1.0x2,bus=pcie.0,chassis=3 \
    -blockdev node-name=file_image1,driver=file,auto-read-only=on,discard=unmap,aio=threads,filename=/home/mirror.qcow2,cache.direct=on,cache.no-flush=off \
    -blockdev node-name=drive_image1,driver=qcow2,read-only=off,cache.direct=on,cache.no-flush=off,file=file_image1 \
    -device virtio-blk-pci,id=image1,drive=drive_image1,write-cache=on,bus=pcie-root-port-2,addr=0x0,iothread=iothread0 \
    -device pcie-root-port,id=pcie-root-port-3,port=0x3,addr=0x1.0x3,bus=pcie.0,chassis=4 \
    -device virtio-net-pci,mac=9a:d4:3c:c8:71:f0,id=idM0s1KB,netdev=idoOHlmh,bus=pcie-root-port-3,addr=0x0  \
    -netdev tap,id=idoOHlmh,vhost=on  \
    -vnc :1  \
    -rtc base=utc,clock=host,driftfix=slew  \
    -boot menu=off,order=cdn,once=c,strict=off \
    -enable-kvm \
    -device pcie-root-port,id=pcie_extra_root_port_0,multifunction=on,bus=pcie.0,addr=0x3,chassis=5 \
    -monitor stdio \
    -qmp tcp:0:3001,server=on,wait=off \
    -incoming defer \

3. In dst: start nbd server, then expose the image
    { "execute": "nbd-server-start", "arguments": { "addr": { "type": "inet", "data": { "host": "10.73.114.14","port":"3333"}}}}
{"return": {}}
{"execute":"block-export-add","arguments":{"id": "export0", "node-name": "drive_image1", "type": "nbd", "writable": true}}
{"return": {}}

 4. In src, add the dst's exposed image.
    {"execute":"blockdev-add","arguments":{"driver":"nbd","node-name":"mirror","server":{"type":"inet","host":"10.73.114.14","port":"3333"},"export":"drive_image1"}}

 5. In src, do mirror from src to dst.
    { "execute": "blockdev-mirror", "arguments": { "device": "drive_image1","target": "mirror", "sync": "full","job-id":"j1" } }
{"timestamp": {"seconds": 1645502413, "microseconds": 863259}, "event": "JOB_STATUS_CHANGE", "data": {"status": "created", "id": "j1"}}
{"timestamp": {"seconds": 1645502413, "microseconds": 863387}, "event": "JOB_STATUS_CHANGE", "data": {"status": "running", "id": "j1"}}
{"return": {}}
{"timestamp": {"seconds": 1645502458, "microseconds": 542540}, "event": "JOB_STATUS_CHANGE", "data": {"status": "ready", "id": "j1"}}
{"timestamp": {"seconds": 1645502458, "microseconds": 542594}, "event": "BLOCK_JOB_READY", "data": {"device": "j1", "len": 21475426304, "offset": 21475426304, "speed": 0, "type": "mirror"}}

 6. In src, when mirror job reach "ready" status, set migration capabilities
    {"execute":"migrate-set-capabilities","arguments":{"capabilities":[{"capability":"events","state":true},{"capability":"pause-before-switchover","state":true}]}}
{"return": {}}

 7. In dst, set migration capabilities and setup the migration
    {"execute":"migrate-set-capabilities","arguments":{"capabilities":[{"capability":"events","state":true},{"capability":"pause-before-switchover","state":true}]}}
{"return": {}}
{"execute": "migrate-incoming","arguments": {"uri": "tcp:[::]:5000"}}
{"timestamp": {"seconds": 1628245994, "microseconds": 902021}, "event": "MIGRATION", "data": {"status": "setup"}}
{"return": {}}

 8. In src, migrate from src to dst.
     {"execute": "migrate","arguments":{"uri": "tcp:10.73.114.14:5000"}}
{"timestamp": {"seconds": 1645502507, "microseconds": 472148}, "event": "MIGRATION", "data": {"status": "setup"}}
{"return": {}}
{"timestamp": {"seconds": 1645502507, "microseconds": 474602}, "event": "MIGRATION_PASS", "data": {"pass": 1}}
{"timestamp": {"seconds": 1645502507, "microseconds": 474669}, "event": "MIGRATION", "data": {"status": "active"}}
{"timestamp": {"seconds": 1645502520, "microseconds": 989563}, "event": "MIGRATION_PASS", "data": {"pass": 2}}
{"timestamp": {"seconds": 1645502521, "microseconds": 263378}, "event": "MIGRATION_PASS", "data": {"pass": 3}}
{"timestamp": {"seconds": 1645502521, "microseconds": 316550}, "event": "MIGRATION_PASS", "data": {"pass": 4}}
{"timestamp": {"seconds": 1645502521, "microseconds": 321147}, "event": "JOB_STATUS_CHANGE", "data": {"status": "standby", "id": "j1"}}
{"timestamp": {"seconds": 1645502521, "microseconds": 321293}, "event": "JOB_STATUS_CHANGE", "data": {"status": "ready", "id": "j1"}}
{"timestamp": {"seconds": 1645502521, "microseconds": 341198}, "event": "STOP"}
{"timestamp": {"seconds": 1645502521, "microseconds": 341229}, "event": "JOB_STATUS_CHANGE", "data": {"status": "standby", "id": "j1"}}
{"timestamp": {"seconds": 1645502521, "microseconds": 341260}, "event": "JOB_STATUS_CHANGE", "data": {"status": "ready", "id": "j1"}}
{"timestamp": {"seconds": 1645502521, "microseconds": 342867}, "event": "MIGRATION", "data": {"status": "pre-switchover"}}

 9. When migrate reach "pre-switchover" status, kill dst vm
    #kill -9 $dst_vm_qemu_pid

 10. In src, cancel migrate.
    {"execute":"migrate_cancel"}
{"timestamp": {"seconds": 1645502575, "microseconds": 838874}, "event": "RESUME"}
{"timestamp": {"seconds": 1645502575, "microseconds": 839619}, "event": "JOB_STATUS_CHANGE", "data": {"status": "standby", "id": "j1"}}
{"timestamp": {"seconds": 1645502575, "microseconds": 839677}, "event": "JOB_STATUS_CHANGE", "data": {"status": "ready", "id": "j1"}}
{"timestamp": {"seconds": 1645502576, "microseconds": 29512}, "event": "MIGRATION", "data": {"status": "cancelled"}}
{"timestamp": {"seconds": 1645502597, "microseconds": 665781}, "event": "BLOCK_JOB_ERROR", "data": {"device": "j1", "operation": "write", "action": "report"}}
{"timestamp": {"seconds": 1645502597, "microseconds": 665889}, "event": "BLOCK_JOB_ERROR", "data": {"device": "j1", "operation": "write", "action": "report"}}
{"timestamp": {"seconds": 1645502597, "microseconds": 666118}, "event": "JOB_STATUS_CHANGE", "data": {"status": "aborting", "id": "j1"}}
{"timestamp": {"seconds": 1645502597, "microseconds": 666213}, "event": "BLOCK_JOB_COMPLETED", "data": {"device": "j1", "len": 21476474880, "offset": 21476343808, "speed": 0, "type": "mirror", "error": "Input/output error"}}
{"timestamp": {"seconds": 1645502597, "microseconds": 666246}, "event": "JOB_STATUS_CHANGE", "data": {"status": "concluded", "id": "j1"}}
{"timestamp": {"seconds": 1645502597, "microseconds": 666268}, "event": "JOB_STATUS_CHANGE", "data": {"status": "null", "id": "j1"}}

 11. Start dst vm again, in dst, start ndb server and expose image again.
     In dst: start nbd server, then expose the image
    { "execute": "nbd-server-start", "arguments": { "addr": { "type": "inet", "data": { "host": "10.73.114.14","port":"3333"}}}}
{"return": {}}
{"execute":"block-export-add","arguments":{"id": "export0", "node-name": "drive_image1", "type": "nbd", "writable": true}}
{"return": {}}

 12. In src, do block mirror from src to dst.
     { "execute": "blockdev-mirror", "arguments": { "device": "drive_image1","target": "mirror", "sync": "full","job-id":"j1" } }

 13. If step12 executed successfully, repeat from step6~ step12.

Actual results:
 After times of try, block mirror failed with error info:
  { "execute": "blockdev-mirror", "arguments": { "device": "drive_image1","target": "mirror", "sync": "full","job-id":"j1" } }
{"timestamp": {"seconds": 1645502983, "microseconds": 23704}, "event": "JOB_STATUS_CHANGE", "data": {"status": "created", "id": "j1"}}
{"timestamp": {"seconds": 1645502983, "microseconds": 23768}, "event": "JOB_STATUS_CHANGE", "data": {"status": "running", "id": "j1"}}
{"return": {}}
{"timestamp": {"seconds": 1645502983, "microseconds": 25165}, "event": "BLOCK_JOB_ERROR", "data": {"device": "j1", "operation": "write", "action": "report"}}
{"timestamp": {"seconds": 1645502983, "microseconds": 25226}, "event": "BLOCK_JOB_ERROR", "data": {"device": "j1", "operation": "write", "action": "report"}}
{"timestamp": {"seconds": 1645502983, "microseconds": 25253}, "event": "BLOCK_JOB_ERROR", "data": {"device": "j1", "operation": "write", "action": "report"}}
{"timestamp": {"seconds": 1645502983, "microseconds": 25355}, "event": "JOB_STATUS_CHANGE", "data": {"status": "aborting", "id": "j1"}}
{"timestamp": {"seconds": 1645502983, "microseconds": 25423}, "event": "BLOCK_JOB_COMPLETED", "data": {"device": "j1", "len": 21474836480, "offset": 0, "speed": 0, "type": "mirror", "error": "Input/output error"}}
{"timestamp": {"seconds": 1645502983, "microseconds": 25454}, "event": "JOB_STATUS_CHANGE", "data": {"status": "concluded", "id": "j1"}}
{"timestamp": {"seconds": 1645502983, "microseconds": 25484}, "event": "JOB_STATUS_CHANGE", "data": {"status": "null", "id": "j1"}}


Hi, Hanna
 Can you help to check it, thanks.

Aliang

Comment 19 Hanna Czenczek 2022-02-22 08:23:00 UTC
Hi Aliang,

In step 9, the destination VM is killed, so I don’t find it surprising that the mirror job would encounter errors once its target (the NBD server in the destination VM) has disappeared.  So I think this result is to be expected.  (It’s also entirely possible that there won’t be any errors, specifically if the source VM doesn’t write anything to disk between steps 8 and 10.)

Hanna

Comment 20 aihua liang 2022-02-22 08:48:04 UTC
(In reply to Hanna Reitz from comment #19)
> Hi Aliang,
> 
> In step 9, the destination VM is killed, so I don’t find it surprising that
> the mirror job would encounter errors once its target (the NBD server in the
> destination VM) has disappeared.  So I think this result is to be expected. 
> (It’s also entirely possible that there won’t be any errors, specifically if
> the source VM doesn’t write anything to disk between steps 8 and 10.)
> 
> Hanna

HI, Hanna

 In step11, I restart dst vm again and expose the dst image. Then I do mirror from src to dst, sometimes the mirror can be executed successfully, but sometimes it will fail with "Input/output error".

BR,
Aliang

Comment 21 Hanna Czenczek 2022-02-22 12:07:24 UTC
That doesn’t matter, the network connection was established with the old server, and writing to it will incur an error.  The NBD client will at some point try to reestablish the connection with the new server, but it will still show an error.

Compare this:

In one shell, run:

shell1$ qemu-nbd -f raw null-co://

In another, run:

shell2$ qemu-io -f raw nbd://localhost:10809
qemu-io>

Then, in the first, kill qemu-nbd (control-C), and restart it.  In qemu-io’s prompt, try some I/O accesses:

qemu-io> write 0 64k
write failed: Input/output error
qemu-io> write 0 64k
write failed: Input/output error
qemu-io> write 0 64k
wrote 65536/65536 bytes at offset 0
64 KiB, 1 ops; 00.00 sec (344.835 MiB/sec and 5517.3631 ops/sec)

As you can see, after a while it reconnects, but it does encounter I/O errors first.

Hanna

Comment 22 aihua liang 2022-02-24 04:29:55 UTC
Thanks Hanna for your detailed explaination.

With extra step '{"execute":"blockdev-del","arguments":{"node-name":"mirror"}}' after '{"execute":"migrate_cancel"}' and run test 10 times on qemu-kvm-6.2.0-9.el9 , all pass.

Test Steps:
  1.Start src guest with qemu cmd:
    /usr/libexec/qemu-kvm \
    -name &apos;avocado-vt-vm1&apos;  \
    -sandbox on  \
    -machine q35,memory-backend=mem-machine_mem \
    -device pcie-root-port,id=pcie-root-port-0,multifunction=on,bus=pcie.0,addr=0x1,chassis=1 \
    -device pcie-pci-bridge,id=pcie-pci-bridge-0,addr=0x0,bus=pcie-root-port-0  \
    -nodefaults \
    -device VGA,bus=pcie.0,addr=0x2 \
    -m 4096 \
    -object memory-backend-ram,size=4096M,id=mem-machine_mem  \
    -smp 10,maxcpus=10,cores=5,threads=1,dies=1,sockets=2  \
    -cpu &apos;Cascadelake-Server-noTSX&apos;,+kvm_pv_unhalt \
    -chardev socket,path=/tmp/monitor-qmpmonitor1-20210803-235709-zuoFCFYt,wait=off,server=on,id=qmp_id_qmpmonitor1  \
    -mon chardev=qmp_id_qmpmonitor1,mode=control \
    -chardev socket,path=/tmp/monitor-catch_monitor-20210803-235709-zuoFCFYt,wait=off,server=on,id=qmp_id_catch_monitor  \
    -mon chardev=qmp_id_catch_monitor,mode=control \
    -device pvpanic,ioport=0x505,id=idczkm5z \
    -chardev socket,path=/tmp/serial-serial0-20210803-235709-zuoFCFYt,wait=off,server=on,id=chardev_serial0 \
    -device isa-serial,id=serial0,chardev=chardev_serial0  \
    -chardev socket,id=seabioslog_id_20210803-235709-zuoFCFYt,path=/tmp/seabios-20210803-235709-zuoFCFYt,server=on,wait=off \
    -device isa-debugcon,chardev=seabioslog_id_20210803-235709-zuoFCFYt,iobase=0x402 \
    -device pcie-root-port,id=pcie-root-port-1,port=0x1,addr=0x1.0x1,bus=pcie.0,chassis=2 \
    -device qemu-xhci,id=usb1,bus=pcie-root-port-1,addr=0x0 \
    -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \
    -object iothread,id=iothread0 \
    -device pcie-root-port,id=pcie-root-port-2,port=0x2,addr=0x1.0x2,bus=pcie.0,chassis=3 \
    -blockdev node-name=file_image1,driver=file,auto-read-only=on,discard=unmap,aio=threads,filename=/home/kvm_autotest_root/images/rhel850-64-virtio-scsi.qcow2,cache.direct=on,cache.no-flush=off \
    -blockdev node-name=drive_image1,driver=qcow2,read-only=off,cache.direct=on,cache.no-flush=off,file=file_image1 \
    -device virtio-blk-pci,id=image1,drive=drive_image1,write-cache=on,bus=pcie-root-port-2,addr=0x0,iothread=iothread0 \
    -device pcie-root-port,id=pcie-root-port-3,port=0x3,addr=0x1.0x3,bus=pcie.0,chassis=4 \
    -device virtio-net-pci,mac=9a:d4:3c:c8:71:f0,id=idM0s1KB,netdev=idoOHlmh,bus=pcie-root-port-3,addr=0x0  \
    -netdev tap,id=idoOHlmh,vhost=on  \
    -vnc :0  \
    -rtc base=utc,clock=host,driftfix=slew  \
    -boot menu=off,order=cdn,once=c,strict=off \
    -enable-kvm \
    -device pcie-root-port,id=pcie_extra_root_port_0,multifunction=on,bus=pcie.0,addr=0x3,chassis=5 \
    -monitor stdio \
    -qmp tcp:0:3000,server=on,wait=off \

2. Create a mirror.qcow2 in dst, then start guest with it
    #qemu-img create -f qcow2 /home/mirror.qcow2 20G
    /usr/libexec/qemu-kvm \
    -name 'avocado-vt-vm1'  \
    -sandbox on  \
    -machine q35,memory-backend=mem-machine_mem \
    -device pcie-root-port,id=pcie-root-port-0,multifunction=on,bus=pcie.0,addr=0x1,chassis=1 \
    -device pcie-pci-bridge,id=pcie-pci-bridge-0,addr=0x0,bus=pcie-root-port-0  \
    -nodefaults \
    -device VGA,bus=pcie.0,addr=0x2 \
    -m 4096 \
    -object memory-backend-ram,size=4096M,id=mem-machine_mem  \
    -smp 10,maxcpus=10,cores=5,threads=1,dies=1,sockets=2  \
    -cpu 'Cascadelake-Server-noTSX',+kvm_pv_unhalt \
    -chardev socket,path=/tmp/monitor-qmpmonitor1-20210803-235709-zuoFCFYu,wait=off,server=on,id=qmp_id_qmpmonitor1  \
    -mon chardev=qmp_id_qmpmonitor1,mode=control \
    -chardev socket,path=/tmp/monitor-catch_monitor-20210803-235709-zuoFCFYu,wait=off,server=on,id=qmp_id_catch_monitor  \
    -mon chardev=qmp_id_catch_monitor,mode=control \
    -device pvpanic,ioport=0x505,id=idczkm5z \
    -chardev socket,path=/tmp/serial-serial0-20210803-235709-zuoFCFYt,wait=off,server=on,id=chardev_serial0 \
    -device isa-serial,id=serial0,chardev=chardev_serial0  \
    -chardev socket,id=seabioslog_id_20210803-235709-zuoFCFYt,path=/tmp/seabios-20210803-235709-zuoFCFYt,server=on,wait=off \
    -device isa-debugcon,chardev=seabioslog_id_20210803-235709-zuoFCFYt,iobase=0x402 \
    -device pcie-root-port,id=pcie-root-port-1,port=0x1,addr=0x1.0x1,bus=pcie.0,chassis=2 \
    -device qemu-xhci,id=usb1,bus=pcie-root-port-1,addr=0x0 \
    -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \
    -object iothread,id=iothread0 \
    -device pcie-root-port,id=pcie-root-port-2,port=0x2,addr=0x1.0x2,bus=pcie.0,chassis=3 \
    -blockdev node-name=file_image1,driver=file,auto-read-only=on,discard=unmap,aio=threads,filename=/home/mirror.qcow2,cache.direct=on,cache.no-flush=off \
    -blockdev node-name=drive_image1,driver=qcow2,read-only=off,cache.direct=on,cache.no-flush=off,file=file_image1 \
    -device virtio-blk-pci,id=image1,drive=drive_image1,write-cache=on,bus=pcie-root-port-2,addr=0x0,iothread=iothread0 \
    -device pcie-root-port,id=pcie-root-port-3,port=0x3,addr=0x1.0x3,bus=pcie.0,chassis=4 \
    -device virtio-net-pci,mac=9a:d4:3c:c8:71:f0,id=idM0s1KB,netdev=idoOHlmh,bus=pcie-root-port-3,addr=0x0  \
    -netdev tap,id=idoOHlmh,vhost=on  \
    -vnc :1  \
    -rtc base=utc,clock=host,driftfix=slew  \
    -boot menu=off,order=cdn,once=c,strict=off \
    -enable-kvm \
    -device pcie-root-port,id=pcie_extra_root_port_0,multifunction=on,bus=pcie.0,addr=0x3,chassis=5 \
    -monitor stdio \
    -qmp tcp:0:3001,server=on,wait=off \
    -incoming defer \

3. In dst: start nbd server, then expose the image
    { "execute": "nbd-server-start", "arguments": { "addr": { "type": "inet", "data": { "host": "10.73.114.14","port":"3333"}}}}
{"return": {}}
{"execute":"block-export-add","arguments":{"id": "export0", "node-name": "drive_image1", "type": "nbd", "writable": true}}
{"return": {}}

 4. In src, add the dst's exposed image.
    {"execute":"blockdev-add","arguments":{"driver":"nbd","node-name":"mirror","server":{"type":"inet","host":"10.73.114.14","port":"3333"},"export":"drive_image1"}}

 5. In src, do mirror from src to dst.
    { "execute": "blockdev-mirror", "arguments": { "device": "drive_image1","target": "mirror", "sync": "full","job-id":"j1" } }
{"timestamp": {"seconds": 1645502413, "microseconds": 863259}, "event": "JOB_STATUS_CHANGE", "data": {"status": "created", "id": "j1"}}
{"timestamp": {"seconds": 1645502413, "microseconds": 863387}, "event": "JOB_STATUS_CHANGE", "data": {"status": "running", "id": "j1"}}
{"return": {}}
{"timestamp": {"seconds": 1645502458, "microseconds": 542540}, "event": "JOB_STATUS_CHANGE", "data": {"status": "ready", "id": "j1"}}
{"timestamp": {"seconds": 1645502458, "microseconds": 542594}, "event": "BLOCK_JOB_READY", "data": {"device": "j1", "len": 21475426304, "offset": 21475426304, "speed": 0, "type": "mirror"}}

 6. In src, when mirror job reach "ready" status, set migration capabilities
    {"execute":"migrate-set-capabilities","arguments":{"capabilities":[{"capability":"events","state":true},{"capability":"pause-before-switchover","state":true}]}}
{"return": {}}

 7. In dst, set migration capabilities and setup the migration
    {"execute":"migrate-set-capabilities","arguments":{"capabilities":[{"capability":"events","state":true},{"capability":"pause-before-switchover","state":true}]}}
{"return": {}}
{"execute": "migrate-incoming","arguments": {"uri": "tcp:[::]:5000"}}
{"timestamp": {"seconds": 1628245994, "microseconds": 902021}, "event": "MIGRATION", "data": {"status": "setup"}}
{"return": {}}

 8. In src, migrate from src to dst.
     {"execute": "migrate","arguments":{"uri": "tcp:10.73.114.14:5000"}}
{"timestamp": {"seconds": 1645673654, "microseconds": 753733}, "event": "MIGRATION", "data": {"status": "setup"}}
{"return": {}}
{"timestamp": {"seconds": 1645673654, "microseconds": 756384}, "event": "MIGRATION_PASS", "data": {"pass": 1}}
{"timestamp": {"seconds": 1645673654, "microseconds": 756453}, "event": "MIGRATION", "data": {"status": "active"}}
{"timestamp": {"seconds": 1645673670, "microseconds": 957450}, "event": "MIGRATION_PASS", "data": {"pass": 2}}
{"timestamp": {"seconds": 1645673671, "microseconds": 232578}, "event": "MIGRATION_PASS", "data": {"pass": 3}}
{"timestamp": {"seconds": 1645673671, "microseconds": 237052}, "event": "JOB_STATUS_CHANGE", "data": {"status": "standby", "id": "j1"}}
{"timestamp": {"seconds": 1645673671, "microseconds": 237116}, "event": "JOB_STATUS_CHANGE", "data": {"status": "ready", "id": "j1"}}
{"timestamp": {"seconds": 1645673671, "microseconds": 257395}, "event": "STOP"}
{"timestamp": {"seconds": 1645673671, "microseconds": 257419}, "event": "JOB_STATUS_CHANGE", "data": {"status": "standby", "id": "j1"}}
{"timestamp": {"seconds": 1645673671, "microseconds": 257448}, "event": "JOB_STATUS_CHANGE", "data": {"status": "ready", "id": "j1"}}
{"timestamp": {"seconds": 1645673671, "microseconds": 258942}, "event": "MIGRATION", "data": {"status": "pre-switchover"}}

 9. When migrate reach "pre-switchover" status, kill dst vm
    #kill -9 $dst_vm_qemu_pid

 10. In src, cancel migrate.
    {"execute":"migrate_cancel"}
{"timestamp": {"seconds": 1645673693, "microseconds": 347607}, "event": "MIGRATION", "data": {"status": "cancelling"}}
{"return": {}}
{"timestamp": {"seconds": 1645673693, "microseconds": 347679}, "event": "RESUME"}
{"timestamp": {"seconds": 1645673693, "microseconds": 348259}, "event": "JOB_STATUS_CHANGE", "data": {"status": "standby", "id": "j1"}}
{"timestamp": {"seconds": 1645673693, "microseconds": 348307}, "event": "JOB_STATUS_CHANGE", "data": {"status": "ready", "id": "j1"}}
{"timestamp": {"seconds": 1645673693, "microseconds": 359202}, "event": "MIGRATION", "data": {"status": "cancelled"}}
{"timestamp": {"seconds": 1645673733, "microseconds": 917111}, "event": "BLOCK_JOB_ERROR", "data": {"device": "j1", "operation": "write", "action": "report"}}
{"timestamp": {"seconds": 1645673733, "microseconds": 917206}, "event": "BLOCK_JOB_ERROR", "data": {"device": "j1", "operation": "write", "action": "report"}}
{"timestamp": {"seconds": 1645673733, "microseconds": 917421}, "event": "JOB_STATUS_CHANGE", "data": {"status": "aborting", "id": "j1"}}
{"timestamp": {"seconds": 1645673733, "microseconds": 917494}, "event": "BLOCK_JOB_COMPLETED", "data": {"device": "j1", "len": 21475229696, "offset": 21475098624, "speed": 0, "type": "mirror", "error": "Input/output error"}}
{"timestamp": {"seconds": 1645673733, "microseconds": 917536}, "event": "JOB_STATUS_CHANGE", "data": {"status": "concluded", "id": "j1"}}
{"timestamp": {"seconds": 1645673733, "microseconds": 917558}, "event": "JOB_STATUS_CHANGE", "data": {"status": "null", "id": "j1"}}
 11. Delete mirror node.
     {"execute":"blockdev-del","arguments":{"node-name":"mirror"}}
{"return": {}}

 12. Start dst vm again, in dst, start ndb server and expose image again.
     In dst: start nbd server, then expose the image
    { "execute": "nbd-server-start", "arguments": { "addr": { "type": "inet", "data": { "host": "10.73.114.14","port":"3333"}}}}
{"return": {}}
{"execute":"block-export-add","arguments":{"id": "export0", "node-name": "drive_image1", "type": "nbd", "writable": true}}
{"return": {}}
  
  13. In src, add mirror node.
      {"execute":"blockdev-add","arguments":{"driver":"nbd","node-name":"mirror","server":{"type":"inet","host":"10.73.114.14","port":"3333"},"export":"drive_image1"}}
{"return": {}}
{ "execute": "blockdev-mirror", "arguments": { "device": "drive_image1","target": "mirror", "sync": "full","job-id":"j1" } }
{"timestamp": {"seconds": 1645673812, "microseconds": 775162}, "event": "JOB_STATUS_CHANGE", "data": {"status": "created", "id": "j1"}}
{"timestamp": {"seconds": 1645673812, "microseconds": 775225}, "event": "JOB_STATUS_CHANGE", "data": {"status": "running", "id": "j1"}}
{"return": {}}
{"timestamp": {"seconds": 1645673875, "microseconds": 755327}, "event": "JOB_STATUS_CHANGE", "data": {"status": "ready", "id": "j1"}}
{"timestamp": {"seconds": 1645673875, "microseconds": 755373}, "event": "BLOCK_JOB_READY", "data": {"device": "j1", "len": 21474967552, "offset": 21474967552, "speed": 0, "type": "mirror"}}

  14. In src, do block mirror from src to dst.
     { "execute": "blockdev-mirror", "arguments": { "device": "drive_image1","target": "mirror", "sync": "full","job-id":"j1" } }
{"timestamp": {"seconds": 1645673812, "microseconds": 775162}, "event": "JOB_STATUS_CHANGE", "data": {"status": "created", "id": "j1"}}
{"timestamp": {"seconds": 1645673812, "microseconds": 775225}, "event": "JOB_STATUS_CHANGE", "data": {"status": "running", "id": "j1"}}
{"return": {}}
{"timestamp": {"seconds": 1645673875, "microseconds": 755327}, "event": "JOB_STATUS_CHANGE", "data": {"status": "ready", "id": "j1"}}
{"timestamp": {"seconds": 1645673875, "microseconds": 755373}, "event": "BLOCK_JOB_READY", "data": {"device": "j1", "len": 21474967552, "offset": 21474967552, "speed": 0, "type": "mirror"}}

  17. Repeat from step 6 to step 14 for 10 times.

Test Result:
  All tests pass.

Comment 23 Yanan Fu 2022-02-24 10:23:06 UTC
Add 'Verified:Tested,SanityOnly' as gating test with qemu-kvm-6.2.0-9.el9 pass

Comment 24 aihua liang 2022-02-24 10:25:23 UTC
As comment 22 and comment 23, set bug's status to "VERIFIED".

Comment 26 Hanna Czenczek 2023-01-09 08:40:00 UTC
Seems like this BZ has fallen through the cracks, not sure why it isn’t CLOSED already.  As far as I understand, the issue should be fixed as of 9.0.  I don’t mind having it closed as WONTFIX, though it doesn’t seem ideal to me.

Comment 27 aihua liang 2023-01-09 09:54:09 UTC
(In reply to Hanna Reitz from comment #26)
> Seems like this BZ has fallen through the cracks, not sure why it isn’t
> CLOSED already.  As far as I understand, the issue should be fixed as of
> 9.0.  I don’t mind having it closed as WONTFIX, though it doesn’t seem ideal
> to me.

I notice that devel_ack and release are not set, not sure if it's the root cause of this bug's unchanged status.
And I haved checked it with jinzhao this morning, but she could't provide a good solution at present.

Comment 28 aihua liang 2023-01-09 10:19:34 UTC
(In reply to aihua liang from comment #27)
> (In reply to Hanna Reitz from comment #26)
> > Seems like this BZ has fallen through the cracks, not sure why it isn’t
> > CLOSED already.  As far as I understand, the issue should be fixed as of
> > 9.0.  I don’t mind having it closed as WONTFIX, though it doesn’t seem ideal
> > to me.
> 
> I notice that devel_ack and release are not set, not sure if it's the root
> cause of this bug's unchanged status.
> And I haved checked it with jinzhao this morning, but she could't provide a
> good solution at present.

Latest news:
 The root casue: abnormal workflow for bug status change: POST->Verified
 I will check the bug on latest qemu, if it works, jinzhao will negotiate it with @mierk for the following things.


Thanks,
Aliang

Comment 29 aihua liang 2023-01-10 08:11:22 UTC
Not hit the src coredump issue on both RHEL9.0(kernel:5.14.0-70.13.1.el9_0.x86_64)+qemu-kvm-6.2.0-11.el9_0.2 and RHEL9.2+qemu-kvm-7.2.0-3.el9.

But I found another issue during my test with steps in bug's description:
 After cancel migration in step10, sometimes, mirror job can't be terminated immediatly.

Reproduce rate: 1/10
Detail:
 10. In src, cancel migrate.
{"execute":"migrate_cancel"}
{"timestamp": {"seconds": 1645673693, "microseconds": 347607}, "event": "MIGRATION", "data": {"status": "cancelling"}}
{"return": {}}
{"timestamp": {"seconds": 1645673693, "microseconds": 347679}, "event": "RESUME"}
{"timestamp": {"seconds": 1645673693, "microseconds": 348259}, "event": "JOB_STATUS_CHANGE", "data": {"status": "standby", "id": "j1"}}
{"timestamp": {"seconds": 1645673693, "microseconds": 348307}, "event": "JOB_STATUS_CHANGE", "data": {"status": "ready", "id": "j1"}}
{"timestamp": {"seconds": 1645673693, "microseconds": 359202}, "event": "MIGRATION", "data": {"status": "cancelled"}}

*****************
 About 1~2 minutes later, mirror job can be terminated automatically with following events sent out.
************
{"timestamp": {"seconds": 1645673733, "microseconds": 917111}, "event": "BLOCK_JOB_ERROR", "data": {"device": "j1", "operation": "write", "action": "report"}}
{"timestamp": {"seconds": 1645673733, "microseconds": 917206}, "event": "BLOCK_JOB_ERROR", "data": {"device": "j1", "operation": "write", "action": "report"}}
{"timestamp": {"seconds": 1645673733, "microseconds": 917421}, "event": "JOB_STATUS_CHANGE", "data": {"status": "aborting", "id": "j1"}}
{"timestamp": {"seconds": 1645673733, "microseconds": 917494}, "event": "BLOCK_JOB_COMPLETED", "data": {"device": "j1", "len": 21475229696, "offset": 21475098624, "speed": 0, "type": "mirror", "error": "Input/output error"}}
{"timestamp": {"seconds": 1645673733, "microseconds": 917536}, "event": "JOB_STATUS_CHANGE", "data": {"status": "concluded", "id": "j1"}}
{"timestamp": {"seconds": 1645673733, "microseconds": 917558}, "event": "JOB_STATUS_CHANGE", "data": {"status": "null", "id": "j1"}}



Hi, Hanna

 Can you help to check it? If the delay is not tolerable, I'll file a new bug to track it.

Thanks,
Aliang

Comment 30 Hanna Czenczek 2023-01-10 09:27:33 UTC
Hi Aliang,

What does “mirror job can't be terminated” mean exactly?  The log contains now block-job-cancel command, so I guess this relies on the destination to go down, breaking up the NBD connection, thus inducing an error in the job.  The delay could be due to anything, maybe the destination didn’t get to terminate the connection properly, maybe the source is still trying to re-establish the connection.

I can also see a small delay when I don’t issue a block-job-cancel, but when issuing block-job-cancel, the job does terminate immediately.  So I don’t think there’s a bug there.  If the job is to be cancelled, it makes sense to cancel it explicitly, I think.

Hanna

Comment 31 aihua liang 2023-01-10 10:05:54 UTC
(In reply to Hanna Reitz from comment #30)
> Hi Aliang,
> 
> What does “mirror job can't be terminated” mean exactly?  The log contains

I meant mirror job couldn't be completed with "Input/output error" immediatly
after migration cancled, as the dst vm had been killed in step9.

> now block-job-cancel command, so I guess this relies on the destination to
> go down, breaking up the NBD connection, thus inducing an error in the job. 
> The delay could be due to anything, maybe the destination didn’t get to
> terminate the connection properly, maybe the source is still trying to
> re-establish the connection.
> 
> I can also see a small delay when I don’t issue a block-job-cancel, but when
> issuing block-job-cancel, the job does terminate immediately.  So I don’t
> think there’s a bug there.  If the job is to be cancelled, it makes sense to
> cancel it explicitly, I think.

Agree.
If upper also hit this issue, we'll do more investigation then.

> 
> Hanna

Thanks, Hanna.

Comment 32 Yash Mankad 2023-01-10 20:38:38 UTC
Based on aihua's #c20, CLOSING this bug as CURRENTRELEASE as it has been fixed since RHEL 9.0

Comment 33 Yash Mankad 2023-01-10 20:42:59 UTC
(In reply to Yash Mankad from comment #32)
> Based on aihua's #c20, CLOSING this bug as CURRENTRELEASE as it has been
> fixed since RHEL 9.0

I meant comment 29, not 20 (typo)
https://bugzilla.redhat.com/show_bug.cgi?id=1990835#c29


Note You need to log in before you can comment on or make changes to this bug.