RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1981782 - qemu segfault after the 2rd postcopy live migration with vhost-user
Summary: qemu segfault after the 2rd postcopy live migration with vhost-user
Keywords:
Status: CLOSED ERRATA
Alias: None
Deadline: 2021-12-13
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: qemu-kvm
Version: 8.6
Hardware: x86_64
OS: Linux
high
high
Target Milestone: rc
: 8.6
Assignee: Juan Quintela
QA Contact: Pei Zhang
URL:
Whiteboard:
: 2024981 (view as bug list)
Depends On: 2027716
Blocks: 1982224 2021976 2021981 2024981 2025609
TreeView+ depends on / blocked
 
Reported: 2021-07-13 11:58 UTC by Pei Zhang
Modified: 2022-05-10 13:28 UTC (History)
12 users (show)

Fixed In Version: qemu-kvm-6.2.0-1.module+el8.6.0+13725+61ae1949
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1982224 2021976 2021981 2024981 (view as bug list)
Environment:
Last Closed: 2022-05-10 13:20:14 UTC
Type: ---
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Gitlab redhat/rhel/src/qemu-kvm qemu-kvm merge_requests 58 0 None None None 2021-11-10 11:33:28 UTC
Red Hat Product Errata RHSA-2022:1759 0 None Closed [bug] [S3][QA][OSP16.1] An unclear precaution is described in the document of ReaR backup for controller nodes.[Stack] 2022-05-19 02:34:03 UTC

Description Pei Zhang 2021-07-13 11:58:13 UTC
Description of problem:
Boot qemu with vhost-user, then qemu crash after the 2rd postcopy live migration.

Version-Release number of selected component (if applicable):
4.18.0-322.el8.x86_64
qemu-kvm-6.0.0-23.module+el8.5.0+11740+35571f13.x86_64
libvirt-7.5.0-1.module+el8.5.0+11664+59f87560.x86_64
openvswitch2.15-2.15.0-26.el8fdp.x86_64


How reproducible:
100%

Steps to Reproduce:
1. Boot ovs-dpdk with postcopy enabled and vhost-user 2 queues on src and des hosts. Full cmd refer to [1]

# ovs-vsctl --no-wait set Open_vSwitch . other_config:vhost-postcopy-support=true

# ovs-vsctl add-port ovsbr0 dpdk0 -- set Interface dpdk0 type=dpdk options:dpdk-devargs=0000:5e:00.0  options:n_rxq=2  options:n_txq=2
# ovs-vsctl add-port ovsbr0 vhost-user0 -- set Interface vhost-user0 type=dpdkvhostuserclient options:vhost-server-path=/tmp/vhostuser0.sock


2. Boot VM with vhost-user 2 queues, full cmd refer to next Comment.

    <interface type="vhostuser">
      <mac address="18:66:da:5f:dd:02" />
      <source mode="server" path="/tmp/vhostuser0.sock" type="unix" />
      <model type="virtio" />
      <driver ats="on" iommu="on" name="vhost" queues="2" rx_queue_size="1024" />
      <address bus="0x6" domain="0x0000" function="0x0" slot="0x00" type="pci" />
    </interface>


3. After VM start up, do postcopy live migration from src to des, success.

# virsh migrate --verbose --persistent --postcopy  --live rhel8.5 qemu+ssh://192.168.1.2/system
Migration: [100 %]


4. Do postcopy live migration from des to host, migariton can finished, however qemu crash in des. And all next ping-pong postcopy live migrations will cause src and des qemu crash. 

# virsh migrate --verbose --persistent --postcopy --live rhel8.5 qemu+ssh://192.168.1.1/system
Migration: [100 %]

dmesg
...
[20193.252622] qemu-kvm[15140]: segfault at 8 ip 00005626add0b329 sp 00007ffc1542d840 error 4 in qemu-kvm[5626ad8fa000+b41000]
[20193.263753] Code: 00 48 83 c4 08 5b 5d c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 55 48 89 fd 53 48 83 ec 08 e8 de 4a fa ff 48 8b b8 a0 01 00 00 <8b> 57 08 85 d2 74 28 48 89 c3 48 8b 07 8b 4d 00 39 08 74 22 48 83



Actual results:
qemu segfault after the postcopy with vhost-user mq.

Expected results:
qemu shold not crash after the postcopy with vhost-user mq.

Additional info:
1. Without vhost-user ports, postcopy works well in the same environment.

2. With vhost-user single queue, postcopy works well in the same environment.

Reference
[1]
# cat boot_ovs_client.sh 
#!/bin/bash

set -e

echo "killing old ovs process"
pkill -f ovs-vswitchd || true
sleep 5
pkill -f ovsdb-server || true

echo "probing ovs kernel module"
modprobe -r openvswitch || true
modprobe openvswitch

echo "clean env"
DB_FILE=/etc/openvswitch/conf.db
rm -rf /var/run/openvswitch
mkdir /var/run/openvswitch
rm -f $DB_FILE

echo "init ovs db and boot db server"
export DB_SOCK=/var/run/openvswitch/db.sock
ovsdb-tool create /etc/openvswitch/conf.db /usr/share/openvswitch/vswitch.ovsschema
ovsdb-server --remote=punix:$DB_SOCK --remote=db:Open_vSwitch,Open_vSwitch,manager_options --pidfile --detach --log-file
ovs-vsctl --no-wait init

echo "start ovs vswitch daemon"
ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-init=true
ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-socket-mem="1024,1024"
ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-lcore-mask="0x1"
ovs-vsctl --no-wait set Open_vSwitch . other_config:vhost-iommu-support=true
ovs-vsctl --no-wait set Open_vSwitch . other_config:vhost-postcopy-support=true
ovs-vswitchd unix:$DB_SOCK --pidfile --detach --log-file=/var/log/openvswitch/ovs-vswitchd.log

echo "creating bridge and ports"

ovs-vsctl --if-exists del-br ovsbr0
ovs-vsctl add-br ovsbr0 -- set bridge ovsbr0 datapath_type=netdev
ovs-vsctl add-port ovsbr0 dpdk0 -- set Interface dpdk0 type=dpdk options:dpdk-devargs=0000:5e:00.0 
ovs-vsctl add-port ovsbr0 vhost-user0 -- set Interface vhost-user0 type=dpdkvhostuserclient options:vhost-server-path=/tmp/vhostuser0.sock
ovs-ofctl del-flows ovsbr0
ovs-ofctl add-flow ovsbr0 "in_port=1,idle_timeout=0 actions=output:2"
ovs-ofctl add-flow ovsbr0 "in_port=2,idle_timeout=0 actions=output:1"

ovs-vsctl set Open_vSwitch . other_config={}
ovs-vsctl set Open_vSwitch . other_config:dpdk-lcore-mask=0x1
ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x154
ovs-vsctl set Interface dpdk0 options:n_rxq=2


echo "all done"

Comment 2 Dr. David Alan Gilbert 2021-07-14 10:40:37 UTC
Can you please capture a backtrace from the crash.

Comment 3 Pei Zhang 2021-07-15 04:52:58 UTC
(In reply to Dr. David Alan Gilbert from comment #2)
> Can you please capture a backtrace from the crash.

David, 

When I try to capture the backtrace from the qemu crash, I did migration from qemu side. Seems the root cause is postcopy + vhost-user 2Q live migration cannot be finished. So no crash captured when testing with qemu.

Here is the details:

1. Boot ovs-dpdk with postcopy enabled and vhost-user 2 queues on src and des hosts. (Same as above step 1)

2. Boot qemu with vhost-user 2 queues on src host, and same qemu cmd but along with "-incoming defer" on des host.


-chardev socket,id=charnet1,path=/tmp/vhostuser0.sock,server \
-netdev vhost-user,chardev=charnet1,queues=2,id=hostnet1 \
-device virtio-net-pci,mq=on,vectors=6,rx_queue_size=1024,netdev=hostnet1,id=net1,mac=88:66:da:5f:dd:02,bus=pci.6,addr=0x0,iommu_platform=on,ats=on \

on des:
...
-incoming defer


3. Enable postcopy on both src and dst host 

(qemu) migrate_set_capability postcopy-ram on

4. Do postcopy live migration

on des:
(qemu) migrate_incoming tcp:[::]:4444

on src:
(qemu) migrate -d tcp:$dst_ip:4444
(qemu) migrate_start_postcopy

5. Check migration status on src. Migration cannot be finished(even after 10 minutes and more time), it's status keep "postcopy-active". Only the value of "total time" increases as time goes by, but the value change of all other items become freeze alreay. 

(qemu) info migrate
globals:
store-global-state: on
only-migratable: off
send-configuration: on
send-section-footer: on
decompress-error-check: on
clear-bitmap-shift: 18
Migration status: postcopy-active
                  ^^^^^^^^^^^^^^
total time: 37547 ms
            ^^^^^
expected downtime: 300 ms
setup: 6 ms
transferred ram: 669171 kbytes
throughput: 944.60 mbps
remaining ram: 0 kbytes
total ram: 8405832 kbytes
duplicate: 1938743 pages
skipped: 0 pages
normal: 162715 pages
normal bytes: 650860 kbytes
dirty sync count: 2
page size: 4 kbytes
multifd bytes: 0 kbytes
pages-per-second: 28770
postcopy request count: 2
(qemu) 
(qemu) 
(qemu) 

(qemu) info migrate
globals:
store-global-state: on
only-migratable: off
send-configuration: on
send-section-footer: on
decompress-error-check: on
clear-bitmap-shift: 18
Migration status: postcopy-active
                  ^^^^^^^^^^^^^^
total time: 493861 ms
            ^^^^^^
expected downtime: 300 ms
setup: 6 ms
transferred ram: 669171 kbytes
throughput: 944.60 mbps
remaining ram: 0 kbytes
total ram: 8405832 kbytes
duplicate: 1938743 pages
skipped: 0 pages
normal: 162715 pages
normal bytes: 650860 kbytes
dirty sync count: 2
page size: 4 kbytes
multifd bytes: 0 kbytes
pages-per-second: 28770
postcopy request count: 2


6. Check qemu terminal on des host, it's not responsible on any input and looks stuck. 

# sh qemu_q35_multifunction_vhost_user.sh 
qemu-kvm: -chardev socket,id=charnet1,path=/tmp/vhostuser0.sock,server: warning: short-form boolean option 'server' deprecated
Please use server=on instead
qemu-kvm: -chardev socket,id=charnet1,path=/tmp/vhostuser0.sock,server: info: QEMU waiting for connection on: disconnected:unix:/tmp/vhostuser0.sock,server=on
QEMU 6.0.0 monitor - type 'help' for more information
(qemu) migrate_set_capability postcopy-ram on
(qemu) migrate_incoming tcp:[::]:4444
(qemu) (stuck here)


Reference: Full qemu cmd line:
/usr/libexec/qemu-kvm \
-name guest=rhel8.5 \
-machine q35,kernel_irqchip=split \
-cpu Skylake-Server-IBRS,ss=on,vmx=on,pdcm=on,hypervisor=on,tsc-adjust=on,clflushopt=on,umip=on,pku=on,md-clear=on,stibp=on,arch-capabilities=on,ssbd=on,xsaves=on,ibpb=on,ibrs=on,amd-stibp=on,amd-ssbd=on,skip-l1dfl-vmentry=on,pschange-mc-no=on,tsc-deadline=on,pmu=off \
-m 8192 \
-smp 6,sockets=6,cores=1,threads=1 \
-object memory-backend-file,id=mem,size=8G,mem-path=/dev/hugepages,share=on \
-numa node,memdev=mem -mem-prealloc \
-device intel-iommu,intremap=on,caching-mode=on,device-iotlb=on \
-device pcie-root-port,port=0x10,chassis=1,id=pci.1,bus=pcie.0,multifunction=on,addr=0x2 \
-device pcie-root-port,port=0x11,chassis=2,id=pci.2,bus=pcie.0,addr=0x2.0x1 \
-device pcie-root-port,port=0x12,chassis=3,id=pci.3,bus=pcie.0,addr=0x2.0x2 \
-device pcie-root-port,port=0x13,chassis=4,id=pci.4,bus=pcie.0,addr=0x2.0x3 \
-device pcie-root-port,port=0x14,chassis=5,id=pci.5,bus=pcie.0,addr=0x2.0x4 \
-device pcie-root-port,port=0x15,chassis=6,id=pci.6,bus=pcie.0,addr=0x2.0x5 \
-device pcie-root-port,port=0x16,chassis=7,id=pci.7,bus=pcie.0,addr=0x2.0x6 \
-blockdev driver=file,cache.direct=off,cache.no-flush=on,filename=/mnt/nfv/rhel8.5.qcow2,node-name=my_file \
-blockdev driver=qcow2,node-name=my,file=my_file \
-device virtio-blk-pci,scsi=off,iommu_platform=on,ats=on,bus=pci.2,addr=0x0,drive=my,id=virtio-disk0,bootindex=1,write-cache=on \
-chardev socket,id=charnet1,path=/tmp/vhostuser0.sock,server \
-netdev vhost-user,chardev=charnet1,queues=2,id=hostnet1 \
-device virtio-net-pci,mq=on,vectors=6,rx_queue_size=1024,netdev=hostnet1,id=net1,mac=88:66:da:5f:dd:02,bus=pci.6,addr=0x0,iommu_platform=on,ats=on \
-monitor stdio \
-vnc :2 \


More info:
1. postcopy + vhost-user single queue works well.

Comment 4 Pei Zhang 2021-07-15 04:59:05 UTC
(In reply to Pei Zhang from comment #3)
...
> 
> 5. Check migration status on src. Migration cannot be finished(even after 10
> minutes and more time), it's status keep "postcopy-active". Only the value
> of "total time" increases as time goes by, but the value change of all other
> items become freeze alreay. 
> 
> (qemu) info migrate
> globals:
> store-global-state: on
> only-migratable: off
> send-configuration: on
> send-section-footer: on
> decompress-error-check: on
> clear-bitmap-shift: 18
> Migration status: postcopy-active
>                   ^^^^^^^^^^^^^^
> total time: 37547 ms
>             ^^^^^
> expected downtime: 300 ms
> setup: 6 ms
> transferred ram: 669171 kbytes
> throughput: 944.60 mbps
> remaining ram: 0 kbytes
> total ram: 8405832 kbytes
> duplicate: 1938743 pages
> skipped: 0 pages
> normal: 162715 pages
> normal bytes: 650860 kbytes
> dirty sync count: 2
> page size: 4 kbytes
> multifd bytes: 0 kbytes
> pages-per-second: 28770
> postcopy request count: 2
> (qemu) 
> (qemu) 
> (qemu) 
> 
> (qemu) info migrate
> globals:
> store-global-state: on
> only-migratable: off
> send-configuration: on
> send-section-footer: on
> decompress-error-check: on
> clear-bitmap-shift: 18
> Migration status: postcopy-active
>                   ^^^^^^^^^^^^^^
> total time: 493861 ms
>             ^^^^^^
> expected downtime: 300 ms
> setup: 6 ms
> transferred ram: 669171 kbytes
> throughput: 944.60 mbps
> remaining ram: 0 kbytes
> total ram: 8405832 kbytes
> duplicate: 1938743 pages
> skipped: 0 pages
> normal: 162715 pages
> normal bytes: 650860 kbytes
> dirty sync count: 2
> page size: 4 kbytes
> multifd bytes: 0 kbytes
> pages-per-second: 28770
> postcopy request count: 2
> 

(qemu) info status 
VM status: paused (finish-migrate)


Also update the VM status info on src host.

Comment 5 Dr. David Alan Gilbert 2021-07-20 11:24:17 UTC
Dan suggests that :

set    max_core = "unlimited" in  /etc/libvirt/qemu.conf and restart libvirtd

and then when qemu crashes you should be able to run

coredumpctl

and see if it lists a crash.

Comment 6 Pei Zhang 2021-07-20 11:35:27 UTC
(In reply to Dr. David Alan Gilbert from comment #5)
> Dan suggests that :
> 
> set    max_core = "unlimited" in  /etc/libvirt/qemu.conf and restart libvirtd
> 
> and then when qemu crashes you should be able to run
> 
> coredumpctl
> 
> and see if it lists a crash.

Thanks David.

I can get below backtrace info following this method. 
# coredumpctl 
TIME                            PID   UID   GID SIG COREFILE  EXE
Tue 2021-07-20 07:30:38 EDT    3638     0   107  11 present   /usr/libexec/qemu-kvm

# coredumpctl info
           PID: 3638 (qemu-kvm)
           UID: 0 (root)
           GID: 107 (qemu)
        Signal: 11 (SEGV)
     Timestamp: Tue 2021-07-20 07:30:38 EDT (2min 13s ago)
  Command Line: /usr/libexec/qemu-kvm -name guest=rhel8.5,debug-threads=on -S -object {"qom-type":"secret","id">
    Executable: /usr/libexec/qemu-kvm
 Control Group: /machine.slice/machine-qemu\x2d1\x2drhel8.5.scope
          Unit: machine-qemu\x2d1\x2drhel8.5.scope
         Slice: machine.slice
       Boot ID: d1dadff44a3b46e58305dfdb65b72c15
    Machine ID: 86efb994179d4e1cbdff7b166627b3b9
      Hostname: dell-per740-03.lab.eng.pek2.redhat.com
       Storage: /var/lib/systemd/coredump/core.qemu-kvm.0.d1dadff44a3b46e58305dfdb65b72c15.3638.162678063800000>
       Message: Process 3638 (qemu-kvm) of user 0 dumped core.
                
                Stack trace of thread 3638:
                #0  0x000055abe37782d9 postcopy_unregister_shared_ufd (qemu-kvm)
                #1  0x000055abe38d1573 vhost_user_backend_cleanup (qemu-kvm)
                #2  0x000055abe390aee2 vhost_dev_cleanup (qemu-kvm)
                #3  0x000055abe3749939 net_vhost_user_cleanup (qemu-kvm)
                #4  0x000055abe36c060d qemu_del_net_client (qemu-kvm)
                #5  0x000055abe36c1201 net_cleanup (qemu-kvm)
                #6  0x000055abe387e64c qemu_cleanup (qemu-kvm)
                #7  0x000055abe36986d7 main (qemu-kvm)
                #8  0x00007fb6b776e493 __libc_start_main (libc.so.6)
                #9  0x000055abe369af4e _start (qemu-kvm)
                
                Stack trace of thread 3657:
                #0  0x00007fb6b784252d syscall (libc.so.6)
                #1  0x000055abe3a3a96f qemu_event_wait (qemu-kvm)
                #2  0x000055abe3a51252 call_rcu_thread (qemu-kvm)
                #3  0x000055abe3a3a0a4 qemu_thread_start (qemu-kvm)
                #4  0x00007fb6b7b1814a start_thread (libpthread.so.0)
                #5  0x00007fb6b7847dc3 __clone (libc.so.6)
                
                Stack trace of thread 3666:
                #0  0x00007fb6b7b1e2fc pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
                #1  0x000055abe3a3a45d qemu_cond_wait_impl (qemu-kvm)
                #2  0x000055abe3882e1f qemu_wait_io_event (qemu-kvm)
                #3  0x000055abe38ba588 kvm_vcpu_thread_fn (qemu-kvm)
                #4  0x000055abe3a3a0a4 qemu_thread_start (qemu-kvm)
                #5  0x00007fb6b7b1814a start_thread (libpthread.so.0)
                #6  0x00007fb6b7847dc3 __clone (libc.so.6)
                
                Stack trace of thread 3665:
                #0  0x00007fb6b783ca41 __poll (libc.so.6)
                #1  0x00007fb6b851ac86 g_main_context_iterate.isra.21 (libglib-2.0.so.0)
                #2  0x00007fb6b851b042 g_main_loop_run (libglib-2.0.so.0)
                #3  0x000055abe39541e9 iothread_run (qemu-kvm)
                #4  0x000055abe3a3a0a4 qemu_thread_start (qemu-kvm)
                #5  0x00007fb6b7b1814a start_thread (libpthread.so.0)
                #6  0x00007fb6b7847dc3 __clone (libc.so.6)
                
                Stack trace of thread 3667:
                #0  0x00007fb6b7b1e2fc pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
                #1  0x000055abe3a3a45d qemu_cond_wait_impl (qemu-kvm)
                #2  0x000055abe3882e1f qemu_wait_io_event (qemu-kvm)
                #3  0x000055abe38ba588 kvm_vcpu_thread_fn (qemu-kvm)
                #4  0x000055abe3a3a0a4 qemu_thread_start (qemu-kvm)
                #5  0x00007fb6b7b1814a start_thread (libpthread.so.0)
                #6  0x00007fb6b7847dc3 __clone (libc.so.6)
                
                Stack trace of thread 3668:
                #0  0x00007fb6b7b1e2fc pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
                #1  0x000055abe3a3a45d qemu_cond_wait_impl (qemu-kvm)
                #2  0x000055abe3882e1f qemu_wait_io_event (qemu-kvm)
                #3  0x000055abe38ba588 kvm_vcpu_thread_fn (qemu-kvm)
                #4  0x000055abe3a3a0a4 qemu_thread_start (qemu-kvm)
                #5  0x00007fb6b7b1814a start_thread (libpthread.so.0)
                #6  0x00007fb6b7847dc3 __clone (libc.so.6)
                
                Stack trace of thread 3669:
                #0  0x00007fb6b7b1e2fc pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
                #1  0x000055abe3a3a45d qemu_cond_wait_impl (qemu-kvm)
                #2  0x000055abe3882e1f qemu_wait_io_event (qemu-kvm)
                #3  0x000055abe38ba588 kvm_vcpu_thread_fn (qemu-kvm)
                #4  0x000055abe3a3a0a4 qemu_thread_start (qemu-kvm)
                #5  0x00007fb6b7b1814a start_thread (libpthread.so.0)
                #6  0x00007fb6b7847dc3 __clone (libc.so.6)
                
                Stack trace of thread 3670:
                #0  0x00007fb6b7b1e2fc pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
                #1  0x000055abe3a3a45d qemu_cond_wait_impl (qemu-kvm)
                #2  0x000055abe3882e1f qemu_wait_io_event (qemu-kvm)
                #3  0x000055abe38ba588 kvm_vcpu_thread_fn (qemu-kvm)
                #4  0x000055abe3a3a0a4 qemu_thread_start (qemu-kvm)
                #5  0x00007fb6b7b1814a start_thread (libpthread.so.0)
                #6  0x00007fb6b7847dc3 __clone (libc.so.6)
                
                Stack trace of thread 3671:
                #0  0x00007fb6b7b1e2fc pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)
                #1  0x000055abe3a3a45d qemu_cond_wait_impl (qemu-kvm)
                #2  0x000055abe3882e1f qemu_wait_io_event (qemu-kvm)
                #3  0x000055abe38ba588 kvm_vcpu_thread_fn (qemu-kvm)
                #4  0x000055abe3a3a0a4 qemu_thread_start (qemu-kvm)
                #5  0x00007fb6b7b1814a start_thread (libpthread.so.0)
                #6  0x00007fb6b7847dc3 __clone (libc.so.6)

Comment 8 Pei Zhang 2021-07-20 12:03:40 UTC
Additional info:

When there are packets inflight in VM when doing postcopy live migration, both single queue and 2 queues hit the qemu segfault issue.

Comment 9 Dr. David Alan Gilbert 2021-07-22 14:09:27 UTC
OK, so it looks like we have two problems:
  a) The postcopy stalling when using multiple queues
  b) The segfault

Comment 10 John Ferlan 2021-07-22 18:40:53 UTC
Assigned to Meirav for initial triage per bz process and age of bug created or assigned to virt-maint without triage.

Comment 11 Dr. David Alan Gilbert 2021-07-28 10:03:47 UTC
I had a quick look at the cleanup code;

vhost_user_postcopy_end

    postcopy_unregister_shared_ufd(&u->postcopy_fd);
    close(u->postcopy_fd.fd);
    u->postcopy_fd.handler = NULL;

vhost_user_backend_cleanup
    if (u->postcopy_fd.handler) {
        postcopy_unregister_shared_ufd(&u->postcopy_fd);
        close(u->postcopy_fd.fd);
        u->postcopy_fd.handler = NULL;
    }

so it *looks* ok; we should be going through postcopy_end that should clean it up first
but then set the handler to NULL; then later when we quit we hit that backend_cleanup
which is where the seg happens; but it's not clear why

Comment 13 Juan Quintela 2021-09-09 12:04:38 UTC
I got today the machines to reproduce this one.  Will see what I can do tomorrow.

Comment 14 John Ferlan 2021-09-09 15:15:11 UTC
Bulk update: Move RHEL-AV bugs to RHEL8 with existing RHEL9 clone.

Comment 15 Juan Quintela 2021-11-09 10:26:07 UTC
Hi Pei

This is the official brew build for the merge request.
You had already tested that it worked with my brew suggestion last week, could you test that this works?

https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=41112346

Thanks, Juan.

Comment 17 Juan Quintela 2021-11-09 15:25:10 UTC
Forget previous comment, I am sending a new Merge Request.  This was build against RHEL 8.5.0, but I think it should be against RHEL-AV- 8.5.0.

Sorry, for the confusion.

Comment 19 Juan Quintela 2021-11-10 08:32:42 UTC
Hi Pei.

No, it would go into 8.5 and perhaps 8.4.z.  What I am wondering is if we should only do it for AV products or also for non-AV ones.

Later, Juan.

Comment 20 Juan Quintela 2021-11-10 11:26:16 UTC
Hi Pei

This is the RHEL-8.5.0 brew build.

Working on cloning this for AV 8.4.z and AV 8.5.0

Could you try?

Thanks, Juan.

Comment 21 John Ferlan 2021-11-10 11:57:03 UTC
Juan - this would need to go into 8.6 first (I assume via the rebase using a process like I described for bug 1982993)... Although in this case, you can start a z-stream request for 8.5 and perhaps 8.4 sooner... Use the "zstream target release" dropdown (just below internal target release). I think only one can be selected at a time. Just drop a needinfo on Yash Mankad if you want both - he manages that process for us.

Comment 22 Juan Quintela 2021-11-10 13:22:43 UTC
Thanks very much.

Comment 24 Pei Zhang 2021-11-15 01:43:07 UTC
Testing update:

This issue has gone with qemu-kvm-6.2.0-1.rc0.scrmod+el8.6.0+13251+1ebd03e3.wrb211110.x86_64. No qemu crash during the vhost-user post copy live migration.

Comment 27 John Ferlan 2021-11-23 20:31:52 UTC
Clearing the zstream? flag as the fix is not needed for RHEL, but rather RHEL-AV and we've cloned this to bug 2024981 in order to create AV 8.5 and 8.4 z-stream clones

Comment 29 John Ferlan 2021-11-27 13:29:47 UTC
*** Bug 2024981 has been marked as a duplicate of this bug. ***

Comment 31 John Ferlan 2021-12-22 18:01:48 UTC
Mass update of DTM/ITM to +3 values since the rebase of qemu-6.2 into RHEL 8.6 has been delayed or slowed due to process roadblocks (authentication changes, gating issues). This avoids the DevMissed bot and worse the bot that could come along and strip release+. The +3 was chosen mainly to give a cushion. 

Also added the qemu-6.2 rebase bug 2027716 as a dependent.

Comment 34 Yanan Fu 2021-12-24 02:48:01 UTC
QE bot(pre verify): Set 'Verified:Tested,SanityOnly' as gating/tier1 test pass.

Comment 35 Pei Zhang 2021-12-27 09:19:29 UTC
Verification:

Versions:
qemu-kvm-6.2.0-1.module+el8.6.0+13725+61ae1949.x86_64:
tuned-2.16.0-1.el8.noarch
libvirt-7.10.0-1.module+el8.6.0+13502+4f24a11d.x86_64
openvswitch2.16-2.16.0-35.el8fdp.x86_64
dpdk-20.11-3.el8.x86_64

vhost-user 1Q/2Q postcopy live migration works well. No any error any more. 


Testcase: live_migration_nonrt_server_2Q_1G_iommu_ovs_postcopy
PASS


Testcase: live_migration_nonrt_server_1Q_1G_iommu_ovs_postcopy
PASS

So this issue has been fixed very well. Move to Verified.

Comment 37 errata-xmlrpc 2022-05-10 13:20:14 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: virt:rhel and virt-devel:rhel security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:1759


Note You need to log in before you can comment on or make changes to this bug.